vigilance

A simple data validation approach inspired by assertR for testing assumptions about pandas DataFrames in Python.

Use Case

This package provides a structured way of testing assumptions about attributes of a Python object, and is specifically aimed at verifying the attributes and values contained within a pandas DataFrame.

The DataFrame object is very versatile, but it can be helpful to verify certain values of its attributes and the data it contains in an analysis piece. This could be to ensure data types are correct for the operations performed, or to guard against data errors by checking that the data held is still within certain bounds once the input source of the data has changed.

One simple way to check attributes of a DataFrame is to write a custom checking function and use assert to test for particular properties. e.g.

def check_df(df):
    """ My custom validator """    
    assert len(df) > 10, "Num rows must be greater than 10"
    assert (df.mpg > 0).all(), "Not all values in mpg are over 0"
    assert (df.am.isin([0, 1]).all(), "Values of am are not all in set{0,1}"

check_df(df)

Though such usage has one main disadvantage; it will error on the first failure encountered, and so lead to an iterative trial and error approach to fixing problems if multiple assertions fail.

The vigilance package provides the expect function, which operates like assert but instead of imediatly raising an error, it stores all failed expectations encountered and then allows them to be recalled at a later point with the report_failures function.

A validating function like the above can thus be written as follows:

from vigilance import expect, report_failures

def check_df(df):
    """ My custom validator """    
    expect(
        (len(df) > 10, "Num rows must be greater than 10"),
        ((df.mpg > 0).all(), "Not all values in mpg are over 0"),
        (df.am.isin([0, 1]).all(), "Values of am are not all in set{0,1}")
    )
    report_failures()

Given some sample data, using the mtcars data set from R,

mtcars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv')
invalid_mtcars = mtcars.copy()
invalid_mtcars.ix[10, 'mpg'] = 999
invalid_mtcars.ix[22, 'am'] = 2

the following output reports are generated.

>>> check_df(mtcars)
All expectations met.

>>> check_df(invalid_mtcars)

Failed Expectations: 2

1: File <filename>, line 5, in check_df()
    "(df.mpg > 0).all()" is not True
        -- Not all values in mpg are over 0

2: File <filename>, line 6, in check_df()
    "df.am.isin([0, 1]).all()" is not True
        -- Values of am are not all in set{0,1}

For brevity, the message strings can be omitted and the expect function will accept a variable number of arguments as statements to evaluate.

def check_df(df):
    """ Validator for mtcars """    

    expect(
        len(df) > 10, 
        (df.mpg > 0).all(),
        (df.vs.isin([0, 1]).all(),
        (df.am.isin([0, 1]).all()
    )

    report_failures()

Features

Delayed assertions with options to print to console or raise a ValueError upon a call to report_failures.
Helper utility functions to confirm the following conditions:
- within_n_sds() Tests all values in a column are with a given number of standard deviations.
- within_n_mads() Tests all values in a column are with a given number of median absolute deviations.
- maha_dist() Computes the average mahalanobis distance for each row in the data set, which is a multivariate version of calculating how many standard deviations a value is from the mean. Larger values are indicative of potential outliers in the data.

Installation

With git installed, the latest development version can be installed with::

pip install git+https://github.com/MrKriss/vigilance.git

Requirements

As the framework takes pandas DataFrame objects as input, the main dependency is pandas itself, along with its dependencies.

In addition, pytest is used to run the tests.

Compatibility

Tested on Python 3.3 and 3.4.

Licence

MIT, see the Licence here

Authors

vigilance was written by Chris Musselle.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
vigilance		vigilance
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.rst		README.rst
makefile		makefile
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vigilance

vigilance

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

README.rst

README.rst

makefile

makefile

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

vigilance

Use Case

Features

Installation

Requirements

Compatibility

Licence

Authors

About

Releases

Packages

Languages

License

MrKriss/vigilance

Folders and files

Latest commit

History

Repository files navigation

vigilance

Use Case

Features

Installation

Requirements

Compatibility

Licence

Authors

About

Resources

License

Stars

Watchers

Forks

Languages