A simple data validation approach inspired by assertR for testing assumptions about pandas DataFrames in Python.
This package provides a structured way of testing assumptions about attributes of a Python object, and is specifically aimed at verifying the attributes and values contained within a pandas DataFrame.
The DataFrame object is very versatile, but it can be helpful to verify certain values of its attributes and the data it contains in an analysis piece. This could be to ensure data types are correct for the operations performed, or to guard against data errors by checking that the data held is still within certain bounds once the input source of the data has changed.
One simple way to check attributes of a DataFrame is to write a custom checking function and use assert
to test for particular properties. e.g.
def check_df(df):
""" My custom validator """
assert len(df) > 10, "Num rows must be greater than 10"
assert (df.mpg > 0).all(), "Not all values in mpg are over 0"
assert (df.am.isin([0, 1]).all(), "Values of am are not all in set{0,1}"
check_df(df)
Though such usage has one main disadvantage; it will error on the first failure encountered, and so lead to an iterative trial and error approach to fixing problems if multiple assertions fail.
The vigilance package provides the expect
function, which operates like assert
but instead of imediatly raising an error, it stores all failed expectations encountered and then allows them to be recalled at a later point with the report_failures
function.
A validating function like the above can thus be written as follows:
from vigilance import expect, report_failures
def check_df(df):
""" My custom validator """
expect(
(len(df) > 10, "Num rows must be greater than 10"),
((df.mpg > 0).all(), "Not all values in mpg are over 0"),
(df.am.isin([0, 1]).all(), "Values of am are not all in set{0,1}")
)
report_failures()
Given some sample data, using the mtcars data set from R,
mtcars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv')
invalid_mtcars = mtcars.copy()
invalid_mtcars.ix[10, 'mpg'] = 999
invalid_mtcars.ix[22, 'am'] = 2
the following output reports are generated.
>>> check_df(mtcars)
All expectations met.
>>> check_df(invalid_mtcars)
Failed Expectations: 2
1: File <filename>, line 5, in check_df()
"(df.mpg > 0).all()" is not True
-- Not all values in mpg are over 0
2: File <filename>, line 6, in check_df()
"df.am.isin([0, 1]).all()" is not True
-- Values of am are not all in set{0,1}
For brevity, the message strings can be omitted and the expect
function will accept a variable number of arguments as statements to evaluate.
def check_df(df):
""" Validator for mtcars """
expect(
len(df) > 10,
(df.mpg > 0).all(),
(df.vs.isin([0, 1]).all(),
(df.am.isin([0, 1]).all()
)
report_failures()
- Delayed assertions with options to print to console or raise a ValueError upon a call to
report_failures
. Helper utility functions to confirm the following conditions:
within_n_sds()
Tests all values in a column are with a given number of standard deviations.within_n_mads()
Tests all values in a column are with a given number of median absolute deviations.maha_dist()
Computes the average mahalanobis distance for each row in the data set, which is a multivariate version of calculating how many standard deviations a value is from the mean. Larger values are indicative of potential outliers in the data.
With git installed, the latest development version can be installed with::
pip install git+https://github.com/MrKriss/vigilance.git
As the framework takes pandas DataFrame objects as input, the main dependency is pandas itself, along with its dependencies.
In addition, pytest is used to run the tests.
Tested on Python 3.3 and 3.4.
MIT, see the Licence here
vigilance was written by Chris Musselle.