input_checker.checker module¶
- class input_checker.checker.InputChecker(columns=None, categorical_columns=None, numerical_columns=None, datetime_columns=None, skip_infer_columns=None, **kwds)[source]¶
Bases:
tubular.base.BaseTransformerClass to compare a dataframe against a benchmark
The input checker class currently contains 5 different checks: 1. Null checker: ensures that columns with missing values in the benchmark dataframe are the only columns with missing values in the comparison dataframe
2. Dtype checker: ensures that columns in the comparison dataframe are of the same data type as in the benchmark dataframe
3. Categorical value checker: ensures that categorical columns in the comparison dataframe only contain values that exist in the benchmark dataframe
4. Numerical checker: ensures that the values of the numerical columns in the comparison dataframe lie within the minimum and maximum range of the numerical columns in the benchmark dataframe.
5. Datetime checker: ensures that the values of datetime columns in the comparison dataframe lie beyond the minimum date (optionally maximum) of datetime columns in the benchmark dataframe.
Checks 1 and 2 are completed for all the columns that are defined under the ‘columns’ variable. If this attribute is not set, all of the columns in the dataframe passed to the fit method will be taken into account. The numerical and categorical checks may be skipped by setting the categorical_columns and numerical_column variables to None. There is alternatively an ‘infer’ option which automatically finds the columns that are of a categorical or numerical type among the list of columns defined/set in the ‘columns’ attribute.
The class is fitted to the benchmark dataframe by calling the fit method which calls all the individual fit methods for individual checks. The input checker class object can then be saved, later to be loaded, and called to compare a dataframe against the benchmark dataframe. For comparison, the transform method will get called, which runs every check in the fitted input checker class against the benchmark dataframe and returns an exception message stating which checks have failed if any.
- Parameters
columns (None, list or str) – The list of model input column names that the column name, null checker and data type checks are generated for. If None then all the columns in the (fitted) benchmark dataframe are included in the checks. If str of a column name then only that column is included in the check
categorical_columns (list or 'infer') – The list of model input column names containing categorical data that the categorical level checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the benchmark dataframe (category, boolean or string)
numerical_columns (list, 'infer' or dict) –
- The list of model input column names containing numerical data that
the numerical range checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the benchmark dataframe. If equal to a dict, then each key in the dictionary must be a column in the (fitted) benchmark dataframe, these must contain a ‘maximum’ and ‘minimum’ keys within them. These keys contain a boolean stating if a maximum and / or minimum value check is desired
- datetime_columnslist, ‘infer’
The list of model input column names containing datetime data that the datetime level checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the (fitted) benchmark dataframe (datetime, object).
skip_infer_columns (list) – The list of columns conttaining the names for dataframe columns that will have type and null checks applied to them but will not be included in the ‘infer’ calculation for the categorical and numerical columns check these should include id, datetime and text fields
- Aside from the class parameters, these attributes are generated when the class
- is fitted to a benchmark dataframe
- null_map¶
Dictionary contain the null map for the specified columns, keys are the column names and the values are a 1 if the column can contain nulls and 0 if the column is not allowed to contain any nulls
- Type
dict
- expected_values¶
Dictionary contain the categorical map for the specified categorical columns, keys are the column names and the values are the various values that are allowed within each categorical column. Only generated if the categorical columns parameter is not set to None
- Type
dict
- column_classes¶
Dictionary contain the data type map for the specified columns, keys are the column names and the values the column data types
- Type
dict
- numerical_values¶
Dictionary contain the numerical map for the specified numerical columns, keys are the column names which themselves contain minimum and maximum allowables within each numerical column. Only generated if the numerical columns parameter is not set to None
- Type
dict
- datetime_values¶
Dictionary contain the datetime map for the specified datetime columns, keys are the column names which themselves contain minimum and (optional)maximum allowables within each datetime column. Only generated if the datetime columns parameter is not set to None
- Type
dict
- fit(X, y=None)[source]¶
Checks that the class inputs are of the correct format and then fits the different input checker methods to the benchmark dataframe
- Parameters
X (pd.DataFrame) – The training input samples.
y (None) – y is not needed in this transformer, yet the sklearn pipeline API requires this parameter for checking.
- raise_exception_if_checks_fail(type_failed_checks, null_failed_checks, value_failed_checks, numerical_failed_checks, datetime_failed_checks)[source]¶
Method to combine all tests results from input checker tests and raise an InputChecker exception if any one of the checks fails.
- Parameters
type_failed_checks (dict) – Details of failed type checker tests, empty if no checks failed.
null_failed_checks (dict) – Details of failed null checker tests, empty if no checks failed.
value_failed_checks (dict) – Details of failed categorical checker tests, empty if no checks failed.
numerical_failed_checks (dict) – Details of failed numerical checker tests, empty if no checks failed.
datetime_failed_checks (dict) – Details of failed datetime checker tests, empty if no checks failed.
- separate_passes_and_fails(type_failed_checks, null_failed_checks, value_failed_checks, numerical_failed_checks, datetime_failed_checks, X)[source]¶
Method to combine all tests results from input checker tests and separate rows which pass checks (good_df) from rows which fail checks (bad_df). Failing rows will have an extra column added called ‘failed_checks’, which concatenates all the failing test information.
- Parameters
type_failed_checks (dict) – Details of failed type checker tests, empty if no checks failed.
null_failed_checks (dict) – Details of failed null checker tests, empty if no checks failed.
value_failed_checks (dict) – Details of failed categorical checker tests, empty if no checks failed.
numerical_failed_checks (dict) – Details of failed numerical checker tests, empty if no checks failed.
datetime_failed_checks (dict) – Details of failed datetime checker tests, empty if no checks failed.
Returns –
-------- –
good_df (tuple) – Dataframes containing rows which pass checks (good_df) and rows which fail checks (bad_df).
bad_df (tuple) – Dataframes containing rows which pass checks (good_df) and rows which fail checks (bad_df).
- transform(X, batch_mode=False)[source]¶
Method to run the input checker tests that have set based on the fitted benchmark dataframe on the comparison dataframe.
- Parameters
X (pd.DataFrame) – The new dataframe to validate against the benchmark samples.
batch_mode (bool, default=False) – When batch_mode = True, the dataframe is processed row-by-row. Two data frames are returned: a DF of the records that pass the checks and a DF of the records that fail the checks. The failed records have an extra column ‘failed_checks’ which contains reasons for the failed checks. When batch_mode = False, an exception will be raised if any of the rows fail the input checks, otherwise the comparison dataframe X is returned
- Returns
good_df, bad_df or X – Returns a tuple of dataframes with rows passing and failing checks respectively if run in batch mode or the comparison dataframe X. If any of the checks fail when batch_mode=False, it will throw an InputChecker exception
- Return type
tuple or pd.DataFrame