input_checker.checker module

class input_checker.checker.InputChecker(columns=None, categorical_columns=None, numerical_columns=None, datetime_columns=None, skip_infer_columns=None, **kwds)[source]

Bases: tubular.base.BaseTransformer

Class to compare a dataframe against a benchmark

The input checker class currently contains 5 different checks: 1. Null checker: ensures that columns with missing values in the benchmark dataframe are the only columns with missing values in the comparison dataframe

2. Dtype checker: ensures that columns in the comparison dataframe are of the same data type as in the benchmark dataframe

3. Categorical value checker: ensures that categorical columns in the comparison dataframe only contain values that exist in the benchmark dataframe

4. Numerical checker: ensures that the values of the numerical columns in the comparison dataframe lie within the minimum and maximum range of the numerical columns in the benchmark dataframe.

5. Datetime checker: ensures that the values of datetime columns in the comparison dataframe lie beyond the minimum date (optionally maximum) of datetime columns in the benchmark dataframe.

Checks 1 and 2 are completed for all the columns that are defined under the ‘columns’ variable. If this attribute is not set, all of the columns in the dataframe passed to the fit method will be taken into account. The numerical and categorical checks may be skipped by setting the categorical_columns and numerical_column variables to None. There is alternatively an ‘infer’ option which automatically finds the columns that are of a categorical or numerical type among the list of columns defined/set in the ‘columns’ attribute.

The class is fitted to the benchmark dataframe by calling the fit method which calls all the individual fit methods for individual checks. The input checker class object can then be saved, later to be loaded, and called to compare a dataframe against the benchmark dataframe. For comparison, the transform method will get called, which runs every check in the fitted input checker class against the benchmark dataframe and returns an exception message stating which checks have failed if any.

Parameters
  • columns (None, list or str) – The list of model input column names that the column name, null checker and data type checks are generated for. If None then all the columns in the (fitted) benchmark dataframe are included in the checks. If str of a column name then only that column is included in the check

  • categorical_columns (list or 'infer') – The list of model input column names containing categorical data that the categorical level checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the benchmark dataframe (category, boolean or string)

  • numerical_columns (list, 'infer' or dict) –

    The list of model input column names containing numerical data that

    the numerical range checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the benchmark dataframe. If equal to a dict, then each key in the dictionary must be a column in the (fitted) benchmark dataframe, these must contain a ‘maximum’ and ‘minimum’ keys within them. These keys contain a boolean stating if a maximum and / or minimum value check is desired

    datetime_columnslist, ‘infer’

    The list of model input column names containing datetime data that the datetime level checks are generated for. If the ‘infer’ option is defined instead, this list is inferred based on the column types of the (fitted) benchmark dataframe (datetime, object).

  • skip_infer_columns (list) – The list of columns conttaining the names for dataframe columns that will have type and null checks applied to them but will not be included in the ‘infer’ calculation for the categorical and numerical columns check these should include id, datetime and text fields

Aside from the class parameters, these attributes are generated when the class
is fitted to a benchmark dataframe
null_map

Dictionary contain the null map for the specified columns, keys are the column names and the values are a 1 if the column can contain nulls and 0 if the column is not allowed to contain any nulls

Type

dict

expected_values

Dictionary contain the categorical map for the specified categorical columns, keys are the column names and the values are the various values that are allowed within each categorical column. Only generated if the categorical columns parameter is not set to None

Type

dict

column_classes

Dictionary contain the data type map for the specified columns, keys are the column names and the values the column data types

Type

dict

numerical_values

Dictionary contain the numerical map for the specified numerical columns, keys are the column names which themselves contain minimum and maximum allowables within each numerical column. Only generated if the numerical columns parameter is not set to None

Type

dict

datetime_values

Dictionary contain the datetime map for the specified datetime columns, keys are the column names which themselves contain minimum and (optional)maximum allowables within each datetime column. Only generated if the datetime columns parameter is not set to None

Type

dict

fit(X, y=None)[source]

Checks that the class inputs are of the correct format and then fits the different input checker methods to the benchmark dataframe

Parameters
  • X (pd.DataFrame) – The training input samples.

  • y (None) – y is not needed in this transformer, yet the sklearn pipeline API requires this parameter for checking.

raise_exception_if_checks_fail(type_failed_checks, null_failed_checks, value_failed_checks, numerical_failed_checks, datetime_failed_checks)[source]

Method to combine all tests results from input checker tests and raise an InputChecker exception if any one of the checks fails.

Parameters
  • type_failed_checks (dict) – Details of failed type checker tests, empty if no checks failed.

  • null_failed_checks (dict) – Details of failed null checker tests, empty if no checks failed.

  • value_failed_checks (dict) – Details of failed categorical checker tests, empty if no checks failed.

  • numerical_failed_checks (dict) – Details of failed numerical checker tests, empty if no checks failed.

  • datetime_failed_checks (dict) – Details of failed datetime checker tests, empty if no checks failed.

separate_passes_and_fails(type_failed_checks, null_failed_checks, value_failed_checks, numerical_failed_checks, datetime_failed_checks, X)[source]

Method to combine all tests results from input checker tests and separate rows which pass checks (good_df) from rows which fail checks (bad_df). Failing rows will have an extra column added called ‘failed_checks’, which concatenates all the failing test information.

Parameters
  • type_failed_checks (dict) – Details of failed type checker tests, empty if no checks failed.

  • null_failed_checks (dict) – Details of failed null checker tests, empty if no checks failed.

  • value_failed_checks (dict) – Details of failed categorical checker tests, empty if no checks failed.

  • numerical_failed_checks (dict) – Details of failed numerical checker tests, empty if no checks failed.

  • datetime_failed_checks (dict) – Details of failed datetime checker tests, empty if no checks failed.

  • Returns

  • --------

  • good_df (tuple) – Dataframes containing rows which pass checks (good_df) and rows which fail checks (bad_df).

  • bad_df (tuple) – Dataframes containing rows which pass checks (good_df) and rows which fail checks (bad_df).

transform(X, batch_mode=False)[source]

Method to run the input checker tests that have set based on the fitted benchmark dataframe on the comparison dataframe.

Parameters
  • X (pd.DataFrame) – The new dataframe to validate against the benchmark samples.

  • batch_mode (bool, default=False) – When batch_mode = True, the dataframe is processed row-by-row. Two data frames are returned: a DF of the records that pass the checks and a DF of the records that fail the checks. The failed records have an extra column ‘failed_checks’ which contains reasons for the failed checks. When batch_mode = False, an exception will be raised if any of the rows fail the input checks, otherwise the comparison dataframe X is returned

Returns

good_df, bad_df or X – Returns a tuple of dataframes with rows passing and failing checks respectively if run in batch mode or the comparison dataframe X. If any of the checks fail when batch_mode=False, it will throw an InputChecker exception

Return type

tuple or pd.DataFrame