Lazy Validation#
New in version 0.4.0
By default, when you call the validate method on schema or schema component
objects, a SchemaError is raised as soon as one of the
assumptions specified in the schema is falsified. For example, for a
DataFrameSchema object, the following situations will raise an
exception:
a column specified in the schema is not present in the dataframe.
if
strict=True, a column in the dataframe is not specified in the schema.the
data typedoes not match.if
coerce=True, the dataframe column cannot be coerced into the specifieddata type.the
Checkspecified in one of the columns returnsFalseor a boolean series containing at least oneFalsevalue.
For example:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
df = pd.DataFrame({"column": ["a", "b", "c"]})
schema = pa.DataFrameSchema({"column": Column(int)})
schema.validate(df)
Traceback (most recent call last):
...
SchemaError: expected series 'column' to have type int64, got object
For more complex cases, it is useful to see all of the errors raised during
the validate call so that you can debug the causes of errors on different
columns and checks. The lazy keyword argument in the validate method
of all schemas and schema components gives you the option of doing just this:
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
schema = pa.DataFrameSchema(
columns={
"int_column": Column(int),
"float_column": Column(float, Check.greater_than(0)),
"str_column": Column(str, Check.equal_to("a")),
"date_column": Column(pa.DateTime),
},
strict=True
)
df = pd.DataFrame({
"int_column": ["a", "b", "c"],
"float_column": [0, 1, 2],
"str_column": ["a", "b", "d"],
"unknown_column": None,
})
schema.validate(df, lazy=True)
Traceback (most recent call last):
...
pandera.errors.SchemaErrors: A total of 5 schema errors were found.
Error Counts
------------
- column_not_in_schema: 1
- column_not_in_dataframe: 1
- schema_component_check: 3
Schema Error Summary
--------------------
failure_cases n_failure_cases
schema_context column check
DataFrameSchema <NA> column_in_dataframe [date_column] 1
column_in_schema [unknown_column] 1
Column float_column dtype('float64') [int64] 1
int_column dtype('int64') [object] 1
str_column equal_to(a) [b, d] 2
Usage Tip
---------
Directly inspect all errors by catching the exception:
```
try:
schema.validate(dataframe, lazy=True)
except SchemaErrors as err:
err.failure_cases # dataframe of schema errors
err.data # invalid dataframe
```
As you can see from the output above, a SchemaErrors
exception is raised with a summary of the error counts and failure cases
caught by the schema. You can also see from the Usage Tip that you can
catch these errors and inspect the failure cases in a more granular form:
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print("Schema errors and failure cases:")
print(err.failure_cases)
print("\nDataFrame object that failed validation:")
print(err.data)
Schema errors and failure cases:
schema_context column check check_number \
0 DataFrameSchema None column_in_dataframe None
1 DataFrameSchema None column_in_schema None
2 Column int_column dtype('int64') None
3 Column float_column dtype('float64') None
4 Column float_column greater_than(0) 0
5 Column str_column equal_to(a) 0
6 Column str_column equal_to(a) 0
failure_case index
0 date_column None
1 unknown_column None
2 object None
3 int64 None
4 0 0
5 b 1
6 d 2
DataFrame object that failed validation:
int_column float_column str_column unknown_column
0 a 0 a None
1 b 1 b None
2 c 2 d None