Checks#
Custom Checks#
checkedframe aims to maximize your control over the checks you write. Checks take 4 types of inputs: a Series, a string, a DataFrame, or nothing. Checks then allow 3 types of outputs: a Series, a boolean, or an Expression. In addition, checks can be “native”, which means they are written using the DataFrame library you are using, or they can be written in a DataFrame-agnostic way. Let’s take a look at some examples using Polars.
import checkedframe as cf
import polars as pl
class S(cf.Schema):
customer_id = cf.String()
checking_balance = cf.Float64()
savings_balance = cf.Float64()
df = pl.DataFrame(
{
"customer_id": ["TV09", "DTG9"],
"checking_balance": [400.4, 20.2],
"savings_balance": [1_000.1, 89.91],
}
)
In this example, we have a check that operates on a Series and returns a Series. The columns
argument tells checkedframe apply this check to the “checking_balance” Series. Since our return type is a Series, checkedframe is able to tell us exactly which rows fail this check as well. In addition, in error reporting, this check will be attached to the “checking_balance” column.
@cf.Check(columns="checking_balance")
def series_series_check(s: pl.Series) -> pl.Series:
return s <= 100
@cf.Check(columns="checking_balance")
def series_series_check(s: cf.Series) -> cf.Series:
return s <= 100
Similarly, we can use expressions. Expression checks usually take a string and return an expression. Since expressions are series-like, we similarly get detailed error reporting. Furthermore, expressions are run in parallel (if your engine runs expressions in parallel). To get the best performance, use expressions.
@cf.Check(columns="checking_balance")
def str_expr_check(name: str) -> pl.Expr:
return pl.col(name) <= 100
@cf.Check(columns="checking_balance")
def str_expr_check(name: str) -> cf.Expr:
return pl.col(name) <= 100
We also can operate at the DataFrame level using expressions.
@cf.Check
def expr_check() -> pl.Expr:
return pl.col("checking_balance") <= 100
@cf.Check
def expr_check() -> cf.Expr:
return cf.col("checking_balance") <= 100
We can also just take the DataFrame in directly.
@cf.Check
def df_check(df: pl.DataFrame) -> pl.Series:
return df["checking_balance"] <= 100
@cf.Check
def df_check(df: cf.DataFrame) -> cf.Series:
return df["checking_balance"] <= 100
While not particularly useful when operating on a single column, DataFrame-level checks shine when multi-column checks are needed.
Note
The above examples use type hints to indicate the input and return types. A check without type hints would look like this:
@cf.Check(columns="checking_balance", input_type="Series", return_type="Series", native=True)
def series_series_check(s):
return s <= 100
@cf.Check(columns="checking_balance", input_type="Series", return_type="Series", native=False)
def series_series_check(s):
return s <= 100
Built-in Checks#
checkedframe also comes with a large suite of built-in checks. Built-in checks are always agnostic, so you can use them the same way regardless of your DataFrame library.
- class checkedframe._checks.Check(func: Callable | None = None, columns: str | list[str] | Selector | None = None, input_type: Literal['auto', 'Frame', 'str', 'Series'] | None = 'auto', return_type: Literal['auto', 'bool', 'Expr', 'Series'] = 'auto', native: bool | Literal['auto'] = 'auto', name: str | None = None, description: str | None = None)#
Represents a check to run.
- Parameters:
func (Optional[Callable], optional) – The check to run, by default None
columns (Optional[str | list[str] | Selector], optional) – The columns associated with the check, by default None
input_type (Optional[Literal["auto", "Frame", "str", "Series"]], optional) – The input to the check function. If “auto”, attempts to determine via the context, by default “auto”
return_type (Literal["auto", "bool", "Expr", "Series"], optional) – The return type of the check function. If “auto”, attempts to determine via the context, by default “auto”
native (bool | Literal["auto"], optional) – Whether to run the check on the native DataFrame or the Narwhals DataFrame. If “auto”, attempts to determine via the context, by default “auto”
name (Optional[str], optional) – The name of the check, by default None
description (Optional[str], optional) – The description of the check. If None, attempts to read from the __doc__ attribute, by default None
- static approx_eq(other: Any, rtol: float = 1e-05, atol: float = 1e-08, nan_equal: bool = False) Check #
Tests whether values are approximately equal to other. Strings are interpreted as column names.
- Parameters:
other (Any)
rtol (float, optional) – Relative tolerance, by default 1e-5
atol (float, optional) – Absolute tolerance, by default 1e-8
nan_equal (bool, optional) – Whether to consider NaN values equal, by default False
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): prob = cf.Float64(checks=[cf.Check.approx_eq(0.5)]) df = pl.DataFrame({"prob": [0.5, 0.50000001, 0.6]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) prob: 1 error(s) - approximately_equal_to failed for 1 / 3 (33.33%) rows: Must be approximately equal to 0.5 (rtol=1e-05, atol=1e-08, nan_equal=False)
- static eq(other: Any) Check #
Tests whether values are equal to other. Strings are interpreted as column names.
- Parameters:
other (Any)
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): group = cf.String(checks=[cf.Check.eq("A")]) df = pl.DataFrame({"group": ["A", "B", "A"]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) group: 1 error(s) - equal_to failed for 1 / 3 (33.33%) rows: Must be = A
- static ge(other: Any) Check #
Tests whether values are greater than or equal to other. Strings are interpreted as column names.
- Parameters:
other (Any)
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): age = cf.Int64( checks=[ cf.Check.ge(10), cf.Check.ge("min_age"), cf.Check.ge(cf.col("min_age") - 10), ] ) df = pl.DataFrame( { "age": [5, 10, 11], "min_age": [10, 5, 8], } ) S.validate(df)
Output:
SchemaError: Found 2 error(s) age: 2 error(s) - greater_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be >= 10 - greater_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be >= min_age
- static gt(other: Any) Check #
Tests whether values are greater than other. Strings are interpreted as column names.
- Parameters:
other (Any)
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): age = cf.Int64( checks=[ cf.Check.gt(10), cf.Check.gt("min_age"), cf.Check.gt(cf.col("min_age") - 100), ] ) df = pl.DataFrame( { "age": [5, 10, 11], "min_age": [10, 5, 8], } ) S.validate(df)
Output:
SchemaError: Found 2 error(s) age: 2 error(s) - greater_than failed for 2 / 3 (66.67%) rows: Must be > 10 - greater_than failed for 1 / 3 (33.33%) rows: Must be > min_age
- static is_between(lower_bound: Any, upper_bound: Any, closed: Literal['left', 'right', 'none', 'both'] = 'both') Check #
Tests whether values are between lower_bound and upper_bound. Strings are interpreted as column names.
- Parameters:
lower_bound (Any) – The lower bound
upper_bound (Any) – The upper bound
closed (ClosedInterval, optional) – Defines which sides of the interval are closed, by default “both”
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): age = cf.Int64(checks=[cf.Check.is_between(0, 128)]) min_balance = cf.Int64() med_balance = cf.Int64(checks=[cf.Check.is_between("min_balance", "max_balance")]) max_balance = cf.Int64() df = pl.DataFrame( { "age": [5, 10, 150], "min_balance": [1, 100, 500], "med_balance": [0, 83, 525], "max_balance": [788, 82, 550], } ) S.validate(df)
Output:
SchemaError: Found 2 error(s) age: 1 error(s) - is_between failed for 1 / 3 (33.33%) rows: Must be in range [0, 128] med_balance: 1 error(s) - is_between failed for 2 / 3 (66.67%) rows: Must be in range [min_balance, max_balance]
- static is_finite() Check #
Tests whether values are finite.
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): balances = cf.Float64(checks=[cf.Check.is_finite()]) df = pl.DataFrame({"balances": [1, 2, float("inf")]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) balances: 1 error(s) - is_finite failed for 1 / 3 (33.33%) rows: All values must be finite
- static is_id(subset: str | list[str]) Check #
Tests whether the given column(s) identify the DataFrame.
- Parameters:
subset (str | list[str]) – The columns that identify the DataFrame
- Return type:
Examples
import checkedframe as cf import polars as pl class MySchema(cf.Schema): __dataframe_checks__ = [cf.Check.is_id("group")] group = cf.String() df = pl.DataFrame({"group": ["A", "B", "A"]}) MySchema.validate(df)
Output:
SchemaError: Found 1 error(s) __dataframe__: 1 error(s) - is_id failed: 'group' must uniquely identify the DataFrame
- static is_in(other: Collection) Check #
Tests whether all values of the Series are in the given collection.
- Parameters:
other (Collection) – The collection
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): business_type = cf.String(checks=[cf.Check.is_in(["tech", "finance"])]) df = pl.DataFrame({"business_type": ["x", "tech", "finance"]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) business_type: 1 error(s) - is_in failed for 1 / 3 (33.33%) rows: Must be in allowed values ['tech', 'finance']
- static is_not_inf() Check #
Tests whether values are not infinite.
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): balances = cf.Float64(checks=[cf.Check.is_not_inf()]) df = pl.DataFrame({"balances": [1, 2, float("inf")]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) balances: 1 error(s) - is_not_inf failed for 1 / 3 (33.33%) rows: Must not be inf/-inf
- static is_not_nan() Check #
Tests whether values are not NaN.
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): balances = cf.Float64(checks=[cf.Check.is_not_nan()]) df = pl.DataFrame({"balances": [1, 2, float("nan")]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) balances: 1 error(s) - is_not_nan failed for 1 / 3 (33.33%) rows: Must not be NaN
- static is_not_null() Check #
Tests whether values are not null.
Note
This method is mainly here for completeness. Columns are by default not nullable.
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): customer_id = cf.String(checks=[cf.Check.is_not_null()]) df = pl.DataFrame({"customer_id": ["a23", None]}) S.validate(df)
Output:
SchemaError: Found 2 error(s) customer_id: 2 error(s) - `nullable=False` failed for 1 / 2 (50.00%) rows: Must not be null - is_not_null failed for 1 / 2 (50.00%) rows: Must not be null
- static is_sorted(descending: bool = False) Check #
Tests whether a Series is sorted.
- Parameters:
descending (bool, optional) – Whether to check for descending order, by default False
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): timestamps = cf.Int64(checks=[cf.Check.is_sorted()]) df = pl.DataFrame({"timestamps": [1, 2, 1]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) timestamps: 1 error(s) - is_sorted failed: Must be sorted in ascending order
- static is_sorted_by(by: str | Sequence[str], descending: bool | Sequence[bool] = False, compare_all: bool = True) Check #
Tests whether a DataFrame is sorted by the given columns.
- Parameters:
by (str | Sequence[str]) – The column(s) to sort by
descending (bool | Sequence[bool], optional) – Whether to sort in descending order, by default False
compare_all (bool, optional) – Whether to compare all columns or just the sorting columns, by default True
- Return type:
Examples
import checkedframe as cf import polars as pl class MySchema(cf.Schema): timestamps = cf.Int64() values = cf.Int64() _sorted_check = cf.Check.is_sorted_by("timestamps") df = pl.DataFrame({"timestamps": [1, 2, 1], "values": [1, 2, 3]}) MySchema.validate(df)
Output:
SchemaError: Found 1 error(s) * is_sorted_by failed for 3 / 3 (100.00%) rows: Must be sorted by timestamps, where descending is False
- static le(other: Any) Check #
Tests whether values are less than or equal to other. Strings are interpreted as column names.
- Parameters:
other (Any)
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): age = cf.Int64( checks=[ cf.Check.le(10), cf.Check.le("max_age"), cf.Check.le(cf.col("max_age") - 10), ] ) df = pl.DataFrame( { "age": [5, 10, 11], "max_age": [10, 5, 8], } ) S.validate(df)
Output:
SchemaError: Found 1 error(s) age: 1 error(s) - less_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be <= 10
- static lt(other: Any) Check #
Tests whether values are less than other. Strings are interpreted as column names.
- Parameters:
other (Any)
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): age = cf.Int64( checks=[ cf.Check.lt(10), cf.Check.lt("max_age"), cf.Check.lt(cf.col("max_age") - 10), ] ) df = pl.DataFrame( { "age": [5, 10, 11], "max_age": [10, 5, 8], } ) S.validate(df)
Output:
SchemaError: Found 2 error(s) age: 2 error(s) - less_than failed for 2 / 3 (66.67%) rows: Must be < 10 - less_than failed for 1 / 3 (33.33%) rows: Must be < max_age
- static str_contains(pattern: str, literal: bool = False) Check #
Tests whether string values contain the given pattern.
- Parameters:
pattern (str) – The pattern to check for
literal (bool, optional) – Whether to interpret the pattern as a literal string or a regex, by default False
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): domains = cf.String(checks=[cf.Check.str_contains(r"\.com$", literal=False)]) df = pl.DataFrame({"domains": ["a.com", "b.org"]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) domains: 1 error(s) - contains failed for 1 / 2 (50.00%) rows: Must contain \.com$
- static str_ends_with(suffix: str) Check #
Tests whether string values end with the given suffix.
- Parameters:
suffix (str) – The suffix to check for
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): emails = cf.String(checks=[cf.Check.str_ends_with("@gmail.com")]) df = pl.DataFrame({"emails": ["a@gmail.com", "b@yahoo.com"]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) emails: 1 error(s) - ends_with failed for 1 / 2 (50.00%) rows: Must end with @gmail.com
- static str_starts_with(prefix: str) Check #
Tests whether string values start with the given prefix.
- Parameters:
prefix (str) – The prefix to check for
- Return type:
Examples
import checkedframe as cf import polars as pl class S(cf.Schema): ids = cf.String(checks=[cf.Check.str_starts_with("user_")]) df = pl.DataFrame({"ids": ["user_a", "admin_b"]}) S.validate(df)
Output:
SchemaError: Found 1 error(s) ids: 1 error(s) - starts_with failed for 1 / 2 (50.00%) rows: Must start with user_