Checks#

Custom Checks#

checkedframe aims to maximize your control over the checks you write. Checks take 4 types of inputs: a Series, a string, a DataFrame, or nothing. Checks then allow 3 types of outputs: a Series, a boolean, or an Expression. In addition, checks can be “native”, which means they are written using the DataFrame library you are using, or they can be written in a DataFrame-agnostic way. Let’s take a look at some examples using Polars.

import checkedframe as cf
import polars as pl


class S(cf.Schema):
   customer_id = cf.String()
   checking_balance = cf.Float64()
   savings_balance = cf.Float64()


df = pl.DataFrame(
   {
      "customer_id": ["TV09", "DTG9"],
      "checking_balance": [400.4, 20.2],
      "savings_balance": [1_000.1, 89.91],
   }
)

In this example, we have a check that operates on a Series and returns a Series. The columns argument tells checkedframe apply this check to the “checking_balance” Series. Since our return type is a Series, checkedframe is able to tell us exactly which rows fail this check as well. In addition, in error reporting, this check will be attached to the “checking_balance” column.

@cf.Check(columns="checking_balance")
def series_series_check(s: pl.Series) -> pl.Series:
   return s <= 100

@cf.Check(columns="checking_balance")
def series_series_check(s: cf.Series) -> cf.Series:
   return s <= 100

Similarly, we can use expressions. Expression checks usually take a string and return an expression. Since expressions are series-like, we similarly get detailed error reporting. Furthermore, expressions are run in parallel (if your engine runs expressions in parallel). To get the best performance, use expressions.

@cf.Check(columns="checking_balance")
def str_expr_check(name: str) -> pl.Expr:
   return pl.col(name) <= 100

@cf.Check(columns="checking_balance")
def str_expr_check(name: str) -> cf.Expr:
   return pl.col(name) <= 100

We also can operate at the DataFrame level using expressions.

@cf.Check
def expr_check() -> pl.Expr:
   return pl.col("checking_balance") <= 100

@cf.Check
def expr_check() -> cf.Expr:
   return cf.col("checking_balance") <= 100

We can also just take the DataFrame in directly.

@cf.Check
def df_check(df: pl.DataFrame) -> pl.Series:
   return df["checking_balance"] <= 100

@cf.Check
def df_check(df: cf.DataFrame) -> cf.Series:
   return df["checking_balance"] <= 100

While not particularly useful when operating on a single column, DataFrame-level checks shine when multi-column checks are needed.

Note

The above examples use type hints to indicate the input and return types. A check without type hints would look like this:

@cf.Check(columns="checking_balance", input_type="Series", return_type="Series", native=True)
def series_series_check(s):
   return s <= 100

@cf.Check(columns="checking_balance", input_type="Series", return_type="Series", native=False)
def series_series_check(s):
   return s <= 100

Built-in Checks#

checkedframe also comes with a large suite of built-in checks. Built-in checks are always agnostic, so you can use them the same way regardless of your DataFrame library.

Represents a check to run.

Parameters:

func (Optional[Callable], optional) – The check to run, by default None
columns (Optional[str | list[str] | Selector], optional) – The columns associated with the check, by default None
input_type (Optional[Literal["auto", "Frame", "str", "Series"]], optional) – The input to the check function. If “auto”, attempts to determine via the context, by default “auto”
return_type (Literal["auto", "bool", "Expr", "Series"], optional) – The return type of the check function. If “auto”, attempts to determine via the context, by default “auto”
native (bool | Literal["auto"], optional) – Whether to run the check on the native DataFrame or the Narwhals DataFrame. If “auto”, attempts to determine via the context, by default “auto”
name (Optional[str], optional) – The name of the check, by default None
description (Optional[str], optional) – The description of the check. If None, attempts to read from the __doc__ attribute, by default None

static approx_eq(other: Any, rtol: float = 1e-05, atol: float = 1e-08, nan_equal: bool = False) → Check#

Tests whether values are approximately equal to other. Strings are interpreted as column names.

Parameters:

other (Any)
rtol (float, optional) – Relative tolerance, by default 1e-5
atol (float, optional) – Absolute tolerance, by default 1e-8
nan_equal (bool, optional) – Whether to consider NaN values equal, by default False

Return type:

Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    prob = cf.Float64(checks=[cf.Check.approx_eq(0.5)])


df = pl.DataFrame({"prob": [0.5, 0.50000001, 0.6]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  prob: 1 error(s)
    - approximately_equal_to failed for 1 / 3 (33.33%) rows: Must be approximately equal to 0.5 (rtol=1e-05, atol=1e-08, nan_equal=False)

static eq(other: Any) → Check#

Tests whether values are equal to other. Strings are interpreted as column names.

Parameters:: other (Any)
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    group = cf.String(checks=[cf.Check.eq("A")])


df = pl.DataFrame({"group": ["A", "B", "A"]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  group: 1 error(s)
    - equal_to failed for 1 / 3 (33.33%) rows: Must be = A

static ge(other: Any) → Check#

Tests whether values are greater than or equal to other. Strings are interpreted as column names.

Parameters:: other (Any)
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    age = cf.Int64(
        checks=[
            cf.Check.ge(10),
            cf.Check.ge("min_age"),
            cf.Check.ge(cf.col("min_age") - 10),
        ]
    )


df = pl.DataFrame(
    {
        "age": [5, 10, 11],
        "min_age": [10, 5, 8],
    }
)
S.validate(df)

Output:

SchemaError: Found 2 error(s)
  age: 2 error(s)
    - greater_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be >= 10
    - greater_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be >= min_age

static gt(other: Any) → Check#

Tests whether values are greater than other. Strings are interpreted as column names.

Parameters:: other (Any)
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    age = cf.Int64(
        checks=[
            cf.Check.gt(10),
            cf.Check.gt("min_age"),
            cf.Check.gt(cf.col("min_age") - 100),
        ]
    )


df = pl.DataFrame(
    {
        "age": [5, 10, 11],
        "min_age": [10, 5, 8],
    }
)
S.validate(df)

Output:

SchemaError: Found 2 error(s)
  age: 2 error(s)
    - greater_than failed for 2 / 3 (66.67%) rows: Must be > 10
    - greater_than failed for 1 / 3 (33.33%) rows: Must be > min_age

static is_between(lower_bound: Any, upper_bound: Any, closed: Literal['left', 'right', 'none', 'both'] = 'both') → Check#

Tests whether values are between lower_bound and upper_bound. Strings are interpreted as column names.

Parameters:

lower_bound (Any) – The lower bound
upper_bound (Any) – The upper bound
closed (ClosedInterval, optional) – Defines which sides of the interval are closed, by default “both”

Return type:

Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    age = cf.Int64(checks=[cf.Check.is_between(0, 128)])
    min_balance = cf.Int64()
    med_balance = cf.Int64(checks=[cf.Check.is_between("min_balance", "max_balance")])
    max_balance = cf.Int64()


df = pl.DataFrame(
    {
        "age": [5, 10, 150],
        "min_balance": [1, 100, 500],
        "med_balance": [0, 83, 525],
        "max_balance": [788, 82, 550],
    }
)
S.validate(df)

Output:

SchemaError: Found 2 error(s)
  age: 1 error(s)
    - is_between failed for 1 / 3 (33.33%) rows: Must be in range [0, 128]
  med_balance: 1 error(s)
    - is_between failed for 2 / 3 (66.67%) rows: Must be in range [min_balance, max_balance]

static is_finite() → Check#

Tests whether values are finite.

Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    balances = cf.Float64(checks=[cf.Check.is_finite()])


df = pl.DataFrame({"balances": [1, 2, float("inf")]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  balances: 1 error(s)
    - is_finite failed for 1 / 3 (33.33%) rows: All values must be finite

static is_id(subset: str | list[str]) → Check#

Tests whether the given column(s) identify the DataFrame.

Parameters:: subset (str | list[str]) – The columns that identify the DataFrame
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class MySchema(cf.Schema):
    __dataframe_checks__ = [cf.Check.is_id("group")]
    group = cf.String()


df = pl.DataFrame({"group": ["A", "B", "A"]})
MySchema.validate(df)

Output:

SchemaError: Found 1 error(s)
  __dataframe__: 1 error(s)
    - is_id failed: 'group' must uniquely identify the DataFrame

static is_in(other: Collection) → Check#

Tests whether all values of the Series are in the given collection.

Parameters:: other (Collection) – The collection
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    business_type = cf.String(checks=[cf.Check.is_in(["tech", "finance"])])


df = pl.DataFrame({"business_type": ["x", "tech", "finance"]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  business_type: 1 error(s)
    - is_in failed for 1 / 3 (33.33%) rows: Must be in allowed values ['tech', 'finance']

static is_not_inf() → Check#

Tests whether values are not infinite.

Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    balances = cf.Float64(checks=[cf.Check.is_not_inf()])


df = pl.DataFrame({"balances": [1, 2, float("inf")]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  balances: 1 error(s)
    - is_not_inf failed for 1 / 3 (33.33%) rows: Must not be inf/-inf

static is_not_nan() → Check#

Tests whether values are not NaN.

Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    balances = cf.Float64(checks=[cf.Check.is_not_nan()])


df = pl.DataFrame({"balances": [1, 2, float("nan")]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  balances: 1 error(s)
    - is_not_nan failed for 1 / 3 (33.33%) rows: Must not be NaN

static is_not_null() → Check#

Tests whether values are not null.

Note

This method is mainly here for completeness. Columns are by default not nullable.

Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    customer_id = cf.String(checks=[cf.Check.is_not_null()])


df = pl.DataFrame({"customer_id": ["a23", None]})
S.validate(df)

Output:

SchemaError: Found 2 error(s)
  customer_id: 2 error(s)
    - `nullable=False` failed for 1 / 2 (50.00%) rows: Must not be null
    - is_not_null failed for 1 / 2 (50.00%) rows: Must not be null

static is_sorted(descending: bool = False) → Check#

Tests whether a Series is sorted.

Parameters:: descending (bool, optional) – Whether to check for descending order, by default False
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    timestamps = cf.Int64(checks=[cf.Check.is_sorted()])


df = pl.DataFrame({"timestamps": [1, 2, 1]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  timestamps: 1 error(s)
    - is_sorted failed: Must be sorted in ascending order

static is_sorted_by(by: str | Sequence[str], descending: bool | Sequence[bool] = False, compare_all: bool = True) → Check#

Tests whether a DataFrame is sorted by the given columns.

Parameters:

by (str | Sequence[str]) – The column(s) to sort by
descending (bool | Sequence[bool], optional) – Whether to sort in descending order, by default False
compare_all (bool, optional) – Whether to compare all columns or just the sorting columns, by default True

Return type:

Check

Examples

import checkedframe as cf
import polars as pl

class MySchema(cf.Schema):
    timestamps = cf.Int64()
    values = cf.Int64()

    _sorted_check = cf.Check.is_sorted_by("timestamps")


df = pl.DataFrame({"timestamps": [1, 2, 1], "values": [1, 2, 3]})
MySchema.validate(df)

Output:

SchemaError: Found 1 error(s)
  * is_sorted_by failed for 3 / 3 (100.00%) rows: Must be sorted by timestamps, where descending is False

static le(other: Any) → Check#

Tests whether values are less than or equal to other. Strings are interpreted as column names.

Parameters:: other (Any)
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    age = cf.Int64(
        checks=[
            cf.Check.le(10),
            cf.Check.le("max_age"),
            cf.Check.le(cf.col("max_age") - 10),
        ]
    )


df = pl.DataFrame(
    {
        "age": [5, 10, 11],
        "max_age": [10, 5, 8],
    }
)
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  age: 1 error(s)
    - less_than_or_equal_to failed for 1 / 3 (33.33%) rows: Must be <= 10

static lt(other: Any) → Check#

Tests whether values are less than other. Strings are interpreted as column names.

Parameters:: other (Any)
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    age = cf.Int64(
        checks=[
            cf.Check.lt(10),
            cf.Check.lt("max_age"),
            cf.Check.lt(cf.col("max_age") - 10),
        ]
    )


df = pl.DataFrame(
    {
        "age": [5, 10, 11],
        "max_age": [10, 5, 8],
    }
)
S.validate(df)

Output:

SchemaError: Found 2 error(s)
  age: 2 error(s)
    - less_than failed for 2 / 3 (66.67%) rows: Must be < 10
    - less_than failed for 1 / 3 (33.33%) rows: Must be < max_age

static str_contains(pattern: str, literal: bool = False) → Check#

Tests whether string values contain the given pattern.

Parameters:

pattern (str) – The pattern to check for
literal (bool, optional) – Whether to interpret the pattern as a literal string or a regex, by default False

Return type:

Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    domains = cf.String(checks=[cf.Check.str_contains(r"\.com$", literal=False)])


df = pl.DataFrame({"domains": ["a.com", "b.org"]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  domains: 1 error(s)
    - contains failed for 1 / 2 (50.00%) rows: Must contain \.com$

static str_ends_with(suffix: str) → Check#

Tests whether string values end with the given suffix.

Parameters:: suffix (str) – The suffix to check for
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    emails = cf.String(checks=[cf.Check.str_ends_with("@gmail.com")])


df = pl.DataFrame({"emails": ["a@gmail.com", "b@yahoo.com"]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  emails: 1 error(s)
    - ends_with failed for 1 / 2 (50.00%) rows: Must end with @gmail.com

static str_starts_with(prefix: str) → Check#

Tests whether string values start with the given prefix.

Parameters:: prefix (str) – The prefix to check for
Return type:: Check

Examples

import checkedframe as cf
import polars as pl

class S(cf.Schema):
    ids = cf.String(checks=[cf.Check.str_starts_with("user_")])


df = pl.DataFrame({"ids": ["user_a", "admin_b"]})
S.validate(df)

Output:

SchemaError: Found 1 error(s)
  ids: 1 error(s)
    - starts_with failed for 1 / 2 (50.00%) rows: Must start with user_