Quickstart#

Install the checkedframe library from PyPI:

pip install checkedframe

checkedframe is agnostic to your DataFrame library. This means that validation, built-in checks, etc. work with every library supported by narwhals, primarily Pandas, Polars, PyArrow, cuDF, and Modin. At the same time, it allows you to write user-defined checks using whatever engine you prefer. In the example below, there will be tabs for Pandas and Polars, as well as an “agnostic” tab that has a completely engine-agnostic schema.

Let’s say you have a very simple DataFrame containing a customer id, their balances, and whether or not they have overdraft protection.

import pandas as pd

df = pd.DataFrame(
    {
        "customer_id": ["TYX89", "F38J0M"],
        "balance": [198, -56],
        "overdraft_protection": [True, False],
    }
)

Let’s start by defining a very basic schema.

import checkedframe as cf


class MySchema(cf.Schema):
    customer_id = cf.String()
    balance = cf.Float64()
    overdraft_protection = cf.Boolean()

This schema represents a DataFrame with a single string column and two float columns. Schemas are typically created by inheriting from the Schema class. Already, we can catch some errors.

MySchema.validate(df)

Output:

Found 1 error(s)
  balance: 1 error(s)
    - Expected Float64, got Int64

checkedframe complains because we declared balance to be a Float64, but we got an Int64. We could fix this ourselves, or we could tell checkedframe to. Casting in checkedframe comes with some extra safety (casting must be value-preserving) and convenience (ablity to pinpoint which rows fail). Let’s edit our schema to reflect this.

class MySchema(cf.Schema):
    customer_id = cf.String()
    balance = cf.Float64(cast=True)
    overdraft_protection = cf.Boolean()

Now our DataFrame passes validation! checkedframe was able to inspect balance and safely cast to our desired data type, Float64.

  customer_id  balance  overdraft_protection
0       TYX89    198.0                  True
1      F38J0M    -56.0                 False

Beyond these core concepts, you may also want to run arbitrary checks against your DataFrame. This is handled by the Check class. Checks are user-defined (or built-in) functions that assert some property of your custom Schema class. For example, let’s say customer_id is always 6 characters. There are actually a couple of ways to do this, but for now let’s assume you want to write your own function (perhaps your real check is much more complex). Let’s add the below function to our schema class.

class MySchema(cf.Schema):
    customer_id = cf.String()
    balance = cf.Float64(cast=True)
    overdraft_protection = cf.Boolean()

    @cf.Check(columns="customer_id")
    def check_id_length(s: pd.Series) -> pd.Series:
        """customer_id must be of length 6"""
        return s.str.len() == 6

Note

Activate the mypy plugin to allow mypy to recognize Check as a staticmethod.

[tool.mypy]
plugins = ["checkedframe.mypy"]

Alternatively, manually add @staticmethod. This doesn’t do anything but silence the type checker.

@cf.Check
@staticmethod
def my_check(): ...

Let’s try validating again.

MySchema.validate(df)

Output:

SchemaError: Found 1 error(s)
  customer_id: 1 error(s)
    - check_id_length failed: customer_id must be of length 6

Here, we wrote a custom function that we identified as a check with the Check decorator. In addition, via the column argument, we attached the check to the customer_id column.

Now, let’s re-visit balances. Notice that we have a negative here. It may make sense to have negative balances (in case of overdraft), but if overdraft protection is on, the transaction simply wouldn’t go through, meaning that it isn’t possible to have a negative balance. Let’s write a check for this.

@cf.Check
def check_balances_pos_if_protected(df: pd.DataFrame) -> pd.Series:
    """Balances can only be negative if there is no overdraft protection"""
    return (df["balance"] >= 0) & df["overdraft_protection"]

Output:

Found 2 error(s)
  customer_id: 1 error(s)
    - check_id_length failed for 1 / 2 (50.00%) rows: customer_id must be of length 6
  * check_balances_pos_if_protected failed for 1 / 2 (50.00%) rows: Balances can only be negative if there is no overdraft protection

Note

checkedframe will not stop on the first error; rather, it will try to find all errors before raising.

You’ve created your first schema, but there is a lot more to checkedframe, including filtering (returning only rows that pass validation), detailed error inspection, static type-checking, and more. See the relevant sections in the User Guide and API Reference for more information.