Skip to content

Drift

Functions:

Name Description
psi

Calculates the Population Stability Index (PSI) between two populations. PSI is

psi(reference, current, *, bins=None, bin_count='fd', include_nulls=True, epsilon=0.0001)

Calculates the Population Stability Index (PSI) between two populations. PSI is defined as

\[ PSI = \sum_{i=1}^{n} (\% \text{Current}_{i} - \% \text{Reference}_{i}) \times \ln\left(\frac{\% \text{Current}_{i}}{\% \text{Reference}_{i}}\right) \]

That is, bin the reference population and compute the percentage of the overall population in each bin. Take the breakpoints from the reference population and bin the current population, and repeat the process. If the bin percentage is 0, add \(\epsilon\) to penalize that bin while preserving the validity of the log.

Parameters:

Name Type Description Default
reference ArrayLike

The reference population. The bins are always determined on this pouplation.

required
current ArrayLike

The current population. This population is binned using the breakpoints from the reference population.

required
bins list[float] | None

A list of bin edges. Either bins or bin_count must be specified. The bins argument will take priority, by default None

None
bin_count int | BinMethod

If an integer, the number of bins. It can also be a string corresponding to an auto-binning method, by default "fd". The possible methods are

'fd'
include_nulls bool

Whether nulls should be considered a bin, by default True

True
epsilon float | None

The correction term to add to 0 percentages, by default 1e-4

0.0001

Returns:

Type Description
float
Added in version 0.3.0
Source code in python/rapidstats/drift.py
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
def psi(
    reference: ArrayLike,
    current: ArrayLike,
    *,
    bins: list[float] | None = None,
    bin_count: int | BinMethod = "fd",
    include_nulls: bool = True,
    epsilon: float | None = 1e-4,
) -> float:
    r"""Calculates the Population Stability Index (PSI) between two populations. PSI is
    defined as

    \[
        PSI = \sum_{i=1}^{n} (\% \text{Current}_{i} - \% \text{Reference}_{i}) \times \ln\left(\frac{\% \text{Current}_{i}}{\% \text{Reference}_{i}}\right)
    \]

    That is, bin the reference population and compute the percentage of the overall
    population in each bin. Take the breakpoints from the reference population and
    bin the current population, and repeat the process. If the bin percentage is 0, add
    $\epsilon$ to penalize that bin while preserving the validity of the log.

    Parameters
    ----------
    reference : ArrayLike
        The reference population. The bins are always determined on this pouplation.
    current : ArrayLike
        The current population. This population is binned using the breakpoints from the
        reference population.
    bins : list[float] | None, optional
        A list of bin edges. Either `bins` or `bin_count` must be specified. The `bins`
        argument will take priority, by default None
    bin_count : int | BinMethod, optional
        If an integer, the number of bins. It can also be a string corresponding to an
        auto-binning method, by default "fd". The possible methods are

        - "doane", see [rapidstats.bin.doane][]
        - "fd", see [rapidstats.bin.freedman_diaconis][]
        - "rice", see [rapidstats.bin.rice][]
        - "sturges", see [rapidstats.bin.sturges][]
        - "scott", see [rapidstats.bin.scott][]
        - "sqrt", see [rapidstats.bin.sqrt][]

    include_nulls : bool, optional
        Whether nulls should be considered a bin, by default True
    epsilon : float | None, optional
        The correction term to add to 0 percentages, by default 1e-4

    Returns
    -------
    float

    Added in version 0.3.0
    ----------------------
    """
    reference = pl.Series("reference", reference)
    current = pl.Series("current", current)

    if reference.dtype.is_numeric() and current.dtype.is_numeric():
        return _numeric_psi(
            reference=reference,
            current=current,
            bins=bins,
            bin_count=bin_count,
            include_nulls=include_nulls,
            epsilon=epsilon,
        )

    return _categorical_psi(
        reference=reference,
        current=current,
        include_nulls=include_nulls,
        epsilon=epsilon,
    )