Correlation

Classes:

Name	Description
`CorrelationBatchOptions`	Options to control batching in rapidstats.correlation_matrix.

Functions:

Name	Description
`correlation_matrix`	Warning

`CorrelationBatchOptions` `dataclass`

Options to control batching in rapidstats.correlation_matrix.

Parameters:

Name	Type	Description	Default
`batch_size`	`int \| float`	The number of combinations (where a combination is a pair of features) to compute each batch. If a float between 0 and 1, it is interpreted as a percent, by default = 0.1	`0.1`
`cache_dir`	`str \| Path \| None`	The directory to save out the results of each batch. If None, creates a folder called "rapidstats_correlation_cache" in the current working directory, by default None	`None`
`start_iteration`	`int \| None`	The iteration to start at. If None, will start at the latest iteration available in `cache_dir`, by default None	`None`
`delete_ok`	`bool`	Whether to delete `cache_dir` after the correlation matrix is computed, by default False	`False`
`quiet`	`bool`	Whether to print progress information, by default False	`False`

Source code in python/rapidstats/_corr.py

@dataclass
class CorrelationBatchOptions:
    """Options to control batching in [rapidstats.correlation_matrix][].

    Parameters
    ----------
    batch_size : int | float, optional
        The number of combinations (where a combination is a pair of features) to
        compute each batch. If a float between 0 and 1, it is interpreted as a percent,
        by default = 0.1
    cache_dir : str | Path | None, optional
        The directory to save out the results of each batch. If None, creates a folder
        called "__rapidstats_correlation_cache__" in the current working directory, by
        default None
    start_iteration : int | None, optional
        The iteration to start at. If None, will start at the latest iteration available
        in `cache_dir`, by default None
    delete_ok : bool, optional
        Whether to delete `cache_dir` after the correlation matrix is computed, by
        default False
    quiet : bool
        Whether to print progress information, by default False
    """

    batch_size: int | float = 0.1
    cache_dir: str | Path | None = None
    start_iteration: int | None = None
    delete_ok: bool = False
    quiet: bool = False

`correlation_matrix(data, l1=None, l2=None, method='pearson', format='wide', index='', batch_options=None)`

Warning

If you know that your data has no nulls, you should use np.corrcoef instead. While this function will return the correct result and is reasonably fast, computing the null-aware correlation matrix will always be slower than assuming that there are no nulls.

Compute the null-aware correlation matrix between two lists of columns. If both lists are None, then the correlation matrix is over all columns in the input DataFrame. If l1 is not None, and is a list of 2-tuples, l1 is interpreted as the combinations of columns to compute the correlation for.

Parameters:

Name	Type	Description	Default
`data`	`IntoFrameT`	The input data	required
`l1`	`Union[list[str], list[tuple[str, str]]]`	A list of columns to appear as the columns of the correlation matrix, by default None	`None`
`l2`	`list[str]`	A list of columns to appear as the rows of the correlation matrix, by default None	`None`
`method`	`Literal['pearson', 'spearman']`	How to calculate the correlation, by default "pearson"	`'pearson'`
`format`	`Literal['wide', 'long']`	The format the correlation matrix is returned in. If "wide", it is the classic correlation matrix. If "long", it is a DataFrame with the columns `c1`, `c2`, and `correlation`, by default "wide" !!! Added in version 0.4.0	`'wide'`
`index`	`str`	The name of the `l2` column in the final output. Ignored if the format is "long", by default "" !!! Added in version 0.2.0 !!! Renamed from "index_name" to "index" in version 0.4.0	`''`
`batch_options`	`CorrelationBatchOptions \| None`	Parameters that control how to compute the correlation matrix in a batched manner. If None, does not use batching, by default None	`None`

Returns:

Type	Description
`DataFrame`	A correlation matrix with `l1` as the columns and `l2` as the rows

Added in version 0.0.24

Source code in python/rapidstats/_corr.py

def correlation_matrix(
    data: nwt.IntoFrame,
    l1: Optional[Union[list[str], list[tuple[str, str]]]] = None,
    l2: Optional[list[str]] = None,
    method: CorrelationMethod = "pearson",
    format: CorrelationMatrixFormat = "wide",
    index: str = "",
    batch_options: CorrelationBatchOptions | None = None,
) -> pl.DataFrame:
    """
    !!! warning

        If you know that your data has no nulls, you should use `np.corrcoef` instead.
        While this function will return the correct result and is reasonably fast,
        computing the null-aware correlation matrix will always be slower than assuming
        that there are no nulls.

    Compute the null-aware correlation matrix between two lists of columns. If both
    lists are None, then the correlation matrix is over all columns in the input
    DataFrame. If `l1` is not None, and is a list of 2-tuples, `l1` is interpreted
    as the combinations of columns to compute the correlation for.

    Parameters
    ----------
    data : nwt.IntoFrameT
        The input data
    l1 : Union[list[str], list[tuple[str, str]]], optional
        A list of columns to appear as the columns of the correlation matrix,
        by default None
    l2 : list[str], optional
        A list of columns to appear as the rows of the correlation matrix,
        by default None
    method : Literal["pearson", "spearman"], optional
        How to calculate the correlation, by default "pearson"
    format : Literal["wide", "long"], optional
        The format the correlation matrix is returned in. If "wide", it is the classic
        correlation matrix. If "long", it is a DataFrame with the columns `c1`, `c2`,
        and `correlation`, by default "wide"

        !!! Added in version 0.4.0
    index : str, optional
        The name of the `l2` column in the final output. Ignored if the format is
        "long", by default ""

        !!! Added in version 0.2.0
        !!! Renamed from "index_name" to "index" in version 0.4.0
    batch_options : CorrelationBatchOptions | None, optional
        Parameters that control how to compute the correlation matrix in a batched
        manner. If None, does not use batching, by default None

    Returns
    -------
    pl.DataFrame
        A correlation matrix with `l1` as the columns and `l2` as the rows

    Added in version 0.0.24
    -----------------------
    """
    pf, original, new_columns, combinations = _prepare_inputs(data, l1, l2)

    if batch_options is None:
        return _correlation_matrix(
            pf,
            original=original,
            new_columns=new_columns,
            combinations=combinations,
            method=method,
            index=index,
            format=format,
        )
    else:
        return _batched_correlation_matrix(
            pf=pf,
            original=original,
            new_columns=new_columns,
            combinations=combinations,
            batch_options=batch_options,
            method=method,
            format=format,
            index=index,
        )

Correlation

CorrelationBatchOptions dataclass

correlation_matrix(data, l1=None, l2=None, method='pearson', format='wide', index='', batch_options=None)

`CorrelationBatchOptions` `dataclass`

`correlation_matrix(data, l1=None, l2=None, method='pearson', format='wide', index='', batch_options=None)`