Skip to content

Correlation

Classes:

Name Description
CorrelationBatchOptions

Options to control batching in rapidstats.correlation_matrix.

Functions:

Name Description
correlation_matrix

Warning

CorrelationBatchOptions dataclass

Options to control batching in rapidstats.correlation_matrix.

Parameters:

Name Type Description Default
batch_size int | float

The number of combinations (where a combination is a pair of features) to compute each batch. If a float between 0 and 1, it is interpreted as a percent, by default = 0.1

0.1
cache_dir str | Path | None

The directory to save out the results of each batch. If None, creates a folder called "rapidstats_correlation_cache" in the current working directory, by default None

None
start_iteration int | None

The iteration to start at. If None, will start at the latest iteration available in cache_dir, by default None

None
delete_ok bool

Whether to delete cache_dir after the correlation matrix is computed, by default False

False
quiet bool

Whether to print progress information, by default False

False
Source code in python/rapidstats/_corr.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@dataclass
class CorrelationBatchOptions:
    """Options to control batching in [rapidstats.correlation_matrix][].

    Parameters
    ----------
    batch_size : int | float, optional
        The number of combinations (where a combination is a pair of features) to
        compute each batch. If a float between 0 and 1, it is interpreted as a percent,
        by default = 0.1
    cache_dir : str | Path | None, optional
        The directory to save out the results of each batch. If None, creates a folder
        called "__rapidstats_correlation_cache__" in the current working directory, by
        default None
    start_iteration : int | None, optional
        The iteration to start at. If None, will start at the latest iteration available
        in `cache_dir`, by default None
    delete_ok : bool, optional
        Whether to delete `cache_dir` after the correlation matrix is computed, by
        default False
    quiet : bool
        Whether to print progress information, by default False
    """

    batch_size: int | float = 0.1
    cache_dir: str | Path | None = None
    start_iteration: int | None = None
    delete_ok: bool = False
    quiet: bool = False

correlation_matrix(data, l1=None, l2=None, method='pearson', format='wide', index='', batch_options=None)

Warning

If you know that your data has no nulls, you should use np.corrcoef instead. While this function will return the correct result and is reasonably fast, computing the null-aware correlation matrix will always be slower than assuming that there are no nulls.

Compute the null-aware correlation matrix between two lists of columns. If both lists are None, then the correlation matrix is over all columns in the input DataFrame. If l1 is not None, and is a list of 2-tuples, l1 is interpreted as the combinations of columns to compute the correlation for.

Parameters:

Name Type Description Default
data IntoFrameT

The input data

required
l1 Union[list[str], list[tuple[str, str]]]

A list of columns to appear as the columns of the correlation matrix, by default None

None
l2 list[str]

A list of columns to appear as the rows of the correlation matrix, by default None

None
method Literal['pearson', 'spearman']

How to calculate the correlation, by default "pearson"

'pearson'
format Literal['wide', 'long']

The format the correlation matrix is returned in. If "wide", it is the classic correlation matrix. If "long", it is a DataFrame with the columns c1, c2, and correlation, by default "wide"

!!! Added in version 0.4.0

'wide'
index str

The name of the l2 column in the final output. Ignored if the format is "long", by default ""

!!! Added in version 0.2.0 !!! Renamed from "index_name" to "index" in version 0.4.0

''
batch_options CorrelationBatchOptions | None

Parameters that control how to compute the correlation matrix in a batched manner. If None, does not use batching, by default None

None

Returns:

Type Description
DataFrame

A correlation matrix with l1 as the columns and l2 as the rows

Added in version 0.0.24
Source code in python/rapidstats/_corr.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
def correlation_matrix(
    data: nwt.IntoFrame,
    l1: Optional[Union[list[str], list[tuple[str, str]]]] = None,
    l2: Optional[list[str]] = None,
    method: CorrelationMethod = "pearson",
    format: CorrelationMatrixFormat = "wide",
    index: str = "",
    batch_options: CorrelationBatchOptions | None = None,
) -> pl.DataFrame:
    """
    !!! warning

        If you know that your data has no nulls, you should use `np.corrcoef` instead.
        While this function will return the correct result and is reasonably fast,
        computing the null-aware correlation matrix will always be slower than assuming
        that there are no nulls.

    Compute the null-aware correlation matrix between two lists of columns. If both
    lists are None, then the correlation matrix is over all columns in the input
    DataFrame. If `l1` is not None, and is a list of 2-tuples, `l1` is interpreted
    as the combinations of columns to compute the correlation for.

    Parameters
    ----------
    data : nwt.IntoFrameT
        The input data
    l1 : Union[list[str], list[tuple[str, str]]], optional
        A list of columns to appear as the columns of the correlation matrix,
        by default None
    l2 : list[str], optional
        A list of columns to appear as the rows of the correlation matrix,
        by default None
    method : Literal["pearson", "spearman"], optional
        How to calculate the correlation, by default "pearson"
    format : Literal["wide", "long"], optional
        The format the correlation matrix is returned in. If "wide", it is the classic
        correlation matrix. If "long", it is a DataFrame with the columns `c1`, `c2`,
        and `correlation`, by default "wide"

        !!! Added in version 0.4.0
    index : str, optional
        The name of the `l2` column in the final output. Ignored if the format is
        "long", by default ""

        !!! Added in version 0.2.0
        !!! Renamed from "index_name" to "index" in version 0.4.0
    batch_options : CorrelationBatchOptions | None, optional
        Parameters that control how to compute the correlation matrix in a batched
        manner. If None, does not use batching, by default None

    Returns
    -------
    pl.DataFrame
        A correlation matrix with `l1` as the columns and `l2` as the rows

    Added in version 0.0.24
    -----------------------
    """
    pf, original, new_columns, combinations = _prepare_inputs(data, l1, l2)

    if batch_options is None:
        return _correlation_matrix(
            pf,
            original=original,
            new_columns=new_columns,
            combinations=combinations,
            method=method,
            index=index,
            format=format,
        )
    else:
        return _batched_correlation_matrix(
            pf=pf,
            original=original,
            new_columns=new_columns,
            combinations=combinations,
            batch_options=batch_options,
            method=method,
            format=format,
            index=index,
        )