Statistics Expr

Extension for Statistical Tests and Samples

Simple Statistics.

Functions:

Name	Description
`add_noise`	Adds some noise to the column.
`bicor`	Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.
`chi2`	Computes the Chi Squared statistic and p value between two categorical values.
`corr`	A convenience function for calling different types of correlations. Pearson and Spearman correlation
`cosine_sim`	Column-and-column cosine similarity
`f_test`	Performs the ANOVA F-test.
`gmean`	Computes the geometric mean of the variable.
`hmean`	Computes the harmonic mean.
`jitter`	Adds a Gaussian noise of N(0, std) to the column.
`kendall_tau`	Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.
`ks_2samp`	Computes two-sided KS statistics between var1 and var2. This will
`mann_whitney_u`	Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop
`normal_test`	Perform a normality test which is based on D'Agostino and Pearson's test
`perturb`	Perturb the var by a small amount. This only applies to float columns.
`random`	Generate random numbers in [lower, upper)
`random_binomial`	Generates random integer following a binomial distribution.
`random_exp`	Generates random numbers following an exponential distribution.
`random_int`	Generates random integer between lower and upper.
`random_normal`	Generates random number following a normal distribution.
`random_null`	Creates random null values in the columns. If var contains nulls originally, they
`random_str`	Generates random strings of length between min_size and max_size. The characters are
`ttest_1samp`	Performs a standard 1 sample t test using reference column and expected mean. This function
`ttest_ind`	Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined
`ttest_ind_from_stats`	Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other.
`weighted_corr`	Computes the weighted correlation between x and y. The weights column must have the same
`weighted_cosine_sim`	Computes the weighted cosine similarity between x and y (column-wise). The weights column
`weighted_cov`	Computes the weighted covariance between x and y. The weights column must have the same
`weighted_gmean`	Computes the weighted geometric mean of the variable.
`weighted_hmean`	Computes the weighted harmonic mean of the variable.
`weighted_mean`	Computes the weighted mean, where weights is an expr represeting
`weighted_var`	Computes the weighted variance. The weights column must have the same length as var.
`xi_corr`	Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference.

`add_noise(x, noise_type='gaussian', **kwargs)`

Adds some noise to the column.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either the name of the column or a Polars expression	required
`noise_type`	`Noise`	Either "gaussian" or "uniform"	`'gaussian'`
`kwargs`		If noise_type = "gaussian", this accepts kwargs to "jitter" and if "uniform", this accepts kwargs to "perturb". You may set a seed via the kwargs.	`{}`

Source code in python/polars_ds/exprs/stats.py

def add_noise(x: str | pl.Expr, noise_type: Noise = "gaussian", **kwargs) -> pl.Expr:
    """
    Adds some noise to the column.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    noise_type
        Either "gaussian" or "uniform"
    kwargs
        If noise_type = "gaussian", this accepts kwargs to "jitter" and if "uniform", this
        accepts kwargs to "perturb". You may set a seed via the kwargs.
    """
    if noise_type == "gaussian":
        return jitter(x, **kwargs)
    elif noise_type == "uniform":
        return perturb(x, **kwargs)
    else:
        raise ValueError(f"The noise_type {noise_type} is not currently supported.")

`bicor(x, y, c=9.0)`

Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.

Performance hint: this expression benefits from .lazy() a lot.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required
`c`	`float`	Biweight tuning constant which is typically 9	`9.0`

Reference

https://en.wikipedia.org/wiki/Biweight_midcorrelation

Source code in python/polars_ds/exprs/stats.py

def bicor(x: str | pl.Expr, y: str | pl.Expr, c: float = 9.0) -> pl.Expr:
    """
    Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.

    Performance hint: this expression benefits from .lazy() a lot.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    c
        Biweight tuning constant which is typically 9

    Reference
    ---------
    https://en.wikipedia.org/wiki/Biweight_midcorrelation
    """
    a, b = to_expr(x), to_expr(y)
    med_a = a.median()
    med_b = b.median()

    diff_a = a - med_a
    diff_b = b - med_b

    ua = diff_a / (c * diff_a.abs().median())
    ub = diff_b / (c * diff_b.abs().median())

    w_a = (1 - ua.pow(2)).pow(2) * ((1 - ua.abs()) > 0).cast(pl.Float64)
    w_b = (1 - ub.pow(2)).pow(2) * ((1 - ub.abs()) > 0).cast(pl.Float64)

    aa = diff_a * w_a
    bb = diff_b * w_b

    return aa.dot(bb) / (aa.dot(aa) * (bb.dot(bb))).sqrt()

`chi2(var1, var2, return_full=False)`

Computes the Chi Squared statistic and p value between two categorical values.

Note that it is up to the user to make sure that the two columns contain categorical values. This method is equivalent to SciPy's chi2_contingency, except that it also computes the contingency table internally for the user.

Parameters:

Name	Type	Description	Default
`var1`	`str \| Expr`	Either the name of the column or a Polars expression	required
`var2`	`str \| Expr`	Either the name of the column or a Polars expression	required
`return_full`	`bool`	If true, dof and expected frequency will also be returned. The returned "struct" will not be a scalar anymore, but has length = length of expected frequencies.	`False`

Source code in python/polars_ds/exprs/stats.py

def chi2(var1: str | pl.Expr, var2: str | pl.Expr, return_full: bool = False) -> pl.Expr:
    """
    Computes the Chi Squared statistic and p value between two categorical values.

    Note that it is up to the user to make sure that the two columns contain categorical
    values. This method is equivalent to SciPy's chi2_contingency, except that it also
    computes the contingency table internally for the user.

    Parameters
    ----------
    var1
        Either the name of the column or a Polars expression
    var2
        Either the name of the column or a Polars expression
    return_full
        If true, dof and expected frequency will also be returned. The returned "struct"
        will not be a scalar anymore, but has length = length of expected frequencies.
    """
    if return_full:
        return pl_plugin(
            symbol="pl_chi2_full", args=[to_expr(var1), to_expr(var2)], changes_length=True
        )
    else:
        return pl_plugin(
            symbol="pl_chi2",
            args=[to_expr(var1), to_expr(var2)],
            returns_scalar=True,
        )

`corr(x, y, method='pearson')`

A convenience function for calling different types of correlations. Pearson and Spearman correlation runs on Polar's native expression, while Kendall and Xi correlation runs on code in this package.

Paramters

x The first variable y The second variable method One of ["pearson", "spearman", "xi", "kendall", "bicor"]

Source code in python/polars_ds/exprs/stats.py

def corr(x: str | pl.Expr, y: str | pl.Expr, method: CorrMethod = "pearson") -> pl.Expr:
    """
    A convenience function for calling different types of correlations. Pearson and Spearman correlation
    runs on Polar's native expression, while Kendall and Xi correlation runs on code in this package.

    Paramters
    ---------
    x
        The first variable
    y
        The second variable
    method
        One of ["pearson", "spearman", "xi", "kendall", "bicor"]
    """
    if method in ["pearson", "spearman"]:
        return pl.corr(x, y, method=method)
    elif method == "xi":
        return xi_corr(x, y)
    elif method == "kendall":
        return kendall_tau(x, y)
    elif method == "bicor":
        return bicor(x, y)
    else:
        raise ValueError(f"Unknown correlation method: {method}.")

`cosine_sim(x, y)`

Column-and-column cosine similarity

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required

Source code in python/polars_ds/exprs/stats.py

def cosine_sim(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Column-and-column cosine similarity

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    """
    xx, yy = to_expr(x), to_expr(y)
    x2 = xx.dot(xx).sqrt()
    y2 = yy.dot(yy).sqrt()
    return xx.dot(yy) / (x2 * y2).sqrt()

`f_test(*variables, group)`

Performs the ANOVA F-test.

Parameters:

Name	Type	Description	Default
`variables`	`str \| Expr`	The columns (variables) to run ANOVA F-test on	`()`
`group`	`str \| Expr`	The "target" column used to group the variables	required

Source code in python/polars_ds/exprs/stats.py

def f_test(*variables: str | pl.Expr, group: str | pl.Expr) -> pl.Expr:
    """
    Performs the ANOVA F-test.

    Parameters
    ----------
    variables
        The columns (variables) to run ANOVA F-test on
    group
        The "target" column used to group the variables
    """
    vars_ = [to_expr(group)]
    vars_.extend(to_expr(x) for x in variables)
    if len(vars_) <= 1:
        raise ValueError("No input feature column to run F-test on.")
    elif len(vars_) == 2:
        return pl_plugin(symbol="pl_f_test", args=vars_, returns_scalar=True)
    else:
        return pl_plugin(symbol="pl_f_test", args=vars_, changes_length=True)

`gmean(var)`

Computes the geometric mean of the variable.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required

Source code in python/polars_ds/exprs/stats.py

def gmean(var: str | pl.Expr) -> pl.Expr:
    """
    Computes the geometric mean of the variable.

    Parameters
    ----------
    var
        The variable
    """
    return to_expr(var).ln().mean().exp()

`hmean(var)`

Computes the harmonic mean.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required

Source code in python/polars_ds/exprs/stats.py

def hmean(var: str | pl.Expr) -> pl.Expr:
    """
    Computes the harmonic mean.

    Parameters
    ----------
    var
        The variable
    """
    x = to_expr(var)
    return x.count() / (1.0 / x).sum()

`jitter(x, std=1.0, seed=None)`

Adds a Gaussian noise of N(0, std) to the column.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either the name of the column or a Polars expression	required
`std`	`float \| Expr`	The std of the Gaussian noise.	`1.0`
`seed`	`int \| None`	A random seed	`None`

Source code in python/polars_ds/exprs/stats.py

def jitter(x: str | pl.Expr, std: float | pl.Expr = 1.0, seed: int | None = None) -> pl.Expr:
    """
    Adds a Gaussian noise of N(0, std) to the column.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    std
        The std of the Gaussian noise.
    seed
        A random seed
    """
    if isinstance(std, float):
        if std < 0:
            raise ValueError("Standard deviation must be positive.")
        elif std == 0:
            return to_expr(x)

        s = pl.lit(std, dtype=pl.Float64)
    else:
        s = std.cast(pl.Float64)

    return pl_plugin(
        symbol="pl_jitter", args=[to_expr(x), s, pl.lit(seed, dtype=pl.UInt64)], is_elementwise=True
    )

`kendall_tau(x, y)`

Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.

Note: this will map NaN to null and drop all rows with null. Inf will be kept and cosidered as the largest value and multiple Infs will be equal. -Inf will be the smallest if it exists in the data. A value of NaN will be returned if the data has < 2 rows after nulls are dropped.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required

Source code in python/polars_ds/exprs/stats.py

def kendall_tau(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.

    Note: this will map NaN to null and drop all rows with null. Inf will be kept and cosidered as
    the largest value and multiple Infs will be equal. -Inf will be the smallest if it exists in the
    data. A value of NaN will be returned if the data has < 2 rows after nulls are dropped.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    """
    xx, yy = to_expr(x).fill_nan(None), to_expr(y).fill_nan(None)
    return pl_plugin(
        symbol="pl_kendall_tau",
        args=[xx.rank(method="min"), yy.rank(method="min")],
        returns_scalar=True,
    )

`ks_2samp(var1, var2, alpha=0.05, is_binary=False)`

Computes two-sided KS statistics between var1 and var2. This will sanitize data (only non-null finite values are used) before doing the computation. If is_binary is true, it will compare the statistics by comparing var2(var1=0) and var2(var1=1).

Note, this returns a stastics and a threshold value. The threshold is not the p-value, but rather it is used in the following way: if the statistic is > the threshold value, then the null hypothesis should be rejected. This is suitable only for large sameple sizes. See more details in the reference.

If either var1 or var2 has less than 30 values, a ks stats of 0 with threshold NaN will be returned.

Parameters:

Name	Type	Description	Default
`var1`	`str \| Expr`	Variable 1	required
`var2`	`str \| Expr`	Variable 2	required
`alpha`	`float`	The confidence level used to estimate p-value	`0.05`
`is_binary`	`bool`	If true, instead of running ks(var1, var2), it runs ks(var2(var1=0), var2(var1=1))	`False`

Reference

https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test

Source code in python/polars_ds/exprs/stats.py

def ks_2samp(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alpha: float = 0.05,
    is_binary: bool = False,
) -> pl.Expr:
    """
    Computes two-sided KS statistics between var1 and var2. This will
    sanitize data (only non-null finite values are used) before doing the computation. If
    is_binary is true, it will compare the statistics by comparing var2(var1=0) and var2(var1=1).

    Note, this returns a stastics and a threshold value. The threshold is not the p-value, but
    rather it is used in the following way: if the statistic is > the threshold value, then the null
    hypothesis should be rejected. This is suitable only for large sameple sizes. See more details
    in the reference.

    If either var1 or var2 has less than 30 values, a ks stats of 0 with threshold NaN will be returned.

    Parameters
    ----------
    var1
        Variable 1
    var2
        Variable 2
    alpha
        The confidence level used to estimate p-value
    is_binary
        If true, instead of running ks(var1, var2), it runs ks(var2(var1=0), var2(var1=1))

    Reference
    ---------
    https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test
    """
    y1, y2 = to_expr(var1), to_expr(var2)
    if is_binary:
        z1 = y2.filter((y1 == 1) & y2.is_finite()).sort()
        z2 = y2.filter((y1 == 0) & y2.is_finite()).sort()
    else:
        z1 = y1.filter(y1.is_finite()).sort()
        z2 = y2.filter(y2.is_finite()).sort()

    return pl_plugin(
        symbol="pl_ks_2samp",
        args=[z1.cast(pl.Float64), z2.cast(pl.Float64), pl.lit(alpha, pl.Float64)],
        returns_scalar=True,
    )

`mann_whitney_u(var1, var2, alternative='two-sided')`

Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop all non-finite values) before computing the statistic. This implementation follows method 2 in reference. This always applies tie correction, which may slow down computation by a little.

WIP. PVALUE NOT DONE YET.

Parameters:

Name	Type	Description	Default
`var1`	`Expr`	Either the name of the column or a Polars expression	required
`var2`	`Expr`	Either the name of the column or a Polars expression	required
`alternative`	`Alternative`	The alternative for the test. `two-sided`, `greater` or `less`	`'two-sided'`

Reference

https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

Source code in python/polars_ds/exprs/stats.py

def mann_whitney_u(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alternative: Alternative = "two-sided",
) -> pl.Expr:
    """
    Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop
    all non-finite values) before computing the statistic. This implementation follows method 2 in reference.
    This always applies tie correction, which may slow down computation by a little.

    WIP. PVALUE NOT DONE YET.

    Parameters
    ----------
    var1 : pl.Expr
        Either the name of the column or a Polars expression
    var2 : pl.Expr
        Either the name of the column or a Polars expression
    alternative: str
        The alternative for the test. `two-sided`, `greater` or `less`

    Reference
    ---------
    https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
    """
    x = to_expr(var1)
    y = to_expr(var2)
    xx = x.filter(x.is_finite())
    yy = y.filter(y.is_finite())
    n1 = xx.len().cast(pl.Float64)
    n2 = yy.len().cast(pl.Float64)

    ranks = (xx.append(yy)).rank()

    u1 = ranks.slice(0, length=xx.len()).sum() - (n1 * (n1 + 1)) / 2
    u2 = (n1 * n2) - u1

    mean = (n1 * n2) / 2
    return pl_plugin(
        symbol="pl_mann_whitney_u",
        args=[u1, u2, mean, ranks.sort(), pl.lit(alternative, dtype=pl.String)],
    )

`normal_test(var)`

Perform a normality test which is based on D'Agostino and Pearson's test that combines skew and kurtosis to produce an omnibus test of normality. Null values, NaN and inf are dropped when running this computation.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	Either the name of the column or a Polars expression	required

References

D'Agostino, R. B. (1971), "An omnibus test of normality for moderate and large sample size", Biometrika, 58, 341-348 D'Agostino, R. and Pearson, E. S. (1973), "Tests for departure from normality", Biometrika, 60, 613-622

Source code in python/polars_ds/exprs/stats.py

def normal_test(var: str | pl.Expr) -> pl.Expr:
    """
    Perform a normality test which is based on D'Agostino and Pearson's test
    that combines skew and kurtosis to produce an omnibus test of normality.
    Null values, NaN and inf are dropped when running this computation.

    Parameters
    ----------
    var
        Either the name of the column or a Polars expression

    References
    ----------
    D'Agostino, R. B. (1971), "An omnibus test of normality for
        moderate and large sample size", Biometrika, 58, 341-348
    D'Agostino, R. and Pearson, E. S. (1973), "Tests for departure from
        normality", Biometrika, 60, 613-622
    """
    y = to_expr(var)
    valid: pl.Expr = y.filter(y.is_finite())
    skew = valid.skew()
    # Pearson Kurtosis, see here: https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test
    kur = valid.kurtosis(fisher=False)
    return pl_plugin(
        symbol="pl_normal_test",
        args=[skew, kur, valid.count().cast(pl.UInt32)],
        returns_scalar=True,
    )

`perturb(x, epsilon=1e-05, positive=False, seed=None)`

Perturb the var by a small amount. This only applies to float columns.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either the name of the column or a Polars expression	required
`epsilon`	`float`	The small amount to perturb.	`1e-05`
`positive`	`bool`	If true, randomly add a small amount in [0, epsilon). If false, it will use the range [-epsilon/2, epsilon/2)	`False`
`seed`	`int \| None`	A random seed	`None`

Source code in python/polars_ds/exprs/stats.py

def perturb(
    x: str | pl.Expr, epsilon: float = 1e-5, positive: bool = False, seed: int | None = None
) -> pl.Expr:
    """
    Perturb the var by a small amount. This only applies to float columns.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    epsilon
        The small amount to perturb.
    positive
        If true, randomly add a small amount in [0, epsilon). If false, it will use the range
        [-epsilon/2, epsilon/2)
    seed
        A random seed
    """
    if math.isinf(epsilon) or math.isnan(epsilon):
        raise ValueError("Input `epsilon should be a valid finite value.`")

    ep = abs(epsilon)
    if positive:
        lo = pl.lit(0.0, dtype=pl.Float64)
        hi = pl.lit(ep, dtype=pl.Float64)
    else:
        half = ep / 2
        lo = pl.lit(-half, dtype=pl.Float64)
        hi = pl.lit(half, dtype=pl.Float64)

    return pl_plugin(
        symbol="pl_perturb",
        args=[to_expr(x), lo, hi, pl.lit(seed, dtype=pl.UInt64)],
        is_elementwise=True,
    )

`random(lower=0.0, upper=1.0, seed=None, len_ref=None)`

Generate random numbers in [lower, upper)

Parameters:

Name	Type	Description	Default
`lower`	`Expr \| float`	The lower bound	`0.0`
`upper`	`Expr \| float`	The upper bound, exclusive	`1.0`
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random(
    lower: pl.Expr | float = 0.0,
    upper: pl.Expr | float = 1.0,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generate random numbers in [lower, upper)

    Parameters
    ----------
    lower
        The lower bound
    upper
        The upper bound, exclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    lo = pl.lit(lower, pl.Float64) if isinstance(lower, float) else lower
    up = pl.lit(upper, pl.Float64) if isinstance(upper, float) else upper
    len_, is_elementwise = _get_streamable(len_ref)

    return pl_plugin(
        symbol="pl_random",
        args=[len_, lo, up, pl.lit(seed, pl.UInt64)],
        is_elementwise=is_elementwise,
    )

`random_binomial(n, p, seed=None, len_ref=None)`

Generates random integer following a binomial distribution.

Parameters:

Name	Type	Description	Default
`n`	`int`	The n in a binomial distribution	required
`p`	`float`	The p in a binomial distribution. The success rate.	required
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_binomial(
    n: int, p: float, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random integer following a binomial distribution.

    Parameters
    ----------
    n
        The n in a binomial distribution
    p
        The p in a binomial distribution. The success rate.
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    if n < 1:
        raise ValueError("Input `n` must be > 1.")
    if p < 0.0 or p > 1.0:
        raise ValueError("Input `p` must be between 0 and 1.")

    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_binomial",
        args=[
            len_,
            pl.lit(n, pl.UInt32),
            pl.lit(p, pl.Float64),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

`random_exp(lambda_, seed=None, len_ref=None)`

Generates random numbers following an exponential distribution.

Parameters:

Name	Type	Description	Default
`lambda_`	`float`	The lambda in an exponential distribution	required
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_exp(
    lambda_: float, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random numbers following an exponential distribution.

    Parameters
    ----------
    lambda_
        The lambda in an exponential distribution
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_exp",
        args=[
            len_,
            pl.lit(lambda_, pl.Float64),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

`random_int(lower, upper, seed=None, len_ref=None)`

Generates random integer between lower and upper.

Parameters:

Name	Type	Description	Default
`lower`	`int \| Expr`	The lower bound, inclusive	required
`upper`	`int \| Expr`	The upper bound, exclusive	required
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_int(
    lower: int | pl.Expr,
    upper: int | pl.Expr,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generates random integer between lower and upper.

    Parameters
    ----------
    lower
        The lower bound, inclusive
    upper
        The upper bound, exclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    if lower == upper:
        raise ValueError("Input `lower` must be smaller than `higher`")

    lo = pl.lit(lower, pl.Int32) if isinstance(lower, int) else lower.cast(pl.Int32)
    hi = pl.lit(upper, pl.Int32) if isinstance(upper, int) else upper.cast(pl.Int32)
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_int",
        args=[
            len_,
            lo,
            hi,
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

`random_normal(mean, std, seed=None, len_ref=None)`

Generates random number following a normal distribution.

Parameters:

Name	Type	Description	Default
`mean`	`Expr \| float`	The mean in a normal distribution	required
`std`	`Expr \| float`	The std in a normal distribution	required
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_normal(
    mean: pl.Expr | float,
    std: pl.Expr | float,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generates random number following a normal distribution.

    Parameters
    ----------
    mean
        The mean in a normal distribution
    std
        The std in a normal distribution
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_normal",
        args=[
            len_,
            pl.lit(mean, pl.Float64) if isinstance(mean, float) else mean,
            pl.lit(std, pl.Float64) if isinstance(std, float) else std,
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

`random_null(x, pct, seed=None)`

Creates random null values in the columns. If var contains nulls originally, they will stay null.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either the name of the column or a Polars expression	required
`pct`	`float`	Percentage of nulls to randomly generate. This percentage is based on the length of the column, so may not be the actual percentage of nulls depending on how many values are originally null.	required
`seed`	`int \| None`	A seed to fix the random numbers. If none, use the system's entropy.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_null(x: str | pl.Expr, pct: float, seed: int | None = None) -> pl.Expr:
    """
    Creates random null values in the columns. If var contains nulls originally, they
    will stay null.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    pct
        Percentage of nulls to randomly generate. This percentage is based on the
        length of the column, so may not be the actual percentage of nulls depending
        on how many values are originally null.
    seed
        A seed to fix the random numbers. If none, use the system's entropy.
    """
    if pct <= 0.0 or pct >= 1.0:
        raise ValueError("Input `pct` must be > 0 and < 1")

    return pl.when(random(0.0, 1.0, seed=seed, len_ref=x) < pct).then(None).otherwise(to_expr(x))

`random_str(min_size, max_size, seed=None, len_ref=None)`

Generates random strings of length between min_size and max_size. The characters are uniformly distributed over ASCII letters and numbers: a-z, A-Z and 0-9.

Parameters:

Name	Type	Description	Default
`min_size`	`int`	The min size of the string, inclusive	required
`max_size`	`int`	The max size of the string, inclusive	required
`seed`	`int \| None`	The random seed. None means no seed.	`None`
`len_ref`	`str \| Expr \| None`	Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.	`None`

Source code in python/polars_ds/exprs/stats.py

def random_str(
    min_size: int, max_size: int, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random strings of length between min_size and max_size. The characters are
    uniformly distributed over ASCII letters and numbers: a-z, A-Z and 0-9.

    Parameters
    ----------
    min_size
        The min size of the string, inclusive
    max_size
        The max size of the string, inclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    mi, ma = min_size, max_size
    if min_size > max_size:
        mi, ma = max_size, min_size

    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_str",
        args=[
            len_,
            pl.lit(mi, pl.UInt32),
            pl.lit(ma, pl.UInt32),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

`ttest_1samp(var1, pop_mean, alternative='two-sided')`

Performs a standard 1 sample t test using reference column and expected mean. This function sanitizes the self column first. The df is the count of valid values.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

There is no valid value in the inputs, or the mean is inf.
Input variable has length 0 after removing non-finite values.

Parameters:

Name	Type	Description	Default
`var1`	`str \| Expr`	Variable 1	required
`pop_mean`	`float`	The expected population mean in the hypothesis test	required
`alternative`	`('two-sided', 'less', 'greater')`	Alternative of the hypothesis test	`"two-sided"`

Source code in python/polars_ds/exprs/stats.py

def ttest_1samp(
    var1: str | pl.Expr, pop_mean: float, alternative: Alternative = "two-sided"
) -> pl.Expr:
    """
    Performs a standard 1 sample t test using reference column and expected mean. This function
    sanitizes the self column first. The df is the count of valid values.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        Variable 1
    pop_mean
        The expected population mean in the hypothesis test
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    """
    y = to_expr(var1)
    s1 = y.filter(y.is_finite())
    sm = s1.mean()
    pm = pl.lit(pop_mean, dtype=pl.Float64)
    var = s1.var()
    cnt = s1.len().cast(pl.UInt64)
    alt = pl.lit(alternative, dtype=pl.String)
    return pl_plugin(
        symbol="pl_ttest_1samp",
        args=[sm, pm, var, cnt, alt],
        returns_scalar=True,
    )

`ttest_ind(var1, var2, alternative='two-sided', equal_var=False)`

Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined to be equivalent to SciPy's ttest_ind, with fewer options. The result is not exact but within 1e-10 precision from SciPy's.

In the case of student's t test, the data is assumed to have no nulls, and n = expr.count() is used. Note expr.count() only counts non-null elements after polars 0.20. The degree of freedom will be 2n - 2. As a result, nulls might cause problems.

In the case of Welch's t test, data will be sanitized (nulls, NaNs, Infs will be dropped before the test), and df will be counted based on the length of sanitized data.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

There is no valid value in the inputs, or the mean is inf.
Input variable has length 0 after removing non-finite values.

Parameters:

Name	Type	Description	Default
`var1`	`str \| Expr`	Variable 1	required
`var2`	`str \| Expr`	Variable 2	required
`alternative`	`('two-sided', 'less', 'greater')`	Alternative of the hypothesis test	`"two-sided"`
`equal_var`	`bool`	If true, perform standard student t 2 sample test. Otherwise, perform Welch's t test.	`False`

Examples:

Same length, equal variance comparisons.

>>> df.select(pds.ttest_ind("x1", "x2", equal_var=True))

Potentially unequal length, unequal variance.

>>> df.select(
...     pds.ttest_ind(
...         pl.col("x1").filter(condition_A), pl.col("x1").filter(condition_B), equal_var=False
...     )
... )

Source code in python/polars_ds/exprs/stats.py

def ttest_ind(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alternative: Alternative = "two-sided",
    equal_var: bool = False,
) -> pl.Expr:
    """
    Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined
    to be equivalent to SciPy's ttest_ind, with fewer options. The result is not exact but
    within 1e-10 precision from SciPy's.

    In the case of student's t test, the data is assumed to have no nulls, and n = expr.count()
    is used. Note expr.count() only counts non-null elements after polars 0.20.
    The degree of freedom will be 2n - 2. As a result, nulls might cause problems.

    In the case of Welch's t test, data will be sanitized (nulls, NaNs, Infs will be dropped
    before the test), and df will be counted based on the length of sanitized data.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        Variable 1
    var2
        Variable 2
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    equal_var
        If true, perform standard student t 2 sample test. Otherwise, perform Welch's
        t test.

    Examples
    --------
    Same length, equal variance comparisons.
    >>> df.select(pds.ttest_ind("x1", "x2", equal_var=True))

    Potentially unequal length, unequal variance.
    >>> df.select(
    ...     pds.ttest_ind(
    ...         pl.col("x1").filter(condition_A), pl.col("x1").filter(condition_B), equal_var=False
    ...     )
    ... )
    """
    y1, y2 = to_expr(var1), to_expr(var2)
    if equal_var:
        m1 = y1.mean()
        m2 = y2.mean()
        v1 = y1.var()
        v2 = y2.var()
        cnt = y1.count().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_ttest_2samp",
            args=[m1, m2, v1, v2, cnt, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )
    else:
        s1 = y1.filter(y1.is_finite())
        s2 = y2.filter(y2.is_finite())
        m1 = s1.mean()
        m2 = s2.mean()
        v1 = s1.var()
        v2 = s2.var()
        n1 = s1.len().cast(pl.UInt64)
        n2 = s2.len().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_welch_t",
            args=[m1, m2, v1, v2, n1, n2, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )

`ttest_ind_from_stats(var1, mean, var, cnt, alternative='two-sided', equal_var=False)`

Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other. This is more suitable for t-tests between rolling data and some other fixed data, from which you can compute the mean, var, and count only once.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

There is no valid value in the inputs, or the mean is inf.
Input variable has length 0 after removing non-finite values.

Parameters:

Name	Type	Description	Default
`var1`	`str \| Expr`	The variable 1	required
`mean`	`float`	The mean of var2	required
`var`	`float`	The var of var2	required
`cnt`	`int`	The count of var2, used only in welch's t test	required
`alternative`	`('two-sided', 'less', 'greater')`	Alternative of the hypothesis test	`"two-sided"`
`equal_var`	`bool`	If true, perform standard student t 2 sample test. Otherwise, perform Welch's t test.	`False`

Source code in python/polars_ds/exprs/stats.py

def ttest_ind_from_stats(
    var1: str | pl.Expr,
    mean: float,
    var: float,
    cnt: int,
    alternative: Alternative = "two-sided",
    equal_var: bool = False,
) -> pl.Expr:
    """
    Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other.
    This is more suitable for t-tests between rolling data and some other fixed data, from which you
    can compute the mean, var, and count only once.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        The variable 1
    mean
        The mean of var2
    var
        The var of var2
    cnt
        The count of var2, used only in welch's t test
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    equal_var
        If true, perform standard student t 2 sample test. Otherwise, perform Welch's
        t test.
    """
    y = to_expr(var1)
    if equal_var:
        m1 = y.mean()
        m2 = pl.lit(mean, pl.Float64)
        v1 = y.var()
        v2 = pl.lit(var, pl.Float64)
        cnt = y.count().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_ttest_2samp",
            args=[m1, m2, v1, v2, cnt, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )
    else:
        s1 = y.filter(y.is_finite())
        m1 = s1.mean()
        m2 = pl.lit(mean, pl.Float64)
        v1 = s1.var()
        v2 = pl.lit(var, pl.Float64)
        n1 = s1.len().cast(pl.UInt64)
        n2 = pl.lit(cnt, pl.UInt64)
        return pl_plugin(
            symbol="pl_welch_t",
            args=[m1, m2, v1, v2, n1, n2, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )

`weighted_corr(x, y, weights)`

Computes the weighted correlation between x and y. The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required

Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py

def weighted_corr(x: str | pl.Expr, y: str | pl.Expr, weights: str | pl.Expr) -> pl.Expr:
    """
    Computes the weighted correlation between x and y. The weights column must have the same
    length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy = to_expr(x), to_expr(y)
    w = to_expr(weights)
    numerator = w.dot((xx - weighted_mean(xx, w, False)) * (yy - weighted_mean(yy, w, False)))
    sxx = w.dot((xx - weighted_mean(xx, w, False)).pow(2))
    syy = w.dot((yy - weighted_mean(yy, w, False)).pow(2))
    return numerator / (sxx * syy).sqrt()

`weighted_cosine_sim(x, y, weights)`

Computes the weighted cosine similarity between x and y (column-wise). The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required

Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py

def weighted_cosine_sim(x: str | pl.Expr, y: str | pl.Expr, weights: str | pl.Expr) -> pl.Expr:
    """
    Computes the weighted cosine similarity between x and y (column-wise). The weights column
    must have the same length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy = to_expr(x), to_expr(y)
    w = to_expr(weights)
    wx2 = xx.pow(2).dot(w)
    wy2 = yy.pow(2).dot(w)
    return (w * xx).dot(yy) / (wx2 * wy2).sqrt()

`weighted_cov(x, y, weights)`

Computes the weighted covariance between x and y. The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required
`weights`	`Expr \| float`	An expr representing weights. Must be of same length as var.	required

Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py

def weighted_cov(x: str | pl.Expr, y: str | pl.Expr, weights: pl.Expr | float) -> pl.Expr:
    """
    Computes the weighted covariance between x and y. The weights column must have the same
    length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy, w = to_expr(x), to_expr(y), to_expr(weights)
    wx, wy = weighted_mean(xx, w, False), weighted_mean(yy, w, False)
    return w.dot((xx - wx) * (yy - wy)) / w.sum()

`weighted_gmean(var, weights, is_normalized=False)`

Computes the weighted geometric mean of the variable.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required
`is_normalized`	`bool`	If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights	`False`

Source code in python/polars_ds/exprs/stats.py

def weighted_gmean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted geometric mean of the variable.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    x, w = to_expr(var), to_expr(weights)
    if is_normalized:
        return (x.ln().dot(w)).exp()
    else:
        return (x.ln().dot(w) / (w.sum())).exp()

`weighted_hmean(var, weights, is_normalized=False)`

Computes the weighted harmonic mean of the variable.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required
`is_normalized`	`bool`	If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights	`False`

Source code in python/polars_ds/exprs/stats.py

def weighted_hmean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted harmonic mean of the variable.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    w = to_expr(weights)
    x = to_expr(var)
    dot = x.dot(pl.lit(1.0, dtype=pl.Float32) / x)
    if is_normalized:
        return 1.0 / dot
    else:
        return 1.0 / (dot / w.sum())

`weighted_mean(var, weights, is_normalized=False)`

Computes the weighted mean, where weights is an expr represeting a weight column. The weights column must have the same length as var.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required
`is_normalized`	`bool`	If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights	`False`

Source code in python/polars_ds/exprs/stats.py

def weighted_mean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted mean, where weights is an expr represeting
    a weight column. The weights column must have the same length as var.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    x, w = to_expr(var), to_expr(weights)
    out = x.dot(w)
    if is_normalized:
        return out
    return out / w.sum()

`weighted_var(var, weights, freq_weights=False)`

Computes the weighted variance. The weights column must have the same length as var.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name	Type	Description	Default
`var`	`str \| Expr`	The variable	required
`weights`	`str \| Expr`	An expr representing weights. Must be of same length as var.	required
`freq_weights`	`bool`	Whether to follow the formula for frequency weights or other types of weights. See reference for detail. If true, this assumes frequency weights are NOT normalized. If false, the weighted sample variance is biased. See reference for more info.	`False`

Reference

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

Source code in python/polars_ds/exprs/stats.py

def weighted_var(var: str | pl.Expr, weights: str | pl.Expr, freq_weights: bool = False) -> pl.Expr:
    """
    Computes the weighted variance. The weights column must have the same length as var.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    freq_weights
        Whether to follow the formula for frequency weights or other types of weights. See reference
        for detail. If true, this assumes frequency weights are NOT normalized. If false, the
        weighted sample variance is biased. See reference for more info.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance
    """
    x, w = to_expr(var), to_expr(weights)
    wm = weighted_mean(x, w, False)
    summand = w.dot((x - wm).pow(2))
    if freq_weights:
        return summand / (w.sum() - 1)
    return summand / w.sum()

`winsorize(x, q_low=0.05, q_high=0.95, method='nearest')`

Winsorize the data by clipping by percentiles at the lower and upper ends.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either the name of the column or a Polars expression	required
`q_low`	`float`	The lower percentile value to clip the data. E.g everything < x.quantile(lower) will be mapped to x.quantile(lower)	`0.05`
`q_high`	`float`	The upper percentile value to clip the data. E.g everything > x.quantile(upper) will be mapped to x.quantile(upper)	`0.95`
`method`	`QuantileMethod`	Method for quantile estimate. One of "nearest", "higher", "lower", "midpoint", "linear".	`'nearest'`

Source code in python/polars_ds/exprs/stats.py

def winsorize(
    x: str | pl.Expr,
    q_low: float = 0.05,
    q_high: float = 0.95,
    method: QuantileMethod = "nearest",
) -> pl.Expr:
    """
    Winsorize the data by clipping by percentiles at the lower and upper ends.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    q_low
        The lower percentile value to clip the data. E.g everything < x.quantile(lower)
        will be mapped to x.quantile(lower)
    q_high
        The upper percentile value to clip the data. E.g everything > x.quantile(upper)
        will be mapped to x.quantile(upper)
    method
        Method for quantile estimate. One of "nearest", "higher", "lower", "midpoint", "linear".
    """
    if q_low <= 0.0 or q_low >= 1.0 or q_high <= 0.0 or q_high >= 1.0 or q_high <= q_low:
        raise ValueError("Lower and upper must be with in (0, 1) and upper should be > lower")

    xx = to_expr(x)
    return xx.clip(
        xx.quantile(q_low, interpolation=method), xx.quantile(q_high, interpolation=method)
    )

`xi_corr(x, y, seed=None, return_p=False)`

Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference. This will return both the correlation (the statistic) and the p-value. Note that if sample size is smaller than 30, p-value will always be NaN. The ξ correlation is not symmetric, as it only tries to explain whether y is a function of x.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The first variable	required
`y`	`str \| Expr`	The second variable	required
`seed`	`int \| None`	Whether to have a seed when we break ties at random	`None`
`return_p`	`bool`	Whether to return a two-sided p value for the statistic	`False`

Reference

https://arxiv.org/pdf/1909.10140.pdf

Source code in python/polars_ds/exprs/stats.py

def xi_corr(
    x: str | pl.Expr, y: str | pl.Expr, seed: int | None = None, return_p: bool = False
) -> pl.Expr:
    """
    Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference.
    This will return both the correlation (the statistic) and the p-value. Note that if sample size
    is smaller than 30, p-value will always be NaN. The ξ correlation is not symmetric, as it only
    tries to explain whether y is a function of x.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    seed
        Whether to have a seed when we break ties at random
    return_p
        Whether to return a two-sided p value for the statistic

    Reference
    ---------
    https://arxiv.org/pdf/1909.10140.pdf
    """
    xx, yy = to_expr(x), to_expr(y)
    args = [
        xx.rank(method="random", seed=seed),
        yy.rank(method="max").cast(pl.Float64),
        (-yy).rank(method="max").cast(pl.Float64),
    ]
    if return_p:
        return pl_plugin(
            symbol="pl_xi_corr_w_p",
            args=args,
            returns_scalar=True,
        )
    else:
        return pl_plugin(
            symbol="pl_xi_corr",
            args=args,
            returns_scalar=True,
        )