Skip to content

Statistics Expr

Extension for Statistical Tests and Samples

Simple Statistics.

Functions:

Name Description
add_noise

Adds some noise to the column.

bicor

Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.

chi2

Computes the Chi Squared statistic and p value between two categorical values.

corr

A convenience function for calling different types of correlations. Pearson and Spearman correlation

cosine_sim

Column-and-column cosine similarity

f_test

Performs the ANOVA F-test.

gmean

Computes the geometric mean of the variable.

hmean

Computes the harmonic mean.

jitter

Adds a Gaussian noise of N(0, std) to the column.

kendall_tau

Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.

ks_2samp

Computes two-sided KS statistics between var1 and var2. This will

mann_whitney_u

Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop

normal_test

Perform a normality test which is based on D'Agostino and Pearson's test

perturb

Perturb the var by a small amount. This only applies to float columns.

random

Generate random numbers in [lower, upper)

random_binomial

Generates random integer following a binomial distribution.

random_exp

Generates random numbers following an exponential distribution.

random_int

Generates random integer between lower and upper.

random_normal

Generates random number following a normal distribution.

random_null

Creates random null values in the columns. If var contains nulls originally, they

random_str

Generates random strings of length between min_size and max_size. The characters are

ttest_1samp

Performs a standard 1 sample t test using reference column and expected mean. This function

ttest_ind

Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined

ttest_ind_from_stats

Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other.

weighted_corr

Computes the weighted correlation between x and y. The weights column must have the same

weighted_cosine_sim

Computes the weighted cosine similarity between x and y (column-wise). The weights column

weighted_cov

Computes the weighted covariance between x and y. The weights column must have the same

weighted_gmean

Computes the weighted geometric mean of the variable.

weighted_hmean

Computes the weighted harmonic mean of the variable.

weighted_mean

Computes the weighted mean, where weights is an expr represeting

weighted_var

Computes the weighted variance. The weights column must have the same length as var.

xi_corr

Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference.

add_noise(x, noise_type='gaussian', **kwargs)

Adds some noise to the column.

Parameters:

Name Type Description Default
x str | Expr

Either the name of the column or a Polars expression

required
noise_type Noise

Either "gaussian" or "uniform"

'gaussian'
kwargs

If noise_type = "gaussian", this accepts kwargs to "jitter" and if "uniform", this accepts kwargs to "perturb". You may set a seed via the kwargs.

{}
Source code in python/polars_ds/exprs/stats.py
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
def add_noise(x: str | pl.Expr, noise_type: Noise = "gaussian", **kwargs) -> pl.Expr:
    """
    Adds some noise to the column.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    noise_type
        Either "gaussian" or "uniform"
    kwargs
        If noise_type = "gaussian", this accepts kwargs to "jitter" and if "uniform", this
        accepts kwargs to "perturb". You may set a seed via the kwargs.
    """
    if noise_type == "gaussian":
        return jitter(x, **kwargs)
    elif noise_type == "uniform":
        return perturb(x, **kwargs)
    else:
        raise ValueError(f"The noise_type {noise_type} is not currently supported.")

bicor(x, y, c=9.0)

Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.

Performance hint: this expression benefits from .lazy() a lot.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
c float

Biweight tuning constant which is typically 9

9.0
Reference

https://en.wikipedia.org/wiki/Biweight_midcorrelation

Source code in python/polars_ds/exprs/stats.py
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
def bicor(x: str | pl.Expr, y: str | pl.Expr, c: float = 9.0) -> pl.Expr:
    """
    Computes the Biweight Midcorrelation between x and y. This is commonly referred to as bicor.

    Performance hint: this expression benefits from .lazy() a lot.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    c
        Biweight tuning constant which is typically 9

    Reference
    ---------
    https://en.wikipedia.org/wiki/Biweight_midcorrelation
    """
    a, b = to_expr(x), to_expr(y)
    med_a = a.median()
    med_b = b.median()

    diff_a = a - med_a
    diff_b = b - med_b

    ua = diff_a / (c * diff_a.abs().median())
    ub = diff_b / (c * diff_b.abs().median())

    w_a = (1 - ua.pow(2)).pow(2) * ((1 - ua.abs()) > 0).cast(pl.Float64)
    w_b = (1 - ub.pow(2)).pow(2) * ((1 - ub.abs()) > 0).cast(pl.Float64)

    aa = diff_a * w_a
    bb = diff_b * w_b

    return aa.dot(bb) / (aa.dot(aa) * (bb.dot(bb))).sqrt()

chi2(var1, var2, return_full=False)

Computes the Chi Squared statistic and p value between two categorical values.

Note that it is up to the user to make sure that the two columns contain categorical values. This method is equivalent to SciPy's chi2_contingency, except that it also computes the contingency table internally for the user.

Parameters:

Name Type Description Default
var1 str | Expr

Either the name of the column or a Polars expression

required
var2 str | Expr

Either the name of the column or a Polars expression

required
return_full bool

If true, dof and expected frequency will also be returned. The returned "struct" will not be a scalar anymore, but has length = length of expected frequencies.

False
Source code in python/polars_ds/exprs/stats.py
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
def chi2(var1: str | pl.Expr, var2: str | pl.Expr, return_full: bool = False) -> pl.Expr:
    """
    Computes the Chi Squared statistic and p value between two categorical values.

    Note that it is up to the user to make sure that the two columns contain categorical
    values. This method is equivalent to SciPy's chi2_contingency, except that it also
    computes the contingency table internally for the user.

    Parameters
    ----------
    var1
        Either the name of the column or a Polars expression
    var2
        Either the name of the column or a Polars expression
    return_full
        If true, dof and expected frequency will also be returned. The returned "struct"
        will not be a scalar anymore, but has length = length of expected frequencies.
    """
    if return_full:
        return pl_plugin(
            symbol="pl_chi2_full", args=[to_expr(var1), to_expr(var2)], changes_length=True
        )
    else:
        return pl_plugin(
            symbol="pl_chi2",
            args=[to_expr(var1), to_expr(var2)],
            returns_scalar=True,
        )

corr(x, y, method='pearson')

A convenience function for calling different types of correlations. Pearson and Spearman correlation runs on Polar's native expression, while Kendall and Xi correlation runs on code in this package.

Paramters

x The first variable y The second variable method One of ["pearson", "spearman", "xi", "kendall", "bicor"]

Source code in python/polars_ds/exprs/stats.py
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
def corr(x: str | pl.Expr, y: str | pl.Expr, method: CorrMethod = "pearson") -> pl.Expr:
    """
    A convenience function for calling different types of correlations. Pearson and Spearman correlation
    runs on Polar's native expression, while Kendall and Xi correlation runs on code in this package.

    Paramters
    ---------
    x
        The first variable
    y
        The second variable
    method
        One of ["pearson", "spearman", "xi", "kendall", "bicor"]
    """
    if method in ["pearson", "spearman"]:
        return pl.corr(x, y, method=method)
    elif method == "xi":
        return xi_corr(x, y)
    elif method == "kendall":
        return kendall_tau(x, y)
    elif method == "bicor":
        return bicor(x, y)
    else:
        raise ValueError(f"Unknown correlation method: {method}.")

cosine_sim(x, y)

Column-and-column cosine similarity

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
Source code in python/polars_ds/exprs/stats.py
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
def cosine_sim(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Column-and-column cosine similarity

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    """
    xx, yy = to_expr(x), to_expr(y)
    x2 = xx.dot(xx).sqrt()
    y2 = yy.dot(yy).sqrt()
    return xx.dot(yy) / (x2 * y2).sqrt()

f_test(*variables, group)

Performs the ANOVA F-test.

Parameters:

Name Type Description Default
variables str | Expr

The columns (variables) to run ANOVA F-test on

()
group str | Expr

The "target" column used to group the variables

required
Source code in python/polars_ds/exprs/stats.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
def f_test(*variables: str | pl.Expr, group: str | pl.Expr) -> pl.Expr:
    """
    Performs the ANOVA F-test.

    Parameters
    ----------
    variables
        The columns (variables) to run ANOVA F-test on
    group
        The "target" column used to group the variables
    """
    vars_ = [to_expr(group)]
    vars_.extend(to_expr(x) for x in variables)
    if len(vars_) <= 1:
        raise ValueError("No input feature column to run F-test on.")
    elif len(vars_) == 2:
        return pl_plugin(symbol="pl_f_test", args=vars_, returns_scalar=True)
    else:
        return pl_plugin(symbol="pl_f_test", args=vars_, changes_length=True)

gmean(var)

Computes the geometric mean of the variable.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
Source code in python/polars_ds/exprs/stats.py
793
794
795
796
797
798
799
800
801
802
def gmean(var: str | pl.Expr) -> pl.Expr:
    """
    Computes the geometric mean of the variable.

    Parameters
    ----------
    var
        The variable
    """
    return to_expr(var).ln().mean().exp()

hmean(var)

Computes the harmonic mean.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
Source code in python/polars_ds/exprs/stats.py
756
757
758
759
760
761
762
763
764
765
766
def hmean(var: str | pl.Expr) -> pl.Expr:
    """
    Computes the harmonic mean.

    Parameters
    ----------
    var
        The variable
    """
    x = to_expr(var)
    return x.count() / (1.0 / x).sum()

jitter(x, std=1.0, seed=None)

Adds a Gaussian noise of N(0, std) to the column.

Parameters:

Name Type Description Default
x str | Expr

Either the name of the column or a Polars expression

required
std float | Expr

The std of the Gaussian noise.

1.0
seed int | None

A random seed

None
Source code in python/polars_ds/exprs/stats.py
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
def jitter(x: str | pl.Expr, std: float | pl.Expr = 1.0, seed: int | None = None) -> pl.Expr:
    """
    Adds a Gaussian noise of N(0, std) to the column.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    std
        The std of the Gaussian noise.
    seed
        A random seed
    """
    if isinstance(std, float):
        if std < 0:
            raise ValueError("Standard deviation must be positive.")
        elif std == 0:
            return to_expr(x)

        s = pl.lit(std, dtype=pl.Float64)
    else:
        s = std.cast(pl.Float64)

    return pl_plugin(
        symbol="pl_jitter", args=[to_expr(x), s, pl.lit(seed, dtype=pl.UInt64)], is_elementwise=True
    )

kendall_tau(x, y)

Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.

Note: this will map NaN to null and drop all rows with null. Inf will be kept and cosidered as the largest value and multiple Infs will be equal. -Inf will be the smallest if it exists in the data. A value of NaN will be returned if the data has < 2 rows after nulls are dropped.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
Source code in python/polars_ds/exprs/stats.py
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
def kendall_tau(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Computes Kendall's Tau (b) correlation between x and y. This automatically drops rows with null.

    Note: this will map NaN to null and drop all rows with null. Inf will be kept and cosidered as
    the largest value and multiple Infs will be equal. -Inf will be the smallest if it exists in the
    data. A value of NaN will be returned if the data has < 2 rows after nulls are dropped.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    """
    xx, yy = to_expr(x).fill_nan(None), to_expr(y).fill_nan(None)
    return pl_plugin(
        symbol="pl_kendall_tau",
        args=[xx.rank(method="min"), yy.rank(method="min")],
        returns_scalar=True,
    )

ks_2samp(var1, var2, alpha=0.05, is_binary=False)

Computes two-sided KS statistics between var1 and var2. This will sanitize data (only non-null finite values are used) before doing the computation. If is_binary is true, it will compare the statistics by comparing var2(var1=0) and var2(var1=1).

Note, this returns a stastics and a threshold value. The threshold is not the p-value, but rather it is used in the following way: if the statistic is > the threshold value, then the null hypothesis should be rejected. This is suitable only for large sameple sizes. See more details in the reference.

If either var1 or var2 has less than 30 values, a ks stats of 0 with threshold NaN will be returned.

Parameters:

Name Type Description Default
var1 str | Expr

Variable 1

required
var2 str | Expr

Variable 2

required
alpha float

The confidence level used to estimate p-value

0.05
is_binary bool

If true, instead of running ks(var1, var2), it runs ks(var2(var1=0), var2(var1=1))

False
Reference

https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test

Source code in python/polars_ds/exprs/stats.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
def ks_2samp(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alpha: float = 0.05,
    is_binary: bool = False,
) -> pl.Expr:
    """
    Computes two-sided KS statistics between var1 and var2. This will
    sanitize data (only non-null finite values are used) before doing the computation. If
    is_binary is true, it will compare the statistics by comparing var2(var1=0) and var2(var1=1).

    Note, this returns a stastics and a threshold value. The threshold is not the p-value, but
    rather it is used in the following way: if the statistic is > the threshold value, then the null
    hypothesis should be rejected. This is suitable only for large sameple sizes. See more details
    in the reference.

    If either var1 or var2 has less than 30 values, a ks stats of 0 with threshold NaN will be returned.

    Parameters
    ----------
    var1
        Variable 1
    var2
        Variable 2
    alpha
        The confidence level used to estimate p-value
    is_binary
        If true, instead of running ks(var1, var2), it runs ks(var2(var1=0), var2(var1=1))

    Reference
    ---------
    https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov%E2%80%93Smirnov_test
    """
    y1, y2 = to_expr(var1), to_expr(var2)
    if is_binary:
        z1 = y2.filter((y1 == 1) & y2.is_finite()).sort()
        z2 = y2.filter((y1 == 0) & y2.is_finite()).sort()
    else:
        z1 = y1.filter(y1.is_finite()).sort()
        z2 = y2.filter(y2.is_finite()).sort()

    return pl_plugin(
        symbol="pl_ks_2samp",
        args=[z1.cast(pl.Float64), z2.cast(pl.Float64), pl.lit(alpha, pl.Float64)],
        returns_scalar=True,
    )

mann_whitney_u(var1, var2, alternative='two-sided')

Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop all non-finite values) before computing the statistic. This implementation follows method 2 in reference. This always applies tie correction, which may slow down computation by a little.

WIP. PVALUE NOT DONE YET.

Parameters:

Name Type Description Default
var1 Expr

Either the name of the column or a Polars expression

required
var2 Expr

Either the name of the column or a Polars expression

required
alternative Alternative

The alternative for the test. two-sided, greater or less

'two-sided'
Reference

https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

Source code in python/polars_ds/exprs/stats.py
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
def mann_whitney_u(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alternative: Alternative = "two-sided",
) -> pl.Expr:
    """
    Computes the Mann-Whitney U statistic and the p-value. Note: this function will sanitize data (drop
    all non-finite values) before computing the statistic. This implementation follows method 2 in reference.
    This always applies tie correction, which may slow down computation by a little.

    WIP. PVALUE NOT DONE YET.

    Parameters
    ----------
    var1 : pl.Expr
        Either the name of the column or a Polars expression
    var2 : pl.Expr
        Either the name of the column or a Polars expression
    alternative: str
        The alternative for the test. `two-sided`, `greater` or `less`

    Reference
    ---------
    https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
    """
    x = to_expr(var1)
    y = to_expr(var2)
    xx = x.filter(x.is_finite())
    yy = y.filter(y.is_finite())
    n1 = xx.len().cast(pl.Float64)
    n2 = yy.len().cast(pl.Float64)

    ranks = (xx.append(yy)).rank()

    u1 = ranks.slice(0, length=xx.len()).sum() - (n1 * (n1 + 1)) / 2
    u2 = (n1 * n2) - u1

    mean = (n1 * n2) / 2
    return pl_plugin(
        symbol="pl_mann_whitney_u",
        args=[u1, u2, mean, ranks.sort(), pl.lit(alternative, dtype=pl.String)],
    )

normal_test(var)

Perform a normality test which is based on D'Agostino and Pearson's test that combines skew and kurtosis to produce an omnibus test of normality. Null values, NaN and inf are dropped when running this computation.

Parameters:

Name Type Description Default
var str | Expr

Either the name of the column or a Polars expression

required
References

D'Agostino, R. B. (1971), "An omnibus test of normality for moderate and large sample size", Biometrika, 58, 341-348 D'Agostino, R. and Pearson, E. S. (1973), "Tests for departure from normality", Biometrika, 60, 613-622

Source code in python/polars_ds/exprs/stats.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
def normal_test(var: str | pl.Expr) -> pl.Expr:
    """
    Perform a normality test which is based on D'Agostino and Pearson's test
    that combines skew and kurtosis to produce an omnibus test of normality.
    Null values, NaN and inf are dropped when running this computation.

    Parameters
    ----------
    var
        Either the name of the column or a Polars expression

    References
    ----------
    D'Agostino, R. B. (1971), "An omnibus test of normality for
        moderate and large sample size", Biometrika, 58, 341-348
    D'Agostino, R. and Pearson, E. S. (1973), "Tests for departure from
        normality", Biometrika, 60, 613-622
    """
    y = to_expr(var)
    valid: pl.Expr = y.filter(y.is_finite())
    skew = valid.skew()
    # Pearson Kurtosis, see here: https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test
    kur = valid.kurtosis(fisher=False)
    return pl_plugin(
        symbol="pl_normal_test",
        args=[skew, kur, valid.count().cast(pl.UInt32)],
        returns_scalar=True,
    )

perturb(x, epsilon=1e-05, positive=False, seed=None)

Perturb the var by a small amount. This only applies to float columns.

Parameters:

Name Type Description Default
x str | Expr

Either the name of the column or a Polars expression

required
epsilon float

The small amount to perturb.

1e-05
positive bool

If true, randomly add a small amount in [0, epsilon). If false, it will use the range [-epsilon/2, epsilon/2)

False
seed int | None

A random seed

None
Source code in python/polars_ds/exprs/stats.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
def perturb(
    x: str | pl.Expr, epsilon: float = 1e-5, positive: bool = False, seed: int | None = None
) -> pl.Expr:
    """
    Perturb the var by a small amount. This only applies to float columns.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    epsilon
        The small amount to perturb.
    positive
        If true, randomly add a small amount in [0, epsilon). If false, it will use the range
        [-epsilon/2, epsilon/2)
    seed
        A random seed
    """
    if math.isinf(epsilon) or math.isnan(epsilon):
        raise ValueError("Input `epsilon should be a valid finite value.`")

    ep = abs(epsilon)
    if positive:
        lo = pl.lit(0.0, dtype=pl.Float64)
        hi = pl.lit(ep, dtype=pl.Float64)
    else:
        half = ep / 2
        lo = pl.lit(-half, dtype=pl.Float64)
        hi = pl.lit(half, dtype=pl.Float64)

    return pl_plugin(
        symbol="pl_perturb",
        args=[to_expr(x), lo, hi, pl.lit(seed, dtype=pl.UInt64)],
        is_elementwise=True,
    )

random(lower=0.0, upper=1.0, seed=None, len_ref=None)

Generate random numbers in [lower, upper)

Parameters:

Name Type Description Default
lower Expr | float

The lower bound

0.0
upper Expr | float

The upper bound, exclusive

1.0
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
def random(
    lower: pl.Expr | float = 0.0,
    upper: pl.Expr | float = 1.0,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generate random numbers in [lower, upper)

    Parameters
    ----------
    lower
        The lower bound
    upper
        The upper bound, exclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    lo = pl.lit(lower, pl.Float64) if isinstance(lower, float) else lower
    up = pl.lit(upper, pl.Float64) if isinstance(upper, float) else upper
    len_, is_elementwise = _get_streamable(len_ref)

    return pl_plugin(
        symbol="pl_random",
        args=[len_, lo, up, pl.lit(seed, pl.UInt64)],
        is_elementwise=is_elementwise,
    )

random_binomial(n, p, seed=None, len_ref=None)

Generates random integer following a binomial distribution.

Parameters:

Name Type Description Default
n int

The n in a binomial distribution

required
p float

The p in a binomial distribution. The success rate.

required
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
def random_binomial(
    n: int, p: float, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random integer following a binomial distribution.

    Parameters
    ----------
    n
        The n in a binomial distribution
    p
        The p in a binomial distribution. The success rate.
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    if n < 1:
        raise ValueError("Input `n` must be > 1.")
    if p < 0.0 or p > 1.0:
        raise ValueError("Input `p` must be between 0 and 1.")

    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_binomial",
        args=[
            len_,
            pl.lit(n, pl.UInt32),
            pl.lit(p, pl.Float64),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

random_exp(lambda_, seed=None, len_ref=None)

Generates random numbers following an exponential distribution.

Parameters:

Name Type Description Default
lambda_ float

The lambda in an exponential distribution

required
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
def random_exp(
    lambda_: float, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random numbers following an exponential distribution.

    Parameters
    ----------
    lambda_
        The lambda in an exponential distribution
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_exp",
        args=[
            len_,
            pl.lit(lambda_, pl.Float64),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

random_int(lower, upper, seed=None, len_ref=None)

Generates random integer between lower and upper.

Parameters:

Name Type Description Default
lower int | Expr

The lower bound, inclusive

required
upper int | Expr

The upper bound, exclusive

required
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
def random_int(
    lower: int | pl.Expr,
    upper: int | pl.Expr,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generates random integer between lower and upper.

    Parameters
    ----------
    lower
        The lower bound, inclusive
    upper
        The upper bound, exclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    if lower == upper:
        raise ValueError("Input `lower` must be smaller than `higher`")

    lo = pl.lit(lower, pl.Int32) if isinstance(lower, int) else lower.cast(pl.Int32)
    hi = pl.lit(upper, pl.Int32) if isinstance(upper, int) else upper.cast(pl.Int32)
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_int",
        args=[
            len_,
            lo,
            hi,
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

random_normal(mean, std, seed=None, len_ref=None)

Generates random number following a normal distribution.

Parameters:

Name Type Description Default
mean Expr | float

The mean in a normal distribution

required
std Expr | float

The std in a normal distribution

required
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
def random_normal(
    mean: pl.Expr | float,
    std: pl.Expr | float,
    seed: int | None = None,
    len_ref: str | pl.Expr | None = None,
) -> pl.Expr:
    """
    Generates random number following a normal distribution.

    Parameters
    ----------
    mean
        The mean in a normal distribution
    std
        The std in a normal distribution
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_normal",
        args=[
            len_,
            pl.lit(mean, pl.Float64) if isinstance(mean, float) else mean,
            pl.lit(std, pl.Float64) if isinstance(std, float) else std,
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

random_null(x, pct, seed=None)

Creates random null values in the columns. If var contains nulls originally, they will stay null.

Parameters:

Name Type Description Default
x str | Expr

Either the name of the column or a Polars expression

required
pct float

Percentage of nulls to randomly generate. This percentage is based on the length of the column, so may not be the actual percentage of nulls depending on how many values are originally null.

required
seed int | None

A seed to fix the random numbers. If none, use the system's entropy.

None
Source code in python/polars_ds/exprs/stats.py
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
def random_null(x: str | pl.Expr, pct: float, seed: int | None = None) -> pl.Expr:
    """
    Creates random null values in the columns. If var contains nulls originally, they
    will stay null.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    pct
        Percentage of nulls to randomly generate. This percentage is based on the
        length of the column, so may not be the actual percentage of nulls depending
        on how many values are originally null.
    seed
        A seed to fix the random numbers. If none, use the system's entropy.
    """
    if pct <= 0.0 or pct >= 1.0:
        raise ValueError("Input `pct` must be > 0 and < 1")

    return pl.when(random(0.0, 1.0, seed=seed, len_ref=x) < pct).then(None).otherwise(to_expr(x))

random_str(min_size, max_size, seed=None, len_ref=None)

Generates random strings of length between min_size and max_size. The characters are uniformly distributed over ASCII letters and numbers: a-z, A-Z and 0-9.

Parameters:

Name Type Description Default
min_size int

The min size of the string, inclusive

required
max_size int

The max size of the string, inclusive

required
seed int | None

The random seed. None means no seed.

None
len_ref str | Expr | None

Length reference. In normal non-streaming context, this should always be None which means it will always use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column name, e.g. len_ref = 'id' so that the random generator knows the corresponding length of each chunk.

None
Source code in python/polars_ds/exprs/stats.py
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
def random_str(
    min_size: int, max_size: int, seed: int | None = None, len_ref: str | pl.Expr | None = None
) -> pl.Expr:
    """
    Generates random strings of length between min_size and max_size. The characters are
    uniformly distributed over ASCII letters and numbers: a-z, A-Z and 0-9.

    Parameters
    ----------
    min_size
        The min size of the string, inclusive
    max_size
        The max size of the string, inclusive
    seed
        The random seed. None means no seed.
    len_ref
        Length reference. In normal non-streaming context, this should always be None which means it will always
        use pl.len() as the total length of the data you wish to generate. In streaming mode, you may pass any column
        name, e.g. `len_ref = 'id'` so that the random generator knows the corresponding length of each chunk.
    """
    mi, ma = min_size, max_size
    if min_size > max_size:
        mi, ma = max_size, min_size

    len_, is_elementwise = _get_streamable(len_ref)
    return pl_plugin(
        symbol="pl_rand_str",
        args=[
            len_,
            pl.lit(mi, pl.UInt32),
            pl.lit(ma, pl.UInt32),
            pl.lit(seed, pl.UInt64),
        ],
        is_elementwise=is_elementwise,
    )

ttest_1samp(var1, pop_mean, alternative='two-sided')

Performs a standard 1 sample t test using reference column and expected mean. This function sanitizes the self column first. The df is the count of valid values.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

  1. There is no valid value in the inputs, or the mean is inf.
  2. Input variable has length 0 after removing non-finite values.

Parameters:

Name Type Description Default
var1 str | Expr

Variable 1

required
pop_mean float

The expected population mean in the hypothesis test

required
alternative ('two-sided', 'less', 'greater')

Alternative of the hypothesis test

"two-sided"
Source code in python/polars_ds/exprs/stats.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def ttest_1samp(
    var1: str | pl.Expr, pop_mean: float, alternative: Alternative = "two-sided"
) -> pl.Expr:
    """
    Performs a standard 1 sample t test using reference column and expected mean. This function
    sanitizes the self column first. The df is the count of valid values.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        Variable 1
    pop_mean
        The expected population mean in the hypothesis test
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    """
    y = to_expr(var1)
    s1 = y.filter(y.is_finite())
    sm = s1.mean()
    pm = pl.lit(pop_mean, dtype=pl.Float64)
    var = s1.var()
    cnt = s1.len().cast(pl.UInt64)
    alt = pl.lit(alternative, dtype=pl.String)
    return pl_plugin(
        symbol="pl_ttest_1samp",
        args=[sm, pm, var, cnt, alt],
        returns_scalar=True,
    )

ttest_ind(var1, var2, alternative='two-sided', equal_var=False)

Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined to be equivalent to SciPy's ttest_ind, with fewer options. The result is not exact but within 1e-10 precision from SciPy's.

In the case of student's t test, the data is assumed to have no nulls, and n = expr.count() is used. Note expr.count() only counts non-null elements after polars 0.20. The degree of freedom will be 2n - 2. As a result, nulls might cause problems.

In the case of Welch's t test, data will be sanitized (nulls, NaNs, Infs will be dropped before the test), and df will be counted based on the length of sanitized data.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

  1. There is no valid value in the inputs, or the mean is inf.
  2. Input variable has length 0 after removing non-finite values.

Parameters:

Name Type Description Default
var1 str | Expr

Variable 1

required
var2 str | Expr

Variable 2

required
alternative ('two-sided', 'less', 'greater')

Alternative of the hypothesis test

"two-sided"
equal_var bool

If true, perform standard student t 2 sample test. Otherwise, perform Welch's t test.

False

Examples:

Same length, equal variance comparisons.

>>> df.select(pds.ttest_ind("x1", "x2", equal_var=True))

Potentially unequal length, unequal variance.

>>> df.select(
...     pds.ttest_ind(
...         pl.col("x1").filter(condition_A), pl.col("x1").filter(condition_B), equal_var=False
...     )
... )
Source code in python/polars_ds/exprs/stats.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
def ttest_ind(
    var1: str | pl.Expr,
    var2: str | pl.Expr,
    alternative: Alternative = "two-sided",
    equal_var: bool = False,
) -> pl.Expr:
    """
    Performs 2 sample student's t test or Welch's t test. Functionality-wise this is desgined
    to be equivalent to SciPy's ttest_ind, with fewer options. The result is not exact but
    within 1e-10 precision from SciPy's.

    In the case of student's t test, the data is assumed to have no nulls, and n = expr.count()
    is used. Note expr.count() only counts non-null elements after polars 0.20.
    The degree of freedom will be 2n - 2. As a result, nulls might cause problems.

    In the case of Welch's t test, data will be sanitized (nulls, NaNs, Infs will be dropped
    before the test), and df will be counted based on the length of sanitized data.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        Variable 1
    var2
        Variable 2
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    equal_var
        If true, perform standard student t 2 sample test. Otherwise, perform Welch's
        t test.

    Examples
    --------
    Same length, equal variance comparisons.
    >>> df.select(pds.ttest_ind("x1", "x2", equal_var=True))

    Potentially unequal length, unequal variance.
    >>> df.select(
    ...     pds.ttest_ind(
    ...         pl.col("x1").filter(condition_A), pl.col("x1").filter(condition_B), equal_var=False
    ...     )
    ... )
    """
    y1, y2 = to_expr(var1), to_expr(var2)
    if equal_var:
        m1 = y1.mean()
        m2 = y2.mean()
        v1 = y1.var()
        v2 = y2.var()
        cnt = y1.count().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_ttest_2samp",
            args=[m1, m2, v1, v2, cnt, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )
    else:
        s1 = y1.filter(y1.is_finite())
        s2 = y2.filter(y2.is_finite())
        m1 = s1.mean()
        m2 = s2.mean()
        v1 = s1.var()
        v2 = s2.var()
        n1 = s1.len().cast(pl.UInt64)
        n2 = s2.len().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_welch_t",
            args=[m1, m2, v1, v2, n1, n2, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )

ttest_ind_from_stats(var1, mean, var, cnt, alternative='two-sided', equal_var=False)

Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other. This is more suitable for t-tests between rolling data and some other fixed data, from which you can compute the mean, var, and count only once.

If (NaN, NaN) is returned, then it is possible that one of the following numeric problems occurred:

  1. There is no valid value in the inputs, or the mean is inf.
  2. Input variable has length 0 after removing non-finite values.

Parameters:

Name Type Description Default
var1 str | Expr

The variable 1

required
mean float

The mean of var2

required
var float

The var of var2

required
cnt int

The count of var2, used only in welch's t test

required
alternative ('two-sided', 'less', 'greater')

Alternative of the hypothesis test

"two-sided"
equal_var bool

If true, perform standard student t 2 sample test. Otherwise, perform Welch's t test.

False
Source code in python/polars_ds/exprs/stats.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
def ttest_ind_from_stats(
    var1: str | pl.Expr,
    mean: float,
    var: float,
    cnt: int,
    alternative: Alternative = "two-sided",
    equal_var: bool = False,
) -> pl.Expr:
    """
    Performs 2 sample student's t test or Welch's t test, using only scalar statistics from other.
    This is more suitable for t-tests between rolling data and some other fixed data, from which you
    can compute the mean, var, and count only once.

    If (NaN, NaN) is returned, then it is possible that one of the following numeric
    problems occurred:

    1. There is no valid value in the inputs, or the mean is inf.
    2. Input variable has length 0 after removing non-finite values.

    Parameters
    ----------
    var1
        The variable 1
    mean
        The mean of var2
    var
        The var of var2
    cnt
        The count of var2, used only in welch's t test
    alternative : {"two-sided", "less", "greater"}
        Alternative of the hypothesis test
    equal_var
        If true, perform standard student t 2 sample test. Otherwise, perform Welch's
        t test.
    """
    y = to_expr(var1)
    if equal_var:
        m1 = y.mean()
        m2 = pl.lit(mean, pl.Float64)
        v1 = y.var()
        v2 = pl.lit(var, pl.Float64)
        cnt = y.count().cast(pl.UInt64)
        return pl_plugin(
            symbol="pl_ttest_2samp",
            args=[m1, m2, v1, v2, cnt, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )
    else:
        s1 = y.filter(y.is_finite())
        m1 = s1.mean()
        m2 = pl.lit(mean, pl.Float64)
        v1 = s1.var()
        v2 = pl.lit(var, pl.Float64)
        n1 = s1.len().cast(pl.UInt64)
        n2 = pl.lit(cnt, pl.UInt64)
        return pl_plugin(
            symbol="pl_welch_t",
            args=[m1, m2, v1, v2, n1, n2, pl.lit(alternative, dtype=pl.String)],
            returns_scalar=True,
        )

weighted_corr(x, y, weights)

Computes the weighted correlation between x and y. The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
def weighted_corr(x: str | pl.Expr, y: str | pl.Expr, weights: str | pl.Expr) -> pl.Expr:
    """
    Computes the weighted correlation between x and y. The weights column must have the same
    length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy = to_expr(x), to_expr(y)
    w = to_expr(weights)
    numerator = w.dot((xx - weighted_mean(xx, w, False)) * (yy - weighted_mean(yy, w, False)))
    sxx = w.dot((xx - weighted_mean(xx, w, False)).pow(2))
    syy = w.dot((yy - weighted_mean(yy, w, False)).pow(2))
    return numerator / (sxx * syy).sqrt()

weighted_cosine_sim(x, y, weights)

Computes the weighted cosine similarity between x and y (column-wise). The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
def weighted_cosine_sim(x: str | pl.Expr, y: str | pl.Expr, weights: str | pl.Expr) -> pl.Expr:
    """
    Computes the weighted cosine similarity between x and y (column-wise). The weights column
    must have the same length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy = to_expr(x), to_expr(y)
    w = to_expr(weights)
    wx2 = xx.pow(2).dot(w)
    wy2 = yy.pow(2).dot(w)
    return (w * xx).dot(yy) / (wx2 * wy2).sqrt()

weighted_cov(x, y, weights)

Computes the weighted covariance between x and y. The weights column must have the same length as both x an y.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
weights Expr | float

An expr representing weights. Must be of same length as var.

required
Reference

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient

Source code in python/polars_ds/exprs/stats.py
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
def weighted_cov(x: str | pl.Expr, y: str | pl.Expr, weights: pl.Expr | float) -> pl.Expr:
    """
    Computes the weighted covariance between x and y. The weights column must have the same
    length as both x an y.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    weights
        An expr representing weights. Must be of same length as var.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Weighted_correlation_coefficient
    """
    xx, yy, w = to_expr(x), to_expr(y), to_expr(weights)
    wx, wy = weighted_mean(xx, w, False), weighted_mean(yy, w, False)
    return w.dot((xx - wx) * (yy - wy)) / w.sum()

weighted_gmean(var, weights, is_normalized=False)

Computes the weighted geometric mean of the variable.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
is_normalized bool

If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights

False
Source code in python/polars_ds/exprs/stats.py
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
def weighted_gmean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted geometric mean of the variable.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    x, w = to_expr(var), to_expr(weights)
    if is_normalized:
        return (x.ln().dot(w)).exp()
    else:
        return (x.ln().dot(w) / (w.sum())).exp()

weighted_hmean(var, weights, is_normalized=False)

Computes the weighted harmonic mean of the variable.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
is_normalized bool

If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights

False
Source code in python/polars_ds/exprs/stats.py
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
def weighted_hmean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted harmonic mean of the variable.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    w = to_expr(weights)
    x = to_expr(var)
    dot = x.dot(pl.lit(1.0, dtype=pl.Float32) / x)
    if is_normalized:
        return 1.0 / dot
    else:
        return 1.0 / (dot / w.sum())

weighted_mean(var, weights, is_normalized=False)

Computes the weighted mean, where weights is an expr represeting a weight column. The weights column must have the same length as var.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
is_normalized bool

If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights

False
Source code in python/polars_ds/exprs/stats.py
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
def weighted_mean(
    var: str | pl.Expr, weights: str | pl.Expr, is_normalized: bool = False
) -> pl.Expr:
    """
    Computes the weighted mean, where weights is an expr represeting
    a weight column. The weights column must have the same length as var.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    is_normalized
        If true, the weights are assumed to sum to 1. If false, will divide by sum of the weights
    """
    x, w = to_expr(var), to_expr(weights)
    out = x.dot(w)
    if is_normalized:
        return out
    return out / w.sum()

weighted_var(var, weights, freq_weights=False)

Computes the weighted variance. The weights column must have the same length as var.

All weights are assumed to be > 0. This will not check if weights are valid.

Parameters:

Name Type Description Default
var str | Expr

The variable

required
weights str | Expr

An expr representing weights. Must be of same length as var.

required
freq_weights bool

Whether to follow the formula for frequency weights or other types of weights. See reference for detail. If true, this assumes frequency weights are NOT normalized. If false, the weighted sample variance is biased. See reference for more info.

False
Reference

https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance

Source code in python/polars_ds/exprs/stats.py
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
def weighted_var(var: str | pl.Expr, weights: str | pl.Expr, freq_weights: bool = False) -> pl.Expr:
    """
    Computes the weighted variance. The weights column must have the same length as var.

    All weights are assumed to be > 0. This will not check if weights are valid.

    Parameters
    ----------
    var
        The variable
    weights
        An expr representing weights. Must be of same length as var.
    freq_weights
        Whether to follow the formula for frequency weights or other types of weights. See reference
        for detail. If true, this assumes frequency weights are NOT normalized. If false, the
        weighted sample variance is biased. See reference for more info.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance
    """
    x, w = to_expr(var), to_expr(weights)
    wm = weighted_mean(x, w, False)
    summand = w.dot((x - wm).pow(2))
    if freq_weights:
        return summand / (w.sum() - 1)
    return summand / w.sum()

winsorize(x, q_low=0.05, q_high=0.95, method='nearest')

Winsorize the data by clipping by percentiles at the lower and upper ends.

Parameters:

Name Type Description Default
x str | Expr

Either the name of the column or a Polars expression

required
q_low float

The lower percentile value to clip the data. E.g everything < x.quantile(lower) will be mapped to x.quantile(lower)

0.05
q_high float

The upper percentile value to clip the data. E.g everything > x.quantile(upper) will be mapped to x.quantile(upper)

0.95
method QuantileMethod

Method for quantile estimate. One of "nearest", "higher", "lower", "midpoint", "linear".

'nearest'
Source code in python/polars_ds/exprs/stats.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
def winsorize(
    x: str | pl.Expr,
    q_low: float = 0.05,
    q_high: float = 0.95,
    method: QuantileMethod = "nearest",
) -> pl.Expr:
    """
    Winsorize the data by clipping by percentiles at the lower and upper ends.

    Parameters
    ----------
    x
        Either the name of the column or a Polars expression
    q_low
        The lower percentile value to clip the data. E.g everything < x.quantile(lower)
        will be mapped to x.quantile(lower)
    q_high
        The upper percentile value to clip the data. E.g everything > x.quantile(upper)
        will be mapped to x.quantile(upper)
    method
        Method for quantile estimate. One of "nearest", "higher", "lower", "midpoint", "linear".
    """
    if q_low <= 0.0 or q_low >= 1.0 or q_high <= 0.0 or q_high >= 1.0 or q_high <= q_low:
        raise ValueError("Lower and upper must be with in (0, 1) and upper should be > lower")

    xx = to_expr(x)
    return xx.clip(
        xx.quantile(q_low, interpolation=method), xx.quantile(q_high, interpolation=method)
    )

xi_corr(x, y, seed=None, return_p=False)

Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference. This will return both the correlation (the statistic) and the p-value. Note that if sample size is smaller than 30, p-value will always be NaN. The ξ correlation is not symmetric, as it only tries to explain whether y is a function of x.

Parameters:

Name Type Description Default
x str | Expr

The first variable

required
y str | Expr

The second variable

required
seed int | None

Whether to have a seed when we break ties at random

None
return_p bool

Whether to return a two-sided p value for the statistic

False
Reference

https://arxiv.org/pdf/1909.10140.pdf

Source code in python/polars_ds/exprs/stats.py
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
def xi_corr(
    x: str | pl.Expr, y: str | pl.Expr, seed: int | None = None, return_p: bool = False
) -> pl.Expr:
    """
    Computes the ξ(xi) correlation developed by SOURAV CHATTERJEE in the paper in the reference.
    This will return both the correlation (the statistic) and the p-value. Note that if sample size
    is smaller than 30, p-value will always be NaN. The ξ correlation is not symmetric, as it only
    tries to explain whether y is a function of x.

    Parameters
    ----------
    x
        The first variable
    y
        The second variable
    seed
        Whether to have a seed when we break ties at random
    return_p
        Whether to return a two-sided p value for the statistic

    Reference
    ---------
    https://arxiv.org/pdf/1909.10140.pdf
    """
    xx, yy = to_expr(x), to_expr(y)
    args = [
        xx.rank(method="random", seed=seed),
        yy.rank(method="max").cast(pl.Float64),
        (-yy).rank(method="max").cast(pl.Float64),
    ]
    if return_p:
        return pl_plugin(
            symbol="pl_xi_corr_w_p",
            args=args,
            returns_scalar=True,
        )
    else:
        return pl_plugin(
            symbol="pl_xi_corr",
            args=args,
            returns_scalar=True,
        )