Skip to content

Numerical Functions Expr

Extension for General Numerical Features/Metrics/Quantities

Miscallaneous Numerical Functions and Transforms.

Functions:

Name Description
add_at

Creates a zero column of length buffer_size first. Then the j-th value in values

arr_dot

Calculates the dot product for two array columns.

arr_l1_dist

Calculates the L1 distance for two array columns.

arr_sql2_dist

Calculates the squared L2 distance for two array columns.

center

Centers the column.

convolve

Performs a convolution with the given kernel(filter). The current implementation's performance is worse

detrend

Detrends self using either linear/mean method. This does not persist.

digamma

The diagamma function.

exp2

Returns 2^x.

expit

Applies the Expit function to self. Expit(x) = 1 / (1 + e^(-x))

fract

Returns the fractional part of the input values. E.g. fractional part of 1.1 is 0.1

gamma

Applies the gamma function to self. Note, this will return NaN for negative values and inf when x = 0,

gcd

Computes GCD of two integer columns. This will try to cast everything to int32.

haversine

Computes haversine distance using the naive method. The output unit is km.

info_value

Compute Information Value for x with respect to target. This assumes the variable x

info_value_discrete

Compute the Information Value for x with respect to target. This assumes x

integrate_trapz

Integrate y along x using the trapezoidal rule. If x is not a single

is_decreasing

Checks whether the column is monotonically decreasing.

is_increasing

Checks whether the column is monotonically increasing.

isotonic_regression

Performs isotonic regression on the data. This is the same as scipy.optimize.isotonic_regression.

jaccard_col

Computes jaccard similarity column-wise. This will hash entire columns and compares the two

jaccard_row

Computes jaccard similarity pairwise between a and b column. The type of

l1_horizontal

Horizontally computes L1 norm. Shorthand for pl.sum_horizontal(pl.col(x).abs() for x in exprs).

l2_sq_horizontal

Horizontally computes L2 norm squared. Shorthand for pl.sum_horizontal(pl.col(x).pow(2) for x in exprs).

l_inf_horizontal

Horizontally computes L inf norm. Shorthand for pl.max_horizontal(pl.col(x).abs() for x in exprs).

lcm

Computes LCM of two integer columns. This will try to cast everything to int32.

list_amax

Finds the argmax of the list in this column. This is useful for

list_dot

Calculates the dot product for two list columns.

list_l1_dist

Calculates the L1 distance for two list columns.

list_sql2_dist

Calculates the squared L2 distance for two list columns.

logit

Applies the logit function to self. Logit(x) = ln(x/(1-x)).

next_down

For any float, return the greatest number smaller than itself (within the precision).

next_up

For any float, return the least number greater than itself (within the precision).

pca

Finds all singular values as well as the principal vectors.

principal_components

Transforms the features to get the first k principal components. This returns NaN if the number

psi

Compute the Population Stability Index between x and the reference column (usually x's historical values).

psi_discrete

Compute the Population Stability Index between self (actual) and the reference column. The baseline

psi_w_breakpoints

Creates a PSI report using the custom breakpoints.

rfft

Computes the DFT transform of a real-valued input series using FFT Algorithm. Note that

singular_values

Finds all principal values (singular values) for the data matrix formed by the given features

softmax

Applies the softmax function to the column, which turns any real valued column into valid probability

trunc

Returns the integer part of the input values. E.g. integer part of 1.1 is 1.0

woe

Compute the Weight of Evidence for x with respect to target. This assumes x

woe_discrete

Compute the Weight of Evidence for x with respect to target. This assumes x

xlogy

Computes x * log(y) so that if x = 0, the product is 0.

z_normalize

Z-normalizes the column.

add_at(indices, values, buffer_size=None)

Creates a zero column of length buffer_size first. Then the j-th value in values will be added to the j-th index in indices in the buffer. This is the equivalent to NumPy's add.at.

Parameters:

Name Type Description Default
indices str | Expr

Expression or name of a column. Must be castable to u32.

required
values str | Expr

Expression or name of a column. Must be castable to f64 and have the same length as the indices.

required
buffer_size int | Expr | None

If this is None, buffer size will be inferred from unique values in indices, which should range from [0..n), where n is the actual buffer size. If this is an integer, then the buffer will have the exact size given here, which might cause out of bounds error if indices are not checked. If this is an expression, only the first element in the represented column will be used.

None
Source code in python/polars_ds/exprs/num.py
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
def add_at(
    indices: str | pl.Expr, values: str | pl.Expr, buffer_size: int | pl.Expr | None = None
) -> pl.Expr:
    """
    Creates a zero column of length `buffer_size` first. Then the j-th value in `values`
    will be added to the j-th index in `indices` in the buffer. This is the equivalent to
    NumPy's add.at.

    Parameters
    ----------
    indices
        Expression or name of a column. Must be castable to u32.
    values
        Expression or name of a column. Must be castable to f64 and have the same length
        as the indices.
    buffer_size
        If this is None, buffer size will be inferred from unique values in indices, which
        should range from [0..n), where n is the actual buffer size. If this is an integer,
        then the buffer will have the exact size given here, which might cause out of bounds
        error if indices are not checked. If this is an expression, only the first element
        in the represented column will be used.
    """
    ind = to_expr(indices).cast(pl.UInt32).rechunk()
    val = to_expr(values).cast(pl.Float64).rechunk()
    if isinstance(buffer_size, pl.Expr):
        size = buffer_size
    else:
        size = pl.lit(buffer_size, dtype=pl.UInt32)

    return pl_plugin(
        symbol="pl_add_at",
        args=[ind, val, size],
    )

arr_dot(arr1, arr2)

Calculates the dot product for two array columns.

Parameters:

Name Type Description Default
arr1 str | Expr

The first array column

required
arr2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
128
129
130
131
132
133
134
135
136
137
138
139
140
def arr_dot(arr1: str | pl.Expr, arr2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the dot product for two array columns.

    Parameters
    ----------
    arr1
        The first array column
    arr2
        The second array column
    """
    x, y = to_expr(arr1), to_expr(arr2)
    return (x * y).arr.sum()

arr_l1_dist(arr1, arr2)

Calculates the L1 distance for two array columns.

Parameters:

Name Type Description Default
arr1 str | Expr

The first array column

required
arr2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def arr_l1_dist(arr1: str | pl.Expr, arr2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the L1 distance for two array columns.

    Parameters
    ----------
    arr1
        The first array column
    arr2
        The second array column
    """
    x, y = to_expr(arr1), to_expr(arr2)
    return (x - y).arr.eval(pl.element().abs()).arr.sum()

arr_sql2_dist(arr1, arr2)

Calculates the squared L2 distance for two array columns.

Parameters:

Name Type Description Default
arr1 str | Expr

The first array column

required
arr2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
158
159
160
161
162
163
164
165
166
167
168
169
170
def arr_sql2_dist(arr1: str | pl.Expr, arr2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the squared L2 distance for two array columns.

    Parameters
    ----------
    arr1
        The first array column
    arr2
        The second array column
    """
    x, y = to_expr(arr1), to_expr(arr2)
    return (x - y).arr.eval(pl.element().pow(2)).arr.sum()

center(x)

Centers the column.

This is only a short cut for a standard feature transform, and is not recommended to be used in settings where the means need to be persisted.

Source code in python/polars_ds/exprs/num.py
253
254
255
256
257
258
259
260
261
def center(x: str | pl.Expr) -> pl.Expr:
    """
    Centers the column.

    This is only a short cut for a standard feature transform, and is not recommended
    to be used in settings where the means need to be persisted.
    """
    xx = to_expr(x)
    return xx - xx.mean()

convolve(x, kernel, fill_value=0.0, method='direct', mode='full', parallel=False)

Performs a convolution with the given kernel(filter). The current implementation's performance is worse than SciPy but offers parallelization within Polars.

For large kernels (usually kernel length > 120), convolving with FFT is faster, but for smaller kernels, convolving with direct method is faster.

Parameters:

Name Type Description Default
x str | Expr

A column of numbers

required
kernel List[float] | ndarray | Series | Expr

The filter for the convolution. Anything that can be turned into a Polars Series will work. All non-finite values will be filtered out before the convolution.

required
fill_value float | Expr

Fill null values in x with this value. Either a float or a polars's expression representing 1 element

0.0
method ConvMethod

Either fft or direct.

'direct'
mode ConvMode

Please check the reference. One of same, left (left-aligned same), right (right-aligned same), valid or full.

'full'
parallel bool

Only applies when method is direct. Whether to compute the convolution in parallel. Note that this may not have the expected performance when you are in group_by or other parallel context already. It is recommended to use this in select/with_columns context, when few expressions are being run at the same time.

False
Reference

https://brianmcfee.net/dstbook-site/content/ch03-convolution/Modes.html https://en.wikipedia.org/wiki/Convolution

Source code in python/polars_ds/exprs/num.py
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
def convolve(
    x: str | pl.Expr,
    kernel: List[float] | ndarray | pl.Series | pl.Expr,  # noqa: F821
    fill_value: float | pl.Expr = 0.0,
    method: ConvMethod = "direct",
    mode: ConvMode = "full",
    parallel: bool = False,
) -> pl.Expr:
    """
    Performs a convolution with the given kernel(filter). The current implementation's performance is worse
    than SciPy but offers parallelization within Polars.

    For large kernels (usually kernel length > 120), convolving with FFT is faster, but for smaller kernels,
    convolving with direct method is faster.

    parameters
    ----------
    x
        A column of numbers
    kernel
        The filter for the convolution. Anything that can be turned into a Polars Series will work. All non-finite
        values will be filtered out before the convolution.
    fill_value
        Fill null values in `x` with this value. Either a float or a polars's expression representing 1 element
    method
        Either `fft` or `direct`.
    mode
        Please check the reference. One of `same`, `left` (left-aligned same), `right` (right-aligned same),
        `valid` or `full`.
    parallel
        Only applies when method is `direct`. Whether to compute the convolution in parallel. Note that this may not
        have the expected performance when you are in group_by or other parallel context already. It is recommended
        to use this in select/with_columns context, when few expressions are being run at the same time.

    Reference
    ---------
    https://brianmcfee.net/dstbook-site/content/ch03-convolution/Modes.html
    https://en.wikipedia.org/wiki/Convolution
    """
    xx = to_expr(x).fill_null(fill_value).cast(pl.Float64).rechunk()  # One cont slice
    f: pl.Expr | pl.Series
    if isinstance(kernel, pl.Expr):
        f = kernel.filter(kernel.is_finite()).rechunk()  # One cont slice
    else:
        f = pl.Series(values=kernel, dtype=pl.Float64)
        f = f.filter(f.is_finite()).rechunk()  # One cont slice

    if method == "direct":
        f = f.reverse()

    return pl_plugin(
        symbol="pl_convolve",
        args=[xx, f],
        kwargs={"mode": mode, "method": method, "parallel": parallel},
        changes_length=True,
    )

detrend(x, method='linear')

Detrends self using either linear/mean method. This does not persist.

Parameters:

Name Type Description Default
method DetrendMethod

Either linear or mean

'linear'
Source code in python/polars_ds/exprs/num.py
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
def detrend(x: str | pl.Expr, method: DetrendMethod = "linear") -> pl.Expr:
    """
    Detrends self using either linear/mean method. This does not persist.

    Parameters
    ----------
    method
        Either `linear` or `mean`
    """
    ts = to_expr(x)
    if method == "linear":
        N = ts.count()
        x = pl.int_range(0, N, eager=False)
        coeff = pl.cov(ts, x) / x.var()
        const = ts.mean() - coeff * (N - 1) / 2
        return ts - x * coeff - const
    elif method == "mean":
        return ts - ts.mean()
    else:
        raise ValueError(f"Unknown detrend method: {method}")

digamma(x)

The diagamma function.

Source code in python/polars_ds/exprs/num.py
1167
1168
1169
1170
1171
1172
1173
1174
1175
def digamma(x: str | pl.Expr) -> pl.Expr:
    """
    The diagamma function.
    """
    return pl_plugin(
        symbol="pl_diagamma",
        args=[to_expr(x)],
        is_elementwise=True,
    )

exp2(x)

Returns 2^x.

Source code in python/polars_ds/exprs/num.py
961
962
963
964
965
966
967
968
969
def exp2(x: str | pl.Expr) -> pl.Expr:
    """
    Returns 2^x.
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_exp2",
        is_elementwise=True,
    )

expit(x)

Applies the Expit function to self. Expit(x) = 1 / (1 + e^(-x))

Source code in python/polars_ds/exprs/num.py
938
939
940
941
942
943
944
945
946
def expit(x: str | pl.Expr) -> pl.Expr:
    """
    Applies the Expit function to self. Expit(x) = 1 / (1 + e^(-x))
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_expit",
        is_elementwise=True,
    )

fract(x)

Returns the fractional part of the input values. E.g. fractional part of 1.1 is 0.1

Source code in python/polars_ds/exprs/num.py
972
973
974
975
976
977
978
979
980
def fract(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the fractional part of the input values. E.g. fractional part of 1.1 is 0.1
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_fract",
        is_elementwise=True,
    )

gamma(x)

Applies the gamma function to self. Note, this will return NaN for negative values and inf when x = 0, whereas SciPy's gamma function will return inf for all x <= 0.

Source code in python/polars_ds/exprs/num.py
926
927
928
929
930
931
932
933
934
935
def gamma(x: str | pl.Expr) -> pl.Expr:
    """
    Applies the gamma function to self. Note, this will return NaN for negative values and inf when x = 0,
    whereas SciPy's gamma function will return inf for all x <= 0.
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_gamma",
        is_elementwise=True,
    )

gcd(x, y)

Computes GCD of two integer columns. This will try to cast everything to int32.

Parameters:

Name Type Description Default
x str | Expr

An integer column

required
y int | str | Expr

Either an int, or another integer column

required
Source code in python/polars_ds/exprs/num.py
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def gcd(x: str | pl.Expr, y: int | str | pl.Expr) -> pl.Expr:
    """
    Computes GCD of two integer columns. This will try to cast everything to int32.

    Parameters
    ----------
    x
        An integer column
    y
        Either an int, or another integer column
    """
    if isinstance(y, int):
        yy = pl.lit(y, dtype=pl.Int32)
    else:
        yy = to_expr(y).cast(pl.Int32)

    return pl_plugin(
        symbol="pl_gcd",
        args=[to_expr(x).cast(pl.Int32), yy],
        is_elementwise=True,
    )

haversine(x_lat, x_long, y_lat, y_long)

Computes haversine distance using the naive method. The output unit is km.

Parameters:

Name Type Description Default
x_lat str | Expr

Column representing latitude in x

required
x_long str | Expr

Column representing longitude in x

required
y_lat float | str | Expr

Column representing latitude in y

required
y_long float | str | Expr

Column representing longitude in y

required
Source code in python/polars_ds/exprs/num.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
def haversine(
    x_lat: str | pl.Expr,
    x_long: str | pl.Expr,
    y_lat: float | str | pl.Expr,
    y_long: float | str | pl.Expr,
) -> pl.Expr:
    """
    Computes haversine distance using the naive method. The output unit is km.

    Parameters
    ----------
    x_lat
        Column representing latitude in x
    x_long
        Column representing longitude in x
    y_lat
        Column representing latitude in y
    y_long
        Column representing longitude in y
    """
    xlat = to_expr(x_lat)
    xlong = to_expr(x_long)
    ylat = pl.lit(y_lat) if isinstance(y_lat, float) else to_expr(y_lat)
    ylong = pl.lit(y_long) if isinstance(y_long, float) else to_expr(y_long)
    return pl_plugin(
        symbol="pl_haversine",
        args=[xlat, xlong, ylat, ylong],
        is_elementwise=True,
        cast_to_supertype=True,
    )

info_value(x, target, n_bins=10, return_sum=True)

Compute Information Value for x with respect to target. This assumes the variable x is continuous. A value of 1 is added to all events/non-events (goods/bads) to smooth the computation.

Currently only quantile binning strategy is implemented.

Parameters:

Name Type Description Default
x str | Expr

The feature. Must be numeric.

required
target str | expr | Iterable[float]

The target column. Should be 0s and 1s.

required
n_bins int

The number of bins to bin x.

10
return_sum bool

If false, the output is a struct containing the ranges and the corresponding IVs. If true, it is the sum of the individual information values.

True
Reference

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Source code in python/polars_ds/exprs/num.py
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
def info_value(
    x: str | pl.Expr,
    target: str | pl.expr | Iterable[float],
    n_bins: int = 10,
    return_sum: bool = True,
) -> pl.Expr:
    """
    Compute Information Value for x with respect to target. This assumes the variable x
    is continuous. A value of 1 is added to all events/non-events
    (goods/bads) to smooth the computation.

    Currently only quantile binning strategy is implemented.

    Parameters
    ----------
    x
        The feature. Must be numeric.
    target
        The target column. Should be 0s and 1s.
    n_bins
        The number of bins to bin x.
    return_sum
        If false, the output is a struct containing the ranges and the corresponding IVs. If true,
        it is the sum of the individual information values.

    Reference
    ---------
    https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
    """
    if isinstance(target, (str, pl.Expr)):
        t = to_expr(target)
    else:
        t = pl.Series(values=target)
    xx = to_expr(x)
    valid = xx.filter(xx.is_finite())
    brk = valid.qcut(n_bins, left_closed=False, allow_duplicates=True).cast(pl.String)
    out = pl_plugin(symbol="pl_iv", args=[brk, t], changes_length=True)
    return out.struct.field("iv").sum() if return_sum else out

info_value_discrete(x, target, return_sum=True)

Compute the Information Value for x with respect to target. This assumes x is discrete and castable to String. A value of 1 is added to all events/non-events (goods/bads) to smooth the computation.

Parameters:

Name Type Description Default
x str | Expr

The feature. The column must be castable to String

required
target str | Expr | Iterable[int]

The target variable. Should be 0s and 1s.

required
return_sum bool

If false, the output is a struct containing the categories and the corresponding IVs. If true, it is the sum of the individual information values.

True
Reference

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Source code in python/polars_ds/exprs/num.py
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
def info_value_discrete(
    x: str | pl.Expr, target: str | pl.Expr | Iterable[int], return_sum: bool = True
) -> pl.Expr:
    """
    Compute the Information Value for x with respect to target. This assumes x
    is discrete and castable to String. A value of 1 is added to all events/non-events
    (goods/bads) to smooth the computation.

    Parameters
    ----------
    x
        The feature. The column must be castable to String
    target
        The target variable. Should be 0s and 1s.
    return_sum
        If false, the output is a struct containing the categories and the corresponding IVs. If true,
        it is the sum of the individual information values.

    Reference
    ---------
    https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
    """
    if isinstance(target, (str, pl.Expr)):
        t = to_expr(target)
    else:
        t = pl.Series(values=target)
    out = pl_plugin(symbol="pl_iv", args=[to_expr(x).cast(pl.String), t], changes_length=True)
    return out.struct.field("iv").sum() if return_sum else out

integrate_trapz(y, x)

Integrate y along x using the trapezoidal rule. If x is not a single value, then x should be sorted.

Parameters:

Name Type Description Default
y str | Expr

A column of numbers

required
x float | Expr

If it is a single float, it must be positive and it will represent a uniform distance between points. If it is an expression, it must be sorted, does not contain null, and have the same length as self.

required
Source code in python/polars_ds/exprs/num.py
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
def integrate_trapz(y: str | pl.Expr, x: float | pl.Expr) -> pl.Expr:
    """
    Integrate y along x using the trapezoidal rule. If x is not a single
    value, then x should be sorted.

    Parameters
    ----------
    y
        A column of numbers
    x
        If it is a single float, it must be positive and it will represent a uniform
        distance between points. If it is an expression, it must be sorted, does not contain
        null, and have the same length as self.
    """
    yy = to_expr(y).cast(pl.Float64).rechunk()
    if isinstance(x, float):
        xx = pl.lit(abs(x), pl.Float64)
    else:
        xx = to_expr(x).cast(pl.Float64)

    return pl_plugin(
        symbol="pl_trapz",
        args=[yy, xx],
        returns_scalar=True,
    )

is_decreasing(x, strict=False)

Checks whether the column is monotonically decreasing.

Parameters:

Name Type Description Default
x str | Expr

A numerical column

required
strict bool

Whether the check should be strict

False
Source code in python/polars_ds/exprs/num.py
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
def is_decreasing(x: str | pl.Expr, strict: bool = False) -> pl.Expr:
    """
    Checks whether the column is monotonically decreasing.

    Parameters
    ----------
    x
        A numerical column
    strict
        Whether the check should be strict
    """
    xx = to_expr(x)
    if strict:
        return (xx.diff() < 0.0).all()
    else:
        return (xx.diff() <= 0.0).all()

is_increasing(x, strict=False)

Checks whether the column is monotonically increasing.

Parameters:

Name Type Description Default
x str | Expr

A numerical column

required
strict bool

Whether the check should be strict

False
Source code in python/polars_ds/exprs/num.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def is_increasing(x: str | pl.Expr, strict: bool = False) -> pl.Expr:
    """
    Checks whether the column is monotonically increasing.

    Parameters
    ----------
    x
        A numerical column
    strict
        Whether the check should be strict
    """
    if strict:
        return (to_expr(x).diff() > 0.0).all()
    else:
        return (to_expr(x).diff() >= 0.0).all()

isotonic_regression(y, weights=None, increasing=True)

Performs isotonic regression on the data. This is the same as scipy.optimize.isotonic_regression.

Parameters:

Name Type Description Default
y str | Expr

The response variable

required
weights str | Expr | None

The weights for the response

None
increasing bool

If true, output will be monotonically inreasing. If false, it will be monotonically decreasing.

True
Source code in python/polars_ds/exprs/num.py
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
def isotonic_regression(
    y: str | pl.Expr, weights: str | pl.Expr | None = None, increasing: bool = True
) -> pl.Expr:
    """
    Performs isotonic regression on the data. This is the same as scipy.optimize.isotonic_regression.

    Parameters
    ----------
    y
        The response variable
    weights
        The weights for the response
    increasing
        If true, output will be monotonically inreasing. If false, it will be monotonically
        decreasing.
    """

    yy = to_expr(y).cast(pl.Float64)
    args = [yy]
    has_weights = weights is not None
    if has_weights:
        args.append(to_expr(weights).cast(pl.Float64))

    return pl_plugin(
        symbol="pl_isotonic_regression",
        args=args,
        kwargs={
            "has_weights": has_weights,
            "increasing": increasing,
        },
    )

jaccard_col(a, b, count_null=False)

Computes jaccard similarity column-wise. This will hash entire columns and compares the two hashsets. Note: only integer/str columns can be compared.

Parameters:

Name Type Description Default
a str | Expr

A column with a hashable type

required
b str | Expr

A column with a hashable type

required
count_null bool

Whether to count null as a distinct element.

False
Source code in python/polars_ds/exprs/num.py
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
def jaccard_col(a: str | pl.Expr, b: str | pl.Expr, count_null: bool = False) -> pl.Expr:
    """
    Computes jaccard similarity column-wise. This will hash entire columns and compares the two
    hashsets. Note: only integer/str columns can be compared.

    Parameters
    ----------
    a
        A column with a hashable type
    b
        A column with a hashable type
    count_null
        Whether to count null as a distinct element.
    """
    aa = to_expr(a).unique()
    bb = to_expr(b).unique()

    if not count_null:
        aa = aa.drop_nulls()
        bb = bb.drop_nulls()

    return jaccard_row(aa.implode(), bb.implode())

jaccard_row(a, b)

Computes jaccard similarity pairwise between a and b column. The type of each column must be list and the lists must have the same inner type. The inner type must either be integer or string.

Parameters:

Name Type Description Default
a str | Expr

A list column with a hashable inner type

required
b str | Expr

A list column with a hashable inner type

required
Source code in python/polars_ds/exprs/num.py
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
def jaccard_row(a: str | pl.Expr, b: str | pl.Expr) -> pl.Expr:
    """
    Computes jaccard similarity pairwise between `a` and `b` column. The type of
    each column must be list and the lists must have the same inner type. The inner type
    must either be integer or string.

    Parameters
    ----------
    a
        A list column with a hashable inner type
    b
        A list column with a hashable inner type
    """
    aa = to_expr(a)
    bb = to_expr(b)
    intersection_len = aa.list.set_intersection(bb).list.len()
    a_len = aa.list.len()
    b_len = bb.list.len()
    return intersection_len / (a_len + b_len - intersection_len)

l1_horizontal(*v, normalize=False)

Horizontally computes L1 norm. Shorthand for pl.sum_horizontal(pl.col(x).abs() for x in exprs).

Parameters:

Name Type Description Default
*v str | Expr

Expressions to compute horizontal L1.

()
normalize bool

Whether to divide by the dimension

False
Source code in python/polars_ds/exprs/num.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def l1_horizontal(*v: str | pl.Expr, normalize: bool = False) -> pl.Expr:
    """
    Horizontally computes L1 norm. Shorthand for pl.sum_horizontal(pl.col(x).abs() for x in exprs).

    Parameters
    ----------
    *v
        Expressions to compute horizontal L1.
    normalize
        Whether to divide by the dimension
    """
    if normalize:
        exprs = list(v)
        return pl.sum_horizontal(to_expr(x).abs() for x in exprs) / len(exprs)
    else:
        return pl.sum_horizontal(to_expr(x).abs() for x in v)

l2_sq_horizontal(*v, normalize=False)

Horizontally computes L2 norm squared. Shorthand for pl.sum_horizontal(pl.col(x).pow(2) for x in exprs).

Parameters:

Name Type Description Default
*v str | Expr

Expressions to compute horizontal L2.

()
normalize bool

Whether to divide by the dimension

False
Source code in python/polars_ds/exprs/num.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def l2_sq_horizontal(*v: str | pl.Expr, normalize: bool = False) -> pl.Expr:
    """
    Horizontally computes L2 norm squared. Shorthand for pl.sum_horizontal(pl.col(x).pow(2) for x in exprs).

    Parameters
    ----------
    *v
        Expressions to compute horizontal L2.
    normalize
        Whether to divide by the dimension
    """
    if normalize:
        exprs = list(v)
        return pl.sum_horizontal(to_expr(x).pow(2) for x in exprs) / len(exprs)
    else:
        return pl.sum_horizontal(to_expr(x).pow(2) for x in v)

l_inf_horizontal(*v, normalize=False)

Horizontally computes L inf norm. Shorthand for pl.max_horizontal(pl.col(x).abs() for x in exprs).

Parameters:

Name Type Description Default
*v str | Expr

Expressions to compute horizontal L infinity.

()
normalize bool

Whether to divide by the dimension

False
Source code in python/polars_ds/exprs/num.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def l_inf_horizontal(*v: str | pl.Expr, normalize: bool = False) -> pl.Expr:
    """
    Horizontally computes L inf norm. Shorthand for pl.max_horizontal(pl.col(x).abs() for x in exprs).

    Parameters
    ----------
    *v
        Expressions to compute horizontal L infinity.
    normalize
        Whether to divide by the dimension
    """
    if normalize:
        exprs = list(v)
        return pl.max_horizontal(to_expr(x).abs() for x in exprs) / len(exprs)
    else:
        return pl.max_horizontal(to_expr(x).abs() for x in v)

lcm(x, y)

Computes LCM of two integer columns. This will try to cast everything to int32.

Parameters:

Name Type Description Default
x str | Expr

An integer column

required
y int | str | Expr

Either an int, or another integer column

required
Source code in python/polars_ds/exprs/num.py
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def lcm(x: str | pl.Expr, y: int | str | pl.Expr) -> pl.Expr:
    """
    Computes LCM of two integer columns. This will try to cast everything to int32.

    Parameters
    ----------
    x
        An integer column
    y
        Either an int, or another integer column
    """
    if isinstance(y, int):
        yy = pl.lit(y, dtype=pl.Int32)
    else:
        yy = to_expr(y).cast(pl.Int32)

    return pl_plugin(
        symbol="pl_lcm",
        args=[to_expr(x).cast(pl.Int32), yy],
        is_elementwise=True,
    )

list_amax(list_col)

Finds the argmax of the list in this column. This is useful for

(1) Turning sparse multiclass target into dense target. (2) Finding the max probability class of a multiclass classification output. (3) As a shortcut for expr.list.eval(pl.element().arg_max()).

Source code in python/polars_ds/exprs/num.py
915
916
917
918
919
920
921
922
923
def list_amax(list_col: str | pl.Expr) -> pl.Expr:
    """
    Finds the argmax of the list in this column. This is useful for

    (1) Turning sparse multiclass target into dense target.
    (2) Finding the max probability class of a multiclass classification output.
    (3) As a shortcut for expr.list.eval(pl.element().arg_max()).
    """
    return to_expr(list_col).list.eval(pl.element().arg_max())

list_dot(list1, list2)

Calculates the dot product for two list columns.

Parameters:

Name Type Description Default
list1 str | Expr

The first array column

required
list2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
173
174
175
176
177
178
179
180
181
182
183
184
185
def list_dot(list1: str | pl.Expr, list2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the dot product for two list columns.

    Parameters
    ----------
    list1
        The first array column
    list2
        The second array column
    """
    x, y = to_expr(list1), to_expr(list2)
    return (x * y).list.sum()

list_l1_dist(list1, list2)

Calculates the L1 distance for two list columns.

Parameters:

Name Type Description Default
list1 str | Expr

The first array column

required
list2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
188
189
190
191
192
193
194
195
196
197
198
199
200
def list_l1_dist(list1: str | pl.Expr, list2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the L1 distance for two list columns.

    Parameters
    ----------
    list1
        The first array column
    list2
        The second array column
    """
    x, y = to_expr(list1), to_expr(list2)
    return (x - y).list.eval(pl.element().abs()).list.sum()

list_sql2_dist(list1, list2)

Calculates the squared L2 distance for two list columns.

Parameters:

Name Type Description Default
list1 str | Expr

The first array column

required
list2 str | Expr

The second array column

required
Source code in python/polars_ds/exprs/num.py
203
204
205
206
207
208
209
210
211
212
213
214
215
def list_sql2_dist(list1: str | pl.Expr, list2: str | pl.Expr) -> pl.Expr:
    """
    Calculates the squared L2 distance for two list columns.

    Parameters
    ----------
    list1
        The first array column
    list2
        The second array column
    """
    x, y = to_expr(list1), to_expr(list2)
    return (x - y).list.eval(pl.element().pow(2)).list.sum()

logit(x)

Applies the logit function to self. Logit(x) = ln(x/(1-x)). Note that logit(0) = -inf, logit(1) = inf, and logit(p) for p < 0 or p > 1 yields nan.

Source code in python/polars_ds/exprs/num.py
949
950
951
952
953
954
955
956
957
958
def logit(x: str | pl.Expr) -> pl.Expr:
    """
    Applies the logit function to self. Logit(x) = ln(x/(1-x)).
    Note that logit(0) = -inf, logit(1) = inf, and logit(p) for p < 0 or p > 1 yields nan.
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_logit",
        is_elementwise=True,
    )

next_down(x)

For any float, return the greatest number smaller than itself (within the precision). Intergers will be treated as f32. E.g. The next value down for 0.1 is 0.09999999999999999. This is useful when you need to make extremely small changes to certain values and you don't want to add random noise.

Source code in python/polars_ds/exprs/num.py
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
def next_down(x: str | pl.Expr) -> pl.Expr:
    """
    For any float, return the greatest number smaller than itself (within the precision).
    Intergers will be treated as f32. E.g. The next value down for 0.1 is 0.09999999999999999.
    This is useful when you need to make extremely small changes to certain values and you don't
    want to add random noise.
    """
    return pl_plugin(
        symbol="pl_next_down",
        args=[to_expr(x)],
        is_elementwise=True,
    )

next_up(x)

For any float, return the least number greater than itself (within the precision). Intergers will be treated as f32. E.g. The next value up for 0.1 is 0.10000000000000002 because of precision issues. This is useful when you need to make extremely small changes to certain values and you don't want to add random noise.

Source code in python/polars_ds/exprs/num.py
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
def next_up(x: str | pl.Expr) -> pl.Expr:
    """
    For any float, return the least number greater than itself (within the precision).
    Intergers will be treated as f32. E.g. The next value up for 0.1 is 0.10000000000000002
    because of precision issues. This is useful when you need to make extremely small changes
    to certain values and you don't want to add random noise.
    """
    return pl_plugin(
        symbol="pl_next_up",
        args=[to_expr(x)],
        is_elementwise=True,
    )

pca(*features, center=True)

Finds all singular values as well as the principal vectors.

Parameters:

Name Type Description Default
features str | Expr

Feature columns

()
center bool

Whether to center the data or not. If you want to standard normalize, set this to False, and do it for input features by hand.

True
Source code in python/polars_ds/exprs/num.py
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
def pca(
    *features: str | pl.Expr,
    center: bool = True,
) -> pl.Expr:
    """
    Finds all singular values as well as the principal vectors.

    Parameters
    ----------
    features
        Feature columns
    center
        Whether to center the data or not. If you want to standard normalize, set this to False,
        and do it for input features by hand.
    """
    feats = [to_expr(f) for f in features]
    if center:
        actual_inputs = [f - f.mean() for f in feats]
    else:
        actual_inputs = feats

    return pl_plugin(
        symbol="pl_pca", args=actual_inputs, changes_length=True, pass_name_to_apply=True
    )

principal_components(*features, k=2, center=True)

Transforms the features to get the first k principal components. This returns NaN if the number of rows is less than k.

Parameters:

Name Type Description Default
features str | Expr

Feature columns

()
k int

The number of principal components to return

2
center bool

Whether to center the data or not. If you want to standard normalize, set this to False, and do it for input features by hand.

True
Source code in python/polars_ds/exprs/num.py
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
def principal_components(
    *features: str | pl.Expr,
    k: int = 2,
    center: bool = True,
) -> pl.Expr:
    """
    Transforms the features to get the first k principal components. This returns NaN if the number
    of rows is less than `k`.

    Parameters
    ----------
    features
        Feature columns
    k
        The number of principal components to return
    center
        Whether to center the data or not. If you want to standard normalize, set this to False,
        and do it for input features by hand.
    """
    feats = [to_expr(f) for f in features]
    if k > len(feats) or k <= 0:
        raise ValueError("Input `k` should be between 1 and the number of features inclusive.")

    actual_inputs = [pl.lit(k, dtype=pl.UInt32)]
    if center:
        actual_inputs.extend(f - f.mean() for f in feats)
    else:
        actual_inputs.extend(feats)

    return pl_plugin(symbol="pl_principal_components", args=actual_inputs, pass_name_to_apply=True)

psi(new, baseline, n_bins=10, return_report=False)

Compute the Population Stability Index between x and the reference column (usually x's historical values). The reference column will be divided into n_bins quantile bins which will be used as basis of comparison.

Note this assumes values in self and ref are continuous. This will also remove all infinite, null, NA. values.

Also note that it will try to create n_bins many unique breakpoints. If input data has < n_bins unique breakpoints, the repeated breakpoints will be grouped together, and the computation will be done with < n_bins many bins. This happens when a single value appears too many times in data. This also differs from the reference implementation by treating breakpoints as right-closed intervals with -inf and inf being the first and last values of the intervals. This is because we need to accommodate all data in the case when actual data's min and the reference data's min are not the same, which is common in reality.

Parameters:

Name Type Description Default
new str | Expr | Iterable[float]

An expression or any iterable that can be turned into a Polars series that represents newly arrived feature values

required
baseline str | Expr | Iterable[float]

An expression or any iterable that can be turned into a Polars series. Usually this should be the feature's historical values

required
n_bins int, > 1

The number of quantile bins to use

10
return_report bool

Whether to return a PSI report or not.

False
Reference

https://github.com/mwburke/population-stability-index/blob/master/psi.py https://www.listendata.com/2015/05/population-stability-index.html

Source code in python/polars_ds/exprs/num.py
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
def psi(
    new: str | pl.Expr | Iterable[float],
    baseline: str | pl.Expr | Iterable[float],
    n_bins: int = 10,
    return_report: bool = False,
) -> pl.Expr:
    """
    Compute the Population Stability Index between x and the reference column (usually x's historical values).
    The reference column will be divided into n_bins quantile bins which will be used as basis of comparison.

    Note this assumes values in self and ref are continuous. This will also remove all infinite, null, NA.
    values.

    Also note that it will try to create `n_bins` many unique breakpoints. If input data has < n_bins
    unique breakpoints, the repeated breakpoints will be grouped together, and the computation will be done
    with < `n_bins` many bins. This happens when a single value appears too many times in data. This also
    differs from the reference implementation by treating breakpoints as right-closed intervals with -inf
    and inf being the first and last values of the intervals. This is because we need to accommodate all data
    in the case when actual data's min and the reference data's min are not the same, which is common in reality.

    Parameters
    ----------
    new
        An expression or any iterable that can be turned into a Polars series that represents newly
        arrived feature values
    baseline
        An expression or any iterable that can be turned into a Polars series. Usually this should
        be the feature's historical values
    n_bins : int, > 1
        The number of quantile bins to use
    return_report
        Whether to return a PSI report or not.

    Reference
    ---------
    https://github.com/mwburke/population-stability-index/blob/master/psi.py
    https://www.listendata.com/2015/05/population-stability-index.html
    """
    if n_bins <= 1:
        raise ValueError("Input `n_bins` must be >= 2.")

    if isinstance(new, (str, pl.Expr)):
        new_ = to_expr(new)
        valid_new = new_.filter(new_.is_finite()).cast(pl.Float64)
    else:
        temp = pl.Series(values=new, dtype=pl.Float64)
        valid_new = pl.lit(temp.filter(temp.is_finite()))

    if isinstance(baseline, (str, pl.Expr)):
        base = to_expr(baseline)
        valid_ref = base.filter(base.is_finite()).cast(pl.Float64)
    else:
        temp = pl.lit(pl.Series(values=baseline, dtype=pl.Float64))
        valid_ref = temp.filter(temp.is_finite())

    vc = (
        valid_ref.qcut(n_bins, left_closed=False, allow_duplicates=True, include_breaks=True)
        .struct.field("breakpoint")
        .value_counts()
        .sort()
    )

    # breakpoints learned from ref
    brk = vc.struct.field("breakpoint")  # .cast(pl.Float64)
    # counts of points in the buckets
    cnt_ref = vc.struct.field("count")  # .cast(pl.UInt32)
    psi_report = pl_plugin(
        symbol="pl_psi_report",
        args=[valid_new, brk, cnt_ref],
        changes_length=True,
    ).alias("psi_report")
    if return_report:
        return psi_report

    return psi_report.struct.field("psi_bin").sum()

psi_discrete(new, baseline, return_report=False)

Compute the Population Stability Index between self (actual) and the reference column. The baseline column will be used as categories which are the basis of comparison.

Note this assumes values in new and ref baseline discrete columns (e.g. str categories). This will treat each value as a distinct category and null will be treated as a category by itself. If a category exists in new but not in baseline, the percentage will be imputed by 0.0001. If you do not wish to include new distinct values in PSI calculation, you can still compute the PSI by generating the report and filtering.

Also note that discrete columns must have the same type in order to be considered the same.

Parameters:

Name Type Description Default
new str | Expr | Iterable[float]

The feature

required
baseline str | Expr | Iterable[float]

An expression, or any iterable that can be turned into a Polars series. Usually this should be the historical values

required
return_report bool

Whether to return a PSI report or not.

False
Reference

https://www.listendata.com/2015/05/population-stability-index.html

Source code in python/polars_ds/exprs/num.py
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
def psi_discrete(
    new: str | pl.Expr | Iterable[float],
    baseline: str | pl.Expr | Iterable[float],
    return_report: bool = False,
) -> pl.Expr:
    """
    Compute the Population Stability Index between self (actual) and the reference column. The baseline
    column will be used as categories which are the basis of comparison.

    Note this assumes values in new and ref baseline discrete columns (e.g. str categories). This will
    treat each value as a distinct category and null will be treated as a category by itself. If a category
    exists in new but not in baseline, the percentage will be imputed by 0.0001. If you do not wish to include
    new distinct values in PSI calculation, you can still compute the PSI by generating the report and filtering.

    Also note that discrete columns must have the same type in order to be considered the same.

    Parameters
    ----------
    new
        The feature
    baseline
        An expression, or any iterable that can be turned into a Polars series. Usually this should
        be the historical values
    return_report
        Whether to return a PSI report or not.

    Reference
    ---------
    https://www.listendata.com/2015/05/population-stability-index.html
    """
    if isinstance(new, (str, pl.Expr)):
        new_ = to_expr(new)
        temp = new_.value_counts().struct.rename_fields(["", "count"])
        new_cnt = temp.struct.field("count")
        new_cat = temp.struct.field("")
    else:
        temp = pl.Series(values=new)
        temp = temp.value_counts()  # This is a df in this case
        new_cnt = pl.lit(temp.drop_in_place("count"))
        new_cat = pl.lit(temp[temp.columns[0]])

    if isinstance(baseline, (str, pl.Expr)):
        base = to_expr(baseline)
        temp = base.value_counts().struct.rename_fields(["", "count"])
        ref_cnt = temp.struct.field("count")
        ref_cat = temp.struct.field("")
    else:
        temp = pl.Series(values=baseline)
        temp = temp.value_counts()  # This is a df in this case
        ref_cnt = pl.lit(temp.drop_in_place("count"))
        ref_cat = pl.lit(temp[temp.columns[0]])

    psi_report = pl_plugin(
        symbol="pl_psi_discrete_report",
        args=[new_cat, new_cnt, ref_cat, ref_cnt],
        changes_length=True,
    )
    if return_report:
        return psi_report

    return psi_report.struct.field("psi_bin").sum()

psi_w_breakpoints(new, baseline, breakpoints)

Creates a PSI report using the custom breakpoints.

Parameters:

Name Type Description Default
new str | expr | Iterable[float]

The data representing the new observed data. Any sequence of numerical values that can be turned into a Polars'series, or an expression representing a column will work

required
baseline str | expr | Iterable[float]

The data representing the baseline data. Any sequence of numerical values that can be turned into a Polars'series, or an expression representing a column will work

required
breakpoints List[float]

The data that represents breakpoints. Input must be sorted, distinct, finite numeric values. This function will not cleanse the breakpoints for the user. E.g. [0.1, 0.5, 0.9] will create four bins: (-inf. 0.1], (0.1, 0.5], (0.5, 0.9] and (0.9, inf). Please do not pass inf or NaN values as breakpoints.

required
Source code in python/polars_ds/exprs/num.py
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
def psi_w_breakpoints(
    new: str | pl.expr | Iterable[float],
    baseline: str | pl.expr | Iterable[float],
    breakpoints: List[float],
) -> pl.Expr:
    """
    Creates a PSI report using the custom breakpoints.

    Parameters
    ----------
    new
        The data representing the new observed data. Any sequence of numerical values that
        can be turned into a Polars'series, or an expression representing a column will work
    baseline
        The data representing the baseline data. Any sequence of numerical values that
        can be turned into a Polars'series, or an expression representing a column will work
    breakpoints
        The data that represents breakpoints. Input must be sorted, distinct, finite numeric values.
        This function will not cleanse the breakpoints for the user. E.g. [0.1, 0.5, 0.9] will create
        four bins: (-inf. 0.1], (0.1, 0.5], (0.5, 0.9] and (0.9, inf). Please do not pass inf or NaN values
        as breakpoints.
    """
    if isinstance(baseline, (str, pl.Expr)):
        x: pl.Expr = to_expr(baseline)
        x = x.filter(x.is_finite())
    else:
        temp = pl.Series(values=baseline)
        x: pl.Expr = pl.lit(temp.filter(temp.is_finite()))

    if isinstance(new, (str, pl.Expr)):
        y: pl.Expr = to_expr(new)
        y = y.filter(y.is_finite())
    else:
        temp = pl.Series(values=new)
        y: pl.Expr = pl.lit(temp.filter(temp.is_finite()))

    if len(breakpoints) == 0:
        raise ValueError("Breakpoints is empty.")

    bp = breakpoints + [float("inf")]
    return pl_plugin(
        symbol="pl_psi_w_bps",
        args=[x.rechunk(), y.rechunk(), pl.Series(values=bp)],
        changes_length=True,
    ).alias("psi_report")

rfft(series, n=None, return_full=False)

Computes the DFT transform of a real-valued input series using FFT Algorithm. Note that by default a series of length (length // 2 + 1) will be returned.

Parameters:

Name Type Description Default
series str | Expr

Input real series

required
n int | None

The number of points to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input is used.

None
return_full bool

If true, output will have the same length as determined by n.

False
Source code in python/polars_ds/exprs/num.py
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
def rfft(series: str | pl.Expr, n: int | None = None, return_full: bool = False) -> pl.Expr:
    """
    Computes the DFT transform of a real-valued input series using FFT Algorithm. Note that
    by default a series of length (length // 2 + 1) will be returned.

    Parameters
    ----------
    series
        Input real series
    n
        The number of points to use. If n is smaller than the length of the input,
        the input is cropped. If it is larger, the input is padded with zeros.
        If n is not given, the length of the input is used.
    return_full
        If true, output will have the same length as determined by n.
    """
    if n is not None and n <= 1:
        raise ValueError("Input `n` should be > 1.")

    full = pl.lit(return_full, pl.Boolean)
    nn = pl.lit(n, pl.UInt32)
    x: pl.Expr = to_expr(series).cast(pl.Float64)
    return pl_plugin(symbol="pl_rfft", args=[x, nn, full], changes_length=True)

sinc(x)

Computes the sinc function normalized by pi.

Source code in python/polars_ds/exprs/num.py
 994
 995
 996
 997
 998
 999
1000
def sinc(x: str | pl.Expr) -> pl.Expr:
    """
    Computes the sinc function normalized by pi.
    """
    xx = to_expr(x)
    y = math.pi * pl.when(xx == 0).then(1e-20).otherwise(xx)
    return y.sin() / y

singular_values(*features, center=True, as_explained_var=False, as_ratio=False)

Finds all principal values (singular values) for the data matrix formed by the given features and returns them in descending order.

Note: if a row has null values, it will be dropped.

Paramters

features Feature columns center Whether to center the data or not. If you want to standard-normalize, set this to False, and do it for input features by hand. as_explained_var If true, return the explained variance, which is singular_value ^ 2 / (n_samples - 1) as_ratio If true, normalize output to between 0 and 1.

Source code in python/polars_ds/exprs/num.py
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
def singular_values(
    *features: str | pl.Expr,
    center: bool = True,
    as_explained_var: bool = False,
    as_ratio: bool = False,
) -> pl.Expr:
    """
    Finds all principal values (singular values) for the data matrix formed by the given features
    and returns them in descending order.

    Note: if a row has null values, it will be dropped.

    Paramters
    ---------
    features
        Feature columns
    center
        Whether to center the data or not. If you want to standard-normalize, set this to False,
        and do it for input features by hand.
    as_explained_var
        If true, return the explained variance, which is singular_value ^ 2 / (n_samples - 1)
    as_ratio
        If true, normalize output to between 0 and 1.
    """
    feats = [to_expr(f) for f in features]
    if center:
        actual_inputs = [f - f.mean() for f in feats]
    else:
        actual_inputs = feats

    out = pl_plugin(symbol="pl_singular_values", args=actual_inputs, returns_scalar=True)
    if as_explained_var:
        out = out.list.eval(pl.element().pow(2) / (pl.count() - 1))
    if as_ratio:
        out = out.list.eval(pl.element() / pl.element().sum())

    return out

softmax(x)

Applies the softmax function to the column, which turns any real valued column into valid probability values. This is simply a shorthand for x.exp() / x.exp().sum() for expressions x.

Paramters

x Either a str represeting a column name or a Polars expression

Source code in python/polars_ds/exprs/num.py
277
278
279
280
281
282
283
284
285
286
287
288
def softmax(x: str | pl.Expr) -> pl.Expr:
    """
    Applies the softmax function to the column, which turns any real valued column into valid probability
    values. This is simply a shorthand for x.exp() / x.exp().sum() for expressions x.

    Paramters
    ---------
    x
        Either a str represeting a column name or a Polars expression
    """
    xx = to_expr(x)
    return xx.exp() / (xx.exp().sum())

target_encode(s, target, min_samples_leaf=20, smoothing=10.0)

Compute information necessary to target encode a string column.

Note: nulls will be encoded as well.

Parameters:

Name Type Description Default
s str | Expr

The string column to encode

required
target str | Expr | Iterable[int]

The target column. Should be 0s and 1s.

required
min_samples_leaf int

A regularization factor

20
smoothing float

Smoothing effect to balance categorical average vs prior

10.0
Reference

https://contrib.scikit-learn.org/category_encoders/targetencoder.html

Source code in python/polars_ds/exprs/num.py
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
def target_encode(
    s: str | pl.Expr,
    target: str | pl.Expr | Iterable[int],
    min_samples_leaf: int = 20,
    smoothing: float = 10.0,
) -> pl.Expr:
    """
    Compute information necessary to target encode a string column.

    Note: nulls will be encoded as well.

    Parameters
    ----------
    s
        The string column to encode
    target
        The target column. Should be 0s and 1s.
    min_samples_leaf
        A regularization factor
    smoothing
        Smoothing effect to balance categorical average vs prior

    Reference
    ---------
    https://contrib.scikit-learn.org/category_encoders/targetencoder.html
    """
    if isinstance(target, (str, pl.Expr)):
        t = to_expr(target)
    else:
        t = pl.lit(pl.Series(values=target))
    return pl_plugin(
        symbol="pl_target_encode",
        args=[to_expr(s), t, t.mean()],
        kwargs={"min_samples_leaf": float(min_samples_leaf), "smoothing": smoothing},
        changes_length=True,
    )

trunc(x)

Returns the integer part of the input values. E.g. integer part of 1.1 is 1.0

Source code in python/polars_ds/exprs/num.py
983
984
985
986
987
988
989
990
991
def trunc(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the integer part of the input values. E.g. integer part of 1.1 is 1.0
    """
    return pl_plugin(
        args=[to_expr(x)],
        symbol="pl_trunc",
        is_elementwise=True,
    )

woe(x, target, n_bins=10)

Compute the Weight of Evidence for x with respect to target. This assumes x is continuous. A value of 1 is added to all events/non-events (goods/bads) to smooth the computation.

Currently only quantile binning strategy is implemented.

Parameters:

Name Type Description Default
x str | Expr

The feature

required
target str | expr | Iterable[float]

The target variable. Should be 0s and 1s.

required
n_bins int

The number of bins to bin the variable.

10
Reference

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Source code in python/polars_ds/exprs/num.py
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
def woe(x: str | pl.Expr, target: str | pl.expr | Iterable[float], n_bins: int = 10) -> pl.Expr:
    """
    Compute the Weight of Evidence for x with respect to target. This assumes x
    is continuous. A value of 1 is added to all events/non-events
    (goods/bads) to smooth the computation.

    Currently only quantile binning strategy is implemented.

    Parameters
    ----------
    x
        The feature
    target
        The target variable. Should be 0s and 1s.
    n_bins
        The number of bins to bin the variable.

    Reference
    ---------
    https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
    """
    if isinstance(target, (str, pl.Expr)):
        t = to_expr(target)
    else:
        t = pl.Series(values=target)
    xx = to_expr(x)
    valid = xx.filter(xx.is_finite())
    brk = valid.qcut(n_bins, left_closed=False, allow_duplicates=True).cast(pl.String)
    return pl_plugin(symbol="pl_woe_discrete", args=[brk, t], changes_length=True)

woe_discrete(x, target)

Compute the Weight of Evidence for x with respect to target. This assumes x is discrete and castable to String. A value of 1 is added to all events/non-events (goods/bads) to smooth the computation.

Parameters:

Name Type Description Default
x str | Expr

The feature

required
target str | Expr | Iterable[int]

The target variable. Should be 0s and 1s.

required
Reference

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Source code in python/polars_ds/exprs/num.py
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
def woe_discrete(
    x: str | pl.Expr,
    target: str | pl.Expr | Iterable[int],
) -> pl.Expr:
    """
    Compute the Weight of Evidence for x with respect to target. This assumes x
    is discrete and castable to String. A value of 1 is added to all events/non-events
    (goods/bads) to smooth the computation.

    Parameters
    ----------
    x
        The feature
    target
        The target variable. Should be 0s and 1s.

    Reference
    ---------
    https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
    """
    if isinstance(target, (str, pl.Expr)):
        t = to_expr(target)
    else:
        t = pl.Series(values=target)
    return pl_plugin(
        symbol="pl_woe_discrete",
        args=[to_expr(x).cast(pl.String), t],
        changes_length=True,
    )

xlogy(x, y)

Computes x * log(y) so that if x = 0, the product is 0.

Parameters:

Name Type Description Default
x str | Expr

A numerical column

required
y str | Expr

A numerical column

required
Source code in python/polars_ds/exprs/num.py
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
def xlogy(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Computes x * log(y) so that if x = 0, the product is 0.

    Parameters
    ----------
    x
        A numerical column
    y
        A numerical column
    """
    return pl_plugin(
        args=[to_expr(x).cast(pl.Float64), to_expr(y).cast(pl.Float64)],
        symbol="pl_xlogy",
        is_elementwise=True,
    )

z_normalize(x)

Z-normalizes the column.

This is only a short cut for a standard feature transform, and is not recommended to be used in settings where the means/stds need to be persisted.

Source code in python/polars_ds/exprs/num.py
264
265
266
267
268
269
270
271
272
273
274
def z_normalize(x: str | pl.Expr) -> pl.Expr:
    """
    Z-normalizes the column.

    This is only a short cut for a standard feature transform, and is not recommended
    to be used in settings where the means/stds need to be persisted.
    """
    xx = to_expr(x)
    mean = xx.mean()
    std = xx.std()
    return (xx - mean) / std