Skip to content

Time Series Features

Feature Engineering Queries and Time Series Features

Time series features.

Functions:

Name Description
query_abs_energy

Absolute energy is defined as Sum(x_i^2).

query_approx_entropy

Approximate sample entropies of a time series given the filtering level. It is highly

query_ar_coeffs

Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the

query_auto_corr

Computes the auto correlation with the given lag.

query_avg_streak

Finds the average streak length where the condition where is true. The average is taken on

query_c3_stats

Measure of non-linearity in the time series using c3 statistics.

query_cid_ce

Estimates the time series complexity.

query_cond_entropy

Queries the conditional entropy of x on y, aka. H(x|y).

query_cond_indep

Computes the conditional independance of x and y, conditioned on z

query_copula_entropy

Estimates Copula Entropy via rank statistics.

query_count_uniques

Returns the count of unique values.

query_cv

Returns the coefficient of variation for the variable. This is a shorthand for std / mean.

query_entropy

Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()

query_first_digit_cnt

Finds the first digit count in the data. This is closely related to Benford's law,

query_knn_entropy

Computes KNN entropy among all the rows.

query_lempel_ziv

Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.

query_longest_streak

Finds the longest streak length where the condition where is true.

query_mean_abs_change

Returns the mean of all successive differences |X_i - X_i-1|

query_mean_n_abs_max

Returns the average of the top n_maxima of |x|.

query_mid_range

A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.

query_permute_entropy

Computes permutation entropy.

query_range_count

Returns the number of values inside [lower, upper].

query_sample_entropy

Calculate the sample entropy of this column. It is highly

query_similar_count

Given a query subsequence, find the number of same-sized subsequences (windows) in target

query_streak

Finds the streak length where the condition where is true. This returns a full column of streak lengths.

query_symm_ratio

Returns the symmetric ratio: |mean - median| / (max - min). Note the closer to 0 this value is,

query_time_reversal_asymmetry_stats

Queries the Time Reversal Asymmetry Statistic, which is the average of

query_transfer_entropy

Estimating transfer entropy from source to x with a lag

query_abs_energy(x)

Absolute energy is defined as Sum(x_i^2).

Source code in python/polars_ds/exprs/ts_features.py
68
69
70
71
72
73
def query_abs_energy(x: str | pl.Expr) -> pl.Expr:
    """
    Absolute energy is defined as Sum(x_i^2).
    """
    y = to_expr(x)
    return y.dot(y)

query_approx_entropy(ts, m, filtering_level, scale_by_std=True, parallel=True)

Approximate sample entropies of a time series given the filtering level. It is highly recommended that the user impute nulls before calling this.

If NaN/some error is returned/thrown, it is likely that: (1) Too little data, e.g. m + 1 > length (2) filtering_level or (filtering_level * std) is too close to 0 or std is null/NaN.

Parameters:

Name Type Description Default
ts str | Expr

A time series

required
m int

Length of compared runs of data. This is m in the wikipedia article.

required
filtering_level float

Filtering level, must be positive. This is r in the wikipedia article.

required
scale_by_std bool

Whether to scale filter level by std of data. In most applications, this is the default behavior, but not in some other cases.

True
parallel bool

Whether to run this in parallel or not. This is recommended when you are running only this expression, and not in group_by context.

True
Reference

https://en.wikipedia.org/wiki/Approximate_entropy

Source code in python/polars_ds/exprs/ts_features.py
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
def query_approx_entropy(
    ts: str | pl.Expr,
    m: int,
    filtering_level: float,
    scale_by_std: bool = True,
    parallel: bool = True,
) -> pl.Expr:
    """
    Approximate sample entropies of a time series given the filtering level. It is highly
    recommended that the user impute nulls before calling this.

    If NaN/some error is returned/thrown, it is likely that:
    (1) Too little data, e.g. m + 1 > length
    (2) filtering_level or (filtering_level * std) is too close to 0 or std is null/NaN.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    m : int
        Length of compared runs of data. This is `m` in the wikipedia article.
    filtering_level : float
        Filtering level, must be positive. This is `r` in the wikipedia article.
    scale_by_std : bool
        Whether to scale filter level by std of data. In most applications, this is the default
        behavior, but not in some other cases.
    parallel : bool
        Whether to run this in parallel or not. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Approximate_entropy
    """

    if filtering_level <= 0 or m <= 1:
        raise ValueError("Filter level must be positive and m must be > 1.")

    t = to_expr(ts)
    if scale_by_std:
        r: pl.Expr = filtering_level * t.std()
    else:
        r: pl.Expr = pl.lit(filtering_level, dtype=pl.Float64)

    rows = t.len() - m + 1
    data = [r, t.slice(0, length=rows).cast(pl.Float64)]
    # See rust code for more comment on why I put m + 1 here.
    data.extend(
        t.shift(-i).slice(0, length=rows).cast(pl.Float64).alias(str(i)) for i in range(1, m + 1)
    )
    # More errors are handled in Rust
    return pl_plugin(
        symbol="pl_approximate_entropy",
        args=data,
        kwargs={
            "k": 0,
            "metric": "inf",
            "parallel": parallel,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

query_ar_coeffs(x, lag, add_bias=True, null_policy='raise')

Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the output.

Parameters:

Name Type Description Default
x str | Expr

The feature

required
lag int

The lag

required
add_bias bool

Whether to add a bias/intercept term

True
null_policy NullPolicy

One of "raise", "one", "zero", or a finite numeric string.

'raise'
Source code in python/polars_ds/exprs/ts_features.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
def query_ar_coeffs(
    x: str | pl.Expr, lag: int, add_bias: bool = True, null_policy: NullPolicy = "raise"
) -> pl.Expr:
    """
    Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the
    output.

    Parameters
    ----------
    x
        The feature
    lag
        The lag
    add_bias
        Whether to add a bias/intercept term
    null_policy
        One of "raise", "one", "zero", or a finite numeric string.
    """

    if null_policy not in ("raise", "one", "zero"):
        try:
            import math

            z = float(null_policy)
            if not math.isfinite(z):
                raise
        except Exception:
            raise ValueError(
                "`null_polocy` must be 'raise', 'one', 'zero' or any finite numeric string for AR coefficients."
            )

    if lag <= 0:
        raise ValueError("`lag` must be > 0.")

    from . import lin_reg

    xx = to_expr(x)
    return lin_reg(
        *[xx.shift(i).slice(offset=lag).alias(str(i)) for i in range(1, lag + 1)],
        target=xx.slice(offset=lag),
        add_bias=add_bias,
        null_policy=null_policy,
    )

query_auto_corr(x, lag, ddof=0, normalize=True)

Computes the auto correlation with the given lag.

Parameters:

Name Type Description Default
x str | Expr

The feature

required
lag int

The lag

required
ddof int

The ddof for the variance

0
normalize bool

Whether to normalize the value to [-1, 1] or not.

True
Source code in python/polars_ds/exprs/ts_features.py
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
def query_auto_corr(x: str | pl.Expr, lag: int, ddof: int = 0, normalize: bool = True) -> pl.Expr:
    """
    Computes the auto correlation with the given lag.

    Parameters
    ----------
    x
        The feature
    lag
        The lag
    ddof
        The ddof for the variance
    normalize
        Whether to normalize the value to [-1, 1] or not.
    """
    xx = to_expr(x)
    if normalize:
        x_m = xx - xx.mean()
        var = xx.var(ddof=ddof)
        n = pl.len()
        n_minus_lag = pl.when(n < lag).then(float("nan")).otherwise(n - lag)
        return x_m.dot(x_m.shift(-lag)) / (n_minus_lag * var)
    else:
        return (xx * xx.shift(-lag)).mean()

query_avg_streak(where)

Finds the average streak length where the condition where is true. The average is taken on the true set.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name Type Description Default
where str | Expr

If where is string, the string must represent the name of a string column. If where is an expression, the expression must evaluate to some boolean expression.

required
Source code in python/polars_ds/exprs/ts_features.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def query_avg_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the average streak length where the condition `where` is true. The average is taken on
    the true set.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a string column. If where is
        an expression, the expression must evaluate to some boolean expression.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return (
        y.filter(y.struct.field("value"))
        .struct.field("len")
        .mean()
        .fill_null(0)
        .alias("avg_streak")
    )

query_benford(var)

Finds the first digit counts which is used in Benford's law. This is an alias to query_first_digit_cnt.

Source code in python/polars_ds/exprs/ts_features.py
231
232
233
234
235
236
def query_benford(var: str | pl.Expr) -> pl.Expr:
    """
    Finds the first digit counts which is used in Benford's law. This is an alias to
    `query_first_digit_cnt`.
    """
    return query_first_digit_cnt(var)

query_c3_stats(x, lag)

Measure of non-linearity in the time series using c3 statistics.

Parameters:

Name Type Description Default
x Expr

Either the name of the column or a Polars expression

required
lag int

The lag that should be used in the calculation of the feature.

required
Reference

https://arxiv.org/pdf/chao-dyn/9909043

Source code in python/polars_ds/exprs/ts_features.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
def query_c3_stats(x: str | pl.Expr, lag: int) -> pl.Expr:
    """
    Measure of non-linearity in the time series using c3 statistics.

    Parameters
    ----------
    x : pl.Expr
        Either the name of the column or a Polars expression
    lag : int
        The lag that should be used in the calculation of the feature.

    Reference
    ---------
    https://arxiv.org/pdf/chao-dyn/9909043
    """
    two_lags = 2 * lag
    xx = to_expr(x)
    return ((xx.mul(xx.shift(lag)).mul(xx.shift(two_lags))).sum()).truediv(xx.len() - two_lags)

query_cid_ce(x, normalize=False)

Estimates the time series complexity.

Parameters:

Name Type Description Default
x Expr

Either the name of the column or a Polars expression

required
normalize bool

If True, z-normalizes the time-series before computing the feature. Default is False.

False
Reference

https://www.cs.ucr.edu/~eamonn/Complexity-Invariant%20Distance%20Measure.pdf

Source code in python/polars_ds/exprs/ts_features.py
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
def query_cid_ce(x: str | pl.Expr, normalize: bool = False) -> pl.Expr:
    """
    Estimates the time series complexity.

    Parameters
    ----------
    x : pl.Expr
        Either the name of the column or a Polars expression
    normalize : bool, optional
        If True, z-normalizes the time-series before computing the feature.
        Default is False.

    Reference
    ---------
    https://www.cs.ucr.edu/~eamonn/Complexity-Invariant%20Distance%20Measure.pdf
    """
    xx = to_expr(x)
    if normalize:
        y = (xx - xx.mean()) / xx.std()
    else:
        y = xx

    z = y - y.shift(-1)
    return z.dot(z).sqrt()

query_cond_entropy(x, y)

Queries the conditional entropy of x on y, aka. H(x|y).

Parameters:

Name Type Description Default
x str | Expr

Either a string or a polars expression

required
y str | Expr

Either a string or a polars expression

required
Source code in python/polars_ds/exprs/ts_features.py
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
def query_cond_entropy(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Queries the conditional entropy of x on y, aka. H(x|y).

    Parameters
    ----------
    x
        Either a string or a polars expression
    y
        Either a string or a polars expression
    """
    return pl_plugin(
        symbol="pl_conditional_entropy",
        args=[to_expr(x), to_expr(y)],
        returns_scalar=True,
        pass_name_to_apply=True,
    )

query_cond_indep(x, y, z, k=3, parallel=False)

Computes the conditional independance of x and y, conditioned on z

Reference

Jian Ma. Multivariate Normality Test with Copula Entropy. arXiv preprint arXiv:2206.05956, 2022.

Source code in python/polars_ds/exprs/ts_features.py
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
def query_cond_indep(
    x: str | pl.Expr, y: str | pl.Expr, z: str | pl.Expr, k: int = 3, parallel: bool = False
) -> pl.Expr:
    """
    Computes the conditional independance of `x`  and `y`, conditioned on `z`

    Reference
    ---------
    Jian Ma. Multivariate Normality Test with Copula Entropy. arXiv preprint arXiv:2206.05956, 2022.
    """
    # We can likely optimize this by going into Rust.
    # Here we are
    # (1) computing rank multiple times
    # (2) creating 3 separate kd-trees, and copying the data 3 times. Might just need to copy once.
    xyz = query_copula_entropy(x, y, z, k=k, parallel=parallel)
    yz = query_copula_entropy(y, z, k=k, parallel=parallel)
    xz = query_copula_entropy(x, z, k=k, parallel=parallel)
    return xyz - yz - xz

query_copula_entropy(*features, k=3, parallel=False)

Estimates Copula Entropy via rank statistics.

Reference

Jian Ma and Zengqi Sun. Mutual information is copula entropy. Tsinghua Science & Technology, 2011, 16(1): 51-54.

Source code in python/polars_ds/exprs/ts_features.py
682
683
684
685
686
687
688
689
690
691
def query_copula_entropy(*features: str | pl.Expr, k: int = 3, parallel: bool = False) -> pl.Expr:
    """
    Estimates Copula Entropy via rank statistics.

    Reference
    ---------
    Jian Ma and Zengqi Sun. Mutual information is copula entropy. Tsinghua Science & Technology, 2011, 16(1): 51-54.
    """
    ranks = [x.rank() / x.len() for x in (to_expr(f) for f in features)]
    return -query_knn_entropy(*ranks, k=k, dist="l2", parallel=parallel)

query_count_uniques(x)

Returns the count of unique values.

Source code in python/polars_ds/exprs/ts_features.py
107
108
109
110
111
def query_count_uniques(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the count of unique values.
    """
    return to_expr(x).is_unique().sum()

query_cv(x, ddof=1)

Returns the coefficient of variation for the variable. This is a shorthand for std / mean.

Parameters:

Name Type Description Default
x str | Expr

The variable

required
ddof int

The delta degree of frendom used in std computation

1
Source code in python/polars_ds/exprs/ts_features.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def query_cv(x: str | pl.Expr, ddof: int = 1) -> pl.Expr:
    """
    Returns the coefficient of variation for the variable. This is a shorthand for std / mean.

    Parameters
    ----------
    x
        The variable
    ddof
        The delta degree of frendom used in std computation
    """
    xx = to_expr(x)
    return xx.std(ddof=ddof) / xx.mean()

query_entropy(x, base=math.e, normalize=True)

Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()

Parameters:

Name Type Description Default
x str | Expr

Either a string or a polars expression

required
base float

Base for the log in the entropy computation

e
normalize bool

Normalize if the probabilities don't sum to 1.

True
Source code in python/polars_ds/exprs/ts_features.py
484
485
486
487
488
489
490
491
492
493
494
495
496
497
def query_entropy(x: str | pl.Expr, base: float = math.e, normalize: bool = True) -> pl.Expr:
    """
    Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()

    Parameters
    ----------
    x
        Either a string or a polars expression
    base
        Base for the log in the entropy computation
    normalize
        Normalize if the probabilities don't sum to 1.
    """
    return to_expr(x).unique_counts().entropy(base=base, normalize=normalize)

query_first_digit_cnt(var)

Finds the first digit count in the data. This is closely related to Benford's law, which states that the the first digits (1-9) follow a certain distribution.

The output is a single element column of type list[u32]. The first value represents the count of 1s that are the first digit, the second value represents the count of 2s that are the first digit, etc.

E.g. first digit of 12 is 1, of 0.0312 is 3. For integers, it is possible to have value = 0, and this will not be counted as a first digit.

Reference

https://en.wikipedia.org/wiki/Benford%27s_law

Source code in python/polars_ds/exprs/ts_features.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def query_first_digit_cnt(var: str | pl.Expr) -> pl.Expr:
    """
    Finds the first digit count in the data. This is closely related to Benford's law,
    which states that the the first digits (1-9) follow a certain distribution.

    The output is a single element column of type list[u32]. The first value represents the count of 1s
    that are the first digit, the second value represents the count of 2s that are the first digit, etc.

    E.g. first digit of 12 is 1, of 0.0312 is 3. For integers, it is possible to have value = 0, and this
    will not be counted as a first digit.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Benford%27s_law
    """
    return pl_plugin(
        symbol="pl_benford_law",
        args=[to_expr(var)],
        returns_scalar=True,
    )

query_knn_entropy(*features, k=3, dist='l2', parallel=False)

Computes KNN entropy among all the rows.

Note if rows <= k, NaN will be returned.

Parameters:

Name Type Description Default
*features str | Expr

Columns used as features

()
k int

The number of nearest neighbor to consider. Usually 2 or 3.

3
dist Literal[`l2`, `inf`]

Note l2 here has to be l2 with square root.

'l2'
parallel bool

Whether to run the distance query in parallel. This is recommended when you are running only this expression, and not in group_by context.

False
Reference

https://arxiv.org/pdf/1506.06501v1.pdf

Source code in python/polars_ds/exprs/ts_features.py
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
def query_knn_entropy(
    *features: str | pl.Expr,
    k: int = 3,
    dist: Distance = "l2",
    parallel: bool = False,
) -> pl.Expr:
    """
    Computes KNN entropy among all the rows.

    Note if rows <= k, NaN will be returned.

    Parameters
    ----------
    *features
        Columns used as features
    k
        The number of nearest neighbor to consider. Usually 2 or 3.
    dist : Literal[`l2`, `inf`]
        Note `l2` here has to be `l2` with square root.
    parallel : bool
        Whether to run the distance query in parallel. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://arxiv.org/pdf/1506.06501v1.pdf
    """
    if k <= 0:
        raise ValueError("Input `k` must be > 0.")
    if dist not in ["l2", "inf"]:
        raise ValueError("Invalid metric for KNN entropy.")

    return pl_plugin(
        symbol="pl_knn_entropy",
        args=[to_expr(e).alias(str(i)) for i, e in enumerate(features)],
        kwargs={
            "k": k,
            "metric": dist,
            "parallel": parallel,
            "skip_eval": False,
            "skip_data": False,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

query_lempel_ziv(b, as_ratio=True)

Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.

Parameters:

Name Type Description Default
b str | Expr

A boolean column

required
as_ratio bool

If true, return complexity / length.

True
Source code in python/polars_ds/exprs/ts_features.py
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def query_lempel_ziv(b: str | pl.Expr, as_ratio: bool = True) -> pl.Expr:
    """
    Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.

    Parameters
    ----------
    b
        A boolean column
    as_ratio : bool
        If true, return complexity / length.
    """
    x = to_expr(b)
    out = pl_plugin(
        symbol="pl_lempel_ziv_complexity",
        args=[x],
        returns_scalar=True,
    )
    if as_ratio:
        return out / x.len()
    return out

query_longest_streak(where)

Finds the longest streak length where the condition where is true.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name Type Description Default
where str | Expr

If where is string, the string must represent the name of a string column. If where is an expression, the expression must evaluate to some boolean expression.

required
Source code in python/polars_ds/exprs/ts_features.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def query_longest_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the longest streak length where the condition `where` is true.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a string column. If where is
        an expression, the expression must evaluate to some boolean expression.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return (
        y.filter(y.struct.field("value"))
        .struct.field("len")
        .max()
        .fill_null(0)
        .alias("longest_streak")
    )

query_mean_abs_change(x)

Returns the mean of all successive differences |X_i - X_i-1|

Source code in python/polars_ds/exprs/ts_features.py
76
77
78
79
80
def query_mean_abs_change(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the mean of all successive differences |X_i - X_i-1|
    """
    return to_expr(x).diff(null_behavior="drop").abs().mean()

query_mean_n_abs_max(x, n_maxima)

Returns the average of the top n_maxima of |x|.

Source code in python/polars_ds/exprs/ts_features.py
83
84
85
86
87
88
89
def query_mean_n_abs_max(x: str | pl.Expr, n_maxima: int) -> pl.Expr:
    """
    Returns the average of the top `n_maxima` of |x|.
    """
    if n_maxima <= 0:
        raise ValueError("The number of maxima should be > 0.")
    return to_expr(x).abs().top_k(n_maxima).mean()

query_mid_range(x)

A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.

Source code in python/polars_ds/exprs/ts_features.py
51
52
53
54
55
56
def query_mid_range(x: str | pl.Expr) -> pl.Expr:
    """
    A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.
    """
    xx = to_expr(x)
    return (xx.max() - xx.min()) / 2

query_permute_entropy(ts, tau=1, n_dims=3, base=math.e)

Computes permutation entropy.

Parameters:

Name Type Description Default
ts str | Expr

A time series

required
tau int

The embedding time delay which controls the number of time periods between elements of each of the new column vectors.

1
n_dims int, > 1

The embedding dimension which controls the length of each of the new column vectors

3
base float

The base for log in the entropy computation

e
Reference

https://www.aptech.com/blog/permutation-entropy/

Source code in python/polars_ds/exprs/ts_features.py
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
def query_permute_entropy(
    ts: str | pl.Expr,
    tau: int = 1,
    n_dims: int = 3,
    base: float = math.e,
) -> pl.Expr:
    """
    Computes permutation entropy.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    tau : int
        The embedding time delay which controls the number of time periods between elements
        of each of the new column vectors.
    n_dims : int, > 1
        The embedding dimension which controls the length of each of the new column vectors
    base : float
        The base for log in the entropy computation

    Reference
    ---------
    https://www.aptech.com/blog/permutation-entropy/
    """
    if n_dims <= 1:
        raise ValueError("Input `n_dims` has to be > 1.")
    if tau < 1:
        raise ValueError("Input `tau` has to be >= 1.")

    t = to_expr(ts)
    if tau == 1:  # Fast track the most common use case
        return (
            pl.concat_list(t, *(t.shift(-i) for i in range(1, n_dims)))
            .head(t.len() - n_dims + 1)
            .list.eval(pl.element().arg_sort())
            .value_counts()  # groupby and count, but returns a struct
            .struct.field("count")  # extract the field named "count"
            .entropy(base=base, normalize=True)
        )
    else:
        return (
            pl.concat_list(
                t.gather_every(tau),
                *(t.shift(-i).gather_every(tau) for i in range(1, n_dims)),
            )
            .slice(0, length=(t.len() // tau) + 1 - (n_dims // tau))
            .list.eval(pl.element().arg_sort())
            .value_counts()
            .struct.field("count")
            .entropy(base=base, normalize=True)
        )

query_range_count(x, lower, upper)

Returns the number of values inside [lower, upper].

Source code in python/polars_ds/exprs/ts_features.py
114
115
116
117
118
def query_range_count(x: str | pl.Expr, lower: float, upper: float) -> pl.Expr:
    """
    Returns the number of values inside [`lower`, `upper`].
    """
    return to_expr(x).is_between(lower_bound=lower, upper_bound=upper).sum()

query_sample_entropy(ts, ratio=0.2, m=2, parallel=False)

Calculate the sample entropy of this column. It is highly recommended that the user impute nulls before calling this.

If NaN/some error is returned/thrown, it is likely that: (1) Too little data, e.g. m + 1 > length (2) ratio or (ratio * std) is too close to or below 0 or std is null/NaN.

Parameters:

Name Type Description Default
ts str | Expr

A time series

required
ratio float

The tolerance parameter. Default is 0.2.

0.2
m int

Length of a run of data. Most common run length is 2.

2
parallel bool

Whether to run this in parallel or not. This is recommended when you are running only this expression, and not in group_by context.

False
Reference

https://en.wikipedia.org/wiki/Sample_entropy

Source code in python/polars_ds/exprs/ts_features.py
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
def query_sample_entropy(
    ts: str | pl.Expr, ratio: float = 0.2, m: int = 2, parallel: bool = False
) -> pl.Expr:
    """
    Calculate the sample entropy of this column. It is highly
    recommended that the user impute nulls before calling this.

    If NaN/some error is returned/thrown, it is likely that:
    (1) Too little data, e.g. m + 1 > length
    (2) ratio or (ratio * std) is too close to or below 0 or std is null/NaN.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    ratio : float
        The tolerance parameter. Default is 0.2.
    m : int
        Length of a run of data. Most common run length is 2.
    parallel : bool
        Whether to run this in parallel or not. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Sample_entropy
    """
    if m <= 1:
        raise ValueError("Input `m` must be > 1.")

    t = to_expr(ts)
    r = ratio * t.std(ddof=0)
    rows = t.len() - m + 1

    data = [r, t.slice(0, length=rows)]
    # See rust code for more comment on why I put m + 1 here.
    data.extend(
        t.shift(-i).slice(0, length=rows).alias(str(i)) for i in range(1, m + 1)
    )  # More errors are handled in Rust
    return pl_plugin(
        symbol="pl_sample_entropy",
        args=data,
        kwargs={
            "k": 0,
            "metric": "inf",
            "parallel": parallel,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

query_similar_count(query, target, threshold, metric='sqzl2', parallel=False, return_ratio=False)

Given a query subsequence, find the number of same-sized subsequences (windows) in target series that have distance < threshold from it.

Note: If target is largely null, errors may occur. If metric is sqzl2, a mininum variance of 1e-10 is applied to all variance calculations to avoid division by 0.

Parameters:

Name Type Description Default
query Iterable[float]

The query subsequence. Must not contain nulls.

required
target str | Expr

The target time series.

required
threshold float

The distance threshold

required
metric Literal['sql2', 'sqzl2']

Either 'sql2' or 'sqzl2', which stands for squared l2 and squared z-normalized l2.

'sqzl2'
parallel bool

Only applies when method is direct. Whether to compute the convulotion in parallel. Note that this may not have the expected performance when you are in group_by or other parallel context already. It is recommended to use this in select/with_columns context, when few expressions are being run at the same time.

False
return_ratio bool

If true, return # of similar subseuqnces / total number of subsequences.

False
Source code in python/polars_ds/exprs/ts_features.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def query_similar_count(
    query: Iterable[float],
    target: str | pl.Expr,
    threshold: float,
    metric: Literal["sql2", "sqzl2"] = "sqzl2",
    parallel: bool = False,
    return_ratio: bool = False,
) -> pl.Expr:
    """
    Given a query subsequence, find the number of same-sized subsequences (windows) in target
    series that have distance < threshold from it.

    Note: If target is largely null, errors may occur. If metric is sqzl2, a mininum variance
    of 1e-10 is applied to all variance calculations to avoid division by 0.

    Parameters
    ----------
    query
        The query subsequence. Must not contain nulls.
    target
        The target time series.
    threshold
        The distance threshold
    metric
        Either 'sql2' or 'sqzl2', which stands for squared l2 and squared z-normalized l2.
    parallel
        Only applies when method is `direct`. Whether to compute the convulotion in parallel. Note that this may not
        have the expected performance when you are in group_by or other parallel context already. It is recommended
        to use this in select/with_columns context, when few expressions are being run at the same time.
    return_ratio
        If true, return # of similar subseuqnces / total number of subsequences.
    """

    q = pl.Series(name="", values=query, dtype=pl.Float64)
    if q.null_count() > 0:
        raise ValueError("Nulls found in the query subsequence.")
    if len(q) <= 1:
        raise ValueError("Length of the query should be > 1.")

    t = to_expr(target)
    kwargs = {"threshold": threshold, "parallel": parallel}
    if metric == "sql2":
        result = pl_plugin(
            symbol="pl_subseq_sim_cnt_l2",
            args=[t.cast(pl.Float64).rechunk(), q],
            kwargs=kwargs,
            returns_scalar=True,
        )
    elif metric == "sqzl2":  # pl_subseq_sim_cnt_zl2
        rolling_mean = t.rolling_mean(window_size=len(q)).slice(len(q) - 1, None)
        rolling_var = pl.max_horizontal(
            t.rolling_var(window_size=len(q)).slice(len(q) - 1, None).fill_nan(1e-10),
            pl.lit(1e-10, dtype=pl.Float64),
        )
        qq = pl.lit(q)
        args = [
            t.cast(pl.Float64).rechunk(),
            ((qq - qq.mean()) / qq.std()).rechunk(),
            rolling_mean.rechunk(),
            rolling_var.rechunk(),
        ]
        result = pl_plugin(
            symbol="pl_subseq_sim_cnt_zl2",
            args=args,
            kwargs=kwargs,
            returns_scalar=True,
        )
    else:
        raise ValueError(f"Unsupported metric {metric}.")

    if return_ratio:
        return result / (t.len() - len(q) + 1)
    return result

query_streak(where)

Finds the streak length where the condition where is true. This returns a full column of streak lengths.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name Type Description Default
where str | Expr

If where is string, the string must represent the name of a boolean column. If where is an expression, the expression must evaluate to some boolean series.

required
Source code in python/polars_ds/exprs/ts_features.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
def query_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the streak length where the condition `where` is true. This returns a full column of streak lengths.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a boolean column. If where is
        an expression, the expression must evaluate to some boolean series.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return y.struct.field("len").alias("streak_len")

query_symm_ratio(x)

Returns the symmetric ratio: |mean - median| / (max - min). Note the closer to 0 this value is, the more symmetric the series is.

Source code in python/polars_ds/exprs/ts_features.py
59
60
61
62
63
64
65
def query_symm_ratio(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the symmetric ratio: |mean - median| / (max - min). Note the closer to 0 this value is,
    the more symmetric the series is.
    """
    y = to_expr(x)
    return (y.mean() - y.median()).abs() / (y.max() - y.min())

query_time_reversal_asymmetry_stats(x, n_lags)

Queries the Time Reversal Asymmetry Statistic, which is the average of (L^2(x) * L(x) - L(x) * x^2), where L is the lag operator.

Source code in python/polars_ds/exprs/ts_features.py
382
383
384
385
386
387
388
389
390
def query_time_reversal_asymmetry_stats(x: str | pl.Expr, n_lags: int) -> pl.Expr:
    """
    Queries the Time Reversal Asymmetry Statistic, which is the average of
    (L^2(x) * L(x) - L(x) * x^2), where L is the lag operator.
    """
    y = to_expr(x)
    one_lag = y.shift(-n_lags)
    two_lag = y.shift(-2 * n_lags)  # Nulls won't be in the mean calculation
    return (one_lag * (two_lag + y) * (two_lag - y)).mean()

query_transfer_entropy(x, source, lag=1, k=3, parallel=False)

Estimating transfer entropy from source to x with a lag

Reference

Jian Ma. Estimating Transfer Entropy via Copula Entropy. arXiv preprint arXiv:1910.04375, 2019.

Source code in python/polars_ds/exprs/ts_features.py
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
def query_transfer_entropy(
    x: str | pl.Expr, source: str | pl.Expr, lag: int = 1, k: int = 3, parallel: bool = False
) -> pl.Expr:
    """
    Estimating transfer entropy from `source` to `x` with a lag

    Reference
    ---------
    Jian Ma. Estimating Transfer Entropy via Copula Entropy. arXiv preprint arXiv:1910.04375, 2019.
    """
    if lag < 1:
        raise ValueError("Input `lag` must be >= 1.")

    xx = to_expr(x)
    x1 = xx.slice(0, pl.len() - lag)
    x2 = xx.slice(lag, pl.len() - lag)  # (equivalent to slice(lag, None), but will break in v1.0)
    s = to_expr(source).slice(0, pl.len() - lag)
    return query_cond_indep(x2, s, x1, k=k, parallel=parallel)