Time Series Features

Feature Engineering Queries and Time Series Features

Time series features.

Functions:

Name	Description
`query_abs_energy`	Absolute energy is defined as Sum(x_i^2).
`query_approx_entropy`	Approximate sample entropies of a time series given the filtering level. It is highly
`query_ar_coeffs`	Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the
`query_auto_corr`	Computes the auto correlation with the given lag.
`query_avg_streak`	Finds the average streak length where the condition `where` is true. The average is taken on
`query_c3_stats`	Measure of non-linearity in the time series using c3 statistics.
`query_cid_ce`	Estimates the time series complexity.
`query_cond_entropy`	Queries the conditional entropy of x on y, aka. H(x\|y).
`query_cond_indep`	Computes the conditional independance of `x` and `y`, conditioned on `z`
`query_copula_entropy`	Estimates Copula Entropy via rank statistics.
`query_count_uniques`	Returns the count of unique values.
`query_cv`	Returns the coefficient of variation for the variable. This is a shorthand for std / mean.
`query_entropy`	Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()
`query_first_digit_cnt`	Finds the first digit count in the data. This is closely related to Benford's law,
`query_knn_entropy`	Computes KNN entropy among all the rows.
`query_lempel_ziv`	Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.
`query_longest_streak`	Finds the longest streak length where the condition `where` is true.
`query_mean_abs_change`	Returns the mean of all successive differences \|X_i - X_i-1\|
`query_mean_n_abs_max`	Returns the average of the top `n_maxima` of \|x\|.
`query_mid_range`	A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.
`query_permute_entropy`	Computes permutation entropy.
`query_range_count`	Returns the number of values inside [`lower`, `upper`].
`query_sample_entropy`	Calculate the sample entropy of this column. It is highly
`query_similar_count`	Given a query subsequence, find the number of same-sized subsequences (windows) in target
`query_streak`	Finds the streak length where the condition `where` is true. This returns a full column of streak lengths.
`query_symm_ratio`	Returns the symmetric ratio: \|mean - median\| / (max - min). Note the closer to 0 this value is,
`query_time_reversal_asymmetry_stats`	Queries the Time Reversal Asymmetry Statistic, which is the average of
`query_transfer_entropy`	Estimating transfer entropy from `source` to `x` with a lag

`query_abs_energy(x)`

Absolute energy is defined as Sum(x_i^2).

Source code in python/polars_ds/exprs/ts_features.py

def query_abs_energy(x: str | pl.Expr) -> pl.Expr:
    """
    Absolute energy is defined as Sum(x_i^2).
    """
    y = to_expr(x)
    return y.dot(y)

`query_approx_entropy(ts, m, filtering_level, scale_by_std=True, parallel=True)`

Approximate sample entropies of a time series given the filtering level. It is highly recommended that the user impute nulls before calling this.

If NaN/some error is returned/thrown, it is likely that: (1) Too little data, e.g. m + 1 > length (2) filtering_level or (filtering_level * std) is too close to 0 or std is null/NaN.

Parameters:

Name	Type	Description	Default
`ts`	`str \| Expr`	A time series	required
`m`	`int`	Length of compared runs of data. This is `m` in the wikipedia article.	required
`filtering_level`	`float`	Filtering level, must be positive. This is `r` in the wikipedia article.	required
`scale_by_std`	`bool`	Whether to scale filter level by std of data. In most applications, this is the default behavior, but not in some other cases.	`True`
`parallel`	`bool`	Whether to run this in parallel or not. This is recommended when you are running only this expression, and not in group_by context.	`True`

Reference

https://en.wikipedia.org/wiki/Approximate_entropy

Source code in python/polars_ds/exprs/ts_features.py

def query_approx_entropy(
    ts: str | pl.Expr,
    m: int,
    filtering_level: float,
    scale_by_std: bool = True,
    parallel: bool = True,
) -> pl.Expr:
    """
    Approximate sample entropies of a time series given the filtering level. It is highly
    recommended that the user impute nulls before calling this.

    If NaN/some error is returned/thrown, it is likely that:
    (1) Too little data, e.g. m + 1 > length
    (2) filtering_level or (filtering_level * std) is too close to 0 or std is null/NaN.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    m : int
        Length of compared runs of data. This is `m` in the wikipedia article.
    filtering_level : float
        Filtering level, must be positive. This is `r` in the wikipedia article.
    scale_by_std : bool
        Whether to scale filter level by std of data. In most applications, this is the default
        behavior, but not in some other cases.
    parallel : bool
        Whether to run this in parallel or not. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Approximate_entropy
    """

    if filtering_level <= 0 or m <= 1:
        raise ValueError("Filter level must be positive and m must be > 1.")

    t = to_expr(ts)
    if scale_by_std:
        r: pl.Expr = filtering_level * t.std()
    else:
        r: pl.Expr = pl.lit(filtering_level, dtype=pl.Float64)

    rows = t.len() - m + 1
    data = [r, t.slice(0, length=rows).cast(pl.Float64)]
    # See rust code for more comment on why I put m + 1 here.
    data.extend(
        t.shift(-i).slice(0, length=rows).cast(pl.Float64).alias(str(i)) for i in range(1, m + 1)
    )
    # More errors are handled in Rust
    return pl_plugin(
        symbol="pl_approximate_entropy",
        args=data,
        kwargs={
            "k": 0,
            "metric": "inf",
            "parallel": parallel,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

`query_ar_coeffs(x, lag, add_bias=True, null_policy='raise')`

Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the output.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The feature	required
`lag`	`int`	The lag	required
`add_bias`	`bool`	Whether to add a bias/intercept term	`True`
`null_policy`	`NullPolicy`	One of "raise", "one", "zero", or a finite numeric string.	`'raise'`

Source code in python/polars_ds/exprs/ts_features.py

def query_ar_coeffs(
    x: str | pl.Expr, lag: int, add_bias: bool = True, null_policy: NullPolicy = "raise"
) -> pl.Expr:
    """
    Computes the autoregressive coefficients for the given lag. The bias/intercept term will be the last value in the
    output.

    Parameters
    ----------
    x
        The feature
    lag
        The lag
    add_bias
        Whether to add a bias/intercept term
    null_policy
        One of "raise", "one", "zero", or a finite numeric string.
    """

    if null_policy not in ("raise", "one", "zero"):
        try:
            import math

            z = float(null_policy)
            if not math.isfinite(z):
                raise
        except Exception:
            raise ValueError(
                "`null_polocy` must be 'raise', 'one', 'zero' or any finite numeric string for AR coefficients."
            )

    if lag <= 0:
        raise ValueError("`lag` must be > 0.")

    from . import lin_reg

    xx = to_expr(x)
    return lin_reg(
        *[xx.shift(i).slice(offset=lag).alias(str(i)) for i in range(1, lag + 1)],
        target=xx.slice(offset=lag),
        add_bias=add_bias,
        null_policy=null_policy,
    )

`query_auto_corr(x, lag, ddof=0, normalize=True)`

Computes the auto correlation with the given lag.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The feature	required
`lag`	`int`	The lag	required
`ddof`	`int`	The ddof for the variance	`0`
`normalize`	`bool`	Whether to normalize the value to [-1, 1] or not.	`True`

Source code in python/polars_ds/exprs/ts_features.py

def query_auto_corr(x: str | pl.Expr, lag: int, ddof: int = 0, normalize: bool = True) -> pl.Expr:
    """
    Computes the auto correlation with the given lag.

    Parameters
    ----------
    x
        The feature
    lag
        The lag
    ddof
        The ddof for the variance
    normalize
        Whether to normalize the value to [-1, 1] or not.
    """
    xx = to_expr(x)
    if normalize:
        x_m = xx - xx.mean()
        var = xx.var(ddof=ddof)
        n = pl.len()
        n_minus_lag = pl.when(n < lag).then(float("nan")).otherwise(n - lag)
        return x_m.dot(x_m.shift(-lag)) / (n_minus_lag * var)
    else:
        return (xx * xx.shift(-lag)).mean()

`query_avg_streak(where)`

Finds the average streak length where the condition where is true. The average is taken on the true set.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name	Type	Description	Default
`where`	`str \| Expr`	If where is string, the string must represent the name of a string column. If where is an expression, the expression must evaluate to some boolean expression.	required

Source code in python/polars_ds/exprs/ts_features.py

def query_avg_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the average streak length where the condition `where` is true. The average is taken on
    the true set.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a string column. If where is
        an expression, the expression must evaluate to some boolean expression.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return (
        y.filter(y.struct.field("value"))
        .struct.field("len")
        .mean()
        .fill_null(0)
        .alias("avg_streak")
    )

`query_benford(var)`

Finds the first digit counts which is used in Benford's law. This is an alias to query_first_digit_cnt.

Source code in python/polars_ds/exprs/ts_features.py

def query_benford(var: str | pl.Expr) -> pl.Expr:
    """
    Finds the first digit counts which is used in Benford's law. This is an alias to
    `query_first_digit_cnt`.
    """
    return query_first_digit_cnt(var)

`query_c3_stats(x, lag)`

Measure of non-linearity in the time series using c3 statistics.

Parameters:

Name	Type	Description	Default
`x`	`Expr`	Either the name of the column or a Polars expression	required
`lag`	`int`	The lag that should be used in the calculation of the feature.	required

Reference

https://arxiv.org/pdf/chao-dyn/9909043

Source code in python/polars_ds/exprs/ts_features.py

def query_c3_stats(x: str | pl.Expr, lag: int) -> pl.Expr:
    """
    Measure of non-linearity in the time series using c3 statistics.

    Parameters
    ----------
    x : pl.Expr
        Either the name of the column or a Polars expression
    lag : int
        The lag that should be used in the calculation of the feature.

    Reference
    ---------
    https://arxiv.org/pdf/chao-dyn/9909043
    """
    two_lags = 2 * lag
    xx = to_expr(x)
    return ((xx.mul(xx.shift(lag)).mul(xx.shift(two_lags))).sum()).truediv(xx.len() - two_lags)

`query_cid_ce(x, normalize=False)`

Estimates the time series complexity.

Parameters:

Name	Type	Description	Default
`x`	`Expr`	Either the name of the column or a Polars expression	required
`normalize`	`bool`	If True, z-normalizes the time-series before computing the feature. Default is False.	`False`

Reference

https://www.cs.ucr.edu/~eamonn/Complexity-Invariant%20Distance%20Measure.pdf

Source code in python/polars_ds/exprs/ts_features.py

def query_cid_ce(x: str | pl.Expr, normalize: bool = False) -> pl.Expr:
    """
    Estimates the time series complexity.

    Parameters
    ----------
    x : pl.Expr
        Either the name of the column or a Polars expression
    normalize : bool, optional
        If True, z-normalizes the time-series before computing the feature.
        Default is False.

    Reference
    ---------
    https://www.cs.ucr.edu/~eamonn/Complexity-Invariant%20Distance%20Measure.pdf
    """
    xx = to_expr(x)
    if normalize:
        y = (xx - xx.mean()) / xx.std()
    else:
        y = xx

    z = y - y.shift(-1)
    return z.dot(z).sqrt()

`query_cond_entropy(x, y)`

Queries the conditional entropy of x on y, aka. H(x|y).

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either a string or a polars expression	required
`y`	`str \| Expr`	Either a string or a polars expression	required

Source code in python/polars_ds/exprs/ts_features.py

def query_cond_entropy(x: str | pl.Expr, y: str | pl.Expr) -> pl.Expr:
    """
    Queries the conditional entropy of x on y, aka. H(x|y).

    Parameters
    ----------
    x
        Either a string or a polars expression
    y
        Either a string or a polars expression
    """
    return pl_plugin(
        symbol="pl_conditional_entropy",
        args=[to_expr(x), to_expr(y)],
        returns_scalar=True,
        pass_name_to_apply=True,
    )

`query_cond_indep(x, y, z, k=3, parallel=False)`

Computes the conditional independance of x and y, conditioned on z

Reference

Jian Ma. Multivariate Normality Test with Copula Entropy. arXiv preprint arXiv:2206.05956, 2022.

Source code in python/polars_ds/exprs/ts_features.py

def query_cond_indep(
    x: str | pl.Expr, y: str | pl.Expr, z: str | pl.Expr, k: int = 3, parallel: bool = False
) -> pl.Expr:
    """
    Computes the conditional independance of `x`  and `y`, conditioned on `z`

    Reference
    ---------
    Jian Ma. Multivariate Normality Test with Copula Entropy. arXiv preprint arXiv:2206.05956, 2022.
    """
    # We can likely optimize this by going into Rust.
    # Here we are
    # (1) computing rank multiple times
    # (2) creating 3 separate kd-trees, and copying the data 3 times. Might just need to copy once.
    xyz = query_copula_entropy(x, y, z, k=k, parallel=parallel)
    yz = query_copula_entropy(y, z, k=k, parallel=parallel)
    xz = query_copula_entropy(x, z, k=k, parallel=parallel)
    return xyz - yz - xz

`query_copula_entropy(*features, k=3, parallel=False)`

Estimates Copula Entropy via rank statistics.

Reference

Jian Ma and Zengqi Sun. Mutual information is copula entropy. Tsinghua Science & Technology, 2011, 16(1): 51-54.

Source code in python/polars_ds/exprs/ts_features.py

def query_copula_entropy(*features: str | pl.Expr, k: int = 3, parallel: bool = False) -> pl.Expr:
    """
    Estimates Copula Entropy via rank statistics.

    Reference
    ---------
    Jian Ma and Zengqi Sun. Mutual information is copula entropy. Tsinghua Science & Technology, 2011, 16(1): 51-54.
    """
    ranks = [x.rank() / x.len() for x in (to_expr(f) for f in features)]
    return -query_knn_entropy(*ranks, k=k, dist="l2", parallel=parallel)

`query_count_uniques(x)`

Returns the count of unique values.

Source code in python/polars_ds/exprs/ts_features.py

def query_count_uniques(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the count of unique values.
    """
    return to_expr(x).is_unique().sum()

`query_cv(x, ddof=1)`

Returns the coefficient of variation for the variable. This is a shorthand for std / mean.

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	The variable	required
`ddof`	`int`	The delta degree of frendom used in std computation	`1`

Source code in python/polars_ds/exprs/ts_features.py

def query_cv(x: str | pl.Expr, ddof: int = 1) -> pl.Expr:
    """
    Returns the coefficient of variation for the variable. This is a shorthand for std / mean.

    Parameters
    ----------
    x
        The variable
    ddof
        The delta degree of frendom used in std computation
    """
    xx = to_expr(x)
    return xx.std(ddof=ddof) / xx.mean()

`query_entropy(x, base=math.e, normalize=True)`

Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()

Parameters:

Name	Type	Description	Default
`x`	`str \| Expr`	Either a string or a polars expression	required
`base`	`float`	Base for the log in the entropy computation	`e`
`normalize`	`bool`	Normalize if the probabilities don't sum to 1.	`True`

Source code in python/polars_ds/exprs/ts_features.py

def query_entropy(x: str | pl.Expr, base: float = math.e, normalize: bool = True) -> pl.Expr:
    """
    Computes the entropy of any discrete column. This is shorthand for x.unique_counts().entropy()

    Parameters
    ----------
    x
        Either a string or a polars expression
    base
        Base for the log in the entropy computation
    normalize
        Normalize if the probabilities don't sum to 1.
    """
    return to_expr(x).unique_counts().entropy(base=base, normalize=normalize)

`query_first_digit_cnt(var)`

Finds the first digit count in the data. This is closely related to Benford's law, which states that the the first digits (1-9) follow a certain distribution.

The output is a single element column of type list[u32]. The first value represents the count of 1s that are the first digit, the second value represents the count of 2s that are the first digit, etc.

E.g. first digit of 12 is 1, of 0.0312 is 3. For integers, it is possible to have value = 0, and this will not be counted as a first digit.

Reference

https://en.wikipedia.org/wiki/Benford%27s_law

Source code in python/polars_ds/exprs/ts_features.py

def query_first_digit_cnt(var: str | pl.Expr) -> pl.Expr:
    """
    Finds the first digit count in the data. This is closely related to Benford's law,
    which states that the the first digits (1-9) follow a certain distribution.

    The output is a single element column of type list[u32]. The first value represents the count of 1s
    that are the first digit, the second value represents the count of 2s that are the first digit, etc.

    E.g. first digit of 12 is 1, of 0.0312 is 3. For integers, it is possible to have value = 0, and this
    will not be counted as a first digit.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Benford%27s_law
    """
    return pl_plugin(
        symbol="pl_benford_law",
        args=[to_expr(var)],
        returns_scalar=True,
    )

`query_knn_entropy(*features, k=3, dist='l2', parallel=False)`

Computes KNN entropy among all the rows.

Note if rows <= k, NaN will be returned.

Parameters:

Name	Type	Description	Default
`*features`	`str \| Expr`	Columns used as features	`()`
`k`	`int`	The number of nearest neighbor to consider. Usually 2 or 3.	`3`
`dist`	Literal[`l2`, `inf`]	Note `l2` here has to be `l2` with square root.	`'l2'`
`parallel`	`bool`	Whether to run the distance query in parallel. This is recommended when you are running only this expression, and not in group_by context.	`False`

Reference

https://arxiv.org/pdf/1506.06501v1.pdf

Source code in python/polars_ds/exprs/ts_features.py

def query_knn_entropy(
    *features: str | pl.Expr,
    k: int = 3,
    dist: Distance = "l2",
    parallel: bool = False,
) -> pl.Expr:
    """
    Computes KNN entropy among all the rows.

    Note if rows <= k, NaN will be returned.

    Parameters
    ----------
    *features
        Columns used as features
    k
        The number of nearest neighbor to consider. Usually 2 or 3.
    dist : Literal[`l2`, `inf`]
        Note `l2` here has to be `l2` with square root.
    parallel : bool
        Whether to run the distance query in parallel. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://arxiv.org/pdf/1506.06501v1.pdf
    """
    if k <= 0:
        raise ValueError("Input `k` must be > 0.")
    if dist not in ["l2", "inf"]:
        raise ValueError("Invalid metric for KNN entropy.")

    return pl_plugin(
        symbol="pl_knn_entropy",
        args=[to_expr(e).alias(str(i)) for i, e in enumerate(features)],
        kwargs={
            "k": k,
            "metric": dist,
            "parallel": parallel,
            "skip_eval": False,
            "skip_data": False,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

`query_lempel_ziv(b, as_ratio=True)`

Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.

Parameters:

Name	Type	Description	Default
`b`	`str \| Expr`	A boolean column	required
`as_ratio`	`bool`	If true, return complexity / length.	`True`

Source code in python/polars_ds/exprs/ts_features.py

def query_lempel_ziv(b: str | pl.Expr, as_ratio: bool = True) -> pl.Expr:
    """
    Computes Lempel Ziv complexity on a boolean column. Null will be mapped to False.

    Parameters
    ----------
    b
        A boolean column
    as_ratio : bool
        If true, return complexity / length.
    """
    x = to_expr(b)
    out = pl_plugin(
        symbol="pl_lempel_ziv_complexity",
        args=[x],
        returns_scalar=True,
    )
    if as_ratio:
        return out / x.len()
    return out

`query_longest_streak(where)`

Finds the longest streak length where the condition where is true.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name	Type	Description	Default
`where`	`str \| Expr`	If where is string, the string must represent the name of a string column. If where is an expression, the expression must evaluate to some boolean expression.	required

Source code in python/polars_ds/exprs/ts_features.py

def query_longest_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the longest streak length where the condition `where` is true.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a string column. If where is
        an expression, the expression must evaluate to some boolean expression.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return (
        y.filter(y.struct.field("value"))
        .struct.field("len")
        .max()
        .fill_null(0)
        .alias("longest_streak")
    )

`query_mean_abs_change(x)`

Returns the mean of all successive differences |X_i - X_i-1|

Source code in python/polars_ds/exprs/ts_features.py

def query_mean_abs_change(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the mean of all successive differences |X_i - X_i-1|
    """
    return to_expr(x).diff(null_behavior="drop").abs().mean()

`query_mean_n_abs_max(x, n_maxima)`

Returns the average of the top n_maxima of |x|.

Source code in python/polars_ds/exprs/ts_features.py

def query_mean_n_abs_max(x: str | pl.Expr, n_maxima: int) -> pl.Expr:
    """
    Returns the average of the top `n_maxima` of |x|.
    """
    if n_maxima <= 0:
        raise ValueError("The number of maxima should be > 0.")
    return to_expr(x).abs().top_k(n_maxima).mean()

`query_mid_range(x)`

A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.

Source code in python/polars_ds/exprs/ts_features.py

def query_mid_range(x: str | pl.Expr) -> pl.Expr:
    """
    A shorthand for (pl.col(x).max() - pl.col(x).min()) / 2.
    """
    xx = to_expr(x)
    return (xx.max() - xx.min()) / 2

`query_permute_entropy(ts, tau=1, n_dims=3, base=math.e)`

Computes permutation entropy.

Parameters:

Name	Type	Description	Default
`ts`	`str \| Expr`	A time series	required
`tau`	`int`	The embedding time delay which controls the number of time periods between elements of each of the new column vectors.	`1`
`n_dims`	`int, > 1`	The embedding dimension which controls the length of each of the new column vectors	`3`
`base`	`float`	The base for log in the entropy computation	`e`

Reference

https://www.aptech.com/blog/permutation-entropy/

Source code in python/polars_ds/exprs/ts_features.py

def query_permute_entropy(
    ts: str | pl.Expr,
    tau: int = 1,
    n_dims: int = 3,
    base: float = math.e,
) -> pl.Expr:
    """
    Computes permutation entropy.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    tau : int
        The embedding time delay which controls the number of time periods between elements
        of each of the new column vectors.
    n_dims : int, > 1
        The embedding dimension which controls the length of each of the new column vectors
    base : float
        The base for log in the entropy computation

    Reference
    ---------
    https://www.aptech.com/blog/permutation-entropy/
    """
    if n_dims <= 1:
        raise ValueError("Input `n_dims` has to be > 1.")
    if tau < 1:
        raise ValueError("Input `tau` has to be >= 1.")

    t = to_expr(ts)
    if tau == 1:  # Fast track the most common use case
        return (
            pl.concat_list(t, *(t.shift(-i) for i in range(1, n_dims)))
            .head(t.len() - n_dims + 1)
            .list.eval(pl.element().arg_sort())
            .value_counts()  # groupby and count, but returns a struct
            .struct.field("count")  # extract the field named "count"
            .entropy(base=base, normalize=True)
        )
    else:
        return (
            pl.concat_list(
                t.gather_every(tau),
                *(t.shift(-i).gather_every(tau) for i in range(1, n_dims)),
            )
            .slice(0, length=(t.len() // tau) + 1 - (n_dims // tau))
            .list.eval(pl.element().arg_sort())
            .value_counts()
            .struct.field("count")
            .entropy(base=base, normalize=True)
        )

`query_range_count(x, lower, upper)`

Returns the number of values inside [lower, upper].

Source code in python/polars_ds/exprs/ts_features.py

def query_range_count(x: str | pl.Expr, lower: float, upper: float) -> pl.Expr:
    """
    Returns the number of values inside [`lower`, `upper`].
    """
    return to_expr(x).is_between(lower_bound=lower, upper_bound=upper).sum()

`query_sample_entropy(ts, ratio=0.2, m=2, parallel=False)`

Calculate the sample entropy of this column. It is highly recommended that the user impute nulls before calling this.

If NaN/some error is returned/thrown, it is likely that: (1) Too little data, e.g. m + 1 > length (2) ratio or (ratio * std) is too close to or below 0 or std is null/NaN.

Parameters:

Name	Type	Description	Default
`ts`	`str \| Expr`	A time series	required
`ratio`	`float`	The tolerance parameter. Default is 0.2.	`0.2`
`m`	`int`	Length of a run of data. Most common run length is 2.	`2`
`parallel`	`bool`	Whether to run this in parallel or not. This is recommended when you are running only this expression, and not in group_by context.	`False`

Reference

https://en.wikipedia.org/wiki/Sample_entropy

Source code in python/polars_ds/exprs/ts_features.py

def query_sample_entropy(
    ts: str | pl.Expr, ratio: float = 0.2, m: int = 2, parallel: bool = False
) -> pl.Expr:
    """
    Calculate the sample entropy of this column. It is highly
    recommended that the user impute nulls before calling this.

    If NaN/some error is returned/thrown, it is likely that:
    (1) Too little data, e.g. m + 1 > length
    (2) ratio or (ratio * std) is too close to or below 0 or std is null/NaN.

    Parameters
    ----------
    ts : str | pl.Expr
        A time series
    ratio : float
        The tolerance parameter. Default is 0.2.
    m : int
        Length of a run of data. Most common run length is 2.
    parallel : bool
        Whether to run this in parallel or not. This is recommended when you
        are running only this expression, and not in group_by context.

    Reference
    ---------
    https://en.wikipedia.org/wiki/Sample_entropy
    """
    if m <= 1:
        raise ValueError("Input `m` must be > 1.")

    t = to_expr(ts)
    r = ratio * t.std(ddof=0)
    rows = t.len() - m + 1

    data = [r, t.slice(0, length=rows)]
    # See rust code for more comment on why I put m + 1 here.
    data.extend(
        t.shift(-i).slice(0, length=rows).alias(str(i)) for i in range(1, m + 1)
    )  # More errors are handled in Rust
    return pl_plugin(
        symbol="pl_sample_entropy",
        args=data,
        kwargs={
            "k": 0,
            "metric": "inf",
            "parallel": parallel,
        },
        returns_scalar=True,
        pass_name_to_apply=True,
    )

`query_similar_count(query, target, threshold, metric='sqzl2', parallel=False, return_ratio=False)`

Given a query subsequence, find the number of same-sized subsequences (windows) in target series that have distance < threshold from it.

Note: If target is largely null, errors may occur. If metric is sqzl2, a mininum variance of 1e-10 is applied to all variance calculations to avoid division by 0.

Parameters:

Name	Type	Description	Default
`query`	`Iterable[float]`	The query subsequence. Must not contain nulls.	required
`target`	`str \| Expr`	The target time series.	required
`threshold`	`float`	The distance threshold	required
`metric`	`Literal['sql2', 'sqzl2']`	Either 'sql2' or 'sqzl2', which stands for squared l2 and squared z-normalized l2.	`'sqzl2'`
`parallel`	`bool`	Only applies when method is `direct`. Whether to compute the convulotion in parallel. Note that this may not have the expected performance when you are in group_by or other parallel context already. It is recommended to use this in select/with_columns context, when few expressions are being run at the same time.	`False`
`return_ratio`	`bool`	If true, return # of similar subseuqnces / total number of subsequences.	`False`

Source code in python/polars_ds/exprs/ts_features.py

def query_similar_count(
    query: Iterable[float],
    target: str | pl.Expr,
    threshold: float,
    metric: Literal["sql2", "sqzl2"] = "sqzl2",
    parallel: bool = False,
    return_ratio: bool = False,
) -> pl.Expr:
    """
    Given a query subsequence, find the number of same-sized subsequences (windows) in target
    series that have distance < threshold from it.

    Note: If target is largely null, errors may occur. If metric is sqzl2, a mininum variance
    of 1e-10 is applied to all variance calculations to avoid division by 0.

    Parameters
    ----------
    query
        The query subsequence. Must not contain nulls.
    target
        The target time series.
    threshold
        The distance threshold
    metric
        Either 'sql2' or 'sqzl2', which stands for squared l2 and squared z-normalized l2.
    parallel
        Only applies when method is `direct`. Whether to compute the convulotion in parallel. Note that this may not
        have the expected performance when you are in group_by or other parallel context already. It is recommended
        to use this in select/with_columns context, when few expressions are being run at the same time.
    return_ratio
        If true, return # of similar subseuqnces / total number of subsequences.
    """

    q = pl.Series(name="", values=query, dtype=pl.Float64)
    if q.null_count() > 0:
        raise ValueError("Nulls found in the query subsequence.")
    if len(q) <= 1:
        raise ValueError("Length of the query should be > 1.")

    t = to_expr(target)
    kwargs = {"threshold": threshold, "parallel": parallel}
    if metric == "sql2":
        result = pl_plugin(
            symbol="pl_subseq_sim_cnt_l2",
            args=[t.cast(pl.Float64).rechunk(), q],
            kwargs=kwargs,
            returns_scalar=True,
        )
    elif metric == "sqzl2":  # pl_subseq_sim_cnt_zl2
        rolling_mean = t.rolling_mean(window_size=len(q)).slice(len(q) - 1, None)
        rolling_var = pl.max_horizontal(
            t.rolling_var(window_size=len(q)).slice(len(q) - 1, None).fill_nan(1e-10),
            pl.lit(1e-10, dtype=pl.Float64),
        )
        qq = pl.lit(q)
        args = [
            t.cast(pl.Float64).rechunk(),
            ((qq - qq.mean()) / qq.std()).rechunk(),
            rolling_mean.rechunk(),
            rolling_var.rechunk(),
        ]
        result = pl_plugin(
            symbol="pl_subseq_sim_cnt_zl2",
            args=args,
            kwargs=kwargs,
            returns_scalar=True,
        )
    else:
        raise ValueError(f"Unsupported metric {metric}.")

    if return_ratio:
        return result / (t.len() - len(q) + 1)
    return result

`query_streak(where)`

Finds the streak length where the condition where is true. This returns a full column of streak lengths.

Note: the query is still runnable when where doesn't represent boolean column / boolean expressions. However, if that is the case the answer will not be easily interpretable.

Parameters:

Name	Type	Description	Default
`where`	`str \| Expr`	If where is string, the string must represent the name of a boolean column. If where is an expression, the expression must evaluate to some boolean series.	required

Source code in python/polars_ds/exprs/ts_features.py

def query_streak(where: str | pl.Expr) -> pl.Expr:
    """
    Finds the streak length where the condition `where` is true. This returns a full column of streak lengths.

    Note: the query is still runnable when `where` doesn't represent boolean column / boolean expressions.
    However, if that is the case the answer will not be easily interpretable.

    Parameters
    ----------
    where
        If where is string, the string must represent the name of a boolean column. If where is
        an expression, the expression must evaluate to some boolean series.
    """

    if isinstance(where, str):
        condition = pl.col(where)
    else:
        condition = where

    y = condition.rle().struct.rename_fields(
        ["len", "value"]
    )  # POLARS V1 rename fields can be removed when polars hit v1.0
    return y.struct.field("len").alias("streak_len")

`query_symm_ratio(x)`

Returns the symmetric ratio: |mean - median| / (max - min). Note the closer to 0 this value is, the more symmetric the series is.

Source code in python/polars_ds/exprs/ts_features.py

def query_symm_ratio(x: str | pl.Expr) -> pl.Expr:
    """
    Returns the symmetric ratio: |mean - median| / (max - min). Note the closer to 0 this value is,
    the more symmetric the series is.
    """
    y = to_expr(x)
    return (y.mean() - y.median()).abs() / (y.max() - y.min())

`query_time_reversal_asymmetry_stats(x, n_lags)`

Queries the Time Reversal Asymmetry Statistic, which is the average of (L^2(x) * L(x) - L(x) * x^2), where L is the lag operator.

Source code in python/polars_ds/exprs/ts_features.py

def query_time_reversal_asymmetry_stats(x: str | pl.Expr, n_lags: int) -> pl.Expr:
    """
    Queries the Time Reversal Asymmetry Statistic, which is the average of
    (L^2(x) * L(x) - L(x) * x^2), where L is the lag operator.
    """
    y = to_expr(x)
    one_lag = y.shift(-n_lags)
    two_lag = y.shift(-2 * n_lags)  # Nulls won't be in the mean calculation
    return (one_lag * (two_lag + y) * (two_lag - y)).mean()

`query_transfer_entropy(x, source, lag=1, k=3, parallel=False)`

Estimating transfer entropy from source to x with a lag

Reference

Jian Ma. Estimating Transfer Entropy via Copula Entropy. arXiv preprint arXiv:1910.04375, 2019.

Source code in python/polars_ds/exprs/ts_features.py

def query_transfer_entropy(
    x: str | pl.Expr, source: str | pl.Expr, lag: int = 1, k: int = 3, parallel: bool = False
) -> pl.Expr:
    """
    Estimating transfer entropy from `source` to `x` with a lag

    Reference
    ---------
    Jian Ma. Estimating Transfer Entropy via Copula Entropy. arXiv preprint arXiv:1910.04375, 2019.
    """
    if lag < 1:
        raise ValueError("Input `lag` must be >= 1.")

    xx = to_expr(x)
    x1 = xx.slice(0, pl.len() - lag)
    x2 = xx.slice(lag, pl.len() - lag)  # (equivalent to slice(lag, None), but will break in v1.0)
    s = to_expr(source).slice(0, pl.len() - lag)
    return query_cond_indep(x2, s, x1, k=k, parallel=parallel)