String Functions Expr

Extension for String Manipulation and Metrics

String-related utils, including edit distances, simple cleaning, etc.

Functions:

Name	Description
`extract_numbers`	Extracts numbers from the string column, and stores them in a list.
`filter_by_hamming`	Returns whether the hamming distance between self and other is <= bound. This is
`filter_by_levenshtein`	Returns whether the Levenshtein distance between self and other is <= bound. This is
`map_words`	Replace words based on the specified mapping.
`normalize_whitespace`	`Normalize whitespace to one, e.g. 'a b' -> 'a b'.`
`remove_diacritics`	Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized
`replace_non_ascii`	Replaces non-Ascii values with the specified value.
`similar_to_vocab`	Compare each word in the vocab with each word in the column c. Returns a boolean
`str_d_leven`	Computes the Damerau-Levenshtein distance between this and the other str.
`str_fuzz`	Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more
`str_hamming`	Computes the hamming distance between two strings. If they do not have the same length, null will
`str_jaccard`	Treats substrings of size `substr_size` as a set. And computes the jaccard similarity between
`str_jaro`	Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.
`str_jw`	Computes the Jaro-Winkler similarity between this and the other str.
`str_lcs_subseq`	Extracts the longest common subsequence from the string between this and the other string.
`str_lcs_subseq_dist`	Computes the Longest Common Subsequence distance/similarity between this and the other str.
`str_lcs_substr`	Extracts the longest common substring from the string between this and the other string.
`str_leven`	Computes the Levenshtein distance between this and the other str.
`str_nearest`	Finds the string in the column that is nearest to the given word in the given metric. This algorithm is
`str_osa`	Computes the Optimal String Alignment distance between this and the other str.
`str_sorensen_dice`	Treats substrings of size `substr_size` as a set. And computes the Sorensen-Dice similarity between
`str_tversky_sim`	Treats substrings of size `substr_size` as a set. And computes the tversky_sim similarity between
`to_camel_case`	Turns itself into camel case. E.g. helloWorld
`to_constant_case`	Turns itself into constant case. E.g. Hello_World
`to_pascal_case`	Turns itself into Pascal case. E.g. HelloWorld
`to_snake_case`	Turns itself into snake case. E.g. hello_world

`extract_numbers(c, ignore_comma=False, join_by='', dtype=pl.String)`

Extracts numbers from the string column, and stores them in a list.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`ignore_comma`	`bool`	Whether to remove all comma before matching for numbers	`False`
`join_by`	`str`	If dtype is pl.String, join the list of strings using the value given here	`''`
`dtype`	`DataType`	The desired inner dtype for the extracted data. Should either be one of one of Polars' numerical types or pl.String	`String`

Examples:

>>> df = pl.DataFrame(
...     {
...         "survey": [
...             "0% of my time",
...             "1% to 25% of my time",
...             "75% to 99% of my time",
...             "50% to 74% of my time",
...             "75% to 99% of my time",
...             "50% to 74% of my time",
...         ]
...     }
... )
>>> df.select(pl.col("survey").str_ext.extract_numbers(dtype=pl.UInt32))
shape: (6, 1)
┌───────────┐
│ survey    │
│ ---       │
│ list[u32] │
╞═══════════╡
│ [0]       │
│ [1, 25]   │
│ [75, 99]  │
│ [50, 74]  │
│ [75, 99]  │
│ [50, 74]  │
└───────────┘
>>> df.select(pl.col("survey").str_ext.extract_numbers(join_by="-", dtype=pl.String))
shape: (6, 1)
┌────────┐
│ survey │
│ ---    │
│ str    │
╞════════╡
│ 0      │
│ 1-25   │
│ 75-99  │
│ 50-74  │
│ 75-99  │
│ 50-74  │
└────────┘

Source code in python/polars_ds/exprs/string.py

def extract_numbers(
    c: str | pl.Expr,
    ignore_comma: bool = False,
    join_by: str = "",
    dtype: pl.DataType = pl.String,
) -> pl.Expr:
    """
    Extracts numbers from the string column, and stores them in a list.

    Parameters
    ----------
    c
        The string column
    ignore_comma
        Whether to remove all comma before matching for numbers
    join_by
        If dtype is pl.String, join the list of strings using the value given here
    dtype
        The desired inner dtype for the extracted data. Should either be one of
        one of Polars' numerical types or pl.String

    Examples
    --------
    >>> df = pl.DataFrame(
    ...     {
    ...         "survey": [
    ...             "0% of my time",
    ...             "1% to 25% of my time",
    ...             "75% to 99% of my time",
    ...             "50% to 74% of my time",
    ...             "75% to 99% of my time",
    ...             "50% to 74% of my time",
    ...         ]
    ...     }
    ... )
    >>> df.select(pl.col("survey").str_ext.extract_numbers(dtype=pl.UInt32))
    shape: (6, 1)
    ┌───────────┐
    │ survey    │
    │ ---       │
    │ list[u32] │
    ╞═══════════╡
    │ [0]       │
    │ [1, 25]   │
    │ [75, 99]  │
    │ [50, 74]  │
    │ [75, 99]  │
    │ [50, 74]  │
    └───────────┘
    >>> df.select(pl.col("survey").str_ext.extract_numbers(join_by="-", dtype=pl.String))
    shape: (6, 1)
    ┌────────┐
    │ survey │
    │ ---    │
    │ str    │
    ╞════════╡
    │ 0      │
    │ 1-25   │
    │ 75-99  │
    │ 50-74  │
    │ 75-99  │
    │ 50-74  │
    └────────┘
    """
    expr = to_expr(c)
    if ignore_comma:
        expr = expr.str.replace_all(",", "")

    # Find all numbers
    expr = expr.str.extract_all(r"(\d*\.?\d+)")

    if dtype in [
        pl.UInt8,
        pl.UInt16,
        pl.UInt32,
        pl.UInt64,
        pl.Int8,
        pl.Int16,
        pl.Int32,
        pl.Int64,
        pl.Float32,
        pl.Float64,
    ]:
        expr = expr.list.eval(pl.element().cast(dtype))
    elif dtype == pl.String:  # As a list of strings
        if join_by != "":
            expr = expr.list.join(join_by)

    return expr

`filter_by_hamming(c, other, bound, pad=False, parallel=False)`

Returns whether the hamming distance between self and other is <= bound. This is faster than computing hamming distance and then doing a filter.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	Either the name of the column or a Polars expression	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`bound`	`int`	Closed upper bound. If distance <= bound, return true and false otherwise.	required
`pad`	`bool`	Whether to pad the strings to the same length. If False, and strings have different lengths, they will be filtered out.	`False`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def filter_by_hamming(
    c: str | pl.Expr,
    other: str | pl.Expr,
    bound: int,
    pad: bool = False,
    parallel: bool = False,
) -> pl.Expr:
    """
    Returns whether the hamming distance between self and other is <= bound. This is
    faster than computing hamming distance and then doing a filter.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    bound
        Closed upper bound. If distance <= bound, return true and false otherwise.
    pad
        Whether to pad the strings to the same length. If False, and strings have different lengths,
        they will be filtered out.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_hamming_filter",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(bound, dtype=pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`filter_by_levenshtein(c, other, bound, parallel=False, as_bytes=False)`

Returns whether the Levenshtein distance between self and other is <= bound. This is faster than computing levenshtein distance and then doing a filter.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	Either the name of the column or a Polars expression	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`bound`	`int`	Closed upper bound. If distance <= bound, return true and false otherwise.	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`
`as_bytes`	`bool`	Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.	`False`

Source code in python/polars_ds/exprs/string.py

def filter_by_levenshtein(
    c: str | pl.Expr,
    other: str | pl.Expr,
    bound: int,
    parallel: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Returns whether the Levenshtein distance between self and other is <= bound. This is
    faster than computing levenshtein distance and then doing a filter.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    bound
        Closed upper bound. If distance <= bound, return true and false otherwise.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"bound": abs(bound), "parallel": parallel, "as_bytes": as_bytes}
    return pl_plugin(
        symbol="pl_levenshtein_filter",
        args=[to_expr(c), to_expr(other)],
        is_elementwise=True and not parallel,
        kwargs=params,
    )

`map_words(c, mapping)`

Replace words based on the specified mapping.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`mapping`	`dict[str, str]`	A dictionary of {word: the replacement}	required

Returns:

Type	Description
`Expr`

Examples:

>>> df = pl.DataFrame({"x": ["one two three"]})
>>> df.select(pds.map_words("x", {"two": "2"}))
shape: (1, 1)
┌─────────────┐
│ x           │
│ ---         │
│ str         │
╞═════════════╡
│ one 2 three │
└─────────────┘

Source code in python/polars_ds/exprs/string.py

def map_words(c: str | pl.Expr, mapping: Dict[str, str]) -> pl.Expr:
    """
    Replace words based on the specified mapping.

    Parameters
    ----------
    c : str | pl.Expr
        The string column
    mapping : dict[str, str]
        A dictionary of {word: the replacement}

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["one two three"]})
    >>> df.select(pds.map_words("x", {"two": "2"}))
    shape: (1, 1)
    ┌─────────────┐
    │ x           │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ one 2 three │
    └─────────────┘
    """
    return pl_plugin(
        symbol="map_words",
        args=[to_expr(c)],
        kwargs={"mapping": mapping},
        is_elementwise=True,
    )

`normalize_whitespace(c, only_spaces=False)`

Normalize whitespace to one, e.g. 'a   b' -> 'a b'.

Parameters

c : str | pl.Expr
    The string column
only_spaces: bool
    If True, only split on the space character ' ' instead of any whitespace
    character such as '     ' and '

', by default False

Returns

pl.Expr

Examples

shape: (2, 3)
┌─────────┬─────┬────────┐
│ x       ┆ y   ┆ z      │
│ ---     ┆ --- ┆ ---    │
│ str     ┆ str ┆ str    │
╞═════════╪═════╪════════╡
│ a     b ┆ a b ┆ a b    │
│ a     b ┆ a b ┆ a     b│
└─────────┴─────┴────────┘

Source code in python/polars_ds/exprs/string.py

def normalize_whitespace(c: str | pl.Expr, only_spaces: bool = False) -> pl.Expr:
    """
    Normalize whitespace to one, e.g. 'a   b' -> 'a b'.

    Parameters
    ----------
    c : str | pl.Expr
        The string column
    only_spaces: bool
        If True, only split on the space character ' ' instead of any whitespace
        character such as '\t' and '\n', by default False

    Returns
    -------
    pl.Expr

    Examples
    --------
    shape: (2, 3)
    ┌─────────┬─────┬────────┐
    │ x       ┆ y   ┆ z      │
    │ ---     ┆ --- ┆ ---    │
    │ str     ┆ str ┆ str    │
    ╞═════════╪═════╪════════╡
    │ a     b ┆ a b ┆ a b    │
    │ a	    b ┆ a b ┆ a	    b│
    └─────────┴─────┴────────┘
    """
    expr = to_expr(c)

    if only_spaces:
        return expr.str.replace_all(" +", " ")

    return pl_plugin(
        symbol="normalize_whitespace",
        args=[expr],
        is_elementwise=True,
    )

`remove_diacritics(c)`

Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized form and removing the resulting non-ASCII components.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`		required

Returns:

Type	Description
`Expr`

Examples:

>>> df = pl.DataFrame({"x": ["mercy", "mèrcy"]})
>>> df.select(pds.replace_non_ascii("x"))
shape: (2, 1)
┌───────┐
│ x     │
│ ---   │
│ str   │
╞═══════╡
│ mercy │
│ mercy │
└───────┘

Source code in python/polars_ds/exprs/string.py

def remove_diacritics(c: str | pl.Expr) -> pl.Expr:
    """Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized
    form and removing the resulting non-ASCII components.

    Parameters
    ----------
    c : str | pl.Expr

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["mercy", "mèrcy"]})
    >>> df.select(pds.replace_non_ascii("x"))
    shape: (2, 1)
    ┌───────┐
    │ x     │
    │ ---   │
    │ str   │
    ╞═══════╡
    │ mercy │
    │ mercy │
    └───────┘
    """
    return pl_plugin(
        symbol="remove_diacritics",
        args=[to_expr(c)],
        is_elementwise=True,
    )

`replace_non_ascii(c, value='')`

Replaces non-Ascii values with the specified value.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The column name or expression	required
`value`	`str`	The value to replace non-Ascii values with, by default ""	`''`

Returns:

Type	Description
`Expr`

Examples:

>>> df = pl.DataFrame({"x": ["mercy", "xbĤ", "ĤŇƏ"]})
>>> df.select(pds.replace_non_ascii("x"))
shape: (3, 1)
┌───────┐
│ x     │
│ ---   │
│ str   │
╞═══════╡
│ mercy │
│ xb    │
│       │
└───────┘

Source code in python/polars_ds/exprs/string.py

def replace_non_ascii(c: str | pl.Expr, value: str = "") -> pl.Expr:
    """Replaces non-Ascii values with the specified value.

    Parameters
    ----------
    c : str | pl.Expr
        The column name or expression
    value : str
        The value to replace non-Ascii values with, by default ""

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["mercy", "xbĤ", "ĤŇƏ"]})
    >>> df.select(pds.replace_non_ascii("x"))
    shape: (3, 1)
    ┌───────┐
    │ x     │
    │ ---   │
    │ str   │
    ╞═══════╡
    │ mercy │
    │ xb    │
    │       │
    └───────┘
    """
    expr = to_expr(c)

    if value == "":
        return pl_plugin(
            symbol="remove_non_ascii",
            args=[expr],
            is_elementwise=True,
        )

    return expr.str.replace_all(r"[^\p{Ascii}]", value)

`similar_to_vocab(c, vocab, threshold, metric='lv', strategy='avg', as_bytes=False)`

Compare each word in the vocab with each word in the column c. Returns a boolean that indicates whether there exist words in c that are similar to words in vocab.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`vocab`	`List[str]`	Any iterable collection of strings	required
`threshold`	`float`	A entry is considered similar to the words in the vocabulary if the similarity is above (>=) the threshold	required
`metric`	`Literal['lv', 'dlv', 'jw', 'osa']`	Which similarity metric to use. One of `lv`, `dlv`, `jw`, `osa`	`'lv'`
`strategy`	`Literal['avg', 'all', 'any']`	If `avg`, then will return true if the average similarity is above the threshold. If `all`, then will return true if the similarity to all words in the vocab is above the threshold. If `any`, then will return true if the similarity to any words in the vocab is above the threshold.	`'avg'`
`as_bytes`	`bool`	Only works for Levenshtein distance. Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.	`False`

Source code in python/polars_ds/exprs/string.py

def similar_to_vocab(
    c: str | pl.Expr,
    vocab: List[str],
    threshold: float,
    metric: Literal["lv", "dlv", "jw", "osa"] = "lv",
    strategy: Literal["avg", "all", "any"] = "avg",
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Compare each word in the vocab with each word in the column c. Returns a boolean
    that indicates whether there exist words in c that are similar to words in vocab.

    Parameters
    ----------
    c
        The string column
    vocab
        Any iterable collection of strings
    threshold
        A entry is considered similar to the words in the vocabulary if the similarity
        is above (>=) the threshold
    metric
        Which similarity metric to use. One of `lv`, `dlv`, `jw`, `osa`
    strategy
        If `avg`, then will return true if the average similarity is above the threshold.
        If `all`, then will return true if the similarity to all words in the vocab is above
        the threshold.
        If `any`, then will return true if the similarity to any words in the vocab is above
        the threshold.
    as_bytes
        Only works for Levenshtein distance. Whether to treat the strings as ASCII characters.
        This will boost performance but does not work on non-ASCII characters.
    """
    if metric == "lv":
        sims = [
            str_leven(c, pl.lit(w, dtype=pl.String), as_bytes=as_bytes, return_sim=True)
            for w in vocab
        ]
    elif metric == "dlv":
        sims = [
            str_d_leven(c, pl.lit(w, dtype=pl.String), as_bytes=as_bytes, return_sim=True)
            for w in vocab
        ]
    elif metric == "osa":
        sims = [str_osa(c, pl.lit(w, dtype=pl.String), return_sim=True) for w in vocab]
    elif metric == "jw":
        sims = [str_jw(c, pl.lit(w, dtype=pl.String), return_sim=True) for w in vocab]
    else:
        raise ValueError(f"Unknown metric: {metric}")

    if strategy == "all":
        return pl.all_horizontal(s >= threshold for s in sims)
    elif strategy == "any":
        return pl.any_horizontal(s >= threshold for s in sims)
    elif strategy == "avg":
        return (pl.sum_horizontal(sims) / len(vocab)) >= threshold
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

`str_d_leven(c, other, parallel=False, return_sim=False, as_bytes=False)`

Computes the Damerau-Levenshtein distance between this and the other str.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`
`return_sim`	`bool`	If true, return normalized Damerau-Levenshtein.	`False`
`as_bytes`	`bool`	Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.	`False`

Source code in python/polars_ds/exprs/string.py

def str_d_leven(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Computes the Damerau-Levenshtein distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized Damerau-Levenshtein.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"parallel": parallel, "as_bytes": as_bytes}
    if return_sim:
        return pl_plugin(
            symbol="pl_d_levenshtein_sim",
            args=[to_expr(c), to_expr(other)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )
    else:
        return pl_plugin(
            symbol="pl_d_levenshtein",
            args=[to_expr(c), to_expr(other)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )

`str_fuzz(c, other, parallel=False)`

Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more information.)

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_fuzz(c: str | pl.Expr, other: str | pl.Expr, parallel: bool = False) -> pl.Expr:
    """
    Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more
    information.)

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_fuzz",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True,
    )

`str_hamming(c, other, pad=False, parallel=False)`

Computes the hamming distance between two strings. If they do not have the same length, null will be returned.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	Either the name of the column or a Polars expression	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`pad`	`bool`	Whether to pad the string when lengths are not equal.	`False`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_hamming(
    c: str | pl.Expr, other: str | pl.Expr, pad: bool = False, parallel: bool = False
) -> pl.Expr:
    """
    Computes the hamming distance between two strings. If they do not have the same length, null will
    be returned.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    pad
        Whether to pad the string when lengths are not equal.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """

    if pad:
        return pl_plugin(
            symbol="pl_hamming_padded",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_hamming",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

`str_jaccard(c, other, substr_size=2, parallel=False)`

Treats substrings of size substr_size as a set. And computes the jaccard similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`substr_size`	`int`	The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.	`2`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_jaccard(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the jaccard similarity between
    this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_str_jaccard",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`str_jaro(c, other, parallel=False)`

Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_jaro(c: str | pl.Expr, other: str | pl.Expr, parallel: bool = False) -> pl.Expr:
    """
    Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_jaro",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

`str_jw(c, other, weight=0.1, parallel=False)`

Computes the Jaro-Winkler similarity between this and the other str. Jaro-Winkler distance = 1 - Jaro-Winkler sim.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`weight`	`float`	Weight for prefix. A typical value is 0.1.	`0.1`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_jw(
    c: str | pl.Expr,
    other: str | pl.Expr,
    weight: float = 0.1,
    parallel: bool = False,
) -> pl.Expr:
    """
    Computes the Jaro-Winkler similarity between this and the other str.
    Jaro-Winkler distance = 1 - Jaro-Winkler sim.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    weight
        Weight for prefix. A typical value is 0.1.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_jw",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(weight, pl.Float64),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`str_lcs_subseq(c, other, parallel=False)`

Extracts the longest common subsequence from the string between this and the other string.

Note: this is not the same as the longest common substring.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_lcs_subseq(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
) -> pl.Expr:
    """
    Extracts the longest common subsequence from the string between this and the other string.

    Note: this is not the same as the longest common substring.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_lcs_subseq",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

`str_lcs_subseq_dist(c, other, parallel=False, return_sim=True)`

Computes the Longest Common Subsequence distance/similarity between this and the other str. The distance is calculated as max(len1, len2) - similarity, where the similarity is the the length of the longest common subsequence.

The subsequence does not need to be consecutive.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`
`return_sim`	`bool`	If true, return normalized similarity.	`True`

Source code in python/polars_ds/exprs/string.py

def str_lcs_subseq_dist(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = True,
) -> pl.Expr:
    """
    Computes the Longest Common Subsequence distance/similarity between this and the other str.
    The distance is calculated as max(len1, len2) - similarity, where the similarity is the
    the length of the longest common subsequence.

    The subsequence does not need to be consecutive.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized similarity.
    """
    if return_sim:
        return pl_plugin(
            symbol="pl_lcs_subseq_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_lcs_subseq_dist",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

`str_lcs_substr(c, other, parallel=False)`

Extracts the longest common substring from the string between this and the other string.

Note: this is not the same as the longest common subsequence.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_lcs_substr(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
) -> pl.Expr:
    """
    Extracts the longest common substring from the string between this and the other string.

    Note: this is not the same as the longest common subsequence.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_lcs_substr",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

`str_leven(c, other, parallel=False, return_sim=False, as_bytes=False)`

Computes the Levenshtein distance between this and the other str.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`
`return_sim`	`bool`	If true, return normalized Levenshtein.	`False`
`as_bytes`	`bool`	Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.	`False`

Source code in python/polars_ds/exprs/string.py

def str_leven(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Computes the Levenshtein distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized Levenshtein.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"parallel": parallel, "as_bytes": as_bytes}
    if return_sim:
        return pl_plugin(
            symbol="pl_levenshtein_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )
    else:
        return pl_plugin(
            symbol="pl_levenshtein",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )

`str_nearest(c, word, threshold=100, metric='lv')`

Finds the string in the column that is nearest to the given word in the given metric. This algorithm is very slow.

Note: Nearest-k strings search functionality is temporarily dropped.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column or its name	required
`word`	`str`	Any iterable collection of strings that can be turned into a polars Series, or an expression	required
`threshold`	`int`	Only considers strings to be near if they are within distance threshold. This is a positive integer because all the distances output integers.	`100`
`metric`	`Literal['lv', 'hamming']`	Which similarity metric to use. One of `lv`, `hamming`	`'lv'`

Source code in python/polars_ds/exprs/string.py

def str_nearest(
    c: str | pl.Expr,
    word: str,
    threshold: int = 100,
    metric: Literal["lv", "hamming"] = "lv",
) -> pl.Expr:
    """
    Finds the string in the column that is nearest to the given word in the given metric. This algorithm is
    very slow.

    Note: Nearest-k strings search functionality is temporarily dropped.

    Parameters
    ----------
    c
        The string column or its name
    word
        Any iterable collection of strings that can be turned into a polars Series, or an expression
    threshold : int
        Only considers strings to be near if they are within distance threshold. This is a positive integer
        because all the distances output integers.
    metric
        Which similarity metric to use. One of `lv`, `hamming`
    """
    if metric not in ("lv", "hamming"):
        raise ValueError(f"Unknown metric for similar_words: {metric}")

    if threshold <= 0:
        raise ValueError("Distance threshold must be > 0.")

    return pl_plugin(
        symbol="pl_nearest_str",
        args=[to_expr(c)],
        kwargs={
            "word": word,
            "metric": str(metric).lower(),
            "threshold": threshold,
        },
        returns_scalar=True,
    )

`str_osa(c, other, parallel=False, return_sim=False)`

Computes the Optimal String Alignment distance between this and the other str.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`
`return_sim`	`bool`	If true, return normalized OSA similarity.	`False`

Source code in python/polars_ds/exprs/string.py

def str_osa(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
) -> pl.Expr:
    """
    Computes the Optimal String Alignment distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized OSA similarity.
    """
    if return_sim:
        return pl_plugin(
            symbol="pl_osa_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_osa",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

`str_overlap_coeff(c, other, substr_size=2, parallel=False)`

Treats substrings of size substr_size as a set. And computes the overlap coefficient as similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`substr_size`	`int`	The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.	`2`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_overlap_coeff(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the overlap coefficient as
    similarity between this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_overlap_coeff",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`str_sorensen_dice(c, other, substr_size=2, parallel=False)`

Treats substrings of size substr_size as a set. And computes the Sorensen-Dice similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`substr_size`	`int`	The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.	`2`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Source code in python/polars_ds/exprs/string.py

def str_sorensen_dice(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the Sorensen-Dice similarity between
    this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_sorensen_dice",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`str_tversky_sim(c, other, alpha, beta, substr_size=2, parallel=False)`

Treats substrings of size substr_size as a set. And computes the tversky_sim similarity between this word and the other. See the reference for information on how Tversky similarity is related the other ngram based similarity.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems. Also note that alpha and beta are supposed to be weighting factors, but this doesn't check whether they satisfy the definition of weights and has to be chosen at the discretion of the user.

Parameters:

Name	Type	Description	Default
`c`	`str \| Expr`	The string column	required
`other`	`str \| Expr`	Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)	required
`alpha`	`float`	The first weighting factor. See reference	required
`beta`	`float`	The second weighting factor. See reference	required
`substr_size`	`int`	The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.	`2`
`parallel`	`bool`	Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.	`False`

Reference

https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7

Source code in python/polars_ds/exprs/string.py

def str_tversky_sim(
    c: str | pl.Expr,
    other: str | pl.Expr,
    alpha: float,
    beta: float,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the tversky_sim similarity between
    this word and the other. See the reference for information on how Tversky similarity is related
    the other ngram based similarity.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems. Also note that alpha and beta are supposed to be weighting factors,
    but this doesn't check whether they satisfy the definition of weights and has to be chosen at the
    discretion of the user.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    alpha
        The first weighting factor. See reference
    beta
        The second weighting factor. See reference
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.

    Reference
    ---------
    https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7
    """
    if alpha < 0 or beta < 0:
        raise ValueError("Input `alpha` and `beta` must be >= 0.")

    return pl_plugin(
        symbol="pl_tversky_sim",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(alpha, pl.Float64),
            pl.lit(beta, pl.Float64),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

`to_camel_case(c)`

Turns itself into camel case. E.g. helloWorld

Source code in python/polars_ds/exprs/string.py

def to_camel_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into camel case. E.g. helloWorld"""
    return pl_plugin(
        symbol="pl_to_camel",
        args=[to_expr(c)],
        is_elementwise=True,
    )

`to_constant_case(c)`

Turns itself into constant case. E.g. Hello_World

Source code in python/polars_ds/exprs/string.py

def to_constant_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into constant case. E.g. Hello_World"""
    return pl_plugin(
        symbol="pl_to_constant",
        args=[to_expr(c)],
        is_elementwise=True,
    )

`to_pascal_case(c)`

Turns itself into Pascal case. E.g. HelloWorld

Source code in python/polars_ds/exprs/string.py

def to_pascal_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into Pascal case. E.g. HelloWorld"""
    return pl_plugin(
        symbol="pl_to_pascal",
        args=[to_expr(c)],
        is_elementwise=True,
    )

`to_snake_case(c)`

Turns itself into snake case. E.g. hello_world

Source code in python/polars_ds/exprs/string.py

def to_snake_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into snake case. E.g. hello_world"""
    return pl_plugin(
        symbol="pl_to_snake",
        args=[to_expr(c)],
        is_elementwise=True,
    )