Skip to content

String Functions Expr

Extension for String Manipulation and Metrics

String-related utils, including edit distances, simple cleaning, etc.

Functions:

Name Description
extract_numbers

Extracts numbers from the string column, and stores them in a list.

filter_by_hamming

Returns whether the hamming distance between self and other is <= bound. This is

filter_by_levenshtein

Returns whether the Levenshtein distance between self and other is <= bound. This is

map_words

Replace words based on the specified mapping.

normalize_whitespace
Normalize whitespace to one, e.g. 'a   b' -> 'a b'.
remove_diacritics

Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized

replace_non_ascii

Replaces non-Ascii values with the specified value.

similar_to_vocab

Compare each word in the vocab with each word in the column c. Returns a boolean

str_d_leven

Computes the Damerau-Levenshtein distance between this and the other str.

str_fuzz

Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more

str_hamming

Computes the hamming distance between two strings. If they do not have the same length, null will

str_jaccard

Treats substrings of size substr_size as a set. And computes the jaccard similarity between

str_jaro

Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.

str_jw

Computes the Jaro-Winkler similarity between this and the other str.

str_lcs_subseq

Extracts the longest common subsequence from the string between this and the other string.

str_lcs_subseq_dist

Computes the Longest Common Subsequence distance/similarity between this and the other str.

str_lcs_substr

Extracts the longest common substring from the string between this and the other string.

str_leven

Computes the Levenshtein distance between this and the other str.

str_nearest

Finds the string in the column that is nearest to the given word in the given metric. This algorithm is

str_osa

Computes the Optimal String Alignment distance between this and the other str.

str_sorensen_dice

Treats substrings of size substr_size as a set. And computes the Sorensen-Dice similarity between

str_tversky_sim

Treats substrings of size substr_size as a set. And computes the tversky_sim similarity between

to_camel_case

Turns itself into camel case. E.g. helloWorld

to_constant_case

Turns itself into constant case. E.g. Hello_World

to_pascal_case

Turns itself into Pascal case. E.g. HelloWorld

to_snake_case

Turns itself into snake case. E.g. hello_world

extract_numbers(c, ignore_comma=False, join_by='', dtype=pl.String)

Extracts numbers from the string column, and stores them in a list.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
ignore_comma bool

Whether to remove all comma before matching for numbers

False
join_by str

If dtype is pl.String, join the list of strings using the value given here

''
dtype DataType

The desired inner dtype for the extracted data. Should either be one of one of Polars' numerical types or pl.String

String

Examples:

>>> df = pl.DataFrame(
...     {
...         "survey": [
...             "0% of my time",
...             "1% to 25% of my time",
...             "75% to 99% of my time",
...             "50% to 74% of my time",
...             "75% to 99% of my time",
...             "50% to 74% of my time",
...         ]
...     }
... )
>>> df.select(pl.col("survey").str_ext.extract_numbers(dtype=pl.UInt32))
shape: (6, 1)
┌───────────┐
│ survey    │
│ ---       │
│ list[u32] │
╞═══════════╡
│ [0]       │
│ [1, 25]   │
│ [75, 99]  │
│ [50, 74]  │
│ [75, 99]  │
│ [50, 74]  │
└───────────┘
>>> df.select(pl.col("survey").str_ext.extract_numbers(join_by="-", dtype=pl.String))
shape: (6, 1)
┌────────┐
│ survey │
│ ---    │
│ str    │
╞════════╡
│ 0      │
│ 1-25   │
│ 75-99  │
│ 50-74  │
│ 75-99  │
│ 50-74  │
└────────┘
Source code in python/polars_ds/exprs/string.py
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
def extract_numbers(
    c: str | pl.Expr,
    ignore_comma: bool = False,
    join_by: str = "",
    dtype: pl.DataType = pl.String,
) -> pl.Expr:
    """
    Extracts numbers from the string column, and stores them in a list.

    Parameters
    ----------
    c
        The string column
    ignore_comma
        Whether to remove all comma before matching for numbers
    join_by
        If dtype is pl.String, join the list of strings using the value given here
    dtype
        The desired inner dtype for the extracted data. Should either be one of
        one of Polars' numerical types or pl.String

    Examples
    --------
    >>> df = pl.DataFrame(
    ...     {
    ...         "survey": [
    ...             "0% of my time",
    ...             "1% to 25% of my time",
    ...             "75% to 99% of my time",
    ...             "50% to 74% of my time",
    ...             "75% to 99% of my time",
    ...             "50% to 74% of my time",
    ...         ]
    ...     }
    ... )
    >>> df.select(pl.col("survey").str_ext.extract_numbers(dtype=pl.UInt32))
    shape: (6, 1)
    ┌───────────┐
    │ survey    │
    │ ---       │
    │ list[u32] │
    ╞═══════════╡
    │ [0]       │
    │ [1, 25]   │
    │ [75, 99]  │
    │ [50, 74]  │
    │ [75, 99]  │
    │ [50, 74]  │
    └───────────┘
    >>> df.select(pl.col("survey").str_ext.extract_numbers(join_by="-", dtype=pl.String))
    shape: (6, 1)
    ┌────────┐
    │ survey │
    │ ---    │
    │ str    │
    ╞════════╡
    │ 0      │
    │ 1-25   │
    │ 75-99  │
    │ 50-74  │
    │ 75-99  │
    │ 50-74  │
    └────────┘
    """
    expr = to_expr(c)
    if ignore_comma:
        expr = expr.str.replace_all(",", "")

    # Find all numbers
    expr = expr.str.extract_all(r"(\d*\.?\d+)")

    if dtype in [
        pl.UInt8,
        pl.UInt16,
        pl.UInt32,
        pl.UInt64,
        pl.Int8,
        pl.Int16,
        pl.Int32,
        pl.Int64,
        pl.Float32,
        pl.Float64,
    ]:
        expr = expr.list.eval(pl.element().cast(dtype))
    elif dtype == pl.String:  # As a list of strings
        if join_by != "":
            expr = expr.list.join(join_by)

    return expr

filter_by_hamming(c, other, bound, pad=False, parallel=False)

Returns whether the hamming distance between self and other is <= bound. This is faster than computing hamming distance and then doing a filter.

Parameters:

Name Type Description Default
c str | Expr

Either the name of the column or a Polars expression

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
bound int

Closed upper bound. If distance <= bound, return true and false otherwise.

required
pad bool

Whether to pad the strings to the same length. If False, and strings have different lengths, they will be filtered out.

False
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def filter_by_hamming(
    c: str | pl.Expr,
    other: str | pl.Expr,
    bound: int,
    pad: bool = False,
    parallel: bool = False,
) -> pl.Expr:
    """
    Returns whether the hamming distance between self and other is <= bound. This is
    faster than computing hamming distance and then doing a filter.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    bound
        Closed upper bound. If distance <= bound, return true and false otherwise.
    pad
        Whether to pad the strings to the same length. If False, and strings have different lengths,
        they will be filtered out.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_hamming_filter",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(bound, dtype=pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

filter_by_levenshtein(c, other, bound, parallel=False, as_bytes=False)

Returns whether the Levenshtein distance between self and other is <= bound. This is faster than computing levenshtein distance and then doing a filter.

Parameters:

Name Type Description Default
c str | Expr

Either the name of the column or a Polars expression

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
bound int

Closed upper bound. If distance <= bound, return true and false otherwise.

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
as_bytes bool

Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.

False
Source code in python/polars_ds/exprs/string.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def filter_by_levenshtein(
    c: str | pl.Expr,
    other: str | pl.Expr,
    bound: int,
    parallel: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Returns whether the Levenshtein distance between self and other is <= bound. This is
    faster than computing levenshtein distance and then doing a filter.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    bound
        Closed upper bound. If distance <= bound, return true and false otherwise.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"bound": abs(bound), "parallel": parallel, "as_bytes": as_bytes}
    return pl_plugin(
        symbol="pl_levenshtein_filter",
        args=[to_expr(c), to_expr(other)],
        is_elementwise=True and not parallel,
        kwargs=params,
    )

map_words(c, mapping)

Replace words based on the specified mapping.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
mapping dict[str, str]

A dictionary of {word: the replacement}

required

Returns:

Type Description
Expr

Examples:

>>> df = pl.DataFrame({"x": ["one two three"]})
>>> df.select(pds.map_words("x", {"two": "2"}))
shape: (1, 1)
┌─────────────┐
│ x           │
│ ---         │
│ str         │
╞═════════════╡
│ one 2 three │
└─────────────┘
Source code in python/polars_ds/exprs/string.py
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
def map_words(c: str | pl.Expr, mapping: Dict[str, str]) -> pl.Expr:
    """
    Replace words based on the specified mapping.

    Parameters
    ----------
    c : str | pl.Expr
        The string column
    mapping : dict[str, str]
        A dictionary of {word: the replacement}

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["one two three"]})
    >>> df.select(pds.map_words("x", {"two": "2"}))
    shape: (1, 1)
    ┌─────────────┐
    │ x           │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ one 2 three │
    └─────────────┘
    """
    return pl_plugin(
        symbol="map_words",
        args=[to_expr(c)],
        kwargs={"mapping": mapping},
        is_elementwise=True,
    )

normalize_whitespace(c, only_spaces=False)

Normalize whitespace to one, e.g. 'a   b' -> 'a b'.
Parameters
c : str | pl.Expr
    The string column
only_spaces: bool
    If True, only split on the space character ' ' instead of any whitespace
    character such as '     ' and '

', by default False

Returns
pl.Expr
Examples
shape: (2, 3)
┌─────────┬─────┬────────┐
│ x       ┆ y   ┆ z      │
│ ---     ┆ --- ┆ ---    │
│ str     ┆ str ┆ str    │
╞═════════╪═════╪════════╡
│ a     b ┆ a b ┆ a b    │
│ a     b ┆ a b ┆ a     b│
└─────────┴─────┴────────┘
Source code in python/polars_ds/exprs/string.py
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
def normalize_whitespace(c: str | pl.Expr, only_spaces: bool = False) -> pl.Expr:
    """
    Normalize whitespace to one, e.g. 'a   b' -> 'a b'.

    Parameters
    ----------
    c : str | pl.Expr
        The string column
    only_spaces: bool
        If True, only split on the space character ' ' instead of any whitespace
        character such as '\t' and '\n', by default False

    Returns
    -------
    pl.Expr

    Examples
    --------
    shape: (2, 3)
    ┌─────────┬─────┬────────┐
    │ x       ┆ y   ┆ z      │
    │ ---     ┆ --- ┆ ---    │
    │ str     ┆ str ┆ str    │
    ╞═════════╪═════╪════════╡
    │ a     b ┆ a b ┆ a b    │
    │ a	    b ┆ a b ┆ a	    b│
    └─────────┴─────┴────────┘
    """
    expr = to_expr(c)

    if only_spaces:
        return expr.str.replace_all(" +", " ")

    return pl_plugin(
        symbol="normalize_whitespace",
        args=[expr],
        is_elementwise=True,
    )

remove_diacritics(c)

Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized form and removing the resulting non-ASCII components.

Parameters:

Name Type Description Default
c str | Expr
required

Returns:

Type Description
Expr

Examples:

>>> df = pl.DataFrame({"x": ["mercy", "mèrcy"]})
>>> df.select(pds.replace_non_ascii("x"))
shape: (2, 1)
┌───────┐
│ x     │
│ ---   │
│ str   │
╞═══════╡
│ mercy │
│ mercy │
└───────┘
Source code in python/polars_ds/exprs/string.py
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
def remove_diacritics(c: str | pl.Expr) -> pl.Expr:
    """Remove diacritics (e.g. è -> e) by converting the string to its NFD normalized
    form and removing the resulting non-ASCII components.

    Parameters
    ----------
    c : str | pl.Expr

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["mercy", "mèrcy"]})
    >>> df.select(pds.replace_non_ascii("x"))
    shape: (2, 1)
    ┌───────┐
    │ x     │
    │ ---   │
    │ str   │
    ╞═══════╡
    │ mercy │
    │ mercy │
    └───────┘
    """
    return pl_plugin(
        symbol="remove_diacritics",
        args=[to_expr(c)],
        is_elementwise=True,
    )

replace_non_ascii(c, value='')

Replaces non-Ascii values with the specified value.

Parameters:

Name Type Description Default
c str | Expr

The column name or expression

required
value str

The value to replace non-Ascii values with, by default ""

''

Returns:

Type Description
Expr

Examples:

>>> df = pl.DataFrame({"x": ["mercy", "xbĤ", "ĤŇƏ"]})
>>> df.select(pds.replace_non_ascii("x"))
shape: (3, 1)
┌───────┐
│ x     │
│ ---   │
│ str   │
╞═══════╡
│ mercy │
│ xb    │
│       │
└───────┘
Source code in python/polars_ds/exprs/string.py
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
def replace_non_ascii(c: str | pl.Expr, value: str = "") -> pl.Expr:
    """Replaces non-Ascii values with the specified value.

    Parameters
    ----------
    c : str | pl.Expr
        The column name or expression
    value : str
        The value to replace non-Ascii values with, by default ""

    Returns
    -------
    pl.Expr

    Examples
    --------
    >>> df = pl.DataFrame({"x": ["mercy", "xbĤ", "ĤŇƏ"]})
    >>> df.select(pds.replace_non_ascii("x"))
    shape: (3, 1)
    ┌───────┐
    │ x     │
    │ ---   │
    │ str   │
    ╞═══════╡
    │ mercy │
    │ xb    │
    │       │
    └───────┘
    """
    expr = to_expr(c)

    if value == "":
        return pl_plugin(
            symbol="remove_non_ascii",
            args=[expr],
            is_elementwise=True,
        )

    return expr.str.replace_all(r"[^\p{Ascii}]", value)

similar_to_vocab(c, vocab, threshold, metric='lv', strategy='avg', as_bytes=False)

Compare each word in the vocab with each word in the column c. Returns a boolean that indicates whether there exist words in c that are similar to words in vocab.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
vocab List[str]

Any iterable collection of strings

required
threshold float

A entry is considered similar to the words in the vocabulary if the similarity is above (>=) the threshold

required
metric Literal['lv', 'dlv', 'jw', 'osa']

Which similarity metric to use. One of lv, dlv, jw, osa

'lv'
strategy Literal['avg', 'all', 'any']

If avg, then will return true if the average similarity is above the threshold. If all, then will return true if the similarity to all words in the vocab is above the threshold. If any, then will return true if the similarity to any words in the vocab is above the threshold.

'avg'
as_bytes bool

Only works for Levenshtein distance. Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.

False
Source code in python/polars_ds/exprs/string.py
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
def similar_to_vocab(
    c: str | pl.Expr,
    vocab: List[str],
    threshold: float,
    metric: Literal["lv", "dlv", "jw", "osa"] = "lv",
    strategy: Literal["avg", "all", "any"] = "avg",
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Compare each word in the vocab with each word in the column c. Returns a boolean
    that indicates whether there exist words in c that are similar to words in vocab.

    Parameters
    ----------
    c
        The string column
    vocab
        Any iterable collection of strings
    threshold
        A entry is considered similar to the words in the vocabulary if the similarity
        is above (>=) the threshold
    metric
        Which similarity metric to use. One of `lv`, `dlv`, `jw`, `osa`
    strategy
        If `avg`, then will return true if the average similarity is above the threshold.
        If `all`, then will return true if the similarity to all words in the vocab is above
        the threshold.
        If `any`, then will return true if the similarity to any words in the vocab is above
        the threshold.
    as_bytes
        Only works for Levenshtein distance. Whether to treat the strings as ASCII characters.
        This will boost performance but does not work on non-ASCII characters.
    """
    if metric == "lv":
        sims = [
            str_leven(c, pl.lit(w, dtype=pl.String), as_bytes=as_bytes, return_sim=True)
            for w in vocab
        ]
    elif metric == "dlv":
        sims = [
            str_d_leven(c, pl.lit(w, dtype=pl.String), as_bytes=as_bytes, return_sim=True)
            for w in vocab
        ]
    elif metric == "osa":
        sims = [str_osa(c, pl.lit(w, dtype=pl.String), return_sim=True) for w in vocab]
    elif metric == "jw":
        sims = [str_jw(c, pl.lit(w, dtype=pl.String), return_sim=True) for w in vocab]
    else:
        raise ValueError(f"Unknown metric: {metric}")

    if strategy == "all":
        return pl.all_horizontal(s >= threshold for s in sims)
    elif strategy == "any":
        return pl.any_horizontal(s >= threshold for s in sims)
    elif strategy == "avg":
        return (pl.sum_horizontal(sims) / len(vocab)) >= threshold
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

str_d_leven(c, other, parallel=False, return_sim=False, as_bytes=False)

Computes the Damerau-Levenshtein distance between this and the other str.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
return_sim bool

If true, return normalized Damerau-Levenshtein.

False
as_bytes bool

Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.

False
Source code in python/polars_ds/exprs/string.py
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
def str_d_leven(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Computes the Damerau-Levenshtein distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized Damerau-Levenshtein.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"parallel": parallel, "as_bytes": as_bytes}
    if return_sim:
        return pl_plugin(
            symbol="pl_d_levenshtein_sim",
            args=[to_expr(c), to_expr(other)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )
    else:
        return pl_plugin(
            symbol="pl_d_levenshtein",
            args=[to_expr(c), to_expr(other)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )

str_fuzz(c, other, parallel=False)

Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more information.)

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
def str_fuzz(c: str | pl.Expr, other: str | pl.Expr, parallel: bool = False) -> pl.Expr:
    """
    Calculates the normalized Indel similarity. (See the package rapidfuzz, fuzz.ratio for more
    information.)

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_fuzz",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True,
    )

str_hamming(c, other, pad=False, parallel=False)

Computes the hamming distance between two strings. If they do not have the same length, null will be returned.

Parameters:

Name Type Description Default
c str | Expr

Either the name of the column or a Polars expression

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
pad bool

Whether to pad the string when lengths are not equal.

False
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def str_hamming(
    c: str | pl.Expr, other: str | pl.Expr, pad: bool = False, parallel: bool = False
) -> pl.Expr:
    """
    Computes the hamming distance between two strings. If they do not have the same length, null will
    be returned.

    Parameters
    ----------
    c
        Either the name of the column or a Polars expression
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    pad
        Whether to pad the string when lengths are not equal.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """

    if pad:
        return pl_plugin(
            symbol="pl_hamming_padded",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_hamming",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

str_jaccard(c, other, substr_size=2, parallel=False)

Treats substrings of size substr_size as a set. And computes the jaccard similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
substr_size int

The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.

2
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
def str_jaccard(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the jaccard similarity between
    this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_str_jaccard",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

str_jaro(c, other, parallel=False)

Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def str_jaro(c: str | pl.Expr, other: str | pl.Expr, parallel: bool = False) -> pl.Expr:
    """
    Computes the Jaro similarity between this and the other str. Jaro distance = 1 - Jaro sim.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_jaro",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

str_jw(c, other, weight=0.1, parallel=False)

Computes the Jaro-Winkler similarity between this and the other str. Jaro-Winkler distance = 1 - Jaro-Winkler sim.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
weight float

Weight for prefix. A typical value is 0.1.

0.1
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
def str_jw(
    c: str | pl.Expr,
    other: str | pl.Expr,
    weight: float = 0.1,
    parallel: bool = False,
) -> pl.Expr:
    """
    Computes the Jaro-Winkler similarity between this and the other str.
    Jaro-Winkler distance = 1 - Jaro-Winkler sim.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    weight
        Weight for prefix. A typical value is 0.1.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_jw",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(weight, pl.Float64),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

str_lcs_subseq(c, other, parallel=False)

Extracts the longest common subsequence from the string between this and the other string.

Note: this is not the same as the longest common substring.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
def str_lcs_subseq(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
) -> pl.Expr:
    """
    Extracts the longest common subsequence from the string between this and the other string.

    Note: this is not the same as the longest common substring.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_lcs_subseq",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

str_lcs_subseq_dist(c, other, parallel=False, return_sim=True)

Computes the Longest Common Subsequence distance/similarity between this and the other str. The distance is calculated as max(len1, len2) - similarity, where the similarity is the the length of the longest common subsequence.

The subsequence does not need to be consecutive.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
return_sim bool

If true, return normalized similarity.

True
Source code in python/polars_ds/exprs/string.py
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
def str_lcs_subseq_dist(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = True,
) -> pl.Expr:
    """
    Computes the Longest Common Subsequence distance/similarity between this and the other str.
    The distance is calculated as max(len1, len2) - similarity, where the similarity is the
    the length of the longest common subsequence.

    The subsequence does not need to be consecutive.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized similarity.
    """
    if return_sim:
        return pl_plugin(
            symbol="pl_lcs_subseq_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_lcs_subseq_dist",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

str_lcs_substr(c, other, parallel=False)

Extracts the longest common substring from the string between this and the other string.

Note: this is not the same as the longest common subsequence.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
def str_lcs_substr(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
) -> pl.Expr:
    """
    Extracts the longest common substring from the string between this and the other string.

    Note: this is not the same as the longest common subsequence.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_lcs_substr",
        args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
        is_elementwise=True and not parallel,
    )

str_leven(c, other, parallel=False, return_sim=False, as_bytes=False)

Computes the Levenshtein distance between this and the other str.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
return_sim bool

If true, return normalized Levenshtein.

False
as_bytes bool

Whether to treat the strings as ASCII characters. This will boost performance but does not work on non-ASCII characters.

False
Source code in python/polars_ds/exprs/string.py
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
def str_leven(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
    as_bytes: bool = False,
) -> pl.Expr:
    """
    Computes the Levenshtein distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized Levenshtein.
    as_bytes
        Whether to treat the strings as ASCII characters. This will boost performance but does not
        work on non-ASCII characters.
    """
    params = {"parallel": parallel, "as_bytes": as_bytes}
    if return_sim:
        return pl_plugin(
            symbol="pl_levenshtein_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )
    else:
        return pl_plugin(
            symbol="pl_levenshtein",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
            kwargs=params,
        )

str_nearest(c, word, threshold=100, metric='lv')

Finds the string in the column that is nearest to the given word in the given metric. This algorithm is very slow.

Note: Nearest-k strings search functionality is temporarily dropped.

Parameters:

Name Type Description Default
c str | Expr

The string column or its name

required
word str

Any iterable collection of strings that can be turned into a polars Series, or an expression

required
threshold int

Only considers strings to be near if they are within distance threshold. This is a positive integer because all the distances output integers.

100
metric Literal['lv', 'hamming']

Which similarity metric to use. One of lv, hamming

'lv'
Source code in python/polars_ds/exprs/string.py
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
def str_nearest(
    c: str | pl.Expr,
    word: str,
    threshold: int = 100,
    metric: Literal["lv", "hamming"] = "lv",
) -> pl.Expr:
    """
    Finds the string in the column that is nearest to the given word in the given metric. This algorithm is
    very slow.

    Note: Nearest-k strings search functionality is temporarily dropped.

    Parameters
    ----------
    c
        The string column or its name
    word
        Any iterable collection of strings that can be turned into a polars Series, or an expression
    threshold : int
        Only considers strings to be near if they are within distance threshold. This is a positive integer
        because all the distances output integers.
    metric
        Which similarity metric to use. One of `lv`, `hamming`
    """
    if metric not in ("lv", "hamming"):
        raise ValueError(f"Unknown metric for similar_words: {metric}")

    if threshold <= 0:
        raise ValueError("Distance threshold must be > 0.")

    return pl_plugin(
        symbol="pl_nearest_str",
        args=[to_expr(c)],
        kwargs={
            "word": word,
            "metric": str(metric).lower(),
            "threshold": threshold,
        },
        returns_scalar=True,
    )

str_osa(c, other, parallel=False, return_sim=False)

Computes the Optimal String Alignment distance between this and the other str.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
return_sim bool

If true, return normalized OSA similarity.

False
Source code in python/polars_ds/exprs/string.py
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
def str_osa(
    c: str | pl.Expr,
    other: str | pl.Expr,
    parallel: bool = False,
    return_sim: bool = False,
) -> pl.Expr:
    """
    Computes the Optimal String Alignment distance between this and the other str.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    return_sim
        If true, return normalized OSA similarity.
    """
    if return_sim:
        return pl_plugin(
            symbol="pl_osa_sim",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )
    else:
        return pl_plugin(
            symbol="pl_osa",
            args=[to_expr(c), to_expr(other), pl.lit(parallel, pl.Boolean)],
            is_elementwise=True and not parallel,
        )

str_overlap_coeff(c, other, substr_size=2, parallel=False)

Treats substrings of size substr_size as a set. And computes the overlap coefficient as similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
substr_size int

The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.

2
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
def str_overlap_coeff(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the overlap coefficient as
    similarity between this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_overlap_coeff",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

str_sorensen_dice(c, other, substr_size=2, parallel=False)

Treats substrings of size substr_size as a set. And computes the Sorensen-Dice similarity between this word and the other.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
substr_size int

The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.

2
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Source code in python/polars_ds/exprs/string.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
def str_sorensen_dice(
    c: str | pl.Expr,
    other: str | pl.Expr,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the Sorensen-Dice similarity between
    this word and the other.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.
    """
    return pl_plugin(
        symbol="pl_sorensen_dice",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

str_tversky_sim(c, other, alpha, beta, substr_size=2, parallel=False)

Treats substrings of size substr_size as a set. And computes the tversky_sim similarity between this word and the other. See the reference for information on how Tversky similarity is related the other ngram based similarity.

Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII characters may have problems. Also note that alpha and beta are supposed to be weighting factors, but this doesn't check whether they satisfy the definition of weights and has to be chosen at the discretion of the user.

Parameters:

Name Type Description Default
c str | Expr

The string column

required
other str | Expr

Either the name of the column or a Polars expression. If you want to compare a single string with all of column c, use pl.lit(your_str)

required
alpha float

The first weighting factor. See reference

required
beta float

The second weighting factor. See reference

required
substr_size int

The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into the set ('ap', 'pp', 'pl', 'le') before being compared.

2
parallel bool

Whether to run it in parallel. Note that this is only recommended when this query is the only one in execution and when this is not executed in any aggregation / streaming context.

False
Reference

https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7

Source code in python/polars_ds/exprs/string.py
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
def str_tversky_sim(
    c: str | pl.Expr,
    other: str | pl.Expr,
    alpha: float,
    beta: float,
    substr_size: int = 2,
    parallel: bool = False,
) -> pl.Expr:
    """
    Treats substrings of size `substr_size` as a set. And computes the tversky_sim similarity between
    this word and the other. See the reference for information on how Tversky similarity is related
    the other ngram based similarity.

    Note this treats substrings at the byte level under the hood, not at the char level. So non-ASCII
    characters may have problems. Also note that alpha and beta are supposed to be weighting factors,
    but this doesn't check whether they satisfy the definition of weights and has to be chosen at the
    discretion of the user.

    Parameters
    ----------
    c
        The string column
    other
        Either the name of the column or a Polars expression. If you want to compare a single
        string with all of column c, use pl.lit(your_str)
    alpha
        The first weighting factor. See reference
    beta
        The second weighting factor. See reference
    substr_size
        The substring size for Jaccard similarity. E.g. if substr_size = 2, "apple" will be decomposed into
        the set ('ap', 'pp', 'pl', 'le') before being compared.
    parallel
        Whether to run it in parallel. Note that this is only recommended when this query
        is the only one in execution and when this is not executed in any aggregation / streaming context.

    Reference
    ---------
    https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7
    """
    if alpha < 0 or beta < 0:
        raise ValueError("Input `alpha` and `beta` must be >= 0.")

    return pl_plugin(
        symbol="pl_tversky_sim",
        args=[
            to_expr(c),
            to_expr(other),
            pl.lit(substr_size, pl.UInt32),
            pl.lit(alpha, pl.Float64),
            pl.lit(beta, pl.Float64),
            pl.lit(parallel, pl.Boolean),
        ],
        is_elementwise=True and not parallel,
    )

to_camel_case(c)

Turns itself into camel case. E.g. helloWorld

Source code in python/polars_ds/exprs/string.py
152
153
154
155
156
157
158
def to_camel_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into camel case. E.g. helloWorld"""
    return pl_plugin(
        symbol="pl_to_camel",
        args=[to_expr(c)],
        is_elementwise=True,
    )

to_constant_case(c)

Turns itself into constant case. E.g. Hello_World

Source code in python/polars_ds/exprs/string.py
179
180
181
182
183
184
185
def to_constant_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into constant case. E.g. Hello_World"""
    return pl_plugin(
        symbol="pl_to_constant",
        args=[to_expr(c)],
        is_elementwise=True,
    )

to_pascal_case(c)

Turns itself into Pascal case. E.g. HelloWorld

Source code in python/polars_ds/exprs/string.py
170
171
172
173
174
175
176
def to_pascal_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into Pascal case. E.g. HelloWorld"""
    return pl_plugin(
        symbol="pl_to_pascal",
        args=[to_expr(c)],
        is_elementwise=True,
    )

to_snake_case(c)

Turns itself into snake case. E.g. hello_world

Source code in python/polars_ds/exprs/string.py
161
162
163
164
165
166
167
def to_snake_case(c: str | pl.Expr) -> pl.Expr:
    """Turns itself into snake case. E.g. hello_world"""
    return pl_plugin(
        symbol="pl_to_snake",
        args=[to_expr(c)],
        is_elementwise=True,
    )