Skip to content

Sample and Split

Polars Native Machine Learning Pipeline

Modules:

Name Description
sample_and_split

Functions:

Name Description
downsample

downsample

random_cols

random_cols

sample

sample

split_by_ratio

split_by_ratio

volume_neutral

volume_neutral

downsample(df, conditions, seed=None, return_df=False)

downsample

Downsamples subsets of a Polars DataFrame or LazyFrame based on specified conditions.

This function applies downsampling to rows where each boolean condition is true. For each condition, you can specify either a fixed number of rows to keep (int) or a fraction of rows to keep (float). The downsampling is performed using a random sampling strategy, which can be made reproducible using a seed.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
conditions List[Tuple[Expr, float | int]] | Tuple[Expr, float | int]

One or more tuples, each containing: - A boolean Polars expression (polars.Expr) defining the subset of rows to downsample. - A float (fraction of rows to keep, e.g., 0.5 for 50%) or an integer (fixed number of rows to keep).

required
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(lf.group_by("category").len().sort("category").collect()) shape: (3, 2) ┌──────────┬─────┐ │ category ┆ len │ │ --- ┆ --- │ │ str ┆ u32 │ ╞══════════╪═════╡ │ A ┆ 341 │ │ B ┆ 343 │ │ C ┆ 316 │ └──────────┴─────┘ print(sampling.downsample( lf, [ (pl.col("category") == "A", 0.25), (pl.col("category") == "B", 10) ], return_df = True ).group_by("category").len().sort("category")) shape: (3, 2) ┌──────────┬─────┐ │ category ┆ len │ │ --- ┆ --- │ │ str ┆ u32 │ ╞══════════╪═════╡ │ A ┆ 85 │ │ B ┆ 10 │ │ C ┆ 316 │ └──────────┴─────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def downsample(
    df: PolarsFrame,
    conditions: List[Tuple[pl.Expr, float | int]] | Tuple[pl.Expr, float | int],
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    """
    downsample
    ===========
    Downsamples subsets of a Polars DataFrame or LazyFrame based on specified conditions.

    This function applies downsampling to rows where each boolean condition is true.
    For each condition, you can specify either a fixed number of rows to keep (int)
    or a fraction of rows to keep (float). The downsampling is performed using a
    random sampling strategy, which can be made reproducible using a seed.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    conditions : List[Tuple[pl.Expr, float | int]] | Tuple[pl.Expr, float | int]
        One or more tuples, each containing:
        - A boolean Polars expression (`polars.Expr`) defining the subset of rows to downsample.
        - A float (fraction of rows to keep, e.g., 0.5 for 50%) or an integer (fixed number of rows to keep).

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(lf.group_by("category").len().sort("category").collect())
    shape: (3, 2)
    ┌──────────┬─────┐
    │ category ┆ len │
    │ ---      ┆ --- │
    │ str      ┆ u32 │
    ╞══════════╪═════╡
    │ A        ┆ 341 │
    │ B        ┆ 343 │
    │ C        ┆ 316 │
    └──────────┴─────┘
    >>> print(sampling.downsample(
    >>>     lf,
    >>>     [
    >>>         (pl.col("category") == "A", 0.25),
    >>>         (pl.col("category") == "B", 10)
    >>>     ],
    >>>     return_df = True
    >>> ).group_by("category").len().sort("category"))
    shape: (3, 2)
    ┌──────────┬─────┐
    │ category ┆ len │
    │ ---      ┆ --- │
    │ str      ┆ u32 │
    ╞══════════╪═════╡
    │ A        ┆ 85  │
    │ B        ┆ 10  │
    │ C        ┆ 316 │
    └──────────┴─────┘
    """
    # Engine
    ## Create samples for each pl.Expr
    results = []
    for expr, value in conditions:
        df_size = (
            df.select(pl.len())[0, 0]
            if isinstance(df, pl.DataFrame)
            else df.select(pl.len()).collect()[0, 0]
        )
        n = min(value, df_size) if isinstance(value, int) else None
        fraction = value if isinstance(value, float) else None
        sample = df.filter(expr).select(
            pl.all().sample(n=n, fraction=fraction, with_replacement=False, shuffle=True, seed=seed)
        )
        results.append(sample)

    ## Add sample where no pl.Expr is met
    exprs = [expr for expr, _ in conditions]
    combined_expr = exprs[0]
    if len(exprs) > 1:
        for expr in exprs[1:]:
            combined_expr = combined_expr | expr
    sample = df.filter(~combined_expr)
    results.append(sample)

    ## Merge samples
    downsample = pl.concat(results, how="vertical")

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        downsample = downsample.collect()
    return downsample

random_cols(all_columns, k, keep=None, seed=None)

random_cols

Randomly select columns from the provided list of column names.

Parameters:

Name Type Description Default
all_columns List[str]

List with the name of the columns from which to drawn randomly.

required
k int

Number of columns to select randomly outside of the list provided in keep.

required
keep List[str]

List of values to always include in the list of randomly drawn columns.

None
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None

Returns:

Type Description
List[str]

Returns a list with the name of the columns that were randomly drawn.

Note(s)
  • It is impossible to randomly select both ["x", "y"] and ["y", "x"].
Example

import polars as pl import polars_ds.sample_and_split as sampling print(sampling.random_cols(["a", "b", "c", "d", "e", "f"], 2, seed=101)) ['c', 'd']

Source code in python/polars_ds/sample_and_split/sample_and_split.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
def random_cols(
    all_columns: List[str],
    k: int,
    keep: List[str] | None = None,
    seed: int | None = None,
) -> List[str]:
    """
    random_cols
    ===========
    Randomly select columns from the provided list of column names.

    Parameters
    ----------
    all_columns : List[str]
        List with the name of the columns from which to drawn randomly.

    k : int
        Number of columns to select randomly outside of the list provided in `keep`.

    keep : List[str], optional, default=None
        List of values to always include in the list of randomly drawn columns.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    Returns
    ----------
    List[str]
        Returns a list with the name of the columns that were randomly drawn.

    Note(s)
    ----------
    - It is impossible to randomly select both ["x", "y"] and ["y", "x"].

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> print(sampling.random_cols(["a", "b", "c", "d", "e", "f"], 2, seed=101))
    ['c', 'd']
    """
    # Engine
    if seed is not None:
        random.seed(seed)

    if keep is None:
        out = []
        to_sample = combinations(all_columns, k)
    else:
        out = keep
        to_sample = combinations((c for c in all_columns if c not in keep), k)

    pool_size = len(all_columns) - len(out)
    if pool_size < k:
        raise ValueError("Not enough columns to select from.")

    n = random.randrange(0, math.comb(pool_size, k))
    rand_cols = next(islice(to_sample, n, None), None)
    random_cols = out + list(rand_cols)

    # Output(s)
    return random_cols

sample(df, value, replace=False, seed=None, return_df=False)

sample

Extracts a random sample from a Polars DataFrame or LazyFrame.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
value int or float

If an integer is provided, value observations are selected from df. Otherwise, a proportion of value over the df is selected.

required
replace bool

Whether to sample with replacement or not.

False
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.sample(lf, 100, seed=101, return_df=True)) shape: (100, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 391 ┆ 99.050514 ┆ C │ │ 555 ┆ 87.66536 ┆ B │ │ 778 ┆ 72.225257 ┆ A │ │ 888 ┆ 23.818278 ┆ C │ │ … ┆ … ┆ … │ │ 233 ┆ 57.690388 ┆ A │ │ 196 ┆ 34.920957 ┆ A │ │ 850 ┆ 59.538502 ┆ C │ │ 235 ┆ 19.524299 ┆ A │ │ 404 ┆ 82.645747 ┆ B │ └─────┴───────────┴──────────┘

print(sampling.sample(lf, 0.5, seed=101, return_df=True)) shape: (500, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 391 ┆ 99.050514 ┆ C │ │ 555 ┆ 87.66536 ┆ B │ │ 778 ┆ 72.225257 ┆ A │ │ 888 ┆ 23.818278 ┆ C │ │ … ┆ … ┆ … │ │ 320 ┆ 25.02429 ┆ B │ │ 812 ┆ 83.889809 ┆ C │ │ 982 ┆ 77.09122 ┆ A │ │ 412 ┆ 95.006197 ┆ B │ │ 416 ┆ 44.844552 ┆ C │ └─────┴───────────┴──────────┘

print(sampling.sample(lf, 0.1, True, 101, True)) shape: (100, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 390 ┆ 80.683474 ┆ A │ │ 554 ┆ 56.093797 ┆ B │ │ 777 ┆ 22.92514 ┆ C │ │ 887 ┆ 65.274611 ┆ C │ │ … ┆ … ┆ … │ │ 152 ┆ 23.956189 ┆ A │ │ 110 ┆ 7.697991 ┆ B │ │ 834 ┆ 17.638699 ┆ C │ │ 152 ┆ 23.956189 ┆ A │ │ 339 ┆ 47.417383 ┆ C │ └─────┴───────────┴──────────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def sample(
    df: PolarsFrame,
    value: float | int,
    replace: bool = False,
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    r"""
    sample
    ===========
    Extracts a random sample from a Polars DataFrame or LazyFrame.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    value : int or float
        If an integer is provided, `value` observations are selected from `df`. Otherwise, a proportion of `value` over the `df` is selected.

    replace : bool, optional, default=False
        Whether to sample with replacement or not.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, optional, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    ----------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.sample(lf, 100, seed=101, return_df=True))
    shape: (100, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 391 ┆ 99.050514 ┆ C        │
    │ 555 ┆ 87.66536  ┆ B        │
    │ 778 ┆ 72.225257 ┆ A        │
    │ 888 ┆ 23.818278 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 233 ┆ 57.690388 ┆ A        │
    │ 196 ┆ 34.920957 ┆ A        │
    │ 850 ┆ 59.538502 ┆ C        │
    │ 235 ┆ 19.524299 ┆ A        │
    │ 404 ┆ 82.645747 ┆ B        │
    └─────┴───────────┴──────────┘

    >>> print(sampling.sample(lf, 0.5, seed=101, return_df=True))
    shape: (500, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 391 ┆ 99.050514 ┆ C        │
    │ 555 ┆ 87.66536  ┆ B        │
    │ 778 ┆ 72.225257 ┆ A        │
    │ 888 ┆ 23.818278 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 320 ┆ 25.02429  ┆ B        │
    │ 812 ┆ 83.889809 ┆ C        │
    │ 982 ┆ 77.09122  ┆ A        │
    │ 412 ┆ 95.006197 ┆ B        │
    │ 416 ┆ 44.844552 ┆ C        │
    └─────┴───────────┴──────────┘

    >>> print(sampling.sample(lf, 0.1, True, 101, True))
    shape: (100, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 390 ┆ 80.683474 ┆ A        │
    │ 554 ┆ 56.093797 ┆ B        │
    │ 777 ┆ 22.92514  ┆ C        │
    │ 887 ┆ 65.274611 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 152 ┆ 23.956189 ┆ A        │
    │ 110 ┆ 7.697991  ┆ B        │
    │ 834 ┆ 17.638699 ┆ C        │
    │ 152 ┆ 23.956189 ┆ A        │
    │ 339 ┆ 47.417383 ┆ C        │
    └─────┴───────────┴──────────┘
    """
    # Engine
    df_size = (
        df.select(pl.len())[0, 0]
        if isinstance(df, pl.DataFrame)
        else df.select(pl.len()).collect()[0, 0]
    )
    n = min(value, df_size) if isinstance(value, int) else None
    fraction = value if isinstance(value, float) else None
    sample = df.select(
        pl.all().sample(n=n, fraction=fraction, with_replacement=replace, shuffle=True, seed=seed)
    )

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        sample = sample.collect()
    return sample

split_by_ratio(df, split_ratio, split_col='__split', by=None, default_split_1='train', default_split_2='test', seed=None, return_df=False)

split_by_ratio

Randomly splits a Polars DataFrame or LazyFrame into subsets based on specified ratios.

The function adds a new column (split_col) to the DataFrame/LazyFrame, assigning each row to a subset according to the provided split_ratio. The splitting can be stratified by one or more columns if the by parameter is specified.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
split_ratio float | List[float] | Dict[str, float]
  • Float: The ratio for the first subset (default: "train"), with the remainder assigned to the second subset (default: "test").
  • List of floats: Each float represents the ratio for a subset, and the list must sum to 1. Subsets are named "split_0", "split_1", etc.
  • Dictionary: Keys are subset names, and values are their respective ratios. The values must sum to 1.
required
split_col str

Name of the column to store the split assignments.

"__split"
by str | list[str]

Column(s) to stratify by. If specified, the DataFrame is collected and split within each stratum.

None
default_split_1 str

Name of the first subset when split_ratio is a float.

"train"
default_split_2 str

Name of the second subset when split_ratio is a float.

"test"
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Note(s)
  • Avoid using floating-point values with too many decimal places, as this may cause the splits to be off by one row due to rounding errors.
Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.split_by_ratio( df = lf, split_ratio = 0.75, seed = 101, return_df = True ).group_by(["__split", "category"]).len().sort(["__split", "category"])) shape: (6, 3) ┌─────────┬──────────┬─────┐ │ __split ┆ category ┆ len │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ u32 │ ╞═════════╪══════════╪═════╡ │ test ┆ A ┆ 98 │ │ test ┆ B ┆ 63 │ │ test ┆ C ┆ 89 │ │ train ┆ A ┆ 243 │ │ train ┆ B ┆ 280 │ │ train ┆ C ┆ 227 │ └─────────┴──────────┴─────┘

print(sampling.split_by_ratio( df = lf, split_ratio = 0.75, split_col = "sample", by = "category", seed = 101, return_df = True ).group_by(["sample", "category"]).len().sort(["sample", "category"])) shape: (6, 3) ┌────────┬──────────┬─────┐ │ sample ┆ category ┆ len │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ u32 │ ╞════════╪══════════╪═════╡ │ test ┆ A ┆ 86 │ │ test ┆ B ┆ 86 │ │ test ┆ C ┆ 79 │ │ train ┆ A ┆ 255 │ │ train ┆ B ┆ 257 │ │ train ┆ C ┆ 237 │ └────────┴──────────┴─────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
def split_by_ratio(
    df: PolarsFrame,
    split_ratio: float | List[float] | Dict[str, float],
    split_col: str = "__split",
    by: str | list[str] | None = None,
    default_split_1: str = "train",
    default_split_2: str = "test",
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    """
    split_by_ratio
    ===========
    Randomly splits a Polars DataFrame or LazyFrame into subsets based on specified ratios.

    The function adds a new column (`split_col`) to the DataFrame/LazyFrame, assigning each row to a subset
    according to the provided `split_ratio`. The splitting can be stratified by one or more columns
    if the `by` parameter is specified.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    split_ratio : float | List[float] | Dict[str, float]
        - **Float**: The ratio for the first subset (default: "train"), with the remainder assigned
        to the second subset (default: "test").
        - **List of floats**: Each float represents the ratio for a subset, and the list must sum to 1.
        Subsets are named "split_0", "split_1", etc.
        - **Dictionary**: Keys are subset names, and values are their respective ratios. The values must
        sum to 1.

    split_col : str, optional, default="__split"
        Name of the column to store the split assignments.

    by : str | list[str], optional, default=None
        Column(s) to stratify by. If specified, the DataFrame is collected and split within each stratum.

    default_split_1 : str, optional, default="train"
        Name of the first subset when `split_ratio` is a float.

    default_split_2 : str, optional, default="test"
        Name of the second subset when `split_ratio` is a float.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Note(s)
    ----------
    - Avoid using floating-point values with too many decimal places, as this may cause the
    splits to be off by one row due to rounding errors.

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.split_by_ratio(
    >>>     df = lf,
    >>>     split_ratio = 0.75,
    >>>     seed = 101,
    >>>     return_df = True
    >>> ).group_by(["__split", "category"]).len().sort(["__split", "category"]))
    shape: (6, 3)
    ┌─────────┬──────────┬─────┐
    │ __split ┆ category ┆ len │
    │ ---     ┆ ---      ┆ --- │
    │ str     ┆ str      ┆ u32 │
    ╞═════════╪══════════╪═════╡
    │ test    ┆ A        ┆ 98  │
    │ test    ┆ B        ┆ 63  │
    │ test    ┆ C        ┆ 89  │
    │ train   ┆ A        ┆ 243 │
    │ train   ┆ B        ┆ 280 │
    │ train   ┆ C        ┆ 227 │
    └─────────┴──────────┴─────┘

    >>> print(sampling.split_by_ratio(
    >>>     df = lf,
    >>>     split_ratio = 0.75,
    >>>     split_col = "sample",
    >>>     by = "category",
    >>>     seed = 101,
    >>>     return_df = True
    >>> ).group_by(["sample", "category"]).len().sort(["sample", "category"]))
    shape: (6, 3)
    ┌────────┬──────────┬─────┐
    │ sample ┆ category ┆ len │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ u32 │
    ╞════════╪══════════╪═════╡
    │ test   ┆ A        ┆ 86  │
    │ test   ┆ B        ┆ 86  │
    │ test   ┆ C        ┆ 79  │
    │ train  ┆ A        ┆ 255 │
    │ train  ┆ B        ┆ 257 │
    │ train  ┆ C        ┆ 237 │
    └────────┴──────────┴─────┘
    """
    # Engine
    ## Stratified Sampling
    if by is not None:
        results = []
        cats = (
            df.select(pl.col(by).unique())
            if isinstance(df, pl.DataFrame)
            else df.select(pl.col(by).unique()).collect()
        )
        for cat in cats.to_series().to_list():
            subset = df.filter(pl.col(by) == cat)
            results.append(
                split_by_ratio(
                    subset,
                    split_ratio=split_ratio,
                    seed=seed,
                    by=None,
                    split_col=split_col,
                    default_split_1=default_split_1,
                    default_split_2=default_split_2,
                )
            )
            split_sample = pl.concat(results, how="vertical")

    ## Simple Sampling
    else:
        if isinstance(split_ratio, float):
            split_sample = (
                df.with_row_index(name="__id")
                .with_columns(
                    pl.when(
                        pl.col("__id").shuffle(seed=seed) < (pl.len() * split_ratio).cast(pl.Int64)
                    )
                    .then(pl.lit(default_split_1, dtype=pl.String))
                    .otherwise(pl.lit(default_split_2, dtype=pl.String))
                    .alias(split_col)
                )
                .select(pl.all().exclude("__id"))
            )

        else:
            if isinstance(split_ratio, dict):
                ratios: pl.Series = pl.Series(split_ratio.values())
                split_names = [str(k) for k in split_ratio.keys()]
            else:
                ratios: pl.Series = pl.Series(split_ratio)
                split_names = [f"split_{i}" for i in range(len(split_ratio))]

            pct = ratios.cum_sum()
            expr = pl.when(pl.lit(False)).then(None)
            for p, k in zip(pct, split_names):
                expr = expr.when(pl.col("__pct") < p).then(pl.lit(k, dtype=pl.String))

            split_sample = (
                df.with_row_index(name="__id")
                .with_columns(pl.col("__id").shuffle(seed=seed).alias("__tt"))
                .sort("__tt")
                .with_columns((pl.col("__tt") / pl.len()).alias("__pct"))
                .select(expr.alias(split_col), pl.all().exclude(["__id", "__pct", "__tt"]))
            )

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        split_sample = split_sample.collect()
    return split_sample

volume_neutral(df, by, control=None, target_volume=None, seed=None, return_df=False)

volume_neutral

Subsample a polars.DataFrame or polars.LazyFrame to achieve volume neutrality per group, optionally controlling for additional grouping variables.

This function reduces each group defined by by (and optionally control) to a target number of rows, ensuring that all groups have the same number of observations. The selection within groups is randomized, with an optional seed for reproducibility.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
by Expr

Expression defining the primary grouping discrete variable for volume balancing.

required
control pl.Expr or list of pl.Expr

Additional expressions to control grouping. Subsampling is done within each combination of control and by.

None
target_volume int

Maximum number of rows to retain per group. If None, the size of the smallest group is used.

None
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.volume_neutral(lf, pl.col("category"), None, 2, 101, True)) shape: (6, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 817 ┆ 59.127544 ┆ A │ │ 825 ┆ 53.73956 ┆ B │ │ 874 ┆ 40.873417 ┆ C │ │ 909 ┆ 25.942343 ┆ A │ │ 923 ┆ 89.455223 ┆ B │ │ 990 ┆ 81.910232 ┆ C │ └─────┴───────────┴──────────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def volume_neutral(
    df: PolarsFrame,
    by: pl.Expr,
    control: pl.Expr | List[pl.Expr] | None = None,
    target_volume: int | None = None,
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    r"""
    volume_neutral
    ===========
    Subsample a polars.DataFrame or polars.LazyFrame to achieve volume neutrality per group,
    optionally controlling for additional grouping variables.

    This function reduces each group defined by `by` (and optionally `control`) to a
    target number of rows, ensuring that all groups have the same number of observations.
    The selection within groups is randomized, with an optional seed for reproducibility.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    by : pl.Expr
        Expression defining the primary grouping discrete variable for volume balancing.

    control : pl.Expr or list of pl.Expr, optional, default=None
        Additional expressions to control grouping. Subsampling is done within each
        combination of `control` and `by`.

    target_volume : int, optional, default=None
        Maximum number of rows to retain per group. If None, the size of the smallest
        group is used.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    ----------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.volume_neutral(lf, pl.col("category"), None, 2, 101, True))
    shape: (6, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 817 ┆ 59.127544 ┆ A        │
    │ 825 ┆ 53.73956  ┆ B        │
    │ 874 ┆ 40.873417 ┆ C        │
    │ 909 ┆ 25.942343 ┆ A        │
    │ 923 ┆ 89.455223 ┆ B        │
    │ 990 ┆ 81.910232 ┆ C        │
    └─────┴───────────┴──────────┘
    """
    # Engine
    if target_volume is not None:
        target = pl.min_horizontal(by.value_counts().struct.field("count").min(), target_volume)
    else:
        target = by.value_counts().struct.field("count").min()

    if isinstance(control, (pl.Expr, list)):
        ctrl = [control]
    else:
        ctrl = []

    if len(ctrl) > 0:
        target = target.over(ctrl)
        final_ref = ctrl + [by]
    else:
        final_ref = by

    volume_neutral = df.filter(pl.int_range(0, pl.len()).shuffle(seed).over(final_ref) < target)

    # Output
    if isinstance(df, pl.LazyFrame) and return_df:
        volume_neutral = volume_neutral.collect()
    return volume_neutral

sample_and_split

Functions:

Name Description
downsample

downsample

random_cols

random_cols

sample

sample

split_by_ratio

split_by_ratio

volume_neutral

volume_neutral

downsample(df, conditions, seed=None, return_df=False)

downsample

Downsamples subsets of a Polars DataFrame or LazyFrame based on specified conditions.

This function applies downsampling to rows where each boolean condition is true. For each condition, you can specify either a fixed number of rows to keep (int) or a fraction of rows to keep (float). The downsampling is performed using a random sampling strategy, which can be made reproducible using a seed.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
conditions List[Tuple[Expr, float | int]] | Tuple[Expr, float | int]

One or more tuples, each containing: - A boolean Polars expression (polars.Expr) defining the subset of rows to downsample. - A float (fraction of rows to keep, e.g., 0.5 for 50%) or an integer (fixed number of rows to keep).

required
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(lf.group_by("category").len().sort("category").collect()) shape: (3, 2) ┌──────────┬─────┐ │ category ┆ len │ │ --- ┆ --- │ │ str ┆ u32 │ ╞══════════╪═════╡ │ A ┆ 341 │ │ B ┆ 343 │ │ C ┆ 316 │ └──────────┴─────┘ print(sampling.downsample( lf, [ (pl.col("category") == "A", 0.25), (pl.col("category") == "B", 10) ], return_df = True ).group_by("category").len().sort("category")) shape: (3, 2) ┌──────────┬─────┐ │ category ┆ len │ │ --- ┆ --- │ │ str ┆ u32 │ ╞══════════╪═════╡ │ A ┆ 85 │ │ B ┆ 10 │ │ C ┆ 316 │ └──────────┴─────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def downsample(
    df: PolarsFrame,
    conditions: List[Tuple[pl.Expr, float | int]] | Tuple[pl.Expr, float | int],
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    """
    downsample
    ===========
    Downsamples subsets of a Polars DataFrame or LazyFrame based on specified conditions.

    This function applies downsampling to rows where each boolean condition is true.
    For each condition, you can specify either a fixed number of rows to keep (int)
    or a fraction of rows to keep (float). The downsampling is performed using a
    random sampling strategy, which can be made reproducible using a seed.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    conditions : List[Tuple[pl.Expr, float | int]] | Tuple[pl.Expr, float | int]
        One or more tuples, each containing:
        - A boolean Polars expression (`polars.Expr`) defining the subset of rows to downsample.
        - A float (fraction of rows to keep, e.g., 0.5 for 50%) or an integer (fixed number of rows to keep).

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(lf.group_by("category").len().sort("category").collect())
    shape: (3, 2)
    ┌──────────┬─────┐
    │ category ┆ len │
    │ ---      ┆ --- │
    │ str      ┆ u32 │
    ╞══════════╪═════╡
    │ A        ┆ 341 │
    │ B        ┆ 343 │
    │ C        ┆ 316 │
    └──────────┴─────┘
    >>> print(sampling.downsample(
    >>>     lf,
    >>>     [
    >>>         (pl.col("category") == "A", 0.25),
    >>>         (pl.col("category") == "B", 10)
    >>>     ],
    >>>     return_df = True
    >>> ).group_by("category").len().sort("category"))
    shape: (3, 2)
    ┌──────────┬─────┐
    │ category ┆ len │
    │ ---      ┆ --- │
    │ str      ┆ u32 │
    ╞══════════╪═════╡
    │ A        ┆ 85  │
    │ B        ┆ 10  │
    │ C        ┆ 316 │
    └──────────┴─────┘
    """
    # Engine
    ## Create samples for each pl.Expr
    results = []
    for expr, value in conditions:
        df_size = (
            df.select(pl.len())[0, 0]
            if isinstance(df, pl.DataFrame)
            else df.select(pl.len()).collect()[0, 0]
        )
        n = min(value, df_size) if isinstance(value, int) else None
        fraction = value if isinstance(value, float) else None
        sample = df.filter(expr).select(
            pl.all().sample(n=n, fraction=fraction, with_replacement=False, shuffle=True, seed=seed)
        )
        results.append(sample)

    ## Add sample where no pl.Expr is met
    exprs = [expr for expr, _ in conditions]
    combined_expr = exprs[0]
    if len(exprs) > 1:
        for expr in exprs[1:]:
            combined_expr = combined_expr | expr
    sample = df.filter(~combined_expr)
    results.append(sample)

    ## Merge samples
    downsample = pl.concat(results, how="vertical")

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        downsample = downsample.collect()
    return downsample

random_cols(all_columns, k, keep=None, seed=None)

random_cols

Randomly select columns from the provided list of column names.

Parameters:

Name Type Description Default
all_columns List[str]

List with the name of the columns from which to drawn randomly.

required
k int

Number of columns to select randomly outside of the list provided in keep.

required
keep List[str]

List of values to always include in the list of randomly drawn columns.

None
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None

Returns:

Type Description
List[str]

Returns a list with the name of the columns that were randomly drawn.

Note(s)
  • It is impossible to randomly select both ["x", "y"] and ["y", "x"].
Example

import polars as pl import polars_ds.sample_and_split as sampling print(sampling.random_cols(["a", "b", "c", "d", "e", "f"], 2, seed=101)) ['c', 'd']

Source code in python/polars_ds/sample_and_split/sample_and_split.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
def random_cols(
    all_columns: List[str],
    k: int,
    keep: List[str] | None = None,
    seed: int | None = None,
) -> List[str]:
    """
    random_cols
    ===========
    Randomly select columns from the provided list of column names.

    Parameters
    ----------
    all_columns : List[str]
        List with the name of the columns from which to drawn randomly.

    k : int
        Number of columns to select randomly outside of the list provided in `keep`.

    keep : List[str], optional, default=None
        List of values to always include in the list of randomly drawn columns.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    Returns
    ----------
    List[str]
        Returns a list with the name of the columns that were randomly drawn.

    Note(s)
    ----------
    - It is impossible to randomly select both ["x", "y"] and ["y", "x"].

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> print(sampling.random_cols(["a", "b", "c", "d", "e", "f"], 2, seed=101))
    ['c', 'd']
    """
    # Engine
    if seed is not None:
        random.seed(seed)

    if keep is None:
        out = []
        to_sample = combinations(all_columns, k)
    else:
        out = keep
        to_sample = combinations((c for c in all_columns if c not in keep), k)

    pool_size = len(all_columns) - len(out)
    if pool_size < k:
        raise ValueError("Not enough columns to select from.")

    n = random.randrange(0, math.comb(pool_size, k))
    rand_cols = next(islice(to_sample, n, None), None)
    random_cols = out + list(rand_cols)

    # Output(s)
    return random_cols

sample(df, value, replace=False, seed=None, return_df=False)

sample

Extracts a random sample from a Polars DataFrame or LazyFrame.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
value int or float

If an integer is provided, value observations are selected from df. Otherwise, a proportion of value over the df is selected.

required
replace bool

Whether to sample with replacement or not.

False
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.sample(lf, 100, seed=101, return_df=True)) shape: (100, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 391 ┆ 99.050514 ┆ C │ │ 555 ┆ 87.66536 ┆ B │ │ 778 ┆ 72.225257 ┆ A │ │ 888 ┆ 23.818278 ┆ C │ │ … ┆ … ┆ … │ │ 233 ┆ 57.690388 ┆ A │ │ 196 ┆ 34.920957 ┆ A │ │ 850 ┆ 59.538502 ┆ C │ │ 235 ┆ 19.524299 ┆ A │ │ 404 ┆ 82.645747 ┆ B │ └─────┴───────────┴──────────┘

print(sampling.sample(lf, 0.5, seed=101, return_df=True)) shape: (500, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 391 ┆ 99.050514 ┆ C │ │ 555 ┆ 87.66536 ┆ B │ │ 778 ┆ 72.225257 ┆ A │ │ 888 ┆ 23.818278 ┆ C │ │ … ┆ … ┆ … │ │ 320 ┆ 25.02429 ┆ B │ │ 812 ┆ 83.889809 ┆ C │ │ 982 ┆ 77.09122 ┆ A │ │ 412 ┆ 95.006197 ┆ B │ │ 416 ┆ 44.844552 ┆ C │ └─────┴───────────┴──────────┘

print(sampling.sample(lf, 0.1, True, 101, True)) shape: (100, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 718 ┆ 96.502691 ┆ C │ │ 390 ┆ 80.683474 ┆ A │ │ 554 ┆ 56.093797 ┆ B │ │ 777 ┆ 22.92514 ┆ C │ │ 887 ┆ 65.274611 ┆ C │ │ … ┆ … ┆ … │ │ 152 ┆ 23.956189 ┆ A │ │ 110 ┆ 7.697991 ┆ B │ │ 834 ┆ 17.638699 ┆ C │ │ 152 ┆ 23.956189 ┆ A │ │ 339 ┆ 47.417383 ┆ C │ └─────┴───────────┴──────────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def sample(
    df: PolarsFrame,
    value: float | int,
    replace: bool = False,
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    r"""
    sample
    ===========
    Extracts a random sample from a Polars DataFrame or LazyFrame.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    value : int or float
        If an integer is provided, `value` observations are selected from `df`. Otherwise, a proportion of `value` over the `df` is selected.

    replace : bool, optional, default=False
        Whether to sample with replacement or not.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, optional, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    ----------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.sample(lf, 100, seed=101, return_df=True))
    shape: (100, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 391 ┆ 99.050514 ┆ C        │
    │ 555 ┆ 87.66536  ┆ B        │
    │ 778 ┆ 72.225257 ┆ A        │
    │ 888 ┆ 23.818278 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 233 ┆ 57.690388 ┆ A        │
    │ 196 ┆ 34.920957 ┆ A        │
    │ 850 ┆ 59.538502 ┆ C        │
    │ 235 ┆ 19.524299 ┆ A        │
    │ 404 ┆ 82.645747 ┆ B        │
    └─────┴───────────┴──────────┘

    >>> print(sampling.sample(lf, 0.5, seed=101, return_df=True))
    shape: (500, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 391 ┆ 99.050514 ┆ C        │
    │ 555 ┆ 87.66536  ┆ B        │
    │ 778 ┆ 72.225257 ┆ A        │
    │ 888 ┆ 23.818278 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 320 ┆ 25.02429  ┆ B        │
    │ 812 ┆ 83.889809 ┆ C        │
    │ 982 ┆ 77.09122  ┆ A        │
    │ 412 ┆ 95.006197 ┆ B        │
    │ 416 ┆ 44.844552 ┆ C        │
    └─────┴───────────┴──────────┘

    >>> print(sampling.sample(lf, 0.1, True, 101, True))
    shape: (100, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 718 ┆ 96.502691 ┆ C        │
    │ 390 ┆ 80.683474 ┆ A        │
    │ 554 ┆ 56.093797 ┆ B        │
    │ 777 ┆ 22.92514  ┆ C        │
    │ 887 ┆ 65.274611 ┆ C        │
    │ …   ┆ …         ┆ …        │
    │ 152 ┆ 23.956189 ┆ A        │
    │ 110 ┆ 7.697991  ┆ B        │
    │ 834 ┆ 17.638699 ┆ C        │
    │ 152 ┆ 23.956189 ┆ A        │
    │ 339 ┆ 47.417383 ┆ C        │
    └─────┴───────────┴──────────┘
    """
    # Engine
    df_size = (
        df.select(pl.len())[0, 0]
        if isinstance(df, pl.DataFrame)
        else df.select(pl.len()).collect()[0, 0]
    )
    n = min(value, df_size) if isinstance(value, int) else None
    fraction = value if isinstance(value, float) else None
    sample = df.select(
        pl.all().sample(n=n, fraction=fraction, with_replacement=replace, shuffle=True, seed=seed)
    )

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        sample = sample.collect()
    return sample

split_by_ratio(df, split_ratio, split_col='__split', by=None, default_split_1='train', default_split_2='test', seed=None, return_df=False)

split_by_ratio

Randomly splits a Polars DataFrame or LazyFrame into subsets based on specified ratios.

The function adds a new column (split_col) to the DataFrame/LazyFrame, assigning each row to a subset according to the provided split_ratio. The splitting can be stratified by one or more columns if the by parameter is specified.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
split_ratio float | List[float] | Dict[str, float]
  • Float: The ratio for the first subset (default: "train"), with the remainder assigned to the second subset (default: "test").
  • List of floats: Each float represents the ratio for a subset, and the list must sum to 1. Subsets are named "split_0", "split_1", etc.
  • Dictionary: Keys are subset names, and values are their respective ratios. The values must sum to 1.
required
split_col str

Name of the column to store the split assignments.

"__split"
by str | list[str]

Column(s) to stratify by. If specified, the DataFrame is collected and split within each stratum.

None
default_split_1 str

Name of the first subset when split_ratio is a float.

"train"
default_split_2 str

Name of the second subset when split_ratio is a float.

"test"
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Note(s)
  • Avoid using floating-point values with too many decimal places, as this may cause the splits to be off by one row due to rounding errors.
Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.split_by_ratio( df = lf, split_ratio = 0.75, seed = 101, return_df = True ).group_by(["__split", "category"]).len().sort(["__split", "category"])) shape: (6, 3) ┌─────────┬──────────┬─────┐ │ __split ┆ category ┆ len │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ u32 │ ╞═════════╪══════════╪═════╡ │ test ┆ A ┆ 98 │ │ test ┆ B ┆ 63 │ │ test ┆ C ┆ 89 │ │ train ┆ A ┆ 243 │ │ train ┆ B ┆ 280 │ │ train ┆ C ┆ 227 │ └─────────┴──────────┴─────┘

print(sampling.split_by_ratio( df = lf, split_ratio = 0.75, split_col = "sample", by = "category", seed = 101, return_df = True ).group_by(["sample", "category"]).len().sort(["sample", "category"])) shape: (6, 3) ┌────────┬──────────┬─────┐ │ sample ┆ category ┆ len │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ u32 │ ╞════════╪══════════╪═════╡ │ test ┆ A ┆ 86 │ │ test ┆ B ┆ 86 │ │ test ┆ C ┆ 79 │ │ train ┆ A ┆ 255 │ │ train ┆ B ┆ 257 │ │ train ┆ C ┆ 237 │ └────────┴──────────┴─────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
def split_by_ratio(
    df: PolarsFrame,
    split_ratio: float | List[float] | Dict[str, float],
    split_col: str = "__split",
    by: str | list[str] | None = None,
    default_split_1: str = "train",
    default_split_2: str = "test",
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    """
    split_by_ratio
    ===========
    Randomly splits a Polars DataFrame or LazyFrame into subsets based on specified ratios.

    The function adds a new column (`split_col`) to the DataFrame/LazyFrame, assigning each row to a subset
    according to the provided `split_ratio`. The splitting can be stratified by one or more columns
    if the `by` parameter is specified.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    split_ratio : float | List[float] | Dict[str, float]
        - **Float**: The ratio for the first subset (default: "train"), with the remainder assigned
        to the second subset (default: "test").
        - **List of floats**: Each float represents the ratio for a subset, and the list must sum to 1.
        Subsets are named "split_0", "split_1", etc.
        - **Dictionary**: Keys are subset names, and values are their respective ratios. The values must
        sum to 1.

    split_col : str, optional, default="__split"
        Name of the column to store the split assignments.

    by : str | list[str], optional, default=None
        Column(s) to stratify by. If specified, the DataFrame is collected and split within each stratum.

    default_split_1 : str, optional, default="train"
        Name of the first subset when `split_ratio` is a float.

    default_split_2 : str, optional, default="test"
        Name of the second subset when `split_ratio` is a float.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Note(s)
    ----------
    - Avoid using floating-point values with too many decimal places, as this may cause the
    splits to be off by one row due to rounding errors.

    Example
    -------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.split_by_ratio(
    >>>     df = lf,
    >>>     split_ratio = 0.75,
    >>>     seed = 101,
    >>>     return_df = True
    >>> ).group_by(["__split", "category"]).len().sort(["__split", "category"]))
    shape: (6, 3)
    ┌─────────┬──────────┬─────┐
    │ __split ┆ category ┆ len │
    │ ---     ┆ ---      ┆ --- │
    │ str     ┆ str      ┆ u32 │
    ╞═════════╪══════════╪═════╡
    │ test    ┆ A        ┆ 98  │
    │ test    ┆ B        ┆ 63  │
    │ test    ┆ C        ┆ 89  │
    │ train   ┆ A        ┆ 243 │
    │ train   ┆ B        ┆ 280 │
    │ train   ┆ C        ┆ 227 │
    └─────────┴──────────┴─────┘

    >>> print(sampling.split_by_ratio(
    >>>     df = lf,
    >>>     split_ratio = 0.75,
    >>>     split_col = "sample",
    >>>     by = "category",
    >>>     seed = 101,
    >>>     return_df = True
    >>> ).group_by(["sample", "category"]).len().sort(["sample", "category"]))
    shape: (6, 3)
    ┌────────┬──────────┬─────┐
    │ sample ┆ category ┆ len │
    │ ---    ┆ ---      ┆ --- │
    │ str    ┆ str      ┆ u32 │
    ╞════════╪══════════╪═════╡
    │ test   ┆ A        ┆ 86  │
    │ test   ┆ B        ┆ 86  │
    │ test   ┆ C        ┆ 79  │
    │ train  ┆ A        ┆ 255 │
    │ train  ┆ B        ┆ 257 │
    │ train  ┆ C        ┆ 237 │
    └────────┴──────────┴─────┘
    """
    # Engine
    ## Stratified Sampling
    if by is not None:
        results = []
        cats = (
            df.select(pl.col(by).unique())
            if isinstance(df, pl.DataFrame)
            else df.select(pl.col(by).unique()).collect()
        )
        for cat in cats.to_series().to_list():
            subset = df.filter(pl.col(by) == cat)
            results.append(
                split_by_ratio(
                    subset,
                    split_ratio=split_ratio,
                    seed=seed,
                    by=None,
                    split_col=split_col,
                    default_split_1=default_split_1,
                    default_split_2=default_split_2,
                )
            )
            split_sample = pl.concat(results, how="vertical")

    ## Simple Sampling
    else:
        if isinstance(split_ratio, float):
            split_sample = (
                df.with_row_index(name="__id")
                .with_columns(
                    pl.when(
                        pl.col("__id").shuffle(seed=seed) < (pl.len() * split_ratio).cast(pl.Int64)
                    )
                    .then(pl.lit(default_split_1, dtype=pl.String))
                    .otherwise(pl.lit(default_split_2, dtype=pl.String))
                    .alias(split_col)
                )
                .select(pl.all().exclude("__id"))
            )

        else:
            if isinstance(split_ratio, dict):
                ratios: pl.Series = pl.Series(split_ratio.values())
                split_names = [str(k) for k in split_ratio.keys()]
            else:
                ratios: pl.Series = pl.Series(split_ratio)
                split_names = [f"split_{i}" for i in range(len(split_ratio))]

            pct = ratios.cum_sum()
            expr = pl.when(pl.lit(False)).then(None)
            for p, k in zip(pct, split_names):
                expr = expr.when(pl.col("__pct") < p).then(pl.lit(k, dtype=pl.String))

            split_sample = (
                df.with_row_index(name="__id")
                .with_columns(pl.col("__id").shuffle(seed=seed).alias("__tt"))
                .sort("__tt")
                .with_columns((pl.col("__tt") / pl.len()).alias("__pct"))
                .select(expr.alias(split_col), pl.all().exclude(["__id", "__pct", "__tt"]))
            )

    # Output(s)
    if isinstance(df, pl.LazyFrame) and return_df:
        split_sample = split_sample.collect()
    return split_sample

volume_neutral(df, by, control=None, target_volume=None, seed=None, return_df=False)

volume_neutral

Subsample a polars.DataFrame or polars.LazyFrame to achieve volume neutrality per group, optionally controlling for additional grouping variables.

This function reduces each group defined by by (and optionally control) to a target number of rows, ensuring that all groups have the same number of observations. The selection within groups is randomized, with an optional seed for reproducibility.

Parameters:

Name Type Description Default
df PolarsFrame

It may be either a polars.DataFrame or a polars.LazyFrame.

required
by Expr

Expression defining the primary grouping discrete variable for volume balancing.

required
control pl.Expr or list of pl.Expr

Additional expressions to control grouping. Subsampling is done within each combination of control and by.

None
target_volume int

Maximum number of rows to retain per group. If None, the size of the smallest group is used.

None
seed int

The seed value for the random number generator. The same seed will produce the same output each time.

None
return_df bool

Determines whether the output should always be a polars.DataFrame or not.

False

Returns:

Type Description
PolarsFrame

Returns either a polars.DataFrame or a polars.LazyFrame depending on the df provided.

Example

import polars as pl import polars_ds.sample_and_split as sampling import numpy as np np.random.seed(42) lf = pl.LazyFrame( data = { "id": range(1, 1001) ,"value": np.random.rand(1000) * 100 ,"category": np.random.choice(["A", "B", "C"], size = 1000) } ) print(sampling.volume_neutral(lf, pl.col("category"), None, 2, 101, True)) shape: (6, 3) ┌─────┬───────────┬──────────┐ │ id ┆ value ┆ category │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═══════════╪══════════╡ │ 817 ┆ 59.127544 ┆ A │ │ 825 ┆ 53.73956 ┆ B │ │ 874 ┆ 40.873417 ┆ C │ │ 909 ┆ 25.942343 ┆ A │ │ 923 ┆ 89.455223 ┆ B │ │ 990 ┆ 81.910232 ┆ C │ └─────┴───────────┴──────────┘

Source code in python/polars_ds/sample_and_split/sample_and_split.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def volume_neutral(
    df: PolarsFrame,
    by: pl.Expr,
    control: pl.Expr | List[pl.Expr] | None = None,
    target_volume: int | None = None,
    seed: int | None = None,
    return_df: bool = False,
) -> PolarsFrame:
    r"""
    volume_neutral
    ===========
    Subsample a polars.DataFrame or polars.LazyFrame to achieve volume neutrality per group,
    optionally controlling for additional grouping variables.

    This function reduces each group defined by `by` (and optionally `control`) to a
    target number of rows, ensuring that all groups have the same number of observations.
    The selection within groups is randomized, with an optional seed for reproducibility.

    Parameters
    ----------
    df : PolarsFrame
        It may be either a polars.DataFrame or a polars.LazyFrame.

    by : pl.Expr
        Expression defining the primary grouping discrete variable for volume balancing.

    control : pl.Expr or list of pl.Expr, optional, default=None
        Additional expressions to control grouping. Subsampling is done within each
        combination of `control` and `by`.

    target_volume : int, optional, default=None
        Maximum number of rows to retain per group. If None, the size of the smallest
        group is used.

    seed : int, optional, default=None
        The seed value for the random number generator. The same seed will produce the same output each time.

    return_df : bool, default=False
        Determines whether the output should always be a polars.DataFrame or not.

    Returns
    ----------
    PolarsFrame
        Returns either a polars.DataFrame or a polars.LazyFrame depending on the `df` provided.

    Example
    ----------
    >>> import polars as pl
    >>> import polars_ds.sample_and_split as sampling
    >>> import numpy as np
    >>> np.random.seed(42)
    >>> lf = pl.LazyFrame(
    >>>     data = {
    >>>         "id": range(1, 1001)
    >>>         ,"value": np.random.rand(1000) * 100
    >>>         ,"category": np.random.choice(["A", "B", "C"], size = 1000)
    >>>     }
    >>> )
    >>> print(sampling.volume_neutral(lf, pl.col("category"), None, 2, 101, True))
    shape: (6, 3)
    ┌─────┬───────────┬──────────┐
    │ id  ┆ value     ┆ category │
    │ --- ┆ ---       ┆ ---      │
    │ i64 ┆ f64       ┆ str      │
    ╞═════╪═══════════╪══════════╡
    │ 817 ┆ 59.127544 ┆ A        │
    │ 825 ┆ 53.73956  ┆ B        │
    │ 874 ┆ 40.873417 ┆ C        │
    │ 909 ┆ 25.942343 ┆ A        │
    │ 923 ┆ 89.455223 ┆ B        │
    │ 990 ┆ 81.910232 ┆ C        │
    └─────┴───────────┴──────────┘
    """
    # Engine
    if target_volume is not None:
        target = pl.min_horizontal(by.value_counts().struct.field("count").min(), target_volume)
    else:
        target = by.value_counts().struct.field("count").min()

    if isinstance(control, (pl.Expr, list)):
        ctrl = [control]
    else:
        ctrl = []

    if len(ctrl) > 0:
        target = target.over(ctrl)
        final_ref = ctrl + [by]
    else:
        final_ref = by

    volume_neutral = df.filter(pl.int_range(0, pl.len()).shuffle(seed).over(final_ref) < target)

    # Output
    if isinstance(df, pl.LazyFrame) and return_df:
        volume_neutral = volume_neutral.collect()
    return volume_neutral