Skip to content

Linear Models

Linear Models

Linear models. This module is in very early development and is subject to frequent breaking changes. Since the backend is Faer in Rust, better performance might be achieved if your NumPy ndarrays are Fortran-style column major. This currently only supports f64.

This module requires the NumPy package. PDS only requires Polars, but you can get all the optional dependencies by

pip install polars_ds[all]

Classes:

Name Description
ElasticNet

Elastic Net Regression.

GLM

Generalized Linear Models.

LR

Normal or Ridge Regression.

OnlineLR

Normal or Ridge Online Regression. This doesn't support dataframe inputs.

ElasticNet

Elastic Net Regression.

Methods:

Name Description
__init__

Initializes an ElasticNet regressor. This is equivalent to Sklearn's Elastic Net if you set

coeffs

Returns a copy of the coefficients.

fit

Fit the Elastic Net model on NumPy data.

fit_df

Fit the Elastic Net model on a dataframe. This will overwrite previously set feature names.

from_values

Constructs a LR class instance from coefficients and bias values.

predict

Returns the prediction of this linear model.

predict_df

Computes the prediction of the linear model and append it as a column in the dataframe. If input

set_input_features

Sets the names of input features.

Source code in python/polars_ds/linear_models.py
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
class ElasticNet:
    """
    Elastic Net Regression.
    """

    def __init__(
        self,
        l1_reg: float,
        l2_reg: float,
        has_bias: bool = False,
        tol: float = 1e-5,
        max_iter: int = 2000,
        feature_names_in_: List[str] | None = None,
    ):
        """
        Initializes an ElasticNet regressor. This is equivalent to Sklearn's Elastic Net if you set
        alpha and l1_ratio to be: `alpha = l1_reg + l2_reg`, and `l1_ratio = l1_reg / (l1_reg + l2_reg)`.

        Parameters
        ----------
        l1_reg
            The l1 regularization parameters for the elastic net.
        l2_reg
            The l2 regularization parameters for the elastic net.
        has_bias
            Whether to add a bias term. Also known as intercept in other packages.
        tol
            When updates are smaller than tol, the algorithm will stop.
        max_iter
            The max number of iterations the algorithm will run.
        feature_names_in_
            Names for the incoming features, if available. If None, the names will be empty. They will be
            learned if .fit_df() is run later, or .set_input_features() is set later.
        """
        if l1_reg <= 0.0 and l2_reg <= 0.0:
            raise ValueError("Cannot have both l1_reg and l2_reg <= 0.")

        self._en = PyElasticNet(l1_reg, l2_reg, has_bias, tol, max_iter)
        self.feature_names_in_: List[str] = (
            [] if feature_names_in_ is None else list(feature_names_in_)
        )

    @classmethod
    def from_values(
        cls, coeffs: List[float], bias: float = 0.0, feature_names_in_: List[str] | None = None
    ) -> Self:
        """
        Constructs a LR class instance from coefficients and bias values.

        Parameters
        ----------
        coeffs
            Iterable of numbers representing the coefficients
        bias
            Value for the bias term
        feature_names_in_
            Names for the incoming features, if available. If None, the names will be empty. They will be
            learned if .fit_df() is run later, or .set_input_features() is set later.
        """
        coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
        elastic_net = cls(
            float("nan"),
            float("nan"),
            has_bias=(bias != 0.0),
            tol=1e-5,
            max_iter=2000,
            feature_names_in_=feature_names_in_,
        )
        elastic_net._en.set_coeffs_and_bias(coefficients, bias)
        return elastic_net

    def is_fit(self) -> bool:
        return self._en.is_fit()

    def __repr__(self) -> str:
        output = f"Elastic Net Model\nl1, l2 regularizers: {self._en.regularizers}\n"
        if self._en.is_fit():
            output += f"Coefficients: {list(round(x, 5) for x in self._en.coeffs)}\n"
            output += f"Bias/Intercept: {self._en.bias}\n"
        else:
            output += "Not fitted yet."
        return output

    def set_input_features(self, features: List[str]) -> Self:
        """
        Sets the names of input features.

        Parameters
        ----------
        features
            List of strings.
        """
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        return self

    def coeffs(self) -> np.ndarray:
        """
        Returns a copy of the coefficients.
        """
        return np.asarray(self._en.coeffs)

    def has_bias(self) -> bool:
        return self._en.has_bias()

    def bias(self) -> float:
        return self._en.bias

    @_sanitize_np("X", "y")
    def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
        """
        Fit the Elastic Net model on NumPy data.

        Parameters
        ----------
        X
            The feature Matrix. NumPy 2D matrix only.
        y
            The target data. NumPy array. Must be reshape-able to (-1, 1).
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        """
        X_, y_ = _handle_nans_in_np(X, y.reshape((-1, 1)), null_policy)
        self._en.fit(X_, y_)
        return self

    def fit_df(
        self,
        df: PolarsFrame,
        features: List[str],
        target: str,
        null_policy: NullPolicy = "skip",
    ) -> Self:
        """
        Fit the Elastic Net model on a dataframe. This will overwrite previously set feature names.
        The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
        NaN values if they exist in their pipeline.

        Parameters
        ----------
        df
            Either an eager or a lazy Polars dataframe.
        features
            List of strings of column names.
        target
            The target column's name.
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        """
        df2 = (
            _handle_nulls_in_df(df.lazy(), features, target, null_policy)
            .select(*features, target)
            .collect()
        )
        if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
            raise ValueError("Nulls found in Dataframe.")

        X = df2.select(features).to_numpy()
        y = df2.select(target).to_numpy()
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        self._en.fit(X, y)
        return self

    @_sanitize_np("X")
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Returns the prediction of this linear model.

        Parameters
        ----------
        X
            Data to predict on, as a matrix
        """
        return np.asarray(self._en.predict(X))

    def predict_df(self, df: PolarsFrame, name: str = "prediction") -> PolarsFrame:
        """
        Computes the prediction of the linear model and append it as a column in the dataframe. If input
        is lazy, output will be lazy.

        Parameters
        ----------
        df
            Either an eager or a lazy Polars dataframe.
        name
            The name of the prediction column
        """
        if len(self.feature_names_in_) <= 0:
            raise ValueError(
                "The linear model is not fitted on a dataframe, or no feature names have been given."
                "Not enough info to predict on a dataframe. Hint: try .fit_df() or .set_input_features()."
            )

        pred = pl.sum_horizontal(
            beta * pl.col(c) for c, beta in zip(self.feature_names_in_, self._en.coeffs)
        )
        bias = self._en.bias
        if bias != 0.0:
            pred = pred + bias

        return df.with_columns(pred.alias(name))

__init__(l1_reg, l2_reg, has_bias=False, tol=1e-05, max_iter=2000, feature_names_in_=None)

Initializes an ElasticNet regressor. This is equivalent to Sklearn's Elastic Net if you set alpha and l1_ratio to be: alpha = l1_reg + l2_reg, and l1_ratio = l1_reg / (l1_reg + l2_reg).

Parameters:

Name Type Description Default
l1_reg float

The l1 regularization parameters for the elastic net.

required
l2_reg float

The l2 regularization parameters for the elastic net.

required
has_bias bool

Whether to add a bias term. Also known as intercept in other packages.

False
tol float

When updates are smaller than tol, the algorithm will stop.

1e-05
max_iter int

The max number of iterations the algorithm will run.

2000
feature_names_in_ List[str] | None

Names for the incoming features, if available. If None, the names will be empty. They will be learned if .fit_df() is run later, or .set_input_features() is set later.

None
Source code in python/polars_ds/linear_models.py
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
def __init__(
    self,
    l1_reg: float,
    l2_reg: float,
    has_bias: bool = False,
    tol: float = 1e-5,
    max_iter: int = 2000,
    feature_names_in_: List[str] | None = None,
):
    """
    Initializes an ElasticNet regressor. This is equivalent to Sklearn's Elastic Net if you set
    alpha and l1_ratio to be: `alpha = l1_reg + l2_reg`, and `l1_ratio = l1_reg / (l1_reg + l2_reg)`.

    Parameters
    ----------
    l1_reg
        The l1 regularization parameters for the elastic net.
    l2_reg
        The l2 regularization parameters for the elastic net.
    has_bias
        Whether to add a bias term. Also known as intercept in other packages.
    tol
        When updates are smaller than tol, the algorithm will stop.
    max_iter
        The max number of iterations the algorithm will run.
    feature_names_in_
        Names for the incoming features, if available. If None, the names will be empty. They will be
        learned if .fit_df() is run later, or .set_input_features() is set later.
    """
    if l1_reg <= 0.0 and l2_reg <= 0.0:
        raise ValueError("Cannot have both l1_reg and l2_reg <= 0.")

    self._en = PyElasticNet(l1_reg, l2_reg, has_bias, tol, max_iter)
    self.feature_names_in_: List[str] = (
        [] if feature_names_in_ is None else list(feature_names_in_)
    )

coeffs()

Returns a copy of the coefficients.

Source code in python/polars_ds/linear_models.py
439
440
441
442
443
def coeffs(self) -> np.ndarray:
    """
    Returns a copy of the coefficients.
    """
    return np.asarray(self._en.coeffs)

fit(X, y, null_policy='ignore')

Fit the Elastic Net model on NumPy data.

Parameters:

Name Type Description Default
X ndarray

The feature Matrix. NumPy 2D matrix only.

required
y ndarray

The target data. NumPy array. Must be reshape-able to (-1, 1).

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'ignore'
Source code in python/polars_ds/linear_models.py
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
@_sanitize_np("X", "y")
def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
    """
    Fit the Elastic Net model on NumPy data.

    Parameters
    ----------
    X
        The feature Matrix. NumPy 2D matrix only.
    y
        The target data. NumPy array. Must be reshape-able to (-1, 1).
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    """
    X_, y_ = _handle_nans_in_np(X, y.reshape((-1, 1)), null_policy)
    self._en.fit(X_, y_)
    return self

fit_df(df, features, target, null_policy='skip')

Fit the Elastic Net model on a dataframe. This will overwrite previously set feature names. The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle NaN values if they exist in their pipeline.

Parameters:

Name Type Description Default
df PolarsFrame

Either an eager or a lazy Polars dataframe.

required
features List[str]

List of strings of column names.

required
target str

The target column's name.

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'skip'
Source code in python/polars_ds/linear_models.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
def fit_df(
    self,
    df: PolarsFrame,
    features: List[str],
    target: str,
    null_policy: NullPolicy = "skip",
) -> Self:
    """
    Fit the Elastic Net model on a dataframe. This will overwrite previously set feature names.
    The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
    NaN values if they exist in their pipeline.

    Parameters
    ----------
    df
        Either an eager or a lazy Polars dataframe.
    features
        List of strings of column names.
    target
        The target column's name.
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    """
    df2 = (
        _handle_nulls_in_df(df.lazy(), features, target, null_policy)
        .select(*features, target)
        .collect()
    )
    if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
        raise ValueError("Nulls found in Dataframe.")

    X = df2.select(features).to_numpy()
    y = df2.select(target).to_numpy()
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    self._en.fit(X, y)
    return self

from_values(coeffs, bias=0.0, feature_names_in_=None) classmethod

Constructs a LR class instance from coefficients and bias values.

Parameters:

Name Type Description Default
coeffs List[float]

Iterable of numbers representing the coefficients

required
bias float

Value for the bias term

0.0
feature_names_in_ List[str] | None

Names for the incoming features, if available. If None, the names will be empty. They will be learned if .fit_df() is run later, or .set_input_features() is set later.

None
Source code in python/polars_ds/linear_models.py
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
@classmethod
def from_values(
    cls, coeffs: List[float], bias: float = 0.0, feature_names_in_: List[str] | None = None
) -> Self:
    """
    Constructs a LR class instance from coefficients and bias values.

    Parameters
    ----------
    coeffs
        Iterable of numbers representing the coefficients
    bias
        Value for the bias term
    feature_names_in_
        Names for the incoming features, if available. If None, the names will be empty. They will be
        learned if .fit_df() is run later, or .set_input_features() is set later.
    """
    coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
    elastic_net = cls(
        float("nan"),
        float("nan"),
        has_bias=(bias != 0.0),
        tol=1e-5,
        max_iter=2000,
        feature_names_in_=feature_names_in_,
    )
    elastic_net._en.set_coeffs_and_bias(coefficients, bias)
    return elastic_net

predict(X)

Returns the prediction of this linear model.

Parameters:

Name Type Description Default
X ndarray

Data to predict on, as a matrix

required
Source code in python/polars_ds/linear_models.py
513
514
515
516
517
518
519
520
521
522
523
@_sanitize_np("X")
def predict(self, X: np.ndarray) -> np.ndarray:
    """
    Returns the prediction of this linear model.

    Parameters
    ----------
    X
        Data to predict on, as a matrix
    """
    return np.asarray(self._en.predict(X))

predict_df(df, name='prediction')

Computes the prediction of the linear model and append it as a column in the dataframe. If input is lazy, output will be lazy.

Parameters:

Name Type Description Default
df PolarsFrame

Either an eager or a lazy Polars dataframe.

required
name str

The name of the prediction column

'prediction'
Source code in python/polars_ds/linear_models.py
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
def predict_df(self, df: PolarsFrame, name: str = "prediction") -> PolarsFrame:
    """
    Computes the prediction of the linear model and append it as a column in the dataframe. If input
    is lazy, output will be lazy.

    Parameters
    ----------
    df
        Either an eager or a lazy Polars dataframe.
    name
        The name of the prediction column
    """
    if len(self.feature_names_in_) <= 0:
        raise ValueError(
            "The linear model is not fitted on a dataframe, or no feature names have been given."
            "Not enough info to predict on a dataframe. Hint: try .fit_df() or .set_input_features()."
        )

    pred = pl.sum_horizontal(
        beta * pl.col(c) for c, beta in zip(self.feature_names_in_, self._en.coeffs)
    )
    bias = self._en.bias
    if bias != 0.0:
        pred = pred + bias

    return df.with_columns(pred.alias(name))

set_input_features(features)

Sets the names of input features.

Parameters:

Name Type Description Default
features List[str]

List of strings.

required
Source code in python/polars_ds/linear_models.py
426
427
428
429
430
431
432
433
434
435
436
437
def set_input_features(self, features: List[str]) -> Self:
    """
    Sets the names of input features.

    Parameters
    ----------
    features
        List of strings.
    """
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    return self

GLM

Generalized Linear Models.

The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

Currently, the variance function will be determined by the link function. If a family is given, then the canonical link function is used. Here is a mapping between currently implemented families and their link functions:

gaussian / normal ==> id (x -> x) poisson ==> log (x -> ln(x)) binomial / logistic ==> logit (x -> ln(x/(1-x))) gamma ==> inverse (x -> 1/x)

Reference

https://en.wikipedia.org/wiki/Generalized_linear_model

Methods:

Name Description
__init__

Parameters

__repr__

Shows a textual representation of the GLM.

coeffs

Returns a copy of the coefficients.

fit

Fit the GLM model on NumPy data.

fit_df

Fit the GLM model on a dataframe. This will overwrite previously set feature names.

predict

Returns the prediction of this linear model.

set_input_features

Sets the names of input features.

Source code in python/polars_ds/linear_models.py
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
class GLM:
    """
    Generalized Linear Models.

    The GLM generalizes linear regression by allowing the linear model to be related to the response
    variable via a link function and by allowing the magnitude of the variance of each measurement to be a
    function of its predicted value.

    Currently, the variance function will be determined by the link function.
    If a family is given, then the canonical link function is used. Here is a mapping between currently
    implemented families and their link functions:

    gaussian / normal    ==>   id (x -> x)
    poisson              ==>   log (x -> ln(x))
    binomial / logistic  ==>   logit (x -> ln(x/(1-x)))
    gamma                ==>   inverse (x -> 1/x)

    Reference
    ---------
    https://en.wikipedia.org/wiki/Generalized_linear_model
    """

    def __init__(
        self,
        add_bias: bool = False,
        solver: GLMSolver = "irls",
        family: GLMFamily = "normal",
        max_iter: int = 100,
        tol: float = 1e-8,
        feature_names_in_: List[str] | None = None,
    ):
        # lambda_: float = 0.0,
        """
        Parameters
        ----------
        family
            One of "gaussian", "normal", "poisson", "binomial", "logistic", "gamma". Note "gaussian" and
            "normal" represent the same family.
        add_bias
            Whether to add a bias term. Also known as intercept in other packages.
        max_iter
            Max number of iterations for the algorithm
        tol
            The tolerance for convergence
        feature_names_in_
            Names for the incoming features, if available. If None, the names will be empty. They will be
            learned if .fit_df() is run later, or .set_input_features() is set later.
        """
        if solver not in ["irls"]:
            raise NotImplementedError

        if max_iter < 1:
            raise ValueError("`max_iter` must be > 1.")

        if family not in ["gaussian", "normal", "poisson", "binomial", "logistic", "gamma"]:
            raise NotImplementedError

        self._glm = PyGLM(
            add_bias=add_bias, family=family, solver=solver, max_iter=max_iter, tol=abs(tol)
        )
        self.feature_names_in_: List[str] = (
            [] if feature_names_in_ is None else list(feature_names_in_)
        )

    def __repr__(self) -> str:
        """
        Shows a textual representation of the GLM.
        """
        return self._glm.describe()

    # @classmethod
    # def from_values(
    #     cls,
    #     coeffs: List[float],
    #     link: LinkFunction,
    #     bias: float = 0.0,
    #     feature_names_in_: List[str] | None = None
    # ) -> Self:
    #     """
    #     Constructs a LR class instance from coefficients and bias values.

    #     Parameters
    #     ----------
    #     coeffs
    #         Iterable of numbers representing the coefficients
    #     link
    #         One of ["id", "log", "logit", "inverse"].
    #     bias
    #         Value for the bias term
    #     feature_names_in_
    #         Names for the incoming features, if available. If None, the names will be empty. They will be
    #         learned if .fit_df() is run later, or .set_input_features() is set later.
    #     """
    #     coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
    #     lr = cls(
    #         add_bias=(bias != 0.0),
    #         solver="irls",
    #         feature_names_in_=feature_names_in_,
    #     )
    #     lr._lr.set_coeffs_and_bias(coefficients, bias)
    #     return lr

    def is_fit(self) -> bool:
        return self._glm.is_fit()

    # def __repr__(self) -> str:
    #     if self._lr.lambda_ > 0.0:
    #         output = "Linear Regression (Ridge) Model\n"
    #     else:
    #         output = "Linear Regression Model\n"

    #     if self._lr.is_fit():
    #         output += f"Coefficients: {list(round(x, 5) for x in self._lr.coeffs)}\n"
    #         output += f"Bias/Intercept: {self._lr.bias}\n"
    #     else:
    #         output += "Not fitted yet."
    #     return output

    def set_input_features(self, features: List[str]) -> Self:
        """
        Sets the names of input features.

        Parameters
        ----------
        features
            List of strings.
        """
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        return self

    def coeffs(self) -> np.ndarray:
        """
        Returns a copy of the coefficients.
        """
        return self._glm.coeffs

    def bias(self) -> float:
        return self._glm.bias

    @_sanitize_np("X", "y")
    def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
        """
        Fit the GLM model on NumPy data.

        Parameters
        ----------
        X
            The feature Matrix. NumPy 2D matrix only.
        y
            The target data. NumPy array. Must be reshape-able to (-1, 1).
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        """
        X_, y_ = _handle_nans_in_np(X, y.astype(np.float64).reshape((-1, 1)), null_policy)
        self._glm.fit(X_, y_)
        return self

    def fit_df(
        self,
        df: PolarsFrame,
        features: List[str],
        target: str,
        null_policy: NullPolicy = "skip",
        show_report: bool = False,
    ) -> Self:
        """
        Fit the GLM model on a dataframe. This will overwrite previously set feature names.
        The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
        NaN values if they exist in their pipeline.

        Parameters
        ----------
        df
            Either an eager or a lazy Polars dataframe.
        features
            List of strings of column names.
        target
            The target column's name.
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        show_report
            Whether to print out a regression report.
        """
        df2 = (
            _handle_nulls_in_df(df.lazy(), features, target, null_policy)
            .select(*features, target)
            .collect()
        )
        if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
            raise ValueError("Nulls found in Dataframe.")

        X = df2.select(features).to_numpy()
        y = df2.select(target).to_numpy()
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        self._glm.fit(X, y)
        return self

    @_sanitize_np("X")
    def predict(self, X: np.ndarray, linear: bool = False) -> np.ndarray:
        """
        Returns the prediction of this linear model.

        Parameters
        ----------
        X
            Data to predict on, as a matrix
        linear
            If true, return the linear predictor eta instead of the expected value of
            the response variable, E[Y|X].
        """
        return self._glm.predict(X, linear).reshape((-1, 1))

__init__(add_bias=False, solver='irls', family='normal', max_iter=100, tol=1e-08, feature_names_in_=None)

Parameters:

Name Type Description Default
family GLMFamily

One of "gaussian", "normal", "poisson", "binomial", "logistic", "gamma". Note "gaussian" and "normal" represent the same family.

'normal'
add_bias bool

Whether to add a bias term. Also known as intercept in other packages.

False
max_iter int

Max number of iterations for the algorithm

100
tol float

The tolerance for convergence

1e-08
feature_names_in_ List[str] | None

Names for the incoming features, if available. If None, the names will be empty. They will be learned if .fit_df() is run later, or .set_input_features() is set later.

None
Source code in python/polars_ds/linear_models.py
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
def __init__(
    self,
    add_bias: bool = False,
    solver: GLMSolver = "irls",
    family: GLMFamily = "normal",
    max_iter: int = 100,
    tol: float = 1e-8,
    feature_names_in_: List[str] | None = None,
):
    # lambda_: float = 0.0,
    """
    Parameters
    ----------
    family
        One of "gaussian", "normal", "poisson", "binomial", "logistic", "gamma". Note "gaussian" and
        "normal" represent the same family.
    add_bias
        Whether to add a bias term. Also known as intercept in other packages.
    max_iter
        Max number of iterations for the algorithm
    tol
        The tolerance for convergence
    feature_names_in_
        Names for the incoming features, if available. If None, the names will be empty. They will be
        learned if .fit_df() is run later, or .set_input_features() is set later.
    """
    if solver not in ["irls"]:
        raise NotImplementedError

    if max_iter < 1:
        raise ValueError("`max_iter` must be > 1.")

    if family not in ["gaussian", "normal", "poisson", "binomial", "logistic", "gamma"]:
        raise NotImplementedError

    self._glm = PyGLM(
        add_bias=add_bias, family=family, solver=solver, max_iter=max_iter, tol=abs(tol)
    )
    self.feature_names_in_: List[str] = (
        [] if feature_names_in_ is None else list(feature_names_in_)
    )

__repr__()

Shows a textual representation of the GLM.

Source code in python/polars_ds/linear_models.py
762
763
764
765
766
def __repr__(self) -> str:
    """
    Shows a textual representation of the GLM.
    """
    return self._glm.describe()

coeffs()

Returns a copy of the coefficients.

Source code in python/polars_ds/linear_models.py
829
830
831
832
833
def coeffs(self) -> np.ndarray:
    """
    Returns a copy of the coefficients.
    """
    return self._glm.coeffs

fit(X, y, null_policy='ignore')

Fit the GLM model on NumPy data.

Parameters:

Name Type Description Default
X ndarray

The feature Matrix. NumPy 2D matrix only.

required
y ndarray

The target data. NumPy array. Must be reshape-able to (-1, 1).

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'ignore'
Source code in python/polars_ds/linear_models.py
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
@_sanitize_np("X", "y")
def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
    """
    Fit the GLM model on NumPy data.

    Parameters
    ----------
    X
        The feature Matrix. NumPy 2D matrix only.
    y
        The target data. NumPy array. Must be reshape-able to (-1, 1).
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    """
    X_, y_ = _handle_nans_in_np(X, y.astype(np.float64).reshape((-1, 1)), null_policy)
    self._glm.fit(X_, y_)
    return self

fit_df(df, features, target, null_policy='skip', show_report=False)

Fit the GLM model on a dataframe. This will overwrite previously set feature names. The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle NaN values if they exist in their pipeline.

Parameters:

Name Type Description Default
df PolarsFrame

Either an eager or a lazy Polars dataframe.

required
features List[str]

List of strings of column names.

required
target str

The target column's name.

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'skip'
show_report bool

Whether to print out a regression report.

False
Source code in python/polars_ds/linear_models.py
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
def fit_df(
    self,
    df: PolarsFrame,
    features: List[str],
    target: str,
    null_policy: NullPolicy = "skip",
    show_report: bool = False,
) -> Self:
    """
    Fit the GLM model on a dataframe. This will overwrite previously set feature names.
    The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
    NaN values if they exist in their pipeline.

    Parameters
    ----------
    df
        Either an eager or a lazy Polars dataframe.
    features
        List of strings of column names.
    target
        The target column's name.
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    show_report
        Whether to print out a regression report.
    """
    df2 = (
        _handle_nulls_in_df(df.lazy(), features, target, null_policy)
        .select(*features, target)
        .collect()
    )
    if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
        raise ValueError("Nulls found in Dataframe.")

    X = df2.select(features).to_numpy()
    y = df2.select(target).to_numpy()
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    self._glm.fit(X, y)
    return self

predict(X, linear=False)

Returns the prediction of this linear model.

Parameters:

Name Type Description Default
X ndarray

Data to predict on, as a matrix

required
linear bool

If true, return the linear predictor eta instead of the expected value of the response variable, E[Y|X].

False
Source code in python/polars_ds/linear_models.py
903
904
905
906
907
908
909
910
911
912
913
914
915
916
@_sanitize_np("X")
def predict(self, X: np.ndarray, linear: bool = False) -> np.ndarray:
    """
    Returns the prediction of this linear model.

    Parameters
    ----------
    X
        Data to predict on, as a matrix
    linear
        If true, return the linear predictor eta instead of the expected value of
        the response variable, E[Y|X].
    """
    return self._glm.predict(X, linear).reshape((-1, 1))

set_input_features(features)

Sets the names of input features.

Parameters:

Name Type Description Default
features List[str]

List of strings.

required
Source code in python/polars_ds/linear_models.py
816
817
818
819
820
821
822
823
824
825
826
827
def set_input_features(self, features: List[str]) -> Self:
    """
    Sets the names of input features.

    Parameters
    ----------
    features
        List of strings.
    """
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    return self

LR

Normal or Ridge Regression.

Methods:

Name Description
__init__

Parameters

coeffs

Returns a copy of the coefficients.

fit

Fit the linear regression model on NumPy data.

fit_df

Fit the linear regression model on a dataframe. This will overwrite previously set feature names.

from_values

Constructs a LR class instance from coefficients and bias values.

predict

Returns the prediction of this linear model.

predict_df

Computes the prediction of the linear model and append it as a column in the dataframe. If input

set_input_features

Sets the names of input features.

Source code in python/polars_ds/linear_models.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
class LR:
    """
    Normal or Ridge Regression.
    """

    def __init__(
        self,
        has_bias: bool = False,
        lambda_: float = 0.0,
        solver: LRSolverMethods = "qr",
        feature_names_in_: List[str] | None = None,
    ):
        """
        Parameters
        ----------
        lambda_
            The regularization parameters for ridge. If this is positive, then this class will solve Ridge.
        solver
            Use one of 'svd', 'cholesky' and 'qr' method to solve the least square equation. Default is 'qr'.
        has_bias
            Whether to add a bias term. Also known as intercept in other packages.
        feature_names_in_
            Names for the incoming features, if available. If None, the names will be empty. They will be
            learned if .fit_df() is run later, or .set_input_features() is set later.
        """
        self._lr = PyLR(solver, lambda_, has_bias)
        self.feature_names_in_: List[str] = (
            [] if feature_names_in_ is None else list(feature_names_in_)
        )

    @classmethod
    def from_values(
        cls, coeffs: List[float], bias: float = 0.0, feature_names_in_: List[str] | None = None
    ) -> Self:
        """
        Constructs a LR class instance from coefficients and bias values.

        Parameters
        ----------
        coeffs
            Iterable of numbers representing the coefficients
        bias
            Value for the bias term
        feature_names_in_
            Names for the incoming features, if available. If None, the names will be empty. They will be
            learned if .fit_df() is run later, or .set_input_features() is set later.
        """
        coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
        lr = cls(
            has_bias=(bias != 0.0),
            lambda_=0.0,
            solver="Not Solved",
            feature_names_in_=feature_names_in_,
        )
        lr._lr.set_coeffs_and_bias(coefficients, bias)
        return lr

    def is_fit(self) -> bool:
        return self._lr.is_fit()

    def __repr__(self) -> str:
        if self._lr.lambda_ > 0.0:
            output = "Linear Regression (Ridge) Model\n"
        else:
            output = "Linear Regression Model\n"

        if self._lr.is_fit():
            output += f"Coefficients: {list(round(x, 5) for x in self._lr.coeffs)}\n"
            output += f"Bias/Intercept: {self._lr.bias}\n"
        else:
            output += "Not fitted yet."
        return output

    def set_input_features(self, features: List[str]) -> Self:
        """
        Sets the names of input features.

        Parameters
        ----------
        features
            List of strings.
        """
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        return self

    def coeffs(self) -> np.ndarray:
        """
        Returns a copy of the coefficients.
        """
        return np.asarray(self._lr.coeffs)

    def bias(self) -> float:
        return self._lr.bias

    @_sanitize_np("X", "y")
    def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
        """
        Fit the linear regression model on NumPy data.

        Parameters
        ----------
        X
            The feature Matrix. NumPy 2D matrix only.
        y
            The target data. NumPy array. Must be reshape-able to (-1, 1).
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        """
        X_, y_ = _handle_nans_in_np(X, y.reshape((-1, 1)), null_policy)
        self._lr.fit(X_, y_)
        return self

    def fit_df(
        self,
        df: PolarsFrame,
        features: List[str],
        target: str,
        null_policy: NullPolicy = "skip",
        show_report: bool = False,
    ) -> Self:
        """
        Fit the linear regression model on a dataframe. This will overwrite previously set feature names.
        The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
        NaN values if they exist in their pipeline.

        Parameters
        ----------
        df
            Either an eager or a lazy Polars dataframe.
        features
            List of strings of column names.
        target
            The target column's name.
        null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
            One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
            fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
            the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
            columns. If target has null, then the row will still be dropped.
        show_report
            Whether to print out a regression report. This will duplicate work and will not work for Ridge
            regression. E.g. Nothing will be printed if lambda_ > 0.
        """
        if show_report and self._lr.lambda_ == 0.0:
            from . import query_lstsq_report

            print(
                df.lazy()
                .select(
                    query_lstsq_report(
                        *features,
                        target=target,
                    ).alias("report")
                )
                .unnest("report")
                .collect()
            )

        df2 = (
            _handle_nulls_in_df(df.lazy(), features, target, null_policy)
            .select(*features, target)
            .collect()
        )
        if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
            raise ValueError("Nulls found in Dataframe.")

        X = df2.select(features).to_numpy()
        y = df2.select(target).to_numpy()
        self.feature_names_in_.clear()
        self.feature_names_in_ = list(features)
        self._lr.fit(X, y)
        return self

    @_sanitize_np("X")
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Returns the prediction of this linear model.

        Parameters
        ----------
        X
            Data to predict on, as a matrix
        """
        return np.asarray(self._lr.predict(X))

    def predict_df(self, df: PolarsFrame, name: str = "prediction") -> PolarsFrame:
        """
        Computes the prediction of the linear model and append it as a column in the dataframe. If input
        is lazy, output will be lazy.

        Parameters
        ----------
        df
            Either an eager or a lazy Polars dataframe.
        name
            The name of the prediction column
        """
        if len(self.feature_names_in_) <= 0:
            raise ValueError(
                "The linear model is not fitted on a dataframe, or no feature names have been given."
                "Not enough info to predict on a dataframe. Hint: try .fit_df() or .set_input_features()."
            )

        pred = pl.sum_horizontal(
            beta * pl.col(c) for c, beta in zip(self.feature_names_in_, self._lr.coeffs)
        )
        bias = self._lr.bias
        if bias != 0.0:
            pred = pred + bias

        return df.with_columns(pred.alias(name))

__init__(has_bias=False, lambda_=0.0, solver='qr', feature_names_in_=None)

Parameters:

Name Type Description Default
lambda_ float

The regularization parameters for ridge. If this is positive, then this class will solve Ridge.

0.0
solver LRSolverMethods

Use one of 'svd', 'cholesky' and 'qr' method to solve the least square equation. Default is 'qr'.

'qr'
has_bias bool

Whether to add a bias term. Also known as intercept in other packages.

False
feature_names_in_ List[str] | None

Names for the incoming features, if available. If None, the names will be empty. They will be learned if .fit_df() is run later, or .set_input_features() is set later.

None
Source code in python/polars_ds/linear_models.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def __init__(
    self,
    has_bias: bool = False,
    lambda_: float = 0.0,
    solver: LRSolverMethods = "qr",
    feature_names_in_: List[str] | None = None,
):
    """
    Parameters
    ----------
    lambda_
        The regularization parameters for ridge. If this is positive, then this class will solve Ridge.
    solver
        Use one of 'svd', 'cholesky' and 'qr' method to solve the least square equation. Default is 'qr'.
    has_bias
        Whether to add a bias term. Also known as intercept in other packages.
    feature_names_in_
        Names for the incoming features, if available. If None, the names will be empty. They will be
        learned if .fit_df() is run later, or .set_input_features() is set later.
    """
    self._lr = PyLR(solver, lambda_, has_bias)
    self.feature_names_in_: List[str] = (
        [] if feature_names_in_ is None else list(feature_names_in_)
    )

coeffs()

Returns a copy of the coefficients.

Source code in python/polars_ds/linear_models.py
213
214
215
216
217
def coeffs(self) -> np.ndarray:
    """
    Returns a copy of the coefficients.
    """
    return np.asarray(self._lr.coeffs)

fit(X, y, null_policy='ignore')

Fit the linear regression model on NumPy data.

Parameters:

Name Type Description Default
X ndarray

The feature Matrix. NumPy 2D matrix only.

required
y ndarray

The target data. NumPy array. Must be reshape-able to (-1, 1).

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'ignore'
Source code in python/polars_ds/linear_models.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
@_sanitize_np("X", "y")
def fit(self, X: np.ndarray, y: np.ndarray, null_policy: NullPolicy = "ignore") -> Self:
    """
    Fit the linear regression model on NumPy data.

    Parameters
    ----------
    X
        The feature Matrix. NumPy 2D matrix only.
    y
        The target data. NumPy array. Must be reshape-able to (-1, 1).
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    """
    X_, y_ = _handle_nans_in_np(X, y.reshape((-1, 1)), null_policy)
    self._lr.fit(X_, y_)
    return self

fit_df(df, features, target, null_policy='skip', show_report=False)

Fit the linear regression model on a dataframe. This will overwrite previously set feature names. The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle NaN values if they exist in their pipeline.

Parameters:

Name Type Description Default
df PolarsFrame

Either an eager or a lazy Polars dataframe.

required
features List[str]

List of strings of column names.

required
target str

The target column's name.

required
null_policy NullPolicy

One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target columns. If target has null, then the row will still be dropped.

'skip'
show_report bool

Whether to print out a regression report. This will duplicate work and will not work for Ridge regression. E.g. Nothing will be printed if lambda_ > 0.

False
Source code in python/polars_ds/linear_models.py
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def fit_df(
    self,
    df: PolarsFrame,
    features: List[str],
    target: str,
    null_policy: NullPolicy = "skip",
    show_report: bool = False,
) -> Self:
    """
    Fit the linear regression model on a dataframe. This will overwrite previously set feature names.
    The null policy only handles null values in df, not NaN values. It is the user's responsibility to handle
    NaN values if they exist in their pipeline.

    Parameters
    ----------
    df
        Either an eager or a lazy Polars dataframe.
    features
        List of strings of column names.
    target
        The target column's name.
    null_policy: Literal['raise', 'skip', 'zero', 'one', 'ignore']
        One of options shown here, but you can also pass in any numeric string. E.g you may pass '1.25' to mean
        fill nulls with 1.25. If the string cannot be converted to a float, an error will be thrown. Note: if
        the target column has null, the rows with nulls will always be dropped. Null-fill only applies to non-target
        columns. If target has null, then the row will still be dropped.
    show_report
        Whether to print out a regression report. This will duplicate work and will not work for Ridge
        regression. E.g. Nothing will be printed if lambda_ > 0.
    """
    if show_report and self._lr.lambda_ == 0.0:
        from . import query_lstsq_report

        print(
            df.lazy()
            .select(
                query_lstsq_report(
                    *features,
                    target=target,
                ).alias("report")
            )
            .unnest("report")
            .collect()
        )

    df2 = (
        _handle_nulls_in_df(df.lazy(), features, target, null_policy)
        .select(*features, target)
        .collect()
    )
    if null_policy == "raise" and any(df2[c].has_nulls() for c in df2.columns):
        raise ValueError("Nulls found in Dataframe.")

    X = df2.select(features).to_numpy()
    y = df2.select(target).to_numpy()
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    self._lr.fit(X, y)
    return self

from_values(coeffs, bias=0.0, feature_names_in_=None) classmethod

Constructs a LR class instance from coefficients and bias values.

Parameters:

Name Type Description Default
coeffs List[float]

Iterable of numbers representing the coefficients

required
bias float

Value for the bias term

0.0
feature_names_in_ List[str] | None

Names for the incoming features, if available. If None, the names will be empty. They will be learned if .fit_df() is run later, or .set_input_features() is set later.

None
Source code in python/polars_ds/linear_models.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
@classmethod
def from_values(
    cls, coeffs: List[float], bias: float = 0.0, feature_names_in_: List[str] | None = None
) -> Self:
    """
    Constructs a LR class instance from coefficients and bias values.

    Parameters
    ----------
    coeffs
        Iterable of numbers representing the coefficients
    bias
        Value for the bias term
    feature_names_in_
        Names for the incoming features, if available. If None, the names will be empty. They will be
        learned if .fit_df() is run later, or .set_input_features() is set later.
    """
    coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
    lr = cls(
        has_bias=(bias != 0.0),
        lambda_=0.0,
        solver="Not Solved",
        feature_names_in_=feature_names_in_,
    )
    lr._lr.set_coeffs_and_bias(coefficients, bias)
    return lr

predict(X)

Returns the prediction of this linear model.

Parameters:

Name Type Description Default
X ndarray

Data to predict on, as a matrix

required
Source code in python/polars_ds/linear_models.py
303
304
305
306
307
308
309
310
311
312
313
@_sanitize_np("X")
def predict(self, X: np.ndarray) -> np.ndarray:
    """
    Returns the prediction of this linear model.

    Parameters
    ----------
    X
        Data to predict on, as a matrix
    """
    return np.asarray(self._lr.predict(X))

predict_df(df, name='prediction')

Computes the prediction of the linear model and append it as a column in the dataframe. If input is lazy, output will be lazy.

Parameters:

Name Type Description Default
df PolarsFrame

Either an eager or a lazy Polars dataframe.

required
name str

The name of the prediction column

'prediction'
Source code in python/polars_ds/linear_models.py
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def predict_df(self, df: PolarsFrame, name: str = "prediction") -> PolarsFrame:
    """
    Computes the prediction of the linear model and append it as a column in the dataframe. If input
    is lazy, output will be lazy.

    Parameters
    ----------
    df
        Either an eager or a lazy Polars dataframe.
    name
        The name of the prediction column
    """
    if len(self.feature_names_in_) <= 0:
        raise ValueError(
            "The linear model is not fitted on a dataframe, or no feature names have been given."
            "Not enough info to predict on a dataframe. Hint: try .fit_df() or .set_input_features()."
        )

    pred = pl.sum_horizontal(
        beta * pl.col(c) for c, beta in zip(self.feature_names_in_, self._lr.coeffs)
    )
    bias = self._lr.bias
    if bias != 0.0:
        pred = pred + bias

    return df.with_columns(pred.alias(name))

set_input_features(features)

Sets the names of input features.

Parameters:

Name Type Description Default
features List[str]

List of strings.

required
Source code in python/polars_ds/linear_models.py
200
201
202
203
204
205
206
207
208
209
210
211
def set_input_features(self, features: List[str]) -> Self:
    """
    Sets the names of input features.

    Parameters
    ----------
    features
        List of strings.
    """
    self.feature_names_in_.clear()
    self.feature_names_in_ = list(features)
    return self

OnlineLR

Normal or Ridge Online Regression. This doesn't support dataframe inputs.

Because of implementation details, it is not recommended to set has_bias = True here if runtime speed is crucial.

Null Behaviors: 1. During the initial fit, no nulls/NaNs should be present 2. During online updates, if the record has null/NaN, then it will be ignored. Nothing will be updated.

Methods:

Name Description
__init__

lambda_

coeffs

Returns a copy of the current coefficients.

fit

Initial Fit for the online linear regression model on NumPy data.

from_coeffs_bias_inverse

Constructs an online linear regression instance from coefficients, inverse. This copies

inv

Returns a copy of the current inverse matrix (inverse of XtX in a linear regression).

predict

Returns the prediction of this online linear model.

update

Updates the online linear regression model with one row of data. If the row contains np.nan,

Source code in python/polars_ds/linear_models.py
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
class OnlineLR:
    """
    Normal or Ridge Online Regression. This doesn't support dataframe inputs.

    Because of implementation details, it is not recommended to set has_bias = True here
    if runtime speed is crucial.

    Null Behaviors:
    1. During the initial fit, no nulls/NaNs should be present
    2. During online updates, if the record has null/NaN, then it will be ignored. Nothing will be updated.
    """

    def __init__(
        self,
        lambda_: float = 0.0,
        has_bias: bool = False,
    ):
        """
        lambda_
            The L2 regularization factor
        has_bias
            Whether this should fit the bias term
        """
        self._lr = PyOnlineLR(lambda_, has_bias)

    @classmethod
    @_sanitize_np("inv")
    def from_coeffs_bias_inverse(cls, coeffs: List[float], bias: float, inv: np.ndarray) -> Self:
        """
        Constructs an online linear regression instance from coefficients, inverse. This copies
        data.

        Parameters
        ----------
        coeffs
            Iterable of numbers representing the coefficients
        bias
            The bias term
        inv
            2D NumPy matrix representing the inverse of XtX in a regression problem.
        """
        coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
        lr = cls(has_bias=(bias > 0.0), lambda_=0.0)
        lr._lr.set_coeffs_bias_inverse(coefficients, bias, inv)
        return lr

    def is_fit(self) -> bool:
        return self._lr.is_fit()

    def __repr__(self) -> str:
        if self._lr.lambda_ > 0.0:
            output = "Online Linear Regression (Ridge) Model\n"
        else:
            output = "Online Linear Regression Model\n"

        if self._lr.is_fit():
            output += f"Coefficients: {list(round(x, 5) for x in self._lr.coeffs)}\n"
            output += f"Bias/Intercept: {self._lr.bias}\n"
        else:
            output += "Not fitted yet."
        return output

    def coeffs(self) -> np.ndarray:
        """
        Returns a copy of the current coefficients.
        """
        return np.asarray(self._lr.coeffs)

    def bias(self) -> float:
        return self._lr.bias

    def inv(self) -> np.ndarray:
        """
        Returns a copy of the current inverse matrix (inverse of XtX in a linear regression).
        """
        return np.asarray(self._lr.inv)

    @_sanitize_np("X", "y")
    def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
        """
        Initial Fit for the online linear regression model on NumPy data.

        Parameters
        ----------
        X
            The feature Matrix. NumPy 2D matrix only.
        y
            The target data. NumPy array. Must be reshape-able to (-1, 1).
        """
        if np.any(np.isnan(X)) | np.any(np.isnan(y)):
            raise ValueError(
                "Online regression currently must fit without null for the initial fit."
            )

        self._lr.fit(X, y)
        return self

    @_sanitize_np("X", "y")
    def update(self, X: np.ndarray, y: np.ndarray | float, c: float = 1.0) -> Self:
        """
        Updates the online linear regression model with one row of data. If the row contains np.nan,
        it will be ignored.

        Parameters
        ----------
        X
            Either a a 1d array or a 2d array with 1 row. Must be reshapeable to a matrix with 1 row.
        y
            Either a scalar, or a 1d array with 1 element, or a 2d array of size 1x1.
        c
            The middle term (C) in the woodbury matrix identity. A value of 1.0 means we add
            the impact of the new data, and a value of -1.0 means we remove the impact of the
            data. Any other value will `scale` the impact of the data.
        """
        if not self.is_fit():
            raise ValueError("You cannot update before the initial fit of the matrix.")

        x_2d = X.reshape((1, -1))
        # Sanitization will do np.asarray(y). This means y at this point is already
        # an array.
        y_2d = y.reshape((1, 1))

        self._lr.update(x_2d, y_2d, c)
        return self

    @_sanitize_np("X")
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Returns the prediction of this online linear model.

        Parameters
        ----------
        X
            Data to predict on, as a matrix
        """
        return np.asarray(self._lr.predict(X))

__init__(lambda_=0.0, has_bias=False)

lambda_ The L2 regularization factor has_bias Whether this should fit the bias term

Source code in python/polars_ds/linear_models.py
565
566
567
568
569
570
571
572
573
574
575
576
def __init__(
    self,
    lambda_: float = 0.0,
    has_bias: bool = False,
):
    """
    lambda_
        The L2 regularization factor
    has_bias
        Whether this should fit the bias term
    """
    self._lr = PyOnlineLR(lambda_, has_bias)

coeffs()

Returns a copy of the current coefficients.

Source code in python/polars_ds/linear_models.py
615
616
617
618
619
def coeffs(self) -> np.ndarray:
    """
    Returns a copy of the current coefficients.
    """
    return np.asarray(self._lr.coeffs)

fit(X, y)

Initial Fit for the online linear regression model on NumPy data.

Parameters:

Name Type Description Default
X ndarray

The feature Matrix. NumPy 2D matrix only.

required
y ndarray

The target data. NumPy array. Must be reshape-able to (-1, 1).

required
Source code in python/polars_ds/linear_models.py
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
@_sanitize_np("X", "y")
def fit(self, X: np.ndarray, y: np.ndarray) -> Self:
    """
    Initial Fit for the online linear regression model on NumPy data.

    Parameters
    ----------
    X
        The feature Matrix. NumPy 2D matrix only.
    y
        The target data. NumPy array. Must be reshape-able to (-1, 1).
    """
    if np.any(np.isnan(X)) | np.any(np.isnan(y)):
        raise ValueError(
            "Online regression currently must fit without null for the initial fit."
        )

    self._lr.fit(X, y)
    return self

from_coeffs_bias_inverse(coeffs, bias, inv) classmethod

Constructs an online linear regression instance from coefficients, inverse. This copies data.

Parameters:

Name Type Description Default
coeffs List[float]

Iterable of numbers representing the coefficients

required
bias float

The bias term

required
inv ndarray

2D NumPy matrix representing the inverse of XtX in a regression problem.

required
Source code in python/polars_ds/linear_models.py
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
@classmethod
@_sanitize_np("inv")
def from_coeffs_bias_inverse(cls, coeffs: List[float], bias: float, inv: np.ndarray) -> Self:
    """
    Constructs an online linear regression instance from coefficients, inverse. This copies
    data.

    Parameters
    ----------
    coeffs
        Iterable of numbers representing the coefficients
    bias
        The bias term
    inv
        2D NumPy matrix representing the inverse of XtX in a regression problem.
    """
    coefficients = np.ascontiguousarray(coeffs, dtype=np.float64).flatten()
    lr = cls(has_bias=(bias > 0.0), lambda_=0.0)
    lr._lr.set_coeffs_bias_inverse(coefficients, bias, inv)
    return lr

inv()

Returns a copy of the current inverse matrix (inverse of XtX in a linear regression).

Source code in python/polars_ds/linear_models.py
624
625
626
627
628
def inv(self) -> np.ndarray:
    """
    Returns a copy of the current inverse matrix (inverse of XtX in a linear regression).
    """
    return np.asarray(self._lr.inv)

predict(X)

Returns the prediction of this online linear model.

Parameters:

Name Type Description Default
X ndarray

Data to predict on, as a matrix

required
Source code in python/polars_ds/linear_models.py
678
679
680
681
682
683
684
685
686
687
688
@_sanitize_np("X")
def predict(self, X: np.ndarray) -> np.ndarray:
    """
    Returns the prediction of this online linear model.

    Parameters
    ----------
    X
        Data to predict on, as a matrix
    """
    return np.asarray(self._lr.predict(X))

update(X, y, c=1.0)

Updates the online linear regression model with one row of data. If the row contains np.nan, it will be ignored.

Parameters:

Name Type Description Default
X ndarray

Either a a 1d array or a 2d array with 1 row. Must be reshapeable to a matrix with 1 row.

required
y ndarray | float

Either a scalar, or a 1d array with 1 element, or a 2d array of size 1x1.

required
c float

The middle term (C) in the woodbury matrix identity. A value of 1.0 means we add the impact of the new data, and a value of -1.0 means we remove the impact of the data. Any other value will scale the impact of the data.

1.0
Source code in python/polars_ds/linear_models.py
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
@_sanitize_np("X", "y")
def update(self, X: np.ndarray, y: np.ndarray | float, c: float = 1.0) -> Self:
    """
    Updates the online linear regression model with one row of data. If the row contains np.nan,
    it will be ignored.

    Parameters
    ----------
    X
        Either a a 1d array or a 2d array with 1 row. Must be reshapeable to a matrix with 1 row.
    y
        Either a scalar, or a 1d array with 1 element, or a 2d array of size 1x1.
    c
        The middle term (C) in the woodbury matrix identity. A value of 1.0 means we add
        the impact of the new data, and a value of -1.0 means we remove the impact of the
        data. Any other value will `scale` the impact of the data.
    """
    if not self.is_fit():
        raise ValueError("You cannot update before the initial fit of the matrix.")

    x_2d = X.reshape((1, -1))
    # Sanitization will do np.asarray(y). This means y at this point is already
    # an array.
    y_2d = y.reshape((1, 1))

    self._lr.update(x_2d, y_2d, c)
    return self