Under the hood#

Import packages.

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import matplotlib.pyplot as plt
import numpy as np
from generalized_additive_models.datasets import load_salaries

from generalized_additive_models import GAM, Categorical, Spline, Intercept, Linear

rng = np.random.default_rng(42)
/home/docs/checkouts/readthedocs.org/user_builds/generalized-additive-models/envs/latest/lib/python3.11/site-packages/generalized_additive_models/__init__.py:116: UserWarning:
Thank you for using generalized-additive-models, version 0.4.3.
Until version 1.0.0 is released, the package and API should be considered unstable.
You are welcome to use the package. Report bugs and join the discussion on GitHub:
https://github.com/tommyod/generalized-additive-models
  warnings.warn(message)

Load data.

[3]:
df = load_salaries().sort_values("salary")
df[["age", "work_domain", "salary"]].sample(5, random_state=42)
[3]:
age work_domain salary
1637 58.0 data science 950000.0
1832 38.0 app 865000.0
1098 23.0 fullstack 630000.0
1172 28.0 backend 665000.0
922 48.0 arkitektur 776000.0

Terms are scikit-learn Transformers#

Analogous to Transformers scikit-learn, a term such as Linear implement fit and transform.

Below the age column is transformed by subtracting the mean value.

[4]:
Linear("age").fit_transform(df)
[4]:
array([[33.],
       [23.],
       [33.],
       ...,
       [43.],
       [33.],
       [48.]], shape=(2232, 1))

Categorical variables are one-hot encoded.

[5]:
Categorical("work_domain").fit_transform(df)
[5]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(2232, 16))

A spline basis is created when we fit and transform using a Spline. Under the hood, the implementation relies on the scikit-learn SplineTransformer. For the visualization to look correct, we must sort the values.

[6]:
X = np.sort(rng.triangular(left=-1, mode=0, right=1, size=2**10))[:, np.newaxis]
plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), Spline(0, num_splines=6).fit_transform(X))
plt.show()
../_images/notebooks_under_the_hood_10_0.png

Every spline basis is centered, so the mean value over all data points equals zero.

[7]:
Spline(0, num_splines=6).fit_transform(X).mean(axis=0)
[7]:
array([ 6.91416054e-17, -4.49672851e-17,  2.22966177e-16, -3.44098664e-16,
       -1.02782366e-16, -4.61599075e-17])
[8]:
Spline(0, num_splines=6).fit_transform(X).mean()
[8]:
np.float64(-1.1029216006198245e-17)

Terms may be added#

When we add two Terms, such as Intercept and Linear, a TermList instance is created.

[9]:
terms = Intercept() + Linear("age")
terms
[9]:
Intercept() + Linear(feature='age')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Fitting and transforming a TermList involves fitting each Term in the TermList.

[10]:
terms.fit_transform(df)
[10]:
array([[ 1., 33.],
       [ 1., 23.],
       [ 1., 33.],
       ...,
       [ 1., 43.],
       [ 1., 33.],
       [ 1., 48.]], shape=(2232, 2))

We can access useful properties, used by the GAM:

[11]:
print("Number of coefficients:", terms.num_coefficients)
print("Penalty matrix:\n", terms.penalty_matrix())
Number of coefficients: 2
Penalty matrix:
 [[0. 0.]
 [0. 1.]]

Penalties (regularization)#

The penalty matrix \(D\) goes into the regulatization term as

\[\text{loss}(\beta) = \text{data fit} + \text{regularization} = \text{data fit} + \text{penalty} \lVert D \beta \rVert_2^2\]
[12]:
D = (Intercept() + Linear("age", penalty=2)).penalty_matrix()
D
[12]:
array([[0.        , 0.        ],
       [0.        , 1.41421356]])

The penalty for a Spline acts on the second-order derivative (smoothness) of the spline.

[13]:
Spline(0, num_splines=8).penalty_matrix()
[13]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

A spline with no smoothness (affine function) results in no penalty.

[14]:
spline = Spline(0, num_splines=6)
beta = np.arange(6)

spline.penalty_matrix() @ beta, np.linalg.norm(spline.penalty_matrix() @ beta)
[14]:
(array([0., 0., 0., 0., 0., 0.]), np.float64(0.0))
[15]:
plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), spline.fit_transform(X) @ beta)
plt.show()
../_images/notebooks_under_the_hood_27_0.png

A wiggly, non-smooth spline does incur a penalty.

[16]:
beta = np.array([1, 2, 1, 2, 1, 2])
spline.penalty_matrix() @ beta, np.linalg.norm(spline.penalty_matrix() @ beta)
[16]:
(array([ 0., -2.,  2., -2.,  2.,  0.]), np.float64(4.0))
[17]:
plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), spline.fit_transform(X) @ beta)
plt.show()
../_images/notebooks_under_the_hood_30_0.png

Univariate GAM#

[18]:
model = GAM(Spline("years_relevant_work_experience"), verbose=9)
model.fit(df, df["salary"])
Variables set to zero for identifiability:1/21
Initial guess:      Objective: 1.3762e+14   Coef. rmse: 2.5974e+05
Iteration:  1/100   Objective: 1.3762e+14   Coef. rmse: 2.5974e+05   Step size: 1/2^19
 => SUCCESS: Solver converged (met tolerance criterion).
[18]:
GAM(terms=TermList(data=[Spline(feature='years_relevant_work_experience'), Intercept()]),
    verbose=9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[19]:
years_smooth = np.linspace(0, 50, num=2**10)

plt.figure(figsize=(8, 3))
plt.scatter(df["years_relevant_work_experience"], df["salary"], alpha=0.33)

# Predict with the model
plt.plot(years_smooth, model.predict(years_smooth[:, np.newaxis]), color="black")

plt.show()
../_images/notebooks_under_the_hood_33_0.png
[20]:
from sklearn.linear_model import Ridge

spline = Spline("years_relevant_work_experience")
X = spline.fit_transform(df)
y = df["salary"]

ridge = Ridge().fit(X, y)

plt.figure(figsize=(8, 3))
plt.scatter(df["years_relevant_work_experience"], df["salary"], alpha=0.33)
plt.plot(
    years_smooth,
    ridge.predict(spline.transform(years_smooth[:, np.newaxis])),
    color="black",
)
plt.plot
plt.show()
../_images/notebooks_under_the_hood_34_0.png
[ ]: