Under the hood#

Import packages.

[1]:

%load_ext autoreload
%autoreload 2

[2]:

import matplotlib.pyplot as plt
import numpy as np
from generalized_additive_models.datasets import load_salaries

from generalized_additive_models import GAM, Categorical, Spline, Intercept, Linear

rng = np.random.default_rng(42)

/home/docs/checkouts/readthedocs.org/user_builds/generalized-additive-models/envs/latest/lib/python3.11/site-packages/generalized_additive_models/__init__.py:116: UserWarning:
Thank you for using generalized-additive-models, version 0.4.3.
Until version 1.0.0 is released, the package and API should be considered unstable.
You are welcome to use the package. Report bugs and join the discussion on GitHub:
https://github.com/tommyod/generalized-additive-models
  warnings.warn(message)

Load data.

[3]:

df = load_salaries().sort_values("salary")
df[["age", "work_domain", "salary"]].sample(5, random_state=42)

[3]:

	age	work_domain	salary
1637	58.0	data science	950000.0
1832	38.0	app	865000.0
1098	23.0	fullstack	630000.0
1172	28.0	backend	665000.0
922	48.0	arkitektur	776000.0

Terms are scikit-learn Transformers#

Analogous to Transformers scikit-learn, a term such as Linear implement fit and transform.

Below the age column is transformed by subtracting the mean value.

[4]:

Linear("age").fit_transform(df)

[4]:

array([[33.],
       [23.],
       [33.],
       ...,
       [43.],
       [33.],
       [48.]], shape=(2232, 1))

Categorical variables are one-hot encoded.

[5]:

Categorical("work_domain").fit_transform(df)

[5]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], shape=(2232, 16))

A spline basis is created when we fit and transform using a Spline. Under the hood, the implementation relies on the scikit-learn SplineTransformer. For the visualization to look correct, we must sort the values.

[6]:

X = np.sort(rng.triangular(left=-1, mode=0, right=1, size=2**10))[:, np.newaxis]
plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), Spline(0, num_splines=6).fit_transform(X))
plt.show()

../_images/notebooks_under_the_hood_10_0.png

Every spline basis is centered, so the mean value over all data points equals zero.

[7]:

Spline(0, num_splines=6).fit_transform(X).mean(axis=0)

[7]:

array([ 6.91416054e-17, -4.49672851e-17,  2.22966177e-16, -3.44098664e-16,
       -1.02782366e-16, -4.61599075e-17])

[8]:

Spline(0, num_splines=6).fit_transform(X).mean()

[8]:

np.float64(-1.1029216006198245e-17)

Terms may be added#

When we add two Terms, such as Intercept and Linear, a TermList instance is created.

[9]:

terms = Intercept() + Linear("age")
terms

[9]:

Intercept() + Linear(feature='age')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Fitting and transforming a TermList involves fitting each Term in the TermList.

[10]:

terms.fit_transform(df)

[10]:

array([[ 1., 33.],
       [ 1., 23.],
       [ 1., 33.],
       ...,
       [ 1., 43.],
       [ 1., 33.],
       [ 1., 48.]], shape=(2232, 2))

We can access useful properties, used by the GAM:

[11]:

print("Number of coefficients:", terms.num_coefficients)
print("Penalty matrix:\n", terms.penalty_matrix())

Number of coefficients: 2
Penalty matrix:
 [[0. 0.]
 [0. 1.]]

Penalties (regularization)#

The penalty matrix \(D\) goes into the regulatization term as

\[\text{loss}(\beta) = \text{data fit} + \text{regularization} = \text{data fit} + \text{penalty} \lVert D \beta \rVert_2^2\]

[12]:

D = (Intercept() + Linear("age", penalty=2)).penalty_matrix()
D

[12]:

array([[0.        , 0.        ],
       [0.        , 1.41421356]])

The penalty for a Spline acts on the second-order derivative (smoothness) of the spline.

[13]:

Spline(0, num_splines=8).penalty_matrix()

[13]:

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

A spline with no smoothness (affine function) results in no penalty.

[14]:

spline = Spline(0, num_splines=6)
beta = np.arange(6)

spline.penalty_matrix() @ beta, np.linalg.norm(spline.penalty_matrix() @ beta)

[14]:

(array([0., 0., 0., 0., 0., 0.]), np.float64(0.0))

[15]:

plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), spline.fit_transform(X) @ beta)
plt.show()

../_images/notebooks_under_the_hood_27_0.png

A wiggly, non-smooth spline does incur a penalty.

[16]:

beta = np.array([1, 2, 1, 2, 1, 2])
spline.penalty_matrix() @ beta, np.linalg.norm(spline.penalty_matrix() @ beta)

[16]:

(array([ 0., -2.,  2., -2.,  2.,  0.]), np.float64(4.0))

[17]:

plt.figure(figsize=(8, 3))
plt.plot(X.ravel(), spline.fit_transform(X) @ beta)
plt.show()

../_images/notebooks_under_the_hood_30_0.png

Univariate GAM#

[18]:

model = GAM(Spline("years_relevant_work_experience"), verbose=9)
model.fit(df, df["salary"])

Variables set to zero for identifiability:1/21
Initial guess:      Objective: 1.3762e+14   Coef. rmse: 2.5974e+05
Iteration:  1/100   Objective: 1.3762e+14   Coef. rmse: 2.5974e+05   Step size: 1/2^19
 => SUCCESS: Solver converged (met tolerance criterion).

[18]:

GAM(terms=TermList(data=[Spline(feature='years_relevant_work_experience'), Intercept()]),
    verbose=9)

[19]:

years_smooth = np.linspace(0, 50, num=2**10)

plt.figure(figsize=(8, 3))
plt.scatter(df["years_relevant_work_experience"], df["salary"], alpha=0.33)

# Predict with the model
plt.plot(years_smooth, model.predict(years_smooth[:, np.newaxis]), color="black")

plt.show()

../_images/notebooks_under_the_hood_33_0.png

[20]:

from sklearn.linear_model import Ridge

spline = Spline("years_relevant_work_experience")
X = spline.fit_transform(df)
y = df["salary"]

ridge = Ridge().fit(X, y)

plt.figure(figsize=(8, 3))
plt.scatter(df["years_relevant_work_experience"], df["salary"], alpha=0.33)
plt.plot(
    years_smooth,
    ridge.predict(spline.transform(years_smooth[:, np.newaxis])),
    color="black",
)
plt.plot
plt.show()

../_images/notebooks_under_the_hood_34_0.png

[ ]:

	0	Intercept()
	1__by	None
	1__constraint	None
	1__feature	'age'
	1__penalty	1
	1	Linear(feature='age')

	terms	TermList([Spl... Intercept()])
	distribution	'normal'
	link	'identity'
	fit_intercept	True
	solver	'pirls'
	max_iter	100
	tol	0.0001
	verbose	9

	0__by	None
	0__constraint	None
	0__degree	3
	0__edges	None
	0__extrapolation	'linear'
	0__feature	'years_relevant_work_experience'
	0__knots	'uniform'
	0__l2_penalty	0
	0__num_splines	20
	0__penalty	1
	0	Spline(featur...k_experience')
	1	Intercept()