Heavily indebted to various internet sources; see References at the end for the most prominent. Additional thanks to Dr. Mara Tableman for corrections via email.
library(splines)
library(tidyverse)
library(pander)
Instead of a single regression line, we fit a set of piecewise linear regressions with the only restriction being that they intersect at the knots. Mathematically, with one predictor variable, we write the regression equation as follows. Note that we have K + 2 parameters to estimate.
\[ \begin{equation} E[y_i] = \beta_0 + \beta_1x_i+\sum_{k=1}^K b_k(x_i-\xi_k)_+ \end{equation} \]
where
\[ \begin{equation} (x_i-\xi_k)_+ = \begin{cases} 0 & x_i < \xi_k \\ (x_i-\xi_k) & x_i \geq \xi_k \end{cases} \end{equation} \]
where K is the number of knots. For a simple example, imagine you have one knot. Evaluating the sum, you can see that the term \(b_k(x_i-\xi_k)\) will only change the expression for \(x\)-values greater than the knot. For those \(x\)-values above the knot, the regression function is \(\mathbb{E}[y_i] = \beta_0 + \beta_1x_i + b_k(x_i-\xi_k)\), meaning there is an ‘added amount of slope’ \(b_k\). Let’s look at this visually:
knot <- 3 # This is the location of our (in this case single) knot
x <- seq(from = 0, to = 6, by = .025)
y <- sin(2*x) + x -.1*x^2 + 2 + rnorm(length(x), sd = .3)
# More on this design matrix later...
design_matrix_X <- cbind(outer(x,1:1,"^"),outer(x,knot,">") * outer(x,knot,"-")^1)
mod_ls <- lm(y~design_matrix_X)
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .5) +
geom_line(aes(x = x, y = predict(mod_ls)), color = "red") +
geom_vline(aes(xintercept = knot), alpha = .5, linetype = "dotdash") +
labs(title = "Piecewise linear spline... not a great fit!")
Here are the coefficients from our model:
summary(mod_ls)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7314389 0.12593095 21.689972 2.671087e-58
## design_matrix_X1 0.4959321 0.06152002 8.061313 3.670208e-14
## design_matrix_X2 -0.5344361 0.11004979 -4.856312 2.167723e-06
The intercept is… well, the intercept, and that matches our plot. The “design_matrix_X1” coefficient is the slope for the first section, while “design_matrix_X2” is the additional slope for the second section. hence, the second section has slope \(.55 + -.6 = -.05\), which matches what we see in the plot.
Now let’s connect what we did above to the design matrix…
Earlier we used some slick R commands to make a design matrix that matched (1) and (2). Let’s look at that matrix in detail. Note that it does not have a vector of \(1\)s for the intercept (yet).
#design_matrix_X[1:nrow(design_matrix_X) %% 10==1,]
design_matrix_X[120:123,]
## [,1] [,2]
## [1,] 2.975 0.000
## [2,] 3.000 0.000
## [3,] 3.025 0.025
## [4,] 3.050 0.050
Only a subset is shown above. The first column is \(x_1\) and the second column is 0 when \(x_1\) is below the knot at \(x=3\), and \((x_1-3)\) when \(x_1\) is above the knot. Let’s assemble the full design matrix and use LSE to estimate the coefficients:
DMX <- cbind(1, design_matrix_X)
(betas <- solve(t(DMX) %*% DMX) %*% t(DMX) %*% y)
## [,1]
## [1,] 2.7314389
## [2,] 0.4959321
## [3,] -0.5344361
summary(mod_ls)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7314389 0.12593095 21.689972 2.671087e-58
## design_matrix_X1 0.4959321 0.06152002 8.061313 3.670208e-14
## design_matrix_X2 -0.5344361 0.11004979 -4.856312 2.167723e-06
They match what we expect.
Over and underfitting are common problems when using splines. For linear splines, there are two things to consider: Knot number/placement and smoothing/penalization.
The number of knots and their placement will certainly affect over/underfitting. Let’s extend the approach shown above, using more knots:
generate_design_matrix <- function(x, knot_vector, degree){
return(cbind(outer(x,1:degree,"^"),outer(x,knot_vector,">")*outer(x,knot_vector,"-")^degree))
}
design_matrix2 <- generate_design_matrix(degree = 1, knot_vector = c(1,2.5,4, 5.7), x = x)
mod_ls2 <- lm(y~design_matrix2)
design_matrix3 <- generate_design_matrix(degree = 1, knot_vector = seq(from = 0.1, to = 5.9, by = .2), x = x)
mod_ls3 <- lm(y~design_matrix3)
yhatbad <- predict(mod_ls3)
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .5) +
geom_line(aes(x = x, y = predict(mod_ls2)), color = "red") +
geom_line(aes(x = x, y = yhatbad), color = "blue") +
labs(title = "Piecewise linear spline - Good number vs. too many knots...")
If there is a scientific rationale for placing knots in a certain place (for example, known changes in hormone levels at the onset of puberty), you may want to select that ahead of time. However, in general, a reasonably dense grid of knots is created (often with an equal number of points in each slice) and other methods of controlling overfitting are employed… like:
This is an alternative to choosing knot location. Assume a large-ish number of knots is OK, then control overfitting via penalization. We use a penalty term \(\lambda\) to constrain coefficients as follows.
Earlier, we noted that we are following the model below:
\[ \begin{equation} E[y_i] = \beta_0 + \beta_1x_i+\sum_{k=1}^K b_k(x_i-\xi_k)_+ \end{equation} \]
One way to keep the model from overfitting is to put a constraint on \(\sum b_k^2\), say, \(\sum b_k^2 < C\), to keep the coefficients from making the slopes leap up and down at each knot. Our penalty term \(\lambda\) is a function of \(C\).
Under this constraint, it can be shown (derivation omitted) that the penalized least squares estimate for the coefficient vector (notably, the same as the ridge regression solution) is:
\[ \begin{equation} \hat{\boldsymbol{\beta}}=(\mathbf{X}^{T} \mathbf{X}+\lambda \mathbf{D})^{-1} \mathbf{X}^{T} \mathbf{y} \end{equation} \]
where \(\mathbf{D}\) is a matrix of all zeros, except on the diagonal, where the first two terms are zero (for \(\beta_0\) and \(\beta_1\)) and the rest of the diagonal terms are 1 (for all of the constrained \(b_k\)s). Using that result to investigate different penalty levels for our data set:
x <- seq(from = 0, to = 6, by = .025)
knot_vector = seq(from = 0.1, to = 5.9, by = .2)
D <- diag(c(0,0,rep(1, times = length(knot_vector))))
X <- cbind(1, generate_design_matrix(degree = 1, knot_vector = knot_vector, x = x))
# derivation of coefficients for penalized linear splines
betas_.1 <- solve(t(X) %*% X + (.1) * D) %*% t(X) %*% y
betas_9 <- solve(t(X) %*% X + (9) * D) %*% t(X) %*% y
yhat.1 <- X %*% betas_.1
yhat9 <- X %*% betas_9
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_line(aes(x = x, y = yhat.1), color = "black", alpha = 1) +
geom_line(aes(x = x, y = yhat9), color = "purple", alpha = 1) +
geom_line(aes(x = x, y = yhatbad), color = "red", alpha = 1) +
labs(title = "lambda = 0 (red, overfit), .1 (black, better), 9 (purple, underfit)")
As in many applications, we can use cross-validation to select the best balance of bias and variance to minimize the mean squared error. Here we’ll do 5-fold CV to find a value of \(\lambda\). I’m also going to reduce the number of knots to make the matrix inversions easier (more on numerical stability later).
library(caret)
knot_vector = seq(from = 0.1, to = 5.9, by = .4)
D <- diag(c(0,0,rep(1, times = length(knot_vector))))
k <- 5
fold_indices <- createFolds(x, k = k)
lambda_grid <- seq(from = 0, to = .25, by = .01)
results <- matrix(data = NA, nrow = k, ncol = length(lambda_grid))
for(j in 1:length(lambda_grid)){
for(i in 1:k){
X_test <- cbind(1, generate_design_matrix(degree = 1, knot_vector = knot_vector, x = x[fold_indices[[i]]]))
X_train <- cbind(1, generate_design_matrix(degree = 1, knot_vector = knot_vector, x = x[-fold_indices[[i]]]))
betas_train <- solve(t(X_train) %*% X_train + (lambda_grid[j]) * D) %*% t(X_train) %*% y[-fold_indices[[i]]]
y_test <- X_test %*% betas_train
mse <- mean((y[fold_indices[[i]]] - y_test)^2)
results[i,j] <- mse
}
}
CV_mse <- colMeans(results)
ggplot() +
geom_line(aes(x = lambda_grid, y = CV_mse), color = "black", alpha = 1) +
labs(title = "MSE", y = "Average MSE across the 5 folds", x = "lambda value")
The lowest MSE occurrs when \(\lambda\) is about \(.05\). Let’s fit the whole data set with that value and examine the results:
D <- diag(c(0,0,rep(1, times = length(knot_vector))))
X <- cbind(1, generate_design_matrix(degree = 1, knot_vector = knot_vector, x = x))
betas_CV <- solve(t(X) %*% X + (.05) * D) %*% t(X) %*% y
yhatCV <- X %*% betas_CV
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_line(aes(x = x, y = yhatCV), color = "black", alpha = 1) +
labs(title = "lambda = .05... looks like a pretty good fit for these knot choices.")
An obvious downside of the above is that most real-world processes don’t have the sharp changes in slope implied by the model fit above. In fact, linear splines are a pretty crude way to model a non-linear relationship, so we’re going to move to a more advanced technique. However, the reason we went through the linear spline example in such detail is that the big ideas (building the design matrix, estimation, and penalization for over/underfitting) are almost the same, but it’s easier to build the intuition with the simple linear case.
Polynomial splines address the concern of jagged splines by using piecewise polynomials (usually of degree 3) that connect at the knot points and whose derivatives also match at the knot points, making for a smooth curve through the data. The same concerns of over/underfitting are present, and are addressed in the same way as in the linear case.
A linear spline is of course a special case of the more general polynomial spline, where the sections between knots are polynomials of degree \(D=1\). The formula below is equivalent to the linear spline formula, but with added polynomial terms. We now have D + K + 1 parameters to estimate.
\[ \begin{equation} E[y_i]=\beta_{0}+\beta_{1} x_i+\ldots \beta_{D} x_i^{D}+\sum_{k=1}^{K} b_{k}\left(x_i-\xi_{k}\right)_{+}^{D} \end{equation} \]
We generate the design matrix in the same way, just with many more terms. In the next section we’ll apply it to our data set. Here we see an example of over and underfitting based on knot selection:
X <- cbind(1, generate_design_matrix(degree = 3, knot_vector = c(2), x = x))
betas <- solve(t(X) %*% X) %*% t(X) %*% y
yhat <- X %*% betas
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_line(aes(x = x, y = yhat), color = "black", alpha = 1) +
geom_vline(aes(xintercept = 2), color = "black", linetype = "dotdash") +
labs(title = "Cubic spline",
subtitle = "1 knot at x=2, no penalization, underfitting")
X <- cbind(1, generate_design_matrix(degree = 3, knot_vector = seq(.2, 5.8, .2), x = x))
betas <- solve(t(X) %*% X) %*% t(X) %*% y
yhat <- X %*% betas
ggplot() +
geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_line(aes(x = x, y = yhat), color = "black", alpha = 1) +
geom_vline(aes(xintercept = seq(.2, 5.8, .2)), color = "black", alpha = .5, linetype = "dotdash") +
labs(title = "Cubic spline",
subtitle = "Too many knots, no penalization, overfitting!")
We can apply the same cross-validated penalization routine to get an appropriate fit to our data. Note that we can also increase the degree of the polynomial to add flexibility to the model, but the danger of overfitting is even more pronounced.
# We haven't explained the "bs" function yet, stay tuned...
modbs <- lm(y~bs(x, knots = c(3), degree = 50))
ggplot() + geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_vline(aes(xintercept = knot), alpha = .5, linetype = "dotdash") +
geom_line(aes(x = x, y = predict(modbs)), color = "red") +
labs(title = "Only one knot... but 50th degree polynomials obviously overfit!")
The use of the term “natural” means that the spline is restricted to be linear beyond the boundary knots. This is enforced by requiring the second derivative to be zero at the boundary. This is frequently desirable to control overfitting around the edges of the data. Here is an example of a natural and non-natural cubic spline:
knotseq <- seq(from=.5, to = 5.5, by = .5)
modNS <- lm(y~ns(x, knots = knotseq))
modBS <-lm(y~bs(x, knots = knotseq, degree = 3))
ggplot() + geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_vline(aes(xintercept = knotseq), alpha = .5, linetype = "dotdash") +
geom_line(aes(x = x, y = predict(modNS)), color = "red") +
geom_line(aes(x = x, y = predict(modBS)), color = "blue") +
labs(title = "Blue: Unrestricted cubic spline; Red: Natural cubic spline")
widex <- seq(from = -1, to = 7, by = .025)
ggplot() + geom_point(aes(x = x, y = y), color = "black", alpha = .3) +
geom_vline(aes(xintercept = knotseq), alpha = .5, linetype = "dotdash") +
geom_line(aes(x = widex, y = predict(modNS, newdata = data.frame(x=widex))), color = "red") +
geom_line(aes(x = widex, y = predict(modBS, newdata = data.frame(x=widex))), color = "blue") +
labs(title = "Blue: Unrestricted cubic spline; Red: Natural cubic spline")
Depending on the data set, making the design matrix for a bunch of cubed \(x_i\) values can lead to some very large (and very small) values, making the fitting algorithm unstable, and there also may be high correlations between some of the columns in the design matrix, further creating fitting instability.
This can be remedied by re-parameterizing the spline function as the
sum of a set of local polynomials, rather than a global polynomial with
additive pieces at each knot. The mathematical details are not important
to the current project, but they are easily implented using the
bs()
function in R.
Generally, individual spline coefficients are not of interest. More commonly, we ask if the addition of the spline lead to better model fit than a one-coefficient linear model. Without penalization, confidence bands for the spline can be calculated pointwise in the same manner as OLS regression. Also, since splines essentially create ‘multiple coefficients’ for each modeled variable, we can use full/reduced \(F\) tests to check for improvement in fit of the model with and without a spline. When using a penalized spline, bootstrap procedures can be used to get empirical confidence bands or distribution of full/reduced model statistics.
Polynomial splines of degree >3 are rarely useful. Since the first and second derivatives will match at each knot, they will be smooth to the human eye. If you want to overfit your data, just use more knots rather than a higher degree.
Penalization is genereally preferred over choosing specific knots, unless you have specific subject matter knowledge that you want to build your knots around.
Don’t predict outside your data: Whether you choose a natural spline that assumes a linear form beyond the boundaries or not, you can’t have confidence in the accuracy of those predictions. Higher-order polynomials exaggerate this problem.
Super-easy in R; the functions bs()
,
ns()
, and smooth.spline()
are built in to base
R distribution, and many packages like mgcv
add additional
functionality.
Once basis functions are chosen, fitting is just via OLS! Easy computationally.
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] caret_6.0-93 lattice_0.20-45 pander_0.6.5 forcats_0.5.2
## [5] stringr_1.4.1 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3
## [9] tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-160 fs_1.5.2 lubridate_1.9.0
## [4] httr_1.4.4 tools_4.2.2 backports_1.4.1
## [7] bslib_0.4.1 utf8_1.2.2 R6_2.5.1
## [10] rpart_4.1.19 DBI_1.1.3 colorspace_2.0-3
## [13] nnet_7.3-18 withr_2.5.0 tidyselect_1.2.0
## [16] compiler_4.2.2 cli_3.4.1 rvest_1.0.3
## [19] xml2_1.3.3 labeling_0.4.2 sass_0.4.2
## [22] scales_1.2.1 digest_0.6.30 rmarkdown_2.18
## [25] pkgconfig_2.0.3 htmltools_0.5.3 parallelly_1.32.1
## [28] highr_0.9 dbplyr_2.2.1 fastmap_1.1.0
## [31] rlang_1.0.6 readxl_1.4.1 rstudioapi_0.14
## [34] farver_2.1.1 jquerylib_0.1.4 generics_0.1.3
## [37] jsonlite_1.8.3 ModelMetrics_1.2.2.2 googlesheets4_1.0.1
## [40] magrittr_2.0.3 Matrix_1.5-1 Rcpp_1.0.9
## [43] munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.3
## [46] pROC_1.18.0 stringi_1.7.8 yaml_2.3.6
## [49] MASS_7.3-58.1 plyr_1.8.7 recipes_1.0.3
## [52] grid_4.2.2 parallel_4.2.2 listenv_0.8.0
## [55] crayon_1.5.2 haven_2.5.1 hms_1.1.2
## [58] knitr_1.40 pillar_1.8.1 stats4_4.2.2
## [61] reshape2_1.4.4 future.apply_1.10.0 codetools_0.2-18
## [64] reprex_2.0.2 glue_1.6.2 evaluate_0.18
## [67] data.table_1.14.4 modelr_0.1.9 vctrs_0.5.0
## [70] tzdb_0.3.0 foreach_1.5.2 cellranger_1.1.0
## [73] gtable_0.3.1 future_1.29.0 assertthat_0.2.1
## [76] cachem_1.0.6 xfun_0.34 gower_1.0.0
## [79] prodlim_2019.11.13 broom_1.0.1 class_7.3-20
## [82] survival_3.4-0 googledrive_2.0.0 gargle_1.2.1
## [85] timeDate_4021.106 iterators_1.0.14 hardhat_1.2.0
## [88] lava_1.7.0 timechange_0.1.1 globals_0.16.1
## [91] ellipsis_0.3.2 ipred_0.9-13