Generally, our optimization problem can be viewd as following: we are minimizing the Objective function with respect to \(\theta\): \[\text{argmin}_{\theta}\ L(\theta | x) \quad \textbf{Objective}\ \text{function},\] for simplicity, we only consider the class of convex function. (Only one minimal point)
Goal
Generate a sequence of \(\theta_i\) such that \(L(\theta_1|x) \ge L(\theta_2|x) \ge \dots \ge L(\theta_i|x) \ge L(\theta_{i+1}|x) \ge \dots\)
Steps
Let’s first start with a simple example, the coefficient estimation in Linear regression (OLS)
\[\begin{align} &Y = X\beta + \epsilon, \ \epsilon \sim N(0, \sigma^2 I) \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = (Y - X\beta)^T(Y - X\beta) \quad \text{// Negative log-likelihood} \\ \end{align}\]We use a simple example to perform the task. Here we set N = 1e5
and p = 5
.
\[\beta = \begin{bmatrix} 1\\ 2\\ 3\\ -2\\ -3\end{bmatrix} \quad \text{True}\] \[X \sim N(0, I)\]
N <- 5e4
p <- 5
X <- matrix(rnorm(N * p), N, p)
beta <- c(1.0, 2.0, 3.0, -2, -3)
Y <- X %*% beta + rnorm(N)
OLS
resultsolve(t(X) %*% X) %*% t(X) %*% Y
## [,1]
## [1,] 1.006664
## [2,] 1.999297
## [3,] 3.001104
## [4,] -1.997833
## [5,] -2.991333
Gradient Descent
resultIf you are in STAT 628, please first implement the gradient descent yourself, copy paste code can not help you anything…
Two things:
Gradient
. Gradient is defined by objective function. glmnet
use the following form as objective function for linear regression \[\frac{1}{2N}\sum\limits_{i = 1}^{N}(y_i - x_i\beta)^2 + \lambda |\beta|_p,\quad p = {1, 2}\] Then the gradient for the first part is \(X^T(X\beta_i - Y)/N\).StepSize
\(\gamma\). If your gradient is not calculated carefully, you algorithm will blowup. So that \(\gamma\) is also important. Wolfe Conditiongradient descent:
initial beta_old
set stepsize
while gradient > tolerance:
g = gradient at beta_old
update
beta_new = beta_old - stepsize * g
beta_old = beta_new
profvis::profvis(beta <- gdescent(Y, X, print = F))
Stochastic Gradient Descent
As we can see the %*%
is a very expensive step, also t
in the gradient evaluation. Because it involves summation of all the data points. Stochastic Gradient Descent only evaluate one data point at once or a batch (\(\ll N\)) of data points at once.
system.time(gdescent(Y, X))
## user system elapsed
## 2.154 0.124 2.304
system.time(sgd(Y, X, MinBatch = 1))
## user system elapsed
## 0.025 0.001 0.025
lm.ridge
print(lm.ridge(Y ~ -1 + X, lambda = 2 * N)$coef)
## X1 X2 X3 X4 X5
## 0.3251837 0.6615333 1.0026633 -0.6694237 -0.9919205
system.time(gdescent(Y, X, method = "ridge", lambda = 1))
## user system elapsed
## 0.828 0.119 0.971
system.time(sgd(Y, X, MinBatch = 50, method = "ridge", lambda = 1, momemtum = T))
## user system elapsed
## 0.039 0.019 0.058
The typycial question related with Lasso is how to evaluate the gradient of \(|\beta|_1\). I recommend you to read proximal gradient, it is faster than sub-gradient.
Here is my solution (Not quite sure)
\[\begin{align} &\beta \sim Laplace(0, 1 / \lambda) \\ &\epsilon \sim N(0, \sigma_2)\\ &y = X^T\beta + \epsilon \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = \frac{1}{2N}(Y - X\beta)^T(Y - X\beta) + \lambda |\beta|_1 \quad \text{// Negative log-likelihood} \\ &\nabla L(\beta_i | X) = X^T(X\beta_i - Y) / N \quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x) \quad \text{// Update step} \\ &\theta_{i+1} = \text{proximal gradient}(\theta_{i + 1}, \lambda * \gamma)\\ &\text{proximal gradient}(a, b) = \left\{\begin{matrix} a - b & a > b\\ 0 & -b < a < b\\ a + b & a < -b \end{matrix}\right. \end{align}\]N <- 5e4
p <- 50
X <- matrix(rnorm(N * p), N, p)
beta <- c(c(1.0, 2, 3.0, -2, -3), rep(0, 45))
Y <- X %*% beta + rnorm(N)
system.time(glmbeta <- glmnet::glmnet(x = X, y = Y, lambda = 0.5, standardize = F)$beta)
## user system elapsed
## 0.811 0.023 0.864
xname <- rownames(glmbeta)
glmbeta[1:5,]
## V1 V2 V3 V4 V5
## 0.4834522 1.4962422 2.4899560 -1.4963718 -2.4966073
system.time(beta <- gdescent(Y, X, method = "lasso", lambda = 1, momemtum = T, print = F))
## user system elapsed
## 9.537 1.240 10.966
names(beta) <- xname
beta[beta != 0]
## V1 V2 V3 V4 V5
## 0.483381 1.496108 2.489728 -1.496137 -2.496294
system.time(beta <- sgd(Y, X, method = "lasso", lambda = 1, MinBatch = 100, momemtum = T, print = F))
## user system elapsed
## 1.016 0.201 1.231
names(beta) <- xname
beta[beta != 0]
## V1 V2 V3 V4 V5
## 0.396657 1.412345 2.410041 -1.407589 -2.401663