Gradient Descent

Generally, our optimization problem can be viewd as following: we are minimizing the Objective function with respect to \(\theta\): \[\text{argmin}_{\theta}\ L(\theta | x) \quad \textbf{Objective}\ \text{function},\] for simplicity, we only consider the class of convex function. (Only one minimal point)

Goal

Generate a sequence of \(\theta_i\) such that \(L(\theta_1|x) \ge L(\theta_2|x) \ge \dots \ge L(\theta_i|x) \ge L(\theta_{i+1}|x) \ge \dots\)

Steps

  1. Calculate the gradient of objective function\(\nabla L(\theta_i)\)
  2. Move the negative step of the \(-\gamma\nabla L(\theta_i)\)
  3. Update \(\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x)\)
  4. Check whether converge. Does \(\nabla L(\theta_i)\) close to zero?

Examples

Linear regression OLS

Let’s first start with a simple example, the coefficient estimation in Linear regression (OLS)

\[\begin{align} &Y = X\beta + \epsilon, \ \epsilon \sim N(0, \sigma^2 I) \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = (Y - X\beta)^T(Y - X\beta) \quad \text{// Negative log-likelihood} \\ \end{align}\]
  • OLS result: \[\hat{\beta}\ = (X^TX)^{-1}X^TY\]
  • Gradient descent: \[\begin{align} &\nabla L(\beta_i | X) = 2X^T(X\beta_i - Y) = 2X^TX\beta_i - 2X^TY\quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \beta_i -\gamma\nabla L(\beta_i | x) \quad \text{// Update step}\\ &\text{Check }\nabla L(\beta_i | X) \end{align}\]

Data

We use a simple example to perform the task. Here we set N = 1e5 and p = 5.

\[\beta = \begin{bmatrix} 1\\ 2\\ 3\\ -2\\ -3\end{bmatrix} \quad \text{True}\] \[X \sim N(0, I)\]

N <- 5e4
p <- 5
X <- matrix(rnorm(N * p), N, p)
beta <- c(1.0, 2.0, 3.0, -2, -3)
Y <- X %*% beta + rnorm(N)

OLS result

solve(t(X) %*% X) %*% t(X) %*% Y
##           [,1]
## [1,]  1.006664
## [2,]  1.999297
## [3,]  3.001104
## [4,] -1.997833
## [5,] -2.991333

Gradient Descent result

If you are in STAT 628, please first implement the gradient descent yourself, copy paste code can not help you anything…

Two things:

  1. Gradient. Gradient is defined by objective function. glmnet use the following form as objective function for linear regression \[\frac{1}{2N}\sum\limits_{i = 1}^{N}(y_i - x_i\beta)^2 + \lambda |\beta|_p,\quad p = {1, 2}\] Then the gradient for the first part is \(X^T(X\beta_i - Y)/N\).
  2. StepSize \(\gamma\). If your gradient is not calculated carefully, you algorithm will blowup. So that \(\gamma\) is also important. Wolfe Condition
gradient descent:

initial beta_old
set stepsize

while gradient > tolerance:
    g = gradient at beta_old
    update
        beta_new = beta_old - stepsize * g
        beta_old = beta_new
    
profvis::profvis(beta <- gdescent(Y, X, print = F))

Stochastic Gradient Descent

As we can see the %*% is a very expensive step, also t in the gradient evaluation. Because it involves summation of all the data points. Stochastic Gradient Descent only evaluate one data point at once or a batch (\(\ll N\)) of data points at once.

Steps:

  1. Set batch i = some numbers
  2. Calculate the gradient \(\nabla L(\theta_i | X_{\text{batch i}})\)
  3. Move the negative step of the \(-\gamma\nabla L(\theta_i | X_{\text{batch i}})\)
  4. Update \(\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | X_{\text{batch i}})\)
  5. Check convergence
system.time(gdescent(Y, X))
##    user  system elapsed 
##   2.154   0.124   2.304
system.time(sgd(Y, X, MinBatch = 1))
##    user  system elapsed 
##   0.025   0.001   0.025

Ridge regression

\[\begin{align} &\beta \sim N(0, 1 / \lambda) \\ &\epsilon \sim N(0, \sigma_2)\\ &y = X\beta + \epsilon \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = \frac{1}{2N}(Y - X\beta)^T(Y - X\beta) + \lambda \beta^T\beta \quad \text{// Negative log-likelihood} \\ &\nabla L(\beta_i | X) = X^T(X\beta_i - Y) / N + 2\lambda\beta_i \quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x) \quad \text{// Update step} \end{align}\]

lm.ridge

print(lm.ridge(Y ~ -1 + X, lambda = 2 * N)$coef)
##         X1         X2         X3         X4         X5 
##  0.3251837  0.6615333  1.0026633 -0.6694237 -0.9919205
system.time(gdescent(Y, X, method = "ridge", lambda = 1))
##    user  system elapsed 
##   0.828   0.119   0.971
system.time(sgd(Y, X, MinBatch = 50, method = "ridge", lambda = 1, momemtum = T))
##    user  system elapsed 
##   0.039   0.019   0.058

Lasso regression

The typycial question related with Lasso is how to evaluate the gradient of \(|\beta|_1\). I recommend you to read proximal gradient, it is faster than sub-gradient.

Here is my solution (Not quite sure)

\[\begin{align} &\beta \sim Laplace(0, 1 / \lambda) \\ &\epsilon \sim N(0, \sigma_2)\\ &y = X^T\beta + \epsilon \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = \frac{1}{2N}(Y - X\beta)^T(Y - X\beta) + \lambda |\beta|_1 \quad \text{// Negative log-likelihood} \\ &\nabla L(\beta_i | X) = X^T(X\beta_i - Y) / N \quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x) \quad \text{// Update step} \\ &\theta_{i+1} = \text{proximal gradient}(\theta_{i + 1}, \lambda * \gamma)\\ &\text{proximal gradient}(a, b) = \left\{\begin{matrix} a - b & a > b\\ 0 & -b < a < b\\ a + b & a < -b \end{matrix}\right. \end{align}\]
N <- 5e4
p <- 50
X <- matrix(rnorm(N * p), N, p)
beta <- c(c(1.0, 2, 3.0, -2, -3), rep(0, 45))
Y <- X %*% beta + rnorm(N)
system.time(glmbeta <- glmnet::glmnet(x = X, y = Y, lambda = 0.5, standardize = F)$beta)
##    user  system elapsed 
##   0.811   0.023   0.864
xname <- rownames(glmbeta)
glmbeta[1:5,]
##         V1         V2         V3         V4         V5 
##  0.4834522  1.4962422  2.4899560 -1.4963718 -2.4966073
system.time(beta <- gdescent(Y, X, method = "lasso", lambda = 1, momemtum = T, print = F))
##    user  system elapsed 
##   9.537   1.240  10.966
names(beta) <- xname
beta[beta != 0]
##        V1        V2        V3        V4        V5 
##  0.483381  1.496108  2.489728 -1.496137 -2.496294
system.time(beta <- sgd(Y, X, method = "lasso", lambda = 1, MinBatch = 100, momemtum = T, print = F))
##    user  system elapsed 
##   1.016   0.201   1.231
names(beta) <- xname
beta[beta != 0]
##        V1        V2        V3        V4        V5 
##  0.396657  1.412345  2.410041 -1.407589 -2.401663