```{r setup, include=FALSE, message = F, results='hide', echo = F} knitr::opts_chunk$set(fig.align = "center",fig.pos ="ht", cache = F, message = F, warning = F, echo = T) set.seed(1) ``` # Gradient Descent Generally, our optimization problem can be viewd as following: we are minimizing the **Objective** function with respect to $\theta$: \[\text{argmin}_{\theta}\ L(\theta | x) \quad \textbf{Objective}\ \text{function},\] for simplicity, we only consider the class of convex function. (Only one minimal point) **Goal** Generate a sequence of $\theta_i$ such that $L(\theta_1|x) \ge L(\theta_2|x) \ge \dots \ge L(\theta_i|x) \ge L(\theta_{i+1}|x) \ge \dots$ **Steps** 1. Calculate the gradient of objective function$\nabla L(\theta_i)$ 2. Move the negative step of the $-\gamma\nabla L(\theta_i)$ 3. Update $\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x)$ 4. Check whether converge. Does $\nabla L(\theta_i)$ close to zero? ## Examples ### Linear regression OLS Let's first start with a simple example, the coefficient estimation in Linear regression (OLS) \begin{align} &Y = X\beta + \epsilon, \ \epsilon \sim N(0, \sigma^2 I) \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = (Y - X\beta)^T(Y - X\beta) \quad \text{// Negative log-likelihood} \\ \end{align} - OLS result: \[\hat{\beta}\ = (X^TX)^{-1}X^TY\] - Gradient descent: \begin{align} &\nabla L(\beta_i | X) = 2X^T(X\beta_i - Y) = 2X^TX\beta_i - 2X^TY\quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \beta_i -\gamma\nabla L(\beta_i | x) \quad \text{// Update step}\\ &\text{Check }\nabla L(\beta_i | X) \end{align} ### Data We use a simple example to perform the task. Here we set `N = 1e5` and `p = 5`. \[\beta = \begin{bmatrix} 1\\ 2\\ 3\\ -2\\ -3\end{bmatrix} \quad \text{True}\] \[X \sim N(0, I)\] ```{r, collapse=FALSE} N <- 5e4 p <- 5 X <- matrix(rnorm(N * p), N, p) beta <- c(1.0, 2.0, 3.0, -2, -3) Y <- X %*% beta + rnorm(N) ``` ### `OLS` result ```{r} solve(t(X) %*% X) %*% t(X) %*% Y ``` ### `Gradient Descent` result If you are in **STAT 628**, please first implement the gradient descent yourself, copy paste code can not help you anything... **Two things:** 1. `Gradient`. Gradient is defined by objective function. `glmnet` use the following form as objective function for linear regression \[\frac{1}{2N}\sum\limits_{i = 1}^{N}(y_i - x_i\beta)^2 + \lambda |\beta|_p,\quad p = {1, 2}\] Then the gradient for the first part is $X^T(X\beta_i - Y)/N$. 2. `StepSize` $\gamma$. If your gradient is not calculated carefully, you algorithm will blowup. So that $\gamma$ is also important. [Wolfe Condition](https://en.wikipedia.org/wiki/Wolfe_conditions) ```{r, echo = F} gradient <- function(y, x, beta, method = "normal", lambda = 1) { if (is.null(nrow(x))) { x <- as.matrix(t(x)) # For stochastic gradient descent with batch = 1 } grad <- t(x) %*% (x %*% beta - y) / length(y) if (method == "normal") { return( grad ) } else if (method == "ridge") { return ( grad + 2 * lambda * beta) } else if (method == "lasso") { return( grad ) } } loss <- function(y, x, beta, method = "normal", lambda = 1) { if (is.null(nrow(x))) { x <- as.matrix(t(x)) } d2loss <- mean( (y - x %*% beta)^2 ) if (method == "normal") { return( d2loss ) } else if (method == "ridge") { return ( d2loss + lambda * sum((beta)^2)) } else if (method == "lasso") { return ( d2loss + lambda * sum(abs(beta))) } } prox_l1 <- function(beta, lambda) { max(c(0, beta - lambda)) - max(c(0, - beta - lambda)) } ``` ```{r, echo = F} gdescent <- function(Y, X, step = 0.01, maxIter = 2 * length(Y), maxTol = 1e-10, method = "normal", lambda = 1, print = FALSE, plot = FALSE, momemtum = FALSE) { gamma <- step alpha <- 0.5 iter <- maxIter tol <- maxTol p <- ncol(X) beta_i <- rnorm(p) loss_c <- rep(NA, iter) lambda <- lambda #XtX <- t(X) %*% X #XtY <- t(X) %*% Y delta_i <- 0 for (i in 1:iter) { delta_new <- - gamma * gradient(Y, X, beta_i, method, lambda = lambda) if(momemtum) { delta_new <- delta_new + alpha * delta_i } beta_new <- beta_i + delta_new if(method == "lasso") { for(i in 1:p) beta_new[i] <- prox_l1(beta_new[i], lambda * gamma) } loss_c[i] <- loss(Y, X, beta_i, method, lambda) if (loss_c[i] < loss(Y, X, beta_new, method, lambda)) { gamma <- gamma * 0.9 alpha <- alpha * 0.99 if(print) { print(paste0("Iter: ", i, ", Step Size too large, current step size change to: ", gamma)) } next } if ( sum( (beta_i - beta_new)^2 ) < tol ) { if(print) { print(paste0("Iter: ", i, " Loss: ", loss_c[i])) } loss_c[i + 1] = loss(Y, X, beta_new, method, lambda) break } else { beta_i <- beta_new delta_i <- delta_new } } if(plot) { plot(loss_c[1:i]) } if(print) { print(paste0("Iter: ", i, ", Loss: ", loss_c[i], ", Beta: ")) print(t(beta_new)) } beta_new } ``` ``` gradient descent: initial beta_old set stepsize while gradient > tolerance: g = gradient at beta_old update beta_new = beta_old - stepsize * g beta_old = beta_new ``` ```{r} profvis::profvis(beta <- gdescent(Y, X, print = F)) ``` ### `Stochastic Gradient Descent` As we can see the `%*%` is a very expensive step, also `t` in the gradient evaluation. Because it involves summation of all the data points. Stochastic Gradient Descent only evaluate one data point at once or a batch ($\ll N$) of data points at once. ### Steps: 1. Set batch i = some numbers 2. Calculate the gradient $\nabla L(\theta_i | X_{\text{batch i}})$ 3. Move the negative step of the $-\gamma\nabla L(\theta_i | X_{\text{batch i}})$ 4. Update $\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | X_{\text{batch i}})$ 5. Check convergence ```{r, echo = F} sgd <- function(Y, X, MinBatch = 10, momemtum = FALSE, step = 0.01, maxIter = 5 * length(Y), maxTol = 1e-10, method = "normal", lambda = 1, print = FALSE, plot = FALSE) { gamma <- step alpha <- 0.5 iter <- maxIter tol <- maxTol p <- ncol(X) N <- nrow(X) batch <- MinBatch batInd <- sample(N) loss_c <- rep(NA, iter) beta_i <- rnorm(p) delta_i <- 0 for (i in 1:iter) { curIndex <- sample.int(N, batch) delta_new <- - gamma * gradient(Y[curIndex], X[curIndex, ], beta_i, method, lambda) if(momemtum) { delta_new <- delta_new + alpha * delta_i } beta_new <- beta_i + delta_new if(method == "lasso") { for(i in 1:p) beta_new[i] <- prox_l1(beta_new[i], lambda * gamma) } loss_c[i] <- loss(Y[curIndex], X[curIndex, ], beta_i, method, lambda) if (loss_c[i] < loss(Y[curIndex], X[curIndex, ], beta_new, method, lambda)) { gamma <- gamma * 0.9 alpha <- alpha * 0.99 if(print) { print(paste0("Iter: ", i, ", Step Size too large, current step size change to: ", gamma)) } next } if ( sum( (beta_i - beta_new)^2 ) < tol ) { break } else { beta_i <- beta_new delta_i <- delta_new } } if(plot) { plot(loss_c[1:i]) } if(print) { print(paste0("Iter: ", i, ", Loss: ", loss(Y, X, beta_i, method), ", Beta: ")) print(t(beta_new)) } beta_new } ``` ```{r} system.time(gdescent(Y, X)) system.time(sgd(Y, X, MinBatch = 1)) ``` # Ridge regression \begin{align} &\beta \sim N(0, 1 / \lambda) \\ &\epsilon \sim N(0, \sigma_2)\\ &y = X\beta + \epsilon \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = \frac{1}{2N}(Y - X\beta)^T(Y - X\beta) + \lambda \beta^T\beta \quad \text{// Negative log-likelihood} \\ &\nabla L(\beta_i | X) = X^T(X\beta_i - Y) / N + 2\lambda\beta_i \quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x) \quad \text{// Update step} \end{align} ```{r, echo = F} library(MASS) ``` ### `lm.ridge` ```{r} print(lm.ridge(Y ~ -1 + X, lambda = 2 * N)$coef) system.time(gdescent(Y, X, method = "ridge", lambda = 1)) system.time(sgd(Y, X, MinBatch = 50, method = "ridge", lambda = 1, momemtum = T)) ``` # Lasso regression The typycial question related with Lasso is how to evaluate the gradient of $|\beta|_1$. I recommend you to read [proximal gradient](https://en.wikipedia.org/wiki/Proximal_gradient_methods_for_learning), it is faster than sub-gradient. Here is my solution (Not quite sure) \begin{align} &\beta \sim Laplace(0, 1 / \lambda) \\ &\epsilon \sim N(0, \sigma_2)\\ &y = X^T\beta + \epsilon \quad \text{// Statistical Model} \\ &\text{argmin}_{\theta}\ L(\beta | X) = \frac{1}{2N}(Y - X\beta)^T(Y - X\beta) + \lambda |\beta|_1 \quad \text{// Negative log-likelihood} \\ &\nabla L(\beta_i | X) = X^T(X\beta_i - Y) / N \quad \text{// Gradient of } \beta_i \\ &\theta_{i + 1} = \theta_i -\gamma\nabla L(\theta_i | x) \quad \text{// Update step} \\ &\theta_{i+1} = \text{proximal gradient}(\theta_{i + 1}, \lambda * \gamma)\\ &\text{proximal gradient}(a, b) = \left\{\begin{matrix} a - b & a > b\\ 0 & -b < a < b\\ a + b & a < -b \end{matrix}\right. \end{align} ```{r} N <- 5e4 p <- 50 X <- matrix(rnorm(N * p), N, p) beta <- c(c(1.0, 2, 3.0, -2, -3), rep(0, 45)) Y <- X %*% beta + rnorm(N) ``` ```{r} system.time(glmbeta <- glmnet::glmnet(x = X, y = Y, lambda = 0.5, standardize = F)$beta) xname <- rownames(glmbeta) glmbeta[1:5,] ``` ```{r} system.time(beta <- gdescent(Y, X, method = "lasso", lambda = 1, momemtum = T, print = F)) names(beta) <- xname beta[beta != 0] ``` ```{r} system.time(beta <- sgd(Y, X, method = "lasso", lambda = 1, MinBatch = 100, momemtum = T, print = F)) names(beta) <- xname beta[beta != 0] ```