# Differentiation & Optimization

A Naive Introduction to Algorithms: Part II

BB

### Introduction

Optimization is a VERY BIG, VERY IMPORTANT SUBJECT right now thanks to the popularity of MACHINE LEARNING Capital letters are a way of expressing how breathlessly excited I am.. . However, it has been around in one form or another for centuries. What has changed recently, and by 'recently', I mean the past 10, 15 years maximum, is the increase in cheap computing power (e.g., memory, graphics cards, distributed systems, etc.) that make large scale "machine learning" algorithms possible. Modern machine learning methods also have the reputation of being 'black boxes', although they don't necessarily need to be. In particular, by studying the underlying computer code, I think it is possible to get a feel for what is happening in mathmatical terms in a way that more traditional, or theoretical approaches may lack.

That isn't to say one will be able to become comfortable with optimization techniques without any math as that simply isn't true. Still, while the domain-specfic terminology and greek symbols can be daunting at first, basic high school algebra and geometry are more than enough to serve as a basis for learning the small amount of trigonometry and (larger) amount of calculus necessary to get started in optimization.

For this piece, like the previous one, I'm going to stay fairly vanilla Read I couldn't handle anything more complicated... and take a look at the what is actually happening behind the scenes with logistic regression in terms of maximum likelihood estimation. Obviously, there are a lot of ways to optimize the log likelihood, but here I'll go with Newton's Method as it gives a slightly better exposure to finite difference methods than something like gradient descent When your function is twice differentiable, it usually tends to be considerably faster too! . Again, like the linear algebra piece, there won't be a ton of math, but some build-up is necessary for a discussion of finite difference methods.

Following this, I'll discuss how finite differences can be used to estimate the gradient vector and the Hessian matrix. These then are plugged into Newton's Method, which, in turn, is used to find the coefficients for the logistic regression equation by minimizing the log likelihood function.

## Differentiation

### Finite Difference Methods

Nothing I write here can have the slightest pretense of being complete. The notion of a derivative and partial derivative, Taylor Series, numerical approximation error, etc. are massive subject areas. Luckily, there are great resources for each of them so like everything else on this site I'll touch on each briefly, pointing out helpful references along the way.

To start, here is the basic formula for the derivative (e.g., slope of the tangent line to a curve at a point $$x_{o}$$, $\lim_{x \to\ x_{o}} f'(x) = \frac{f(x) - f(x_{o})}{x - x_{o}}$ Of course, the stupid computer doesn't understand the meaning of infinity, so we need to use an approximation. How? A very small number, $$h$$, $f'(x) \approx \frac{f(x_{o} + h) - f(x_{o})}{h}$ This is known as the forward-difference formula Note that it is only an approximation as I omit discussion of approximation error for space, time, etc. considerations... . More commonly, the central-difference formula is used as it is more accurate. $f'(x) \approx \frac{f(x_{o} + h) - f(x_{o} - h)}{2h}$ You can use the same technique to estimate the second deriative Derivative (e.g., change) of first derivative... as well. $f''(x) \approx \frac{f(x_{o} + h) - f(x_{o}) + f(x_{o} - h)}{h^2}$ To continually tie everything back to code (e.g., R), let's see how these look in practice...

f <- function(x) x^2 + 3*x

myderiv <- function(f,x,h=.0001) {

d <- (f(x + h) - f(x - h))/(2*h)
d
}

myderiv(f,5)

## [1] 13

pracma::fderiv(f,5,n = 1) # more simple than base R, see ?praca::fderiv

## [1] 13

myderiv2 <- function(f,x,h) {

d2 <- (f(x + h) - (2* f(x)) + f(x - h))/(h^2)
d2
}

myderiv2(f,5,h=.0001)

## [1] 2

pracma::fderiv(f,5,n = 2) # second derivative

## [1] 2

# SMALLER ISN'T BETTER for your step size, h!

myderiv2(f,5,h=1e-08)

## [1] 71.05427

pracma::fderiv(f,5,2,h=1e-08)

## [1] 71.05427


This last demonstration brings us into more complicated territory very quickly. What is important to know, though, is that there is a continual tradeoff between approximation errors and what are known as condition, or roundoff errors. Although I omit a detailed discussion of the former See ?pracma::fderiv and ?numDeriv::grad from the pracma and numDeriv packages respectively for more information on step-size or the following for a more technical explanation... , it is necessary to touch on the latter, at least briefly.

For this, we turn to Taylor series as a way to estimate function values at other points, and corresponding error, given only derivative information at a single point. The (truncated) Taylor series formula is, $\sum_{k=0}^{n} \frac{f^{(k)}(x_0)}{k!}(x-x_o)^k$ with truncation error denoted as, $\mathcal{O}(\Delta x)^2$ Basically, what Taylor series does is 'expand about' a point to approximate a function value using a an infinite number of polynomial terms by making use of increasingly higher order derivative information See this handout for more information... . So, the Taylor series approximation using just first derivative information would be, $f(a)+f'(a)(x-a)$ whereas including the second derivative leads to, $(f(a)+f'(a)(x-a)+\frac{f''(a)}{2!}(x-a)^2$ ... and so on. A quick toy example may help make this more clear...

f <- function(x) x^3 + 3*x^2 + 6

x <- 5
x0 <- 5.1

f(x)

## [1] 206

# now, what about x0?

f(x0)

## [1] 216.681

# now via taylor series:

# only 1st derivative

f(x) + myderiv(f,x,h=.0001) * (x0 - x)

## [1] 216.5

# now with second
f(x) + myderiv(f,x,h=.0001) * (x0 - x) + (myderiv2(f,x,h=.0001)/2) * ((x0 - x)^2)

## [1] 216.68

# and now the obligatory example using pracma::fderiv

library(pracma)
x <- 1
x0 <- 1.5

f <- function(x) x^exp(1)

f(1)

## [1] 1

f(x0)

## [1] 3.010687

f(x) + myderiv(f,x,1) * (x0 - x)

## [1] 2.645221

f(x) + fderiv(f,x,1) * (x0 - x) + (fderiv(f,x,2)/2) * ((x0 - x)^2)

## [1] 2.942988

f(x) + fderiv(f,x,1) * (x0 - x) + (fderiv(f,x,2)/2) * ((x0 - x)^2) +
(fderiv(f,x,3)/3) * ((x0 - x)^3)

## [1] 3.082777

f(x) + fderiv(f,x,1) * (x0 - x) + (fderiv(f,x,2)/2) * ((x0 - x)^2) +
(fderiv(f,x,3)/3) * ((x0 - x)^3) + (fderiv(f,x,4)/4) * ((x0 - x)^4)

## [1] 3.068009


Of course, this example is trivial as we need to numerically evaluate the function to get our derivative estimate, but this does allow us to see how the error grows smaller as n increases. Again, there is a lot more to this, but this gives the basic idea. Now, we turn to partial derivatives as in the REAL WORLD, most functions are multivariate.

### The Gradient and the Hessian

The gradient, $$\nabla f$$ is just a vector of partial derivatives and while it doesn't sound like anything exiting, it forms the basis for a considerable number of optimization techniques. The Hessian, $$\mathbf{H_{i,j}} = \frac{\partial^{2} f}{\partial x_{i} \partial x_{j} }$$ ,on the other hand, is a matrix with all combinations of second order partial derivatives See here..., which is used to determine whether a convex function is at the minimum or maximum Wikipedia link....

Neither is very difficult to calculate given the aforementioned formulas for finite difference methods, however, the notion of 'holding a variable constant can be a little tricky to do, or do cleanly, at least. The key to what actually happens is understanding what is happening in terms of arithmetic for each of the different terms.

In very general algorithmic terms, the gradient would be as follows, \begin{align*} &\textbf{for}\ (i\ in\ 1:length(\theta))\\ & \hspace{.5cm} \nabla\ gradient_{i} = \frac{(f(\theta_{[i+h]}) - f(\theta_{[i-h]}))}{(2*h)}\\ &\textbf{end loop} \end{align*} For the Hessian, one would use the standard second partial derivative formula for diagonals, $\frac{\partial f}{\partial \theta_{i}^{2}} = \frac{f(\theta_{i+h,\ j}) - 2f(\theta_{i,\ j}) + f(\theta_{i-h,\ j}) }{h^2}$ and then the mixed partial derivative for all lower(upper) triangular entries, $\frac{\partial^{2} f}{\partial \theta_{i} \partial \theta_{j}} = \frac{(\theta_{i+h,\ j+h}) - (\theta_{i+h,\ j-h}) - (\theta_{i-h,\ j+h}) + (\theta_{i-h,\ j-h})}{(4h^2)}$

In practice, there are a few ways to do this for both the gradient and Hessian, but in any case, you will be using some form of iteration. My initial gradient function is as follows,

mygrad <- function(f, x0, h = 10e-06) {

len <- length(x0)

for (i in 1:len) {

plus_h <- replace(x0,i,x0[i] + h) # replace element and return vector
minus_h <- replace(x0,i,x0[i] - h)

gradout[i] <- (f(plus_h) - f(minus_h))/(2 * h)

}

}

# sample multivariable function below

mvf <- function(x0) {

x <- x0[1]
y <- x0[2]
z <- x0[3]

out <- x^3 + y^2 + z
out
}

x0 <- c(1,1,1)

mvf(x0)

## [1] 3

mygrad(mvf,x0)

## [1] 3 2 1

pracma::grad(mvf,x0)

## [1] 3 2 1


It is important to note that the pracma package's implementation is faster and more efficient however, below I show just the vectorized for-loop strategy that iteratively alternates how 'h', the step-size, is either added or subtracted to each element.

heps = .Machinedouble.eps^(1/3) hh <- rep(0, n) gr <- numeric(n) for (i in 1:n) { hh[i] <- heps # replace specific element of vector # use vectorized addition/subtraction below gr[i] <- (f(x0 + hh) - f(x0 - hh))/(2 * heps) hh[i] <- 0  I thought this technique was kind of cute so I implemented a similar version for the Hessian In the package author(s)' version, a matrix with the step-size along the diagonals is created beforehand and then iteratively subsetted depending on vector position, see pracma::hessian. Note also that is the "correct" way as it is based on the evenly-spaced grid approach found in numerical analysis textbooks, but I digress... . Because of the need to do all possible combinations for the mixed partial derivatives, I created three vectors in advance, one for plus-minus, one for minus-plus, and one for either plus-plus or minus-minus. My implementation is below... myhess <- function(func,x0,h=.001) { f <- func len <- length(x0) H <- diag(0,len) # double for loop for lower diagonal entries for (j in 1:(len - 1)) { plusminus <- numeric(len) minusplus <- numeric(len) samesign <- numeric(len) plusminus[j] <- h minusplus[j] <- -h samesign[j] <- h for (i in (j + 1):len){ plusminus[i] <- -h minusplus[i] <- h samesign[i] <- h H[i,j] <- (f(x0 + samesign) - f(x0 + plusminus) - f(x0 + minusplus) + f(x0 + (-1 * samesign)))/(4 * h^2) H[j,i] <- H[i,j] # clears the prior value plusminus[i] <- 0 minusplus[i] <- 0 samesign[i] <- 0 } samesign[len-1] <- 0 } # diagonal entries for (i in 1:len) { samesign[i] <- h H[i,i] <- (f(x0 - samesign) - (2*f(x0)) + f(x0 + samesign)) / (h^2) samesign[i] <- 0 } H }  To test this out, we'll use a multidimensional version the famous Rosenbrock Banana Function This function was specifically designed for testing optimization techniques as it has numerous valleys... as we will use it in the next section. # taken from: https://www.sfu.ca/~ssurjano/Code/rosenr.html rosen <- function(xx) { d <- length(xx) xi <- xx[1:(d-1)] xnext <- xx[2:d] sum <- sum(100*(xnext-xi^2)^2 + (xi-1)^2) y <- sum return(y) } x0 <- c(3,7) rosen(x0)  ## [1] 404  myhess(rosen,x0)  ## [,1] [,2] ## [1,] 8002 -1200 ## [2,] -1200 200  pracma::hessian(rosen,x0)  ## [,1] [,2] ## [1,] 8002 -1200 ## [2,] -1200 200  ## Optimization ### Newton's Method Finally on to the good stuff! Newton's Method is but one of many different optimization techniques. It uses both first and second derivative information in order to find the minimum, or roots, of the function. Assuming the function is convex and continously differentiable near the root, this method will have quadratic convergence which is very fast A technical overview.... On the other hand, computing the full derivative information can be quite expensive so this is a major downside. Still, for our purposes (e.g., logistic regression) it will work just fine and in my opinion, provides an excellent introduction to optimization. Also, if you understand how Newton's method works, then you will easily be able to implement other derivative-based methods like gradient descent going forward. In pseudocode, the algorithm is as follows, \begin{align*} &i=0\\ &iterations = 100\\ &tol = 1e{-06} \\ &\textbf{while}\ (tol > ||\nabla f(x_i)||):\\ &\hspace{.5cm}i = i + 1\\ &\hspace{.5cm}\textbf{compute}\ x_{i+1} = x_i -H^{-1}\nabla f(x_i)\\ &\hspace{1cm}\textbf{if}\ (i\geq iterations)\ then\ \textbf{stop}\\ &\hspace{.5cm}\textbf{else}\ \hat{x} = x_i\\ &\textbf{return}(\hat{x}) \end{align*} where $$tol$$ is tolerance threshold for termination, $$||\nabla f(x_i)||$$ is the Euclidean norm of the gradient, and $$H^{-1}$$ is the inverse of the Hessian. Compared to finite differences, implementing this in code is a walk in the park. All thats needed is just to set up loops, tolerance, iterations etc., and then test it out Do note that I use a number of functions from the previous section here as I wanted to stay as true to the notion of "from scratch" as possible (still, not very...). If you are interested in creating a more efficient version, Base R functions are bound to be far faster and the same is true for code in Rcpp... ... mynewt <- function(f,x0,eps=1e-06,N=100){ i <- 0 euc <- 1 xhat <- x0 norm_vec <- function(x) sqrt(sum(x^2)) while ((euc > eps) == TRUE) { i <- i + 1 hessinvert <- inv(myhess(f,xhat)) gradvec <- as.matrix(mygrad(f,xhat)) xhat <- xhat - matmult(hessinvert,gradvec) euc <- norm_vec(mygrad(f,xhat)) if (i >= N) { warning('Max iterations reached, returning control...') break } } est <- xhat attributes(est) <- list(iter=i) est } mynewt(rosen,c(2,3,4))  ## [1] 1 1 1 ## attr(,"iter") ## [1] 11  OK, let's test some other functions Courtesy of the fine people at Simon Fraser University..... spheref <- function(xx) { sum <- sum(xx^2) y <- sum return(y) } spheref(c(2,3,4))  ## [1] 29  mynewt(spheref,c(2,3,4))  ## [1] -1.509903e-09 -2.309263e-09 1.127987e-08 ## attr(,"iter") ## [1] 1  optim(c(2,3,4),spheref)  ##par
## [1]  0.0000438641 -0.0002905961  0.0004100185
##
## $value ## [1] 2.544853e-07 ## ##$counts
##      104       NA
##
## $convergence ## [1] 0 ## ##$message
## NULL

himmelblau <- function(x0) {
x <- x0[1]
y <- x0[2]

out <- (x^2 + y -11)^2 + (x + y^2 - 7)^2
out
}

# https://en.wikipedia.org/wiki/Himmelblau%27s_function

himmelblau(c(1,1))

## [1] 106

mynewt(himmelblau,c(-5,8))

## [1] -2.805118  3.131313
## attr(,"iter")
## [1] 7

mynewt(himmelblau,c(-10,-10))

## [1] -3.779310 -3.283186
## attr(,"iter")
## [1] 7

mynewt(himmelblau,c(0,0)) # single global maximum

## [1] -0.2708446 -0.9230386
## attr(,"iter")
## [1] 4


### Logistic Regression

Finally, its time to test mynewt out by optimizing the log-liklihood for logistic regression as Newton's Method is frequently used as the optimizer in that context. Considering that the internet is knee deep in 'from-scratch' logistic regression examples in multiple languages, I'm going to deliberately skimp on the details. Briefly, as opposed to OLS regression, logistic regression allows one to predict a dichotomous, categorical outcome There are a number of great tutorials including Andrew Ng's notes and this guy's as well as full-length books, (e.g., An Introduction to Statistical Learning), available for free... . Moreover, unlike OLS regression, there is no analytical solution in terms of error minimization so it is necessary to use an iterative approach.

In addition to what was previously covered, you will need to know two equations Or at least where to look them up and copy them from... . The first is the inverse-logit, or logistic sigmoid, function which is what gives us a 0/1 outcome as a conditional probability Greater detail here... (Author also has a good (Python) book on the subject... , $g(z) = \frac{1}{1 + exp(-z)}$ The second is the 'cost' function which is what we want to minimize Overview of logistic regression cost function...., $J(\theta)= \frac{1}{m} \sum_{i=1}^m[-y_i log(h_\beta (x_i) - (1 - y_i) log(1-h_\beta (x_i))$ Assuming Newton's Method convergences, the values it returns that minimize the above equation will be our logistic regression coefficients. Coding it up is fairly straightforward with the caveat that you will need to either use reference objects in the global environment create a function wrapper to manage scope. For this example, I use the nycflights13::flights My custom matrix multiplication function makes everything brutally slow so I'm only using a subset. Again, if you actually wanted to make your own, you would want to use base R or Rcpp.. dataset and create a dichotomous variable indicating whether a flight was more than 20 minutes delayed or not.

df <- na.omit(nycflights13::flights)
attributes(df)$class <- "data.frame" # starts as tibble subrows <- sample(nrow(df),10000) x <- as.matrix(df[subrows,c('month','day','distance')]) x <- cbind(1,x) # intercept y <- ifelse(df$dep_delay[subrows] > 20,1,0)

theta <- numeric(dim(x)[2]) # starting values

sigmoid <- function(z) {
g <- 1/(1 + exp(-z))
g
}

mycost <- function(theta) {

# note that x and y are from parent environment, not ideal for real-world..

m <- dim(x)[1]
g <- sigmoid(matmult(x,matrix(theta))) # much faster with %*%

J <- (1/m) * sum((-y * log(g)) - ((1-y) * log(1 - g)))
J
}

mynewt(mycost,theta)

## [1] -1.3775133753 -0.0076038843  0.0038722036 -0.0001291763
## attr(,"iter")
## [1] 7

glm.fit(x,y,family=binomial(link='logit'))\$coefficients

##                       month           day      distance
## -1.3775775549 -0.0076042919  0.0038721237 -0.0001291093

# pretty close...


This is a start, at least, to understanding the underlying implementations behind common statistical techniques...