wiki:gradient

Welcome to part two of the programming blogs. This one will cover an algorithm known as gradient descent. I am nowhere near qualified to teach math, so this will assume at least some working knowledge of calculus concepts, mainly the derivative. As always it will also assume that you know how to program comfortably. The samples will be shown in C++. Let's begin.

**Gradient descent** is an algorithm for iteratively finding local minima of any differentiable function. It is one of the core tools in machine learning algorithms and is not conceptually difficult to understand, with the main hurdle being understanding the calculus that goes behind it. Although it is very powerful in certain scenarios, it is not as widespread outside of statistical analysis in programming. The core of the gradient descent algorithm is a concept known as the gradient vector.

As previously stated, gradient descent is based on the idea of a gradient. But what is a gradient? Put simply, it is a vector describing how a function changes in an instantaneous timeframe. Sound familiar? If you were thinking of derivatives, that's exactly it. The gradient is constructed with the partial derivatives of all the variables in your target function. Taking the partial derivative describes the rate of change of that variable as the others are held constant, and in conjunction, they can describe the whole function's slope at any input.

How does this help us, you may ask. It's simple. Since the gradient tells us the slope of the function at any particular point, i.e. in which way the function is increasing, we go in the *opposite direction*. By doing this, we know that the direction we are travelling is always downwards until we hit a minimum. More than that, the value that is returned by the gradient tells us more than just which direction to travel, it also tells us how much. Larger values indicate that the slope is steep at that point, meaning it is *probably* safe to move in a large step, while smaller values indicate that we are almost at a critical point and that we should slow our steps. Using this information, we can, step by step, descend towards a local minimum. You may notice the word local; Unfortunately, gradient descent does not guarantee the global minimum if the function has multiple minima. Just like rolling something down hilly terrain, the starting position matters and is what decides which ditch the ball rolls into. A minimum is a minimum though and can still give you a desirable result for what you're looking for.

If you don't quite get it yet, that's fine. My favourite analogy for understanding this concept is the skier on a hill.

A skier starts at the top of a curve. Due to gravity, he then begins to slide down. At the top of the curve where he's starting, the slope is very steep. Because of this, he slides down very fast. But as he goes down, his skies start pointing more and more up as the slope decreases. Because of this, he starts slowing down, until he is at a rest at the bottom of the curve. This is gradient descent. We start at a point, and like the skier, we slide down our curve, slowing down and speeding up based on the slope of where we currently are.

Let's look at a classic example of gradient descent, regression. Regression is the act of "fitting" a function to a set of discrete data points and is a powerful tool for the analysis of data. Any function can be regressed to data, but the challenge is picking a function that is both easy to work with and retains in distinct shape (for example, if we regress some data with a quadratic function and the x^2 coefficient is a de-facto zero, it is probably not the best choice for this data). Here we'll use gradient descent to quickly perform linear regression on a set of data.

Before we regress, we have to set up a few definitions. First, how do we know when a function is good? As in, how well does the function we picked fit our data? Luckily, Gauss has this one figured out for us with his method of least squares. Basically what you do in it is subtract every Y value of our data with the corresponding Y value of our function, square that to remove the sign, and then add up all those values. The value you get from that is how close the function fits the data. The smaller the value, better.

In the case of linear regression, our function would be *y = mx + b.* In the sum of least squares this would look like this:

We can now take the partial derivatives with respect to *m* and *b* to get our gradient. The partial derivative with respect to *m* would look like this:

And the partial derivative with respect to *b* would look like this. They are both very similar.

Nice. We now have all the components of the gradient vector. All that's left to do is implement these in code. For this code, we will assume that there has already been X and Y data loaded up into two global arrays *x[]* and *y[]* with size N. First, we can implement the function for the partial derivative with respect to *m*.

```
double d_lin_m(double m, double b) {
double sum(0);
for (int i = 0; i < N; i++) {
sum += -2 * x[i] * (y[i] - (m * x[i] + b));
}
return sum;
}
```

Simple enough. This is just the exact definition of the sigma implemented in a loop. Taking a look at the partial derivative with respect to *b*, it is very similar to the first one as we would expect:

```
double d_lin_b(double m, double b) {
double sum(0);
for (int i = 0; i < N; i++) {
sum += -2 * (y[i] - (m * x[i] + b));
}
return sum;
}
```

Now we just need to create the driver for these two functions that starts at a point and iterates. Before that, it is also handy to define a descention rate *step*, which will ideally be a really small number. The numbers the gradient gives are quite large and if we were to just move based on purely those, we would be flying all over the place. It is benificial to first multiply it by *step* to slow it down a bit.

```
double m(0);
double b(0);
double step(0.0001);
for (int iter = 0; iter < 1e4; iter++) {
double dm = d_lin_m(m, b);
double db = d_lin_b(m, b);
m -= step * dm;
b -= step * db;
}
```

At the end of all those iterations, which will happen in almost an instant, we will have regressed *m* and *b* values that are extremely close to the optimal up to several significant digits. It is, however, important to note that with huge data sets, up to the millions or even hundreds of thousands, you will see a significant slowdown due to it having to sum through the whole set multiple times per iterations.

So that's that. While gradient descent is fairly niche, it is still a fairly interesting process to learn about considering it is what powers much of the "learning" modern AIs do. There are more advanced methods, sure, but gradient descent requiring only a prerequisite of calculus is not a huge challenge to understand and apply.

'til next time -pickle

wiki/gradient.txt · Last modified: 2020/03/09 15:27 by abigpickle