stochastic gradient descent

Stochastic gradient descent is a way to optimize functions by iteratively nudge the parameters in the opposite direction of the gradient of the loss function.

In terms of the direction of SGD, the rationale is that since the gradient of the loss function moves in the direction of the steepest increase, we want to move in the opposite direction to minimize the loss.

Here, we are will estimate a quadratic equation. To do so, we initialize random parameters of the equation. We use mean square error as loss function to calculate the gradient. Then we use SGD to optimize the parameters.

First, we will setup the gradients:

# mean square error
def mse(preds, acts):
    return ((preds - acts) ** 2).mean()
 
 
def quad(a, b, c, x):
    return a * x**2 + b * x + c
 
 
def mk_quad(a, b, c):
    return partial(quad, a, b, c)
 
 
# target model
f = mk_quad(2, 3, 4)
f(2)
 
# assume some data points
x = torch.linspace(-2, 2, 20)[:, None]
torch.manual_seed(42)
 
# Generate a tensor of random numbers with the same shape as f(x)
# torch.rand_like(f(x)) generates random numbers between 0 and 1
# with the same shape as f(x). We scale and shift it to the desired range.
random_numbers = torch.rand_like(f(x)) * 10 - 5
 
# dataset
y = f(x) + random_numbers
 
 
# loss function
def quad_mse(params):
    f = mk_quad(*params)
    return mse(f(x), y)
 
 
# initial params
params = torch.tensor([4, 5.0, 7.0])
params.requires_grad_()
 
loss = quad_mse(params)
loss
 
loss.backward()
params.grad

tensor([11.2822,  6.9424,  0.0000])

Let’s calculate the the SGD and loss manually.

Here is the loss function:

mse (f (x), y) = \frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2}

where $n$ is the number of data points.

To calculate the gradient of the loss, we start with a generic expression

f (x_{i}) = a x_{i}^{2} + b x_{i} + c

then

\frac{\partial f ( x _{i} )}{\partial a} = x_{i}^{2} .

Then we have the loss function, where $y_{i}$ are $y$ value in $(x_{i}, y_{i})$ in the training data and $f (x_{i})$ is the target $y$ value

L = \frac{1}{n} i = 1 \sum n (f (x_{i}) - y_{i})^{2} .

So for parameter $a$ , our gradient is:

\frac{\partial L}{\partial a} = \frac{2}{n} i = 1 \sum n (f (x_{i}) - y_{i}) \cdot x_{i}^{2} .

The process is similar for $b$ and $c$ .

To perform SGD, we subtract $g r a d \times l e a r nin g r a t e$ from the weight:

$p a r am s - = l e a r nin g r a t e g r a d$

print("target params", f)
print("initial params", params)
print("initial values of x: ", x[:2])

target params functools.partial(<function quad at 0x7fe6424a3060>, 2, 3, 4)
initial params tensor([4., 5., 7.], requires_grad=True)
initial values of x:  tensor([[-2.0000],
        [-1.7895]])

lr = 0.23
params = torch.tensor([4, 5.0, 7.0])
params.requires_grad_()
for _ in range(10):
    loss = quad_mse(params)
    print("loss ", loss.item())
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = None

loss  23.300430297851562
loss  12.930697441101074
loss  10.261430740356445
loss  8.984521865844727
loss  8.224483489990234
loss  7.7517900466918945
loss  7.455584526062012
loss  7.269739627838135
loss  7.153112888336182
loss  7.079920768737793

as we can see at the end, after a few iterations, the loss goes down consistently

Quarry

All Entries

Recent entries

policy gradient theorem

neural network

Adam

stochastic gradient descent

Backlinks