Stochastic gradient descent is a way to optimize functions by iteratively nudge the parameters in the opposite direction of the gradient of the loss function.
In terms of the direction of SGD, the rationale is that since the gradient of the loss function moves in the direction of the steepest increase, we want to move in the opposite direction to minimize the loss.
Here, we are will estimate a quadratic equation. To do so, we initialize random parameters of the equation. We use mean square error as loss function to calculate the gradient. Then we use SGD to optimize the parameters.
First, we will setup the gradients:
# mean square error
def mse(preds, acts):
return ((preds - acts) ** 2).mean()
def quad(a, b, c, x):
return a * x**2 + b * x + c
def mk_quad(a, b, c):
return partial(quad, a, b, c)
# target model
f = mk_quad(2, 3, 4)
f(2)
# assume some data points
x = torch.linspace(-2, 2, 20)[:, None]
torch.manual_seed(42)
# Generate a tensor of random numbers with the same shape as f(x)
# torch.rand_like(f(x)) generates random numbers between 0 and 1
# with the same shape as f(x). We scale and shift it to the desired range.
random_numbers = torch.rand_like(f(x)) * 10 - 5
# dataset
y = f(x) + random_numbers
# loss function
def quad_mse(params):
f = mk_quad(*params)
return mse(f(x), y)
# initial params
params = torch.tensor([4, 5.0, 7.0])
params.requires_grad_()
loss = quad_mse(params)
loss
loss.backward()
params.grad
tensor([11.2822, 6.9424, 0.0000])
Let’s calculate the the SGD and loss manually.
Here is the loss function:
where is the number of data points.
To calculate the gradient of the loss, we start with a generic expression
then
Then we have the loss function, where are value in in the training data and is the target value
So for parameter , our gradient is:
The process is similar for and .
To perform SGD, we subtract from the weight:
print("target params", f)
print("initial params", params)
print("initial values of x: ", x[:2])
target params functools.partial(<function quad at 0x7fe6424a3060>, 2, 3, 4)
initial params tensor([4., 5., 7.], requires_grad=True)
initial values of x: tensor([[-2.0000],
[-1.7895]])
lr = 0.23
params = torch.tensor([4, 5.0, 7.0])
params.requires_grad_()
for _ in range(10):
loss = quad_mse(params)
print("loss ", loss.item())
loss.backward()
params.data -= lr * params.grad.data
params.grad = None
loss 23.300430297851562
loss 12.930697441101074
loss 10.261430740356445
loss 8.984521865844727
loss 8.224483489990234
loss 7.7517900466918945
loss 7.455584526062012
loss 7.269739627838135
loss 7.153112888336182
loss 7.079920768737793
as we can see at the end, after a few iterations, the loss goes down consistently