A neural net is a series of linear functions with nonlinearity in between. We define the structure of a neural net in a forward pass to calculate the result of the model. We use backpropagation, also called backward pass, to calculate gradients (we don’t include the optimization step here).
Forward pass
The forward pass is the process of use input and the model to calculate output (based on matrix multiplication). We start with a linear layer:
def lin(x, w, b):
return x @ w + b
To define a nn, we put two linear layers together, with a nonlinear function, the activation function, in the middle (since the composition of two linear function is just one linear function, which doesn’t amounts to 2 layers).
We initialize the weight for a linear layer:
x = torch.randn(200, 100)
y = torch.randn(200)
w1 = torch.randn(100, 50)
b1 = torch.zeros(50)
w2 = torch.randn(50, 1)
b2 = torch.zeros(1)A problem with this set of weights is that the output of the model after several layers because very big. That is because after going through several linear transformations, matrix multiplications will lead to a very large number. This can be illustrated by iteratively perform matrix multiplications on a matrix. As we can see, the weight matrix becomes very big quickly. This won’t work for deep neural networks.
x = torch.randn(200, 100)
for i in range(50):
x = x @ torch.randn(100, 100)
x[0:5, 0:5]tensor([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
Kaiming He et al. introduced the right scaling factor for neural net with relu in their ResNet paper, which is , where is the number of input. So we initialize according to this approach instead:
import math
x = torch.randn(200, 100)
y = torch.randn(200)
w1 = torch.randn(100, 50) * math.sqrt(2 / 100)
b1 = torch.zeros(50)
w2 = torch.randn(50, 1) * math.sqrt(2 / 50)
b2 = torch.zeros(1)We define an activation function, the nonlinear function, with relu, which clamps a tensor at 0:
def relu(x):
return x.clamp_min(0.0)Now we can calculate how an input goes thru the first layer:
l1 = lin(x, w1, b1)
l2 = relu(l1)
l2.mean(), l2.std()
(tensor(0.5728), tensor(0.8428))
And we can define our neural network as, which is also our forward pass.
def model(x):
l1 = lin(x, w1, b1)
l2 = relu(l1)
l3 = lin(l2, w2, b2)
return l3Backward pass
The backward pass calculates a loss function, which compute a measure of difference between the output of the forward pass and our label (y in our case). In our case, we use mean squared error.
However, we have 1 minor problem. Our model should have an output of shape [200] (the shape of y), but right now it has a trailing 1 dim.
out = model(x)
out.shapetorch.Size([200, 1])
We get rid of the training 1 with squeeze method in our loss function:
def mse(output, targ):
return (output.squeeze(-1) - targ).pow(2).mean()Now we need to calculate gradients wrt our loss function, the mean root square. To do so, we apply chain rule. To implement the chain rule for all operations we have, we save gradient of every function we have wrt the inputs.
For mean squared error, the gradient wrt to the input (which is the output of the forward pass) is
where is the # of elem in
For relu, the gradient wrt to the input is 1 or 0 depending on the input. Because, the result of relu is not final, we need to apply chain rule, which mean we need to time the result with the gradient of the next step, out.g:
where is:
Similarly, for the linear function we have
where is:
This gives use:
def mse_grad(inp, targ):
# grad of loss with respect to output of previous layer
inp.g = 2.0 * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]
def relu_grad(inp, out):
# grad of relu with respect to input activations
inp.g = (inp > 0).float() * out.g
def lin_grad(inp, out, w, b):
# grad of matmul with respect to input
inp.g = out.g @ w.t()
w.g = inp.t() @ out.g
b.g = out.g.sum(0)The backward pass is simply run these gradient one at a time:
def forward_and_backward(inp, targ):
# forward pass:
l1 = inp @ w1 + b1
l2 = relu(l1)
out = l2 @ w2 + b2
# backward pass:
mse_grad(out, targ)
lin_grad(l2, out, w2, b2)
relu_grad(l1, l2)
lin_grad(inp, l1, w1, b1)Refactor to PyTorch
Next, we want to move the current implementation into pytorch. To do that, we first organize the code into classes where each operation has a forward pass and a backward pass.
class Relu:
def __call__(self, inp):
self.inp = inp
self.out = inp.clamp_min(0.0)
return self.out
def backward(self):
self.inp.g = (self.inp > 0).float() * self.out.g
class Lin:
def __init__(self, w, b):
self.w, self.b = w, b
def __call__(self, inp):
self.inp = inp
self.out = inp @ self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
self.w.g = self.inp.t() @ self.out.g
self.b.g = self.out.g.sum(0)
class Mse:
def __call__(self, inp, targ):
self.inp = inp
self.targ = targ
self.out = (inp.squeeze() - targ).pow(2).mean()
return self.out
def backward(self):
x = (self.inp.squeeze() - self.targ).unsqueeze(-1)
self.inp.g = 2.0 * x / self.targ.shape[0]We will also create a model class to tie everything together.
Here there is something really subtle. Because python is pass by reference, the object relation established in the forward pass gets retained in the backward pass. This make it possible to call self.out.g without setting it explicitly.
For the forward pass we have:
y1 = lin1(x0) # lin1.out is y1
r1 = relu(y1) # relu.out is r1; note: r1 is also lin2.inp
yhat = lin2(r1) # lin2.out is yhat; note: yhat is also loss.inp
L = loss(yhat, targ)
In the backward pass:
loss.backward() # loss.inp.g is now set → therefore lin2.out.g is set
lin2.backward() # lin2 reads its out.g (set by loss), writes its inp.g (which is relu.out.g)
relu.backward() # relu reads its out.g (set by lin2), writes its inp.g (which is lin1.out.g)
lin1.backward() # lin1 reads its out.g (set by relu), writes its inp.g (which is x0.g)
class Model:
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1, b1), Relu(), Lin(w2, b2)]
self.loss = Mse()
def __call__(self, x, targ):
for l in self.layers:
x = l(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for l in reversed(self.layers):
l.backward()
Now we are in a position to perform forward and backward pass similar to how pytorch does it:
model = Model(w1, b1, w2, b2)
loss = model(x, y)
model.backward()
w1.g.shapetorch.Size([100, 50])
In Pytorch, we rewrite layers of neural net as child class of nn.Module, which is the base structure for Pytorch models. It registers trainable parameters and keeps track of gradients. We derive the equivalent gradient of the layer 1 weight w1 (note that PyTorch weights’ shape, (out_features, in_features) is the reverse of how we defined it)
import torch.nn as nn
class Model2(nn.Module):
def __init__(self, n_in, nh, n_out):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)
)
self.loss = mse
def forward(self, x, targ):
return self.loss(self.layers(x).squeeze(), targ)
print(x.shape, y.shape)
model = Model2(100, 50, 1)
loss = model(x, y)
loss.backward()
print(model.layers[0].weight.grad.shape)
torch.Size([200, 100]) torch.Size([200])
torch.Size([50, 100])