Collaborative filtering is a way to predict users’ preferences with data of other users’ preferences. It is used in recommendation systems. Collaborative filtering construct similarities of users as well as items. This differs from content-based filtering, where only item features are used to make recommendations.

For this entry, we use linear model to perform collaborative filtering. The basic idea is the following. We construct embedding matrices for movies and users. These embedding matrices are parameters of our linear model. For each user and movie, we create the same number of embedding factors, their dot product (plus a bias term) is the predicted value of user’s rating.

Let’s say we have movies and users, we use 5 embedding factors (or latent factor) for each movie and user. As an example, a user ‘s rating of item is modeled by the following equation, where the first term in the dot product is the user’s embedding and the second term is the items’ embedding.

These embedding matrices are initialized with random numbers. In our case, we have shaped embeddings for movies and for users. We take the dot product of the item embedding and transpose of the user embedding to get a matrix of user ratings for each movie. We improve these embedding matrices with our dataset of user ratings with SGD.

Data exploration

import torch
import pandas as pd
#!wget "https://files.grouplens.org/datasets/movielens/ml-100k.zip" -O "../data/ml-100k.zip" && unzip "../data/ml-100k.zip" -d ../data/
from pathlib import Path
 
path = Path("../data/ml-100k/")
 
ratings = pd.read_csv(
    path / "u.data",
    delimiter="\t",
    header=None,
    names=["user", "movie", "rating", "timestamp"],
)
ratings.head()
 
usermovieratingtimestamp
01962423881250949
11863023891717742
2223771878887116
3244512880606923
41663461886397596
movies = pd.read_csv(
    path / "u.item",
    delimiter="|",
    encoding="latin-1",
    usecols=(0, 1),
    names=("movie", "title"),
    header=None,
)
movies.head()
movietitle
01Toy Story (1995)
12GoldenEye (1995)
23Four Rooms (1995)
34Get Shorty (1995)
45Copycat (1995)

We merge the two datasets to get movie titles:

ratings = ratings.merge(movies)
ratings.head()
usermovieratingtimestamptitle
01962423881250949Kolya (1996)
11863023891717742L.A. Confidential (1997)
2223771878887116Heavyweights (1994)
3244512880606923Legends of the Fall (1994)
41663461886397596Jackie Brown (1997)
len(ratings.user.unique()), len(ratings.title.unique())
(943, 1664)

Data preparation

Create a DataLoader from the pandas dataframe, our the input are user and titles (same as movie), and the output is rating. When we create the Dataset (see details in Dataloader), we make the input/output distinction by telling the __getitem__ function to stack user and titles together. This way, instead of 3, the iterator will give us only 2 items, input and output, as described above.

from torch.utils.data import DataLoader
 
batch_size = 64
 
 
class CollabDataset:
    def __init__(self, ratings, user_col, item_col, rating_col):
        # notice here we reindexed the user col, the user index is original + 1
        self.users = torch.tensor(
            ratings[user_col].astype("category").cat.codes.values, dtype=torch.int
        )
        self.items = torch.tensor(
            ratings[item_col].astype("category").cat.codes.values, dtype=torch.int
        )
        self.ratings = torch.tensor(ratings[rating_col].values, dtype=torch.float32)
 
    def __len__(self):
        return len(self.ratings)
 
    def __getitem__(self, idx):
        return torch.stack([self.users[idx], self.items[idx]]), self.ratings[idx]
 
 
user_col = "user"
item_col = "title"
ratings_col = "rating"
dataset = CollabDataset(ratings, user_col, item_col, ratings_col)
dls = DataLoader(dataset, batch_size=64, shuffle=True)

For the user column, the original index is 1 based, which isn’t allowed in torch training. So we reindex it based on categories, which make the new user index

Now we can see that the DataLoader iterator give a list of 2 items a time, the first are an array of user, item pairs, the second are ratings. This is what we need for training.

next(iter(dls))
[tensor([[ 888,  865],
         [  15, 1227],
         [  58,  319],
         [ 192,  793],
         [ 732,  333],
         [ 310, 1393],
         [  37,  559],
         [ 379,   82],
         [ 654,  920],
         [ 325,  296],
         [ 261,  773],
         [ 550,  313],
         [  30,  528],
         [ 392,  312],
         [ 316,  453],
         [ 252, 1204],
         [ 311,  153],
         [ 388, 1008],
         [  38, 1482],
         [ 860,  272],
         [ 386, 1491],
         [ 312, 1210],
         [  14,  780],
         [ 681,  302],
         [ 845, 1572],
         [ 397,  960],
         [ 646, 1251],
         [   6,  890],
         [ 598, 1570],
         [ 881,  494],
         [ 213,  910],
         [ 780,  339],
         [ 312,  337],
         [ 184,  389],
         [ 449,  314],
         [ 827, 1167],
         [ 384, 1038],
         [ 104,  339],
         [ 617,  930],
         [ 398,  769],
         [ 898, 1146],
         [ 795,  232],
         [ 304,  627],
         [ 822,   88],
         [ 642,  950],
         [ 346,  493],
         [ 876, 1016],
         [ 133,  733],
         [ 372, 1597],
         [ 591,  554],
         [ 180,  991],
         [ 168,  618],
         [ 647, 1019],
         [ 214,  179],
         [ 893,  519],
         [ 298,  450],
         [ 294, 1410],
         [ 805, 1214],
         [ 654, 1401],
         [ 857,  378],
         [ 621, 1178],
         [ 351, 1534],
         [ 415,  378],
         [ 676,  860]], dtype=torch.int32),
 tensor([5., 1., 4., 3., 3., 4., 5., 2., 3., 1., 3., 5., 4., 3., 4., 1., 5., 4.,
         5., 4., 3., 3., 3., 4., 5., 4., 3., 5., 5., 5., 4., 4., 5., 4., 4., 4.,
         4., 4., 3., 3., 3., 3., 2., 3., 4., 2., 3., 2., 5., 4., 1., 4., 3., 4.,
         4., 4., 4., 3., 3., 2., 5., 3., 3., 5.])]

We need a way to retrieve movie titles with index. We can verify our DataLoader work by picking a row in the iterator, ([803, 1214], 1), which means user 804 (old index) rated movie 1214 a score of 1. We verify the corresponding row in ratings dataframe.

item_codes_to_titles = dict(
    enumerate(ratings[item_col].astype("category").cat.categories)
)
item_codes_to_titles[1214]
'Reality Bites (1994)'
ratings[(ratings[user_col] == 804) & (ratings[item_col] == "Reality Bites (1994)")]
usermovieratingtimestamptitle
6208780410741879447476Reality Bites (1994)

Elements of collaborative filtering

We need to create embedding matrix with latent factors user_factors and movie_factors, we initialize them with random numbers:

n_users = len(ratings[user_col].astype("category").cat.categories)
n_movies = len(ratings[item_col].astype("category").cat.categories)
n_factors = 5
 
user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)
print(n_users, n_movies)
print(user_factors.shape, movie_factors.shape)
943 1664
torch.Size([943, 5]) torch.Size([1664, 5])

We simplify this process with a helper function that takes a shape and output the embedding matrix:

def create_params(size):
    return torch.nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
 
 
create_params([2, 3])
Parameter containing:
tensor([[-0.0121, -0.0043,  0.0041],
        [ 0.0078, -0.0008,  0.0028]], requires_grad=True)

Next we define a sigmoid function to be used in our training model:

def sigmoid_range(x, low, high):
    "Sigmoid function with range `(low, high)`"
    return torch.sigmoid(x) * (high - low) + low

We define our linear model using dot product and a bias term. The model need to be an instantiation of the torch.nn.Module and instantiation the __init__ and forward methods

class DotProductBias(torch.nn.Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
        super().__init__()
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
 
    def forward(self, x):
        users = self.user_factors[x[:, 0]]
        movies = self.movie_factors[x[:, 1]]
        res = (users * movies).sum(dim=1)
        res += self.user_bias[x[:, 0]] + self.movie_bias[x[:, 1]]
        return sigmoid_range(res, *self.y_range)
 

We instantiate our model:

model = DotProductBias(n_users, n_movies, 5)
 
xb, yb = next(iter(dls))
xb.shape, yb.shape
(torch.Size([64, 2]), torch.Size([64]))

We can call the model to create a simple prediction, under the hood, the model uses the forward method we just defined.

pred = model(xb)
loss = torch.nn.functional.mse_loss(yb, pred)
pred.shape, yb.shape, loss
(torch.Size([64]),
 torch.Size([64]),
 tensor(1.5828, grad_fn=<MseLossBackward0>))

To train the model with SGD, we implement the following algorithm:

for (item, user), rating in dls:
    pred = model(item, user)
    loss = loss_func(pred, rating)
    loss.backward()
    parameters -= parameters.grad * lr
epochs = 5
lr = 5e-3
wd = 0.1
for epoch in range(epochs):
    epoch_loss = 0
    batch_num = 0
    for xb, yb in dls:
        pred = model(xb)
        loss = torch.nn.functional.mse_loss(pred, yb)
 
        loss.backward()
        with torch.no_grad():
            for p in model.parameters():
                # p.grad += wd * 2 * p
                p -= p.grad * lr
            model.zero_grad()
        epoch_loss += loss.item()
        batch_num += 1
    avg_loss = epoch_loss / batch_num
    print(f"Average loss: {avg_loss:.4f}")
Average loss: 1.7865
Average loss: 1.6328
Average loss: 1.5120
Average loss: 1.4171
Average loss: 1.3417

From our model, we now have embeddings of movies and users. We can use movie embeddings to find similar movies. For example, we want to find movies similar to ‘Chinatown (1974)’

title = "Chinatown (1974)"
item_idx = list(item_codes_to_titles.values()).index(title)
item_factors = model.movie_factors[item_idx].detach()
 
# Calculate cosine similarity between this movie and all others
similarities = torch.nn.functional.cosine_similarity(
    model.movie_factors.detach(), item_factors.unsqueeze(0)
)
 
# Get top similar movies
_, indices = torch.topk(similarities, 10)
 
print(f"Movies similar to '{title}':\n")
for idx in indices:
    similar_title = item_codes_to_titles[idx.item()]
    similarity = similarities[idx].item()
    print(f"{similar_title:<50} (similarity: {similarity:.3f})")
 
Movies similar to 'Chinatown (1974)':

Chinatown (1974)                                   (similarity: 1.000)
Trial and Error (1997)                             (similarity: 0.981)
Men With Guns (1997)                               (similarity: 0.978)
Forget Paris (1995)                                (similarity: 0.953)
Lashou shentan (1992)                              (similarity: 0.943)
Cook the Thief His Wife & Her Lover, The (1989)    (similarity: 0.936)
Further Gesture, A (1996)                          (similarity: 0.930)
Cool Hand Luke (1967)                              (similarity: 0.927)
Spellbound (1945)                                  (similarity: 0.917)
Liar Liar (1997)                                   (similarity: 0.916)