Collaborative filtering is a way to predict users’ preferences with data of other users’ preferences. It is used in recommendation systems. Collaborative filtering construct similarities of users as well as items. This differs from content-based filtering, where only item features are used to make recommendations.

For this entry, we use linear model to perform collaborative filtering. The basic idea is the following. We construct embedding matrices for movies and users. These embedding matrices are parameters of our linear model. For each user and movie, we create the same number of embedding factors, their dot product (plus a bias term) is the predicted value of user’s rating.

Let’s say we have $n_i t e m$ movies and $n_u ser$ users, we use 5 embedding factors (or latent factor) for each movie and user. As an example, a user $u$ ‘s rating of item $i$ is modeled by the following equation, where the first term in the dot product is the user’s embedding and the second term is the items’ embedding.

r_{u i} = [0.5 - 0.2 0.1 0.8 - 0.3] 0.4 0.6 - 0.3 0.2 0.7 + b_{u} + b_{i}

These embedding matrices are initialized with random numbers. In our case, we have $n_i t e m \times n_f a c t or$ shaped embeddings for movies and $n_u ser \times n_f a c t or$ for users. We take the dot product of the item embedding and transpose of the user embedding to get a $n_i t e m \times n_u ser$ matrix of user ratings for each movie. We improve these embedding matrices with our dataset of user ratings with SGD.

Data exploration

import torch
import pandas as pd

#!wget "https://files.grouplens.org/datasets/movielens/ml-100k.zip" -O "../data/ml-100k.zip" && unzip "../data/ml-100k.zip" -d ../data/

from pathlib import Path
 
path = Path("../data/ml-100k/")

ratings = pd.read_csv(
    path / "u.data",
    delimiter="\t",
    header=None,
    names=["user", "movie", "rating", "timestamp"],
)
ratings.head()

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

movies = pd.read_csv(
    path / "u.item",
    delimiter="|",
    encoding="latin-1",
    usecols=(0, 1),
    names=("movie", "title"),
    header=None,
)
movies.head()

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

We merge the two datasets to get movie titles:

ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	186	302	3	891717742	L.A. Confidential (1997)
2	22	377	1	878887116	Heavyweights (1994)
3	244	51	2	880606923	Legends of the Fall (1994)
4	166	346	1	886397596	Jackie Brown (1997)

len(ratings.user.unique()), len(ratings.title.unique())

(943, 1664)

Data preparation

Create a DataLoader from the pandas dataframe, our the input are user and titles (same as movie), and the output is rating. When we create the Dataset (see details in Dataloader), we make the input/output distinction by telling the __getitem__ function to stack user and titles together. This way, instead of 3, the iterator will give us only 2 items, input and output, as described above.

from torch.utils.data import DataLoader
 
batch_size = 64
 
 
class CollabDataset:
    def __init__(self, ratings, user_col, item_col, rating_col):
        # notice here we reindexed the user col, the user index is original + 1
        self.users = torch.tensor(
            ratings[user_col].astype("category").cat.codes.values, dtype=torch.int
        )
        self.items = torch.tensor(
            ratings[item_col].astype("category").cat.codes.values, dtype=torch.int
        )
        self.ratings = torch.tensor(ratings[rating_col].values, dtype=torch.float32)
 
    def __len__(self):
        return len(self.ratings)
 
    def __getitem__(self, idx):
        return torch.stack([self.users[idx], self.items[idx]]), self.ratings[idx]
 
 
user_col = "user"
item_col = "title"
ratings_col = "rating"
dataset = CollabDataset(ratings, user_col, item_col, ratings_col)
dls = DataLoader(dataset, batch_size=64, shuffle=True)

For the user column, the original index is 1 based, which isn’t allowed in torch training. So we reindex it based on categories, which make the new user index $n e w_in d e x = o l d_in d e x - 1$

Now we can see that the DataLoader iterator give a list of 2 items a time, the first are an array of user, item pairs, the second are ratings. This is what we need for training.

next(iter(dls))

[tensor([[ 888,  865],
         [  15, 1227],
         [  58,  319],
         [ 192,  793],
         [ 732,  333],
         [ 310, 1393],
         [  37,  559],
         [ 379,   82],
         [ 654,  920],
         [ 325,  296],
         [ 261,  773],
         [ 550,  313],
         [  30,  528],
         [ 392,  312],
         [ 316,  453],
         [ 252, 1204],
         [ 311,  153],
         [ 388, 1008],
         [  38, 1482],
         [ 860,  272],
         [ 386, 1491],
         [ 312, 1210],
         [  14,  780],
         [ 681,  302],
         [ 845, 1572],
         [ 397,  960],
         [ 646, 1251],
         [   6,  890],
         [ 598, 1570],
         [ 881,  494],
         [ 213,  910],
         [ 780,  339],
         [ 312,  337],
         [ 184,  389],
         [ 449,  314],
         [ 827, 1167],
         [ 384, 1038],
         [ 104,  339],
         [ 617,  930],
         [ 398,  769],
         [ 898, 1146],
         [ 795,  232],
         [ 304,  627],
         [ 822,   88],
         [ 642,  950],
         [ 346,  493],
         [ 876, 1016],
         [ 133,  733],
         [ 372, 1597],
         [ 591,  554],
         [ 180,  991],
         [ 168,  618],
         [ 647, 1019],
         [ 214,  179],
         [ 893,  519],
         [ 298,  450],
         [ 294, 1410],
         [ 805, 1214],
         [ 654, 1401],
         [ 857,  378],
         [ 621, 1178],
         [ 351, 1534],
         [ 415,  378],
         [ 676,  860]], dtype=torch.int32),
 tensor([5., 1., 4., 3., 3., 4., 5., 2., 3., 1., 3., 5., 4., 3., 4., 1., 5., 4.,
         5., 4., 3., 3., 3., 4., 5., 4., 3., 5., 5., 5., 4., 4., 5., 4., 4., 4.,
         4., 4., 3., 3., 3., 3., 2., 3., 4., 2., 3., 2., 5., 4., 1., 4., 3., 4.,
         4., 4., 4., 3., 3., 2., 5., 3., 3., 5.])]

We need a way to retrieve movie titles with index. We can verify our DataLoader work by picking a row in the iterator, ([803, 1214], 1), which means user 804 (old index) rated movie 1214 a score of 1. We verify the corresponding row in ratings dataframe.

item_codes_to_titles = dict(
    enumerate(ratings[item_col].astype("category").cat.categories)
)
item_codes_to_titles[1214]

'Reality Bites (1994)'

ratings[(ratings[user_col] == 804) & (ratings[item_col] == "Reality Bites (1994)")]

	user	movie	rating	timestamp	title
62087	804	1074	1	879447476	Reality Bites (1994)

Elements of collaborative filtering

We need to create embedding matrix with latent factors user_factors and movie_factors, we initialize them with random numbers:

n_users = len(ratings[user_col].astype("category").cat.categories)
n_movies = len(ratings[item_col].astype("category").cat.categories)
n_factors = 5
 
user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)
print(n_users, n_movies)
print(user_factors.shape, movie_factors.shape)

943 1664
torch.Size([943, 5]) torch.Size([1664, 5])

We simplify this process with a helper function that takes a shape and output the embedding matrix:

def create_params(size):
    return torch.nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
 
 
create_params([2, 3])

Parameter containing:
tensor([[-0.0121, -0.0043,  0.0041],
        [ 0.0078, -0.0008,  0.0028]], requires_grad=True)

Next we define a sigmoid function to be used in our training model:

def sigmoid_range(x, low, high):
    "Sigmoid function with range `(low, high)`"
    return torch.sigmoid(x) * (high - low) + low

We define our linear model using dot product and a bias term. The model need to be an instantiation of the torch.nn.Module and instantiation the __init__ and forward methods

class DotProductBias(torch.nn.Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
        super().__init__()
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
 
    def forward(self, x):
        users = self.user_factors[x[:, 0]]
        movies = self.movie_factors[x[:, 1]]
        res = (users * movies).sum(dim=1)
        res += self.user_bias[x[:, 0]] + self.movie_bias[x[:, 1]]
        return sigmoid_range(res, *self.y_range)

We instantiate our model:

model = DotProductBias(n_users, n_movies, 5)

xb, yb = next(iter(dls))
xb.shape, yb.shape

(torch.Size([64, 2]), torch.Size([64]))

We can call the model to create a simple prediction, under the hood, the model uses the forward method we just defined.

pred = model(xb)
loss = torch.nn.functional.mse_loss(yb, pred)
pred.shape, yb.shape, loss

(torch.Size([64]),
 torch.Size([64]),
 tensor(1.5828, grad_fn=<MseLossBackward0>))

To train the model with SGD, we implement the following algorithm:

for (item, user), rating in dls:
    pred = model(item, user)
    loss = loss_func(pred, rating)
    loss.backward()
    parameters -= parameters.grad * lr

epochs = 5
lr = 5e-3
wd = 0.1
for epoch in range(epochs):
    epoch_loss = 0
    batch_num = 0
    for xb, yb in dls:
        pred = model(xb)
        loss = torch.nn.functional.mse_loss(pred, yb)
 
        loss.backward()
        with torch.no_grad():
            for p in model.parameters():
                # p.grad += wd * 2 * p
                p -= p.grad * lr
            model.zero_grad()
        epoch_loss += loss.item()
        batch_num += 1
    avg_loss = epoch_loss / batch_num
    print(f"Average loss: {avg_loss:.4f}")

Average loss: 1.7865
Average loss: 1.6328
Average loss: 1.5120
Average loss: 1.4171
Average loss: 1.3417

From our model, we now have embeddings of movies and users. We can use movie embeddings to find similar movies. For example, we want to find movies similar to ‘Chinatown (1974)’

title = "Chinatown (1974)"
item_idx = list(item_codes_to_titles.values()).index(title)
item_factors = model.movie_factors[item_idx].detach()
 
# Calculate cosine similarity between this movie and all others
similarities = torch.nn.functional.cosine_similarity(
    model.movie_factors.detach(), item_factors.unsqueeze(0)
)
 
# Get top similar movies
_, indices = torch.topk(similarities, 10)
 
print(f"Movies similar to '{title}':\n")
for idx in indices:
    similar_title = item_codes_to_titles[idx.item()]
    similarity = similarities[idx].item()
    print(f"{similar_title:<50} (similarity: {similarity:.3f})")

Movies similar to 'Chinatown (1974)':

Chinatown (1974)                                   (similarity: 1.000)
Trial and Error (1997)                             (similarity: 0.981)
Men With Guns (1997)                               (similarity: 0.978)
Forget Paris (1995)                                (similarity: 0.953)
Lashou shentan (1992)                              (similarity: 0.943)
Cook the Thief His Wife & Her Lover, The (1989)    (similarity: 0.936)
Further Gesture, A (1996)                          (similarity: 0.930)
Cool Hand Luke (1967)                              (similarity: 0.927)
Spellbound (1945)                                  (similarity: 0.917)
Liar Liar (1997)                                   (similarity: 0.916)

Quarry

All Entries

Recent entries

policy gradient theorem

neural network

Adam

collaborative filtering

Data exploration

Data preparation

Elements of collaborative filtering

Table of Contents