Collaborative filtering is a way to predict users’ preferences with data of other users’ preferences. It is used in recommendation systems. Collaborative filtering construct similarities of users as well as items. This differs from content-based filtering, where only item features are used to make recommendations.
For this entry, we use linear model to perform collaborative filtering. The basic idea is the following. We construct embedding matrices for movies and users. These embedding matrices are parameters of our linear model. For each user and movie, we create the same number of embedding factors, their dot product (plus a bias term) is the predicted value of user’s rating.
Let’s say we have movies and users, we use 5 embedding factors (or latent factor) for each movie and user. As an example, a user ‘s rating of item is modeled by the following equation, where the first term in the dot product is the user’s embedding and the second term is the items’ embedding.
These embedding matrices are initialized with random numbers. In our case, we have shaped embeddings for movies and for users. We take the dot product of the item embedding and transpose of the user embedding to get a matrix of user ratings for each movie. We improve these embedding matrices with our dataset of user ratings with SGD.
Data exploration
import torch
import pandas as pd
#!wget "https://files.grouplens.org/datasets/movielens/ml-100k.zip" -O "../data/ml-100k.zip" && unzip "../data/ml-100k.zip" -d ../data/
from pathlib import Path
path = Path("../data/ml-100k/")
ratings = pd.read_csv(
path / "u.data",
delimiter="\t",
header=None,
names=["user", "movie", "rating", "timestamp"],
)
ratings.head()
user | movie | rating | timestamp | |
---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 |
1 | 186 | 302 | 3 | 891717742 |
2 | 22 | 377 | 1 | 878887116 |
3 | 244 | 51 | 2 | 880606923 |
4 | 166 | 346 | 1 | 886397596 |
movies = pd.read_csv(
path / "u.item",
delimiter="|",
encoding="latin-1",
usecols=(0, 1),
names=("movie", "title"),
header=None,
)
movies.head()
movie | title | |
---|---|---|
0 | 1 | Toy Story (1995) |
1 | 2 | GoldenEye (1995) |
2 | 3 | Four Rooms (1995) |
3 | 4 | Get Shorty (1995) |
4 | 5 | Copycat (1995) |
We merge the two datasets to get movie titles:
ratings = ratings.merge(movies)
ratings.head()
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 186 | 302 | 3 | 891717742 | L.A. Confidential (1997) |
2 | 22 | 377 | 1 | 878887116 | Heavyweights (1994) |
3 | 244 | 51 | 2 | 880606923 | Legends of the Fall (1994) |
4 | 166 | 346 | 1 | 886397596 | Jackie Brown (1997) |
len(ratings.user.unique()), len(ratings.title.unique())
(943, 1664)
Data preparation
Create a DataLoader from the pandas dataframe, our the input are user and titles (same as movie), and the output is rating. When we create the Dataset (see details in Dataloader), we make the input/output distinction by telling the __getitem__
function to stack user and titles together. This way, instead of 3, the iterator will give us only 2 items, input and output, as described above.
from torch.utils.data import DataLoader
batch_size = 64
class CollabDataset:
def __init__(self, ratings, user_col, item_col, rating_col):
# notice here we reindexed the user col, the user index is original + 1
self.users = torch.tensor(
ratings[user_col].astype("category").cat.codes.values, dtype=torch.int
)
self.items = torch.tensor(
ratings[item_col].astype("category").cat.codes.values, dtype=torch.int
)
self.ratings = torch.tensor(ratings[rating_col].values, dtype=torch.float32)
def __len__(self):
return len(self.ratings)
def __getitem__(self, idx):
return torch.stack([self.users[idx], self.items[idx]]), self.ratings[idx]
user_col = "user"
item_col = "title"
ratings_col = "rating"
dataset = CollabDataset(ratings, user_col, item_col, ratings_col)
dls = DataLoader(dataset, batch_size=64, shuffle=True)
For the user column, the original index is 1 based, which isn’t allowed in torch training. So we reindex it based on categories, which make the new user index
Now we can see that the DataLoader iterator give a list of 2 items a time, the first are an array of user, item pairs, the second are ratings. This is what we need for training.
next(iter(dls))
[tensor([[ 888, 865],
[ 15, 1227],
[ 58, 319],
[ 192, 793],
[ 732, 333],
[ 310, 1393],
[ 37, 559],
[ 379, 82],
[ 654, 920],
[ 325, 296],
[ 261, 773],
[ 550, 313],
[ 30, 528],
[ 392, 312],
[ 316, 453],
[ 252, 1204],
[ 311, 153],
[ 388, 1008],
[ 38, 1482],
[ 860, 272],
[ 386, 1491],
[ 312, 1210],
[ 14, 780],
[ 681, 302],
[ 845, 1572],
[ 397, 960],
[ 646, 1251],
[ 6, 890],
[ 598, 1570],
[ 881, 494],
[ 213, 910],
[ 780, 339],
[ 312, 337],
[ 184, 389],
[ 449, 314],
[ 827, 1167],
[ 384, 1038],
[ 104, 339],
[ 617, 930],
[ 398, 769],
[ 898, 1146],
[ 795, 232],
[ 304, 627],
[ 822, 88],
[ 642, 950],
[ 346, 493],
[ 876, 1016],
[ 133, 733],
[ 372, 1597],
[ 591, 554],
[ 180, 991],
[ 168, 618],
[ 647, 1019],
[ 214, 179],
[ 893, 519],
[ 298, 450],
[ 294, 1410],
[ 805, 1214],
[ 654, 1401],
[ 857, 378],
[ 621, 1178],
[ 351, 1534],
[ 415, 378],
[ 676, 860]], dtype=torch.int32),
tensor([5., 1., 4., 3., 3., 4., 5., 2., 3., 1., 3., 5., 4., 3., 4., 1., 5., 4.,
5., 4., 3., 3., 3., 4., 5., 4., 3., 5., 5., 5., 4., 4., 5., 4., 4., 4.,
4., 4., 3., 3., 3., 3., 2., 3., 4., 2., 3., 2., 5., 4., 1., 4., 3., 4.,
4., 4., 4., 3., 3., 2., 5., 3., 3., 5.])]
We need a way to retrieve movie titles with index. We can verify our DataLoader work by picking a row in the iterator, ([803, 1214], 1), which means user 804 (old index) rated movie 1214 a score of 1. We verify the corresponding row in ratings
dataframe.
item_codes_to_titles = dict(
enumerate(ratings[item_col].astype("category").cat.categories)
)
item_codes_to_titles[1214]
'Reality Bites (1994)'
ratings[(ratings[user_col] == 804) & (ratings[item_col] == "Reality Bites (1994)")]
user | movie | rating | timestamp | title | |
---|---|---|---|---|---|
62087 | 804 | 1074 | 1 | 879447476 | Reality Bites (1994) |
Elements of collaborative filtering
We need to create embedding matrix with latent factors user_factors
and movie_factors
, we initialize them with random numbers:
n_users = len(ratings[user_col].astype("category").cat.categories)
n_movies = len(ratings[item_col].astype("category").cat.categories)
n_factors = 5
user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)
print(n_users, n_movies)
print(user_factors.shape, movie_factors.shape)
943 1664
torch.Size([943, 5]) torch.Size([1664, 5])
We simplify this process with a helper function that takes a shape and output the embedding matrix:
def create_params(size):
return torch.nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
create_params([2, 3])
Parameter containing:
tensor([[-0.0121, -0.0043, 0.0041],
[ 0.0078, -0.0008, 0.0028]], requires_grad=True)
Next we define a sigmoid function to be used in our training model:
def sigmoid_range(x, low, high):
"Sigmoid function with range `(low, high)`"
return torch.sigmoid(x) * (high - low) + low
We define our linear model using dot product and a bias term. The model need to be an instantiation of the torch.nn.Module
and instantiation the __init__
and forward
methods
class DotProductBias(torch.nn.Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0, 5.5)):
super().__init__()
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:, 0]]
movies = self.movie_factors[x[:, 1]]
res = (users * movies).sum(dim=1)
res += self.user_bias[x[:, 0]] + self.movie_bias[x[:, 1]]
return sigmoid_range(res, *self.y_range)
We instantiate our model:
model = DotProductBias(n_users, n_movies, 5)
xb, yb = next(iter(dls))
xb.shape, yb.shape
(torch.Size([64, 2]), torch.Size([64]))
We can call the model to create a simple prediction, under the hood, the model uses the forward
method we just defined.
pred = model(xb)
loss = torch.nn.functional.mse_loss(yb, pred)
pred.shape, yb.shape, loss
(torch.Size([64]),
torch.Size([64]),
tensor(1.5828, grad_fn=<MseLossBackward0>))
To train the model with SGD, we implement the following algorithm:
for (item, user), rating in dls:
pred = model(item, user)
loss = loss_func(pred, rating)
loss.backward()
parameters -= parameters.grad * lr
epochs = 5
lr = 5e-3
wd = 0.1
for epoch in range(epochs):
epoch_loss = 0
batch_num = 0
for xb, yb in dls:
pred = model(xb)
loss = torch.nn.functional.mse_loss(pred, yb)
loss.backward()
with torch.no_grad():
for p in model.parameters():
# p.grad += wd * 2 * p
p -= p.grad * lr
model.zero_grad()
epoch_loss += loss.item()
batch_num += 1
avg_loss = epoch_loss / batch_num
print(f"Average loss: {avg_loss:.4f}")
Average loss: 1.7865
Average loss: 1.6328
Average loss: 1.5120
Average loss: 1.4171
Average loss: 1.3417
From our model, we now have embeddings of movies and users. We can use movie embeddings to find similar movies. For example, we want to find movies similar to ‘Chinatown (1974)’
title = "Chinatown (1974)"
item_idx = list(item_codes_to_titles.values()).index(title)
item_factors = model.movie_factors[item_idx].detach()
# Calculate cosine similarity between this movie and all others
similarities = torch.nn.functional.cosine_similarity(
model.movie_factors.detach(), item_factors.unsqueeze(0)
)
# Get top similar movies
_, indices = torch.topk(similarities, 10)
print(f"Movies similar to '{title}':\n")
for idx in indices:
similar_title = item_codes_to_titles[idx.item()]
similarity = similarities[idx].item()
print(f"{similar_title:<50} (similarity: {similarity:.3f})")
Movies similar to 'Chinatown (1974)':
Chinatown (1974) (similarity: 1.000)
Trial and Error (1997) (similarity: 0.981)
Men With Guns (1997) (similarity: 0.978)
Forget Paris (1995) (similarity: 0.953)
Lashou shentan (1992) (similarity: 0.943)
Cook the Thief His Wife & Her Lover, The (1989) (similarity: 0.936)
Further Gesture, A (1996) (similarity: 0.930)
Cool Hand Luke (1967) (similarity: 0.927)
Spellbound (1945) (similarity: 0.917)
Liar Liar (1997) (similarity: 0.916)