Unveiling Movie Preferences: A Deep Dive into Recommendation Systems with PySpark

Introduction

In today's digital age, recommender systems are everywhere—from suggesting your next binge-worthy show to curating your shopping experience. But how do these systems understand our preferences? This post explores building a movie recommendation system using PySpark, covering approaches from simple baselines to advanced matrix factorization, and discusses how to incorporate genre information for better recommendations.

The Dataset

We use a subset of the MovieLens dataset, which includes:

Over 32 million ratings
Approximately 200,948 unique users
84,432 movies
About 30 distinct genres

This rich dataset helps us understand user-movie interactions at scale.

Initial Data Analysis

Sparsity: The user-movie rating matrix is extremely sparse (only about 0.18% filled).
Rating Trends: Average ratings remain stable over time (~3.54 stars), so we can randomly split data for training/testing.
Skewed Distribution: Ratings are skewed towards 3 and 4 stars, so median can be more robust than mean.
Long Tail Phenomenon:
- Many users rate only a few movies.
- Many movies are rated by only a few users.
- A small number of popular movies get most ratings.

To address popularity bias, we use Inverse User Frequency (IUF) to weight movies based on their rating frequency.

Evaluation Metrics

We use several metrics to assess our models:

RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error): Standard regression metrics.
Weighted RMSE/MAE: Incorporate IUF weights for fairer evaluation.
Ranking Metrics:
- Precision@K: Proportion of recommended items that are relevant in top K.
- Recall@K: Proportion of relevant items found in top K.
- NDCG@K: Considers position of relevant items in the ranked list.

Baseline Models

Average Movie Rating: Predicts a movie's rating as its average in the training set.
- RMSE: 0.9634
- MAE: 0.7423
Median Movie Rating: Uses the median rating for robustness.
- RMSE: 0.9845
- MAE: 0.7225

K-Means Heuristic: Clustering by Genre

We experimented with clustering movies by genre using K-Means:

One-hot encode genres for each movie.
Cluster movies into k groups (k=150 chosen).
Predict ratings for new movies in a cluster based on user's ratings in that cluster.

Performance:

RMSE: 1.79–1.85 (worse than baselines)
MAE: 1.18–1.29

Conclusion: Genre clustering alone is insufficient for accurate predictions.

ALS: Collaborative Filtering

Alternating Least Squares (ALS) is a matrix factorization technique that learns latent factors for users and movies.

Best model (rank=13) after cross-validation.
Performance:
- RMSE: 0.8494
- MAE: 0.6629

Top-N Recommendation Performance:

Recall@10: 6.25e-07
Precision@10: 4.99e-07
NDCG@10: 4.50e-07

Note: ALS predicts ratings well but struggles with top-N recommendations, especially for unseen items.

Hybrid, Content-Aware Matrix Factorization

To address cold-start and interpretability, we propose a hybrid model that combines collaborative filtering with genre information.

Prediction Model:

$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u^T \mathbf{q}_i + f(\mathbf{g}_u, \mathbf{m}_i)$

$\mu$: Global average rating
$b_u$, $b_i$: User and item biases
$\mathbf{p}_u^T \mathbf{q}_i$: Latent factor interaction (from ALS)
$f(\mathbf{g}_u, \mathbf{m}_i)$: Interaction based on genre information

User-Genre Vector ($\mathbf{g}_u$): User's affinity for each genre (average rating per genre).

Movie-Genre Vector ($\mathbf{m}_i$): One-hot encoded genre vector for each movie.

Function $f$: Could be a dot product ($\mathbf{g}_u^T \mathbf{m}_i$) or involve a learned weight matrix. #TODO: Expand on more complex $f$ if needed.

Why Hybrid Matrix Factorization?

Cold Start: Genre info helps recommend new movies or to new users.
Diversity: Surfaces niche movies and improves coverage.
Interpretability: Easier to explain recommendations using explicit features.

Convex Optimization Aspect

If the model is linear in its parameters and uses a convex loss (like MSE), the optimization problem is convex, ensuring global optimality. However, standard matrix factorization is not jointly convex, but alternating optimization (as in ALS) works well. #TODO: Add more details on convexity conditions if needed.

PySpark Implementation Tips

Compute user-genre vectors by joining ratings with genres and aggregating.
Use one-hot encoded genre vectors for movies.
Combine ALS latent factors and genre vectors as features for a linear regression model.

This hybrid approach leverages both collaborative and content-based signals for robust recommendations.

Conclusion

Building effective movie recommendation systems is complex. While ALS excels at rating prediction, hybrid models that combine collaborative and content-based information offer better diversity, interpretability, and cold-start handling. By framing the problem within a convex optimization framework, we can build more robust and theoretically sound recommendation engines.