#Hybrid recommendation system

9 messages · Page 1 of 1 (latest)

cosmic jolt
#

I am building a recommendation system on the Movie Lens dataset. I want to use the movie descriptions to build content based filtering to suggest movies to new users and to solve the Cold start problem (using TF-IDF, word2vec embeddings etc.). Then use actual Neural Collaborative Filtering to suggest movie based on rating. I am using Pytorch. How do I combine these models?. What models to use for both?. It would be helpful if you guys could provide resources.

thick jolt
#

Hey! My masters dissertation project (several years ago now!) was around collaborative filtering with the MovieLens dataset

#

If I were doing something like this in 2024, I'd probably use something like RoBERTa (HuggingFace Transformers) to find a dense embedding for my description, and then filter on the embedding space using something like KNN or RNN

The collaborative filtering aspect is completely separate from this. It's essentially a kind of clustering exercise in a different space. If you wanted to somehow combine these, try to make the semantic and rating space closer to one another, you could try something like SBert / triplet loss, where essentially you fine-tune an embedding model to force certain examples to be closer to one another by some external metric (for example, their distance in your collaborative filtering system).

proven dragon
#

+1, collab filtering would be another layer to filter on top for ranking. there are some other tricks out there for solving the cold start problem, such as allowing the user to specify their own preferences during the first session (IIRC both Netflix and Spotify do this) and then showing recs from existing user populations that are similar to the new user's taste preference using similarity search

#

I think you should also evaluate how predictive the Movie Lens dataset is of consumption/viewing, if possible. I have some experience in this domain and am skeptical that it is all that useful, compared to behavioral data in a commercial recsys. Pre-Covid, the single most predictive feature we found was box office revenue (we suspect because it is a good proxy for market awareness) which cost almost nothing to collect, though I imagine the statistical relationship has broken down since Covid

cosmic jolt
# proven dragon +1, collab filtering would be another layer to filter on top for ranking. there ...

I thought of training a linear model on the Movie descriptions using TF-IDF and predict scores on unseen movies for each user. Then use something like neural collaborative filtering to also do predictions for that user. At last combine these predictions using weighted average => hybrid_preds = ( 0.5 * content_preds + 0.5 * collab_preds). But it wouldn't work for completely new users, only way to solve that is through similarity ranking. This is a task for a internship they wanted me to build a hybrid model. Idk how to combine these.

cosmic jolt
thick jolt
#

Oh, so that's more combining them as an ensemble. I see

thick jolt
#

So a problem you will probably find with a linear model on tf-idf is that it's a really huge sparse vector, so you need a lot of examples to fit the model (if you tokenize to a 4000-long vector, you'd be looking at wanting ~100,000 examples to train that model)