Building a Movie Recommender System in Python
Created By :
Prashant Jha, Data Science Enthusiast
What is a Recommendation System ?
As it is clear by name, Recommender System is a piece of software that recommends us the relevant products or services. Recommender systems are everywhere. There can’t be a day when you don’t come through a recommender system on web.
Recommender system is the most basic application of machine learning. Some real life examples of recommender system:
- Amazon recommends us products based on the products we checked out in the past.
- Netflix recommends us movies based on our watch history.
- Google gives us suggestions when we type something in the search box based on our search history.
Types of Recommender System
There are basically two main components of any recommendation system, Users and Items. Items are the entities that are recommended by the recommender system to the users. Let’s understand by taking some examples:
Netflix recommends movies to the people, hence movies are items and people are users, while Facebook recommends the people you may know to the people, here people are users and people are items too.
There are three types of recommender systems that are mostly used:
● Popularity Based Recommender System
Popularity based recommender system recommends the most popular items to the users. Most popular items is the item that is used by most number of users. For example, youtube trending list recommends the most popular videos of the day.
● Content Based Recommender System
Content based recommender systems recommends similar items used by the user in the past.
For example, Netflix recommends us the similar movies to the movie we recently watched.
Similarly, Youtube also recommends us similar videos to the videos in our watch history.
● Collaborative Filtering based Recommender System
Collaborative Filtering based recommender system creates profiles of users based on the items the user likes. Then it recommends the items liked by a user to the user with similar profile.
For example, Google creates our profile based on our browsing history and then shows us the relevant ads
Now we’ll be building a Content Based Hollywood movie recommender system in Python programming language. First we’ll dive deeper into the working of Content Based Recommender System.
Content Based Recommender System Working
Content Based Recommender System recommends items similar to the items user likes. How does it decide which item is most similar to the item user likes. Here we use the similarity scores.
● Similarity Score : It is a numerical value ranges between zero to one which helps to determine how much two items are similar to each other on a scale of zero to one. This similarity score is obtained measuring the similarity between the text details of both of the items. So, similarity score is the measure of similarity between given text details of two items.
Here we’ll use cosine similarity between text details of items. In the example below it is shown how to get cosine similarity:
- Step 1 : Count the number of unique words in both texts.
- Step 2 : Count the frequency of each word in each text.
- Step 3 : Plot it by taking each word as an axis and frequency as measure.
- Step 4 : Find the points of both texts and get the value of cosine distance between them.
Example : Text 1 = “Money All Money”
Text 2 = “All Money All”
Step 1 : There are only two unique words between both texts.
‘Money’ and ‘All’
Step 2 : Counting the frequency.
Step 3 and Step 4 : There are only two unique words, hence we’re making a 2D graph.
Now we need to find the cosine of angle ‘a’, which is the value of angle between both vectors. Here is the formula used for doing this:
In our case, cos(a) = 0.8 . Hence similarity score between Text 1 and Text 2 is 0.8, so we can say that both the texts are 80% similar.
This was all the Math behind cosine similarity, which is used by Content Based Recommender System. Now let’s see how to make a Content Based Recommender System in Python.
Building it in Python
We are going to make a Hollywood movie recommender system in Python. Here we are using the IMDB-5000 Movie Dataset from Kaggle.
You can download the dataset from this link :- https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
After downloading the data, unzip it and there will be two .csv files. Check them out. We’ll use ‘movie_metadata.csv’ in our code.
We’ll be making it in Jupyter Notebook.
Let’s get started :
● Import pandas and numpy
● Read the .csv file with pandas.
● Take a look at the dataset
● Dealing with the null values. Filling them with string ‘unknown’.
● In the ‘genres’ column, replacing the ‘|’ with whitespace, so the genres would be considered different strings.
● Now converting the ‘movie_title’ columns values to lowercase for searching simplicity.
● One more operation we need to perform on the title column. All the title strings has one special character in the end, which will cause problem while searching for the movie in dataset. Hence removing the last character from all the elements of ‘movie_title’.
● Now we’ll sum up the text of all the features we’re going to use for measuring similarity in one column.
● Now after getting all the features summed up in one column in text form. We can make a similarity matrix. For making a similarity matrix, first we have to create count matrix of features.
After importing the libraries, now let’s make count matrix of features and then similarity Score matrix. Similarity score matrix contains the cosine similarity of all the movies in the dataset. We can get the similarity score of two movies by there index in dataset.
Here is a small example:
In the above cosine similarity score matrix :
Similarity score of A and A is 1.0, ie; they are 100% similar.
Similarity score of A and B is 0.7, ie; they are 70% similar.
● Now let’s name a movie user like. Convert it to small case, cause we converted all the title to smallcase. Then check if this movies exists in our dataset or not. This will Return True.
● Now, get the index of the movie in the dataset. Then fetch the row on the same index from the similarity score matrix, which has the similarity scores of all the movies to the movie user like.
● In the above step we also enumerated the row of similarity score, because for the most similar movies we have to sort this list in descending order, but we don’t want to lose the original index of similar movies, hence enumerating is important.
● Now, let’s sort the similarity score list on the basis of similarity score.
● Now we have the indexes of ten most similar movies to the movie user likes in the list. We just need to iterate through the list and store movie names on the indexes in a new list and print it.
● Yeah ! We’ve got some good suggestions here. Now let’s define a function that will take the movie you like as input and will perform all the steps at once and will print the name of ten most similar movies to the movies you like.
As you can see here we are getting some really cool prediction form the function.There are two same names of the top, these are actually the next parts of the movie you entered. So now you’ve got the idea of how recommendation system works and how to build one.