Recommender System Using “Surprise”

Photo by Glenn Carstens-Peters

Recommender Systems are used to predict user preferences or we can say that these are systems that help people find things when the manual process of selection is a little bit challenging because of too many choices or alternatives. The best examples are Amazon recommending us the next book to read or Netflix suggesting the next movie to watch. There are three types of recommender systems.

  • Collaborative filtering
  • Content-based filtering
  • Hybrid recommender system

We are going to build a recommender system using the collaborative filtering technique. Collaborative filtering is the technique of making predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

Surprise is a Python scikit for building and testing recommender systems that deal with explicit rating data. We will make use of this library for building the collaborative filtering based recommender system.


You can install Surprise on your machine using pip cor conda. You'll also need numpy and a C compiler. The recommended method for windows users is using conda):

pip install numpy
pip install scikit-surprise
pip install pandas
pip install matplotlib
pip install seaborn

With conda

conda install -c conda-forge scikit-surprise

Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split
from surprise import SVD, accuracy

Load Data

We are going to use the Amazon's clothing, shoes and jewellery rating dataset from here

colnames=['reviewerID', 'asin', 'overall', 'unixReviewTime'] 
data = pd.read_csv('./data/ratings.csv', names=colnames, header=None)

As you can see the data is loaded using pandas. We have four columns reviewerID, asin, overall and unixReviewTime.


  • reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin - ID of the product, e.g. 0000013714
  • overall - rating of the product
  • unixReviewTime - time of the review (unix time)

The dataset contains 5748920 rows of data. Now we will check the value counts of the column overall.


We can see that the rating of 5.0 has the highest value counts. This means more people rated an item 5.0. It is shown in the plot below.


Now we will check for null values in the dataset.


As there are no null values in the dataset we will jump into the next step.

Load Data into Surprise

We don't need the timestamp column for this. So we will remove that column from the dataset.

data = data[['reviewerID', 'asin', 'overall', 'unixReviewTime']]
input_data = data.iloc[:, :-1]

To load the dataset from pandas data frame to Surprise, we will use the load_from_df() method. We will also need a Reader object with the rating_scale the parameter specified as (1,5).

reader = Reader(rating_scale=(1,5))
input_data = Dataset.load_from_df(input_data[['reviewerID', 'asin', 'overall']], reader)

Now we will split the input_data to train and test data in an 80:20 ratio.

train_data, test_data = train_test_split(input_data, test_size=0.20)

The most known and widely used matrix decomposition method is the Singular-Value Decomposition or SVD. All matrices have an SVD, which makes SVD more stable than other methods. So we will use this technique for training our model.

algo = SVD()
predictions =

To inspect our predictions in details, we are going to build a pandas data frame with all the predictions. The following code is largely taken from this notebook.

def get_Iu(uid):
    """ return the number of items rated by given user
      uid: the id of the user
      the number of items rated by the user
        return len(train_data.ur[train_data.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
def get_Ui(iid):
    """ return number of users that have rated given item
      iid: the raw id of the item
      the number of users that have rated the item.
        return len([train_data.to_inner_iid(iid)])
    except ValueError:
        return 0
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:1000]
worst_predictions = df.sort_values(by='err')[-1000:]

Here rui is the actual user rating while est is the value predicted by our model.


That's it! We build a fashion apparel recommender system using the Surprise library. You can perform hyperparameters tuning and Cross-validation with Surprise to get more accurate predictions. You can find the official Surprise documentation here. I love your feedback, please let me know what you think.

All the code of this article is available over on Github. This is a python project, so it should be easy to import and run as it is.