The intro to this series explained why I was making an Anime recommendation system, part 1 gave a brief overview of the approach I was taking. Part 2 explained how I got my Data. In this part I will explore how I tuned my three different methods determining Item-Item similarity.
Method 1: Item-Item similarity based on user scores.
Anime_Score_Sim in my github shows the code for this.
First all 0’s (or statuses without a score) are dropped. Next, I find the average score for each user, and subtract that score from every user’s score. This is to normalize the data. Without doing this, users having different ideas of what constitutes a 6 vs a 7 vs 8 adds noise to the data, especially with users that have very few reviews.
Then I make a sparse matrix of every user x every anime (with the user’s scaled scores being at the intersection).
After some remapping to get those into Anime names for EDA, I can spot check to confirm that what the method indicates as similar are logical choices to Anime fan such as myself:
For example, if I look at Death Note, one of the most popular and successful anime of all time, I get a number of other very popular and successful shows, including Attack On Titan (which is directed by the same director, Tetsuro Araki). However, it’s hard to judge these recommendations as this almost just a list of ‘Crossover Anime’ (Anime that crossed over to being mainstream)
Next I look at something more obscure:
Girls Und Panzer, the twee show about ‘Cute Girls riding in Tanks as a highschool sport’ is a little more informative. We see a great number of spin-off and sequel works (Mal lists prequels and sequels as different anime), as well as a number of other ‘Cute Girls doing Cute Things shows’, such as Hidamari Sketch and Non Non Biyori.
If I examine the polarizing show Bakemonogatari, I see one of the downsides for this method :
There are many recommendations for the sequels to Bakemonogatari, and one recommendation for another work by the same author (Katanagatari). While these are certainly similar shows, it is likely that using this system of similarity with the MAL dataset will give ‘possible recs’ that the user has already seen, but not scored, or is already aware of.
Method 2: Anime Review NLP TF-IDF Cosine Distance collaborative filtering
Review_Sim shows the code for this in my github shows the code for this.
After some pre-processing(merging all of our reviews into a one dataframe, removing duplicates) I generate a dataframe that consists of each row containing one anime with the text of every review for that Anime merged together. As a pre-processing step I replace every instance in a review of an Anime’s title with the word “Anime_Title”. This ensures that Death Note and Death Parade are not similar just because of the shared word in the title.
Then I use a TF-IDF (Term Frequency, Inverse Document Frequency) vectorizer to generate a sparse matrix for each Anime.
N-grams are set from 1-3, which means that two word phrases (Such as ‘moe trash’, are treated as a single Token to be vectorized and counted)
In short, this is taking every 1 to 3 word phrase in every review, determining how often that Token appears in review text for one anime (such as cowboy bebop) and compares it to how often that Token appears in all reviews.
There are few ways I can vectorize this, but the best results that I have found slice this down to 900 or 1800 most common Tokens, with a custom list of stop words to be ignored. Most of the custom stop words are references to seasons. Without this modification, second and third seasons of very different shows are considered similar.
With this sparse matrix of values, I can once again compute cosine pairwise distance between all shows. (So shows which are described by anime fans using similar uncommon words will be closer to each other).
I then turn these similarity scores into a dataframe, which allows me to look at those shows which are most similar to each other.
Checking the results for Death Note:
I see a who’s who of very popular, well liked Anime, but this is a different list than user score distance gave us. Specifically, this list tilts towards older, ‘darker’ shows.
Checking the results for Girls Und Panzer:
I see excellent results. This list is not only different from the user score distances, but features a wide variety of “Moe” (Cute girls doing cute things) shows, as well as a large number of the subgenre of “Militarized Moe” (cute girls doing cute military things),
The most similar show on this list being “High School Fleet” which is a show that has the same premise as Girls Und Panzer, but with battleships rather than tanks.
When I look at Bakemonogatari, I also see a marked improvement:
I see the show’s direct sequel (Nisemonogatari), and a show based on a novel by the same author (Katanagatari), but also see a number of similar shows (Durara, Spice and Wolf, Mawaru Penguindrum) that are different results from what user score distance returned.
Method 3: Anime Hidden Factors( HF) Cosine Distance collaborative filtering
Mal_Keras_Fm_Clustering shows the code for this in my github.
As a prerequisite for this method, I make a factorization machine neural net. This neural net embeds each user and each anime as N hidden factors, then makes the score a simple dot product of these hidden factors.
After training this neural network, I use the embedding for each show (which are the show’s ‘hidden factors’) and use cosine pairwise distance to measure similarity.
By changing the number of hidden factors, and the amount of regularization, I can modify how logical these factors become.
Here is the network that is used to generate this
When tuning this, I discovered that the more hidden factors and the heavier the regularization, the stranger the ‘similar’ anime become. Without regularization and too few hidden factors, the results overlap too much with the prior two methods. (The shows similar to Bakemonogatari and are all Bakemonogatari sequels)
Given that the two other methods are generating ‘logical’ similar items, I tuned this part of the model to specifically allow for “serendipity”: users being recommended shows that they are unlikely to have sought out on their own or know about already, but they would still enjoy. Without serendipity a recommend system like this one can create a filter bubble, a feedback loop, or otherwise generate recommendations that are both logical and accurate, but not useful.
In a production environment, multi-arm bandit testing would help tweak the hyperparameters of number of hidden factors and regularization.
The best results I have found so far come from regularizing user factors but not anime factors.
It’s important to note that because there are two alternative methods which provide ‘traditional’ answers, this method is being used for serendipity. If this was the core of a recommender system without either of the other two item-item similarity methods, this would be better left with fewer hidden factors and less regularization.
Some examples of these ‘serendipity’ similarities:
With all three of these methods, for each show that a user has liked, I can generate a variety of Anime. Next I will feed all of these similar Anime to neural nets to predict a user’s score.
Also published on Medium.