The intro to these posts explained why I was making an Anime recommendation system, and Part 1 gave a brief overview of the approach I was taking. In the next part I will start to explore how I tuned my Item-Item similarity models.
Before diving into that I wanted to go through my data collection process, and initial analysis that helped guide me in this process.
Even though every project like this starts with the data, and as Data is the New Oil it’s always worth going past the platitudes to figure out where my data came from, and how it was gathered.
MyAnimeList, or Mal no longer offers API keys (there have been DDOS issues), so all of my data was acquired through scraping.
There is a great project Mal_Scraper which I used to get score information for each user once I had a user name. The problem being that MyAnimeList doesn’t provide any list of users. Mal does provide, however, a search function for users by name with a minimum search length of three characters. My solution was to randomly generate three characters, get all usernames from Mal that contain those characters, and repeat. I ran this script until I had 1.5 million distinct usernames.
With those usernames, I began scraping on three nodes, each behind a VPN.
On a fourth node, I used Beautiful Soup to get plaintext reviews from users
Exploratory Data Analysis
After getting approx 800,000 scores, I began working on some exploratory data analysis.
This would tell me some general things about the type of data I was collecting, and help guide the model I was going to build.
I did all of my EDA in Tableau Public, which is great for visualization, especially interactive visualizations.
The first thing I needed to do was to confirm that the data I was gathering pseudo-randomly was not being thrown off by a difference in the distribution of Mal users by usernames.
Both of these graphs are different versions of an X=Y line, which is what it should like like (since the values in my dataset should track the values that Mal lists for their entire dataset).
The next thing to look at in a dataset like this is the distribution of scores:
A few things jump out immediately: Not only are some scores far more common (8s), but below 5 is very rarely given out. This is a common experience in all type of media review systems, there’s even a term for it. The best solution to this is scaling the data: rather than analyzing how a user has scored an Anime, I can express every score as some number of points higher or lower than that user’s average score:
I will use this technique (scaled rather than raw scores) throughout the model.
A less common aspect of this dataset is that Mal allows users to give scores on shows they claim they have not seen.
Mal also allows users to indicate if they have seen something without scoring it. Roughly half of the available data is a ‘Status’ without a true score.
The more a show is reviewed in the data, the better it is liked. This is logical: people hear about good Anime, and watch it because it is good or they think it is good. This also points to a larger issue I will encounter in this project: I don’t have nearly enough negative feedback. Too few users log in to score a show they disliked, or a show they stopped watching. Because this data came from an opt-in site, users who began an Anime, realized it was not for them and stopped watching it without updating Mal are not well represented.
The final bits of exploration I investigated was how different Genres (by the Mal’s tags) are scored.
In the next post, I’ll get into the Cosine-Similarity methods I used to determine what Anime is similar to other Anime.