Sam Kenkel

Data Science, Machine Learning, DevOps, CCNA, ACSR
Learn More

Lol_Scout 1: Data Collection

This is the 1st of 3 blog posts about my process and discoveries working with data from Riot’s online game, League of Legends.  The code I wrote to do this can be found here.

Background:Project Purpose

In the 5 v 5 Videogame/Esport two teams of 5 players compete against each other. Before the ‘Game’ start each player chooses 1 of (currently 133) characters, known as a champion. No two players may play the same champions. In ranked and professional play, players may “ban” a champion and prevent either side from choosing that champion. There are several common strategies for this ‘Pick/Ban’ phase:

  1. Ban the “strongest” champion.
  2. Ban the champions that are strongest against champions your team is planning on picking.
  3. Ban the champions that (in professional play) you know your opponents are most succesful/ skilled with.

Since a team only has 5 bans, and every banned champion is banned for both teams, evaluating these three goals (and which 5 champions are worth banning) is a difficult task for professional League of Legends teams.

The purpose of the project was to make classifier (Team 1 winning being the target), that would help answer the question of how to optimally do pick ban, and to use that classifier’s output to answer the above questions.

 Approach to data collection

I began this project with the prior version of the Riot API (before the last of the v1 endpoint were depreciated).  In that version of the API, for a ‘personal’ project such as this I could get a 24 hour key. With that key, I can query:

The top 200 (by current rank) players of League of legends in the world.

The Match Id’s of every  ranked match played by a player (if I had the player’s unique identifier).

Some summary statistics about a player’s ranked Win/Loss overall, Win/Loss per character.

The site Champion.gg which does champion (character) based statistics for every champion/role (how the character is being used in game) combination in the game that is common enough for meaningful statistics to be generated. I use an api key for Champion.gg, but later decided against using any of this data for two reasons:

1.My reading of Riot API terms of service leads me to believe  I  should generate my own summary statistics for how champions perform on a specific patch.

2.Without extensive use of a unauthorized data sources, I had no way of confirming that my data followed a similar distribution of player skill that the champion.gg statistics I collected did.


The Limits of my  approach to data collection

I’m going to spoil some of my conclusions to this project: By my own metrics for  generating insight I fail;I get 55% accuracy at best vs 50% of just guessing, and the data I wanted to investigate, player’s skill with specific characters, makes my accuracy lower. I’m writing these blog posts (and sharing my sourcecode) because there are things to be learned from failures. The ‘Science’ part of Data Science should refer to using the Scientific Method of learning from mistakes, sharing knowledge, and pursuing reproducibility.

I fail at the data collection step, and I failed because I didn’t refine my data collection program ; while my code needs to be refactored, it’s the underlying algorithim that I needed to rewrite.   Because I didn’t concern myself sufficiently with Data leakage, or with ensuring an unbiased sample, I took a statistically unsound approach to summary stats, and had to impute a staggering number of them. Because I didn’t do sufficient EDA on a trial run of my data collection system, I ended up with an unbalanced distribution of matches player per patch.  Because of how I didn’t truly grasp the sparsity issues I was going to face, I ended up with an unbalanced playerdata/match data, with too many different players for the number of matches I had gathered information on.

Issue 1: Inefficient Code

The code that I used to gather this data made extensive use of cassiopeia. This saved my time in writing my data collection application, so it allowed me to start gathering data sooner.

However the data script I wrote had several problems .  The Riot API (sanely) limits the number of API calls, which makes every api call a valuable resource. This is a perfect proxy for how all Data Scientists should care about Data Structure and Algorithms, since not caring about computational efficiency with your experiment dataset is a good way to end up with a solution that will not scale to your production dataset. The risk for people learning Data Science/ Machine learning on the ‘standard’ datasets (Iris/Titanic/ Boston housing) with access to modern computation is that  it’s easy to end up with ‘bad habits’ that will lead to poor outcomes on any “real” project.

The lesson is to test your process at the smallest scale and optimize before you go further. I used cassiopeia to make calls requesting match information.  The Riot API does calls using a summoner id (which I had cached in a dataframe). In the code I wrote for the calls cassoipeia, I made calls using the summoner object. When I did this cassiopeia was doing a riot api call to convert the summoner name to id, before using that summoner id to get match data. Writing my code this way meant everything took twice as long as it should (since calls are limited by time)

Issue 2: Summary statistics are a double edged sword

In my initial project/experiment design, I wanted to use summary statistics for feature engineering. I wanted to track the delta between the overall win/loss ratio of a champion in a specific patch number, and a specific player’s performance on that champion to feature engineer some measure specialization or practice with a champion . When I stopped collecting statistics (at the endpoint depreciation which removed my ability to use the Riot API to get summary statistics) I had less than 100,000 games, spread out among over more than 10 patch releases. This meant that the Win/Loss Ratio per champion per patch was small enough that I was sure to have an unbalanced dataset.

At different levels of experience or skill, League of Legends is effectively a different game. There are entire websites dedicated to helping players understand what champions should be a concern at their level of play.  Because of the Data collection method I used, rather than a  targeted collection of games at the highest levels of play, I ended up with a random assortment.

Issue 3: Sparsity is the enemy

Since I began working on an Anime Recommendation Engine as a portfolio project, I’ve been fighting sparsity. Sparsity, the fact that in a combinatorial data space many of the combinations are not present, is one of the core difficulties in many type of product recommenders. I realized when I began working with the data I had collect was  one of the core issues I was facing here. There are 133 active champions in this dataset, with 10 champions being used per game. How many champion combinations are there? 298978680 . That doesn’t include roles (A character being used in a different position, which can often change how they are used so radically that it is not fair). Even with some of the standard approaches to sparsity (SVD/Embedding encoders), my data is just too sparse: the best way to treat a Character in league of legends, is in the context of the Role, the Patch number, and Skill level of the game. Nothing in my data collection process was sufficiently restricting those variables in the match data I collected.

Because of how this data collection process worked, I ended up with many many players that are part of a very few games, and without summary statistics.   This mean any type of generalization as to their level of skill is fraught with peril. If the only  games on a certain patch are those by few players who have singular dedication to that character that will distort the output of any model I use, since I’m ultimately trying to generate predictions.


The Lesson: Minimum Viable Product mentality is dangerous with Data Science

Algorithims and Data Structures matter. Computational Complexity matters.  Data Scientists (or machine learning engineers) should remember the lessons from leetcode, ctci and the like.  The point of those lessons is that you shouldn’t  just throw “more storage”, “more data”, or “more computation” at issues. With poor experiment design and poor data collection techniques, the technical debt I incurred in the initial stages of the project were so severe, that my modelling was ultimately fruitless.

If I had truly grappled with my data sparsity problem sooner, I would have realized that I needed to compute my own summary statistics, and not used so much of a limited resource (api calls) on something better handled another way.

If I had better tested my code and optimized it from the ground up, I would have have gathered more data.

If I better explored the initially collected data, I would have changed my “Characters to Scrape” code to focus on a smaller number of players, which would have compressed the player skill of sample to a more manageable amount, and mitigated the sparsity issues.

What I should have done: How I’ll approach this problem the next time I attempt it. 

Create a variable for challenger_players (top 200 players, which can be returned by the riot api)

Create a variable for players_to_scrape. Add all of the challenger players to this list.

Begin pulling data on every match each of the players currently in players_to_scrape.  For each match, add every unseen player to the players_to_scrape variable, only if at least one player in the match is part of the challenger_players variable.

When this loop completes, repeat it, checking the current top 200 and adding any new players (there is no need to remove the players

How is this different from the original code:

  1. All summary statistics and measures of skill are computed in a 2nd stage based on the collected dataset. Specifically, this allows for a summary statistic for each player of “at the time this match was played, what was their Win/Loss as a player, their Win/Loss for that characters etc.
  2. The restriction of which players are added to the target list radically decreases the “spread” of skillset in the dataset. Every match in the dataset will have at least one player in it good enough to play (at least once) a player that will at some point be in the top 200.
  3. Because no summary statistics are being collected, every api call will give match data. Because measures of skill will be computed for the dataset, I can feature engineer measures of character strength (such as champion win rate), or player-character familiarity specific to this dataset: this will target the model to the dataset, and reduce any “noise” that would otherwise come from differences between the summary statistics for characters at different skill levels/ from random chance inherent in how my sample is collected.


Post-Script:Technical Writeup of current code: 

If you want to follow this code, it can be found here: but I wanted to give the option of following along without a second tab.

First I import the libraries I use. (Including Cassiopeia to manage the calls to the riot api).

As a method of Api Key security, I force myself to type in my api key every time I use this script.

I use the endpoints in the champion.gg api  to get summary statistics (Win rate, Play Rate, KDA,) for each champion at each role that champion.gg is tracking. I make a “unique” key of the champion+role+date and use that with a break to ensure that I don’t end up with duplicate data. This is good practice for  any data that doesn’t come with it’s own unique key.

This gives me data that looks like this:

The Riot api gives only two ways to request lists of players ( otherwise known as Summoners); Explicitly we can request the top 200 players in the ranked ladder. Implicitly we can generate lists of summoners from  the games that they play from other players.

So at the begging of the data collection process I generate this seed (I made a .csv file with the columns spelled the way I’d prefer long term)

Then I iterate through the the list of summoners to get the summoner id and summoner name.

And I generate a pandas dataframe, and up with my “seed” or initial scrape targets.

I convert these to summoner_to_scrape.

I define a function, which I call in my main loop, which I can use to add a player that I see in match to my overall list of players to get match information from.

Then the main loop, which goes through my current “players to get match information about”, gets every match they have played, and checks each match for summoners not part of the current seed list.  The original version of that code is too long for a screenshot, so I recommend going to my github to check it.

Then I run one more loop, which gathers summary statistics about players  (with the intention of having summary statistics for every player who is in the matches database.)

In practice, after my initial “seed” function, every 24 hours I would update my Riot Api key, and run either of the two loops (The player Statistics loop, or the Match Information loop)

Also published on Medium.