Sam Kenkel

Data Science, Machine Learning, DevOps, CCNA, ACSR
Learn More


Like everyone with an interest in Data Science, I use the Data Science learning/competition site Kaggle.

Recently, I’ve been working on a competition, (The Port Seguro challenge), which is both a frustrating competition to take part in,  but is a great chance to go through an overview of my standard kaggle process, as well as a great opportunity to dig into some of the limitations of Kaggle for data science projects.

The current sourcecode and readme can be found here.

This is a good project to talk about high level, how I approach kaggle projects (and how that can be applied to non kaggle data science projects)

The key to  my approach to this project (at every stage) is just raw computational power: while insight and clever experiments can often save you some cpu cycles, this data has been thoroughly scrubed of meaning/context to protect customer privacy.

I use netdata on my workstation (which is a z620 with 16 cores and 32 threads) and I keep a window open all of the time checking cpu usage: If the cpu usage is below 70 percent I go back and tweak the code to assume something isn’t multithreading correctly.

For this project I was focusing on cpu multithread optimization. The next time I encounter a project which requires this type of Computational needs, I’ll work on getting gpu acceleration running on every model I use.

Step 0: Define the problem, the available data.

For a Kaggle, this is easy. You have a metric you want to optimize (which represents a business opportunity or challenge), and data.

Step 1: Research:

Before starting any modern problem, it’s always worth reading what others have done. Kaggle sometimes makes this too easy, but it’s great to understand certain techniques (early stopping lgb for example) which are explained and commonly used.

Step 2: Feature engineering:

In a normal project, domain knowledge, research or extensive eda can be used for this. Because this data is anonymized, feature engineering is much more difficult.

Step 3: Hyperparameter tuning: Self explanatory:

Step 4: stacking or ensembling.