I placed in the top 10% of my first Kaggle competition. If you are not familiar with it, Kaggle is an ongoing forum for competitive data science. Individuals and teams compete to create the best model for data sets provided by industry and sometimes academia.
Individuals who enter are ranked as either Novice, Kaggler and Kaggle Master. To become a Kaggle master, one must place in the top 10% of two competitions; and in one of the top 10 slots of a third competition.

I’ve talked about Kaggle in many of my presentations. I’ve also used Kaggle data in my books. Until now, I had yet to actually enter a Kaggle competition. I decided it was finally time to try this for myself. I competed in the Otto Group Product Classification Challenge that ended on May 18th, 2015. My score was sufficient to land in the top 10%, so I’ve completed one of the requirements for Kaggle master. My Kaggle profile can be seen here.

My goals for entering were:

  • See how hard Kaggle actually is, and move towards a Kaggle master designation.
  • Learn from the other Kagglers and forums.
  • Build a basic toolkit that I will use for future Kaggle competitions.
  • Gain an example (from my entry) for the Artificial Intelligence for Humans series.
  • Maybe get an idea or two for my future dissertation (I am a phd student at Nova Southeastern University).

The Otto Classification Challenge

First, I will give a brief introduction to the exact nature of the Otto Classification Challenge. For a complete description, refer to the Kaggle description(found here).
This challenge was introduced by the Otto Group, who is the world’s largest mail order company and currently one of the biggest e-commerce companies, mainly based in Germany and France but operating in more than 20 countries. They have many products sold over numerous countries. They would like to be able to classify these products into 9 categories, using 93 features (columns). These 93 columns represent counts, and are often zero.

The data are completely redacted. You do not know what the 9 categories are, nor do you know the meaning behind the 93 features. You only know that the features are integer counts. Most Kaggle competitions provide you with a test and training dataset. For the training dataset you are given the outcomes, or correct answers. For the test set, you are only given the 93 features, and you must provide the outcome. The test and training sets are divided as follows:

  • Test Data: 144K rows
  • Training Data: 61K rows

You do not actually submit your model to Kaggle. Rather, you submit your predictions based on the test data. This allows you to use any platform to make these predictions.
The actual format of a submission for this competition is the probability of each of the 9 categories being the outcome. This is not like a university multiple choice test where you must submit your answer as A, B, C, or D. Rather, you would submit your answer as:

  • A: 80% probability
  • B: 16% probability
  • C: 2% probability
  • D: 2% probability

I wish college exams were graded like this! Often I am very confident about two of the answers, and can eliminate the other two. Simply assign a probability to each, and you get a partial score. If A were the correct answer for the above, I would get 80% of the points.

The actual Kaggle score is slightly more complex than that. Rather, you are graded on a logarithm based scale and are very heavily penalized for having a lower probability on the correct answer. The following are a few lines from my submission:

1,0.0003,0.2132,0.2340,0.5468,6.2998e-05,0.0001,0.0050,0.0001,4.3826e-05
2,0.0011,0.0029,0.0010,0.0003,0.0001,0.5207,0.0013,0.4711,0.0011
3,3.2977e-06,4.1419e-06,7.4524e-06,2.6550e-06,5.0014e-07,0.9998,5.2621e-06,0.0001,6.6447e-06
4,0.0001,0.6786,0.3162,0.0039,3.3378e-05,4.1196e-05,0.0001,0.0001,0.0006
5,0.1403,0.0002,0.0002,6.734e-05,0.0001,0.0027,0.0009,0.0297,0.8255

Each line starts with a number that specifies the data item that is being answered.
The sample above shows the answers for items 1-5. The next 9 values are the probabilities for each of the product classes. These probabilities must add up to 1.0 (100%).

What I Learned from Kaggle

If you want to do well in Kaggle, the following are very important topics, along with the tools I used.

The two areas that I learned the most about, during this challenge, were GBM parameter tuning and ensemble learning. I got pretty good at tuning a GBM. The individual scores for my GBM’s were in line with those used by the top teams.

Before Kaggle I typically used only one model, if I were using neural networks, I just used neural networks. If I were using an SVM, Random Forest or Gradient Boosting, I stuck to just that model. With Kaggle, it is critical to use multiple models, ensembled to produce better results than each of the models could produce independently.

Some of my main takeaways from the competition:

  • GPU is really important for deep learning. It is best to use a deep learning package that supports it, such as H2O, Theano or Lasagne.
  • The t-sne visualization is awesome for high-dimension visualization and creating features.
  • I need to learn to ensemble better!

This competition was the first time I used T-SNE. It works like PCA in that it is capable of reducing dimensions, however, the data points separate in such a way that the visualization is often clearer than PCA. This is done using a stochastic nearest neighbor process. I plan to learn more about how t-sne actually performs the reduction, compared to PCA.

t-SNE Plot of the Otto Group Challenge

My Approach to the Otto Challenge

So far I’ve only worked with single model systems. I’ve used models that contain ensembles that are “built in”, such as random forests and gradient boosting machines. However, it is possible to create higher-level ensembles of these models. I used a total of 20 models, this included 10 deep neural networks and 10 gradient boosting machines. My deep neural network system provided one prediction and my gradient boosting machines provided the other.
These two predictions were blended together, using a simple ratio. The resulting prediction vector was then normalized so that the sum equaled 1.0(100%).

Jeff Heaton's Kaggle Model for the Otto Group

I did not remove or engineer any fields. For both model types I converted all 93 attributes into Z-Scores. For the neural network I normalized all values to be in a specific range.

My 10 deep learning neural networks used a simple bagging method. I averaged the predictions from 20 different neural networks. Each of these neural networks was created by choosing a different 80/20 split between training and validation. The neural network was trained on the training data until the validation score did not improve for 25 epochs.
Once training stopped I used the weights from the epoch that produced the highest training score. This process is a simple form of bagging called bootstrap aggregation.

My 10 gradient boosting machines (GBM) were each components of a 10-fold cross-validation.
I essentially broke the Kaggle training data into 10 folds and used each of these folds as a validation set, and the others as training. This produced 10 gradient boosting machines.
I then used an NxM coefficient matrix to blend each of these together. Where N is the number of models, M is the number of features. In this case it was a 10x9 grid. This matrix weighted each of the 10 model’s predictive power in each of the 9 categories.
These coefficients were a straight probability calculation from the confusion matrix of each of the 10 models. This allowed each model to potentially specialize in each of the 9 categories.

I spent considerable time tuning my GBM. I used Nelder-Mead searches to optimize my hyper-parameter vector. I ultimately settled on the following parameters:

params = {'max_depth': 13,'min_child_weight': 4,'subsample': .78,'gamma': 0,'colsample_bytree': 0.5, 'eta':.005, 'threads':24}
Each of these two approaches (GBM and neural network) produced a separate submission file. I then blended these together, weighting each.  I found that 0.65 gave me the best blend with my deep neural network.

What Worked Well for Top Teams

The top Kaggle teams made use of more sophisticated ensemble techniques than I did.
This will be my primary learning area for the next competition. You can read about some of the top models here:

The above write-ups are very useful, I’ve already started examining their approaches.

Some of the top technologies discussed were:

  • Feature Engineering
  • Input Transformation - good write up here
    • log transforms
    • sqrt(x + 3/8) - Not sure what this one is called, but I saw it used a few times
    • z-score transforms
    • ranged transformation
  • Hyperparameter Optimization

I will probably not enter another Kaggle until the fall of this year. This blog post will be updated to contain my notes as I investigate other techniques for this competition.