State of data science: the Kaggle prism

The buzz about AI and data science can be deafening. What can Kaggle tell us about realities in the field?

Background

Gauged by media hype, the last year or two has brought a revolution in data science. Has there been such a revolution? If so, we would expect to find evidence on Kaggle. To inspect the state of amateur sport we would watch the Olympics. As world records stand or fall, we can measure the achievements of today’s athletes. Kaggle invites us as spectators (and participants) to the arena for international “data athletes.”

Are these data athletes getting better? Are they performing feats today which were unthinkable a few years ago? Has there been a data science revolution?

What Kaggle offers

Since Anthony Goldbloom founded Kaggle with Ben Hamner in 2010, data hackers have won more than $3million in prize money in more than 135 competitions. Though there are typically only 20-30 competitions per year, there are few better sources of hackable data about hackable data hackers.

For the last seven years, Kaggle makes available almost all competition information. We know the nature of the challenges including their data types, objectives and evaluation metrics. We also know how many competitors faced each challenge, how hard they tried, and how successful they were.

Sample Issues

Quantitative analysis of Kaggle competition data suffers from several handicaps. One such handicap is the small sample size. 20-30 diverse competitions each year are too few. The second problem compounds the small samples - the competition quality, data types and evaluation metrics vary. The differing evaluation metrics make like-for-like comparisons particularly challenging. Although AUC and RMSE have been the most common evaluation metrics throughout the life of Kaggle, together they represent just 3 of the 16 competitions completed this year.

Possible Conclusions

Every year, the average number of players per competition has increased. 2016 player numbers top 1,500 on average, with nearly 6,000 data hackers competing for $60,000 from Santander. In 2012, competitions attracted just over 200 competitors on average, or less than 900 for the KDD Cup. More competitors also means it is harder to win. Winning teams this year make 8x as many entries as average teams, vs. 6x in 2012.

The biggest payouts each year have not gotten any bigger since the $500,000+ pay out by Heritage Healthcare in 2013. But average cash prizes have generally increased, with 2016 on track for an average prize in excess of $50,000. That said, an increasing percentage of Kaggle contests award jobs rather than cash.

The mix of underlying data types varies from year to year, across image, text, waveform, and the most common parametric data, but there have not been any persistent trends. Similarly, the spread of evaluation metrics across AUC, RMSE, MAP, MAE, Log Loss and other metrics has varied over the last seven years without many consistent trends. In general, RMSE is less commonly used, whilst Log Loss has found more favour with Kaggle sponsors.

Similarly, the domains from which the challenges arise vary in time. The commercial focus of data science in most organisations remains marketing. Science and finance trail in second and third, with other domains such as travel, retail, health, industrial and NLP making up the rest.

Uncertainties

The figures above evidence more competition in the data hacking arena. Increasing competition is consistent with broader global interest in data science. But what about performance? What about results?

Much like comparing times for the Olympic 100m sprint, we would like to compare the results of data athletes today to those five years ago. Unfortunately our sample issues prevent that direct comparison.

Evaluation metrics which are not normalised, such as MAE or Log Loss cannot be compared at all. Even comparing the same normalised metrics such as AUC or RMSE across contests is misleading. If we compared all the results with the same class of normalised evaluation metric, the winning scores do not improve over time. Nor, even, do the benchmark scores materially improve.

One might argue that the level of contest difficulty has increased in line with competitor ability, thereby hiding the performance improvements in Kaggle athletes. We have no measure of contest difficulty with which we can test that argument.

Conclusions

The data science arena of Kaggle does not evidence any revolution in data science. Like any field of practice, it is evolving. The explosion in media coverage and popular interest in “deep learning” and “artificial intelligence” seems to reflect a change in public consciousness of the field, rather than newsworthy changes in the field itself.

More money is being spent on data hacking in its various flavours. More organisations are employing more people to extract value from data. However, most of these organisations are catching up with practices employed for years by smaller, savvier organisations. Although improvements are made every day in the field, most effort is expended on the classical problem of marketing — how to sell people more stuff.

Despite the often mundane uses of Extreme Gradient Boosting to sell more T-shirts, empowering more people with better tools and improved understanding is good. Those improvements should benefit the evolution of our society.

Notes:

  1. Noting, however:
    • professionals are now allowed to compete in most Olympic sports
    • recent news stories highlight the extent to which doping has been a feature of international sport
  2. This sample includes 150 total completed through 11 July 2016 which were advertised with cash prizes (138) or jobs (12). Competitions with “knowledge” or “swag” prizes were excluded. Also excluded were the following competitions:
    • Additional parts to GE Flight Quest
    • GE Hospital Quest as a subjective, qualitative contest
    • Practice Fusion preliminary phases (problem selection)
    • CHALEARN (Microsoft Kinect) prelim round
    • Parkinson's progression as evaluated on subjective basis
    • State of Colorado Education - no entries?
    • Harvard Business Review Visualisation competition as subjective
    • Prelim Click Predict Fix competition ending Sep 2013
    • Follow the Money from Oct 2012 - no entrants?
    • Eurovision 2010 missing leaderboard
    • Kaggle Leaderboard design contest as subjective
    • Multi-modal learning contest from 2013 ICML as no results or team data
    • World Cup 2010 with zero entries, as well as follow-up with no leaderboard
  3. We do not actually know how much effort competitors expended. We only know how many entries they made and use that as a rough proxy for effort. However, typical winners make 5-10x as many submissions as average competitors.
  4. Parametric data in CSV or equilavent format represents 67% (100) of the 150 competitions in the sample.
  5. Marketing has been the focus of at least 40 competitions. The 4 retail sales prediction competitions might also have marketing as their underlying motivation rather than financial planning.
  6. Google Trends information is illustrative, rather than authoritative.
  7. Winning scores based upon private leaderboard results. Benchmark scores are also difficult to compare. Some contests have no benchmarks. In such cases, a benchmark score has been assumed where applicable, e.g. 0.5 for AUC. Other competitions have multiple benchmarks. In these cases, we have subjectively chosen the highest benchmark (closest to a theoretical perfect score) which does not require knowledge of domain-specific modelling techniques.
  8. Despite being a valuable platform for encouring new data hackers, and encouraging knowledge transfer within the data science community, Kaggle itself has struggled to find a viable business model.
  9. Like Google Trends, trend information from Indeed is illustrative only.