Data fairy tales

With all the buzz about data science, there are no magic wands.

No magic wands

Large organisations often talk about Tableau, Qlik, Pentaho, or SAS as if buying a license will magically bring value to the business. These providers (and many of their competitors) are good enterprise software businesses. They sell themselves as magic wands to transform the businesses of their customers.

Reality is different.

Extracting value from data is an ongoing process. People in your team need to perform each role. They need to choose their own tools. No one tool suits everyone on your team. Salespeople are selling you fantasy, not reality. They sell you a license for a cute dashboard. What you need is value from your data. That value extraction is a process.

Data scientists know a lot about process.

Tasks, people, and tools

The likes of Gartner and IBM have spent hundreds or thousands of manhours documenting and re-documenting “frameworks” for extracting value from data, or “business intelligence”. They speak the language of the large organisation. They do not speak the language of the startup. They do not speak plain English.

In plain English, there are some steps to the value extraction process. Some of those steps require regular human involvement. Some can be automated, but that automation must be engineered by humans. Humans use tools.

Frameworks and acronyms aside, just focus on tasks, people, and tools.

Around the loop

Decision makers prefer not to face the same decisions repeatedly. But value extraction requires it. Extracting value from data is an iterative process. It is dynamic. It is not a one off. Put a process in place. Then commit everybody to accelerating around that loop by reducing friction each time.

These low-friction, accelerating iterations are familiar territory to startups. They are alien to large organisations. Their first iterations will be slower. Their acceleration will be slower. But large organisations can get there. They need only get started.

Start somewhere

The simplest place to start is with a data source. It might be a log file. It might be a SQL database. It won’t be perfect. That’s okay. Set some simple temporal boundaries. Set a point in time for a snapshot, or a start and end time for a time series. Now extract the data.

The data will be incomplete. The missing data may be elsewhere in your systems. Or it may be data which isn’t captured at all. Just make a note of what is missing. The format will be imperfect. It may even be terrible. Make a note of how the format could be better.

This stage requires a techie, or a tech-savvy business analyst with access to your core IT infrastructure. They need to copy the log file, or run the database query and output the results somewhere safe for others to tinker. They may use grep, awk, sed or other command line tools to do some pre-processing of text files. Or they may load the database dump into a separate database instance for analysis.

You can automate data snapshots with a flavour of cron or something more heavyweight, like Chronos to call a bash or other script.

Clean it up

Data snapshots may need cleaning during early iterations. Date formats might be a mess. Numeric values might be text. Missing values can be problematic. As your snapshots evolve and incorporate more data, you might need to link data, update indices, normalize or denormalize for database analysis. Log files may need reformatting into SQL or CSV.

Now calculate basic metrics about the data. How many rows or entries? How many columns or fields? How many missing values? What are the basic statistical properties for each field, including the range and distribution of values?

Clean up should be automated by someone close to your core infrastructure. The basic metrics should be reviewed by a data or business analyst. As you reduce friction in your value extraction, that person need not review the metrics from each iteration. The analyst might instead rely upon alerts via Twilio or your corporate alerting system. You can trigger alerts based upon basic metrics reaching levels outside particular thresholds.

For small snapshots you might use a tool like Trifacta, Open Refine, or even Excel for early iterations. As you automate the process, you can use a distributed or other suitable architecture. Small snapshots only need lightweight tools. You can start to move your data into your analysis platform, such as Python or R.

Make it manageable

Armed with some basic information about the raw snapshot, it might be time to sample. Sampling is appropriate if two conditions hold. First, the raw snapshot is too big for the next stage(s) of your analysis. If you had to use Hadoop or another distributed architecture for the previous stage, the raw snapshot is too big. The second condition is that you can mitigate any sampling bias.

If you need to sample, you will need one of your software engineers to implement the sampling process. You will also want one of your quantitative analysts involved as well. As sampling is only necessary for large snapshots, your software engineers will work with your distributed architecture and code in a system languages suitable for heavy computation.

Look for easy wins

At this stage, your data is ready for analysis. Look for easy wins. Check for any simple linear relationships. Run some quick correlations. You might be lucky.

Lucky or not, discuss these results with your team. If there are no correlations between releasing a new version of your app and customer engagement, something is wrong. Business people and data analysts must communicate.

Sharing the correlations for just a few fields requires just a table or an image. Share more fields with a CSV which team members can review in their tool of choice. You should share those files through the best platform for your organisation. Hopefully it’s a communication platform like Slack or Hipchat, or a task management platform like Trello. In the worst case, it could just be on email.

Have some conversations

If the data suggests there are some easy wins, try them. Your team can make them happen. Be sure you capture the data you need to track your progress, such as logging the correct events on your mobile or web app.

Some easy wins might be missing. Your marketing spend has not increased site traffic or calls. Have the right conversations. Change the marketing. Stop the marketing. See if the data changes. Again, be sure you capture the data you need to track your progress.

Easy wins or not, discuss the results with your team.

Beyond the obvious

Extracting value from data often requires more work than picking out easy wins. Your average business analyst isn’t enough. Your average software developer isn’t enough. An expensive bit of magical enterprise software isn’t enough. You need some Kagglers on your team. They will bring the right tools for the job.

Predicting outcomes based upon complex features is a challenge. Most of the methods are quantitative. But when to use a random forest versus gradient boosting, and how to do feature engineering often requires intuition and experience, rather than scientific rules. Even with the benefit of intuition and experience, “data science” is like most science. It is based upon repeated experimentation. It is characterized by repeated failure. With each failure comes new information and small improvements.

Tools for the deep

R and sklearn for Python dominate the toolkits of Kagglers. As Spark matures, much of that functionality is coming to the JVM.

Licensed software like Tableau or SAP offer integrations with R, but then why not just use R? They offer lots of drag and drop, but often still require coding skills. Your business team cannot use them without your development team building a bespoke application. Your team should not be beholden to its tools.

Data science work needs to be repeatable. Therefore it must be scriptable. It must be portable from desktops to servers. It should lend itself to parallel processing. Licensed software rarely meets these criteria.

Your team will visualize aspects of your data. They might use some histograms in their feature engineering. Or they might view some decision trees in calibrating their algorithms. But they don’t need additional tools to do that. And they are unlikely to share those images. They are disposable. Just quick checks between sprints of coding.

Across the gap

Communicating the easy wins in your data is straightforward. Discussing deeper value extraction is hard. Your data scientists and business decision makers speak different languages.

But they can communicate simply at the edge of their worlds. Your data scientists can talk about their results. At each iteration, those results are incremental, rather than revolutionary. It should be something as simple as:

We’ve run the numbers, and the single best predictor of which insurance policy options a customer will choose is the package they were last quoted. We can predict correctly 53.8% of the time using just that method. We’re not sure we can do much better than 55%. That’s even if we throw more time and resource at the problem.

Everyone can understand that! The team can collectively make informed, well-understood decisions.

In this example, your team used no special visualisation to communicate effectively. You should encourage your team to distill the information to its simplest form. Rarely do more complex tools or interactive dashboards lead to focused conversations and clear decisions.

This stage of communication cannot be automated. Real people need to converse.

The feedback loop

Your team should see the effects of its decisions. Track response metrics in real-time. Use dashboards. But use dashboards for live data. Use dashboards available in the office, at home, on the move. Design them well.

As you reduce friction in your process, your team will get real-time feedback on fixes and changes for the next iteration. In early iterations, your team should summarise the fixes, changes, and business decisions for next trip around the loop. Do not send your team around the loop without any changes to the system. Make the changes small enough to be implemented quickly. Then repeat your process.

No need for fairy dust and magic wands. Invest in your people and your process.

Notes:

  1. Training videos are a great way to get a sense of what it is like to use the various enterprise products, such as SAS, SAP Lumira (as an alternative to exploring SAP Hana data via Eclipse), Oracle, or IBM Cognos.
  2. People often misunderstand the term “data science”. The word science misleads them. We all use science to mean two things: established knowledge (an endpoint) and the process of establishing that knowledge through the scientific method (a process of repeated experimentation). Data science is about using established knowledge (known mathematical techniques) together with a process of repeated experimentation in order to study data.
  3. IBM offers an example of their framework co-authored with Gartner. Gartner separately offers a good insight into the state of the enterprise business intelligence market.
  4. As Eric Ries exhorts “The only way to win is to learn faster than anyone else.” (The Lean Startup, p. 111)
  5. Compiled languages usually work best. Your engineers might be using C++, Java/Scala, Go, or possibly a .NET language.
  6. The link for feature engineering suggests that it can be obviated by feature learning. In practice, there is almost always human effort required.
  7. JVM is the Java Virtual Machine, on top of which run languages such as Java, Scala, Groovy, Clojure, or even JRuby and Jython. SAS does offer some of its own implementations of many machine learning techniques, but often lags the latest versions available in the open source community.
  8. Scriptable is not particularly well-defined. However the key point is that it can be automated by code, and integrated with other pieces of code in a larger software architecture.
  9. From the Allstate competition from 2014.
  10. Geckoboard embodies many of the principles of good dashboards.