Skip to main content

On the Futility of Machine Learning "Projects"

In recent years, there has been a great buzz around big data and analytics, and the potential of a confluence of computing capabilites as well as intelligent algorithms to deliever massive business efficiencies and insights. Everyone looks to companies like Google and Facebook and says to themselves, "we should be like them!". In Singapore, it seems as if in the past couple of years, every other company is setting up an analytics team and the demand for data scientists, even though it is not clear that every one thinks the same thing when they all say data scientist, has never been higher. In a way, it is reminiscent of the dot-com bubble that drove an over-investment in intercontinental fibre-optic cable. However, as noted by Thomas Friedman in The World is Flat, new platforms, techniques or methods cannot achieve their full potential unless they are combined with new ways of conducting business. For example, the invention of the light bulb was not able to light up households and allow productive activities to carry on at night until efficient electricity generation and delivery was widespread. Likewise, many companies are, in my opinion, grappling with how to embed data science into their core business practices and in many cases they might be grappling in the wrong direction by applying old thinking to new ways.

In this post, I offer some of my thoughts of how data science needs to change in order for businesses to derive actual benefit from it.

Data science is NOT new

Before we begin, let's get one thing straight, data science is not new. It is the analysis of observations via mathematical methods which scientists and mathematicians have been doing ever since the advent of modern science grounded on a rigorous mathematical framework. It was born, some might even say, when people started drawing a straight line through a bunch of dots on a piece of paper. So it's sometimes kind of amusing when I hear people speak of data analytics/science as some new vogue that's sweeping the industry.

In fact, even in businesses, the analysis of data has long been used to understand customer characteristics and sales performance. If in doubt, take a look at established business intelligence tools such as SASS and Tableau.

What makes this data science revolution different from the data science that scientists and businesses have known for decades, is that for the first time the development of intelligent algorithms, the ability to compute massive amounts of data and the infrastructure to deliver and act on the insights derived have all reached a point of maturity where actual benefits in the commercial environment is now possible.

A never-ending cycle

As many traditional industries begin to think about how they can become more data-driven and how they can derive the benefits of the new data revolution, they look at things through the lens of old ways. They still think of the data science effort in terms of individual projects to be pursued on a case-by-case, product-by-product basis; every time the company wants to launch a marketing campaing or to predict customer response it will start a new project to address the particular business question.

So with each business question to be answered, the data scientist has to perform the usual extract-transform-load (ETL) process, address anomalies and missing data, understand the perculiarities pertaining to the business problem and engineer relevant features, model the data then deliver the insights to the business users. This is the seemingly never-ending, ever-repeating cycle of a data scientist.

It is not hard to see how this is not scalable. With each project, the data scientist has to painstakingly go through the dataset which is probably extracted from the source databases specifically for the particular project and address the perculiarities of the dataset. In fact, it is well known that most of the time is spent on ETL and feature engineering and the actual modelling work takes up only a small fraction of the effort. Now imagine, in the fast-changing world, companies are constantly pushing out new products or services to cater to the specific needs of individual customers. How can the data scientist keep up?

Representations, not immediate results

In my opinion, companies should focus on building representations of the parts of their business that matters to them. Representations are succinct or compressed descriptions of the relevant input data variables. For example, the input variables could be a customer's spending patterns, demographic characteristic and browsing history, etc. However, the variables could exist on a space that can is very complicated and noisy. Building a representation of the raw input can translate the noisy raw data into a cleaner space and make subsequent machine learning work much more effective by allowing those algorithms to start learning on a cleaner data space.

An organisation interested in becoming data-driven should build representations of key aspects of their businesses. For example, Netflix builds a representation of their customers' viewing preferences as well as movie characteristics on an aggregate basis so as to make relevant movie recommentations on an individual-basis this in turns engages the customers to continually use their platform to watch movies/shows. Imagine if Netflix tried to build a model for every new movie that comes along that tells it which customers are likely to watch it. They would end up with millions of individual models with no end in sight; there would be no way for Netflix to scale. Whereas now they have one or two master models/representations that allows them to match any customer to any movie. The key is to focus of the customer and understanding who they are.

Hence before trying to derive results from data analytics projects, the organisation needs to think carefully about their business needs and start building representations of key aspects of their business to lay the groundwork for future work.

Again, building representations is not new. Principal Components Analysis (PCA) can be though of a way of deriving a succinct representation of the raw data that allows more effective machine learning. The key challenge here is how to build representations that can effectively capture the non-linearities in complex datasets as well as easily amenable to subsequent discriminative model building work for specific purposes. Methods such as deep-belief networks, sparse autoencoders and Variational Autoencoders are all ways of building representations using deep learning.

Pipelines and Infrastructure

On top of having good data representations of key business aspects, it is crucially important to have good workflow and data pipelines. After all, the benefits from the insights derived from data analytics cannot materialize if the insights only end up in slides and reports. Also, as mentioned before, if businesses were to respond to the rapidly changing world in a data-driven way, data scientists have to be able to cycle through the analysis cycle rapidly. This means that data needs to be readily available and models built must be able to deliver their results to the business frontend, where actual business benefits are derived, for consumption.

Traditionally, data is stored in a database and utilised via SQL-based queries in a low-fequency manner. Access to data is limited to a few data technicians who navigate a myriad of databases and perform basic ETL for users on the business end. However in this day and age, this method of working is too slow. A data scientist needs data to work and he cannot be waiting for it. The data must be easily accessible to him. And as organisations seek to utilise data streaming data to deliver real-time insights or action points, computing frameworks like Apache Kafka or Apache Spark Streaming that allow computation to be performed or data to be delivered to multiple consumption points as the data travels from the source to a backend data base are required.

This means that for many organizations the entire data pipeline needs to change. Otherwise, the data scientists will be like factories without raw materials and the insights derived would be like goods produced but have no way of reaching the hands of consumers.


In this post, I offer up some of my thoughts on the challenges facing organisations nowadays as they try to be part of the data science revolution. In particular, building useful data representations of key business aspects and data pipeline and infrastructure stand out to me as the most important hurdles that have to be crossed before data science can deliver the real-time business benefits in this fast-changing world. I don't claim to have the solution to the problems as every organisation is different and will face various types of challenges, some of them non-technical, such as the mindset of the leadership and legacy systems that are hard to change. But I believe that my points still stand. Every organisation should examine themselves and ask whether they are truly ready for the new data age. This is something cannot be solved by employing "data scientists". There are tough and perhaps expensive choices to be made in terms of computing hardware/architecture, the people and business practices, otherwise, I find it hard to see how machine learning "projects" can deliver the commercial yield that people think is achievable when they look at companies like Google or Facebook.