Many data scientists accept the model that gives the best results on the data set at hand; no attempt is made to understand why the given inputs lead to the output.
It all started some six years ago, when I was in the middle of switching careers and thus had a fair bit of time on my hands. My wife Priyanka (who, incidentally, also writes a column for this website) suggested that I use this time to learn astrology.
At the outset, I’d been taken aback, and found the suggestion bizarre; for I’ve never been religious or superstitious, or shown any sort of inclination to believe in astrology. Yes, there was this one time when I had helped set up a numerology stall for a school exhibition, but even that wasn’t particularly serious.
Now, Priyanka has some (indirect) background in astrology. One of her aunts is an astrologer, and specializes in something called “prashNa shaastra“, where predictions are made based on the time at which the client asks the astrologer a question. Priyanka believes this has resulted in largely correct predictions (though I suspect a strong dose of confirmation bias there), and (very strangely to me) seems to believe in the stuff.
“What’s the use of studying astrology if I don’t believe in it one bit,” I asked. “Astrology is very mathematical, and you are very good at mathematics. So you’ll enjoy it a lot,” she countered, sidestepping the question.
We went off into a long discussion on the origins of astrology, and how it resulted in early developments in astronomy (necessary in order to precisely determine the position of planets) and so on. The discussion got involved, and involved many digressions, as discussions of this sort often do. And as you might expect with such discussions, Priyanka threw a curveball: “You know, you say you’re building a business based on data analysis. Isn’t data analysis just like astrology?”
I was stumped (OK, I know I’m mixing metaphors from different sports here), and that had ended the discussion then. I still wasn’t convinced, though, and in due course of time even forgot about this discussion. For the record, I used my newfound free time by taking lessons in Western classical music.
I’ve spent most of the last six years playing around with data and drawing insights from it (a lot of those insights have been published in Mint). A lot of work that I’ve done can fall under the (rather large) umbrella of “data science”, and some of it can be classified as “machine learning”. Over the last couple of years, though, I’ve been rather disappointed by what goes on in the name of data science.
Stripped to its bare essentials, machine learning is an exercise in pattern recognition. Given a set of inputs and outputs, the system tunes a set of parameters in a mathematical formula such that the outputs can be predicted with as much accuracy as possible given the inputs (I’m massively oversimplifying here, but this captures sufficient essence for this discussion).
One big advantage with machine learning is that algorithms can sometimes recognize patterns that are not easily visible to the human eye. The most spectacular application of this has been in the field of medical imaging, where time and again algorithms have been shown to outperform human experts while analysing images.
In February last year, a team of researchers from Stanford University showed that a deep learning algorithm they had built performed on par against a team of expert doctors in detecting skin cancer. In July, another team from Stanford built an algorithm to detect heart arrhythmia by analysing electrocardiograms, and showed that it outperformed the average cardiologist. More recently, algorithms to detect pneumonia and breast cancer have been shown to perform better than expert doctors.
The way all these algorithms perform is similar—fed with large sets of images that contain both positive and negative cases of the condition to be detected, they calibrate parameters of a mathematical formula so that patterns that lead to positive and negative cases can be distinguished. Then, when fed with new images, they apply these formulae with the calibrated parameters in order to classify those images.
While applications such as medical imaging might make us believe that machine learning might take over the world, we should keep in mind that left to themselves, machines can go wrong spectacularly. In 2015, for example, Google got into trouble when it tagged photos containing a black woman as “gorillas”.
Google admitted the mistake, and that the classification was “unacceptable” but it appears that its engineers couldn’t do much to prevent such classification. Last month, The Guardian reported that Google had gotten around this problem by removing tags such as “gorilla”, “chimpanzee” and “monkey” from its database.
The advantage of machine learning—that it can pick up patterns that are not apparent to humans—can be its undoing as well. The gorilla problem aside, the problem with identifying patterns that are not apparent or intuitive to humans is that meaningless patterns can get picked up and amplified as well. In the statistics world, this is known as “spurious correlations”.
And as we deal with larger and larger data sets, the possibility of such spurious correlations appearing out of sheer randomness increases. Nassim Nicholas Taleb, author of Fooled by Randomness and The Black Swan had written in 2013 about how “big data” would lead to “big errors”.
We are seeing this with modern machine learning algorithms that rely on lots of data as well. For example, a team of researchers from New York University showed that algorithms detecting road signs can be fooled by simply inserting an additional set of training images with an obscure pattern in the corner. More importantly, they showed that a malicious adversary could choose these additional images carefully in a way that the misclassification wouldn’t be apparent when the model was being trained.
Traditionally, the most common method to get around spurious correlations has been for the statistician (or data scientist) to inspect their models and to make sure that they make “intuitive sense”. In other words, the models are allowed to find patterns and then a domain expert validates those patterns. In case the patterns don’t make sense, data scientists have tweaked their models in a way that they give more meaningful results.
The other way statisticians have approached the problem is to pick models that are most appropriate for the data at hand. Different mathematical models are adept at detecting patterns in different kinds of data, and picking the right algorithm for the data ensures that spurious pattern detection is minimized.
The problem with modern machine learning algorithms, however, is that the models are hard to inspect and make sense of. It isn’t possible to look at the calibrated parameters of a deep learning system, for example, to see whether the patterns it detects make sense. In fact, “explainability” of artificial intelligence algorithms has become a major topic of interest among researchers.
Given the difficulty of explaining the models, the average data scientist proceeds to use the algorithms as black boxes. Moreover, determining the right model for a given data set is more of an art than science, and involves the process of visually inspecting the data and understanding the maths behind the models.
With standardized packages being available to do the maths, and in a “cheap” manner (a Python package called Scikit Learn allows data scientists to implement just about any machine learning model using three very similar looking lines of code), data scientists have gotten around this “problem”.
The way a large number of data scientists approach a problem is to take a data set and then apply all possible machine learning methods on it. They then accept the model that gives the best results on the data set at hand. No attempt is made to understand why the given inputs lead to the output, or if the patterns make “physical sense”. As this XKCD strip puts it, this is akin to stirring a pile of answers till they start looking right.
And this is not very different from the way astrology works. There, we have a a bunch of predictor variables (position of different “planets” in various parts of the “sky”) and observed variables (whether some disaster happened or not, in most cases). And then some of our ancients did some data analysis on this, trying to identify combinations of predictors that predicted the output (unfortunately, they didn’t have the power of statistics or computers, so in that sense the models were limited). And then they simply accepted the outputs, without challenging why it makes sense that the position of Jupiter at the time of wedding affects how someone’s marriage would go.
Armed with this analysis, I brought up the topic of astrology and data science again recently, telling my wife that “after careful analysis I admit that astrology is the oldest form of data science”.
“That’s not what I said,” Priyanka countered. “I said that data science is new-age astrology, and not the other way round.”
I admit it’s hard to argue with that! #KhabarLive