Google flu trends and the future of Big Data

Accurate analysis of large amounts of data is more difficult to achieve than many think, notes Chris Gonsalves

There's an old psychology test turned team-building exercise that's fun because nine out of 10 people get it wrong: the Wason Card Problem. It does a pretty elegant job of exposing our human penchant for an error known as confirmation bias.

You can try it here.

Our propensity for reading too much into things and stumbling toward the answers we expect to get, even when wrong, is on full display right now in the tearing down of Google flu trends -- an early Google experiment that over the past two decades became a standard bearer for the Big Data movement.

Few early presentations promoting the power and possibilities of Big Data failed to include some mention of Google flu trends and its uncanny ability to turn hundreds of billions of global searches into a predictor of where the next outbreak of fever, chills and runny noses was about to erupt.

Except that, as we're finding out now, Google flu trends is mostly wrong. Very wrong.

Four researchers from Harvard University have now pointed out that Google's flu trends data grossly overestimated flu outbreaks for pretty much all of the past two years.

In fact it hasn't even been close to correct since mid-2011. Almost any other traditional method of flu reporting, including the old Centre for Disease Control (CDC) reports that Google advocates used to deride as "lagging" indicators, would have outperformed Google's data by a long shot.

"Google flu trends was like the bathroom scale where the spring slowly loosens up and no one ever recalibrated," David Lazer, one of the researchers who wrote an article on Google flu trends for Science magazine, told the Guardian.

"You know scales are going to need to be recalibrated, yet when [it] started missing by a lot, which started years ago, before it got any media attention, no one tweaked the mechanism."

So, just as Google flu trends was once a poster child for the Big Data movement, its public failure is likely to be used by sceptics who have long questioned the real usefulness - and profitability - of such technology in every-day business.

The reality is this stumble exposes things we already knew about Big Data and analytics outside the rarified air of science and academia.

If there's one moral to this story, it's that Google's desire to promote its Big Data efforts but never share its algorithms or underlying data for critical review will never work.

If you want to roll with the scientists, you have to wear the lab coat. Might the Google folks get that now?

Beneath runs an undercurrent of truisms that might comfort anyone hoping to successfully use large data sets for analytics and intelligence gathering. This is no trivial exercise. This is the technology trend McKinsey has labelled "the next frontier for innovation, competition, and productivity" so getting it right is important.

Chief among these issues is the fallacy that more data equals better data. That was Google flu trends' undoing and it's the bane of every budding Big Data project in the corporate world.

Google had all the data in the world and it was wrong.The belief that larger data sets are inherently more reliable is the same assumption error that led to the dot com collapse, the housing crisis and the global recession.

In a bubble, all signs point to up and nobody can hear you scream.

Kaiser Fung, a statistician at Vimeo and creator of Junk Charts, sums up the level of understanding business users need to approach Big Data sensibly. The worst outcome of the Google flu trends snafu, Fung writes in the Harvard Business Review, would be to use it as evidence that Big Data's not worth it.

"Honest appraisals are meant to create honest progress, to advance the discipline rather than fuel the fad," he says.

Fung says would-be users need to focus on the state of Big Data and the assumptions it often contains. A lot of the data which analysts are working with today is purely observational; it comes from machines that collect data indiscriminately, with none of the purpose of a traditional research instrument.

The data lacks the trusted scientific tool of controls for comparison and analysis, and it fools users into thinking it's complete merely by the sheer massive size of it all.

In fact, "more data creates more false leads and blind alleys, complicating the search for meaningful, predictable structure", Fung writes.

Adding to the problem, particularly in the case of Google flu trends, is the increased use of third-party data collection -- a busy mining of information for a purpose unrelated to the analyst's and data scientist's cause.

The error gets compounded as multiple data sets with misaligned definitions and objectives get mashed together -- perhaps by marketing teams or media workers with little understanding of statistics -- for analysis.

However, far from discouraging and discounting its value, what happened with Google flu trends could actually help Big Data advance and find its rightful home in the enterprise - the place where industry observers have consistently imagined it when calling it a potential $20bn market.

A frank discussion of what Big Data can and cannot do, and ways to avoid the bad assumptions and confirmation biases that led us down the Google flu trends path, would show Big Data is maturing, and that's a good thing for technologists who would champion - and profit from - its expanded use.

Take the Wason Card Problem test. If you tried it, there's a better than 90 per cent chance you failed. But you may have learned something, and if so there's much less chance you would fail a similar exercise in the future.

The error is pervasive, but it's also simple and easy to avoid.

"I am excited by the promises of data analytics," says Fung. "But I'd like to see our industry practice what we preach, conducting honest assessment of our own successes and failures. In the meantime, outsiders should be attentive to the challenges of big data analysis, and apply considerable caution in interpreting such analyses."

For more US-focused channel news, see www.channelnomics.com