Sunday, August 4, 2013

The Dictatorship of Data or Fooled by Randomness

Big Data has spawned a cult of infallibility — a vision of prediction obviating explanation and math trumping science. In The End of Theory: The Data Deluge Makes the Scientific Method ObsoleteChris Anderson wrote, "With enough data, the numbers speak for themselves."

The trouble is that you don't always know when to believe them. When you've got algorithms weighing hundreds of factors over a huge data set, you can't really know why they come to a particular decision or whether it really makes sense

As Geoff Nunberg, who teaches at the School of Information at the University of California Berkeley stated in an NPR interview, big data is no more exact a notion than big hair. Nothing magic happens when you get to the 18th or 19th zero. After all, digital data has been accumulating for decades in quantities that always seemed unimaginably vast at the time.

An exponential curve looks just as overwhelming wherever you get onboard. And anyway, nobody really knows how to quantify this stuff precisely. Whatever the sticklers say, data isn't a plural noun like pebbles. It's a mass noun like dust.

What's new is the way data is generated and processed. It's like dust in that regard, too. We kick up clouds of it wherever we go. Cell phones and cable boxes; Google and Amazon, Facebook and Twitter; the bar codes on milk cartons; and the RFID chip that whips you through the toll plaza - each of them captures a sliver of what we're doing, and nowadays they're all calling home.

It's only when all those little chunks are aggregated that they turn into big data, then the software called analytics can scour it for patterns. Epidemiologists watch for blips in Google queries to localize flu outbreaks. Economists use them to spot shifts in consumer confidence. Police analytics comb over crime data looking for hot zones. 

Big Data and Google Flu Trends

In 2008, Google launched Google Flu Trends, which used big data and search algorithms to estimate the prevalence of flu outbreaks. For the first few years, Google Flu Trends metrics tracked closely with Center for Disease Control (CDC) data - and they were delivered several days before the CDC data.

But for the 2012-13 flu season, a comparison with traditional surveillance data showed that Google Flu Trends, which estimates prevalence from flu-related Internet searches, had drastically overestimated peak flu levels

It is not the first time that a flu season has tripped Google up. In 2009, Flu Trends had to tweak its algorithms after its models badly underestimated ILI in the United States at the start of the H1N1 (swine flu) pandemic

The glitch may be no more than a temporary setback for a promising strategy and Google is sure to refine its algorithms. But as flu-tracking techniques based on mining of web data and on social media proliferate, the episode is a reminder that they will complement, but not substitute for, traditional epidemiological surveillance networks.

This stumble doesn't necessarily make Google Flu Trends irrelevant. But it does mean that Google needs to recalibrate the way they mine big data to track the spread of disease, accounting for searches that may not be linked with infections. "You need to be constantly adapting these models, they don’t work in a vacuum," says Harvard Medical School epidemiologist John Brownstein.

Twitter Flu, Too?

Other researchers are turning to what is probably the largest publicly accessible alternative trove of social-media data: Twitter. Several groups have published work suggesting that models of flu-related tweets can be closely fitted to past official ILI data, and various services, such as MappyHealth and Sickweather, are testing whether real-time analyses of tweets can reliably assess levels of flu.

But Lyn Finelli, head of the CDC’s Influenza Surveillance and Outbreak Response Team, is skeptical. “The Twitter analyses have much less promise” than Google Flu or Flu Near You, she says, arguing that Twitter’s signal-to-noise ratio is very low, and that the most active Twitter users are young adults and so are not representative of the general public.

Further, we know many Twitter accounts are automated response programs called "bots," fake accounts, or "cyborgs" -- human controlled accounts assisted by bots. Recent estimates suggest there could be as many as 20 million fake accounts. So even before we get into the methodological minefield of how you assess health issues on Twitter, let's ask whether those issues are expressed by people or just automated algorithms.

Fooled by Randomness - The Dictatorship of Data

As stated by the authors in Big Data: A Revolution That Will Transform How We Live, Work, and Think:
Correlations let us analyze a phenomenon not by shedding light on its inner workings but by identifying a useful proxy for it. Of course, even strong correlations are never perfect. It is quite possible that two things may behave similarly just by coincidence. We may simply be “fooled by randomness” to borrow a phrase from the empiricist Nassim Nicholas Taleb. With correlations, there is no certainty, only probability.  
Moreover, because big-data analysis is based on theories, we can’t escape them. They shape both our methods and our results. It begins with how we select the data. Our decisions may be driven by convenience: Is the data readily available? Or by economics: Can the data be captured cheaply? Our choices are influenced by theories. What we choose influences what we find, as the digital-technology researchers danah boyd and Kate Crawford have argued. After all, Google used search terms as a proxy for the flu, not the length of people’s hair. Similarly, when we analyze the data, we choose tools that rest on theories. And as we interpret the results we again apply theories. The age of big data clearly is not without theories – they are present throughout, with all that this entails. 
We are more susceptible than we may think to the “dictatorship of data” — that is, to letting the data govern us in ways that may do as much harm as good. The threat is that we will let ourselves be mindlessly bound by the output of our analyses even when we have reasonable grounds for suspecting something is amiss. Or that we will become obsessed with collecting facts and figures for data’s sake. Or that we will attribute a degree of truth to the data which it does not deserve.

No comments:

Post a Comment

Because I value your thoughtful opinions, I encourage you to add a comment to this discussion. Don't be offended if I edit your comments for clarity or to keep out questionable matters, however, and I may even delete off-topic comments.