* * * * * * *
With big data promising valuable insights to those who analyze it, all signs seem to point to a further surge in others’ gathering, storing, and reusing our personal data. The size and scale of data collections will increase their already exponential growth as storage costs continue to plummet and analytic tools become ever more powerful. If the Internet Age threatened privacy does big data endanger it even more? Is that the dark side of big data?
Yes, and it is not the only one. Here, too, the essential point about big data is that a change of scale leads to a change of state. This transformation not only makes protecting privacy much harder, but also presents an entirely new menace: penalties based on propensities. That is the possibility of using big-data predictions about people to judge and punish them even before they have acted. Doing this negates ideas of fairness justice, and free will.
Penalizing Based On Propensities
The clearest example of penalizing based on propensities is predictive policing. Predictive policing tries to harness the power of information, geospatial technologies and evidence-based intervention models to reduce crime and improve public safety. This two-pronged approach — applying advanced analytics to various data sets, in conjunction with intervention models — can move law enforcement from reacting to crimes into the realm of predicting what and where something is likely to happen and deploying resources accordingly.
Predictive policing raises suspicions that it either legitimizes racial profiling or at the very least gives the police a much wider latitude of probable cause with which to challenge citizens or force a consent to a search. On top of this, some wonder if it goes beyond criminalizing actions to criminalizing the simple fact of being in the wrong place and at the wrong time.
Take marijuana arrests as an example. We know that black people and Latinos are arrested, prosecuted and convicted for marijuana offenses at rates astronomically higher than their white counterparts, even if we adjust for income and geography. We also know that whites smoke marijuana at about the same rate as blacks and Latinos.
Therefore we know that marijuana laws are not applied equally across the board: Blacks and Latinos are disproportionately targeted for associated arrests, while whites are arrested at much lower rates for smoking or selling small amounts of marijuana.
If historical arrest data shows that the majority of arrests for marijuana crimes in a city are made in a predominately black area, instead of in a predominately white area, predictive policing algorithms working off of this problematic data will recommend that officers deploy resources to the predominately black area -- even if there is other information to show that people in the white area violate marijuana laws at about the same rate as their black counterparts.
If an algorithm is only fed unjust arrest data (as compared to conviction data), it will simply repeat the injustice by advising the police to send yet more officers to patrol the black area. In that way, predictive policing creates a feedback loop of injustice.
Dictatorship of Data
In addition to privacy and propensity, there is a third danger. We risk falling victim to a dictatorship of data, whereby we fetishize the information, the output of our analyses, and end up misusing it. Handled responsibly, big data is a useful tool of rational decision-making. Wielded unwisely, it can become an instrument of the powerful, who may turn it into a source of repression, either by simply frustrating customers and employees or, worse, by harming citizens.
Big data is predicated on correlations among various data elements. Correlations let us analyze a phenomenon not by shedding light on its inner workings but by identifying a useful proxy for it. Of course, even strong correlations are never perfect. It is quite possible that two things may behave similarly just by coincidence. We may simply be “fooled by randomness” to borrow a phrase from the empiricist Nassim Nicholas Taleb. With correlations, there is no certainty, only probability.
As boyd and Crawford state, "Too often, Big Data enables the practice of apophenia: seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions. In one notable example, David Leinweber demonstrated that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh."
While many companies and government agencies foster an illusion that classification is (or should be) an area of absolute algorithmic rule—that decisions are neutral, organic, and even automatically rendered without human intervention—reality is a far messier mix of technical and human curating. Both the datasets and the algorithms reflect choices, among others, about data, connections, inferences, interpretation, and thresholds for inclusion that advance a specific purpose.
Like maps that represent the physical environment in varied ways to serve different needs—mountaineering, sightseeing, or shopping—classification systems are neither neutral nor objective, but are biased toward their purposes. They reflect the explicit and implicit values of their designers.
A similar principle should apply outside government, when businesses make highly significant decisions about us – to hire or fire, offer a mortgage, or deny a credit card. When they base these decisions mostly on big-data predictions, we recommend that certain safeguards must be in place.
First is openness: making available the data and algorithm underlying the prediction that affects an individual. With big-data analysis, however, traceability will become much harder. The basis of an algorithm’s predictions may often be far too intricate for most people to understand.
Second is certification: having the algorithm certified for certain sensitive uses by an expert third party as sound and valid. Third is disprovability: specifying concrete ways that people can disprove a prediction about themselves. (This is analogous to the tradition in science of disclosing any factors that might undermine the findings of a study.)
Most important, a guarantee on human agency guards against the threat of a dictatorship of data, in which we endow the data with more meaning and importance than it deserves.
Preexisting Bias
Freedom from bias should be counted among the select set of criteria—including reliability, accuracy, and efficiency—according to which the quality of systems in use in society should be judged. We use the term bias to refer to computer systems that systematically and unfairly discriminate against certain individuals or groups of individuals in favor of others. A system discriminates unfairly if it denies an opportunity or a good or if it assigns an undesirable outcome to an individual or group of individuals on grounds that are unreasonable or inappropriate.
Moreover, because big-data analysis is based on theories, we can’t escape them. They shape both our methods and our results. It begins with how we select the data. Our decisions may be driven by convenience: Is the data readily available? Or by economics: Can the data be captured cheaply? Our choices are influenced by theories. What we choose influences what we find, as the digital-technology researchers Crawford and boyd have stated. Similarly, when we analyze the data, we choose tools that rest on theories. And as we interpret the results we again apply theories. The age of big data clearly is not without theories – they are present throughout, with all that this entails.
Ameliorating the Risks
In these scenarios, we can see the risk that big-data predictions, and the algorithms and datasets behind them, will become black boxes that offer us no accountability, traceability, or confidence. To prevent this, big data will require monitoring and transparency, which in turn will require new types of expertise and institutions. These new players will provide support in areas where society needs to scrutinize big-data predictions and enable people who feel wronged by them to seek redress. Please see Are Discriminatory Systems Discriminatory? If So, Then What?
Big data will require a new group of people to take on this role. Perhaps they will be called “algorithmists.” They could take two forms - independent entities to monitor firms from outside, and employees or departments to monitor them from within - just as companies have in-house accountants as well as outside auditors who review their finances.
We envision external algorithmists acting as impartial auditors to review the accuracy or validity of big-data predictions whenever the government requires it, such as under court order or regulation. They also can take on big-data companies as clients, performing audits for firms that want expert support. And they may certify the soundness of big-data applications like anti-fraud techniques or stock-trading systems. Finally, external algorithmists are prepared to consult with government agencies on how best to use big data in the public sector.
Moreover, people who believe they’ve been harmed by big-data predictions —a patient rejected for surgery, an inmate denied parole, a loan applicant denied a mortgage — can look to algorithmists much as they already look to lawyers for help in understanding and appealing those decisions.