Wednesday, July 24, 2013

Big Data and Employment Testing: Correlation Is Not Causation

Imagine you are watching at a railway station. More and more people arrive until the platform is crowded, and then — hey presto — along comes a train. Did the people cause the train to arrive (A causes B)? Did the train cause the people to arrive (B causes A)? No, they both depended on a railway timetable (C caused both A and B).

Some 60% of American workers earn hourly wages. Of these, about half change jobs each year. So firms that employ lots of unskilled workers, such as supermarkets, home improvement stores and fast-food chains, have to vet million of applications every year. Making the process more efficient could yield big payoffs.

Algorithms and big data are powerful tools. Wisely used, they can help match the right people with the right jobs. But they must be designed and used by humans, otherwise they can go terribly wrong.

Big Data

Big Data is a tool in the right hands which can yield insight, help determine paths and alternatives which are more likely to be successful, and lead to improved conditions. But Big Data is just one of a set of tools which can be used to develop successful paths and alternatives. It is best when it is used in conjunction with other tools: intuition, inductive reasoning, statistical analysis to name a few.

One of the reasons why Big Data is in the forefront today is with the advent of new tools, very large data flows, and advanced computing techniques there are real opportunities to manage and use huge data sets.

The theory of big data is to have no theory, at least about human nature. One just gathers huge amounts of information, observes the patterns and estimates probabilities about how people will act in the future. One does not address causality.

As the authors of Big Data state, “Contrary to conventional wisdom, such human intuiting of causality does not deepen our understanding of the world." Instead, they aim to stand back nonjudgmentally and observe linkages: “Correlations are powerful not only because they offer insights, but also because the insights they offer are relatively clear. These insights often get obscured when we bring causality back into the picture.”

But are correlations relatively clear? The authors of Freaknomics discuss correlation and causation in the video below; specifically, the view of medical professionals in the first half of the 20th century that polio was caused by ice cream consumption (since disproved).

When two variables, A and B, are found to be correlated, there are several possibilities:
  1. A causes B
  2. B causes A
  3. A causes B at the same time as B causes A (a self-reinforcing system)
  4. Some third factor causes both A and B
The correlation is simple coincidence.

It’s wrong to assume any one of these possibilities. Correlation is a (perhaps strong) hint that there may be a relationship, but identifying the exact nature of that relationship requires more - i.e., a controlled experiments or proper statistical analysis. One needs to examine all of the variables that may influence the relationship and look for evidence supporting or rejecting the influence of each. One also needs to find a mechanism that explains any causal relationship. 

Bad Data

"Bad data" is data that has not been collected accurately or consistently or the data has been defined differently from person to person, group to group and company to company. The huge amount of "bad" data that is regularly served up for analysis may make it irresponsible for one to just "stand back nonjudgmentally and observe linkages," especially in the pre-employment testing process.

In the recruitment and hiring context, unproctored online personality tests are used to allow an applicant to take the test anywhere and anytime. That freedom creates conditions ripe for obtaining "bad' data. As stated by Jim Beaty, PhD, and Chief Science Officer at testing company Previsor:
Applicants who want to cheat on the test can employ a number of strategies to beat the test, including logging in for the test multiple times to practice or get the answers, colluding with another persons while completing the test, or hiring a test proxy to take the test.
And what about the accuracy of tests responses from those who are hired? Analyzing a sample of over 31,000 employees, a data analytic company's researchers found that employees who said they were most likely to follow the rules left the job on average 10% earlier, were 3% less likely to close a sale and were actually not particularly good at following rules.

Bad data exposes a vexing problem for employers. Applicants and employees seek to tell employers what they believe employers want to hear, and employers tend to ask questions that lead applicants employees to answer these questions in the “right” way.

A Simple Want of Careful, Rational Reflection

As noted in a prior post, prejudice rises not from malice or hostile animus alone. It may result as well from insensitivity caused by simple want of careful, rational reflection.

For example, take two insights from Evolv, a data analytics company:

  1. Living in close proximity to the job site and having access to reliable transportation—are correlated with reduced attrition and better performance; and
  2. Referred employees have 10% longer tenure than non-referred employees and demonstrate approximately equal performance.

An employer confronted with these two insights might well determine that (i) applicants living beyond a certain distance from the job site (i.e., retail store) should be excluded from employment consideration and (ii) preference in hiring should be extended to applicants referred by existing employees. Such a determination may end up being penny wise and pound foolish.

Painting with the broad brush of distance from job site will result in well-qualified applicants being excluded, applicants who might have ended up being among the longest tenured of employees. Remember that the Evolv insight is a generalized correlation (i.e., the pool of persons living closer to the job site tend to have longer tenure than the pool of persons living farther from the job site). The insight says nothing about any particular applicant or employee.

As a consequence, employers will pass over qualified applicants solely because they live (or don't live) in certain areas. Not only does the employer do a disservice to itself and the applicant, they increase the risk of employment litigation, with its consequent costs (attorney fees, damages, reputational harm, etc.). How?

A recent New York Time article, "In Climbing Income Ladder, Location Matters," reads, in part:
Her nearly four-hour round-trip [job commute] stems largely from the economic geography of Atlanta, which is one of America’s most affluent metropolitan areas yet also one of the most physically divided by income. The low-income neighborhoods here often stretch for miles, with rows of houses and low-slung apartments, interrupted by the occasional strip mall, and lacking much in the way of good-paying jobs
The dearth of good-paying jobs in low-income neighborhoods means that residents of those neighborhoods have a longer commute. As to the demographic makeup of low-income families, the 2010 Census showed that poverty rates are much higher for blacks and Hispanics. Consequently, hiring decisions predicated on distance, intentionally or not, discriminate against certain protected classes.

Similarly, an employer extending a hiring preference to referrals of existing employees may be further exacerbating the discriminatory impact of its hiring process. Those referrals tend to be persons from the same neighborhoods and socioeconomic backgrounds of existing employees, meaning that workforce diversity, broadly considered, will decline.

With the huge amounts of "bad" data that get generated and stored daily, the failure to understand how to leverage the data in a practical way that has business benefit will increasingly lead to shaky insights and faulty decision-making, with significant costs both to the employer and society.

A Fool With A Tool Is Still A Fool

Most companies have vast amounts of HR data (employee demographics, performance ratings, talent mobility data, training completed, age, academic history, etc.) but they are in no position to use it. According to Bersin by Deloitte, an HR research and consultancy organization, only 20% believe that the data they capture now is highly credible and reliable for decision-making in their own organization.

Research shows that the average large company has more than 10 different HR applications and their core HR system is over 6 years old. So it will take significant effort and resources (read funding) to bring this data together and make sense of it.

With the huge amounts of "bad" data that get generated and stored daily, the failure to understand how to leverage the data in a practical way that has business benefit will increasingly lead to shaky insights and faulty decision-making, with significant costs both to the employer and society.

As stated by Jim Stikeleather on the Harvard Business Review blog:
Machines don't make the essential and important connections among data and they don't create information. Humans do. Tools have the power to make work easier and solve problems. A tool is an enabler, facilitator, accelerator and magnifier of human capability, not its replacement or surrogate. That's what the software architect Grady Booch had in mind when he uttered that famous phrase: "A fool with a tool is still a fool." 
Understand that expertise is more important than the tool. Otherwise the tool will be used incorrectly and generate nonsense (logical, properly processed nonsense, but nonsense nonetheless). 
Although data does give rise to information and insight, they are not the same. Data's value to business relies on human intelligence, on how well managers and leaders formulate questions and interpret results. More data doesn't mean you will get "proportionately" more information. In fact, the more data you have, the less information you gain as a proportion of the data (concepts of marginal utility, signal to noise and diminishing returns). 

No comments:

Post a Comment

Because I value your thoughtful opinions, I encourage you to add a comment to this discussion. Don't be offended if I edit your comments for clarity or to keep out questionable matters, however, and I may even delete off-topic comments.