Employment Testing: Failing to Make the Grade: 2014

Thursday, October 9, 2014

Big Data's Disparate Impact - Excerpts and Annotations

This posting is based on, and excerpts are taken from, "Big Data's Disparate Impact" by Solon Barocas and Andrew D. Selbst. Their article addresses the potential for disparate impact in the data mining process and points to different places within the process where a disproportionately adverse impact on protected classes may result from innocent choices on the part the data miner. Excerpts from the article are set out below in normal typeface. Please note that footnotes from the article are not included in the excerpts set out below. Annotations that further illuminate issues raised in the article are indented and italicized. Readers are strongly encouraged to read the article by Messrs Barocas and Selbst.

* * * * * * *

"Big Data's Disparate Impact" introduces the computer science literature on data mining and proceeds through the various steps of solving a problem this way:

defining the target variable,
labeling and collecting the training data,
feature selection, and
making decisions on the basis of the resulting model.

Each of these steps creates possibilities for a final result that has a disproportionately adverse impact on protected classes, whether by specifying the problem to be solved in ways that affect classes differently, failing to recognize or address statistical biases, reproducing past prejudice, or considering an insufficiently rich set of factors. Even in situations where data miners are extremely careful, they can still effect discriminatory results with models that, quite unintentionally, pick out proxy variables for protected classes.

To be sure, data mining is a very useful construct. It even has the potential to be a boon to those who would not discriminate, by formalizing decision-making processes and thus limiting the influence of individual bias.

Data mining in such an instance addresses the issue of the "rogue recruiter," a recruiter who is biased, whether intentionally or not, against certain protected classes. Employers and testing companies argue that replacing the rogue recruiter with an algorithmic-based decision model will eliminate the biased hiring practices of that recruiter.

But where data mining does perpetuate discrimination, society does not have a ready answer for what to do about it.

The simple fact that hiring decisions are made "by computers" does not mean the decisions are not subject to bias. Human judgment is subject to an automation bias, which fosters a tendency to disregard or not search for contradictory information insight of a computer-generated solution that is accepted as correct. Such bias has been found to be most pronounced when computer technology fails to flag a problem.

The use of technology systems to hardwire workforce analytics raises a number of fundamental issues regarding the translation of legal mandates, psychological models and business practices into computer code and the resulting distortions. These translation distortions arise from the organizational and social context in which translation occurs; choices embody biases that exist independently, and usually prior to the creation of the system. And they arise as well from the nature of the technology itself and the attempt to make human constructs amenable to computers. (Please see What Gets Lost? Risks of Translating Psychological Models and Legal Requirements to Computer Code.)

Defining the Target Variable and Class Labels

In contrast to those traditional forms of data analysis that simply return records or summary statistics in response to a specific query, data mining attempts to locate statistical relationships in a dataset. In particular, it automates the process of discovering useful patterns, revealing regularities upon which subsequent decision-making can rely. The accumulated set of discovered relationships is commonly called a “model,” and these models can be employed to automate the process of classifying entities or activities of interest, estimating the value of unobserved variables, or predicting future outcomes.

[B]y exposing so-called “machine learning” algorithms to examples of the cases of interest, the algorithm “learns” which related attributes or activities can serve as potential proxies for those qualities or outcomes of interest. In the machine learning and data mining literature, these states or outcomes of interest are known as “target variables.”

The proper specification of the target variable is frequently not obvious, and it is the data miner’s task to define it. In doing so, data miners must translate some amorphous problem into a question that can be expressed in more formal terms that computers can parse. In particular, data miners must determine how to solve the problem at hand by translating it into a question about the value of some target variable.

This initial step requires a data miner to “understand[] the project objectives and requirements from a business perspective [and] then convert[] this knowledge into a data mining problem definition.” Through this necessarily subjective process of translation, though, data miners may unintentionally parse the problem and define the target variable in such a way that protected classes happen to be subject to systematically less favorable determinations.

Kenexa, an employment assessment company purchased by IBM in December 2012, believes that a lengthy commute raises the risk of attrition in call-center and fast-food jobs. It asks applicants for call-center and fast-food jobs to describe their commute by picking options ranging from "less than 10 minutes" to "more than 45 minutes.

The longer the commute, the lower their recommendation score for these jobs, says Jeff Weekley, who oversees the assessments. Applicants also can be asked how long they have been at their current address and how many times they have moved. People who move more frequently "have a higher likelihood of leaving," Mr. Weekley said.

Are there any groups of people who might live farther from the work site and may move more frequently than others? Yes, lower-income persons, disproportionately women, Black, Hispanic and the mentally ill (all, protected classes). They can't afford to live where the jobs are and move more frequently because of an inability to afford housing or the loss of employment.Not only are these protected classes poorly paid, many are electronically redlined from hiring consideration.

As a consequence of Kenexa's "insights," its clients will pass over qualified applicants solely because they live (or don't live) in certain areas. Not only does the employer do a disservice to itself and the applicant, it increases the risk of employment litigation, with its consequent costs. (Please see From What Distance is Discrimination Acceptable?)

[W]here employers turn to data mining to develop ways of improving and automating their search for good employees, they face a number of crucial choices. Like [the term] creditworthiness, the definition of a good employee is not a given. “Good” must be defined in ways that correspond to measurable outcomes: relatively higher sales, shorter production time, or longer tenure, for example.

When employers use data mining to find good employees, they are, in fact, looking for employees whose observable characteristics suggest, based on the evidence that an employer has assembled, that they would meet or exceed some monthly sales threshold, that they would perform some task in less than a certain amount of time, or that they would remain in their positions for more than a set number of weeks or months. Rather than drawing categorical distinctions along these lines, data mining could also estimate or predict the specific numerical value of sales, production time, or tenure period, enabling employers to rank rather than simply sort employees.

These may seem like eminently reasonable things for employers to want to predict, but they are, by necessity, only part of an array of possible ways of defining what “good” means. An employer may attempt to define the target variable in a more holistic way—by, for example, relying on the grades that prior employees have received in annual reviews, which are supposed to reflect an overall assessment of performance. These target variable definitions simply inherit the formalizations involved in preexisting assessment mechanisms, which in the case of human-graded performance reviews, may be far less consistent.

As previously noted, Kenexa defines a "good" employee as a function, in part, of job tenure. It then uses a number of proxies - distance from jobsite, length of time at current address, and how many times moved - to define "job tenure."

Painting with the broad brush of distance from job site, commute time and moving frequency results in well-qualified applicants being excluded, applicants who might have ended up being among the longest tenured of employees. The Kenexa findings are generalized correlations (i.e., persons living closer to the job site tend to have longer tenure than persons living farther from the job site). The insights say nothing about any particular applicant.

The general lesson to draw from this discussion is that the definition of the target variable and its associated class labels will determine what data mining happens to find. While critics of data mining have tended to focus on inaccurate classifications (false positives and false negatives), as much—
if not more—danger resides in the definition of the class label itself and the subsequent labeling of examples from which rules are inferred. While different choices for the target variable and class labels can seem more or less reasonable, valid concerns with discrimination enter at this stage because the different choices may have a greater or lesser adverse impact on protected classes.

Training Data

As described above, data mining learns by example. Accordingly, what a model learns depends on the examples to which it has been exposed. The data that function as examples are known as training data: quite literally the data that train the model to behave in a certain way. The character of the training data can have meaningful consequences for the lessons that data mining happens to learn.

Discriminatory training data leads to discriminatory models.This can mean two rather different things, though:

If data mining treats cases in which prejudice has played some role as valid examples from which to learn a decision-making rule, that rule may simply reproduce the prejudice involved in these earlier cases; and
If data mining draws inferences from a biased sample of the populations to which the inferences are expected to generalize, any decisions that rests on these inferences may systematically disadvantage those who are under- or over-represented in the dataset.

Labeling Examples

The unavoidably subjective labeling of examples can skew the resulting findings in such a way that any decisions taken on the basis of those findings will characterize all future cases along the same lines, even if such characterizations would seem plainly erroneous to analysts who looked more closely at the individual cases. For all their potential problems, though, the labels applied to the training data must serve as ground truth.

The kinds of subtle mischaracterizations that happened during training will be impossible to detect when evaluating the performance of a model, because the training is taken as a given at that point. Thus, decisions taken on the basis of discoveries that rest on haphazardly labeled data or data labeled in a systematically, though unintentionally, biased manner will seem valid.

So long as prior decisions affected by some form of prejudice serve as examples of correctly rendered determinations, data mining will necessarily infer rules that exhibit the same prejudice.

An employer currently subject to an EEOC investigation states it identified “a pool of existing employees” that Kronos, a third party assessment provider, utilized to create a customized assessment for use by the employer. The employer's reliance on that employee sample is flawed because people with mental disabilities are severely underrepresented in the existing workforce:

According to a 2010 Kessler Foundation/NOD Survey of Employment of Americans with Disabilities conducted by Harris Interactive survey, the employment gap between people with and without disabilities has remained significant over the past 25+ years.

According to a 2013 report of the Senate HELP Committee, Unfinished Business: Making Employment of People with Disabilities A National Priority, only 32% of working age people with disabilities participate in the labor force, as compared with 77% of working age people without disabilities. For people with mental illnesses, rates are even lower.

The employment rate for people with serious mental illness is less than half the 33% rate for other disability groups (Anthony, Cohen, Farkas, & Gagne, 2002).

Surveys have found that only 10% - 15% of people with serious mental illness receiving community treatment are competitively employed (Henry, 1990; Lindamer et al., 2003; Pandiani & Leno, 2011; Rosenheck et al., 2006; Salkever et al., 2007).

In Albemarle Paper Company v. Moody, 422 US 405 (1975), in which an employer implemented a test on the theory that a certain verbal intelligence was called for by the increasing sophistication of the plant's operations, the Supreme Court cited the Standards of the American Psychological Association and pointed out that a test should be validated on people as similar as possible to those to whom it will be administered. The Court further stated that differential studies should be conducted on minority groups/protected classes wherever feasible.

The use of the employer's own workforce to develop and benchmark its assessment is flawed because people with mental disabilities are severely underrepresented in the employer's workforce and the overall U.S. workforce.

Not only can data mining inherit prior prejudice through the mislabeling of examples, it can also reflect current prejudice through the ongoing behavior of users taken as inputs to data mining.

This is what Latanya Sweeney discovered in a study that found that Google queries for black-sounding names were more likely to return contextual (i.e., key-word triggered) advertisements for arrest records than those for white-sounding names.

Sweeney confirmed that the companies paying for these ads had not set out to focus on black-sounding names; rather, the fact that black-sounding names were more likely to trigger such advertisements seemed to be an artifact of the algorithmic process that Google employs to determine

which advertisements to deliver alongside the results for certain queries. Although the details of the process by which Google computes the so-called “quality score” according to which it ranks advertisers’ bids is not fully known, one important factor is the predicted likelihood, based on historical trends, that users will click on an advertisement.

As Sweeney points out, the process “learns over time which ad text gets the most clicks from "viewers of the ad” and promotes that advertisement in its rankings accordingly. Sweeney posits that this aspect of the process could result in the differential delivery of advertisements that reflect the kinds of prejudice held by those exposed to the advertisements. In attempting to cater to the preferences of users, Google will unintentionally reproduce the existing prejudices that inform users’ choices.

A similar situation could conceivably arise on websites that recommend potential employees to employers, as LinkedIn does through its Talent Match feature. If LinkedIn determines which candidates to recommend on the basis of the demonstrated interest of employers in certain types of candidates, Talent Match will offer recommendations that reflect whatever biases employers happen to exhibit. In particular, if LinkedIn’s algorithm observes that employers disfavor certain candidates that are members of a protected class, Talent Match may decrease the rate at which it recommends these types of candidates to employers. The recommendation engine would learn to cater to the prejudicial preferences of employers.

Data Collection

Organizations that do not or cannot observe different populations in a consistent way and with

equal coverage will amass evidence that fails to reflect the actual incidence and relative proportion of some attribute or activity in the under- or over-observed group. Consequently, decisions that depend on conclusions drawn from this data may discriminate against members of these groups.

The data might suffer from a variety of problems: the individual records that a company maintains about a person might have serious mistakes, the records of the entire protected class of which this person is a member might also have similar mistakes at a higher rate than other groups, and the entire set of records may fail to reflect members of protected classes in accurate proportion to others. In other words, the quality and representativeness of records might vary in ways that correlate with class membership (e.g., institutions might maintain systematically less accurate, precise, timely, and complete records). Even a dataset with individual records of consistently high quality can suffer from statistical biases that fail to represent different groups in accurate proportions. Much attention has

focused on the harms that might befall individuals whose records in various commercial databases are error-ridden, but far less consideration has been paid to the systematic disadvantage that members of protected classes may suffer from being miscounted and the resulting biases in their representation

in the evidence base.

Recent scholarship has begun to stress this point. Jonas Lerman, for example, worries about “the nonrandom, systemic omission of people who live on big data’s margins, whether due to poverty, geography, or lifestyle, and whose lives are less ‘datafied’ than the general population’s.” Kate Crawford has likewise warned, “because not all data is created or even collected equally, there are ‘signal problems’ in big-data sets—dark zones or shadows where some citizens and communities are ... underrepresented.” Errors of this sort may befall historically disadvantaged groups at higher rates because they are less involved in the formal economy and its data-generating activities.

Crawford points to Street Bump, an application for Boston residents that takes advantage of accelerometers built into smart phones to detect when drivers ride over potholes (sudden movement that suggests broken road automatically prompts the phone to report the location to the city).

While Crawford praises the cleverness and cost-effectiveness of this passive approach to reporting road problems, she rightly warns that whatever information the city receives from this application will be biased by the uneven distribution of smartphones across populations in different parts of

the city. In particular, systematic differences in smartphone ownership will very likely result in the underreporting of road problems in the poorer communities where protected groups disproportionately congregate. If the city were to rely on this data to determine where it should direct its resources, it would only further underserve these communities. Indeed, the city would discriminate against those who lack the capacity to report problems as effectively as wealthier residents with smartphones.

A similar dynamic could easily apply in an employment context if members of protected classes are unable to report their interest in and qualification for jobs listed online as easily or effectively as others due to systematic differences in Internet access.

Zappos has launched a new careers site and removed all job postings. Instead of applying for jobs, persons interested in working at Zappos will need to enroll in a social network run by the company, called Zappos Insiders. The social network will allow them to network with current employees by digital Q&As, contests and other means in hopes that Zappos will tap them when jobs come open.

"Zappos Insiders will have unique access to content, Google Hangouts, and discussions with recruiters and hiring teams. Since the call-to-action is to become an Insider versus applying for a specific opening, we will capture more people with a variety of skill sets that we can pipeline for current or future openings," said Michael Bailen, Zappos’ head of talent acquisition.

In response to a question, “How can I stand out from the pack and stay front-and-center in the Zappos Recruiters’ minds?” on the Zappos' Insider site, the company lists six ways to stand out, including: using Twitter, Facebook, Instagram, Pinterest and Google Hangouts; participating in TweetChats; following Zappos’ employees on various social media platforms; and, reaching out to Zappos’ “team ambassadors.”

For the most part, all of the foregoing activities require broadband internet access and devices (tablets, smartphones, etc.) that run on those access networks. A number of protected classes will be challenged by both the broadband access and social media participation requirements:

As noted by a PewResearch Internet Project Research report, African Americans have long been less likely than whites to have high speed broadband access at home, and that continues to be the case. Today, African Americans trail whites by seven percentage points when it comes to overall internet use (87% of whites and 80% of blacks are internet users), and by twelve percentage points when it comes to home broadband adoption (74% of whites and 62% of blacks have some sort of broadband connection at home).

The gap between whites and blacks when it comes to traditional measures of internet and broadband adoption is pronounced. Specifically, older African Americans, as well as those who have not attended college, are significantly less likely to go online or to have broadband service at home compared to whites with a similar demographic profile.

According to the PewResearch Internet Project, even among those persons who have broadband access, the percentage of those using social media sites varies significantly by age.

Social medial participation is not solely a function of age. "Social media is transforming how we engage with customers, employees, jobseekers and other stakeholders," said Kathy Martinez, Assistant Secretary of Labor for Disability Employment Policy. "But when social media is inaccessible to people with disabilities, it excludes a sizeable segment of our population."

Persons with disabilities (e.g., sight or hearing loss, paralysis), whether physical, mental, or developmental, face challenges accessing social media. Each of the social media platforms promoted by Zappos - Twitter, Facebook, Instagram, Pinterest, and Google Hangouts - have differing levels of support for those with disabilities (e.g., close captions or real live captions on image content that utilize sound/voice). (Please see Zappos: The Future of Hiring and Hiring Discrimination?)

To ensure that data mining reveals patterns that obtain for more than the particular sample under
analysis, the sample must share the same probability distribution as the data that would be gathered from all cases across both time and population. In other words, the sample must be proportionally representative of the entire population, even though the sample, by definition, does not include every
case.

If a sample includes a disproportionate representation of a particular class (more or less than its actual incidence in the overall population), the results of an analysis of that sample may skew in favor or against the over-or under-represented class. While the representativeness of the data is often simply assumed, this assumption is rarely justified, and is “perhaps more often incorrect than correct.”

Feature Selection

Organizations—and the data miners that work for them—also make choices about what attributes they observe and what they subsequently fold into their analyses. Data miners refer to the process of settling on the specific string of input variables as “feature selection.” Members of protected classes may find that they are subject to systematically less accurate classifications or predictions because the details necessary to achieve equally accurate determinations reside at a level of granularity and coverage that the features fail to achieve.

This problem stems from the fact that data are by necessity reductive representations of an infinitely more specific real-world object or phenomenon. At issue, really, is the coarseness and comprehensiveness of the criteria that permit statistical discrimination and the uneven rates at which different groups happen to be subject to erroneous determinations. Crucially, these erroneous and potentially adverse outcomes are artifacts of statistical reasoning rather than prejudice on the part of decision-makers or bias in the composition of the dataset. As Frederick Schauer explains, decision-makers that rely on statistically sound but nonuniversal generalizations “are being simultaneously rational and unfair” because certain individuals are “actuarially saddled” by statistically sound inferences that are nevertheless inaccurate

To take an obvious example, hiring decisions that consider credentials tend to assign enormous weight to the reputation of the college or university from which an applicant has graduated, despite the fact that such credentials may communicate very little about the applicant’s job-related skills and competencies. If equally competent members of protected classes happen to graduate from these colleges or universities at disproportionately low rates, decisions that turn on the credentials conferred by these schools, rather than some set of more specific qualities that more accurately sort individuals, will incorrectly and systematically discount these individuals.

Kenexa, an assessment company owned by IBM and used by hundreds of employers, believes that a lengthy commute raises the risk of attrition in call-center and fast-food jobs. It asks applicants for those jobs to describe their commute by picking options ranging from "less than 10 minutes" to "more than 45 minutes." According to Kenexa’s Jeff Weekley, in a September 20, 2012 article in The Wall Street Journal, “The longer the commute, the lower their recommendation score for these jobs.” Applicants are also asked how long they have been at their current address and how many times they have moved. People who move more frequently "have a higher likelihood of leaving," Mr. Weekley said.

A 2011 study by the Center for Public Housing found that poor and near-poor families tend to move more frequently than the general population. A wide range of often complex forces appears to drive this mobility:

the formation and dissolution of households;

an inability to afford one’s housing costs;

the loss of employment;

lack of quality housing; or

a safer neighborhood.

According to the U.S. Census, lower-income persons are disproportionately female, black, Hispanic, and mentally ill.

Painting with the broad brush of distance from work, commute time and moving frequency results in well-qualified applicants being excluded from employment consideration. Importantly, the workforce insights of companies like Kenexa are based on data correlations - they say nothing about a particular person.

The application of these insights means that many low-income persons are electronically redlined. Employers do not even interview, let alone hire, qualified applicants because they live in certain areas or because they have moved. The reasons for moving do not matter, even if it was to find a better school for their children, to escape domestic violence or due to job loss from a plant shutdown.

When Clayton County, Georgia killed its bus system in 2010, it had nearly 9,000 daily riders. Many of those riders used the service to commute to their jobs. The transit shutdown increased commuting times (as persons found alternate ways to get to work) and led to more housing mobility (as persons relocated to be closer to their jobs to mitigate commuting time). Though no fault of their own, the impact of increasing the former bus riders commuting time or moving their residence made them less attractive job candidates to the many employers who use companies like Kenexa.

Making Decisions on the Basis of the Resulting Model

Cases of decision-making that do not artificially introduce discriminatory effects into the data mining process may nevertheless result in systematically less favorable determinations for members of protected classes. Situations of this sort are possible when the criteria that are genuinely relevant in making rational and well-informed decisions also happen to serve as reliable proxies for class membership. In other words, the very same criteria that correctly sort individuals according to their
predicted likelihood of excelling at a job—as formalized in some fashion— may also sort individuals according to class membership.

For example, employers may find, in conferring greater attention and opportunities to employees that they predict will prove most competent at some task, that they subject members of protected groups to consistently disadvantageous treatment because the criteria that determine the attractiveness of employees happen to be held at systematically lower rates by members of these groups. Decision-makers do not necessarily intend this disparate impact because they hold prejudicial beliefs; rather, their reasonable priorities as profit-seekers unintentionally recapitulate the inequality that happens to exist in society. Furthermore, this may occur even if proscribed criteria have been removed from the dataset, the data are free from latent prejudice or bias, the data is especially granular and diverse, and the only goal is to maximize classificatory or predictive accuracy.

The problem stems from what researchers call “redundant encodings”: cases in which membership in a protected class happens to be encoded in other data. This occurs when a particular piece of data or certain values for that piece of data are highly correlated with membership in specific protected classes. The fact that these data may hold significant statistical relevance to the decision at hand explains why data mining can result in seemingly discriminatory models even when its only objective is to ensure the greatest possible accuracy for its determinations. If there is a disparate distribution of an attribute, a more precise form of data mining will be more likely to capture it as such. Better data and more features will simply expose the exact extent of inequality.

Data mining could also breathe new life into traditional forms of intentional discrimination because decision-makers with prejudicial views can mask their intentions by exploiting each of the mechanisms enumerated above. Stated simply, any form of discrimination that happens unintentionally can be orchestrated intentionally as well.

For instance, decision-makers could knowingly and purposefully bias the collection of data to ensure that mining suggests rules that are less favorable to members of protected classes. They could likewise attempt to preserve the known effects of prejudice in prior decision-making by insisting that such decisions constitute a reliable and impartial set of examples from which to induce a decision-making rule. And decision-makers could intentionally rely on features that only permit coarse-grain distinction-making—distinctions that result in avoidable and higher rates of erroneous determinations for members of a protected class.

Because data mining holds the potential to infer otherwise unseen attributes, including those traditionally deemed sensitive, it can furnish methods by which to determine indirectly individuals’ membership in protected classes and to unduly discount, penalize, or exclude such people accordingly. In other words, data mining could grant decision-makers the ability to distinguish and disadvantage members of protected classes without access to explicit information about individuals’ class membership. It could instead help to pinpoint reliable proxies for such membership and thus place institutions in the position to automatically sort individuals into their respective class without ever having to learn these facts directly.

Tuesday, August 26, 2014

Knack Testing Illegal Under ADA?

Wasabi Waiter looks a lot like hundreds of other simple online games. Players acting as sushi servers track the moods of their customers, deliver them dishes that correspond to those emotions, and clear plates while tending to incoming patrons. Unlike most games, though, Wasabi Waiter purportedly analyzes every millisecond of player behavior, measuring conscientiousness, emotion recognition, and other attributes that academic studies show correlate with job performance. The game, designed by startup Knack.it, then scores each player’s likelihood of becoming an outstanding employee.

Knack's assessments are based on games developed by the company that may be "played" on computers and mobile devices. Interesting, but how do persons with disabilities play these games? How would a blind person play these game? How would a persons with limb paralysis play these games? How would a person with diminished mental capacity play these games? How well would a person who may not be computer literate, an older person for example, play these games? What advantage, if any, does a gaming environment provide for one class of persons (young male online gamer ) versus another (mature female non-gamer)?

Screening Out Applicants

Tests that screen out or tend to screen out an individual with a disability or a class of individuals with disabilities are illegal under the Americans with Disabilities Act (ADA) unless the tests are job-related and consistent with business necessity.

Knack testing relies on gamification. Applicants "play" Wasabi Waiter, Balloon Brigade, and other video games to generate the data used by Knack to identify promising applicants. As noted above, however, the reliance on video games screens out persons with disabilities, whether physical disabilities like blindness and limb paralysis or mental disabilities like diminished mental capacity.

Phrased differently, how would physicist Stephen Hawking, clearly an innovator and high performer, fare in taking Knack's Balloon Brigade? Hawking has a motor neurone disease related to amyotrophic lateral sclerosis, a condition that has progressed over the years. He is almost entirely paralysed and communicates through a speech generating device.

From a practical standpoint, legal claims that an individual with a disability has been screened out do not require a statistical showing of disparate impact, or other comparative evidence showing that a group of disabled persons are adversely affected. The plain language of the law – “screen out or tend to screen out” and “an individual with a disability or a class of individuals with disabilities” – confirm that a claim may be supported by evidence that the challenged practice screens out an individual on the basis of their disability. “In the ADA context, a plaintiff may satisfy the second prong of his prima facie case [impact upon persons with protected characteristic] by demonstrating an adverse impact on himself rather than on an entire group.” Gonzalez v. City of New Braunfels.

Illegal Medical Examination

The ADA prohibits employers, whether directly or via third parties like Knack, from administering pre-employment medical examinations. Guidance by the Equal Employment Opportunity Commission defines medical examination under the ADA by reference to seven factors, any one of which may be sufficient to determine that a test is a medical examination.

Physiological Responses

One of those factors is whether the test measures an applicant's physiological responses to performing a task. EEOC guidance on this issue states:

[I]f an employer measures an applicant's physiological or biological responses to performance, the test would be medical.

According to Knack, its test:

leverages cutting-edge behavioral and cognitive neuroscience, data science, and computer science to build games which produce thousands of data points describing how a player perceives, responds, plans, reacts, thinks, problem-solves, adapts, learns, persists, and performs in a multitude of situations.

Types of physiological responses include a reaction or response - a bodily process occurring due to the effect of some antecedent stimulus or agent. As noted in the prior paragraph, Knack tests create data points that track how an applicant perceives, responds, reacts, adapts, learns and persists. The Knack test, therefore, is an illegal medical examination under the ADA.

Five Factor Model of Personality

Justin Fox, executive editor of the Harvard Business Review Group, took two of the Knack assessments and received information in the following report:

As can be seen by the report, among the factors measured by Knack are conscientiousness, openness and stability. These are elements found in the Five Factor Model of Personality, a model that is currently being challenged in at least seven charges filed with the EEOC. Please see ADA, FFM and DSM.

The ADA prohibits pre-employment medical exams but allows employers to “make pre-employment inquiries into the ability of an applicant to perform job-related functions.” The Knack gaming measurements do not seek job-related information and are not consistent with business necessity. The measurements, designed to reveal information about individuals’ openness, conscientiousness, stability (also referred to as neuroticism), and other factors do not seek information about the ability of an applicant to perform the day-to-day functions of a job.

Knowledge of Disability Not Required

Neither the medical examination claim nor the "screen out" claim under the ADA require that an employer have knowledge that an applicant has a disability, a consistent holding from a number of jurisdictions, including the 7th, 9th, 10th, and 11th Federal Circuit Courts of Appeal.

ADA guidance states, in relevant part:

A covered entity shall not require a medical examination and shall not make inquiries of an employee as to whether such employee is an individual with a disability or as to the nature and severity of the disability, unless such examination or inquiry is shown to be job-related and consistent with business necessity.

According to guidance issued by the EEOC, "This statutory language makes clear that the ADA’s restrictions on inquiries and examinations apply to all employees, not just those with disabilities.”

Monday, August 25, 2014

Sound and Fury, Signifying Nothing

Incorporating elements of gamification, big data, machine learning, and predictive human analytics, Knack is a veritable buzzword oasis. According to Knack, their games are designed to test cognitive skills that employers might want, drawing on some of the latest scientific research. These range from pattern recognition to emotional intelligence, risk appetite and adaptability to changing situations.

John Funge, Knack's CTO, states that "we have used our games to infer cognitive ability, conscientiousness, leadership potential, creativity as well as predict how people would perform as surgeons, management consultants, and innovators." In an Economist article, Chris Chabris, a Knack executive, states that games have huge advantages over traditional recruitment tools, such as personality tests, which can easily be outwitted by an astute candidate. Many more things can be tested quickly and performance can't be faked on Knack's games, he says.

Gary Halfteck, Knack's founder and CEO, says playing a video game can be a better representation of who you are and your skill sets than an employer might get in a one-on-one conversation. "As people, we make many decisions that are biased, whether it's consciously or subconsciously, and we have no good tools to assess and evaluate, let alone predict, what one's potential is," he says.

If Knack's CEO admits that people make many decisions that are biased, what prevents the people at Knack from being biased in the creation, development and implementation of their games? Further, what prevents employers using Knack from being held liable for the biases of those games? The answer to both questions: Nothing.

Algorithmic Illusion

While many companies foster an illusion that scoring/classification is an area of absolute algorithmic rule—that decisions are neutral, organic, and even automatically rendered without human intervention—reality is a far messier mix of technical and human curating. Both the datasets and the algorithms used to analyze the data reflect choices, among others, about connections, inferences, and interpretation.

The recent White House report, “Big Data: Seizing Opportunities, Preserving Values," found that, "while big data can be used for great social good, it can also be used in ways that perpetrate social harms or render outcomes that have inequitable impacts, even when discrimination is not intended."

The fact sheet accompanying the White House report warns:

As more decisions about our commercial and personal lives are determined by algorithms and automated processes, we must pay careful attention that big data does not systematically disadvantage certain groups, whether inadvertently or intentionally. We must prevent new modes of discrimination that some uses of big data may enable, particularly with regard to longstanding civil rights protections in housing, employment, and credit.

Some of the most profound challenges revealed by the White House Report concern how data analytics may lead to disparate inequitable treatment, particularly of disadvantaged groups, or create such an opaque decision-making environment that individual autonomy is lost in an impenetrable set of algorithms. Please see Knack Testing Illegal Under ADA?

Systemic Risk

Workforce assessment systems like Knack's games, designed in part to mitigate risks for employers, are becoming sources of material risk, both to job applicants and employers. The systems create the perception of stability through probabilistic reasoning and the experience of accuracy, reliability, and comprehensiveness through automation and presentation. But in so doing, technology systems draw attention away from uncertainty and partiality.

While Knack's approach may help reduce an employer's hiring costs and may reduce the impact of overtly biased or discriminatory behavior, the inclusion of one or more potentially "defective components" in the assessments means that employers face the risk that a finding of bias or discrimination of a Knack assessment used by one employer will put all employers that use the assessment at risk. Please see When the First Domino Falls: Consequences to Employers of Embracing Workforce Assessment Solutions.

These "defective components" in assessments may be either design defects (i.e., the adoption and use of certain personality models) or manufacturing defects (i.e., coding errors in the assessment software). The latter is analogous to the coding error at 23andMe that resulted in notices going out to some customers informing them that they had a chronic and life-shortening condition when they did not. Please see On Not Dying Young: Fatal Illness or Flawed Algorithm?

Each day an employer continues to use the Knack assessment, there are more potential plaintiffs with claims against that employer. Labor and employment laws like Title VII and the ADA, permit an employer to use a third party like Knack to undertake the assessment of job applicants. The use of a third party, however, does not insulate an employer from any claims arising from the assessment usage. Under those laws, an employer is responsible (and liable) for any failures on the part of an assessment or assessment provider to comply with the provisions of those laws.

No Silver Bullet

Just as concerns about scoring systems are heightened, their human element is diminishing. Although software engineers initially identify the correlations and inferences programmed into algorithms, machine learning, predictive analytics, and big data promises to eliminate the human “middleman” at some point in the process.

As Hector J. Levesque, a professor at the University of Toronto and a founding member of the American Association of Artificial Intelligence, wrote:

"As a field, I believe that we tend to suffer from what might be called serial silver bulletism, defined as follows:

the tendency to believe in a silver bullet for AI, coupled with the belief that previous beliefs about silver bullets were hopelessly naıve.

We see this in the fads and fashions of AI research over the years: first, automated theorem proving is going to solve it all; then, the methods appear too weak, and we favour expert systems; then the programs are not situated enough, and we move to behaviour-based robotics; then we come to believe that learning from big data is the answer; and on it goes."

Similarly, employment assessment companies like Knack market the benefits of science, precision and data over the past fifteen years under the guise of neural networks, artificial intelligence, big data and deep learning, yet what has changed? Employee engagement levels have hardly budged and employee turnover remains a continuing and expensive challenge for employers. Please see Gut Check: How Intelligent is Artificial Intelligence?

Friday, August 15, 2014

The Next Asbestos? The Next FLSA?

Asbestos Litigation

A 2005 RAND report states that asbestos litigation arose as a result of millions of individuals’ exposure to asbestos and as a result of many asbestos product manufacturers’ failure to protect workers against exposure and failure to warn their workers to take adequate precautions against exposure. The history of the litigation has been shaped by the rise of a sophisticated and well-capitalized plaintiff bar, heightened media attention to litigation, and the information technology revolution.

According to the RAND report:

At least 8,400 entities have been named as asbestos defendants through 2002.
Defendants are distributed across most U.S. industries.
Total spending on asbestos litigation through 2002 was about $70 billion, broken down as set out in the diagram below.

FLSA Litigation

The Fair Labor Standards Act (FLSA) is the federal law of broadest application governing minimum wage, overtime pay, and youth employment. Employees who are covered by the FLSA are entitled to be paid at least the Federal minimum wage as well as time and one-half their regular rates of pay for all hours worked over 40 in a workweek, unless an exemption applies.

Although FLSA litigation can involve a variety of claims, two of the most common are misclassification claims—i.e., allegations that an employer has misclassified an employee, or a group of employees, as exempt from the FLSA’s overtime requirements—and “off-the-clock” claims—i.e., allegations that an employee, or group of employees, has not been paid for all of the time they worked for the employers.

These and other claims under the FLSA can be brought individually or on behalf of all “similarly situated” employees and former employees. As a result, FLSA cases can involve a large number of employees and present significant financial exposure for employers. For instance, in 2008 Walmart agreed to pay as much as $640 million to settle 63 federal and state class actions claiming the company cheated hourly workers and forced them to work through breaks.

The multitude of wage and hour claims and lawsuits that workers have filed under the FLSA, and its state law counterparts, have made wage and hour law the nation’s fastest growing type of litigation. All industries (including retail, financial services, hospitality, construction, technology, and communications) have been susceptible to these lawsuits.

As shown in the graph to the left, the number of wage and hour lawsuits increased significantly over the past reporting year to 8,126, up another 4.7% over the prior 12-month period.

This is the seventh straight year of increases in federal court wage and hour lawsuits and ups the continuing explosion in these cases over the past decade to 237% and since 2000 to 438%.

Although anecdotal, a partner at a major labor and employment defense law firm believes that those numbers would be substantially greater if wage and hour lawsuits filed in state courts under state pay practices, tip laws, meal and rest break requirements, independent contractor rules, and the like, were added.

Employment Testing Litigation

Employment testing litigation will have many parallels with asbestos and FLSA litigation, but on an even larger scale.

There are potentially tens of millions of plaintiffs

Any person who takes an assessment, if the assessment is determined to be a medical examination will be a plaintiff. Appellate courts have held consistently that the prohibition on medical examinations extends to all persons, including persons who are not disabled (see, e.g., decisions from the Second, Sixth, Eighth, Tenth and Eleventh Circuits).
Any class of disabled persons (i.e., those with mental illness) where the assessment tends to screen out those persons from employment consideration. Unlike disparate impact claims under other employment discrimination laws the ADA does not require statistical evidence if an expert can confirm that the test would screen out persons with disabilities or categories of disabilities.
Federal and state agencies seeking to recover billions of dollars spent on SSDI/SSI disability awards, Medicare/Medicaid and other costs expended on persons who were illegally and invidiously discriminated against as a consequence of the use of employment assessments.

There are potentially hundreds of thousands of defendants

Employers utilizing testing will face claims by (A) applicants and employees for both the illegal use of a pre-employment medical examination and the failure to treat information obtained from such medical examination as confidential medical information, (B) federal and state agencies seeking recovery of costs incurred (disability awards, Medicare/Medicaid, etc.) as a consequence of the illegal testing, (C) claims by insurers denying coverage and (D) claims by testing companies denying liability/rejecting indemnification.
Testing companies will face claims by (A) applicants whose information was not treated as confidential medical information (a separate cause of action that does not require exhausting of remedies with the EEOC), (B) employers seeking indemnification from the testing companies for claims made against the employers by applicants, government agencies and others, and (C) claims by insurers denying coverage.
Insurance companies who underwrite policies for employers and testing companies will face claims from those they insure as well as individuals and government agencies making claims against those employers and testing companies.

Costs to employers, testing companies and their insurers will be in the tens (if not hundreds) of billions of dollars, including:

Defense transaction costs, including the costs of outside counsel, internal management and employee time, public relations, lobbying, etc.
Gross compensation, including awards to applicants and payment of costs and fees (i.e., counsel, expert witnesses, e-discovery).
Reputational damage costs, including lost/reduced sales and brand damage.
Business restructuring and/or “disinfectant” costs – The employers and testing companies retention and use of confidential medical information in violation of the ADA safeguards has resulted in the applications and solutions that illegally use this data. The data derived from the hundreds of millions of assessments over the past years has created a virus that has "infected" the employer and testing company solutions that integrate this data.

Thursday, August 7, 2014

Lovin It (Or Not) - McDonald's and the NLRB

The National Labor Relations Board (NLRB) announced on July 29, 2014 that its Office of General Counsel (OGC) authorized the filing of administrative complaints against franchise giant, McDonald’s USA LLC, for unfair labor practices involving workers at franchisee-owned restaurants.

The OGC said that it had investigated 181 cases of unlawful labor practices at McDonald’s franchise restaurants since 2012 and found sufficient merit in at least 43 cases to name McDonald’s as the workers’ “joint employer” creating a legal basis for holding McDonald’s responsible with the franchise owners for the labor violations. The OGC's findings were made in the form of an Advice Memo supporting the OGC's legal theory. Since this is a matter of ongoing litigation, disclosure of the Advice Memo will not be made at this time.

The NLRB's rationale is likely found in the new “joint employer” test that it is pressing for in Browning-Ferris Industries of California, Inc., a non-franchise case. In its amicus brief, the OGC urges the NLRB to replace the current “joint employer” standard, which examines a company’s direct control over another company’s essential employment decisions specifically affecting hiring, firing, supervision and direction of employment, with the pre-1984 broader-based “industrial realities” test, which focuses on the “economic dependence” between two companies and assumes that a company effectively controls another company’s labor decisions if it dictates standards for every other variable of its business.

McDonald's HR Practices

Heather Smedstad, senior vice president, human resources, of McDonald’s USA, said in a statement that “this decision to allow unfair labor practice complaints to allege that McDonald’s is a joint employer with its franchisees is wrong. McDonald’s will contest this allegation in the appropriate forum.” In the statement, Ms. Smedstad also says that "McDonald’s does not direct or co-determine the hiring, termination, wages, hours, or any other essential terms and conditions of employment of our franchisees’ employees ..."

Ms. Smedstad's statement that McDonald's does not determine or help determine decisions on employment matters appears at odds with her executive bio on the McDonald's website which reads, in part, that "she has lead responsibility for ... execution of all areas of HR for McDonald’s U.S. business and its 14,000 restaurants." (emphasis added) Assuming the accuracy of Ms. Smedstad's bio, McDonald's appears to play a role in employment matters at franchisees, since 90% of those 14,000 restaurants are franchisee-owned.

McDonald's: Hiring Gatekeeper

The process of applying for an hourly job at a McDonald's franchisee requires the applicant to use the application process found on the McDonald's corporate website. In addition to providing personal information, applicants are required to complete an online assessment. The assessment goes beyond testing skills and evaluating knowledge, and assesses cultural fit, behavior, and potential.

Applicants are required to choose between pairs of statements, including:

I am usually a very stable person
I often am not sure why I fell the way I do about certain things
I am pretty good at understanding what other people are thinking
If something very bad happens, it takes time before I'm happy again
I am sometimes not in touch with my feelings
Most of the time I am not interested in other people's problems
I prefer to avoid difficult tasks, in case I end up making mistakes
When I think about the future, I get worried because I know how difficult life can be
Sometimes I find it hard to sympathize with others' feelings
I do not like the idea of change, I like things the way they are
I have certain ways of doing things which I do not like to change
I am very disorganized, but it works for me
New experiences often do not turn out well so I like to do what I already like
I smile more often than not
I get frustrated doing things in groups because most people are hard to get along with
I am not very assertive because I do not want to upset anyone

In most instances, the first time the franchisee is aware of the applicant's interest in a job, the applicant has already had several employment-related actions with McDonald's. The franchisee is made aware of the applicant's interest by a report that not only contains personal information supplied by the applicant, but also the results of the assessment - including a dashboard of predictors, tagging the candidate as qualified or not qualified.

McDonald's active and ongoing control of the applicant intake and assessment portions of the hiring process for the franchisees contrasts sharply with its public claims. It is also at odds with advice being provided by labor and employment lawyers who represent employers. Rochelle Spandorf of Davis, Wright Tremaine, writes, "[T]he best advice for franchisors at the moment is to completely distance all operating advice from anything that could remotely be interpreted as suggesting or recommending particular employment practices." Similarly, John T. Lovett of Frost, Brown, Todd writes, "The more influence a franchisor has over the employment practices of the franchisee, the greater the likelihood that the franchisor will be found to be a "joint employer" with the franchisee."

Thursday, July 31, 2014

TeacherInsight Assessments: Fooled by Randomness?

According to Gallup, the current TeacherInsight (TI) assessment is based on a study of teachers (1,000 and 13,000 candidates) whose students have grown academically, regardless of the beginning level of the student. Unlike state-required certification exams, TI measures values and behavior -- not subject knowledge.

The TI assessment is based on a "profile" of the 1,000 teachers studied and applicants are measured (graded) based on correlation between their responses and the "profile" response. As noted in an article in the Dallas Morning News:

Gallup would not release its test but provided one question without answering it: "When students say they want their teachers to be fair, what do they mean?" Applicants choose from among four answers. It may seem like a subjective question, but according to Gallup, the best teachers all answer the same way. "There's quite a bit of consistency in their behavior," Gary Gordon, vice president of Gallup's Education Division, said of the best teachers. "They don't distinguish between students as much."

As discussed in The (Non)Predictive Ability of the Gallup TeacherInsight Assessment there is little evidence linking teachers' test scores to student achievement and teacher effectiveness.

Correlation vs Causation

TI assessment scoring is predicated on correlations among various data elements. Correlations let us analyze a phenomenon (teacher effectiveness) not by shedding light on its inner workings but by identifying a (hopefully) useful proxy for it. Of course, even strong correlations are never perfect. It is quite possible that two things may behave similarly just by coincidence. We may simply be “fooled by randomness” to borrow a phrase from the empiricist Nassim Nicholas Taleb. With correlations, there is no certainty, only probability.

Decisions made or affected by correlation are inherently flawed. Correlation does not equal causation. This point is made vividly by Tyler Vigen, a law student at Harvard who put together a website that finds very, very high correlations - as shown below - between things that are absolutely not related.

Each of these have correlation coefficents in excess of 0.99, serving to demonstrate the point that a strong correlation isn't nearly enough to make strong conclusions about how two phenomena are related to each other.

While many companies foster an illusion that scoring/classification is an area of absolute algorithmic rule—that decisions are neutral, organic, and even automatically rendered without human intervention—reality is a far messier mix of technical and human curating. Both the datasets and the algorithms used to analyze the data reflect choices, among others, about connections, inferences, and interpretation.

Theories about teacher effectiveness shape both methods used in the TI assessment and the results of that assessment. It begins with how the data was selected and what is chosen influences what is found. Similarly, when Gallup analyzes the data, it chooses tools that rest on theories. And as it interprets the results it again applies theories.

Seizing Opportunities, Preserving Values

The recent White House report, “Big Data: Seizing Opportunities, Preserving Values," found that, "while big data can be used for great social good, it can also be used in ways that perpetrate social harms or render outcomes that have inequitable impacts, even when discrimination is not intended." The fact sheet accompanying the White House report warns:

As more decisions about our commercial and personal lives are determined by algorithms and automated processes, we must pay careful attention that big data does not systematically disadvantage certain groups, whether inadvertently or intentionally. We must prevent new modes of discrimination that some uses of big data may enable, particularly with regard to longstanding civil rights protections in housing, employment, and credit.

Some of the most profound challenges revealed by the White House Report concern how data analytics may lead to disparate inequitable treatment, particularly of disadvantaged groups, or create such an opaque decision-making environment that individual autonomy is lost in an impenetrable set of algorithms.

Workforce assessment systems like Gallup's TeacherInsight, designed in part to mitigate risks for employers, have become sources of material risk, both to job applicants and employers. The systems create the perception of stability through probabilistic reasoning and the experience of accuracy, reliability, and comprehensiveness through automation and presentation. But in so doing, technology systems draw attention away from uncertainty and partiality.

As more and more school districts seek to broaden their teaching staff to include more ethnically, linguistically, and culturally diverse teachers, it is imperative to make the selection and hiring practices of teachers more transparent.