Wednesday, June 17, 2015

A Fool With A Tool Is Still A Fool

In the June 22, 2015 cover story for Time magazine, "Questions to Answer in the Age of Optimized Hiring," author Eliza Gray asks, “Are we truly comfortable with turning hiring–potentially one of the most life-changing experiences that a person can go through–over to the algorithms?” The answer should be no.

When you have algorithms weighing hundreds of factors over a huge data set, you can't really know why they come to a particular decision or whether it really makes sense. As Geoff Nunberg, who teaches at the School of Information at the University of California Berkeley stated in an NPR interview, “big data is no more exact a notion than big hair.”

Decisions made or affected by correlation are inherently flawed. Correlation does not equal causation, as demonstrated by Tyler Vigen on his website Spurious Correlations. For example:
  • There is a greater than 99% correlation (0.992558) between the divorce rate in Maine and the per capita consumption of butter in the U.S. over the years 2000-2009;
  • There is a greater than 78% correlation (0.78915) between the number of worldwide non-commercial space launches and the number of sociology doctorates awarded in the U.S. over the years 1997-2009; and,
  • There is a greater than 66% correlation (0.666004) between the number of films Nicolas Cage appeared in and the number people who drowned by falling into a swimming pool over the years 1999-2009.

And what of the correlation between personality and job performance? In a 2007 article titled, “Reconsidering the Use of Personality Tests in Employment Contexts,” Dr. Neil Schmitt, the University Distinguished Professor at Michigan State University, wrote:

 [A 1965 research paper found that] the average validity of personality tests was 0.09. Twenty-five years later, Barrick and Mount (1991) published a paper in which the best validity they could get for the Big Five [personality model] was 0.13. They looked at the same research. Why are we now suddenly looking at personality as a valid predictor of job performance when the validities still haven’t changed and are still close to zero?

If personality assessments are designed to find those employees with the best fit for the company culture, shouldn't the rising use of those assessments by employers over the past 10-15 years have resulted in a concomitant rise in employee engagement?

Gallup has taken an employee engagement poll annually since 2000. Gallup defines engaged employees as those who are involved in, enthusiastic about and committed to their work and workplace. According to the 2014 Gallup poll, 51% of employees in the U.S. were "not engaged" in their jobs and 17.5% were "actively disengaged." These percentages have changed little over the fifteen years Gallup has been polling.

Gallup’s research shows that employee engagement is strongly connected to business outcomes essential to an organization’s financial success, including productivity, profitability, and customer satisfaction. Yet, the purported benefits of personality assessments have failed to move the needle on employee engagement, meaning companies have not received the promised productivity and profitability "bumps" from using personality assessments.

Laszlo Bock
There are significant risks associated with the use of personality assessments in hiring algorithms, both to the employer and to the job applicant. As Google’s Laszlo Bock state in the Time article, “if [an employer] makes a bad assessment based on an algorithm or a test, that has a major impact on a person’s life–a job they don’t get or a promotion they don’t get.”

For the employer, the risks are at least two-fold. First, people who are “different” will be screened out, denying the employer the benefits that come from having a widely diverse group of employees. As Bock states in the article:
“I imagine someone who has Asperger’s or autism, they will test differently on these things. We want people like that at the company because we want people of all kinds, but they’ll get screened out by this kind of thing.”
The second risk for employers are the liabilities they face under laws like the Americans with Disabilities Act for using personality tests that screen out persons with disabilities, whether it be Asperger’s, autism, bipolar disorder, or other mental health challenges.  The Equal Employment Opportunity Commission (EEOC) currently has two systemic investigations ongoing against employers that used personality tests in their pre-employment screening processes.
The 2014 White House report, “Big Data: Seizing Opportunities, Preserving Values," found that, "while big data can be used for great social good, it can also be used in ways that perpetrate social harms or render outcomes that have inequitable impacts, even when discrimination is not intended." An accompanying fact sheet warns:

As more decisions about our commercial and personal lives are determined by algorithms and automated processes, we must pay careful attention that big data does not systematically disadvantage certain groups, whether inadvertently or intentionally. We must prevent new modes of discrimination that some uses of big data may enable, particularly with regard to longstanding civil rights protections in housing, employment, and credit.

Just as neighborhoods can serve as a proxy for racial or ethnic identity, there are new worries that big data technologies (personality assessments and algorithmic decisionmaking) could be used to “digitally redline” unwanted groups, either as customers, employees, tenants, or recipients of credit.  That is why we should not be comfortable with turning hiring over to the algorithms.

Sunday, June 7, 2015

The LAST-2 Not the Last One

On Friday, June 5, 2015, a federal judge found that an exam for New York teaching candidates was racially discriminatory because it did not measure skills necessary to do the job. The exam, the second incarnation of the Liberal Arts and Sciences Test, called the LAST-2, was administered from 2004 through 2012 and was designed to test an applicant’s knowledge of liberal arts and science.

Establishing a Prima Facie Case

Under Title VII of the Civil Rights Act of 1964, a plaintiff can make out a prima facie case of discrimination with respect to an employment exam by showing that the exam has a disparate impact on minority candidates. To do so, a party must (1) identify a policy or practice (in this case, the employment exam), (2) demonstrate that a disparity exists, and (3) establish a causal relationship between the two. A party can meet the second and third requirement by relying on the “80% rule." As stated by the EEOC:
A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact.
In the LAST-2 case, Judge Kimba M. Wood found that the pass rate for African-American and Latino candidates was between 54 percent and 75 percent of the pass rate for white candidates.

At the signing of the Civil Rights Act of 1964

Rebutting the Prima Facie Case

The defendant can rebut that prima facie showing by demonstrating that the exam is job related. To do so, the defendant must prove that the exam has been validated properly. Validation requires showing, by professionally acceptable methods, that the exam is predictive of or significantly correlated with important elements of work behavior which comprise or are relevant to the job for which candidates are being evaluated.

In determining whether an employment exam has been properly validated and is thus job related for the purposes of Title VII, the following factors must be considered:

  1. the test-makers must have conducted a suitable job analysis;
  2. the test-makers must have used reasonable competence in constructing the test;
  3. the content of the test must be related to the content of the job;
  4. the content of the test must be representative of the content of the job; and
  5. there must be a scoring system that usefully selects those applicants who can better perform the job.

The LAST-2 decision found that the defendant New York City Board of Education (BOE) failed to rebut the prima facie showing of discrimination because it had not demonstrated that the LAST-2 was properly validated.  The court found that National Evaluation Systems (NES), the test developer owned by Pearson, did not comport with the five factors listed above, focusing primarily on the first factor: the sufficiency of NES’s job analysis.

Wholly Deficient Job Analysis

A job analysis is an assessment of the important work behavior(s) required for successful performance of the job in question and the relative importance of these behaviors. The purpose of a job analysis is to ensure that an exam adequately tests for the knowledge, skills, and abilities (KSAs) that are actually needed to perform the daily tasks of the job. The test developer must be able to explain the relationship between the subject matter being assessed by the exam and the job tasks identified.

To perform a suitable job analysis, a test developer must: (1) identify the tasks involved in performing the job; (2) include a thorough survey of the relative importance of the various skills involved in the job in question; and (3) define the degree of competency required in regard to each skill.

The LAST-2 court found that the core flaw in NES’s job analysis was that it failed to identify any job tasks whatsoever. Without identifying the tasks involved in performing the job (required by the first factor discussed above), it was not possible for NES to determine the relative importance of each job task (second factor), or to define the degree of competency required for each skill needed to accomplish those job tasks (third factor). Accordingly, the court found NES’s job analysis to be wholly deficient.

An Inherently Flawed Approach

Instead of beginning with ascertaining the job tasks of New York teachers, the LAST-2 examination began with the premise that all New York teachers should be required to demonstrate an understanding of the liberal arts.

NES began developing the LAST-2 by consulting documents describing liberal arts and general education undergraduate and graduate course requirements, syllabi, and course outlines. NES then defined the KSAs it believed a liberal arts exam should assess, based on the way the liberal arts were characterized in those documents. Thus, NES did not investigate the job tasks that a teacher must perform to do her job satisfactorily, but instead used liberal arts curricular documents to construct the entirety of the LAST-2.

In other words, NES started with the unproved assumption that specific facets of liberal arts and science knowledge were critically important to the role of teaching, and then attempted to determine how to test for that specific knowledge. This is an inherently flawed approach because at no point did NES ascertain, through an open ended investigation into the job tasks a successful teacher performs, whether its conception of the liberal arts and sciences was important to even some New York public school teachers, let alone to all of them.

Survey Says ... Unpersuasive

NES argued that it had surveyed several hundred teachers about the importance of the KSAs that NES identified, and those teachers affirmed their importance, but the court found the argument unpersuasive. 

The problem with NES’s approach is that it assumed, without investigation or proof, that specific KSAs are important to a teacher’s effectiveness at her job—namely, an understanding of some pre-determined subset of the liberal arts and sciences—and then asked teachers to rank only those KSAs in importance. The fact that survey respondents stated that certain surveyed KSAs were important to teaching says nothing about the relative importance of the surveyed KSAs compared to any KSA not included in NES’s survey.

The court found that NES cannot determine the KSAs most important to teaching by surveying only those KSAs NES already believed were important. NES should have determined which KSAs to survey based on an investigation of the job tasks performed by successful teachers. Only KSAs which NES has directly linked to those identified job tasks should be included in a survey attempting to determine “relative importance.”

As an example, the court wrote:
Assume that the KSA of reading comprehension has an importance value of 9, the KSA of logical reasoning has an importance value of 4, and the KSA of leadership has an importance value of 20. Assume that NES’s survey would have queried the value of both reading comprehension and logical reasoning, but not of leadership. Ranked relative to each other, reading comprehension would be very important, while logical reasoning might be somewhat important. But in this example, neither is nearly as important as leadership. In this way, NES’s survey would have greatly exaggerated the importance of both reading comprehension and logical reasoning.
Although the survey might be an appropriate way of confirming information gathered through a proper job task investigation, or as a way of determining the relative importance of already-ascertained job tasks, it is not an appropriate way of initially identifying KSAs.

What To Do Now?

Judge Wood stated that NES should begin by first identifying the necessary job tasks for a New York public school teacher. Necessary job tasks could be identified through some combination of (1) teacher interviews, (2) observations of teachers across the state performing their day-to-day duties, and (3) the survey responses of educators who have been given open-ended surveys requiring them to describe the job tasks they perform and to rank the importance of those tasks.

Job tasks must be ascertained from the source—in this case, from public school teachers.
Using the data culled from such an investigation, NES could then analyze these job tasks,
and from that analysis determine what KSAs a teacher must possess to adequately perform the
tasks identified. NES should document precisely how those KSAs are necessary to the performance of the identified job tasks. It is those KSAs that should provide the foundation for the development of the test framework.

The importance of identifying these job tasks is amplified here because every teacher in
New York must be licensed, whether she teaches kindergarten, or advanced chemistry. NES therefore needs to determine exactly what job tasks are performed, and accordingly, what KSAs are required, to teach kindergarten through twelfth grade proficiently. This is likely a daunting task given how different the daily experience of a kindergarten teacher is from that of an advanced chemistry teacher.

Last, NES needs to make sure that the relevant test tests for abilities not already tested for by related exams. In the LAST-2 case, applicants were also required to pass Assessment of Teaching Skills – Written and a Content Specialty Test applicable to the teacher’s subject area before they can become licensed.