Employment Testing: Failing to Make the Grade: July 2014

Thursday, July 31, 2014

TeacherInsight Assessments: Fooled by Randomness?

According to Gallup, the current TeacherInsight (TI) assessment is based on a study of teachers (1,000 and 13,000 candidates) whose students have grown academically, regardless of the beginning level of the student. Unlike state-required certification exams, TI measures values and behavior -- not subject knowledge.

The TI assessment is based on a "profile" of the 1,000 teachers studied and applicants are measured (graded) based on correlation between their responses and the "profile" response. As noted in an article in the Dallas Morning News:

Gallup would not release its test but provided one question without answering it: "When students say they want their teachers to be fair, what do they mean?" Applicants choose from among four answers. It may seem like a subjective question, but according to Gallup, the best teachers all answer the same way. "There's quite a bit of consistency in their behavior," Gary Gordon, vice president of Gallup's Education Division, said of the best teachers. "They don't distinguish between students as much."

As discussed in The (Non)Predictive Ability of the Gallup TeacherInsight Assessment there is little evidence linking teachers' test scores to student achievement and teacher effectiveness.

Correlation vs Causation

TI assessment scoring is predicated on correlations among various data elements. Correlations let us analyze a phenomenon (teacher effectiveness) not by shedding light on its inner workings but by identifying a (hopefully) useful proxy for it. Of course, even strong correlations are never perfect. It is quite possible that two things may behave similarly just by coincidence. We may simply be “fooled by randomness” to borrow a phrase from the empiricist Nassim Nicholas Taleb. With correlations, there is no certainty, only probability.

Decisions made or affected by correlation are inherently flawed. Correlation does not equal causation. This point is made vividly by Tyler Vigen, a law student at Harvard who put together a website that finds very, very high correlations - as shown below - between things that are absolutely not related.

Each of these have correlation coefficents in excess of 0.99, serving to demonstrate the point that a strong correlation isn't nearly enough to make strong conclusions about how two phenomena are related to each other.

While many companies foster an illusion that scoring/classification is an area of absolute algorithmic rule—that decisions are neutral, organic, and even automatically rendered without human intervention—reality is a far messier mix of technical and human curating. Both the datasets and the algorithms used to analyze the data reflect choices, among others, about connections, inferences, and interpretation.

Theories about teacher effectiveness shape both methods used in the TI assessment and the results of that assessment. It begins with how the data was selected and what is chosen influences what is found. Similarly, when Gallup analyzes the data, it chooses tools that rest on theories. And as it interprets the results it again applies theories.

Seizing Opportunities, Preserving Values

The recent White House report, “Big Data: Seizing Opportunities, Preserving Values," found that, "while big data can be used for great social good, it can also be used in ways that perpetrate social harms or render outcomes that have inequitable impacts, even when discrimination is not intended." The fact sheet accompanying the White House report warns:

As more decisions about our commercial and personal lives are determined by algorithms and automated processes, we must pay careful attention that big data does not systematically disadvantage certain groups, whether inadvertently or intentionally. We must prevent new modes of discrimination that some uses of big data may enable, particularly with regard to longstanding civil rights protections in housing, employment, and credit.

Some of the most profound challenges revealed by the White House Report concern how data analytics may lead to disparate inequitable treatment, particularly of disadvantaged groups, or create such an opaque decision-making environment that individual autonomy is lost in an impenetrable set of algorithms.

Workforce assessment systems like Gallup's TeacherInsight, designed in part to mitigate risks for employers, have become sources of material risk, both to job applicants and employers. The systems create the perception of stability through probabilistic reasoning and the experience of accuracy, reliability, and comprehensiveness through automation and presentation. But in so doing, technology systems draw attention away from uncertainty and partiality.

As more and more school districts seek to broaden their teaching staff to include more ethnically, linguistically, and culturally diverse teachers, it is imperative to make the selection and hiring practices of teachers more transparent.

Systemic Risks for School Systems Employing TeacherInsight

School districts that are using, or considering the use of, the TI should require the Gallup Organization to show them the research in support of the instrument. The lack of independent research on the TI and Gallup’s unwillingness to publish their own research does not help support the credibility of the TI. If Gallup was to publish their own studies, then independent researchers could attempt to replicate those studies in order to either confirm or contradict their findings regarding the validity of the TI. Most large school districts have a research department with the capability of conducting such research.

As discussed under the headings "Ignoring EEOC Guidance" and "Disregarding Industry Standards" in Risks to Kroger Shareholders, a key element in guidance from both the EEOC and EEAC (an employer association) is the responsibility of the employer (e.g., local school district) to independently review the hiring assessment(s) it uses and to avoid reliance on the representations of the assessment provider (e.g., Gallup).

Accordingly, as Novotny recommends, school districts that are using, or considering the use of, the TI assesment should require Gallup to show them the research in support of the instrument. The Gallup-provided research should include information on the ability of the TI assessment to predict teacher performance, as well as validity studies undertaken by or on behalf of Gallup and evidence that the assessment does not discriminate against individuals who are members of classes protected by laws like Title VII and the ADA.

What's Missing?

According to Gallup, a TI interview development study, originally completed in January 2002, demonstrated content, construct, and criterion-related validity as well as fairness across classifications of race, gender, and age. There are a number of other "protected classes," including persons with disabilities (including physical, developmental and mental disabilities), national origin, and religion, that are missing from Gallup's list.

As an example, Dr. Melanie Schneider has identified three primary concerns relevant to nonnative and bilingual speakers of English (raising issues of discrimination on the basis of national origin):

the timed nature of the TI assesment,
possible inequalities associated with limited access to computers or the Internet, and
little perceived consideration for cultural and linguistic.

Dr. Schneider writes:

As mentioned earlier, parts of the TI contain timed questions. Unlike other standardized tests, there is no accommodation for applicants for whom English is a second or additional language or others who may have a disability that prevents a rapid-fire response. Waiting too long to respond to a question results in a missing response, which counts against an applicant’s total score. Although the timed nature of some types of questions affects all applicants who take the TI, nonnativespeakers of English are especially penalized when speed of response is required.

Linked to the timed format of the TI, which may disadvantage some nonnative speakers of English, are possible inequalities due to the digital divide between low-income and middle-income students. ... English language learners in the United States are more likely to come from lower income families than their native-English-speaking peers. Inequalities between low-income and middle-income children in the use of and access to computers and the Internet have been well documented. ... Even among undergraduate students, values, attitudes, and beliefs about access to and facility with certain technologies may disadvantage certain groups of students, such as those from low-income immigrant families.

Finally, a third concern for bilingual speakers of English is the belief that a singular view of talents characterizes successful teachers. Assuming that there is a single, preferred set of values, attitudes, beliefs, and behaviors associated with teacher success in the classroom ignores the role of culture in teaching and assessment

In Albemarle Paper Company v. Moody, the Supreme Court addressed a case in which an employer implemented a test (Wonderlic) on the theory that a certain verbal intelligence was called for by the increasing sophistication of the plant's operations. The company made no attempt to validate the test for job-relatedness, and simply adopted the national "norm" score as a cut-off point for new job applicants.

The Supreme Court cited the Standards of the American Psychological Association and pointed out that a test should be validated on people as similar as possible to those to whom it will be administered. The Court further stated that differential studies should be conducted on minority groups wherever feasible.

Ongoing Investigations of Assessments for ADA Compliance by EEOC

For more than six years, the EEOC has been investigating Kroger and Kronos, Kroger's assessment provider. The investigation focuses on whether the Kronos assessment illegally screens out persons with mental illness. Please see Kroger and Kronos: Chaos and Disorder.

The case arose from a charge filed by Vicky Sands with the EEOC in 2007. The EEOC has converted Ms. Sands' investigation into a systemic investigation and has since started at least two additional systemic investigations involving companies using the Kronos assessment.

The EEOC investigations involve claims that the Kronos assessment is an illegal pre-employment medical examination, that the data collected by the assessments is confidential medical information, and that the assessment screens out or tends to screen out persons with mental illness.

Systemic Risks

For many school districts, employment assessments like TI offer a standardized experience for all applicants. While this "one size fits all" approach helps to reduce a school district's costs and may reduce the impact of overtly biased or discriminatory behavior, the inclusion of one or more potentially "defective components" in the assessments means that school districts face the risk that a finding of bias or discrimination of an assessment used in one school district will put all school districts that use the assessment at risk. Please see When the First Domino Falls: Consequences to Employers of Embracing Workforce Assessment Solutions.

These "defective components" in assessments may be either design defects (i.e., the adoption and use of certain personality models) or manufacturing defects (i.e., coding errors in the assessment software). The latter is analogous to the coding error at 23andMe that resulted in notices going out to some customers informing them that they had a chronic and life-shortening condition when they did not. Please see On Not Dying Young: Fatal Illness or Flawed Algorithm?

Each day a school district continues to use the TI assessment, there are more potential plaintiffs with claims against that school district. Labor and employment laws like Title VII and the ADA, permit a school district to use a third party like Gallup to undertake the assessment of job applicants. The use of a third party, however, does not insulate a school district from any claims arising from the assessment usage. Under those laws, a school district is responsible (and liable) for any failures on the part of an assessment or assessment provider to comply with the provisions of those laws.

The (Non)Predictive Ability of the Gallup TeacherInsight Assessment

Gallup states that the TeacherInsight (TI) assessments have "been thoroughly researched and tested to be sure they identify potentially superior teachers." While it is to be expected that the company marketing the TI assessment would make such a statement, is there any independent support for the predictive ability of the TI assessment?

Establishing the predictive validity of an assessment usually requires that applicants who "pass" the assessment perform satisfactorily in practice, and those who do not pass do not perform satisfactorily in practice. The challenge with assessing the predictive validity of the TI assessment, however, is that teacher applicants who do not meet the cutoff score may not be hired. Consequently, support for the predictive validity of the TI assessment is determined by carefully documenting the evidence from teachers who “pass,” through correlational and regression analyses that compare performance on the target measure, TI scores, with one or more other measures.

Doctoral dissertations by Robert Jacob Koerner and Michael T. Novotny are two of the few independent, published research studies to examine the predictive validity of the TeacherInsight (TI) assessment: that teachers who score higher on it will be more successful teachers.

Novotny Study

The Novotny study involved 527 teachers hired into a North Texas school district for the 2006-2007 school year. The study analyzed the relationships between the TI assessment scores and the eight Professional Development Appraisal System (PDAS) domain scores for those teachers. PDAS is the instrument used by the State of Texas for appraising its teachers and identifying areas that would benefit from staff development.

The Novotny study concluded that:

The TeacherInsight scores produced a statistically significant correlation with only one of the eight PDAS domain scores. However, even that correlation (r = 0.14) was weak. ... The findings do not support the ability of the TeacherInsight to identify more effective teachers, based on Professional Development Appraisal System scores.

Koerner Study

The Koerner study examined the relationship between the TI assessment and student achievement as measured by the Texas Growth Instrument (TGI), which is an estimate of a student’s academic growth based on the Texas Assessment of Knowledge and Skills (TASK) scores for two consecutive years.More specifically, the study focused on the predictive validity of the TI assessment, that is, how well teacher TI scores predict student achievement gains, as measured by the TGI in reading, English language arts, and mathematics at the primary and secondary school levels. Participants in the study were 132 teachers from one Texas school district who taught reading, English language arts, or mathematics in Grades 3–11 during the 2005–2006 school year and had taken the online TI assessment.

According to Koerner:

The findings [of his study] provide little support to the validity of TeacherInsight in terms of its ability to predict student achievement scores and its usefulness as a tool for the selection of teachers by school systems.

Wasting Resources, Eliminating Good Teachers

Both the Koerner and Novotny studies found statistically significant, but very weak, relationships with a small number of variables related to teacher success. Those very low correlations suggest a weak link at best between the TI assessment and student achievement and teacher effectiveness.

As Novotny states in the introduction to his study:

It is critical that schools and districts identify highly effective, highly qualified teachers to raise student achievement. School districts have limited resources such as time, money, and manpower to achieve this task. If standardized interview tools such as the TI are effective at identifying better teachers, the time and money spent on them are worthwhile. However, if these tools are not effective, then the time and money spent could be better utilized elsewhere. Furthermore, if the TI does not effectively identify better teachers it could be preventing good candidates from being hired or from being accepted into alternative certification programs.

Not only are there concerns that the TI assessment may waste resources and screen out good candidates, as set out in Systemic Risks for School Systems Employing TeacherInsight, the use of the TI assessment may also violate the non-discrimination provisions of labor and employment laws like Title VII and the ADA that are enforced by the U.S. Equal Employment Opportunity Commission.

As Dr. Melanie Schneider writes in TeacherInsight and the Selection and Hiring of Bilingual Speakers of English:

Should teachers, teacher educators, and professional teacher organizations passively accept the increasingly pivotal role of online assessment tools in the hiring process, or should they press for more research and a balanced selection process that includes multiples sources of information,including human interaction?

What Is the Gallup TeacherInsight Assessment?

According to Gallup Inc., its TeacherInsight (TI) assessment is an automated online interview used by many school districts to help those schools identify the best potential teachers.

School districts across the country use the TI assessment as part of their teacher application and selection process. Some school districts use a cut score and do not consider applicants that fail to achieve or exceed that minimum score. Other districts do not use a cut score and instead use the TI score as one source of information to be considered in the selection process. Regardless of which method a district utilizes, the TI score inevitably affects applicants’ chances of obtaining a position.

Although Gallup recommends that school districts use the assessment as one piece of information when making their selection decisions, some school districts, use the assessment as an initial screening mechanism to determine which applicants to interview. For larger school districts, the TI assessment is administered to tens of thousands of applicants each year.

Development of TI Assessment

The TI assessment was developed as a Web-based version of the Teacher Perceiver Instrument (TPI). The TPI was originally created in the 1970s by Selection Research, which acquired the Gallup Organization in 1988 and took on Gallup’s name. The TPI was designed to identify effective teachers in face-to-face interviews based on their responses to interview questions. To develop the interview questions, Gallup states that it drew on the responses of experienced, master teachers across the country who were interviewed about effective teaching practices and behaviors.

In 2002, Gallup transitioned to the web-based TI assessment and in early February 2011, Gallup introduced its second generation of the TI assessment. The TI assessment has three types of questions:

First are multiple choice questions where one has 50 seconds to choose the response that BEST describes the applicant from four possible responses.
Second are forced-choice questions where one has 50 seconds to choose the response that BEST describes the applicant from two possible statements.
Third are Likert questions where one has 20 seconds to read a statement and rate the applicant's level of agreement with the statement. The applicant selects from five possible responses: "Strongly Disagree," "Disagree," "Neutral," "Agree," and "Strongly Agree."

Feedback, Fairness and Credibilty

Teacher applicants do not receive their scores or any other feedback from Gallup about whether they passed. A passing score is based on cutoff score guidelines recommended by Gallup and set by school districts. In most school districts, applicants who do not make the cutoff score are allowed to take the TI only once in a 12-month period.

A Gallup response to one of its FAQs on the TI assessment states:

TeacherInsight is fair because all applicants are asked exactly the same questions and they are evaluated exactly the same way. The questions have been thoroughly researched and tested to be sure they identify potentially superior teachers.

The TeacherInsight interview development study, originally completed in January 2002, demonstrated content, construct, and criterion-related validity as well as fairness across Equal Employment Opportunity Commission (EEOC) classifications of race, gender, and age. Subsequent analysis of candidate scores indicates similar results and interview fairness across groups.

The (Non)Predictive Ability of the Gallup TeacherInsight Assessment notes that Gallup's research on the effectiveness of its TI assessment is not publicly available. As stated in the Recommendations section in one of the few independent studies analyzing the TI assessment: "The lack of independent research on the TI and Gallup’s unwillingness to publish their own research does not help support the credibility of the TI."