Friday, June 6, 2014

Primer on Big Data and Hiring: Chapter 3

This is the third chapter of a primer on big data and hiring. The structure of the primer is based on the following graphic created by Evolv, a company that provides "workforce optimization" services. Evolv was selected not because it is sui generis; rather, it is emblematic of numerous companies, from start-ups to well-established companies that market "workforce science" services to employers.

The Evolv graphic below is intended to illustrate the process of workforce science.

Chapter 3: Cleanse and Upload
Structured and Unstructured Data Is Aggregated

Data isn't something that's abstract and value-neutral. Data only exists when it's collected, and collecting data is a human activity. And in turn, the act of collecting and analyzing data changes (one could even say "interprets") us. 

Workforce science requires enormous amounts of historic or legacy data. This data has to be consolidated from a number of disparate source systems within each company, each with their specific data environment and particular brand of business logic. That data consolidation then must be replicated across hundreds or thousands of companies.

Structured data refers to data that is identifiable because it is organized in a structure. The most common form of structured data is a database where specific information is stored based on a methodology of columns and rows (i.e., Excel). Structured data is understood by computers and is also efficiently organized for human readers.  

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to structured data.

Unstructured data consists of two basic categories; textual objects (based on written or printed language, such as emails or Word documents); and bitmap objects (non-language based, such as image, video or audio files).

There are many types of techniques that need to be put together in a complex data processing flow utilizing unstructured data. These techniques include
  • information extraction (to produce structured records from text or semi-structured data)
  • cleansing and normalization (to be able to even compare string values of the same type, such as a dollar amount or a job title)
  • entity resolution (to link records that correspond to the same real-world entity or that are related via some other type of semantic relationship)
  • mapping (to bring the extracted and linked records to a uniform schematic representation)
  • data fusion (to merge all the related facts into one integrated, clean object)
Assumptions are embedded in a data model upon its creation. Data sources are shaped through ‘washing’, integration, and algorithmic calculations in order to be commensurate to an acceptable level that allows a data set to be created. By the time the data are ready to be used, they are already ‘at several degrees of remove from the world.’

Data is never raw; it’s always structured according to somebody’s predispositions and values. The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation.

No comments:

Post a Comment

Because I value your thoughtful opinions, I encourage you to add a comment to this discussion. Don't be offended if I edit your comments for clarity or to keep out questionable matters, however, and I may even delete off-topic comments.