Friday, June 6, 2014

Primer on Big Data and Hiring: Chapter 4

This is the fourth chapter of a primer on big data and hiring. The structure of the primer is based on the following graphic created by Evolv, a company that provides "workforce optimization" services. Evolv was selected not because it is sui generis; rather, it is emblematic of numerous companies, from start-ups to well-established companies that market "workforce science" services to employers.

The Evolv graphic below is intended to illustrate the process of workforce science.



Chapter 4: Analyze and Predict
Data Analyzed Using Machine Learning and Predictive Algorithms

The theory of big data is to have no theory, at least about human nature. One just gathers huge amounts of information, observes the patterns and estimates probabilities about how people will act in the future. One does not address causality.

In linear systems, cause and effect is much easier to pinpoint. However, the world around us is considered a complex system where there are often multiple variables pushing an outcome to occur. Nigel Goldenfeld, a professor of physics at University Illinois, sums it up best: “For every event that occurs, there are a multitude of possible causes, and the extent to which each contributes to the event is not clear.”

Algorithms and big data are powerful tools. Wisely used, they can help match the right people with the right jobs. But they must be designed and used by humans, or they can go very wrong. ADavid Brooks wrote in the New York Times:
Data creates bigger haystacks. This is a point Nassim Taleb, the author of “Antifragile,” has made. As we acquire more data, we have the ability to find many, many more statistically significant correlations. Most of these correlations are spurious and deceive us when we’re trying to understand a situation. Falsity grows exponentially the more data we collect. 
There’s a saying in artificial intelligence circles that techniques like machine learning can very quickly get you 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible. The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the winning team got a 10% improvement over Netflix’s in-house algorithm.

A corollary of the above saying is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R&D department in a startup going to have a significant breakthrough.  Modern machine algorithms are the product of thousands of academics and billions of dollars of R&D and are generally only improved upon at the margins by individual companies.

Some of the best and brightest organizations have recognized that improvement, if any, in machine learning comes from outside the organization. Facebook, Ford, GE and other companies have run contests for data-science challenges on Kaggle, while NASA and other government agencies, as well as the Harvard Business School, have taken the crowdsource route on Topcoder.

Abhishek Shivkumar of IBM Watson Labs has listed the top ten problems for machine learning in 2013. These problems include churn prediction, truth and veracity, scalability and intelligent learning. This doesn’t mean machine learning isn’t ever useful – it just means one needs to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Applying machine learning concepts in the context of persons livelihoods (and, potentially, lives) is problematic, not just for the individual applicant or employee but also for the employer and its potential liability exposure
































Analyze and Predict
Data Analyzed Using Machine Learning and Predictive Algorithms

The theory of big data is to have no theory, at least about human nature. One just gathers huge amounts of information, observes the patterns and estimates probabilities about how people will act in the future. One does not address causality.

In linear systems, cause and effect is much easier to pinpoint. However, the world around us is considered a complex system where there are often multiple variables pushing an outcome to occur. Nigel Goldenfeld, a professor of physics at University Illinois, sums it up best: “For every event that occurs, there are a multitude of possible causes, and the extent to which each contributes to the event is not clear.”

Algorithms and big data are powerful tools. Wisely used, they can help match the right people with the right jobs. But they must be designed and used by humans, or they can go very wrong. ADavid Brooks wrote in the New York Times:
Data creates bigger haystacks. This is a point Nassim Taleb, the author of “Antifragile,” has made. As we acquire more data, we have the ability to find many, many more statistically significant correlations. Most of these correlations are spurious and deceive us when we’re trying to understand a situation. Falsity grows exponentially the more data we collect. 
There’s a saying in artificial intelligence circles that techniques like machine learning can very quickly get you 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible. The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the winning team got a 10% improvement over Netflix’s in-house algorithm.

A corollary of the above saying is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R&D department in a startup going to have a significant breakthrough.  Modern machine algorithms are the product of thousands of academics and billions of dollars of R&D and are generally only improved upon at the margins by individual companies.

Some of the best and brightest organizations have recognized that improvement, if any, in machine learning comes from outside the organization. Facebook, Ford, GE and other companies have run contests for data-science challenges on Kaggle, while NASA and other government agencies, as well as the Harvard Business School, have taken the crowdsource route on Topcoder.

Abhishek Shivkumar of IBM Watson Labs has listed the top ten problems for machine learning in 2013. These problems include churn prediction, truth and veracity, scalability and intelligent learning. This doesn’t mean machine learning isn’t ever useful – it just means one needs to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Applying machine learning concepts in the context of persons livelihoods (and, potentially, lives) is problematic, not just for the individual applicant or employee but also for the employer and its potential liability exposure

No comments:

Post a Comment

Because I value your thoughtful opinions, I encourage you to add a comment to this discussion. Don't be offended if I edit your comments for clarity or to keep out questionable matters, however, and I may even delete off-topic comments.