neděle 9. února 2014

The difference between statistics, machine learning, data mining and data science.

Originally there was just statistics - a method how to summarize huge populations into two numbers, average and variance. And with a bit of exaggeration whole statistics is operating with just these two numbers. With these two numbers you can compute significance, confidence intervals, correlation, regression and many other. Back in time it was amazing success - you could operate with millions of records on a single piece of paper.

But with dawn of computers people became less limited in the amount of computation that was deemed practical and they started to think in big. What if we worked with whole population distribution? Or if we run these old trivial statistical tests on this huge pile of data, wouldn’t we find something? These two questions stood at the beginning of two fields, machine learning, evaluating the former question, and data mining, evaluating the later question.

Statisticians with access to computers started to pull nasty tricks like bootstrapping to narrow confidence intervals. And traditional statistics felt threatened. And they started to accuse computer statisticians from cheating. Latter on computer statisticians persuaded traditional statisticians about validity of the approach and statisticians accepted bootstrapping as a useful tool. But disagreements like this led to divergence of machine learning from statistics.

Similarly guys and gals from data mining were targets of many attacks because data mining allowed production of scientific articles with amazing pace – what would have taken whole life of a respected scientist could have been done in less than 5 minutes with a stupid computer. Unfortunately, in this case the despect was deserved because many results of data mining were false positives. Later on data miners learned to use Bonferroni correction and validate results to decrease the rate of false positives, but damage was done. Both, statisticians and machine learners, started to look down upon data miners as kids that learned a few tricks, which they apply without any deeper understanding.

With rise of Internet access to data simplified and the biggest time burden shifted from data collection to data procession. The methods invented by machine learners were hopelessly slow on data from Internet and phones and even methods employed by data miners were too slow to be executed on whole datasets. This change of paradigm led to return to the roots of statistics where people first created hypotheses and then they studied the data to prove or invalidate the hypothesis. But because focus shifted from correctness of methods (they were proved many times since then) to efficient computation of trivial algorithms, new group of statisticians with computer science background emerged. Nowadays we call this group as data scientists.

PS: A quick guide how to differentiate different fields based on the keywords:
  1. State space -> artificial inteligence
  2. Significance -> statistics
  3. Maximum a posteriori (MAP) -> machine learning
  4. Algorithmic efficiency (O-notation) -> theoretical computer science
  5. Cross-validation -> data mining

Žádné komentáře: