Datafilos: února 2014

sobota 15. února 2014

Camera sensor

Recently I have noticed that someone patented a layout of sensors at camera chip that was better tuned to sensitivity of eye. Particularly the patented chip was combining colorless sensors with color sensors. This combination makes sense since human eye is more sensitive to luminance than color. Furthermore the resulting pictures are less noisy because colorless sensors do not filter light.

Nevertheless, the presented design doesn’t exactly follow the sensitivity of human eye. Hence I predict that sooner or later someone will patent the right proportion of sensors without specifying the exact geometric shape. And indeed it’s possible that the exact layout of the sensors will be random (while preserving the right proportions).

neděle 9. února 2014

Zpoveď

Paní H., zazlívám Vám jednu věc. Jak jsme týden co týden psali slohová cvičení, vytvořil jsem si závislost. Kupříkladu ulehnu do lóže, ale neusnu, protože v mysli neustále vylepšuji nějaký příběh. A jediné co pomáhá, je vstát, usednout za židli a vypsat se. Teprve jakmile jsou myšlenky vyexportovány na papír a zvalidovány, že export proběhl úspěšně, mohu se jít věnovat původně zamýšlené činnosti, spánku. U mně vypsat a vyspat často znamená totéž.

Nebo jsem s kamarády a chci se bavit, ale nějaká myšlenka se mi do mysli neustále vrací jako moucha na exkrement. A jediný způsob, jak se jí zbavit, je jí sdělit kamarádům nebo papíru. Kdyby to byly alespoň náležité myšlenky, které by společnost pobavili. Ale ono ne, ty myšlenky jsou akorát tak hodny papíru. Asi jsme měli více konverzovat a méně psát.

The difference between statistics, machine learning, data mining and data science.

Originally there was just statistics - a method how to summarize huge populations into two numbers, average and variance. And with a bit of exaggeration whole statistics is operating with just these two numbers. With these two numbers you can compute significance, confidence intervals, correlation, regression and many other. Back in time it was amazing success - you could operate with millions of records on a single piece of paper.

But with dawn of computers people became less limited in the amount of computation that was deemed practical and they started to think in big. What if we worked with whole population distribution? Or if we run these old trivial statistical tests on this huge pile of data, wouldn’t we find something? These two questions stood at the beginning of two fields, machine learning, evaluating the former question, and data mining, evaluating the later question.

Statisticians with access to computers started to pull nasty tricks like bootstrapping to narrow confidence intervals. And traditional statistics felt threatened. And they started to accuse computer statisticians from cheating. Latter on computer statisticians persuaded traditional statisticians about validity of the approach and statisticians accepted bootstrapping as a useful tool. But disagreements like this led to divergence of machine learning from statistics.

Similarly guys and gals from data mining were targets of many attacks because data mining allowed production of scientific articles with amazing pace – what would have taken whole life of a respected scientist could have been done in less than 5 minutes with a stupid computer. Unfortunately, in this case the despect was deserved because many results of data mining were false positives. Later on data miners learned to use Bonferroni correction and validate results to decrease the rate of false positives, but damage was done. Both, statisticians and machine learners, started to look down upon data miners as kids that learned a few tricks, which they apply without any deeper understanding.

With rise of Internet access to data simplified and the biggest time burden shifted from data collection to data procession. The methods invented by machine learners were hopelessly slow on data from Internet and phones and even methods employed by data miners were too slow to be executed on whole datasets. This change of paradigm led to return to the roots of statistics where people first created hypotheses and then they studied the data to prove or invalidate the hypothesis. But because focus shifted from correctness of methods (they were proved many times since then) to efficient computation of trivial algorithms, new group of statisticians with computer science background emerged. Nowadays we call this group as data scientists.

PS: A quick guide how to differentiate different fields based on the keywords:

State space -> artificial inteligence
Significance -> statistics
Maximum a posteriori (MAP) -> machine learning
Algorithmic efficiency (O-notation) -> theoretical computer science
Cross-validation -> data mining

Povzdech

Otec je posedlý veterány, matkou familiárně přezdívaný vraky, vysavači, televizemi a měřící aparaturou. Matka je zase posedlá květinami, otcem familiárně přezdívány jako plevel, porcelánem a sklem. O oba mají tendenci své sbírky rozšiřovat, třebas i na úkor toho druhého. Takže když jeden odjede na chvíli pryč, druhý toho využije a posune demarkační čáru. Například když matka odjede na týden pryč, otec si pořídí nový vrak a jako by se nechumelilo, umístí ho na matčin trávník. A matka po návratu spíná ruce, protože vrak veterán se počůral a vytvořil olejovou loužičku, takže i když se vrak odsune, trávník je už zničen. A naopak, když otec odjede, matka vyhodí staré pneumatiky a zasadí místo nich vzrostlý strom. A otec potom žalostně lamentuje, že to byly ještě dobré pneumatiky a že je potřebuje. Protože ale maminku miluje, strom tam ponechá, jen ho obloží novými pneumatiky, takže strom se pokrucuje, jak se snaží skrze pneumatiky dostat ke světlu.

Přeji si, aby rodiče nikdy neumřely, protože etická likvidace nahromaděného majetku by byla vyčerpávající.

pátek 7. února 2014

My knowledge of RapidMiner

My knowledge of operators in RapidMiner:

Process Control (10/39)
Utility (7/54)
Repository Access (2/6)
Import (2/28)
Export (1/18)
Data Transformation (42/115)
Modeling (27/66)
Evaluation (11/32)

Overall, I know around 28% (102/358) of operators in RapidMiner.