Datafilos: 2014

sobota 1. listopadu 2014

3Vs (variety, velocity and volume)

Three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms:
Product, Promotion, Place, and Price.

pátek 17. října 2014

pátek 3. října 2014

Comparison of MATLAB and R

Advantage of R:

Easy setting of default parameters (inheritance from functional languages). Not that it is incredibly difficult to set a default value in MATLAB, but it's verbose and error prone.
Named parameters (again, inheritance from functional languages). In MATLAB, when you pass many parameters with string values to a function, it's unclear at glance, what is parameter name and what is parameter value. In R, it's immediately clear.
Mixed tables (combination of string and numerical columns). Incredibly useful for real world messy data sets. A partial remedy to this problem is 'Tables' in the late versions of MATLAB.
Possibility to name rows and columns.This is awesome because you don't have to remember that you want column 181, all you have to remember is the name of the column. Also, it has the advantage that metadata are together with the data. Hence if you perform selection, projection or transformation of the data, the metadata are automatically in sync with the data. No work is left on the user. In MATLAB, you have to use 'Struct'. Or 'Tables' in the late versions of MATLAB.
Negative indexes for dropping of particular columns/rows.

Advantage of MATLAB:

There is a fewer competing packages for MATLAB than for R. Hence in MATLAB you are spared of deciding, which library is the best.
Spare matrices are integral part of MATLAB. Hence all algorithms benefiting from spare matrices are using the same representation of spare matrices. In R, each library is using it's own representation.

neděle 28. září 2014

Difference between Machine Learning and Artificial Inteligence

In my biased opinion, the difference between Machine Learning (ML) and Artificial Intelligence (AI) is in the way, how do they solve problems. AI seeks an optimal solution, while ML seeks a usable solution.

And this difference reflects in used tool sets. A typical tool for AI is logic, which is traditionally binary. On the other end probability, a common tool in ML, allows any value between 0 and 1.

The difference reflects in individual algorithms as well. A* is a representative algorithm from AI. It is an elegant algorithm that guaranties, that the returned is optimal. In contrast, neuron network, an algorithm from ML, doesn’t guaranty optimality of the solution at all. But it can tackle much wider range of problems than A*. And that is the reason, why ML is currently more popular and successful than AI. Despite all the hopes, it turned out we are unable to optimally solve many problems like voice or object recognition. All we can hope for is a good enough solution. And that is exactly the thing, where ML beats AI. ML is all about “how to get a usable solution”, while AI is about “how to get an optimal solution”. And when this optimal solution is unreachable, AI just gives up, while ML gives at least something.

úterý 9. září 2014

Comparison of SAS data step and SQL

The default tool for ETL in SAS is data step. However, SAS also offers support for SQL. When to use which?

The main advantages of data step are:

Drop keyword. Let's imagine that you want to remove one column from a table with 2000 columns. In SQL you would have to name all columns you want to keep. But in data step it is enough to just name the column you don't want to include. Awesome.
Wildcards. If you want to select all columns beginning with "pred_", all you have to do in data step is to write "pred_:" (note the column). In SQL you would have to write name of each predictor.
Speed. SQL in SAS is not implemented overly effectively.
LAG command. In SQL you have to perform a slow and cumbersome join to get the corresponding functionality.

The main advantages of SQL:

Group by command. Simply because data step doesn't offer such functionality.
Order by command. Again you can't sort directly in data step.
Metadata. Queries on the metadata are so addictive!

středa 28. května 2014

Náboženství

Proč existují náboženství? Protože cílem každého náboženství je se rozšířit. A pokud to náhodou nebylo jeho cílem, tak takové náboženství v konkurenci ostatních zaniklo - mezi náboženstvími totiž existuje seleční tlak, podobně jako u živoucích organismů. Jak se náboženství šíři? Existují tři hlavní strategie:

válka (například křížové výpravy, spanilé jízdy nebo džihád mečem),
populační exploze (odpor katolické církve k antikoncepci),
atraktivita (slib posmrtného života, či znovuzrození).

Válkou se likvidují oponenti, a tak zastoupení věřících stoupá na úkor ostatních. Podporou rozmnožování se zase přímo zvyšuje počet věřících. A atraktivitou náboženství se zvyšuje pravděpodobnost, že lidi na něj konvergují, ať už svévolně nebo silou.

A protože se objevili náboženství, která jsou životaschopnější, než ateismus, máme náboženství.

neděle 11. května 2014

Proč analytici jsou tak často horlezci?

Když člověk celý den tráví v sedě, má potřebu protáhnout si svaly. Asi nejlepší sport na protažení všech svalů v těle je plavání. Pokud jste ale hubení, rychle ve vodě prochládáte. A tak hubení analytici často řeší bolesti zad druhým nejlepším sportem na protažení, horolezectvím.
Pokud pracujete na dlouhodobých projektech, kde trvá léta, než se vyhodnotí vaše výsledky, schází vám (včasný) pocit euforie z dobře odvedené práce. U horolezectví ale máte okamžitý feedback - buď cestu vylezete, nebo ne.
Jestliže jste analytik konzultant, jste děvka prodejná - pokaždé pracujete pro někoho jiného, někde jinde. A je tedy pro vás obtížné se pravidelně setkávat ke kolektivnímu sportu. Naštěstí ale, bouldering je individuální sport.
V horách je nízká hustota zalidnění, a tak pokud jste přeučenými introverty, stává se z horolezectví vítaná záminka k útěku od civilizace.
Lidi se dají rozdělit do dvou skupin - někteří preferují optimalizovat správnost odpovědi, jiní rychlost. Analytici obvykle inklinují k přesnosti a horolezectví je nejdřív o optimalizaci výběru cesty a až později o akčnosti. Naproti tomu squash, ten je nejdřívo akčnosti a až později o strategii. Proto obchodníci spíše preferují běh a squash.

neděle 20. dubna 2014

Hacker in the browser and other hacking ideas

If I was a hacker with the ability to modify web pages in the browser, I would modify the top of Wikipedia and ask people to donate. People accustomed to donations would donate again. But this time to my pocket. If it was synced with the real campaign it would be awesome.

Or if I was Microsoft and detected that two people communication over Skype (or other messenger) are in a relationship, I would suggest the male counterpart to send roses to his fiancé. It would be just a small bunch of roses. But the delivery would be guaranteed in the next 30 minutes for the current estimated location of the fiancé.

And to be even more devilish I would use the fiancé as the vector. I would show a pop-up to the fiancé saying: “Do you want to test your boyfriend? Show him an advertisement with one of the following buckets!” She would be presented with three simple options like snowdrops, violas and sunflowers. She would be thinking about the ad while chatting with the boyfriend and in the end she would click on one of the buckets because who would not like to test the love of the beloved one? And to make sure that the boyfriend is not going to disappoint her (man are notoriously unreliable) she would hinge the boyfriend to buy the flowers until she actually gets the bucket (women can get really persistent when they decide to get something). The poor boyfriend would be then forced to buy the overprized bucket because who the hell should bear all that morning of the fiancé about some stupid inedible vegetable? Sooner he gets over this the better.

The boyfriend would be stressed to buy the specific bucket. But for men it’s hard to multitask (search for a cheaper delivery while chatting with his fiancé). And it would be ridiculously hard to find a delivery with that specific flower (Why the hell didn’t she want roses?!) and even harder to find deliver able to deliver the bucket in less than 30 minutes. Simply Machiavellous.

sobota 15. února 2014

Camera sensor

Recently I have noticed that someone patented a layout of sensors at camera chip that was better tuned to sensitivity of eye. Particularly the patented chip was combining colorless sensors with color sensors. This combination makes sense since human eye is more sensitive to luminance than color. Furthermore the resulting pictures are less noisy because colorless sensors do not filter light.

Nevertheless, the presented design doesn’t exactly follow the sensitivity of human eye. Hence I predict that sooner or later someone will patent the right proportion of sensors without specifying the exact geometric shape. And indeed it’s possible that the exact layout of the sensors will be random (while preserving the right proportions).

neděle 9. února 2014

Zpoveď

Paní H., zazlívám Vám jednu věc. Jak jsme týden co týden psali slohová cvičení, vytvořil jsem si závislost. Kupříkladu ulehnu do lóže, ale neusnu, protože v mysli neustále vylepšuji nějaký příběh. A jediné co pomáhá, je vstát, usednout za židli a vypsat se. Teprve jakmile jsou myšlenky vyexportovány na papír a zvalidovány, že export proběhl úspěšně, mohu se jít věnovat původně zamýšlené činnosti, spánku. U mně vypsat a vyspat často znamená totéž.

Nebo jsem s kamarády a chci se bavit, ale nějaká myšlenka se mi do mysli neustále vrací jako moucha na exkrement. A jediný způsob, jak se jí zbavit, je jí sdělit kamarádům nebo papíru. Kdyby to byly alespoň náležité myšlenky, které by společnost pobavili. Ale ono ne, ty myšlenky jsou akorát tak hodny papíru. Asi jsme měli více konverzovat a méně psát.

The difference between statistics, machine learning, data mining and data science.

Originally there was just statistics - a method how to summarize huge populations into two numbers, average and variance. And with a bit of exaggeration whole statistics is operating with just these two numbers. With these two numbers you can compute significance, confidence intervals, correlation, regression and many other. Back in time it was amazing success - you could operate with millions of records on a single piece of paper.

But with dawn of computers people became less limited in the amount of computation that was deemed practical and they started to think in big. What if we worked with whole population distribution? Or if we run these old trivial statistical tests on this huge pile of data, wouldn’t we find something? These two questions stood at the beginning of two fields, machine learning, evaluating the former question, and data mining, evaluating the later question.

Statisticians with access to computers started to pull nasty tricks like bootstrapping to narrow confidence intervals. And traditional statistics felt threatened. And they started to accuse computer statisticians from cheating. Latter on computer statisticians persuaded traditional statisticians about validity of the approach and statisticians accepted bootstrapping as a useful tool. But disagreements like this led to divergence of machine learning from statistics.

Similarly guys and gals from data mining were targets of many attacks because data mining allowed production of scientific articles with amazing pace – what would have taken whole life of a respected scientist could have been done in less than 5 minutes with a stupid computer. Unfortunately, in this case the despect was deserved because many results of data mining were false positives. Later on data miners learned to use Bonferroni correction and validate results to decrease the rate of false positives, but damage was done. Both, statisticians and machine learners, started to look down upon data miners as kids that learned a few tricks, which they apply without any deeper understanding.

With rise of Internet access to data simplified and the biggest time burden shifted from data collection to data procession. The methods invented by machine learners were hopelessly slow on data from Internet and phones and even methods employed by data miners were too slow to be executed on whole datasets. This change of paradigm led to return to the roots of statistics where people first created hypotheses and then they studied the data to prove or invalidate the hypothesis. But because focus shifted from correctness of methods (they were proved many times since then) to efficient computation of trivial algorithms, new group of statisticians with computer science background emerged. Nowadays we call this group as data scientists.

PS: A quick guide how to differentiate different fields based on the keywords:

State space -> artificial inteligence
Significance -> statistics
Maximum a posteriori (MAP) -> machine learning
Algorithmic efficiency (O-notation) -> theoretical computer science
Cross-validation -> data mining

Povzdech

Otec je posedlý veterány, matkou familiárně přezdívaný vraky, vysavači, televizemi a měřící aparaturou. Matka je zase posedlá květinami, otcem familiárně přezdívány jako plevel, porcelánem a sklem. O oba mají tendenci své sbírky rozšiřovat, třebas i na úkor toho druhého. Takže když jeden odjede na chvíli pryč, druhý toho využije a posune demarkační čáru. Například když matka odjede na týden pryč, otec si pořídí nový vrak a jako by se nechumelilo, umístí ho na matčin trávník. A matka po návratu spíná ruce, protože vrak veterán se počůral a vytvořil olejovou loužičku, takže i když se vrak odsune, trávník je už zničen. A naopak, když otec odjede, matka vyhodí staré pneumatiky a zasadí místo nich vzrostlý strom. A otec potom žalostně lamentuje, že to byly ještě dobré pneumatiky a že je potřebuje. Protože ale maminku miluje, strom tam ponechá, jen ho obloží novými pneumatiky, takže strom se pokrucuje, jak se snaží skrze pneumatiky dostat ke světlu.

Přeji si, aby rodiče nikdy neumřely, protože etická likvidace nahromaděného majetku by byla vyčerpávající.

pátek 7. února 2014

My knowledge of RapidMiner

My knowledge of operators in RapidMiner:

Process Control (10/39)
Utility (7/54)
Repository Access (2/6)
Import (2/28)
Export (1/18)
Data Transformation (42/115)
Modeling (27/66)
Evaluation (11/32)

Overall, I know around 28% (102/358) of operators in RapidMiner.

středa 29. ledna 2014

My knowledge of SAS Enterprise Miner

Currently I am familiar with 48/70 (69%) nodes in Enterprise Miner.

čtvrtek 23. ledna 2014

Execution of MBR score.sas generated in SAS Enterprise Miner in SAS Enterprise Guide

Execution of non-data-step models like MBR and SVM is different from execution of data-step models like tree and regression. Hence it is not possible to call score.sas like:

data output;
   set input;
   %include "C:\score.sas";
run;

Instead you have to use something like:

/* Note: For k-NN score.sas is not enough by itself.
   k-NN also needs whole training set.*/

/* Define EMSCORE library where MBR will look for training
   and scoring data. It has to be named EMSCORE. */
libname EMSCORE base "C:\EMSCORE";

/* Define training data. It has to be named EM_TRAIN_MBR
   and reside in EMSCORE. */
data EMSCORE.EM_TRAIN_MBR;
    set learning;
run;

/* Define scoring data. Scoring data will be overwritten
   with MBR prediction. It has to be named em_score_output. */
data EMSCORE.scoring;
    set scoring;
run;
%let em_score_output= EMSCORE.scoring;

/* Execute MBR */
data _null_;
   %include "C:\score.sas";
run;

pondělí 20. ledna 2014

Comparison of variable selection methods

Since I didn't know which variable selection method to use, I performed a trivial test on Sonar dataset. Sonar dataset has 60 attributes. But I arbitrarily decided to reduce the number of attributes to 10. Then I measured classification accuracy with ten fold cross-validation. And to get an idea how feature selection methods are dependent on the classifiers I tried three different models: naive Bayes, k-NN and classification tree:

Based on the comparison the best method to use is SVM attribute selection. However, this method requires parameter tunning. The next best variable selection method is Chi2. The disadvantage of this method is that it favors attributes with many levels. Hence the performance of Chi2 could be severely hindered on diverse set not like Sonar set. The last method from the top three is information gain ratio. The advantage of this method is that it can handle attributes with diverse number of levels not like Chi2.

neděle 19. ledna 2014

Oh people of SAS, you are amazing!

Sometimes you have to use comma between parameters, for example in definition of macro:
%macro append(columnName=, labelName=);
But it other cases, like dropping, you can't use comma between parameters:
data table(drop=attr1 attr2 attr3);
If you type DAAT instead of DATA, your program will run anyway with a polite note in your SAS log telling you that it has assumed you meant DATA and went ahead and executed based on that assumption, but if not, hey, feel free to let it know.

čtvrtek 16. ledna 2014

SAS experience

Jak se pozná, že jsem pracoval v SASu? Mám neuvěřitelně vyčesané vlasy do zádu. Obávám se, že pokud mi i zítra řeknou, že budu pracovat se SASem, skončím, jako Homer Simpson, který se připravil o část kštice, kdykoliv se dozvěděl, že bude mít další dítě.

Práce

Co nemám rád na mém současném pracovišti? Koberec v kanceláři. Jak šoupu nohama po koberci, nabíjím se statickou elektřinou. A potom stačí dotknout se vodovodní baterie a zajiskří to.

Ale alespoň můžu říkat, že v práci přímo zářím.

úterý 14. ledna 2014

Nest Thermostat

Since we have self learning thermostats I predict that sooner or later we will have self learning electrical kettles. When you wake up, the water for the morning coffee will be preheated. So when you push the button you don't have to wait an eternity for the morning cup of coffee. And in the evening when it detects you are returning home it preheats water. It doesn't matter whether you use the hot water for a cup of tea or for dinner. It will be there ready on push of button. But in the mean time the kettle is not going to keep the water warm since it knows no one is home. Comfort and efficiency blended together. PS: did you notice how hard it is to estimate how much of water to pour into the kettle so you always pour a bit more of water than is necessary just to be sure you don't end up with too little water? If you have your favorite cup the intelligent kettle can learn it and preheat the exact amount of water. And it can get so far that you push the button and based on the sensors it fills any cup just the way you like - to the border or a centimeter below. Your choice.

čtvrtek 9. ledna 2014

SAS Miner error messages

Are you getting error messages when you ran MBR (Memory Based Reasoning, also known as k-nearest neighbors)?

ERROR in MBR: Run time error was encountered. Please see log for more details.

SOLUTION: Edit variables entering into MBR and unselect all ID attributes like _dataobs_. That should do the trick.

NOTE: MBR processes only interval variables. And it's not sufficient to just change attribute level from binary to interval.

ERROR in Partial Least Squares: No factor was extracted. Use different classifier.

Solution: Use variable selection to reduce the search space.

sobota 4. ledna 2014

Optimismus

Viděl jsem, jak přijíždí tramvaj. Už byla za hranicí, kterou jsem mohl ovlivnit rychlostí běhu a zdali tramvaj stihnu, už jen záleželo na řidiči, zda počká. Ale řekl jsem si: je ráno a po ránu bývají lidi shovívavější. A je nový rok, lidi jsou optimističtí. A z toho optimismu jsem se jal tramvaj stíhat. Už jsem stál ve dveřích tramvaje a plánoval do nich vstoupit co nejrychleji. Co kdyby se v řidiči probudily sadistické pudy?

Ale jak jsem vstupoval do tramvaje, část těla ještě pokračovala v letu paralelně s tramvají. Vzorově jsem se v těch dveřích natáhl.

Comparsion of SAS Enterprise Miner and RapidMiner

Enterprise Miner (EM)
+ Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted. Still, the plots could support interactive exploration similar to KNIME, where when you highlight a sample in one plot, it gets highlighted in all other plots.
+ Ingenious default settings. You plug the operator and it works. In RapidMiner you have often preprocess the data and fiddle with the settings to get usable result in a reasonable time.
+ You can selectively execute a portion of a flow. While in RapidMiner you can execute only whole flow. The selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch. Instead of that, EM (by default) recalculates only branches affected by the change. Unfortunately, this feature takes a high toll. While RapidMiner can process data "on the fly", EM has to process data in steps - each operator in the flow has to finish before the subsequent operators can start. Furthermore, in EM you have to have fast and big storage for intermediate calculations.
- Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
- It's full of bugs. Not that RapidMiner would be free of bugs. But after a day of bug hunting in RapidMiner you can at least fix it in the source code by yourself. In the case of SAS you either have to be better hacker, then am I, or you have to go through frustrating process of communication with the infamous SAS support.
- EM ecosystem lacks ETL operators - instead of them, you are supposed to use SAS code operator. Personally, I prefer to perform all the transformations in Enterprise Guide. And then I just load the final table(s) into EM.
- EM doesn't know undo command. Instead of undo, you have to confirm each action, since each change is irreversible. This approach goes against the best UI practice to limit amount of modal windows and make each action easily reversible.
- EM doesn't know save command. Beware of blind confirming of modal windows! It happened to me several time that instead of deleting a single operator whole flow was deleted because I miss-clicked the operator and selected the drawing board instead.
- Setup SAS ecosystem takes some time. Count something around one month.
- Keep SAS ecosystem running takes a lot of energy. Count two days of repairs for each working day.
- Import of data into EM is tedious as you have first define a library and then walk through a long wizard. Count 5 minutes for even the simplest data set. But at least example data sets from SASHELP are fast to load.
- Moving of EM projects from one directory to another is trivial but unintuitive. First, you must make sure that the name of the project directory is the same as the project name. Second, you open moved projects via "New project" and selecting project directory as the destination directory.
- Metadatabase. So far all it does is that it doubles maintenance time. I am sure there must be some benefit of having metadatabase but so far I do not see it. Hence I strongly suggest everyone to install a standalone version instead of the server version unless you have some good reason to do it otherwise.

RapidMiner (RM)
+ Fully sufficient ecosystem. You can do all ETL within RapidMiner.
+ You can truly visually program in RM since you have loop operators and conditions.
+ You have source codes.
+ The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
- Every time you have to rerun whole flow even though you want to make just a small refinement.
- It's too easy to setup the flows/operators that they would take eternity to calculate. For comparison, everything in SAS is optimized for big data and you don't get trapped in computational black holes too often.
- You can't stop running node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.
~ Since each connection between operators in RM fulfills distinct function it's common that you have to wire two operators with more than one connection (for example one connection for training data and second connection for testing data), making the flow look overcrowded. In EM it's always enough to make a single connection between two operators as each connection can transfer any type of data (training, testing, validation...). Hence flows in EM looks tidier.
~ On the other end RM allows grouping of operators into a single operator allowing "divide and conquer" strategy. While absence of operator nesting in EM makes small EM flows easy to understand, bigger flows look messier in EM than in RM.
- SVD and PCA are ridiculously slow and memory consuming in comparison to SAS versions.
- Graphical presentation of hierarchical clustering is just a joke in comparison to output from Orange, KNIME or MATLAB.
- I miss Partial Least Square regression and full Bayes (not just Naive Baves). But you can get them from WEKA plugin.
- ROC plot should show reference line for simpler visual evaluation.

středa 1. ledna 2014

Programming languages/scripts in which I wrote something

Pascal
VisualBasic
Assembler
C
C++
C#
Python
MATLAB
Lisp
Prolog
R
NetLogo
Java
JavaScript
PHP
SQL
SAS (if we relax requirements for definition of programming languages/scripts)

And I am not good in any one.