sobota 4. ledna 2014

Optimismus

Viděl jsem, jak přijíždí tramvaj. Už byla za hranicí, kterou jsem mohl ovlivnit rychlostí běhu a zdali tramvaj stihnu, už jen záleželo na řidiči, zda počká. Ale řekl jsem si: je ráno a po ránu bývají lidi shovívavější. A je nový rok, lidi jsou optimističtí. A z toho optimismu jsem se jal tramvaj stíhat. Už jsem stál ve dveřích tramvaje a plánoval do nich vstoupit co nejrychleji. Co kdyby se v řidiči probudily sadistické pudy?

Ale jak jsem vstupoval do tramvaje, část těla ještě pokračovala v letu paralelně s tramvají. Vzorově jsem se v těch dveřích natáhl.

Comparsion of SAS Enterprise Miner and RapidMiner

Enterprise Miner (EM)
+ Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted. Still, the plots could support interactive exploration similar to KNIME, where when you highlight a sample in one plot, it gets highlighted in all other plots.
+ Ingenious default settings. You plug the operator and it works. In RapidMiner you have often preprocess the data and fiddle with the settings to get usable result in a reasonable time.
+ You can selectively execute a portion of a flow. While in RapidMiner you can execute only whole flow. The selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch. Instead of that, EM (by default) recalculates only branches affected by the change. Unfortunately, this feature takes a high toll. While RapidMiner can process data "on the fly", EM has to process data in steps - each operator in the flow has to finish before the subsequent operators can start. Furthermore, in EM you have to have fast and big storage for intermediate calculations.    
- Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
- It's full of bugs. Not that RapidMiner would be free of bugs. But after a day of bug hunting in RapidMiner you can at least fix it in the source code by yourself. In the case of SAS you either have to be better hacker, then am I, or you have to go through frustrating process of communication with the infamous SAS support.
- EM ecosystem lacks ETL operators - instead of them, you are supposed to use SAS code operator. Personally, I prefer to perform all the transformations in Enterprise Guide. And then I just load the final table(s) into EM.
- EM doesn't know undo command. Instead of undo, you have to confirm each action, since each change is irreversible. This approach goes against the best UI practice to limit amount of modal windows and make each action easily reversible.  
- EM doesn't know save command. Beware of blind confirming of modal windows! It happened to me several time that instead of deleting a single operator whole flow was deleted because I miss-clicked the operator and selected the drawing board instead.
- Setup SAS ecosystem takes some time. Count something around one month.
- Keep SAS ecosystem running takes a lot of energy. Count two days of repairs for each working day.
- Import of data into EM is tedious as you have first define a library and then walk through a long wizard. Count 5 minutes for even the simplest data set. But at least example data sets from SASHELP are fast to load.
- Moving of EM projects from one directory to another is trivial but unintuitive. First, you must make sure that the name of the project directory is the same as the project name. Second, you open moved projects via "New project" and selecting project directory as the destination directory.   
- Metadatabase. So far all it does is that it doubles maintenance time. I am sure there must be some benefit of having metadatabase but so far I do not see it. Hence I strongly suggest everyone to install a standalone version instead of the server version unless you have some good reason to do it otherwise.

RapidMiner (RM)
+ Fully sufficient ecosystem. You can do all ETL within RapidMiner.
+ You can truly visually program in RM since you have loop operators and conditions.
+ You have source codes.
+ The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
- Every time you have to rerun whole flow even though you want to make just a small refinement.
- It's too easy to setup the flows/operators that they would take eternity to calculate. For comparison, everything in SAS is optimized for big data and you don't get trapped in computational black holes too often.
- You can't stop running node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.
~ Since each connection between operators in RM fulfills distinct function it's common that you have to wire two operators with more than one connection (for example one connection for training data and second connection for testing data), making the flow look overcrowded. In EM it's always enough to make a single connection between two operators as each connection can transfer any type of data (training, testing, validation...). Hence flows in EM looks tidier.
~ On the other end RM allows grouping of operators into a single operator allowing "divide and conquer" strategy. While absence of operator nesting in EM makes small EM flows easy to understand, bigger flows look messier in EM than in RM.
- SVD and PCA are ridiculously slow and memory consuming in comparison to SAS versions.
- Graphical presentation of hierarchical clustering is just a joke in comparison to output from Orange, KNIME or MATLAB.
- I miss Partial Least Square regression and full Bayes (not just Naive Baves). But you can get them from WEKA plugin.
- ROC plot should show reference line for simpler visual evaluation.

středa 1. ledna 2014

Programming languages/scripts in which I wrote something

  1. Pascal
  2. VisualBasic
  3. Assembler
  4. C
  5. C++
  6. C#
  7. Python
  8. MATLAB
  9. Lisp
  10. Prolog
  11. R
  12. NetLogo
  13. Java
  14. JavaScript
  15. PHP
  16. SQL
  17. SAS (if we relax requirements for definition of programming languages/scripts)
And I am not good in any one.

sobota 18. května 2013

Which classifiers can deal with useless attributes

One of preprocessing steps in data mining is feature selection. Let's perform a simple test to identify  classifiers, which benefit from feature selection. The test is performed on Wisconsin Breast Cancer dataset with a subset of attributes (this dataset is too easy to classify with all the attributes).


Classifier    Just data         With useless attributes          Relative difference
Naïve Bayes 94% 69% 37%
k-nn  94% 93% 1%
Classification Tree 85% 82% 5%
Random Forest 93% 64% 46%
Perceptron 86% 85% 1%
SVM 94% 90% 5%

Based on the test Naive Bayes and Random Forest are sensitive to the feature selection. While Perceptron, k-nn, Classification Tree and SVM are resistant to adding irrelevant attributes.

Honestly, I am surprised that Random Forest performed so poorly in the comparison. But in this case it is because 100 trees were used to classify over 500 attributes. And that is too small ratio. When 300 trees were used, the relative difference dropped to 12%.

neděle 31. března 2013

Archetypes in relationship

There are two archetypes in a relationship. One is a boss-and-his-secretary and the second archetype is craftsman-and-his-saleswoman. Both these archetypes are based on a difference between an average man and woman. Men like to think in big and women prefer to think in small. For an illustration ask students at an elementary school what they would like to change if they could change anything. Boys would be likely answering things they want world peace or a cure for cancer. While girls would answer something like no pet excrement on pavements or building a shelter for local homeless. Or in another words, women are practical while men are theoretical. Women takes care about their family and neighborhood, while men discuss politics. And that's all right because the sexes perfectly complement - the boss deals with big deals, but is incompetent when it comes to finding his misplaced keys. And here comes his secretary-hero and says: "in your left pocket", saving his day.

Another difference between a typical man and woman is that women are averagely better in talking than men because men are hunters and women stayed at home. And while men didn't have anyone to talk during the hunting, women were always accompanied by children and other women. Hence women's fitness was improving by high communication skills while men's fitness didn't really increase because they didn't have time to benefit from their good communication skills. Hence a craftsman and his saleswoman is a good combination. He makes products somewhere hidden in the cellar while she stars by communicating with clients.

Of course you can disagree by saying that your experience is different. That the most talkative person you know is a man, not a woman. But that's all right, because there is more extremes among men than among women. And averagely, men are still less talkative that women.

sobota 9. března 2013

Google Redirect Notice

Do you get Google Redirect Notice regardless on the page you want to visit?

If you are using Firefox then the help is simple. Just use Redirect Cleaner addon and the problem is gone.

pondělí 18. února 2013

Recenze 50 odstínů šedi

Hned na úvod trochu kritiky. Jedny a ty samé popisy sexu se v 50 odstínech šedi neustále opakují. Tím se stávají natolik nudnými, že jsem nemohl jinak a přeskakoval je. Přitom pasáže o sexu se dají napsat zajímavě. Například v Justýně od Markýze de Sade bylo potěšením je číst - každý akt přinášel novou pikantnost, navíc popsanou unikátními jazykovými prostředky bez použití jediného vulgárního slůvka. Hold, autorka asi neměla příliš barevný sexuální život ani básnické střevo, jinak si tu neduchaplnost nedokáži vysvětlit.

Nicméně, kniha obsahuje přepěkná schémata: v sexu sběhlý muž se vzdá všech svých žen pro svoji vyvolenou, muž se nechá svojí milou převychovávat, muž je neuvěřitelně atraktivní - mocný, pohledný, inteligentní,... A vyvolená je přitom naprosto průměrná. Nu, prostě sen každé ženy, které byly čteny pohádky o Popelce.

Krom toho část knihy je psána jako román v dopisech. A právě v tom autorka exceluje. Britský humor z dopisů čiší a vynáší tím knihu o několik řádů výš.

Můžu tedy s klidným svědomím říct, že kniha je až na nudné popisy sexu vynikající? Tak to rozhodně není. Kniha je poplatná času svého vzniku a přirovnal bych ji dílům Jane Austenové - ve své době neuvěřitelně populární autorka, ale o století později jsou její díla beznadějně zastaralá. A zdá se, že vydavatelé si jsou této podobnosti vědomi. Kniha, navzdory své tloušťce, je prodávána za velmi nízkou cenu, díky velmi levné vazbě, která knihu předurčuje k rychlé konzumaci - přečíst, maximálně třikrát půjčit a navždy utopit v knihovně.