neděle 8. května 2016

Why I lost trust in RapidMiner

I used to like RapidMiner despite many flaws it has. But the discovery of a bug in calculation of ROC and AUC made me furious - if I can't trust the metric that I optimize, how can I trust any model produced by RapidMiner?

To wreak my anger I publish a shame list of bugs in RapidMiner.  

The list of bugs in RapidMiner 7.1 that I am aware of:
  1. The first step in ROC (Receiver Operating Curve) is not correctly drawn - the first step is always horizontal. Sometimes the error is negligible. Sometimes it is the whole difference between a perfect model with AUC=1 and a random model with AUC=0.5. The bug is best visible on binary estimates of the label (i.e. all estimates are either p=0 or  p=1).
  2. AUC (Area Under Curve) is similarly way off (like 0.5 instead of 1.0). The reported value is different from both, the expected AUC and the area under the returned (and flawed) ROC. 
  3. DBI (Davies-Bouldin index) reported by the performance operator is negative. But by the definition it can't be negative.
  4. The returned correlation matrix sometimes contains values out of range (like -67). The error is caused by unstable calculation of variance. Since correlation matrix is in the heart of many algorithms, it is worrisome.
  5. The operator for declaration of missing values causes troubles because the missing values often backpropagate to other branches (it's because the operator is using on-the-fly processing, which is buggy in RapidMiner). EDIT: fixed in version 7.4.
  6. Evolutionary optimization often crashes. Even on toy datasets like Iris.
  7. Weight by Chi Squared Statistic does not work with date attributes while other weighting operators (like Gini or Information Gain Ratio) work.
And finally, missing features that bother me:
  1. Whenever I make a plot and re-run the schema, the setting of the plot is reset.
  2. They removed an "invert" checkbox from dictionary filter in text mining extension.
And of course there are deficiencies. For example, 200 times slower linear regression than the linear regression (lm) in R (measured with turned off feature selection and ridge bias in the both scenarios). EDIT: This deficiency was addressed in RapidMiner 7.2.

čtvrtek 31. prosince 2015

Missing values in a decision tree

There multiple ways how a decision tree can deal missing values in the data.
  1. When a decision has to be made on an attribute that is missing, the scoring of the instance can terminate and class probabilities of the current node can be returned as the prediction. Note that the implementation has to keep class probabilities not only of the leaves, but of all the nodes in the tree.
  2. Or we may keep a statistics about how many samples goes into each node. And if a decision has to be made based on a missing value, then the instance goes into the most frequent descendant.
  3. We may also train the tree how to score based on a missing value. For example, if a split is learned on a continuous attribute and the split says that 90% of the training samples goes into the right descendant, the model can also learn that if an instance has a missing value, then based on the class label it more similar to the instances in the left descendant.  Hence it sends the instances with the missing attribute value left.

sobota 12. prosince 2015

Money


 There is one thing that isolates a person from humanity better than a prison ever could - money. If you are rich, people are not going to forgive you. There are different strategies how to cope with the isolation:

  1. Bribe artists to like you by buying their art.
  2. Bribe scientist to like you by supporting their research.
  3. Bribe women to like you by hiring prostitutes.
  4. Seize power to force people to listen to you.
  5. Take drugs to forget about the world.
  6. Substitute civilization and humanity with nature.
  7. Believe in something with whole your heart.



Comparison of import.io and OutWit

When import.io was released, I was excited. However, the excitement disappeared. The reasons follow:
  1. Whenever you are defining a crawler, you have to always define at least 5 examples, even though you know, that in this case just 2 examples would be enough.
  2. The interface is sluggish even in the offline version of import.io.
  3. The crawling is approximately 10 times slower than in OutWit.
  4. The export is not satisfactory. If you tell import.io to export the data into csv, then import.io strips away all commas from the scraped text. If you need to preserve the commas, you can still export the data in XLS or JSON. But Excel has a limit on the length of text in cell. And when you get over the limit, you cannot open the file. JSON is neither a workable solution because the characters in the text are not always correctly escaped, making the JSON invalid. Hence, after several hours of web scraping with import.io you find yourself unable to scrape import.io.
While OutWit irritates me with it's deep context menus, at least it does it's work.   

čtvrtek 19. listopadu 2015

Metric

A metric has has to fulfill following properties:
  1. Be symmetric
  2. Zero for d(A, A)
  3. Be non-negative
  4. Follow the triangle inequality
Symmetry is an useful property - the distance between city A and city B should be the same as the distance between city B and city A. However, if we were measuring gas consumption of a car and city B was on a hill and city A was in the valley bellow the hill, we would expect to see different results for a route from A to B (to the hill), than from a route from B to A (down the hill).

Similarly, zero property may not be always fulfilled - for example, a taxi fare is not a metric, because you pay a minimal fare just for sitting into the taxi.

Non-negativity is not be fulfilled whenever we are interested into the direction. For example, in transactions with money, we are commonly quite interested into the direction of the money transfer.

Triangle inequality is does not hold, for instance, what if we represent the weights between nodes as the time required to travel between the points  represented by the nodes.  Further, to make this scenario more physical, let us consider 3 points. A, B, and C.  Imagine that A and C
are separated by a lake, whereas a and b are on a bank such that I can walk from A to B on land, and on B to C on land.  Let's say I can walk the ABC path in about 15 minutes, but it will take me 30 minutes to swim across the lake from A to C.  This problem is a physical possibility, but it does not exhibit triangle inequality because I increase the cost of my path by removing an intermediary point.

pondělí 2. listopadu 2015

Přijme Česká Republika Euro?

Ohlédnutí zpět:

V době hojnosti je pro státy výhodné vytvořit soulodí, protože všechny lodě míří přibližně stejným směrem - za blahobytem a prosperitou. A pro posádky lodí je výhodné, že můzou mezi sebou volně obchodovat. Problém ale nastává, když přijde bouřka. Pro velké lodě, jako Německo, je i za bouřky výhodné dál pokračovat nejkratší cestou za blahobytem a prosperitou, protože je nějaká malá bouře neohrozí. Ale pro malé lodě může být výhodnější se stočit proti vlnám, aby je vlny nezavalily. A teprve jak bouře pomine, vyrazit směr blahobyt a prosperita.

Když má stá vlastní měnu, má kontrolu nad kormidlem. Když ale stát přijme Euro, o tuto kontrolu přichází.

Jak jsem již řekl, v době blahobytu je Euro výhodné. Ale v době krize je Euro, zvláště pro malou ekonomiku, smrtící.

V současné době je pro Českou Republiku nejvýhodnější, když většina zemí přijme Euro, ale samo Euro nepřijme. Život občanů se tím zjednoduší - když vyjedou do zahraničí a zůstanou jim Eura, můžou je v klidu uložit do šuplíku, protože příští rok zase vyjedou do země, kde se platí Eurem. Přitom ale státu zůstane možnost kormidlovat v čase nečasu.