Datafilos: května 2016

pondělí 30. května 2016

In what data type to store year?

If you are using MySQL, use "year" data type, if possible. The reasons follows:

It takes just 1 byte. Hence it saves storage.
Date arithmetic can be applied without a conversion. Hence, it saves processor time.
The reference point is well defined and unique. Hence, everyone can select the right attribute value for an event without any background knowledge (at least if you are a country that is using Gregorian calendar).
Visualization and analytical tools automatically treat a "year" attribute as a temporal variable. That saves user's time.

But of course, "year" data type has some disadvantages. Namely, it has a narrow range of values. Hence, it can happen that the value you need to store is out of range. An example could be the year when an organization was established. On the other hand this limitation can work in your advantage - if someone types a year with 3 or 5 digits, the database will complain. And the database will be most likely right because the typed year is most likely a type.

čtvrtek 26. května 2016

Is it possible to evaluate a campaign without a control group?

First, this is a problem of so-called "quasi experimental studies". But all these quasi experiments need a time dimension. Can we design an experimental study that does not use a control group or multiple measurements in time?

If we assume additivity of campaigns, all we have to do is to get as many independent equations as is the count of unknown variables. For example, let's imagine we have two campaigns, the first one with response rate A and the second one with the response rate B. Let's also imagine that if we do not expose a user to a campaign, then the response rate of the user is N. And let's imagine that we can combine campaigns, either because each is using a different channel or because each is to a different product. Than we can set up 3 campaigns:

N+A = 5
N+B = 7
N+A+B = 9

The first group of users is exposed to the first campaign and the response rate is 5%. The second group to users is exposed to the second campaign and the response rate is 7%. And the last group of users is exposed to both campaigns and the resulting response rate is 9%.

Since we have 3 independent equations and 3 unknown variables, we can calculate that the first campaign has uplift 9-7 = 2 percent points and that the second campaign has uplift 9-5 = 4 percent points.

Of course, this approach neglects interactions between the campaigns. But we can model interactions as well. In this example, let's consider 3 campaigns with following response rates:

N+A = 5
N+B = 7
N+A+B+AB = 9
N+C = 8
N+A+C = 3

where AB represents an interaction of two first two campaigns. We can put the equation into a matrix form:

matrix = [1 1 0 0 0; 1 0 1 0 0; 1 1 1 1 0; 1 0 0 0 1; 1 1 0 0 1]
response = [5; 7; 9; 8; 3]
x = [?; ?; ?; ?; ?]

Where x is a vector of the unknown variables. It holds that:

matrix*x = response

With linear algebra:

x = inv(matrix) * response

We can get that x = [10; -5; -3; 7; -2].

To get higher confidence in the estimates it is possible to use more independent equations than is the count of unknown variables and fit the unknown variables with least squares estimate (or other method of your choice).

neděle 8. května 2016

Why I lost trust in RapidMiner

I used to like RapidMiner despite many flaws it has. But the discovery of a bug in calculation of ROC and AUC made me furious - if I can't trust the metric that I optimize, how can I trust any model produced by RapidMiner?

To wreak my anger I publish a shame list of bugs in RapidMiner.

The list of bugs in RapidMiner 7.1 that I am aware of:

The first step in ROC (Receiver Operating Curve) is not correctly drawn - the first step is always horizontal. Sometimes the error is negligible. Sometimes it is the whole difference between a perfect model with AUC=1 and a random model with AUC=0.5. The bug is best visible on binary estimates of the label (i.e. all estimates are either p=0 or p=1).
AUC (Area Under Curve) is similarly way off (like 0.5 instead of 1.0). The reported value is different from both, the expected AUC and the area under the returned (and flawed) ROC.
DBI (Davies-Bouldin index) reported by the performance operator is negative. But by the definition it can't be negative.
The returned correlation matrix sometimes contains values out of range (like -67). The error is caused by unstable calculation of variance. Since correlation matrix is in the heart of many algorithms, it is worrisome.
The operator for declaration of missing values causes troubles because the missing values often backpropagate to other branches (it's because the operator is using on-the-fly processing, which is buggy in RapidMiner). EDIT: fixed in version 7.4.
Evolutionary optimization often crashes. Even on toy datasets like Iris.
Weight by Chi Squared Statistic does not work with date attributes while other weighting operators (like Gini or Information Gain Ratio) work.

And finally, missing features that bother me:

Whenever I make a plot and re-run the schema, the setting of the plot is reset.
They removed an "invert" checkbox from dictionary filter in text mining extension.

And of course there are deficiencies. For example, 200 times slower linear regression than the linear regression (lm) in R (measured with turned off feature selection and ridge bias in the both scenarios). EDIT: This deficiency was addressed in RapidMiner 7.2.