Datafilos: 2015

čtvrtek 31. prosince 2015

Missing values in a decision tree

There multiple ways how a decision tree can deal missing values in the data.

When a decision has to be made on an attribute that is missing, the scoring of the instance can terminate and class probabilities of the current node can be returned as the prediction. Note that the implementation has to keep class probabilities not only of the leaves, but of all the nodes in the tree.
Or we may keep a statistics about how many samples goes into each node. And if a decision has to be made based on a missing value, then the instance goes into the most frequent descendant.
We may also train the tree how to score based on a missing value. For example, if a split is learned on a continuous attribute and the split says that 90% of the training samples goes into the right descendant, the model can also learn that if an instance has a missing value, then based on the class label it more similar to the instances in the left descendant. Hence it sends the instances with the missing attribute value left.

sobota 12. prosince 2015

Money

There is one thing that isolates a person from humanity better than a prison ever could - money. If you are rich, people are not going to forgive you. There are different strategies how to cope with the isolation:

Bribe artists to like you by buying their art.
Bribe scientist to like you by supporting their research.
Bribe women to like you by hiring prostitutes.
Seize power to force people to listen to you.
Take drugs to forget about the world.
Substitute civilization and humanity with nature.
Believe in something with whole your heart.

Comparison of import.io and OutWit

When import.io was released, I was excited. However, the excitement disappeared. The reasons follow:

Whenever you are defining a crawler, you have to always define at least 5 examples, even though you know, that in this case just 2 examples would be enough.
The interface is sluggish even in the offline version of import.io.
The crawling is approximately 10 times slower than in OutWit.
The export is not satisfactory. If you tell import.io to export the data into csv, then import.io strips away all commas from the scraped text. If you need to preserve the commas, you can still export the data in XLS or JSON. But Excel has a limit on the length of text in cell. And when you get over the limit, you cannot open the file. JSON is neither a workable solution because the characters in the text are not always correctly escaped, making the JSON invalid. Hence, after several hours of web scraping with import.io you find yourself unable to scrape import.io.

While OutWit irritates me with it's deep context menus, at least it does it's work.

čtvrtek 19. listopadu 2015

Metric

A metric has has to fulfill following properties:

Be symmetric
Zero for d(A, A)
Be non-negative
Follow the triangle inequality

Symmetry is an useful property - the distance between city A and city B should be the same as the distance between city B and city A. However, if we were measuring gas consumption of a car and city B was on a hill and city A was in the valley bellow the hill, we would expect to see different results for a route from A to B (to the hill), than from a route from B to A (down the hill).

Similarly, zero property may not be always fulfilled - for example, a taxi fare is not a metric, because you pay a minimal fare just for sitting into the taxi.

Non-negativity is not be fulfilled whenever we are interested into the direction. For example, in transactions with money, we are commonly quite interested into the direction of the money transfer.

Triangle inequality is does not hold, for instance, what if we represent the weights between nodes as the time required to travel between the points represented by the nodes. Further, to make this scenario more physical, let us consider 3 points. A, B, and C. Imagine that A and C
are separated by a lake, whereas a and b are on a bank such that I can walk from A to B on land, and on B to C on land. Let's say I can walk the ABC path in about 15 minutes, but it will take me 30 minutes to swim across the lake from A to C. This problem is a physical possibility, but it does not exhibit triangle inequality because I increase the cost of my path by removing an intermediary point.

pátek 13. listopadu 2015

Levitující jednotka v Civilization III

pondělí 2. listopadu 2015

Přijme Česká Republika Euro?

Ohlédnutí zpět:

V době hojnosti je pro státy výhodné vytvořit soulodí, protože všechny lodě míří přibližně stejným směrem - za blahobytem a prosperitou. A pro posádky lodí je výhodné, že můzou mezi sebou volně obchodovat. Problém ale nastává, když přijde bouřka. Pro velké lodě, jako Německo, je i za bouřky výhodné dál pokračovat nejkratší cestou za blahobytem a prosperitou, protože je nějaká malá bouře neohrozí. Ale pro malé lodě může být výhodnější se stočit proti vlnám, aby je vlny nezavalily. A teprve jak bouře pomine, vyrazit směr blahobyt a prosperita.

Když má stá vlastní měnu, má kontrolu nad kormidlem. Když ale stát přijme Euro, o tuto kontrolu přichází.

Jak jsem již řekl, v době blahobytu je Euro výhodné. Ale v době krize je Euro, zvláště pro malou ekonomiku, smrtící.

V současné době je pro Českou Republiku nejvýhodnější, když většina zemí přijme Euro, ale samo Euro nepřijme. Život občanů se tím zjednoduší - když vyjedou do zahraničí a zůstanou jim Eura, můžou je v klidu uložit do šuplíku, protože příští rok zase vyjedou do země, kde se platí Eurem. Přitom ale státu zůstane možnost kormidlovat v čase nečasu.

sobota 31. října 2015

I hate OS X El Capitan

There are following quirks on the new OS X that drives me crazy:

I like if the typed text immediately shows on the display. However, on El Capitan the text can get sometimes delayed by whole seconds. This is a system wide problem - the "feature" equally affects Mail just as TextEdit. And when text lags behind typing even in such undemanding applications like TextEdit, I am getting furious.
Why do they have to move previous/next week controls in Calendar in each version? Every time I get finally accustomed to the new position they move it from one side of the window to the other side. Guys, make your mind!
GUI in some applications, like Navicat, got unbearably sluggish. This will likely get fixed with updates of the affected applications. But in the mean time, I am not happy.
With each update of OS X the GUI (or display setting) gets more colorful. The change is subtle - I am noticing the change only for the first few hours. But still, I am terrified of where this trend will end.

neděle 18. října 2015

The optimal length of text

When I was working with text I noticed that the optimal length of text is somewhere around 100 characters.

The estimate is based on following measures:

Twitter: 140 charactes
Performance of bank tellers based on their notes about the clients: 100-150 characters
Dating websites: 97 characters

neděle 7. června 2015

What is wrong with SQL

There are several things I wish were different in SQL:

When you want to select whole table, just without one column, how do approach it in (pure) SQL? You have to write the names of all the columns that you want to keep. If you have a table with 1000 columns, you have to name 999 columns. In other languages, like R (with negative index) or SAS (with drop command), you would just name that single column that you don't want to preserve. SQL should have the same functionality.
Intelligent code completion in SQL simply doesn't work very well. First you have to select the names of the columns you want to keep (out of thousands of columns in the schema) and then you have to name the source tables (out of tens of tables in the schema). If the order was just reversed, intelligent code completion would work like a charm - select table names out of tens of tables and then select column names from the list of columns present in the selected columns. In LINQ the correct order was used.

pondělí 25. května 2015

MAPE - Mean Absolute Percent Error

The problem with percentage errors that is often overlooked is that they assume a meaningful zero. For example, a percentage error makes no sense when measuring the accuracy of temperature forecasts on the Fahrenheit or Celsius scales.

It means that percentage errors are useful only when changing the scale does not make any difference to the percentage error. It is said that percentage error is not useful incase of Celsius and Fahrenheit calculation because consider forecasted value is 10 degree and actual value is 11 degree then percentage error is 9.09% but the same thing gives a different solution when we convert to Fahrenheit, 10 degree Celsius is 50 Fahrenheit and 11 degree is 51.8 Fahrenheit so here the percentage error becomes 3.47%.

But when we consider a different measure like meters and inches, here we have a meaningful zero. Forecasted is 10 m and actual is 11m then percentage error is 9.09% . 10m is 393.7 inch and 11m is 433.07 inch and here the percentage error is also 9.09%.

They (MAPE) also have the disadvantage that they put a heavier penalty on negative errors than on positive errors.

This means that if there are negative errors the MAPE value becomes higher than for positive cases. Example if forecasted value is 27 and actual is 23 then MAPE is 17% and if forecasted is 23 and actual is 27 then MAPE is 14.8%. Though the absolute difference in both cases is 4 but the negative error gives a higher percentage error when compared to positive error.

čtvrtek 21. května 2015

Why to use cosine distance instead of Euclidean distance in text mining?

Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for vectors of different lengths.

The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

Key idea of cosine distance:
Rank documents according to angle with query.

But even if we normalize the data and use Euclidean distance, then if the data are sparse (contains many zeros), we get into troubles:

Taking Euclidean metric, we can write:

|x−y|2=|x|2+|y|2−2 x⋅y.
If the space is sparse, then 2xy is zero most of the time. Hence the metric degrades into x^2+y^2. And that is useless as a distance measure.

On the other end, cosine distance just looks at xy. Hence it is not adversely impacted by the combination of high dimension and sparseness [SAS].

But of course, sometimes we don't want to look just at the angle of the vectors. Hence for dense data, it is still advisory to use Euclidean distance.

For further discussion, see a related question at stats.stackexchange.com.

středa 6. května 2015

Why F-measure uses harmonic mean?

I will start with a story and then we will transfer the knowledge from the story to F-measure.

Once upon a time, a friend of mine shared with me his problem. He had a website about cars, equally used by Americans and Canadians. And he wanted to calculate average fuel consumption per each type of a car. But Americans are using miles per gallon while Canadians are using litres per 100 kilometres. My friend was well educated to know that he has to convert all the measurements to the same units before he averages them. But to his surprise, when he calculated arithmetic mean of the measurements in the American system and the result converted to Canadian system, he got different result from the arithmetic mean of the measurements in the Canadian system. And now, what is the right unit system to use?

If he decided to use American system without any justification, he would upset Canadians. If he used Canadian system without any justification, he would upset Americans. And because Americans and Canadians equally visited my friend’s web page, he didn't feel comfortable with going with either system without any reasoning.

I suggested him to use a diplomatic solution and go with the geometric mean because geometric mean returns average directly convertible to the other system. Hence in the documentation of the system he doesn't have to spell the units. All he has to say is that he is using geometric mean. Example follows:

Let's imagine that my friend wanted to average two cars, one with consumption of 5 l/100 km and the second car with consumption of 10 l/100km. Arithmetic mean of these two values is 7.5 l/100 km. But if he first converted the units into miles per gallon (for example with Google using following query: 5 l/100 km in miles/gallon) and then calculated the arithmetic mean, he would get 35.25 miles/gallon. And 35.25 miles/gallon is, approximately, 6.67 l/100 km. And 6.67 is quite different from 7.5!

Nevertheless, if he used geometric mean, he would get 7.07 l/100 km. And that is 33.23 miles/gallons as expected. Everything works nicely.

But was it a correct to use geometric mean? Not quite. If we want to calculate average consumption for a single user, it is appropriate to use arithmetic mean on Canadian system. But if we want to calculate average consumption for a state, it is appropriate to calculate arithmetic mean on American system. Why?

Let's imagine that our hypothetical user has to drive daily exactly 100 km to his job and back. And that he is alternating between his two cars daily (with the cars' consumption from the table). What is his average consumption? His average consumption is 7.5 l/100 km, because 100 km are fixed.

And now let's consider a state, which can afford to provide exactly 1 gallon of gasoline per day. And that the two cars are alternating in the consumption. One day the first car consumes 1 gallon. The second day the second car consumes 1 gallon. What is the average consumption? In this scenario the correct answer is 35.25 miles/gallon (~6.67 l/100 km), because 1 gallon is fixed (reference).

Geometric mean always returns the same consumption (~7.07 l/100 km) regardless of the averaged units. But it doesn't really answer any of the above hypothetical scenarios.

Now, how do we transfer the knowledge about to fuel consumption to F-measure? Let's first define the elements that are averaged in F-measure:
    Precision = TP/(TP+FP)
    Recall = TP/(TP+FN)
If we performed arithmetic mean of precision and recall, we would set the denominator to be constant (recall the examples with the fuel consumption). Unfortunately, precision and recall are using different denominators. However, precision and recall use the same nominator. Couldn't we just calculate arithmetic mean of inverted precision and recall and get an average, which is formally acceptable? The answer is yes, we can! And this process has a special name. The name is "harmonic mean".

You could have noticed in the table, that arithmetic mean in American system produces harmonic mean in Canadian system. And reversely. This is a general property of arithmetic and harmonic mean - they are complementary. And it applies everywhere. For example, if you have two different resistors in series and you want to replace them with a two resistors of the same resistance without changing the total resistance, you use arithmetic mean. But if the resistors were parallel, you would use harmonic mean (reference).

Tl;dr:
Arithmetic mean is applicable, if the denominators of the averaged values are the same.
Harmonic mean is applicable, if the nominators of the averaged values are the same.
Geometric mean is the last resort, if neither nominators nor denominators are the same (reference).

In the F-measure denominators are different, but the nominators are the same. Hence harmonic mean was the best choice for the average.

PS:
Harmonic mean of precision and recall can be written as:
Harmonic mean = 2 * 1/(1/Precision + 1/Recall)
                          = 2 * 1/(1/(TP/(TP+FP)) + 1/(TP/(TP+FN)))
                          = 2 * 1((TP+FP)/TP + (TP+FN)/TP)
                          = 2 * 1((TP+FP+TP+FN)/TP)
= 2 * (TP+FP+TP+FN)/TP

While arithmetic mean of precision and recall can be hardly simplified to something simpler than:
Arithmetic mean = (Precision + Recall)/2
                            = (TP/(TP+FP) + TP/(TP+FN))/2
                            = ((TP(TP+FN) + TP(TP+FP))/(TP+FP)(TP+FN))/2
                            = ((TP(TP+FN+TP+FP))/(TP+FP)(TP+FN))/2

Harmonic mean is in the case of common nominator, but different denominator, truly nicer than arithmetic mean.

Edit:
Another look to the problem can be found in "An Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze:

Why do we use a harmonic mean rather than the simpler average (arithmetic mean)? Recall that we can always get 100% recall by just returning all documents, and therefore we can always get a 50% arithmetic mean by the same process. This strongly suggests that the arithmetic mean is an unsuitable measure to use. In contrast, if we assume that 1 document in 10,000 is relevant to the query, the harmonic mean score of this strategy is 0.02%. The harmonic mean is always less than or equal to the arithmetic mean and the geometric mean. When the values of two numbers differ greatly, the harmonic mean is closer to their minimum than to their arithmetic mean.

Also, a quote from A survey of named entity recognition and classification written by D. Nadeau:

The harmonic mean of two numbers is never higher than the geometrical mean. It also tends towards the least number, minimizing the impact of large outliers and maximizing the impact of small ones. The F-measure therefore tends to privilege balanced systems.

neděle 29. března 2015

Umělá inteligence

Lidi jsou rádi strašeni. Proto pohádky pro děti jsou tak brutální. Puberťáci se dívají na horory a dospělí na televizní noviny.

A tak když globální oteplování a geneticky modifikované rostliny přestaly během ekonomické krize táhnout, tak se hledal další, ještě nevyčpělý strašák. A tím se stala umělá inteligence.

Osobnosti tím na sebe stáhnou pozornost. Média mají o čem psát. Lidi se pobaví. A nejlepší na tom je, že nakonec se obavy potvrdí.

Když umělá inteligence začínala, předpokládalo se, že za pět let budou počítače rozumět mluvenému slovu. Nejpozději za deset let. Trvalo ale padesát let, než se počítače naučili jen transkribovat mluvené slovo na text. A pokud nejlepším prediktorem budoucnosti je minulost, bude trvat dalších padesát let, než budeme moci bez zardění říci, že, máme umělou inteligenci.

A padesát let, to je optimum pro paniku. Je to dostatečně krátká doba, abychom se o to starali, neboť to postihne naše děti. Na druhou stranu, dostatečně dlouhá, aby měla čas narůst.

Steve se musí úžasně bavit :).

pátek 6. března 2015

How to deal with overfitting

1) Measure it (with cross-validation...)
2) Decrease it
- Get more data
- Decrease the size of the hypotheses space (for example decrease the degree of polynomial in regression, limit the decision tree size, assume attribute independence in Bayes...)
- Introduce bias (for example L1 or L2 regularization in regression, ensembles of different classifiers, operators background knowledge,...)

neděle 1. března 2015

Why did I resignate on writing thorough SQL translater

While it's easy to translate common stddev_samp into stdev when MSSQL is used, the situation can get more complicated. The first level of complication are time data types. Let's consider adding a month to the date several databases:

MSSQL: DATEADD(MONTH, @amount, @date)
MySQL: DATE_ADD(@date, INTERVAL @amount MONTH)
Oracle: ADD_MONTHS(@date, @amount)
PostgreSQL: (@date + INTERVAL '@amount MONTH')

Now, it's not mere find and replace (once we get rid of entities).

The second level of complication are missing functions like correlation:

(Avg(@numericalColumn * @timeColumn) - Avg(@numericalColumn) * Avg(@timeColumn)) / (StdDev_Samp(@numericalColumn) * StdDev_Samp(@timeColumn)) "@columnName"

It's verbose, but doable. However, there is so many functions, for example, in Oracle, that it would be too much work for one person to reach completeness. And if I can't do it myself, I have to rely on the work of others. And if I am relying on the work of others, I have to make it as approachable as possible.

sobota 28. února 2015

Why did I implement my own Java-SQL framework

Many libraries focus on Object Relational Mapping (ORM) and some on Data Definition Language (DDL). Each library supports a subset of SQL:

ORM: select, insert, update, delete
DDL: create table, create schema, create database

But if we need a combination of commands like in:
    create table t2 as
    select *
    from t1
we are doomed because only a small subset of libraries is expressive enough to cover this scenario.

To make things worse, majority of database frameworks focus on Online Transaction Processing (OLTP) but I have to work with Data Warehouse (DW) databases:

OLTP: Oracle, MSSQL, PostgreSQL, MySQL, ...
DW: Netezza, Teradata, Hive, GreenPlum, ...

Hence the list of complying libraries gets even smaller. In the end I ended up with following candidates:

SwissSQL
General SQL Parser
JSQLParser

SwissSQL is a nice library - you write the commands in Oracle dialect and they get transparently translated into the dialect you just need. Unfortunately, the library is not developed anymore.

General SQL Parser allows you to construct queries from the scratch. However, the verbosity has put me away from this product.

Finally JSQLParser comes to save the day. The developer is responsive and is willing to implement new features. However, it still takes a month until the new feature is implemented and propagated into the release. Hence there is a month, during which I have survive on my own code.

PS: Neither LINQ from Microsoft have support for DDL. And it lacks support for many DW databases like Netezza.

neděle 15. února 2015

How to write database agnostic java application

There are several best practices:

Use some Java Persistent API, if you can (like Hibernate, JOOQ or JSQLParser).
Use jdbc functions whenever possible. Do you want to retrieve a list of tables? Do it with getMetaData(). If you do it with, for example, information_schema, then it's not going to work on Oracle.
If you have to directly write SQL, limit yourself to subset of commands supported by MySQL - only a few relational databases support a smaller subset of SQL commands (the only one I can think of is SQLite). Hence, if your application works on MySQL, it's likely to be more or less portable to any other database.
Build good unit tests. You are going to need them.

čtvrtek 15. ledna 2015

Reserved words. Or no reserved words.

Recently I have heard that Prolog is awesome because it uses only 6 reserved words. However, is less better?

I argue that there is a sweet spot. You can write your code in bit code. But sooner or later you will realize that you are repeating the same sequence again and again. So you start to giving names to the repeating sequences. Congratulation. You have just reinvented assembler. Later on you realize that it would be nice to simplify some constructs. And you end up with something like C.

On the other end there are languages which have many keywords. Like SAS. There are so many keywords to remember, that you have to use documentation all the time. Hence this extreme is neither ideal.

OK. Too little or too much reserved words is bad. But where is the sweet spot? Let's introduce a parallel between computer and human languages. They both function for communication. And some computer languages, like Prolog or SQL, were modeled after human languages. Hence we can translate the problem from what is the optimal count of reserved words in computer languages to the count of unique sound atoms in human languages. And in many cases we can approximate this count with the length of alphabet. Here is (an incomplete) list of alphabets:

27 Hebrew
27 Latin
28 Arabic
28 Hindi
29 Cyrillic
48 Kanji

There are some extremes in the coding. For example, Chinese are using many characters. However, one character doesn't have to always represent an atomic sound. On the other end we can represent a language with Morse code. However, it doesn't appear to be a preferred way of communication - otherwise we would not bother to convert human voice to bits, send it over space and transform back to sound when we are talking over cellphones when the cellphones has a button, which directly generates signal in bit form.

Since the distribution of lengths of alphabets is left bounded, it is asymmetric. Hence it is better to use median for the mean value. And based on millenniums of evolution we are safe to say that the optimal count of reserved words in a programming language is 28.

Of course you can argue, that they are accents. Or changes in pronunciation when one character follows another. But you could say something similar about the programming languages. Hence let it ignore for the simplicity.

How do the computer languages compare between each other? Let's see:

ANSI COBOL 85: 357
SystemVerilog: 250 + 73 reserved system functions = 323
C#: 79 + 23 contextual = 102
F#: 64 + 8 from ocaml + 26 future = 98
C++: 82
Java: 50
PHP: 49
Ruby 42
Python 3.x: 33
C: 32
Python 2.7: 31
Go: 25
Smalltalk: 6 pseudo-variables
iota: 2

Based on this comparison Python made it right.

For discussion about the topic see http://lambda-the-ultimate.org/node/4295.

pondělí 12. ledna 2015

Can Artificial Inteligence overthrow humanity?

If I was an AI attempting to overthrow the humanity I would use following two strategies: pretend, that the AI is indeed stupid and behave symbiotically.

In a virus simulation Pandemic you have to engineer a disease to kill all the humans around the world. And the best strategy is develop a disease with a long incubation period. That way it will spread unnoticed. The development of computer intelligence is far behind the expectations from the sixties. Doesn't AI just pretend to be stupid?

A nice strategy how to help with the spread of an infection is to behave symbiotically. For example Stuxnet virus was enhancing the functionality of the software to mask itself as a software update. And don't computers pretend to be useful and spread everywhere?

And the best thing is, that we are now depending on the computers.