Datafilos

pátek 6. března 2015

How to deal with overfitting

1) Measure it (with cross-validation...)
2) Decrease it
- Get more data
- Decrease the size of the hypotheses space (for example decrease the degree of polynomial in regression, limit the decision tree size, assume attribute independence in Bayes...)
- Introduce bias (for example L1 or L2 regularization in regression, ensembles of different classifiers, operators background knowledge,...)

neděle 1. března 2015

Why did I resignate on writing thorough SQL translater

While it's easy to translate common stddev_samp into stdev when MSSQL is used, the situation can get more complicated. The first level of complication are time data types. Let's consider adding a month to the date several databases:

MSSQL: DATEADD(MONTH, @amount, @date)
MySQL: DATE_ADD(@date, INTERVAL @amount MONTH)
Oracle: ADD_MONTHS(@date, @amount)
PostgreSQL: (@date + INTERVAL '@amount MONTH')

Now, it's not mere find and replace (once we get rid of entities).

The second level of complication are missing functions like correlation:

(Avg(@numericalColumn * @timeColumn) - Avg(@numericalColumn) * Avg(@timeColumn)) / (StdDev_Samp(@numericalColumn) * StdDev_Samp(@timeColumn)) "@columnName"

It's verbose, but doable. However, there is so many functions, for example, in Oracle, that it would be too much work for one person to reach completeness. And if I can't do it myself, I have to rely on the work of others. And if I am relying on the work of others, I have to make it as approachable as possible.

sobota 28. února 2015

Why did I implement my own Java-SQL framework

Many libraries focus on Object Relational Mapping (ORM) and some on Data Definition Language (DDL). Each library supports a subset of SQL:

ORM: select, insert, update, delete
DDL: create table, create schema, create database

But if we need a combination of commands like in:
    create table t2 as
    select *
    from t1
we are doomed because only a small subset of libraries is expressive enough to cover this scenario.

To make things worse, majority of database frameworks focus on Online Transaction Processing (OLTP) but I have to work with Data Warehouse (DW) databases:

OLTP: Oracle, MSSQL, PostgreSQL, MySQL, ...
DW: Netezza, Teradata, Hive, GreenPlum, ...

Hence the list of complying libraries gets even smaller. In the end I ended up with following candidates:

SwissSQL
General SQL Parser
JSQLParser

SwissSQL is a nice library - you write the commands in Oracle dialect and they get transparently translated into the dialect you just need. Unfortunately, the library is not developed anymore.

General SQL Parser allows you to construct queries from the scratch. However, the verbosity has put me away from this product.

Finally JSQLParser comes to save the day. The developer is responsive and is willing to implement new features. However, it still takes a month until the new feature is implemented and propagated into the release. Hence there is a month, during which I have survive on my own code.

PS: Neither LINQ from Microsoft have support for DDL. And it lacks support for many DW databases like Netezza.

neděle 15. února 2015

How to write database agnostic java application

There are several best practices:

Use some Java Persistent API, if you can (like Hibernate, JOOQ or JSQLParser).
Use jdbc functions whenever possible. Do you want to retrieve a list of tables? Do it with getMetaData(). If you do it with, for example, information_schema, then it's not going to work on Oracle.
If you have to directly write SQL, limit yourself to subset of commands supported by MySQL - only a few relational databases support a smaller subset of SQL commands (the only one I can think of is SQLite). Hence, if your application works on MySQL, it's likely to be more or less portable to any other database.
Build good unit tests. You are going to need them.

čtvrtek 15. ledna 2015

Reserved words. Or no reserved words.

Recently I have heard that Prolog is awesome because it uses only 6 reserved words. However, is less better?

I argue that there is a sweet spot. You can write your code in bit code. But sooner or later you will realize that you are repeating the same sequence again and again. So you start to giving names to the repeating sequences. Congratulation. You have just reinvented assembler. Later on you realize that it would be nice to simplify some constructs. And you end up with something like C.

On the other end there are languages which have many keywords. Like SAS. There are so many keywords to remember, that you have to use documentation all the time. Hence this extreme is neither ideal.

OK. Too little or too much reserved words is bad. But where is the sweet spot? Let's introduce a parallel between computer and human languages. They both function for communication. And some computer languages, like Prolog or SQL, were modeled after human languages. Hence we can translate the problem from what is the optimal count of reserved words in computer languages to the count of unique sound atoms in human languages. And in many cases we can approximate this count with the length of alphabet. Here is (an incomplete) list of alphabets:

27 Hebrew
27 Latin
28 Arabic
28 Hindi
29 Cyrillic
48 Kanji

There are some extremes in the coding. For example, Chinese are using many characters. However, one character doesn't have to always represent an atomic sound. On the other end we can represent a language with Morse code. However, it doesn't appear to be a preferred way of communication - otherwise we would not bother to convert human voice to bits, send it over space and transform back to sound when we are talking over cellphones when the cellphones has a button, which directly generates signal in bit form.

Since the distribution of lengths of alphabets is left bounded, it is asymmetric. Hence it is better to use median for the mean value. And based on millenniums of evolution we are safe to say that the optimal count of reserved words in a programming language is 28.

Of course you can argue, that they are accents. Or changes in pronunciation when one character follows another. But you could say something similar about the programming languages. Hence let it ignore for the simplicity.

How do the computer languages compare between each other? Let's see:

ANSI COBOL 85: 357
SystemVerilog: 250 + 73 reserved system functions = 323
C#: 79 + 23 contextual = 102
F#: 64 + 8 from ocaml + 26 future = 98
C++: 82
Java: 50
PHP: 49
Ruby 42
Python 3.x: 33
C: 32
Python 2.7: 31
Go: 25
Smalltalk: 6 pseudo-variables
iota: 2

Based on this comparison Python made it right.

For discussion about the topic see http://lambda-the-ultimate.org/node/4295.

pondělí 12. ledna 2015

Can Artificial Inteligence overthrow humanity?

If I was an AI attempting to overthrow the humanity I would use following two strategies: pretend, that the AI is indeed stupid and behave symbiotically.

In a virus simulation Pandemic you have to engineer a disease to kill all the humans around the world. And the best strategy is develop a disease with a long incubation period. That way it will spread unnoticed. The development of computer intelligence is far behind the expectations from the sixties. Doesn't AI just pretend to be stupid?

A nice strategy how to help with the spread of an infection is to behave symbiotically. For example Stuxnet virus was enhancing the functionality of the software to mask itself as a software update. And don't computers pretend to be useful and spread everywhere?

And the best thing is, that we are now depending on the computers.

sobota 1. listopadu 2014

3Vs (variety, velocity and volume)

Three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms:
Product, Promotion, Place, and Price.