Datafilos

sobota 28. února 2015

Why did I implement my own Java-SQL framework

Many libraries focus on Object Relational Mapping (ORM) and some on Data Definition Language (DDL). Each library supports a subset of SQL:

ORM: select, insert, update, delete
DDL: create table, create schema, create database

But if we need a combination of commands like in:
    create table t2 as
    select *
    from t1
we are doomed because only a small subset of libraries is expressive enough to cover this scenario.

To make things worse, majority of database frameworks focus on Online Transaction Processing (OLTP) but I have to work with Data Warehouse (DW) databases:

OLTP: Oracle, MSSQL, PostgreSQL, MySQL, ...
DW: Netezza, Teradata, Hive, GreenPlum, ...

Hence the list of complying libraries gets even smaller. In the end I ended up with following candidates:

SwissSQL
General SQL Parser
JSQLParser

SwissSQL is a nice library - you write the commands in Oracle dialect and they get transparently translated into the dialect you just need. Unfortunately, the library is not developed anymore.

General SQL Parser allows you to construct queries from the scratch. However, the verbosity has put me away from this product.

Finally JSQLParser comes to save the day. The developer is responsive and is willing to implement new features. However, it still takes a month until the new feature is implemented and propagated into the release. Hence there is a month, during which I have survive on my own code.

PS: Neither LINQ from Microsoft have support for DDL. And it lacks support for many DW databases like Netezza.

neděle 15. února 2015

How to write database agnostic java application

There are several best practices:

Use some Java Persistent API, if you can (like Hibernate, JOOQ or JSQLParser).
Use jdbc functions whenever possible. Do you want to retrieve a list of tables? Do it with getMetaData(). If you do it with, for example, information_schema, then it's not going to work on Oracle.
If you have to directly write SQL, limit yourself to subset of commands supported by MySQL - only a few relational databases support a smaller subset of SQL commands (the only one I can think of is SQLite). Hence, if your application works on MySQL, it's likely to be more or less portable to any other database.
Build good unit tests. You are going to need them.

čtvrtek 15. ledna 2015

Reserved words. Or no reserved words.

Recently I have heard that Prolog is awesome because it uses only 6 reserved words. However, is less better?

I argue that there is a sweet spot. You can write your code in bit code. But sooner or later you will realize that you are repeating the same sequence again and again. So you start to giving names to the repeating sequences. Congratulation. You have just reinvented assembler. Later on you realize that it would be nice to simplify some constructs. And you end up with something like C.

On the other end there are languages which have many keywords. Like SAS. There are so many keywords to remember, that you have to use documentation all the time. Hence this extreme is neither ideal.

OK. Too little or too much reserved words is bad. But where is the sweet spot? Let's introduce a parallel between computer and human languages. They both function for communication. And some computer languages, like Prolog or SQL, were modeled after human languages. Hence we can translate the problem from what is the optimal count of reserved words in computer languages to the count of unique sound atoms in human languages. And in many cases we can approximate this count with the length of alphabet. Here is (an incomplete) list of alphabets:

27 Hebrew
27 Latin
28 Arabic
28 Hindi
29 Cyrillic
48 Kanji

There are some extremes in the coding. For example, Chinese are using many characters. However, one character doesn't have to always represent an atomic sound. On the other end we can represent a language with Morse code. However, it doesn't appear to be a preferred way of communication - otherwise we would not bother to convert human voice to bits, send it over space and transform back to sound when we are talking over cellphones when the cellphones has a button, which directly generates signal in bit form.

Since the distribution of lengths of alphabets is left bounded, it is asymmetric. Hence it is better to use median for the mean value. And based on millenniums of evolution we are safe to say that the optimal count of reserved words in a programming language is 28.

Of course you can argue, that they are accents. Or changes in pronunciation when one character follows another. But you could say something similar about the programming languages. Hence let it ignore for the simplicity.

How do the computer languages compare between each other? Let's see:

ANSI COBOL 85: 357
SystemVerilog: 250 + 73 reserved system functions = 323
C#: 79 + 23 contextual = 102
F#: 64 + 8 from ocaml + 26 future = 98
C++: 82
Java: 50
PHP: 49
Ruby 42
Python 3.x: 33
C: 32
Python 2.7: 31
Go: 25
Smalltalk: 6 pseudo-variables
iota: 2

Based on this comparison Python made it right.

For discussion about the topic see http://lambda-the-ultimate.org/node/4295.

pondělí 12. ledna 2015

Can Artificial Inteligence overthrow humanity?

If I was an AI attempting to overthrow the humanity I would use following two strategies: pretend, that the AI is indeed stupid and behave symbiotically.

In a virus simulation Pandemic you have to engineer a disease to kill all the humans around the world. And the best strategy is develop a disease with a long incubation period. That way it will spread unnoticed. The development of computer intelligence is far behind the expectations from the sixties. Doesn't AI just pretend to be stupid?

A nice strategy how to help with the spread of an infection is to behave symbiotically. For example Stuxnet virus was enhancing the functionality of the software to mask itself as a software update. And don't computers pretend to be useful and spread everywhere?

And the best thing is, that we are now depending on the computers.

sobota 1. listopadu 2014

3Vs (variety, velocity and volume)

Three terms stood out in relation to Big Data.
Variety, Velocity and Volume.
In marketing, the 4Ps define all of marketing using only four terms:
Product, Promotion, Place, and Price.

pátek 17. října 2014

Hail to Emperor, baby!

pátek 3. října 2014

Comparison of MATLAB and R

Advantage of R:

Easy setting of default parameters (inheritance from functional languages). Not that it is incredibly difficult to set a default value in MATLAB, but it's verbose and error prone.
Named parameters (again, inheritance from functional languages). In MATLAB, when you pass many parameters with string values to a function, it's unclear at glance, what is parameter name and what is parameter value. In R, it's immediately clear.
Mixed tables (combination of string and numerical columns). Incredibly useful for real world messy data sets. A partial remedy to this problem is 'Tables' in the late versions of MATLAB.
Possibility to name rows and columns.This is awesome because you don't have to remember that you want column 181, all you have to remember is the name of the column. Also, it has the advantage that metadata are together with the data. Hence if you perform selection, projection or transformation of the data, the metadata are automatically in sync with the data. No work is left on the user. In MATLAB, you have to use 'Struct'. Or 'Tables' in the late versions of MATLAB.
Negative indexes for dropping of particular columns/rows.

Advantage of MATLAB:

There is a fewer competing packages for MATLAB than for R. Hence in MATLAB you are spared of deciding, which library is the best.
Spare matrices are integral part of MATLAB. Hence all algorithms benefiting from spare matrices are using the same representation of spare matrices. In R, each library is using it's own representation.