Datafilos: listopadu 2016

čtvrtek 10. listopadu 2016

Information Gain Ratio vs. Gini vs. Chi2

Information Gain Ratio (IGR), Gini index and Chi2 are among the most popular univariate feature selection methods for classification (and decision tree construction). But just as no free lunch theorems were formulated for optimization and classification, a no free lunch theorem for feature selection could be formulated - a single method may not be averagely better than other methods over all datasets.

If we compare feature selection methods on a single dataset by the accuracy on the subsequent classifier, we generally find out that either IGR or Chi2 is the best, while Gini is (almost) always the second:

What is intriguing is the fact that both, IGR and Chi2 sometimes fail terribly. And that Gini generally lacks behind the best method just a bit. Hence, if we calculate accuracy of the feature selection methods over many (real-world) datasets, we find out that Gini is, on average, the best method.

Recommendation: On text related datasets, Chi2 generally excells. On datasets with id-like attributes (attributes with very high cardinality), IGR generally excels, because IGR, in comparison to Chi2, penalizes attributes with high cardinality. If you want to get a reasonable model on the first shot regardless of the data, use Gini (assuming real world datasets since we can always craft datasets to foul the feature selection methods).

pátek 4. listopadu 2016

How to write an agnostic Java application for SQL databases

Whenever we want to talk to a relational database, we generally have two options: use native connection or use some standard API like Open Database Connectivity (ODBC).

If we aim to support multiple databases, it is easier to use a single standard API than to deal with multiple native protocols. There are three well known APIs (and some less known) for communication with SQL databases: ODBC, JDBC and OLE DB. OLE DB was deprecated by its author, Microsoft (e.g. http://www.simba.com). Hence, it is possibly not smart to start new projects with this technology. Consequently, we are left with the decision between ODBC and JDBC. JDBC was developed for Java as a (better) replacement of ODBC. Hence, if you are developing in Java, the choice is clear - pick JDBC. If you are developing in something else (e.g. C), you don't have a choice but to pick ODBC.

Following paragraphs describe useful tricks to know about JDBC. First, forget about information_schema (PostgreSQL, MySQL,...) or dictionary tables (Oracle) to collect metadata about objects (tables, columns,...) in the database. Instead of that use DatabaseMetaData interface in JDBC because only this way works reliably and without any vendor specific code over all databases that have a JDBC driver.

Second, to deal with different data types supported by the databases, use JDBC data types. That way the JDBC driver takes care of the conversions, not you.

Third, limit yourself to a subset of SQL-92 that is supported across multiple databases. A good list of supported functions is entry conformance of SQL-92. If this list is too narrow, use ODBC Escape Sequences - they are fully supported in JDBC. Just be aware of the fact that escape sequences do not generally add any new functionality into databases - they just translate vendor's function names (and parameters) into a standardized format. Hence, if a database does not provide the functionality natively, it is unlikely it will provide the functionality over the escape sequences. Consequently, only a subset of escape sequences are save to use.

Fourth, named entity quoting and literal quoting is specific to each database. You may extract the quotes from metaData in JDBC (e.g. getIdentifierQuoteString()).

Fifth, even if you try hard, situations arise, when you want to do things vendor specific way. To deal with these scenarios, have a single class, which implements the default solution (e.g. "Ansi"), and have vendor specific classes that inherit from it (e.g. "Mysql", "Postgre",...). That way, you can switch from vendor agnostic code, which is deemed to be too slow, to vendor specific code any time. Also, it gives you the ability to work-around bugs in implementations of JDBC drivers.

Unfortunately, there are some aspects of SQL for which I didn't find an a good enough agnostic solution. For example, "limit" clause is not agnostic because it does not work, for example, in SQL Server, Oracle or SAS. The ability to limit the size of the ResultSet in JDBC is nice, but if you need to write the result into a table in the database, properties of the ResultSet are not applicable. If you know of a nice agnostic way how to "limit", let me know.