pátek 23. června 2017

Each database has its own charm, it only takes a while to discover it

Microsoft SQL Server
This is a tough one. But this was the only database that warned me about data corruption, when the SSD disk started to fail. While other databases (Oracle, PostgreSQL and MySQL) were silently returning corrupted data. To be fair to other databases, if nothing else, you can [[https://www.postgresql.org/docs/9.5/static/wal-reliability.html|configure]] PostgreSQL to be more resistant to silent errors.

You can relly on the order of the data in the tables, i.e. the tables are not sets, as dictated by relational algebra, but lists! That makes working with the database more intuitive. Of course, some other databases have this property as well (Oracle, MSSQL, SAS) but other not (PostgreSQL, Teradata, Netteza) - generally, all distributed databases use sets. Another nice property is that you can change the order of the columns in a table any time you want to (in PostgreSQL, for example, you can only append new columns at the end of the tables).

PostgreSQL has the nicest installer on OsX I have ever seen for a database. It's just drag and drop like any other normal app. And after starting the database it tells you the connection parameters. And that's it! No configuration needed! In comparison to installation of SAS or full blooded Oracle, it is Heaven versus Hell. Also, PostgreSQL does not need any configuration fine tuning to be usable. Once, I installed MySQL and PostgreSQL at the same server and mirrored their content. While PostgreSQL worked without any touch for a year, I had to change the configuration of MySQL multiple times, because some default limit (always different one) was too tight.

Hand down Oracle has the best execution planner I have ever used. Plus, it has provides a rich set of commands.

The best part of SAS's SQL is that it allows you to use some of data step conventions in the SQL. Do you need to limit the count of read (not outputted!) rows from a table? No problem, just use inobs parameter! Or is the SQL too inconvenient for your task? Just use SAS code!

sobota 25. února 2017

TPC benchmarks

Since I repeatedly stumble upon the problem how to generate the benchmark databases, here are the references:
  1. HammerDB (for TPC-C and TPC-H databases).
  2. Benchmark Factory (for TPC-C, TPC-D, TPC-E, TPC-H and ASP3AP databases).

středa 22. února 2017


Sometimes I find LaTeX to be frustrating because of the following reasons:
  1. It is difficult to parse TeX code. This occasionally leads to unclear error messages and inconsistent treatment of white space characters.
  2. It is sensitive to the order, in which packages are loaded. If you are loading more than a few packages, finding the correct order can be a non-trivial problem.
  3. Inconsistent quality of typesetting. While TeX is prized for typography, kerning by default is not applied on equations.
For further discussion of the topic see 25 Years of TEX and METAFONT: Looking Back and Looking Forward.

čtvrtek 10. listopadu 2016

Information Gain Ratio vs. Gini vs. Chi2

Information Gain Ratio (IGR), Gini index and Chi2 are among the most popular univariate feature selection methods for classification (and decision tree construction). But just as no free lunch theorems were formulated for optimization and classification, a no free lunch theorem for feature selection could be formulated - a single method may not be averagely better than other methods over all datasets.

If we compare feature selection methods on a single dataset by the accuracy on the subsequent classifier, we generally find out that either IGR or Chi2 is the best, while Gini is (almost) always the second:
What is intriguing is the fact that both, IGR and Chi2 sometimes fail terribly. And that Gini generally lacks behind the best method just a bit. Hence, if we calculate accuracy of the feature selection methods over many (real-world) datasets, we find out that Gini is, on average, the best method.

Recommendation: On text related datasets, Chi2 generally excells. On datasets with id-like attributes (attributes with very high cardinality), IGR generally excels, because IGR, in comparison to Chi2, penalizes attributes with high cardinality. If you want to get a reasonable model on the first shot regardless of the data, use Gini (assuming real world datasets since we can always craft datasets to foul the feature selection methods).

pátek 4. listopadu 2016

How to write an agnostic Java application for SQL databases

Whenever we want to talk to a relational database, we generally have two options: use native connection or use some standard API like Open Database Connectivity (ODBC).

If we aim to support multiple databases, it is easier to use a single standard API than to deal with multiple native protocols. There are three well known APIs (and some less known) for communication with SQL databases: ODBC, JDBC and OLE DB. OLE DB was deprecated by its author, Microsoft (e.g. http://www.simba.com). Hence, it is possibly not smart to start new projects with this technology. Consequently, we are left with the decision between ODBC and JDBC. JDBC was developed for Java as a (better) replacement of ODBC. Hence, if you are developing in Java, the choice is clear - pick JDBC. If you are developing in something else (e.g. C), you don't have a choice but to pick ODBC.

Following paragraphs describe useful tricks to know about JDBC. First, forget about information_schema (PostgreSQL, MySQL,...) or dictionary tables (Oracle) to collect metadata about objects (tables, columns,...) in the database. Instead of that use DatabaseMetaData interface in JDBC because only this way works reliably and without any vendor specific code over all databases that have a JDBC driver.

Second, to deal with different data types supported by the databases, use JDBC data types. That way the JDBC driver takes care of the conversions, not you.

Third, limit yourself to a subset of SQL-92 that is supported across multiple databases. A good list of supported functions is entry conformance of SQL-92. If this list is too narrow, use ODBC Escape Sequences - they are fully supported in JDBC. Just be aware of the fact that escape sequences do not generally add any new functionality into databases - they just translate vendor's function names (and parameters) into a standardized format. Hence, if a database does not provide the functionality natively, it is unlikely it will provide the functionality over the escape sequences. Consequently, only a subset of escape sequences are save to use.

Fourth, named entity quoting and literal quoting is specific to each database. You may extract the quotes from metaData in JDBC (e.g. getIdentifierQuoteString()).

Fifth, even if you try hard, situations arise, when you want to do things vendor specific way. To deal with these scenarios, have a single class, which implements the default solution (e.g. "Ansi"), and have vendor specific classes that inherit from it (e.g. "Mysql", "Postgre",...). That way, you can switch from vendor agnostic code, which is deemed to be too slow, to vendor specific code any time. Also, it gives you the ability to work-around bugs in implementations of JDBC drivers.

Unfortunately, there are some aspects of SQL for which I didn't find an a good enough agnostic solution. For example, "limit" clause is not agnostic because it does not work, for example, in SQL Server, Oracle or SAS. The ability to limit the size of the ResultSet in JDBC is nice, but if you need to write the result into a table in the database, properties of the ResultSet are not applicable. If you know of a nice agnostic way how to "limit", let me know.

pátek 2. září 2016

What is the correct measure for classification?

There are multiple measures that we can use to describe performance of a classifier. To name just a few: classification accuracy (ACC), area under receiver-operating curve (ROC-AUC) or F-measure.

Each metric is suitable for a different task. And there are two questions we have ask to select an appropriate metric:
  1. Is the decision threshold known ahead and fixed?
  2. Do we need calibrated predictions?  
If a single threshold is known ahead and fixed, for example, because we know the misclassification cost matrix, threshold-based metrics like classification accuracy or F-measure are appropriate. However, if we do not know the cost matrix or if the cost matrix is expected to evolve in time, it is desirable to use ranking-based metrics like ROC-AUC, that consider all possible thresholds with equal weight.

Of course, sometimes we may want sometimes in the middle, because we have some knowledge about the distribution of the values in the cost matrix. Example measures that consider just a range of thresholds or weight the thresholds unequally are partial Gini index, lift or H-measure. If you have some information about the working region of the model, you should use a measure that takes that knowledge into account (see the correlations of measures in the second referenced article).

Another question to ask is whether we need calibrated results. If all we need is to rank results and pick top n% of samples, ranking measures like ROC-AUC or lift are appropriate, following the  Vapnik's philosophy: "Do not solve harder problem than you have to". On the other end, if we need to asses risk, calibrated results are desirable. Measures that evaluate quality of calibration are Brier score or logarithmic-loss.

And of course, there are measures that are somewhere in the middle, like threshold-base measures, that evaluate quality of calibration at a single threshold, while ignoring the quality of calibration at the rest of thresholds.

A metric may also feature following desirable properties:
  1. Provides information about how much better is the model in comparison to a baseline for easier interpretation of the quality of the model.
  2. Works not only with binary, but also with polynomial labels.
  3. Is bounded for easier comparison of models across different datasets.
  4. Works also with unbalanced classes. 
But that would be for a longer post. Relevant literature:
  1. Getting the Most Out of Ensemble Selection
  2. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

sobota 27. srpna 2016

How to design and evaluate tests in schools

There two ways how to end up with high quality products. Either you invest more energy into a design phase, or into a test phase. Following paragraphs describe the second approach.

Premise: If you replace paper-based tests with electronic tests, you can easily analyze the student's responses.

The first trivial analysis is to rank the questions based on the students' average success rate on the question. By looking at the bottom 10% of questions, you may identify badly formulated questions or flawed answer options. Alternatively, the topic may have not be thoroughly explained and practiced. Similarly, you may look at the top 10% of questions and identify accidentally too easy questions. 

The second analysis is based on correlation coefficients. Pearson correlation (although we should call it point biserial correlation) is calculated for each question between the vector of students' responses on that question and the students' overall gain of points from the test. Ideally, each question would highly correlate with the overall test score. If we sort the questions in the descending order and look at the bottom 10% of questions, we may identify irrelevant, too easy, too hard or simply flawed questions.

An example of a irrelevant question is calculation of Euclidean distance between 2 points in 3 dimensional space - each university student should already know how to calculate that. But this question could have been missed by the first analysis simply because of frequent numerical errors.

On the other end, an example of a relevant question is computation of Chebyshev distance - it is unlikely that students have ever heard about it before the course. Plus, it is easier to compute than Euclidean distance. Hence, the effect of numerical errors is minimized.

After identifying 10% of the worst questions, give students free points for these questions. It will make students happy for following reasons:
  1. If a test consists of more than a single question, students will almost inevitably complain, that some of the questions were too difficult. Not counting the toughest questions solves the problem.
  2. If a test is long, students will almost inevitably complain, that something was not properly covered. Not counting the toughest questions solves the problem.
  3. The correction may improve students' scores, but never hurt.
 Professors like it because:
  1. They may dedicate the test creation to unskilled workforce (read: teacher assistants), as the questions do not have to perfect. The students themselves will identify troublesome questions and the system will deal with them.
  2. It is a systematic (and proactive) approach how deal with students' complains about the tests.
  3. As a side product, it allows generation of a large set of relevant, correct and appropriately difficult test questions, which may work as poor man's replacement for absent/obsolete textbooks.
Summary: The described approach is suitable for fast evolving domains like machine learning (or any computer science related domain), where battle-field tested questions are still not available or are already obsolete.