Datafilos: prosince 2018

neděle 30. prosince 2018

Beta-binomial distribution

Beta-binomial distribution is a distribution, which is not commonly teach at universities. However, it is a nice distribution to know. Examples of usage:

Estimate popularity of a product based on impression and purchase counts.
Estimate quality of a baseball player (section 7.2) based on hit and at-bat counts.

středa 26. prosince 2018

Classification with high cardinality attributes

There are many ways how to deal with high cardinality categorical attributes, like zip code, in binary classification. If the classifier directly supports high cardinality attributes, like CatBoost or mixed models, there is nothing we have to do - we pass the attribute to the classifier and we have a nice day. However, if the model does not ingest high cardinality attributes, we have a problem. And maybe surprisingly, we have a problem not only if we use a classifier that does not accept (polynomial) categorical attributes, like SVM or logistic regression, but also when we use a classifier based on decision tree, like random forest or gradient boosted trees (CatBoost is a noteworthy exception), because decision trees can frequently deal with only up to 64 categories in a categorical attribute (the exact threshold depends on the implementation). When the categorical attribute has more categories, the attribute gets to be ignored.

A common remedy, when we have a high cardinality attribute and the classifier does not accept them, is to use one-hot-encoding (or dummy coding). But it creates a large matrix. And when the classifier does not have a great support for sparse matrices (some implementations of kernel based algorithms like SVM have it), larger datasets are uncomputable with this approach.

But not all encodings for conversion of high cardinality attributes into numerical attributes result into an explosion of attributes. One of such encodings is the estimation of the label y conditional probability given the categorical value x_i: p(y|x_i) for each category i. That means that a single categorical attribute gets to get replaced with only a single numerical attribute. However, this approach is not without it's own issues. Namely, we have to be able to estimate this conditional probability from a very few samples. Following list contains of few methods that can be used to estimate the conditional probability:

Use empirical estimate (simple average of the label, assuming y takes values {0, 1}).
Use empirical estimate with Laplace correction (as used in naive Bayes).
Use leave-one-out (as used in CatBoost).
Calculate lower confidence bound of the estimate, assuming Bernoulli distribution.
Shrink the estimates toward the global label average, as in equations 3 and 4.
Shrink the estimates, assuming Gaussian distributions, as in equations 3 and 5.
Use m-probability estimate, as in equation 7.
Shrink the estimates, assuming beta-binomial distribution.

Experimental results
Testing AUC for points {1,2,7,8} from running 9 different classifiers on 32 datasets:

Conclusion
At alpha=0.05, there is no significant difference.

The impression of machine learning in R

One professor once said, that whenever a student asks him a question, he immediately knows whether the student studies economy, statistics or computer science. If a student asks him what is the return of investment, the student studies economy. If a student asks him what is the sample size, the student studies statistics. And if a student asks him for an edge scenario, the student is a computer scientist.
From the current state of algorithms in R it is evident that many authors are statisticians but not programmers. On the good side, whenever I need an implementation of some obscure statistical algorithm, I know that either I get it in R or I am out of luck. However, package authors in R frequently forget to check for common edge scenarios like division by zero.

I good burn in the past so many times that I have assembled a table with the prior believe that the implementation is not going to blow up when I throw my dirty data at it:

Core R functions like c(): 100%
Core library functions like chisq.test(): 98%
A function in a library implementing a single algorithm: 70%
A function in a library implementing many algorithms: 40%

pátek 7. prosince 2018

Hegemon scenario for Civ 3

The best Civilization 3 contrib scenario that I have ever played is, without hesitation, Hegemon. This is a scenario that you will want to play multiple time. Just each time with a different nation. Notes for one of the weakest nations, Ithaka, follow.

While Ithaka has strong start, their power is limited:

While they have an awesome unique wonder, which is buildable from the beginning, they don't have any other unique building (not even a tribal camp!).
While they start with 4 unique units prebuild, they are not able to create additional unique units.
While the motherland is fertile, it does not contain an Island resource, which is necessary for trade network construction. Hence, forget about trading strategic and luxury resources with other nations.
Ithaka city does not lye on river, while many other capitols do. Hence, the growth of Ithaka is capped, until aqueduct is discovered and build.
Ithaka belongs to Island Greeks. And Island Greeks (together with Lesser Greeks) have inferior units to everyone else (with a few exceptions during the early and mid-game when their units are just as good as everyone).

Nevertheles, contrary to what is said in the guide, it is possible to win with Ithaka.

Ithaka starts with 2 prebuild boats. These boats allows you to quickly uncover the whole coastline and make highly profitable discovery trades.
Thanks to the unique wonder, Ithaka has a lead in production capacity. Use it to build unit generating units, because own units suck. And use the generated units for early attack on weaker nations. There are four goals:

Get their cities. Economically wise, it is cheaper to build attack units and conquer cities than to build settlers.
Get workers from their settlers. They are an easy target.
Get a leader in order to build an army. The army will then allow you to conquer well defended capitols of stronger nations (e.g.: Sparta).
If you are not able to eliminate some nation completely, deny them access to cooper as cooper is necessary for building many mid-game units.

Rush the conquest. Ithaka is greatly positioned to take down nations in the order they get their unique units. Sparta has early unique units. So, take down Sparta before they start to spawn their unique units. Then there are Athens and Thebians with their mid-game unique units. And finally Macedonians with their late unique units.
Focus on nations that have freshly build a wonder and take it from them (it is frequently cheaper to conquer a city with a wonder than to build the wonder).

úterý 4. prosince 2018

Incremental learning feature-by-feature

Commonly, when we are training a model, we just use the data that are readily available and call it a day. But then there are "online" models that can be learned sample-by-sample. The advantage of these models is that they can be cheaply retrained as new data are observed. However, this post is about models that can learn feature-by-feature, not by sample-by-sample.

What is the advantage of feature-by-feature learning? Sometimes, there is a cost associated with getting new measurements (features) for the given set of samples. And if all we want to do is to get a good enough classifier for the smallest possible cost, it makes sense to first try to build a model on a single feature, and keep adding new features until the model is good enough (or until we run out of money). The issue is that model training can be expensive. And if possible, we want to reuse calculations from the previous model in construction of the updated model. That is the core idea of incremental learning feature-by-feature.

Following paragraphs list common machine learning methods and how they can be modified for learning feature-by-feature.

Naive Bayes

Models that assume (conditional) independence of features support feature-by-feature learning by design. When a new feature arrives, we just calculate statistics on the feature and append the statistics to the model. An example of such model is naive Bayes. Limitations: naive Bayes has relatively small plasticity.

k-NN

In k-NN, all we have to do is to extend the training set by a feature. For small dimension count, branch-and-bound pruning can speed up scoring of the new samples. And to speed up the things even further, we do not have to calculate the distances with expensive metrices (e.g.: Euclidean distance). It is enough to select good candidates with inexpensive distances/similarities (e.g.: Jaccard similarity), prune the space, and only than calculate the expensive metric on the remaining samples. This is (approximately) the approach taken by Vector-Approximation file implementation.

Kernel methods

Kernel methods like SVM sometimes accept precomputed "similarity matrix" (e.g.: when dot product kernel is used, the matrix is called Gram matrix). And when a new feature appears, we update the similarity matrix. The update is approximately 6 times faster than calculation of the similarity matrix from scratch. Note that kernels can be used in other models beside SVM. To name just a few: logistic regression, perceptron, LDA and PCA. Till the similarity matrix fits the memory, all these methods are well suited for learning feature-by-feature. Note also that we have to remember the training data in order to score the new samples. Fortunately, SVM requires only support vectors, not all the training data. Unfortunately, after adding a new feature the set of support vectors may change. Hence, we still have to remember all the training data.

Decision tree

When we presort each tuple (feature, label) for each feature, the search for the optimal threshold gets accelerated. This is a common optimization, which is used for example in scikit decision tree. When a new feature appears, we just sort the new tuple and rebuild the tree. Furthermore, we can memoize the optimal threshold for the given path from the root to the current node - most of the time, the structure of the decision tree is not going to change when we add a new feature.

Random Forest/Extremely Randomized Trees/Boosted trees

Whenever a new feature appears, we just add a new tree into the ensemble. The only issue with RF and ERT is that newly added features have a chance to appear only in new trees, while old features can appear in both, old and new trees. Hence, older features are generally favored. To combat that, we may prefer newly added features in the new trees. Unfortunately, it does not fix the probability that two features appear in the same tree (it should be constant for any two features, but it isn't). A solution is to weight the trees with approximately 1.6^n, where n means the n-th tree in the ensemble. The advantage of this solution is that the weight of the old trees decays exponentially. Hence, once the ensemble grows too large to be practical (e.g.: the ensemble gets over 150 trees), we may simply drop the oldest tree whenever a new tree is generated without a large drop in accuracy.

Full Bayes

Assuming Bayes model that works with a single covariance matrix and a single vector of mean values, it is simple to update the matrix and the vector when a new feature is appears.