neděle 30. prosince 2018

Beta-binomial distribution

Beta-binomial distribution is a distribution, which is not commonly teach at universities. However, it is a nice distribution to know. Examples of usage:
  1. Estimate popularity of a product based on impression and purchase counts.
  2. Estimate quality of a baseball player (section 7.2) based on hit and at-bat counts.

středa 26. prosince 2018

Classification with high cardinality attributes

There are many ways how to deal with high cardinality categorical attributes, like zip code, in binary classification. If the classifier directly supports high cardinality attributes, like CatBoost or mixed models, there is nothing we have to do - we pass the attribute to the classifier and we have a nice day. However, if the model does not ingest high cardinality attributes, we have a problem. And maybe surprisingly, we have a problem not only if we use a classifier that does not accept (polynomial) categorical attributes, like SVM or logistic regression, but also when we use a classifier based on decision tree, like random forest or gradient boosted trees (CatBoost is a noteworthy exception), because decision trees can frequently deal with only up to 64 categories in a categorical attribute (the exact threshold depends on the implementation). When the categorical attribute has more categories, the attribute gets to be ignored.

A common remedy, when we have a high cardinality attribute and the classifier does not accept them, is to use one-hot-encoding (or dummy coding). But it creates a large matrix. And when the classifier does not have a great support for sparse matrices (some implementations of kernel based algorithms like SVM have it), larger datasets are uncomputable with this approach.

But not all encodings for conversion of high cardinality attributes into numerical attributes result into an explosion of attributes. One of such encodings is the estimation of the label y conditional probability given the categorical value x_i: p(y|x_i) for each category i. That means that a single categorical attribute gets to get replaced with only a single numerical attribute. However, this approach is not without it's own issues. Namely, we have to be able to estimate this conditional probability from a very few samples. Following list contains of few methods that can be used to estimate the conditional probability:
  1. Use empirical estimate (simple average of the label, assuming y takes values {0, 1}).
  2. Use empirical estimate with Laplace correction (as used in naive Bayes).
  3. Use leave-one-out (as used in CatBoost).
  4. Calculate lower confidence bound of the estimate, assuming Bernoulli distribution.
  5. Shrink the estimates toward the global label average, as in equations 3 and 4.
  6. Shrink the estimates, assuming Gaussian distributions, as in equations 3 and 5.
  7. Use m-probability estimate, as in equation 7.
  8. Shrink the estimates, assuming beta-binomial distribution.
Experimental results
Testing AUC for points {1,2,7,8} from running 9 different classifiers on 32 datasets:
Conclusion
At alpha=0.05, there is no significant difference.

The impression of machine learning in R

One professor once said, that whenever a student asks him a question, he immediately knows whether the student studies economy, statistics or computer science. If a student asks him what is the return of investment, the student studies economy. If a student asks him what is the sample size, the student studies statistics. And if a student asks him for an edge scenario, the student is a computer scientist.
From the current state of algorithms in R it is evident that many authors are statisticians but not programmers. On the good side, whenever I need an implementation of some obscure statistical algorithm, I know that either I get it in R or I am out of luck. However, package authors in R frequently forget to check for common edge scenarios like division by zero.

I good burn in the past so many times that I have assembled a table with the prior believe that the implementation is not going to blow up when I throw my dirty data at it:
  1. Core R functions like c(): 100%
  2. Core library functions like chisq.test(): 98%
  3. A function in a library implementing a single algorithm: 70%
  4. A function in a library implementing many algorithms: 40%


pátek 7. prosince 2018

Hegemon scenario for Civ 3

The best Civilization 3 contrib scenario that I have ever played is, without hesitation, Hegemon. This is a scenario that you will want to play multiple time. Just each time with a different nation. Notes for one of the weakest nations, Ithaka, follow.

While Ithaka has strong start, their power is limited:
  1. While they have an awesome unique wonder, which is buildable from the beginning, they don't have any other unique building (not even a tribal camp!).
  2. While they start with 4 unique units prebuild, they are not able to create additional unique units.
  3. While the motherland is fertile, it does not contain an Island resource, which is necessary for trade network construction. Hence, forget about trading strategic and luxury resources with other nations.
  4. Ithaka city does not lye on river, while many other capitols do. Hence, the growth of Ithaka is capped, until aqueduct is discovered and build.
  5. Ithaka belongs to Island Greeks. And Island Greeks (together with Lesser Greeks) have inferior units to everyone else (with a few exceptions during the early and mid-game when their units are just as good as everyone).

Nevertheles, contrary to what is said in the guide, it is possible to win with Ithaka.
  1. Ithaka starts with 2 prebuild boats. These boats allows you to quickly uncover the whole coastline and make highly profitable discovery trades.
  2. Thanks to the unique wonder, Ithaka has a lead in production capacity. Use it to build unit generating units, because own units suck. And use the generated units for early attack on weaker nations. There are four goals:
    1. Get their cities. Economically wise, it is cheaper to build attack units and conquer cities than to build settlers.
    2. Get workers from their settlers. They are an easy target.
    3. Get a leader in order to build an army. The army will then allow you to conquer well defended capitols of stronger nations (e.g.: Sparta).
    4. If you are not able to eliminate some nation completely, deny them access to cooper as cooper is necessary for building many mid-game units.
  3. Rush the conquest.  Ithaka is greatly positioned to take down nations in the order they get their unique units. Sparta has early unique units. So, take down Sparta before they start to spawn their unique units. Then there are Athens and Thebians with their mid-game unique units. And finally Macedonians with their late unique units.
  4. Focus on nations that have freshly build a wonder and take it from them (it is frequently cheaper to conquer a city with a wonder than to build the wonder).

úterý 4. prosince 2018

Incremental learning feature-by-feature

Commonly, when we are training a model, we just use the data that are readily available and call it a day. But then there are "online" models that can be learned sample-by-sample. The advantage of these models is that they can be cheaply retrained as new data are observed. However, this post is about  models that can learn feature-by-feature, not by sample-by-sample.

What is the advantage of feature-by-feature learning? Sometimes, there is a cost associated with getting new measurements (features) for the given set of samples. And if all we want to do is to get a good enough classifier for the smallest possible cost, it makes sense to first try to build a model on a single feature, and keep adding new features until the model is good enough (or until we run out of money). The issue is that model training can be expensive. And if possible, we want to reuse calculations from the previous model in construction of the updated model. That is the core idea of incremental learning feature-by-feature.

Following paragraphs list common machine learning methods and how they can be modified for learning feature-by-feature.

Naive Bayes

Models that assume (conditional) independence of features support feature-by-feature learning by design. When a new feature arrives, we just calculate statistics on the feature and append the statistics to the model. An example of such model is naive Bayes. Limitations: naive Bayes has relatively small plasticity.

k-NN

In k-NN, all we have to do is to extend the training set by a feature. For small dimension count, branch-and-bound pruning can speed up scoring of the new samples. And to speed up the things even further, we do not have to calculate the distances with expensive metrices (e.g.: Euclidean distance). It is enough to select good candidates with inexpensive distances/similarities (e.g.: Jaccard similarity), prune the space, and only than calculate the expensive metric on the remaining samples. This is (approximately) the approach taken by Vector-Approximation file implementation.

Kernel methods

Kernel methods like SVM sometimes accept precomputed "similarity matrix" (e.g.: when dot product kernel is used, the matrix is called Gram matrix). And when a new feature appears, we update the similarity matrix. The update is approximately 6 times faster than calculation of the similarity matrix from scratch. Note that kernels can be used in other models beside SVM. To name just a few: logistic regression, perceptron, LDA and PCA. Till the similarity matrix fits the memory, all these methods are well suited for learning feature-by-feature. Note also that we have to remember the training data in order to score the new samples. Fortunately, SVM requires only support vectors, not all the training data. Unfortunately, after adding a new feature the set of support vectors may change. Hence, we still have to remember all the training data.

Decision tree

When we presort each tuple (feature, label) for each feature, the search for the optimal threshold gets accelerated. This is a common optimization, which is used for example in scikit decision tree. When a new feature appears, we just sort the new tuple and rebuild the tree. Furthermore, we can memoize the optimal threshold for the given path from the root to the current node - most of the time, the structure of the decision tree is not going to change when we add a new feature.

Random Forest/Extremely Randomized Trees/Boosted trees

Whenever a new feature appears, we just add a new tree into the ensemble. The only issue with RF and ERT is that newly added features have a chance to appear only in new trees, while old features can appear in both, old and new trees. Hence, older features are generally favored. To combat that, we may prefer newly added features in the new trees. Unfortunately, it does not fix the probability that two features appear in the same tree (it should be constant for any two features, but it isn't). A solution is to weight the trees with approximately 1.6^n, where n means the n-th tree in the ensemble. The advantage of this solution is that the weight of the old trees decays exponentially. Hence, once the ensemble grows too large to be practical (e.g.: the ensemble gets over 150 trees), we may simply drop the oldest tree whenever a new tree is generated without a large drop in accuracy.

Full Bayes

Assuming Bayes model that works with a single covariance matrix and a single vector of mean values, it is simple to update the matrix and the vector when a new feature is appears.







sobota 27. října 2018

How to win on deity or sid in Civilization 3

Generally, you want to finish the game as fast as possible, otherwise enemies are going to overbuild/overreseach you. In order to be able to finish the map quickly, pick a small map. In order to be able to quickly get to the enemy, pick a single continent with a minimum of water mass. Set a wining condition to elimination - with this setting, it is enough to conquer a single opponent's city in order to completely eliminate the opponent. With this setting, the opponent’s biggest advantage - many settlers from the beginning, is turned into a weakness. All you need is 2 unit that are capable of overcoming 2 spearmen in a weakly defended city, and the opponent is gone. In order to maximize your advantage, pick a nation with an early unique unit. And set your opponents to nations with late unique unite. The selection of your nation depends on the game rules of your Civ mod/patch. For example, in RARR, Aztecs are awesome. Their unique unit costs the same as a warrior. But it has a bonus attack point. It has a bonus movement point. It is not penalized for movement in jungle and forest (set the map to wet climate in order to maximize this benefit). It has a bonus hit point and it can enslave. And of course it helps that you can build the unique unit from the beginning without any resource dependency and that it triggers early gold age (remember, you want to finish the game as fast as possible). If you can pick your starting position, it is good to start next to a bonus resource like game in order to maximize the population growth and unique unit production. If everything goes well, you should easily win in the first 30 turns by conquest.

úterý 31. července 2018

Disadvantages of AUC and the fixes

Area Under the Curve for Receiver Operating Characteristics (further referred as AUC) is a nice measure, but of course, it has some potential theoretical disadvantages:

  1. It takes into account all possible decision thresholds.
  2. It assumes uniform misclassification cost.
  3. It does not measure calibration of the predictions.

Consequences:
  1. We are generally interested into decision thresholds close to the true ratio of positive samples to negative samples. For example, whenever we have an extreme class ratio, like in information retrieval where we have only a few positive samples but many negative samples, we are much more likely to be interested into observing the few predicted positives than into observing a few almost certain negatives (or complementary, to seeing all but a few almost certain negatives), simply because positives are rare while there is an abundance of negative samples (if you randomly pick a few samples, you are almost guaranteed that they all will be negative). But AUC puts equal weight to all thresholds. Consequently, in the case of severe class imbalance, it is easy to find examples, where higher AUC corresponds to worse performance in the real world - a model with modest AUC but stellar performance in the working region can beat a model that has an awesome AUC but underperforms in the working region.
  2. The sum of the cost of all positives samples is generally similar to the sum of the cost of all negative samples. In the case of severe class imbalance, we generally put much more weight on the rare class samples than on the common samples. I.e.: it is more important for us that a positive sample is classified as positive than when a negative sample is classified as negative. However, AUC puts equal weight on both classes. Consequently, if we can correctly predict a large portion of negative samples as negative (as is common in information retrieval), it is easy to get AUC close to 1.0 as demonstrated in this example
  3. Whenever we need calibrated probabilities as the output of a classifier (for example for risk assessment), AUC is not really informative. But in defense of AUC, if the business application asks for ranking and not for calibrated probabilities, we want to optimize ranking and not the quality of calibration because we can come with two models: one with great ranking but terrible calibration (e.g.: because all the probabilities are squeezed to the decision boundary like in SVM) and another model with great calibration but terrible ranking (e.g.: decision trees create plateaus of probabilities that hurt their ranking scores). 

Practical complications:
  1. When AUC was defined, it was not said how to handle ties. Consequently, everyone implements it differently and the reported values differ.
  2. It was (originally) defined only for binary classification. Consequently, there are multiple non-comparable proposals for extension into multiclass situations.
  3. All ranking measures that take into consideration all the predictions (like AUC does) are slow when you have a lot of samples (> 10M). 
Solutions for the mentioned issues (in the order of their appearance):
  1. Use AUC with Beta prior on the threshold location: H measure
  2. Use sample-weighted AUC 
  3. Use Brier score
  4. Test the implementation against reasonably looking unit-tests . A fast and concise implementation that passes the tests is at MATLAB Central
  5. Use M measure
  6. Both, thresholding and calibration measures scale linearly with the count of samples. Use them. Alternatively, you may calculate a ranking measure on a specific subset of data. An example is NDCG@k, which you may get in O(n + k log k) time, where k is the top k samples and n is the count of samples.

pondělí 16. července 2018

Impression of R

Following text is structured as a sonnet: thesis, antithesis and synthesis.

The philosophy of R is beautiful. Everything is a function. Hence the syntax highly regular (predictable). And R contains the common syntax sugar that you would expect from a functional language like named and default function parameters. This is a great improvement in comparison to Matlab, where you have to fake this functionality. Also, R always passes by value. Hence, it behaves more predictably than Java, which passes primitives by value and objects, effectively, by reference (Java passes a reference to the object by value). But R utilizes copy-on-write, which helps to avoid unnecessary copying of data.

But there are also disadvantages. In R, you can do the same thing many ways. And that can quickly become overwhelming. You search for a way how to do X. And the search returns ten different ways how to do it. Which of the methods is the best? You don't know. So you test the first one. It works. But it does not allow you to do Y, which is closely related to X. So you test the second approach. It works - it allows you to do both, X and Y. But it is too slow. So you test the third approach. There is a bug, which does not allow you to do X. You test the fourth approach. It does not work at all... I think you got the idea. Finding a way how to do something in R is much more consuming than in Python, where is commonly only a single dominant way how to do something. R community realizes that this is a problem. CRAN introduced a possibility to asses packages by stars. But it got phased out because only a few people were contributing their evaluation. Another disadvantage of R is that when you look at the source code of a function, you newer know ahead, what will you find. Is it function written in R, S or C? Sometimes it really fells like unwrapping a Christmas present - at the beginning you only see the function API. You unwrap it just to find out there is a thin R wrapper inside. So you remove another layer. It is in S. But not pure S - it is FORTRAN code ported into S. So you look into the FORTRAN code. But the FORTRAN code is not a typical FORTRAN code either - it was already ported from something else... I have to admit that I am impressed that so many levels of indirection actually works. But debugging and understanding such code is a nightmare.

My conclusion: R is awesome, if all you need is to use it as script language. Whenever I need to run some exotic statistical method, I can be pretty confident that it was ported to or written in R and distributed as a package. There are situations when I simply do not have choice but  to use R unless I want to implement the method from scratch. But whenever I have to read R code written by my colleagues, it is a hell - each of my colleagues is using different set of packages, different objects (data frame vs. data table vs. tibble vs. ...), different approach to writing loops (for loop vs. apply vs. tidyverse vs. ...). In the end, it is not difficult to understand the code. But it fells like reading text riddled with typos - you understand it. But it takes mental capacity from the content to the representation layer.

pondělí 23. dubna 2018

Bugs in code

It is said that there are just two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. Nevertheless, the most memorable bugs for me are race-conditions and numerical stability problems.

Race-conditions are memorable, because they are difficult to reproduce. While numerical stability related issues are memorable, because it is difficult to fix them correctly.

pondělí 26. března 2018

Type systems in programming languages

There are four approaches to data types:
  1. Have a very few basic data structures. This is the philosophy of Scheme.
  2. Have a rich type system that is organized into a hierarchy. Haskel is famous for this approach.
  3. Have a rich type system that is using union types. This is the Ceylon way.
  4. Have a rich type system with traits. This approach is taken by Rust, Swift, Python and Go.
Languages that desire simplicity generally favour having just a few data types. The advantage of this approach is that you need to write just a few different functions to handle all possible data types. However, there are two disadvantages. First, sometimes a specialized data type can dramatically decrease runtime and memory requirements. Second, sometimes you may want to deliberately limit the domain that a variable can take to be sure that a variable may not contain illegal values.

But once you have many data types, you frequently find yourself in a need to write the same function but for multiple data types. An illustrative example would be a sum function that should be able to work on integer vector and floating point vector. But it can quickly become tiresome to maintain multiple copies of the same code. Languages with the rich data type system have to provide some way how to avoid plain duplication of the code. Haskell solves it by providing data types in a hierarchy - if you need to write a function that works on numbers, just write it for the common ancestor of all numbers.

The hierarchical type system is, however, quite restrictive. If there are types that fulfill requirement α and some types fulfill requirement β, then we can represent such types in a Venn diagram:



But if all 3 sections, A, B and A∩B are non-empty, we cannot say that α⊆β or β⊆α. In other words, with a hierarchical system we cannot use both, α and β characteristics, to describe the types. We can, at best, just pick one of these characteristics to describe the data types.

A concrete example of this dilemma are integers in databases. Should we treat them as additive or as "groupable"? Integers are both. But continuous numbers are, arguably, just additive. While characters are, arguably, only groupable. In a hierarchical system, we generally take the smaller of the evils and accept that we can test exact equality even on floating point numbers. Union type systems (systems that rely on union/sum types) and trait type systems (as in protocols in Swift or traits in Python or intersection types in algebraic data types) allow you to tackle this dilemma. Btw., if you are using tags to organize your bookmarks, files or photos, you are already using a trait system. A nice critique of hierarchical systems is given in Goodbye, Object Oriented Programming.

What is the difference between the union type system and the trait type system? Union type systems assume that we have a fixed set of types (e.g.: float, double, integer, ...). And each variable (e.g.: a function argument) specifies a set of data types that it accepts. On the other end, trait type systems assume that we have a fixed set of traits (e.g.: comparable, ordinal, additive, ...) and each variable defines traits that the data type must fulfill.

How to return multiple values from a function

There are three approaches how to return multiple values from a function:
  1. Return some object. This is the canonical solution in objective and functional languages (which return "tuples").
  2. Use function parameters for the output. This is the canonical solution in Prolog.
  3. Allow functions to return multiple values. This is the canonical solution in Matlab.
Discussion:
Matlab's approach is beautiful, till you have to return up to 4 variables. Bu it does not scale to higher count of return values, because you have to address the output values by position - addressing by name does not work. And just a mental picture of me trying to read the 100th return value by writing the correct amount of commas and getting it wrong by one comma is painful.

Prolog's approach requires convention to differentiate between input parameters and output attributes.

Hence, the nicest approach, in my opinion, is to return "tuple", which can be addressed by both, position and name. This is the approach taken by R.

pondělí 19. března 2018

Dream castle in Heroes III by unit levels

1) Skeletons
While an average level 1 unit, the necromancy skill makes the skeletons a great choice for large maps with a lot of wandering creatures, which can be harvested.

Disadvantage: On tiny maps occupied only by castles, where the majority of the battles happen during the first two weeks, updated gremlins or sprites would be a better choice since they are better suited for castle conquests during the first weeks than skeletons.

2) Harpy
Updated harpies are masters of guerrilla attacks as the enemy does not retaliate and the harpies return to their original position unharmed. If combined with mass slow, melee units can have a tough time to take down the harpies.  If you dislike loosing units while cleaning the map, this is the creature for you. It was voted to be the second most favourite level 2 unit.

Disadvantage: While harpies are awesome units for map cleaning and castle defense against weak opponents because of their low attrition, harpies are lousy units for final battles where big unit losses have to be expected. If you expect early battles between the main heroes in the open field, wolfs with their impressive damage ability would be a better choice.

3) Elves
Each castle should have at least one shooting unit. Upgraded elves shoot twice during the turn, making their upgraded version highly desirable. Elves were voted to be the favourite level 3 unit.

4) Vampires
Once the count of upgraded vampires reaches ~20 units, they are awesome beasts for map cleaning as you may expect minimal losses due to their ability to resurrect from the flesh of their enemies. Rated the most favourite level 4 unit.

Disadvantage: Just like harpies, vampires are not the best units for the final battles. Furthermore, it takes two weeks to grow the necessary number of units from a single castle.

5) Pits
Upgraded pits can (or could, if they worked also on undeads) raise ridiculously expensive vampires in vast quantities from the slaughtered skeletons. Finally, you do not have to worry about skeleton losses during map cleaning as the lost skeletons get converted into vampires!

Disadvantage: Just like harpies, elves and vampires, this unit has to be upgraded to benefit from them fully.

6) Wyvern
To compensate the need to upgrade almost all lower level units, wyverns do not have to be upgraded to be of some use. Furthermore, wyverns' dwelling has very low prerequisite requirements (just a level 2 dwelling building), making it possible to recruit the wyverns during on the first day. Since they are also fliers, they can be of great use for early castle sieging.

Disadvantage: Champions, black knights and nagas are all beautiful creatures. But no other level 6 dwelling can be build on day 1. If we went with any other unit, we would not be able to recruit level 7 units on day 2 of the first week.

7) Firebird
While firebird is not the best level 7 unit, the fact that the dwelling is one of the cheapest (the second behind titan's), the unit itself is inexpensive, the weekly growth is above average, it is one of the fastest units in the game giving you the opportunity to apply mass haste/slow and decimate the opponent before he gets to the turn, and fire immunity allowing you to embrace Armageddon strategy (with the help of a support hero to store raised skeletons) make it a pretty decent choice. It is the only level 7 unit whose parameters were down-tuned in Horn of the Abyss. Furthermore, firebird dwelling building has only one prerequisite - level 6 dwelling building. Hence, if you play on normal difficulty, you just need to get 5 wood (or better, set the starting bonus to resources - wood and stone, and you do not have to worry about getting any additional resource or gold) and on the second day you will have 2 firebirds, 2 wyverns and 3900 gold in the pocket to spend it on whatever you want to (based on the situation you may want to keep it for capitol erection, or spend it on another hero, 8 harpies and 6 skeletons).

Discussion:
Since all the units would be from the same castle, there would not be morale penalty from mixing several units together. The castle would be usable for blitzkrieg strategy during the early phase of the game if played on normal difficulty level thanks to wyverns and firebirds. But the castle would also scale to long games due to the ability to harvest skeletons, which would be continuously converted into vampires. And because upgraded harpies and vampires simply do not die during the map cleaning, you may go into the final battles with vast counts of skeletons, harpies, and vampires. Furthermore, firebirds and Phoenixes are, contrary to many dragons, immune to Armageddon spell. And because Phoenixes are the fastest units in the game, you are almost guaranteed, that you get the opportunity to cast Armageddon before the opponent gets the opportunity to do anything. Hence, Phoenixes can be in some situations used for dragon slaying without any loss.