Datafilos

pondělí 7. června 2021

Microarray classification with background knowledge

The main issue with microarray classification is that the dimensionality is high (feature count > 10000) while the sample count is low (~$100 per sample). We can partially mitigate this issue by incorporating the background knowledge about how the features (genes) are controlled with transcription factors (TF).

I have two proposals: one with a linear discriminant analysis (LDA) and another with a neural network.

Linear discriminant analysis

LDA and variants of LDA, like nearest shrunken centroids (PAM) or shrunken centroids regularized discriminant analysis (SCRDA), are fairly popular in microarray analysis, but they all deal with the problem of how to reliably estimate the covariance matrix because the size of the covariance matrix is #features × #features. PAM gives up the hope of estimating the whole covariance matrix and estimates only the diagonal matrix, while SCRDA only shrinks the covariance matrix toward the diagonal matrix. But with the knowledge of which genes are coregulated together with the transcription factors, we can not only shrink the covariance matrix toward the diagonal matrix, but we can also shrink the coregulated genes together. If the estimate for covariance matrix in SCRDA is: (1-alpha)*cov(X) + alpha*I, where X are the training data, alpha is a tunable scalar and I is an identity matrix, then the covariance matrix in SCRDA with background knowledge is: (1-alpha-beta)*cov(X) + alpha*I + beta*B, where beta is a tunable scalar and B is a block matrix. This block matrix can be generated with the following pseudocode:

# Get the count of TFs that two genes share:
B = np.zeros(len(x))
for tf in tfs:
    for coregulated_gene_1 in tf2genes[tf]:
        for coregulated_gene_2 in tf2genes[tf]:
            B[coregulated_gene_1, coregulated_gene_2] += 1
            
# Normalize the intersect count by the count of TFs per gene_1 and gene_2:
for gene_1 in B:
    for gene_2 in B:
        B[gene_1, gene_2] /= len(gene2tfs[gene_1]) + len(gene2tfs[gene_2])

The code doesn't do anything else but calculate Jaccard similarity between the genes based on the shared transcription factors. And if (1-alpha)*cov(X) + alpha*I is equivalent to shrinking toward the identity matrix (see "Improved estimation of the covariance matrix of stock returns with an application to portfolio selection") then (1-beta)*cov(X) + beta*B is equivalent to shrinking toward coregulated genes.

Model assumptions: Beyond what LDA does (shared covariance matrices across the classes,...), we also assume that the impact of all the transcription factors to genes is identical.

Neural network

While LDA is fairly popular in microarray analysis, neural networks are not as they require a lot of training samples to learn something. But we can partially alleviate that issue by using smart network structure and initialization. We can use a neural network with a single hidden layer, where the number of the neurons in the hidden layer is equivalent the number of transcription factors. And instead of using a fully connected network between the input layer (which represents genes) and the hidden layer (which represents transcription factors), we can make connections only between the genes and transcription factors, where the interactions are known. This dramatically reduces the count of parameters to estimate. Furthermore, if we know whether the transcription factor works as an "activator" or "deactivator", we can initialize the connection weights to a fairly high, respectively low, random value. The idea is, that if the gene is highly expressed (feature value is high), its activator transcription factors are likely "on" while its deactivator transcription factors are likely "off".

Contrary to the LDA model, the neural network model does not assume that transcription factors affect the genes with the same strength but rather learns the strength. But that also means that we need more training data than in LDA model.

Transfer learning

To make the neural network model viable, we have to also exploit transfer learning: Train the model on some large_dataset and use the trained weights between the input layer and the middle layer as the initialization values for our model on dataset_of_interest. Of course, the same trick can be employed in LDA to shrink the covariance matrix toward the pooled covariance matrix estimated from many different datasets: (1-gamma)*cov(X) + gamma*P, where P is a pooled covariance matrix over many datasets. Hence, the final estimate of the covariance matrix might look like: (1-alpha-beta-gamma)*cov(X) + alpha*I + beta*B + gamma*P.

Conclusion

Both models can be further improved. For example, the assumption of identical covariance matrices per class in the LDA could be relaxed like in regularized discriminant analysis (RDA) and the initial weights in the neural network could be scaled with 1/len(gene2tfs[gene]) to initialize them to the expected value. But that is beyond this post.

čtvrtek 22. dubna 2021

Anti-patterns of two-factor authentication

Use a password field for the one-time password instead of a plain text field. Password fields hide the password to prevent other people from reading your password from your screen. But in the case of a one-time password, once you use the one-time password, the one-time password is useless. Hence, the password fields for one-time passwords do not really increase the security. However, it increases the probability of overlooking a typo, as the screen does not provide feedback about what you typed anymore. This can be quite a nuisance if you have to use a computer with a different keyboard layout than what you are accustomed to. Bonus points for disabling the clipboard to prevent users from copy-pasting the one-time password from notepad.
Allocate only enough resources to handle the normal login rate, not the theoretical peak login rate, for critical applications. Most of the time, only a small portion of users attempts to log in at the same time. But when something serious is happening, everyone wants to login in. And when the login system crashes at this stressful moment, it doesn't make people happy.
Make the login memory-less and batch-less. A memory-less implementation does not remember that you have logged in 10 seconds ago. And batch-less implementation does not allow you to pack multiple privileged commands together. Hence, even if you know ahead that want to issue 10 privileged commands in one go, you are still forced to perform 10 two-factor authentications - one for each command.

sobota 5. prosince 2020

Pandas violates composability

Python Pandas library violates composability. What is a composability? It is the ability to take a part of code and use it as an input for another code.

For example, SQL is composable - SQL takes relations and produces a relation. Lisp is composable - Lisp takes lists and produces a list. Matlab is composable - classical Matlab code takes arrays and produces an array.

But Pandas is not composable. A Pandas method, which takes a DataFrame, may return anything, be it:

Categorical
DataFrame
Group
List
Mask (a Numpy vector)
Index
IntervalIndex
Scalar (be it a Python data type, Numpy data type or Pandas data type)
Series
Slice
TimeDeltaIndex

And these are just the one I got after one minute at Pandas API.

Now you may argue: "What's the problem? Python uses duct typing". The issue is, that many Pandas methods, which accept a DataFrame, refuse to work on anything but a DataFrame. And that's the better case. In the worse case, the method accepts it, but produces something else than you would get if you passed the data as a DataFrame.

To make it even worse, Pandas mixes the ways how do you do things. Sometimes, you have to use method of the instance. Sometimes you have to access an attribute of the instance. And sometimes you have to use a class method.

If Pandas was using class methods everywhere, like Numpy, it would be possible to hide the differences between the different data structures. But you don't hide a missing attribute or method of an instance.

Composability is not a small thing. For example, GQL, a new language for graph databases, has composability as the governing principle. And people seem to, knowingly or not, gravitate toward composability. If nothing else, datatable, an alternative to Pandas in Python, ditches Series data structure, as it is considered unnecesary.

středa 18. listopadu 2020

What is wrong with Wikipedia

I like Wikipedia. But I am worried about the future of Wikipedia. Why? Because it keeps growing without limit.

When Wikipedia was based, it got one thing right: the delivery of the information has to snowball. You begin with the minimal quantum of information that is self-sustainable. And then you keep iteratively expanding that. A good article at Wikipedia starts with a self-sustainable sentence, which can exist alone and provides basic description of the keyword. Then there is the rest of the first paragraph, which slightly expands the first sentence. And as the whole, the first paragraph is self-sustainable. Then there is the rest of the paragraphs that make the header. They slightly expand the description of the first paragraph. And together with the first paragraph, they are self-sustainable. And finally, there is the body, which provides the rest of the information and which is, by the fact that it completes the whole article, also self sustainable.

A good metaphor to this concept is the understanding of a picture. With the snowball approach, you first look at the picture from a distance. And you recognize a house. Then you move closer and recognize the front door and windows. You move even closer, and recognize individual parts of the doors. And finally, when you move the closest, you see details like the cracks on the door panels.

In contrast, with pinhole approach you scan the picture pixel-by-pixel. And you have to reconstruct the whole picture your mind.

While the pinhole approach is perfectly fine for computers, humans generally prefer the snowball approach. If nothing else, it allows them to skip irrelevant information like details of the clouds, because from the previous step they already know that that patch of pixels are clouds.

For long time, Wikipedia followed the snowball approach. But now, it keeps shifting to pinhole approach as the bodies of the articles keep growing without any limit.

For me, the current transition from the header to the body is frequently too abrupt. I cope with that by switching from English version of the article to some non-English version, where the articles are smaller. Most of the time, the smaller version provides all the information that I need. But even if it does not answer everything, I at least know which information I seek. And I can then quickly jump to the relevant parts in English version of the article.

But how can the situation be systematically rectified? There are multiple options:
   1) Identify an optimal size of the articles and start truncating the overgrown articles.
   2) Allow fast time traveling to time when the article was closest to the optimal size.
   3) Expandable TOC. Now, TOC doesn't even fit whole screen. Could we by default hide the lowest level headers but on click expand them?
    4) Inform Wikipedist that the article is too long and that they should abstain from making it longer. If something, they should make it shorter.
    5) Introduce another layer of granularity.

The first approach is politically unacceptable. Wikipedists like to expand articles, not to reduce them. The second approach is meaningful on topics that do not evolve rapidly, like interpretation of ancient events, but fails miserably on contemporary topics. The third approach is a nice technical solution. The forth approach might help a bit. But the last option is the only real solution that might have a chance to succeed.

sobota 11. dubna 2020

My requirements for a data-scientist programming language

United data type and data structures

Reasoning: As a data-scientist, you may want to be able to quickly apply different libraries on your data. But that can work only if they all use the same data representation.

Example: R doesn't have a canonical implementation of sparse matrices. Hence, each library uses their own implementation. And if you want to process your sparse data with two different libraries, you frequently have to perform the format conversion over a dense matrix. That's a no go for any non-toy matrix. Matlab got it, in the case of sparse matrices, right: there is only a single format for sparse matrices.

True pass-by-value

Many languages use pass-by-value. But the approaches to make it computationally feasible differ. Python and Java frequently pass-by-value only a reference. While R and Matlab use copy-on-write (CoW), which delivers behaviour that I call true pass-by-value.

Reasoning: Data in the operation systems and databases are true pass-by-value. Hence, the expectation is set. R and Matlab got it right.

Working autocomplete

It is nice when autocomplete works on table names, column names, file paths, function names, function argument names... It decreases typo rate, speeds up typing and provides real-time validation - if the autocomplete found the file/table/column/function/whatever, it exists.

Example: Thanks to clause ordering in SQL, the table and column name autocomplete doesn't work very well. LINQ got it right: first define tables and only then columns.

Working documentation

As a data-scientist, you may have to work with many different tools. And a good manual can make all the difference when you are learning something new.

Example: Python uses multiple formats for documentation and that can cause errors in documentation rendering, when a wrong formatter is used. Java got it right: provide (and enforce) a single documentation format.

Simple copy-paste of matrices

It is nice to be able to simply copy-paste matrices from the result of print(), Excel or publication directly into the interpreter/script code.

Example: Python (and Numpy, Pandas,...) requires commas between the values in a matrix. But when you print a Numpy matrix, it is printed without commas (for improved legibility). That means that you can't simply copy-paste use the printed matrix into your code: you have to first add those missing commas. Matlab got it right: copy-paste from/to Excel works. And parsing of copy-pasted tables from pdfs/web pages frequently even works better than in Excel.

This requirement can be extended to all other data structures.

High-quality and interactive plotting

As a data scientist you may greatly benefit from visualizing the data.

Example: The built-in plots in R are static. Matlab got it right: you may interact with the plot. Move overlapping labels a bit. Or read the dato value below the cursor.

Support for functions with many arguments (default values,...)

Argument parsing & validation can take a lot of code, if the language doesn't handle it for you.

Example: Matlab doesn't support named arguments in the syntax. But many functions accept something like f('argument_name', 'argument_value'). Since argument names are strings, argument name autocomplete doesn't work. And when you are passing many string arguments, it is difficult for a reader to recognize what is actually the argument name and what is the argument value - they are both strings! R got it right: just like almost any other functional language.

True raw strings

As a data-scientist, you may want to embed different languages (be it regex, SQL or HTML) in your language. And being able to simply copy-paste the foreign code without the need to escape/uneescape makes live easier.

Example: In Java, you have to escape backslash in regex with another backslash. Groovy got it right: just use """ """.

Python Pandas

Pandas is convenient but sometimes also a bit inconsistent. For example, None==None is in pure Python and Numpy evaluated as True. In Pandas, it is evaluated as False:

import pandas as pd
import numpy as np

df = pd.DataFrame([None])
x = np.array([None])

print(None == None) # Python says True
print(x == x)       # Numpy says True
print(df == df)     # Pandas says False

While we should generally avoid equality comparison for detection of None in the code (and use is, respectively isnull()), when we are actually comparing two variables coming from the outside, we may end up comparing None to None. And if we care about the result of the comparison (we do, otherwise why we would bother with the comparison in the first place?), we have to be careful, whether we are comparing the variables with "Pandas logic", or "the rest of Python world logic".

úterý 18. února 2020

Civilization 6

One of the most criticized features of the Civ 5&6 is "one unit per tile" (1UPT) limitation - in the older versions of Civ, it was possible to stack an unlimited count of units on a single tile.

The benefit of the change is more opportunities for tactics. For example, it is possible to replay Battle of Marathon in Jadwiga's Legacy scenario - a single well placed pikeman can indefinitely block a mountain passage against a hoard of knights, till only a single knight can attack the pikeman per turn. Or you may place a city between two large lakes, like real-world Madison-Wisconsin, making it difficult to "flank" the city and put it in to the siege. And of course, 1UPT fixes the infamous Stack of Doom issue.

On the other end, it makes logistics difficult. It happened to me multiple times that I ordered a unit to move forward, only to watch it in horor how it moves backward because all the "resting places" in the forward direction are already occupied by my units. A simple mitigation of the described nuisance is to introduce a keyboard shortcut particularly popular in Microsoft Word: Ctrl+Z. The implementation can take an inspiration from The Battle for Wesnoth turn based strategy, which gives the user the opportunity to take back a miss-click. But to prevent the abuse of the feature by the players, it is not possible to take back a move that uncovers part of a map or a move that lifts fog-of-war. Westnoth also prevents you from taking back a move, which results into a combat. Since Civ 6 still uses randomness in the damage calculation, this limitation would have to be ported as well. The long loading times because of a silly miss-click would be mostly just part of the distant past...

Another irritating thing is that sometimes a victory (or loss) is inevitable. But you still have to get thru many turns to get there. Some people enjoy this part of the game. But I find it boring. It would be nice to have something like "auto combat" or "quick battle" from Heroes of Might & Magic III that would allow you to watch the march to the victory from the resting position of your seat or that would allow you to just skip to the victory screen because you really need some sleep time. Didn't you get the outcome you expected? No worry, just press the sweet Ctrl+Z combo and you can show the computer how it should have been done.

The biggest advantage of having the "fast end" option is that as a developer you do not have to stress yourself so much with fine-tuning the winning conditions. Does the game slip too frequently into the war-of-attrition? Just add "fast end" option. Problem mitigated. Release the game. And leave the re-balancing on the expansion pack.

And finally, the game logic for city states and civilizations should be the same thing. City states are great. But I want to be able to:

Befriend not only city states but also civilizations with amenities.
Get unique bonuses not only from being suzerain of city states, but also from being a friend to civilizations.

And reversely, civilizations are great, but I want to be able to:

Select not only civilizations that appear in the game, but also city states.

All these issues can be nicely resolved by adding a switch into the "create new game" GUI next to each civilization, which would "cripple" the selected civilization to a city state. This switch would:

Remove the leader and her/his traits. But the civilization traits would stay there.
Limit the count of cities to one.
Adjust things like loyalty bonus,... to actually make the city states work.