Datafilos: 2021

sobota 18. prosince 2021

Fraying apple cables

Low-voltage cables from Apple chargers are infamous for their durability issues:

The issue is caused by repeated torsion of the cable:

As we expose the cable to torsion, the rubber jacket eventually separates from the braid:

However, this separation is already present from the factory at the cable ends as the braid is pulled to one side:

This separation is troublesome because when we further twist the cable, the braid works like a grater, which "eats" the rubber jacket. And eventually, the cable frays.

How to change the cable design to fix fraying:

Make the rubber jacket thicker. Apple has already done that with Lightning cables. However, this just delays the fraying.
Wrap the braid in a foil, to separate the jacket from the braid. The jacket will then nicely slide over the foil instead of getting grated by the braid. The common thickness of the foil is 50 μm. But it can be made as thin as 6 μm. Hence, this change could increase the thickness of the cable only by 12 μm. For comparison, typical hair is 75 μm thick.

Conclusion: It is well known that Apple uses badly designed "strain reliefs" at the cable ends. But that does not explain why the cables fray in the middle (as illustrated in the first photo) as well and not just at the ends.

neděle 1. srpna 2021

Replication crisis and the proposed solution

Remarkably, only 12 percent of post-replication citations of non-replicable findings acknowledge the replication failure. [1]

Roommate submitted his thesis for publication and one reviewer told him "oh, you cited this result from ~30y ago but it actually has a gap in the proof that no one's figured out how to fix yet." (People learn this stuff via the number theory gossip grapevine apparently?) [2]

Google Scholar is in a great position to reduce the "replication crisis", by alerting the users that the listed article is known to have some defect.

Principally, it could work like "disputed" on Twitter or Facebook:

Is it the best UX to show a modal window? Most likely not:

We want to inform the visitors not just about failed replications, but also about successful replications and small rectifications (like adding a missing condition to a claim or fix of a troublesome typo).
We do not want to unnecessarily interrupt the visitor’s flow - maybe the visitor is already familiar with the issues of the article or they just don't care about them.

So what? The information about the presence and the overall conclusion of the replicas could be represented with a double-ended bar chart sparkline similar to how Google Translate shows frequency the translation pair (note the red-gray bar graph at the bottom):

When there is a lot of negative evidence, the red bar graph on the left from the black divider is long. When there is a lot of positive evidence, the green bar graph on the right from the black line is long (not present in this case).

How to get it started? Let people mark articles as a replication of other articles.

Why people would bother?

It is a great opportunity for the authors of replication studies to piggyback (collect citations) on the original, likely popular, articles.
After a lot wasted time, you might find out that a claim in paper A does not hold. And that there is paper B that has already spot the issue. It's just that you were not aware of paper's B existence. In the rage, you might be willing to spend a minute and complain to the world that paper A has some issue, as noted by paper B.

How to collect feedback? The "piggybacking" articles could be explicitly ranked (up-voted/down-voted) like on StackOverflow. While an explicit feedback is not in Google's style, it is important to realize that Google Scholar is for a niche community and niche communities seem to benefit from the explicit feedback as there isn't enough implicit signal (observe success of StackOverflow, Reddit, Hacker News,...). A nice side effect of that would be an increased engagement due to Ikea effect (People values things, on which they have spend some effort, more than things that they got for free. In this case, people would value Google Scholar more, because they have spent time marking articles as a "rectification" of other orticles).

And what about machine learning? Of course, over the time, Google would collect enough training data, explicit feedback, and implicit feedback, that the pairing of the articles could get fairly reliably predicted. But to get there, Google has to first get the training data.

pondělí 7. června 2021

Microarray classification with background knowledge

The main issue with microarray classification is that the dimensionality is high (feature count > 10000) while the sample count is low (~$100 per sample). We can partially mitigate this issue by incorporating the background knowledge about how the features (genes) are controlled with transcription factors (TF).

I have two proposals: one with a linear discriminant analysis (LDA) and another with a neural network.

Linear discriminant analysis

LDA and variants of LDA, like nearest shrunken centroids (PAM) or shrunken centroids regularized discriminant analysis (SCRDA), are fairly popular in microarray analysis, but they all deal with the problem of how to reliably estimate the covariance matrix because the size of the covariance matrix is #features × #features. PAM gives up the hope of estimating the whole covariance matrix and estimates only the diagonal matrix, while SCRDA only shrinks the covariance matrix toward the diagonal matrix. But with the knowledge of which genes are coregulated together with the transcription factors, we can not only shrink the covariance matrix toward the diagonal matrix, but we can also shrink the coregulated genes together. If the estimate for covariance matrix in SCRDA is: (1-alpha)*cov(X) + alpha*I, where X are the training data, alpha is a tunable scalar and I is an identity matrix, then the covariance matrix in SCRDA with background knowledge is: (1-alpha-beta)*cov(X) + alpha*I + beta*B, where beta is a tunable scalar and B is a block matrix. This block matrix can be generated with the following pseudocode:

# Get the count of TFs that two genes share:
B = np.zeros(len(x))
for tf in tfs:
    for coregulated_gene_1 in tf2genes[tf]:
        for coregulated_gene_2 in tf2genes[tf]:
            B[coregulated_gene_1, coregulated_gene_2] += 1
            
# Normalize the intersect count by the count of TFs per gene_1 and gene_2:
for gene_1 in B:
    for gene_2 in B:
        B[gene_1, gene_2] /= len(gene2tfs[gene_1]) + len(gene2tfs[gene_2])

The code doesn't do anything else but calculate Jaccard similarity between the genes based on the shared transcription factors. And if (1-alpha)*cov(X) + alpha*I is equivalent to shrinking toward the identity matrix (see "Improved estimation of the covariance matrix of stock returns with an application to portfolio selection") then (1-beta)*cov(X) + beta*B is equivalent to shrinking toward coregulated genes.

Model assumptions: Beyond what LDA does (shared covariance matrices across the classes,...), we also assume that the impact of all the transcription factors to genes is identical.

Neural network

While LDA is fairly popular in microarray analysis, neural networks are not as they require a lot of training samples to learn something. But we can partially alleviate that issue by using smart network structure and initialization. We can use a neural network with a single hidden layer, where the number of the neurons in the hidden layer is equivalent the number of transcription factors. And instead of using a fully connected network between the input layer (which represents genes) and the hidden layer (which represents transcription factors), we can make connections only between the genes and transcription factors, where the interactions are known. This dramatically reduces the count of parameters to estimate. Furthermore, if we know whether the transcription factor works as an "activator" or "deactivator", we can initialize the connection weights to a fairly high, respectively low, random value. The idea is, that if the gene is highly expressed (feature value is high), its activator transcription factors are likely "on" while its deactivator transcription factors are likely "off".

Contrary to the LDA model, the neural network model does not assume that transcription factors affect the genes with the same strength but rather learns the strength. But that also means that we need more training data than in LDA model.

Transfer learning

To make the neural network model viable, we have to also exploit transfer learning: Train the model on some large_dataset and use the trained weights between the input layer and the middle layer as the initialization values for our model on dataset_of_interest. Of course, the same trick can be employed in LDA to shrink the covariance matrix toward the pooled covariance matrix estimated from many different datasets: (1-gamma)*cov(X) + gamma*P, where P is a pooled covariance matrix over many datasets. Hence, the final estimate of the covariance matrix might look like: (1-alpha-beta-gamma)*cov(X) + alpha*I + beta*B + gamma*P.

Conclusion

Both models can be further improved. For example, the assumption of identical covariance matrices per class in the LDA could be relaxed like in regularized discriminant analysis (RDA) and the initial weights in the neural network could be scaled with 1/len(gene2tfs[gene]) to initialize them to the expected value. But that is beyond this post.

čtvrtek 22. dubna 2021

Anti-patterns of two-factor authentication

Use a password field for the one-time password instead of a plain text field. Password fields hide the password to prevent other people from reading your password from your screen. But in the case of a one-time password, once you use the one-time password, the one-time password is useless. Hence, the password fields for one-time passwords do not really increase the security. However, it increases the probability of overlooking a typo, as the screen does not provide feedback about what you typed anymore. This can be quite a nuisance if you have to use a computer with a different keyboard layout than what you are accustomed to. Bonus points for disabling the clipboard to prevent users from copy-pasting the one-time password from notepad.
Allocate only enough resources to handle the normal login rate, not the theoretical peak login rate, for critical applications. Most of the time, only a small portion of users attempts to log in at the same time. But when something serious is happening, everyone wants to login in. And when the login system crashes at this stressful moment, it doesn't make people happy.
Make the login memory-less and batch-less. A memory-less implementation does not remember that you have logged in 10 seconds ago. And batch-less implementation does not allow you to pack multiple privileged commands together. Hence, even if you know ahead that want to issue 10 privileged commands in one go, you are still forced to perform 10 two-factor authentications - one for each command.