pondělí 7. června 2021

Microarray classification with background knowledge

The main issue with microarray classification is that the dimensionality is high (feature count > 10000) while the sample count is low (~$100 per sample). We can partially mitigate this issue by incorporating the background knowledge about how the features (genes) are controlled with transcription factors (TF).

I have two proposals: one with a linear discriminant analysis (LDA) and another with a neural network.

Linear discriminant analysis

LDA and variants of LDA, like nearest shrunken centroids (PAM) or shrunken centroids regularized discriminant analysis (SCRDA), are fairly popular in microarray analysis, but they all deal with the problem of how to reliably estimate the covariance matrix because the size of the covariance matrix is #features × #features. PAM gives up the hope of estimating the whole covariance matrix and estimates only the diagonal matrix, while SCRDA only shrinks the covariance matrix toward the diagonal matrix. But with the knowledge of which genes are coregulated together with the transcription factors, we can not only shrink the covariance matrix toward the diagonal matrix, but we can also shrink the coregulated genes together. If the estimate for covariance matrix in SCRDA is: (1-alpha)*cov(X) + alpha*I, where X are the training data, alpha is a tunable scalar and I is an identity matrix, then the covariance matrix in SCRDA with background knowledge is: (1-alpha-beta)*cov(X) + alpha*I + beta*B, where beta is a tunable scalar and B is a block matrix. This block matrix can be generated with the following pseudocode:

# Get the count of TFs that two genes share:
B = np.zeros(len(x))
for tf in tfs:
   for coregulated_gene_1 in tf2genes[tf]:
       for coregulated_gene_2 in tf2genes[tf]:
  B[coregulated_gene_1, coregulated_gene_2] += 1
           
# Normalize the intersect count by the count of TFs per gene_1 and gene_2:
for gene_1 in B:
for gene_2 in B:
B[gene_1, gene_2] /= len(gene2tfs[gene_1]) + len(gene2tfs[gene_2])

The code doesn't do anything else but calculate Jaccard similarity between the genes based on the shared transcription factors. And if (1-alpha)*cov(X) + alpha*I is equivalent to shrinking toward the identity matrix (see "Improved estimation of the covariance matrix of stock returns with an application to portfolio selection") then (1-beta)*cov(X) + beta*B is equivalent to shrinking toward coregulated genes.

Model assumptions: Beyond what LDA does (shared covariance matrices across the classes,...), we also assume that the impact of all the transcription factors to genes is identical.

Neural network

While LDA is fairly popular in microarray analysis, neural networks are not as they require a lot of training samples to learn something. But we can partially alleviate that issue by using smart network structure and initialization. We can use a neural network with a single hidden layer, where the number of the neurons in the hidden layer is equivalent the number of transcription factors. And instead of using a fully connected network between the input layer (which represents genes) and the hidden layer (which represents transcription factors), we can make connections only between the genes and transcription factors, where the interactions are known. This dramatically reduces the count of parameters to estimate. Furthermore, if we know whether the transcription factor works as an "activator" or "deactivator", we can initialize the connection weights to a fairly high, respectively low, random value. The idea is, that if the gene is highly expressed (feature value is high), its activator transcription factors are likely "on" while its deactivator transcription factors are likely "off".

Contrary to the LDA model, the neural network model does not assume that transcription factors affect the genes with the same strength but rather learns the strength. But that also means that we need more training data than in LDA model.

Transfer learning

To make the neural network model viable, we have to also exploit transfer learning: Train the model on some large_dataset and use the trained weights between the input layer and the middle layer as the initialization values for our model on dataset_of_interest. Of course, the same trick can be employed in LDA to shrink the covariance matrix toward the pooled covariance matrix estimated from many different datasets: (1-gamma)*cov(X) + gamma*P, where P is a pooled covariance matrix over many datasets. Hence, the final estimate of the covariance matrix might look like: (1-alpha-beta-gamma)*cov(X) + alpha*I + beta*B + gamma*P.

Conclusion

Both models can be further improved. For example, the assumption of identical covariance matrices per class in the LDA could be relaxed like in regularized discriminant analysis (RDA) and the initial weights in the neural network could be scaled with 1/len(gene2tfs[gene]) to initialize them to the expected value. But that is beyond this post.