The main issue with microarray classification is that the dimensionality is high (feature count > 10000) while the sample count is low (~$100 per sample). We can partially mitigate this issue by incorporating the background knowledge about how the features (genes) are controlled with transcription factors (TF).
I have two proposals: one with a linear discriminant analysis (LDA) and another with a neural network.
Linear discriminant analysis
LDA and variants of LDA, like nearest shrunken centroids (PAM) or shrunken centroids regularized discriminant analysis (SCRDA), are fairly popular in microarray analysis, but they all deal with the problem of how to reliably estimate the covariance matrix because the size of the covariance matrix is #features × #features. PAM gives up the hope of estimating the whole covariance matrix and estimates only the diagonal matrix, while SCRDA only shrinks the covariance matrix toward the diagonal matrix. But with the knowledge of which genes are coregulated together with the transcription factors, we can not only shrink the covariance matrix toward the diagonal matrix, but we can also shrink the coregulated genes together. If the estimate for covariance matrix in SCRDA is: (1-alpha)*cov(X) + alpha*I
, where X
are the training data, alpha
is a tunable scalar and I
is an identity matrix, then the covariance matrix in SCRDA with background knowledge is: (1-alpha-beta)*cov(X) + alpha*I + beta*B
, where beta
is a tunable scalar and B
is a block matrix. This block matrix can be generated with the following pseudocode:
# Get the count of TFs that two genes share:
B = np.zeros(len(x))
for tf in tfs:
for coregulated_gene_1 in tf2genes[tf]:
for coregulated_gene_2 in tf2genes[tf]:
B[coregulated_gene_1, coregulated_gene_2] += 1
# Normalize the intersect count by the count of TFs per gene_1 and gene_2:
for gene_1 in B:
for gene_2 in B:
B[gene_1, gene_2] /= len(gene2tfs[gene_1]) + len(gene2tfs[gene_2])
The code doesn't do anything else but calculate Jaccard similarity between the genes based on the shared transcription factors. And if (1-alpha)*cov(X) + alpha*I
is equivalent to shrinking toward the identity matrix (see "Improved estimation of the covariance matrix of stock returns with an application to portfolio selection") then (1-beta)*cov(X) + beta*B
is equivalent to shrinking toward coregulated genes.
Model assumptions: Beyond what LDA does (shared covariance matrices across the classes,...), we also assume that the impact of all the transcription factors to genes is identical.
Neural network
Contrary to the LDA model, the neural network model does not assume that transcription factors affect the genes with the same strength but rather learns the strength. But that also means that we need more training data than in LDA model.
Transfer learning
To make the neural network model viable, we have to also exploit transfer learning: Train the model on some large_dataset
and use the trained weights between the input layer and the middle layer as the initialization values for our model on dataset_of_interest
. Of course, the same trick can be employed in LDA to shrink the covariance matrix toward the pooled covariance matrix estimated from many different datasets: (1-gamma)*cov(X) + gamma*P
, where P
is a pooled covariance matrix over many datasets. Hence, the final estimate of the covariance matrix might look like: (1-alpha-beta-gamma)*cov(X) + alpha*I + beta*B + gamma*P
.
Conclusion
Both models can be further improved. For example, the assumption of identical covariance matrices per class in the LDA could be relaxed like in regularized discriminant analysis (RDA) and the initial weights in the neural network could be scaled with 1/len(gene2tfs[gene])
to initialize them to the expected value. But that is beyond this post.
Žádné komentáře:
Okomentovat