Brain inter-dataset
Warning: vignette for scHPL v. 0.0.2, this should be updated
[1]:
import os
import pandas as pd
import time as tm
import scanpy as sc
from scHPL import progressive_learning, predict, evaluate
During this vignette we will repeat the brain inter-dataset experiment. We use the three datasets to construct a tree for brain cell populations. The aligned datasets and labels can be downloaded from https://doi.org/10.5281/zenodo.4557712
Read the data
We start with reading the different datasets and corresponding labels. Here we use an Anndata object and transform this into a pandas dataframe.
In the datasets, the rows represent different cells and columns represent the genes
[2]:
adata = sc.read('brain_downsampled5000_integrated.h5ad')
groups = adata.obs.groupby('dataset').indices
TM = adata[groups['TM']]
RO = adata[groups['Rosenberg']]
ZE = adata[groups['Zeisel']]
SA = adata[groups['Saunders']]
data = []
labels = []
data.append(pd.DataFrame(data = SA.X, index = SA.obs_names, columns=SA.var_names))
labels.append(pd.DataFrame(data = SA.obs['original2'].values).stack().str.replace(',','_').unstack())
data.append(pd.DataFrame(data = ZE.X, index = ZE.obs_names, columns=ZE.var_names))
labels.append(pd.DataFrame(data = ZE.obs['original2'].values).stack().str.replace(',','_').unstack())
data.append(pd.DataFrame(data = TM.X, index = TM.obs_names, columns=TM.var_names))
labels.append(pd.DataFrame(data = TM.obs['original'].values))
testdata = pd.DataFrame(data = RO.X, index = RO.obs_names, columns=RO.var_names)
testlabels = pd.DataFrame(data = RO.obs['original'].values)
Construct and train the classification tree
Next, we use hierarchical progressive learning to construct and train a classification tree. After each iteration, an updated tree will be printed. If two labels have a perfect match, one of the labels will not be visible in the tree. Therefore, we will also indicate these perfect matches using a print statement
In this vignette, we use the one-class SVM, apply dimensionality reduction and use the default threshold of 0.25. In you want to use a linear SVM, the following can be used: classifier = ‘svm’. When using a linear SVM, we advise to set dimred to False.
[3]:
start = tm.time()
classifier = 'svm_occ'
dimred = True
threshold = 0.25
tree = progressive_learning.learn_tree(data, labels, classifier = classifier, dimred = dimred, threshold = threshold)
training_time = tm.time()-start
print('Training time:', training_time)
Iteration 1
Perfect match: Ependymal-Zeisel is now: EPENDYMAL-Saunders
Perfect match: Neurons-Zeisel is now: NEURON-Saunders
Updated tree:
root
ASTROCYTE-Saunders
Astrocyte-Zeisel
Bergmann-glia-Zeisel
OEC-Zeisel
EPENDYMAL-Saunders
NEUROGENESIS-Saunders
Neurons_Cycling-Zeisel
NEURON-Saunders
Vascular-Zeisel
ENDOTHELIAL_STALK-Saunders
ENDOTHELIAL_TIP-Saunders
MURAL-Saunders
Immune-Zeisel
MACROPHAGE-Saunders
MICROGLIA-Saunders
Oligos-Zeisel
OLIGODENDROCYTE-Saunders
POLYDENDROCYTE-Saunders
Oligos_Cycling-Zeisel
Ttr-Zeisel
Iteration 2
Perfect match: endothelial cell-TM is now: ENDOTHELIAL_STALK-Saunders
Perfect match: microglial cell-TM is now: Immune-Zeisel
Perfect match: brain pericyte-TM is now: MURAL-Saunders
Perfect match: Bergmann glial cell-TM is now: Bergmann-glia-Zeisel
Perfect match: oligodendrocyte-TM is now: OLIGODENDROCYTE-Saunders
Perfect match: oligodendrocyte precursor cell-TM is now: POLYDENDROCYTE-Saunders
Updated tree:
root
ASTROCYTE-Saunders
Bergmann-glia-Zeisel
astrocyte of the cerebral cortex-TM
Astrocyte-Zeisel
OEC-Zeisel
EPENDYMAL-Saunders
Vascular-Zeisel
ENDOTHELIAL_STALK-Saunders
ENDOTHELIAL_TIP-Saunders
MURAL-Saunders
Immune-Zeisel
MACROPHAGE-Saunders
MICROGLIA-Saunders
Oligos-Zeisel
OLIGODENDROCYTE-Saunders
POLYDENDROCYTE-Saunders
Oligos_Cycling-Zeisel
Ttr-Zeisel
neuron-TM
NEUROGENESIS-Saunders
Neurons_Cycling-Zeisel
NEURON-Saunders
macrophage-TM
Training time: 17915.45694875717
Predict the labels of the fourth dataset
In this last step, we use the learned tree to predict the labels of the Rosenberg dataset
[4]:
start = tm.time()
ypred = predict.predict_labels(testdata, tree)
test_time = tm.time()-start
print('Predict time:', test_time)
Predict time: 1220.4297716617584