Brain inter-dataset

Warning: vignette for scHPL v. 0.0.2, this should be updated

[1]:
import os
import pandas as pd
import time as tm
import scanpy as sc
from scHPL import progressive_learning, predict, evaluate

During this vignette we will repeat the brain inter-dataset experiment. We use the three datasets to construct a tree for brain cell populations. The aligned datasets and labels can be downloaded from https://doi.org/10.5281/zenodo.4557712

Read the data

We start with reading the different datasets and corresponding labels. Here we use an Anndata object and transform this into a pandas dataframe.

In the datasets, the rows represent different cells and columns represent the genes

[2]:
adata = sc.read('brain_downsampled5000_integrated.h5ad')

groups = adata.obs.groupby('dataset').indices

TM = adata[groups['TM']]
RO = adata[groups['Rosenberg']]
ZE = adata[groups['Zeisel']]
SA = adata[groups['Saunders']]

data = []
labels = []

data.append(pd.DataFrame(data = SA.X, index = SA.obs_names, columns=SA.var_names))
labels.append(pd.DataFrame(data = SA.obs['original2'].values).stack().str.replace(',','_').unstack())

data.append(pd.DataFrame(data = ZE.X, index = ZE.obs_names, columns=ZE.var_names))
labels.append(pd.DataFrame(data = ZE.obs['original2'].values).stack().str.replace(',','_').unstack())

data.append(pd.DataFrame(data = TM.X, index = TM.obs_names, columns=TM.var_names))
labels.append(pd.DataFrame(data = TM.obs['original'].values))

testdata = pd.DataFrame(data = RO.X, index = RO.obs_names, columns=RO.var_names)
testlabels = pd.DataFrame(data = RO.obs['original'].values)

Construct and train the classification tree

Next, we use hierarchical progressive learning to construct and train a classification tree. After each iteration, an updated tree will be printed. If two labels have a perfect match, one of the labels will not be visible in the tree. Therefore, we will also indicate these perfect matches using a print statement

In this vignette, we use the one-class SVM, apply dimensionality reduction and use the default threshold of 0.25. In you want to use a linear SVM, the following can be used: classifier = ‘svm’. When using a linear SVM, we advise to set dimred to False.

[3]:
start = tm.time()
classifier = 'svm_occ'
dimred = True
threshold = 0.25
tree = progressive_learning.learn_tree(data, labels, classifier = classifier, dimred = dimred, threshold = threshold)

training_time = tm.time()-start

print('Training time:', training_time)
Iteration  1

Perfect match:  Ependymal-Zeisel is now: EPENDYMAL-Saunders
Perfect match:  Neurons-Zeisel is now: NEURON-Saunders

Updated tree:
root
        ASTROCYTE-Saunders
                Astrocyte-Zeisel
                Bergmann-glia-Zeisel
                OEC-Zeisel
        EPENDYMAL-Saunders
        NEUROGENESIS-Saunders
                Neurons_Cycling-Zeisel
        NEURON-Saunders
        Vascular-Zeisel
                ENDOTHELIAL_STALK-Saunders
                ENDOTHELIAL_TIP-Saunders
                MURAL-Saunders
        Immune-Zeisel
                MACROPHAGE-Saunders
                MICROGLIA-Saunders
        Oligos-Zeisel
                OLIGODENDROCYTE-Saunders
                POLYDENDROCYTE-Saunders
        Oligos_Cycling-Zeisel
        Ttr-Zeisel
Iteration  2

Perfect match:  endothelial cell-TM is now: ENDOTHELIAL_STALK-Saunders
Perfect match:  microglial cell-TM is now: Immune-Zeisel
Perfect match:  brain pericyte-TM is now: MURAL-Saunders
Perfect match:  Bergmann glial cell-TM is now: Bergmann-glia-Zeisel
Perfect match:  oligodendrocyte-TM is now: OLIGODENDROCYTE-Saunders
Perfect match:  oligodendrocyte precursor cell-TM is now: POLYDENDROCYTE-Saunders

Updated tree:
root
        ASTROCYTE-Saunders
                Bergmann-glia-Zeisel
                astrocyte of the cerebral cortex-TM
                        Astrocyte-Zeisel
                        OEC-Zeisel
        EPENDYMAL-Saunders
        Vascular-Zeisel
                ENDOTHELIAL_STALK-Saunders
                ENDOTHELIAL_TIP-Saunders
                MURAL-Saunders
        Immune-Zeisel
                MACROPHAGE-Saunders
                MICROGLIA-Saunders
        Oligos-Zeisel
                OLIGODENDROCYTE-Saunders
                POLYDENDROCYTE-Saunders
        Oligos_Cycling-Zeisel
        Ttr-Zeisel
        neuron-TM
                NEUROGENESIS-Saunders
                        Neurons_Cycling-Zeisel
                NEURON-Saunders
        macrophage-TM
Training time: 17915.45694875717

Predict the labels of the fourth dataset

In this last step, we use the learned tree to predict the labels of the Rosenberg dataset

[4]:
start = tm.time()
ypred = predict.predict_labels(testdata, tree)
test_time = tm.time()-start
print('Predict time:', test_time)
Predict time: 1220.4297716617584