{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# treeArches: learning and updating a cell-type hierarchy (basic tutorial)\n", "\n", "In this tutorial, we explain the different functionalities of treeArches. We show how to:\n", "\n", "- [Step 1](#Create-scVI-model-and-train-it-on-reference-dataset): Integrate reference datasets using scVI\n", "- [Step 2](#Construct-hierarchy-for-the-reference-using-scHPL): Match the cell-types in the reference datasets to learn the cell-type hierarchy of the reference datasets using scHPL\n", "- [Step 3](#Use-pretrained-reference-model-and-apply-surgery-with-a-new-query-dataset-to-get-a-bigger-reference-atlas): Apply architural surgery to extend the reference dataset using scArches\n", "- [Step 4a](#Updating-the-hierarchy-using-scHPL): Update the learned hierarchy with the cell-types from the query dataset using scHPL (useful when the query dataset is labeled)\n", "- [Step 4b](#Predicting-cell-type-labels-using-scHPL): Predict the labels of the cells in the query dataset using scHPL (useful when the query dataset is unlabeled)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "warnings.simplefilter(action='ignore', category=UserWarning)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Global seed set to 0\n" ] } ], "source": [ "import scanpy as sc\n", "import torch\n", "import scarches as sca\n", "from scarches.dataset.trvae.data_handling import remove_sparsity\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import gdown\n", "import copy as cp\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sc.settings.set_figure_params(dpi=1000, frameon=False)\n", "sc.set_figure_params(dpi=1000)\n", "sc.set_figure_params(figsize=(7,7))\n", "torch.set_printoptions(precision=3, sci_mode=False, edgeitems=7)\n", "\n", "import matplotlib\n", "matplotlib.rcParams['pdf.fonttype'] = 42" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download raw Dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading...\n", "From: https://drive.google.com/uc?id=1Vh6RpYkusbGIZQC8GMFe3OKVDk5PWEpC\n", "To: /exports/humgen/lmichielsen/scArches-scHPL/PBMC/pbmc.h5ad\n", "100%|█████████████████████████████████████████████| 2.06G/2.06G [01:37<00:00, 21.1MB/s]\n" ] }, { "data": { "text/plain": [ "'pbmc.h5ad'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = 'https://drive.google.com/uc?id=1Vh6RpYkusbGIZQC8GMFe3OKVDk5PWEpC'\n", "output = 'pbmc.h5ad'\n", "gdown.download(url, output, quiet=False)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "adata = sc.read('pbmc.h5ad')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "adata.X = adata.layers[\"counts\"].copy()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "adata = adata[adata.obs.study != \"Villani\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now split the data into reference and query dataset to simulate the building process. Here we use the '10X' batch as query data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AnnData object with n_obs × n_vars = 21757 × 12303\n", " obs: 'batch', 'chemistry', 'data_type', 'dpt_pseudotime', 'final_annotation', 'mt_frac', 'n_counts', 'n_genes', 'sample_ID', 'size_factors', 'species', 'study', 'tissue'\n", " layers: 'counts'\n", "AnnData object with n_obs × n_vars = 10727 × 12303\n", " obs: 'batch', 'chemistry', 'data_type', 'dpt_pseudotime', 'final_annotation', 'mt_frac', 'n_counts', 'n_genes', 'sample_ID', 'size_factors', 'species', 'study', 'tissue'\n", " layers: 'counts'\n" ] } ], "source": [ "target_conditions = [\"10X\"]\n", "source_adata = adata[~adata.obs.study.isin(target_conditions)].copy()\n", "target_adata = adata[adata.obs.study.isin(target_conditions)].copy()\n", "print(source_adata)\n", "print(target_adata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a better model performance it is necessary to select HVGs. We are doing this by applying the function `scanpy.pp.highly_variable_genes()`. The parameter `n_top_genes` is set to 2000 here. However, for more complicated datasets you might have to increase number of genes to capture more diversity in the data." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "source_adata.raw = source_adata" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 21757 × 12303\n", " obs: 'batch', 'chemistry', 'data_type', 'dpt_pseudotime', 'final_annotation', 'mt_frac', 'n_counts', 'n_genes', 'sample_ID', 'size_factors', 'species', 'study', 'tissue'\n", " layers: 'counts'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "source_adata" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "sc.pp.normalize_total(source_adata)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "sc.pp.log1p(source_adata)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "sc.pp.highly_variable_genes(\n", " source_adata,\n", " n_top_genes=2000,\n", " batch_key=\"study\",\n", " subset=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For consistency we set adata.X to be raw counts. In other datasets that may be already the case" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "source_adata.X = source_adata.raw[:, source_adata.var_names].X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create scVI model and train it on reference dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "