Denoising scRNAseq#
In this tutorial, we will walk through the steps for denoising mRNA counts in species-mixing data.
In this experiment, an equal number of human HEK293T cells and mouse NIH3T3 cells were pooled and sequenced, transcripts from the other organism were unambiguously ambient contamination. The original dataset is downloaded from 10x genomics dataset, and cell annotation file is available at scAR-reproducibility/data
[1]:
# # Run this cell to install scar in Colab
# # Skip this cell if running on your own device
# %pip install scanpy
# %pip install git+https://github.com/Novartis/scAR.git
# %pip install matplotlib==3.1.3 # Specify this matplotlib version to avoid errors
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
from scar import model, setup_anndata
import warnings
warnings.simplefilter("ignore")
Download data#
The raw and filtered count matrices can be downloaded from 10x Dataset.
The raw data#
cellranger output: raw_feature_bc_matrix
[2]:
hgmm_20k_raw = sc.read_10x_h5(filename='20k_hgmm_3p_HT_nextgem_Chromium_X_raw_feature_bc_matrix.h5ad',
backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/20k_hgmm_3p_HT_nextgem_Chromium_X/20k_hgmm_3p_HT_nextgem_Chromium_X_raw_feature_bc_matrix.h5');
hgmm_20k_raw.var_names_make_unique();
The filtered data#
cellranger output: filtered_feature_bc_matrix
[3]:
hgmm_20k = sc.read_10x_h5(filename='20k_hgmm_3p_HT_nextgem_Chromium_X_filtered_feature_bc_matrix.h5ad',
backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/20k_hgmm_3p_HT_nextgem_Chromium_X/20k_hgmm_3p_HT_nextgem_Chromium_X_filtered_feature_bc_matrix.h5');
hgmm_20k.var_names_make_unique();
Annotation and filtering#
We annotated cells by their cell species and provided the annotation file
[4]:
!git clone https://github.com/CaibinSh/scAR-reproducibility.git
hgmm_20k_anno = pd.read_csv('scAR-reproducibility/data/20k_hgmm_cell_annotation.csv', index_col=0)
hgmm_20k.obs=hgmm_20k.obs.join(hgmm_20k_anno[['species']])
Cloning into 'scAR-reproducibility'...
remote: Enumerating objects: 110, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 110 (delta 0), reused 3 (delta 0), pack-reused 106
Receiving objects: 100% (110/110), 65.67 MiB | 37.13 MiB/s, done.
Resolving deltas: 100% (48/48), done.
Gene and cell filtering
[5]:
sc.pp.filter_genes(hgmm_20k, min_counts=200);
sc.pp.filter_genes(hgmm_20k, max_counts=6000);
sc.pp.filter_cells(hgmm_20k, min_genes=200);
Calculate number of human and mouse transcripts
[6]:
hgmm_20k.obs['human gene counts'] = hgmm_20k[:, hgmm_20k.var['genome']=="GRCh38"].X.sum(axis=1).A1
hgmm_20k.obs['mouse gene counts'] = hgmm_20k[:, hgmm_20k.var['genome']=="mm10"].X.sum(axis=1).A1
[7]:
hgmm_20k
[7]:
AnnData object with n_obs × n_vars = 16292 × 16586
obs: 'species', 'n_genes', 'human gene counts', 'mouse gene counts'
var: 'gene_ids', 'feature_types', 'genome', 'n_counts'
Estimate ambient profile#
See details in calculating ambient profile
[8]:
setup_anndata(
adata = hgmm_20k,
raw_adata = hgmm_20k_raw,
prob = 0.995,
kneeplot = True
)
2024-08-10 20:48:50|INFO|setup_anndata|Use all 199666 droplets.
2024-08-10 20:48:50|INFO|setup_anndata|Estimating ambient profile for ['Gene Expression']...
2024-08-10 20:49:23|INFO|setup_anndata|Iteration: 1
2024-08-10 20:49:54|INFO|setup_anndata|Iteration: 2
2024-08-10 20:50:26|INFO|setup_anndata|Iteration: 3
2024-08-10 20:50:26|INFO|setup_anndata|Estimated ambient profile for Gene Expression saved in adata.uns
2024-08-10 20:50:26|INFO|setup_anndata|Estimated ambient profile for all features saved in adata.uns
[9]:
hgmm_20k.uns["ambient_profile_Gene Expression"].head()
[9]:
| ambient_profile_Gene Expression | |
|---|---|
| GRCh38_AL627309.5 | 0.000000 |
| GRCh38_LINC01128 | 0.000127 |
| GRCh38_LINC00115 | 0.000000 |
| GRCh38_FAM41C | 0.000010 |
| GRCh38_AL645608.2 | 0.000010 |
Training#
Note
For this specific case, it is necessary to set a large sparsity parameter because the data is highly sparse due to the following reasons:
1) The presence of low-coverage transcripts, as filtering was not applied;
2) The inclusion of both human and mouse transcripts in the mRNA matrix, which leads to additional zero counts.
However, when working with data that includes only highly variable genes, a smaller sparsity parameter is recommended to prevent overcorrection.
scAR allows for denoising a specific subset of genes and samples. To save computational time, it is advisable to use filtered raw counts (e.g., xxx_filtered_feature_bc_matrix.h5).
[10]:
hgmm_20k_scar = model(raw_count=hgmm_20k, # In the case of Anndata object, scar will automatically use the estimated ambient_profile present in adata.uns.
ambient_profile=hgmm_20k.uns['ambient_profile_Gene Expression'],
feature_type='mRNA',
sparsity=1,
device='cuda' # CPU, CUDA and MPS are supported.
)
hgmm_20k_scar.train(epochs=200,
batch_size=64,
verbose=True
)
2024-08-10 20:50:38|INFO|model|cuda will be used.
2024-08-10 20:50:39|INFO|VAE|Running VAE using the following param set:
2024-08-10 20:50:39|INFO|VAE|...denoised count type: mRNA
2024-08-10 20:50:39|INFO|VAE|...count model: binomial
2024-08-10 20:50:39|INFO|VAE|...num_input_feature: 16586
2024-08-10 20:50:39|INFO|VAE|...NN_layer1: 150
2024-08-10 20:50:39|INFO|VAE|...NN_layer2: 100
2024-08-10 20:50:39|INFO|VAE|...latent_space: 15
2024-08-10 20:50:39|INFO|VAE|...dropout_prob: 0.00
2024-08-10 20:50:39|INFO|VAE|...expected data sparsity: 1.00
2024-08-10 20:50:40|INFO|model|kld_weight: 1.00e-05
2024-08-10 20:50:40|INFO|model|learning rate: 1.00e-03
2024-08-10 20:50:40|INFO|model|lr_step_size: 5
2024-08-10 20:50:40|INFO|model|lr_gamma: 0.97
Training: 100%|██████████| 200/200 [05:06<00:00, 1.53s/it, Loss=4.6976e+03]
[11]:
# After training, we can infer the native signals
hgmm_20k_scar.inference() # by defaut, batch_size = None, set a batch_size if getting a memory issue
The denoised counts are saved in hgmm_20k_scar.native_counts and can be used for downstream analysis.
[12]:
denoised_count = pd.DataFrame(hgmm_20k_scar.native_counts.toarray(), index=hgmm_20k.obs_names, columns=hgmm_20k.var_names)
denoised_count.head()
[12]:
| GRCh38_AL627309.5 | GRCh38_LINC01128 | GRCh38_LINC00115 | GRCh38_FAM41C | GRCh38_AL645608.2 | GRCh38_LINC02593 | GRCh38_SAMD11 | GRCh38_KLHL17 | GRCh38_AL645608.7 | GRCh38_ISG15 | ... | mm10___Gm21887 | mm10___Kdm5d | mm10___Eif2s3y | mm10___Uty | mm10___Ddx3y | mm10___AC133103.1 | mm10___AC168977.1 | mm10___CAAA01118383.1 | mm10___Spry3 | mm10___Tmlhe | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAACCCAAGAGCCGAT-1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAACCCAAGCGTTCCG-1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| AAACCCAAGTGCGTCC-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 |
| AAACCCACAAACTGCT-1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| AAACCCACAATACCTG-1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 16586 columns
Visualization#
Calculate number of human and mouse transcripts before denoising
[13]:
raw_counts_df = hgmm_20k.obs[['human gene counts', 'mouse gene counts']].astype(int)
raw_counts_df['log2(human gene counts+1)'] = np.log2(raw_counts_df['human gene counts']+1)
raw_counts_df['log2(mouse gene counts+1)'] = np.log2(raw_counts_df['mouse gene counts']+1)
raw_counts_df = raw_counts_df.join(hgmm_20k.obs[['species']])
visualization of raw counts
[14]:
plt.figure(figsize=(4, 4), dpi=150)
ax = sns.jointplot(data=raw_counts_df,
x="log2(human gene counts+1)",
y="log2(mouse gene counts+1)",
hue='species',
hue_order=['HEK293T', 'mixture', 'NIH3T3'],
s=8,
alpha=0.5,
legend=True,
height=5,
marginal_kws={'common_norm':False})
ax.ax_joint.text(0, 0, 'raw counts')
ax.ax_joint.set_xlim(-1, 14)
ax.ax_joint.set_ylim(-1, 14)
ax.ax_joint.legend(bbox_to_anchor=(1.55, 0.6), borderaxespad=0);
<Figure size 600x600 with 0 Axes>
Calculate number of human and mouse transcripts after denoising
[15]:
denoised_count['human gene counts'] = denoised_count.loc[:,denoised_count.columns.str.startswith('GRCh38_')].sum(axis=1)
denoised_count['mouse gene counts'] = denoised_count.loc[:,denoised_count.columns.str.startswith('mm10_')].sum(axis=1)
denoised_count['species'] = raw_counts_df['species']
denoised_count['log2(human gene counts+1)'] = np.log2(denoised_count['human gene counts']+1)
denoised_count['log2(mouse gene counts+1)'] = np.log2(denoised_count['mouse gene counts']+1)
visualization of denoised counts
[16]:
plt.figure(figsize=(4, 4), dpi=150);
ax = sns.jointplot(data=denoised_count,
x="log2(human gene counts+1)",
y="log2(mouse gene counts+1)",
hue='species',
hue_order=['HEK293T', 'mixture', 'NIH3T3'],
s=8,
alpha=0.5,
legend=True,
height=5,
marginal_kws={'common_norm':False});
ax.ax_joint.text(0, 0, 'denoised counts')
ax.ax_joint.set_xlim(-1, 14)
ax.ax_joint.set_ylim(-1, 14)
ax.ax_joint.legend(bbox_to_anchor=(1.55, 0.6), borderaxespad=0);
<Figure size 600x600 with 0 Axes>