Denoising scRNAseq#

In this tutorial, we will walk through the steps for denoising mRNA counts in species-mixing data.

In this experiment, an equal number of human HEK293T cells and mouse NIH3T3 cells were pooled and sequenced, transcripts from the other organism were unambiguously ambient contamination. The original dataset is downloaded from 10x genomics dataset, and cell annotation file is available at scAR-reproducibility/data

Note

To run this notebook on your device, you need to install scAR.

Alternatively, you can also run this notebook on Colab Open In Colab

[1]:
# # Run this cell to install scar in Colab
# # Skip this cell if running on your own device

# %pip install scanpy
# %pip install git+https://github.com/Novartis/scAR.git
# %pip install matplotlib==3.1.3  # Specify this matplotlib version to avoid errors
[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
from scar import model, setup_anndata

import warnings
warnings.simplefilter("ignore")

Download data#

The raw and filtered count matrices can be downloaded from 10x Dataset.

The raw data#

cellranger output: raw_feature_bc_matrix

[2]:
hgmm_20k_raw = sc.read_10x_h5(filename='20k_hgmm_3p_HT_nextgem_Chromium_X_raw_feature_bc_matrix.h5ad',
                                 backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/20k_hgmm_3p_HT_nextgem_Chromium_X/20k_hgmm_3p_HT_nextgem_Chromium_X_raw_feature_bc_matrix.h5');
hgmm_20k_raw.var_names_make_unique();

The filtered data#

cellranger output: filtered_feature_bc_matrix

[3]:
hgmm_20k = sc.read_10x_h5(filename='20k_hgmm_3p_HT_nextgem_Chromium_X_filtered_feature_bc_matrix.h5ad',
                             backup_url='https://cf.10xgenomics.com/samples/cell-exp/6.1.0/20k_hgmm_3p_HT_nextgem_Chromium_X/20k_hgmm_3p_HT_nextgem_Chromium_X_filtered_feature_bc_matrix.h5');
hgmm_20k.var_names_make_unique();

Annotation and filtering#

We annotated cells by their cell species and provided the annotation file

[4]:
!git clone https://github.com/CaibinSh/scAR-reproducibility.git

hgmm_20k_anno = pd.read_csv('scAR-reproducibility/data/20k_hgmm_cell_annotation.csv', index_col=0)
hgmm_20k.obs=hgmm_20k.obs.join(hgmm_20k_anno[['species']])
Cloning into 'scAR-reproducibility'...
remote: Enumerating objects: 110, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 110 (delta 0), reused 3 (delta 0), pack-reused 106
Receiving objects: 100% (110/110), 65.67 MiB | 37.13 MiB/s, done.
Resolving deltas: 100% (48/48), done.

Gene and cell filtering

[5]:
sc.pp.filter_genes(hgmm_20k, min_counts=200);
sc.pp.filter_genes(hgmm_20k, max_counts=6000);
sc.pp.filter_cells(hgmm_20k, min_genes=200);

Calculate number of human and mouse transcripts

[6]:
hgmm_20k.obs['human gene counts'] = hgmm_20k[:, hgmm_20k.var['genome']=="GRCh38"].X.sum(axis=1).A1
hgmm_20k.obs['mouse gene counts'] = hgmm_20k[:, hgmm_20k.var['genome']=="mm10"].X.sum(axis=1).A1
[7]:
hgmm_20k
[7]:
AnnData object with n_obs × n_vars = 16292 × 16586
    obs: 'species', 'n_genes', 'human gene counts', 'mouse gene counts'
    var: 'gene_ids', 'feature_types', 'genome', 'n_counts'

Estimate ambient profile#

See details in calculating ambient profile

[8]:
setup_anndata(
    adata = hgmm_20k,
    raw_adata = hgmm_20k_raw,
    prob = 0.995,
    kneeplot = True
)
2024-08-10 20:48:50|INFO|setup_anndata|Use all 199666 droplets.
2024-08-10 20:48:50|INFO|setup_anndata|Estimating ambient profile for ['Gene Expression']...
2024-08-10 20:49:23|INFO|setup_anndata|Iteration: 1
2024-08-10 20:49:54|INFO|setup_anndata|Iteration: 2
2024-08-10 20:50:26|INFO|setup_anndata|Iteration: 3
2024-08-10 20:50:26|INFO|setup_anndata|Estimated ambient profile for Gene Expression saved in adata.uns
2024-08-10 20:50:26|INFO|setup_anndata|Estimated ambient profile for all features saved in adata.uns
../_images/tutorials_scAR_tutorial_denoising_scRNAseq_18_1.png
[9]:
hgmm_20k.uns["ambient_profile_Gene Expression"].head()
[9]:
ambient_profile_Gene Expression
GRCh38_AL627309.5 0.000000
GRCh38_LINC01128 0.000127
GRCh38_LINC00115 0.000000
GRCh38_FAM41C 0.000010
GRCh38_AL645608.2 0.000010

Training#

Note

For this specific case, it is necessary to set a large sparsity parameter because the data is highly sparse due to the following reasons:

1) The presence of low-coverage transcripts, as filtering was not applied;
2) The inclusion of both human and mouse transcripts in the mRNA matrix, which leads to additional zero counts.

However, when working with data that includes only highly variable genes, a smaller sparsity parameter is recommended to prevent overcorrection.

scAR allows for denoising a specific subset of genes and samples. To save computational time, it is advisable to use filtered raw counts (e.g., xxx_filtered_feature_bc_matrix.h5).

[10]:
hgmm_20k_scar = model(raw_count=hgmm_20k, # In the case of Anndata object, scar will automatically use the estimated ambient_profile present in adata.uns.
                      ambient_profile=hgmm_20k.uns['ambient_profile_Gene Expression'],
                      feature_type='mRNA',
                      sparsity=1,
                      device='cuda' # CPU, CUDA and MPS are supported.
                     )

hgmm_20k_scar.train(epochs=200,
                    batch_size=64,
                    verbose=True
                   )
2024-08-10 20:50:38|INFO|model|cuda will be used.
2024-08-10 20:50:39|INFO|VAE|Running VAE using the following param set:
2024-08-10 20:50:39|INFO|VAE|...denoised count type: mRNA
2024-08-10 20:50:39|INFO|VAE|...count model: binomial
2024-08-10 20:50:39|INFO|VAE|...num_input_feature: 16586
2024-08-10 20:50:39|INFO|VAE|...NN_layer1: 150
2024-08-10 20:50:39|INFO|VAE|...NN_layer2: 100
2024-08-10 20:50:39|INFO|VAE|...latent_space: 15
2024-08-10 20:50:39|INFO|VAE|...dropout_prob: 0.00
2024-08-10 20:50:39|INFO|VAE|...expected data sparsity: 1.00
2024-08-10 20:50:40|INFO|model|kld_weight: 1.00e-05
2024-08-10 20:50:40|INFO|model|learning rate: 1.00e-03
2024-08-10 20:50:40|INFO|model|lr_step_size: 5
2024-08-10 20:50:40|INFO|model|lr_gamma: 0.97
Training: 100%|██████████| 200/200 [05:06<00:00,  1.53s/it, Loss=4.6976e+03]
[11]:
# After training, we can infer the native signals
hgmm_20k_scar.inference()  # by defaut, batch_size = None, set a batch_size if getting a memory issue

The denoised counts are saved in hgmm_20k_scar.native_counts and can be used for downstream analysis.

[12]:
denoised_count = pd.DataFrame(hgmm_20k_scar.native_counts.toarray(), index=hgmm_20k.obs_names, columns=hgmm_20k.var_names)
denoised_count.head()
[12]:
GRCh38_AL627309.5 GRCh38_LINC01128 GRCh38_LINC00115 GRCh38_FAM41C GRCh38_AL645608.2 GRCh38_LINC02593 GRCh38_SAMD11 GRCh38_KLHL17 GRCh38_AL645608.7 GRCh38_ISG15 ... mm10___Gm21887 mm10___Kdm5d mm10___Eif2s3y mm10___Uty mm10___Ddx3y mm10___AC133103.1 mm10___AC168977.1 mm10___CAAA01118383.1 mm10___Spry3 mm10___Tmlhe
AAACCCAAGAGCCGAT-1 0 1 0 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAACCCAAGCGTTCCG-1 0 1 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
AAACCCAAGTGCGTCC-1 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 2 0 0 0 0 0
AAACCCACAAACTGCT-1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
AAACCCACAATACCTG-1 0 1 0 0 1 0 1 0 1 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 16586 columns

Visualization#

Calculate number of human and mouse transcripts before denoising

[13]:
raw_counts_df = hgmm_20k.obs[['human gene counts', 'mouse gene counts']].astype(int)
raw_counts_df['log2(human gene counts+1)'] = np.log2(raw_counts_df['human gene counts']+1)
raw_counts_df['log2(mouse gene counts+1)'] = np.log2(raw_counts_df['mouse gene counts']+1)
raw_counts_df = raw_counts_df.join(hgmm_20k.obs[['species']])

visualization of raw counts

[14]:
plt.figure(figsize=(4, 4), dpi=150)

ax = sns.jointplot(data=raw_counts_df,
                   x="log2(human gene counts+1)",
                   y="log2(mouse gene counts+1)",
                   hue='species',
                   hue_order=['HEK293T', 'mixture', 'NIH3T3'],
                   s=8,
                   alpha=0.5,
                   legend=True,
                   height=5,
                   marginal_kws={'common_norm':False})
ax.ax_joint.text(0, 0, 'raw counts')
ax.ax_joint.set_xlim(-1, 14)
ax.ax_joint.set_ylim(-1, 14)
ax.ax_joint.legend(bbox_to_anchor=(1.55, 0.6), borderaxespad=0);
<Figure size 600x600 with 0 Axes>
../_images/tutorials_scAR_tutorial_denoising_scRNAseq_30_1.png

Calculate number of human and mouse transcripts after denoising

[15]:
denoised_count['human gene counts'] = denoised_count.loc[:,denoised_count.columns.str.startswith('GRCh38_')].sum(axis=1)
denoised_count['mouse gene counts'] = denoised_count.loc[:,denoised_count.columns.str.startswith('mm10_')].sum(axis=1)
denoised_count['species'] = raw_counts_df['species']
denoised_count['log2(human gene counts+1)'] = np.log2(denoised_count['human gene counts']+1)
denoised_count['log2(mouse gene counts+1)'] = np.log2(denoised_count['mouse gene counts']+1)

visualization of denoised counts

[16]:
plt.figure(figsize=(4, 4), dpi=150);

ax = sns.jointplot(data=denoised_count,
                   x="log2(human gene counts+1)",
                   y="log2(mouse gene counts+1)",
                   hue='species',
                   hue_order=['HEK293T', 'mixture', 'NIH3T3'],
                   s=8,
                   alpha=0.5,
                   legend=True,
                   height=5,
                   marginal_kws={'common_norm':False});
ax.ax_joint.text(0, 0, 'denoised counts')
ax.ax_joint.set_xlim(-1, 14)
ax.ax_joint.set_ylim(-1, 14)
ax.ax_joint.legend(bbox_to_anchor=(1.55, 0.6), borderaxespad=0);
<Figure size 600x600 with 0 Axes>
../_images/tutorials_scAR_tutorial_denoising_scRNAseq_34_1.png