Denoising model#

The main module of scar

class scar.main._scar.model(raw_count: Union[str, numpy.ndarray, pandas.core.frame.DataFrame, anndata._core.anndata.AnnData], ambient_profile: Optional[Union[str, numpy.ndarray, pandas.core.frame.DataFrame]] = None, nn_layer1: int = 150, nn_layer2: int = 100, latent_dim: int = 15, dropout_prob: float = 0, feature_type: str = 'mRNA', count_model: str = 'binomial', sparsity: float = 0.9, device: str = 'auto', verbose: bool = True)#

The scar model

Parameters

raw_count (Union[str, np.ndarray, pd.DataFrame, ad.AnnData]) –
Raw count matrix or Anndata object.

Note

scar takes the raw UMI counts as input. No size normalization or log transformation.
ambient_profile (Optional[Union[str, np.ndarray, pd.DataFrame]], optional) – the probability of occurrence of each ambient transcript. If None, averaging cells to estimate the ambient profile, by default None
nn_layer1 (int, optional) – number of neurons of the 1st layer, by default 150
nn_layer2 (int, optional) – number of neurons of the 2nd layer, by default 100
latent_dim (int, optional) – number of neurons of the bottleneck layer, by default 15
dropout_prob (float, optional) – dropout probability of neurons, by default 0
feature_type (str, optional) –
the feature to be denoised. One of the following:

’mRNA’ – transcriptome data, including scRNAseq and snRNAseq

’ADT’ – protein counts in CITE-seq

’sgRNA’ – sgRNA counts for scCRISPRseq

’tag’ – identity barcodes or any data types of high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell

’CMO’ – Cell Multiplexing Oligo counts for cell hashing

’ATAC’ – peak counts for scATACseq

New in version 0.5.2.

By default “mRNA”
count_model (str, optional) –
the model to generate the UMI count. One of the following:

’binomial’ – binomial model,

’poisson’ – poisson model,

’zeroinflatedpoisson’ – zeroinflatedpoisson model, by default “binomial”
sparsity (float, optional) –
range: [0, 1]. The sparsity of expected native signals. It varies between datasets, e.g. if one prefilters genes – use only highly variable genes – the sparsity should be low; on the other hand, it should be set high in the case of unflitered genes. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”. Thank Will Macnair for the valuable feedback.

New in version 0.4.0.

Raises

TypeError – if raw_count is not str or np.ndarray or pd.DataFrame
TypeError – if ambient_profile is not str or np.ndarray or pd.DataFrame or None

Examples

>>> # Real data
>>> import scanpy as sc
>>> from scar import model
>>> adata = sc.read("...")  # load an anndata object
>>> scarObj = model(adata, ambient_profile)  # initialize scar model
>>> scarObj.train()  # start training
>>> scarObj.inference()  # inference
>>> adata.layers["X_scar_denoised"] = scarObj.native_counts   # results are saved in scarObj
>>> adata.obsm["X_scar_assignment"] = scarObj.feature_assignment   #'sgRNA' or 'tag' feature type

Examples

# Synthetic data
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scar import data_generator, model

# Generate a synthetic ADT count dataset
np.random.seed(8)
n_features = 50  # 50 ADTs
n_cells = 6000  # 6000 cells
n_celltypes = 6  # cell types
citeseq = data_generator.citeseq(n_cells, n_celltypes, n_features)
citeseq.generate()

# Train scAR
citeseq_denoised = model(citeseq.obs_count, citeseq.ambient_profile, feature_type="ADT", sparsity=0.6)  # initialize scar model
citeseq_denoised.train(epochs=100, verbose=False)  # start training
citeseq_denoised.inference()  # inference

# Visualization
sorted_noisy_counts = citeseq.obs_count[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # noisy observation
sorted_native_counts = citeseq.native_signals[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # native counts
sorted_denoised_counts = citeseq_denoised.native_counts[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # denoised counts

fig, axs = plt.subplots(ncols=3, figsize=(12,4))
sns.heatmap(
            np.log2(sorted_noisy_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[0],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[0].set_title("noisy observation")

sns.heatmap(
            np.log2(sorted_native_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[1],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[1].set_title("native counts (ground truth)")

sns.heatmap(
            np.log2(sorted_denoised_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[2],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[2].set_title("denoised counts (prediction)")

fig.supxlabel("ADTs")
fig.supylabel("cells")
plt.tight_layout()

Attributes:

`logger`	logging.Logger, the logger for this class.
`nn_layer1`	int, number of neurons of the 1st layer.
`nn_layer2`	int, number of neurons of the 2nd layer.
`latent_dim`	int, number of neurons of the bottleneck layer.
`dropout_prob`	float, dropout probability of neurons.
`feature_type`	str, the feature to be denoised.
`count_model`	str, the model to generate the UMI count.
`sparsity`	float, the sparsity of expected native signals.
`raw_count`	np.ndarray, raw count matrix.
`ambient_profile`	np.ndarray, the probability of occurrence of each ambient transcript.
`runtime`	int, runtime in seconds.
`loss_values`	list, loss values during training.
`trained_model`	nn.Module object, added after training.
`native_counts`	np.ndarray, denoised counts, added after inference
`bayesfactor`	np.ndarray, bayesian factor of whether native signals are present, added after inference
`native_frequencies`	np.ndarray, probability of native transcripts (normalized denoised counts), added after inference
`noise_ratio`	np.ndarray, noise ratio per cell, added after inference
`feature_assignment`	pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment

Methods:

`train`([batch_size, train_size, shuffle, ...])	train training scar model
`inference`([batch_size, count_model_inf, ...])	inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies
`assignment`([cutoff, moi])	assignment assignment of feature barcodes.

logger#: logging.Logger, the logger for this class.

nn_layer1#: int, number of neurons of the 1st layer.

nn_layer2#: int, number of neurons of the 2nd layer.

latent_dim#: int, number of neurons of the bottleneck layer.

dropout_prob#: float, dropout probability of neurons.

feature_type#: str, the feature to be denoised. One of the following:

‘mRNA’ – transcriptome

‘ADT’ – protein counts in CITE-seq

‘sgRNA’ – sgRNA counts for scCRISPRseq

‘tag’ – identity barcodes or any data types of super high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell.

‘CMO’ – Cell Multiplexing Oligo counts for cell hashing

‘ATAC’ – peak counts for scATACseq

By default “mRNA”

count_model#: str, the model to generate the UMI count. One of the following:

‘binomial’ – binomial model,

‘poisson’ – poisson model,

‘zeroinflatedpoisson’ – zeroinflatedpoisson model.

sparsity#: float, the sparsity of expected native signals. (0, 1]. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”.

raw_count#

np.ndarray, raw count matrix.

Type: raw_count

ambient_profile#

np.ndarray, the probability of occurrence of each ambient transcript.

Type: ambient_profile

runtime#: int, runtime in seconds.

loss_values#: list, loss values during training.

trained_model#: nn.Module object, added after training.

native_counts#: np.ndarray, denoised counts, added after inference

bayesfactor#: np.ndarray, bayesian factor of whether native signals are present, added after inference

native_frequencies#: np.ndarray, probability of native transcripts (normalized denoised counts), added after inference

noise_ratio#: np.ndarray, noise ratio per cell, added after inference

feature_assignment#: pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment

train(batch_size: int = 64, train_size: float = 0.998, shuffle: bool = True, kld_weight: float = 1e-05, lr: float = 0.001, lr_step_size: int = 5, lr_gamma: float = 0.97, epochs: int = 400, reconstruction_weight: float = 1, dropout_prob: float = 0, save_model: bool = False, verbose: bool = True)#

train training scar model

Parameters

batch_size (int, optional) – batch size, by default 64
train_size (float, optional) – the size of training samples, by default 0.998
shuffle (bool, optional) – whether to shuffle the data, by default True
kld_weight (float, optional) – weight of KL loss, by default 1e-5
lr (float, optional) – initial learning rate, by default 1e-3
lr_step_size (int, optional) – period of learning rate decay, by default 5
lr_gamma (float, optional) – multiplicative factor of learning rate decay, by default 0.97
epochs (int, optional) – training iterations, by default 800
reconstruction_weight (float, optional) – weight on reconstruction error, by default 1
dropout_prob (float, optional) – dropout probability of neurons, by default 0
save_model (bool, optional) – whether to save trained models(under development), by default False
verbose (bool, optional) – whether to print the details, by default True

Return type

After training, a trained_model attribute will be added.

inference(batch_size=4096, count_model_inf='poisson', adjust='micro', cutoff=3, round_to_int='stochastic_rounding', clip_to_obs=False, moi=None)#

inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies

Parameters

batch_size (int, optional) – batch size, set a small value upon GPU memory issue, by default 4096
count_model_inf (str, optional) – inference model for evaluation of ambient presence, by default “poisson”
adjust (str, optional) –
Only used for calculating Bayesfactors to improve performance. One of the following:

’micro’ – adjust the estimated native counts per cell. This can overcome the issue of over- or under-estimation of noise.

’global’ – adjust the estimated native counts globally. This can overcome the issue of over- or under-estimation of noise.

False – no adjustment, use the model-returned native counts.

Defaults to “micro”
cutoff (int, optional) – cutoff for Bayesfactors, by default 3
round_to_int (str, optional) –
whether to round the counts, by default “stochastic_rounding”

New in version 0.4.1.
clip_to_obs (bool, optional) –
whether to clip the predicted native counts to the observation in order to ensure that denoised counts are not greater than the observation, by default False. Use it with caution, as it may lead to over-estimation of overall noise.

New in version 0.5.0.
moi (int, optional (under development)) – multiplicity of infection. If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016] for details, by default None

Return type

After inferring, several attributes will be added, inc. native_counts, bayesfactor, native_frequencies, and noise_ratio. A feature_assignment will be added in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.

assignment(cutoff=3, moi=None)#

assignment assignment of feature barcodes. Re-run it can test different cutoffs for your experiments.

Parameters

cutoff (int, optional) – cutoff for Bayesfactors, by default 3
moi (float, optional) – multiplicity of infection. (under development) If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016], by default None

Return type

After running, a attribute ‘feature_assignment’ will be added, in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.

Raises

NotImplementedError – if moi is not None