Denoising model#

The main module of scar

class scar.main._scar.model(raw_count: Union[str, numpy.ndarray, pandas.core.frame.DataFrame, anndata._core.anndata.AnnData], ambient_profile: Optional[Union[str, numpy.ndarray, pandas.core.frame.DataFrame]] = None, nn_layer1: int = 150, nn_layer2: int = 100, latent_dim: int = 15, dropout_prob: float = 0, feature_type: str = 'mRNA', count_model: str = 'binomial', sparsity: float = 0.9, device: str = 'auto', verbose: bool = True)#

The scar model

Parameters
  • raw_count (Union[str, np.ndarray, pd.DataFrame, ad.AnnData]) –

    Raw count matrix or Anndata object.

    Note

    scar takes the raw UMI counts as input. No size normalization or log transformation.

  • ambient_profile (Optional[Union[str, np.ndarray, pd.DataFrame]], optional) – the probability of occurrence of each ambient transcript. If None, averaging cells to estimate the ambient profile, by default None

  • nn_layer1 (int, optional) – number of neurons of the 1st layer, by default 150

  • nn_layer2 (int, optional) – number of neurons of the 2nd layer, by default 100

  • latent_dim (int, optional) – number of neurons of the bottleneck layer, by default 15

  • dropout_prob (float, optional) – dropout probability of neurons, by default 0

  • feature_type (str, optional) –

    the feature to be denoised. One of the following:

    ’mRNA’ – transcriptome data, including scRNAseq and snRNAseq
    ’ADT’ – protein counts in CITE-seq
    ’sgRNA’ – sgRNA counts for scCRISPRseq
    ’tag’ – identity barcodes or any data types of high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell
    ’CMO’ – Cell Multiplexing Oligo counts for cell hashing
    ’ATAC’ – peak counts for scATACseq

    New in version 0.5.2.

    By default “mRNA”

  • count_model (str, optional) –

    the model to generate the UMI count. One of the following:

    ’binomial’ – binomial model,
    ’poisson’ – poisson model,
    ’zeroinflatedpoisson’ – zeroinflatedpoisson model, by default “binomial”

  • sparsity (float, optional) –

    range: [0, 1]. The sparsity of expected native signals. It varies between datasets, e.g. if one prefilters genes – use only highly variable genes – the sparsity should be low; on the other hand, it should be set high in the case of unflitered genes. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”. Thank Will Macnair for the valuable feedback.

    New in version 0.4.0.

Raises
  • TypeError – if raw_count is not str or np.ndarray or pd.DataFrame

  • TypeError – if ambient_profile is not str or np.ndarray or pd.DataFrame or None

Examples

>>> # Real data
>>> import scanpy as sc
>>> from scar import model
>>> adata = sc.read("...")  # load an anndata object
>>> scarObj = model(adata, ambient_profile)  # initialize scar model
>>> scarObj.train()  # start training
>>> scarObj.inference()  # inference
>>> adata.layers["X_scar_denoised"] = scarObj.native_counts   # results are saved in scarObj
>>> adata.obsm["X_scar_assignment"] = scarObj.feature_assignment   #'sgRNA' or 'tag' feature type

Examples

# Synthetic data
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scar import data_generator, model

# Generate a synthetic ADT count dataset
np.random.seed(8)
n_features = 50  # 50 ADTs
n_cells = 6000  # 6000 cells
n_celltypes = 6  # cell types
citeseq = data_generator.citeseq(n_cells, n_celltypes, n_features)
citeseq.generate()

# Train scAR
citeseq_denoised = model(citeseq.obs_count, citeseq.ambient_profile, feature_type="ADT", sparsity=0.6)  # initialize scar model
citeseq_denoised.train(epochs=100, verbose=False)  # start training
citeseq_denoised.inference()  # inference

# Visualization
sorted_noisy_counts = citeseq.obs_count[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # noisy observation
sorted_native_counts = citeseq.native_signals[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # native counts
sorted_denoised_counts = citeseq_denoised.native_counts[citeseq.celltype.argsort()][
            :, citeseq.ambient_profile.argsort()
        ]  # denoised counts

fig, axs = plt.subplots(ncols=3, figsize=(12,4))
sns.heatmap(
            np.log2(sorted_noisy_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[0],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[0].set_title("noisy observation")

sns.heatmap(
            np.log2(sorted_native_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[1],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[1].set_title("native counts (ground truth)")

sns.heatmap(
            np.log2(sorted_denoised_counts + 1),
            yticklabels=False,
            vmin=0,
            vmax=10,
            cmap="coolwarm",
            center=1,
            ax=axs[2],
            cbar_kws={"label": "log2(counts + 1)"},
        )
axs[2].set_title("denoised counts (prediction)")

fig.supxlabel("ADTs")
fig.supylabel("cells")
plt.tight_layout()
../_images/training-1.png

Attributes:

logger

logging.Logger, the logger for this class.

nn_layer1

int, number of neurons of the 1st layer.

nn_layer2

int, number of neurons of the 2nd layer.

latent_dim

int, number of neurons of the bottleneck layer.

dropout_prob

float, dropout probability of neurons.

feature_type

str, the feature to be denoised.

count_model

str, the model to generate the UMI count.

sparsity

float, the sparsity of expected native signals.

raw_count

np.ndarray, raw count matrix.

ambient_profile

np.ndarray, the probability of occurrence of each ambient transcript.

runtime

int, runtime in seconds.

loss_values

list, loss values during training.

trained_model

nn.Module object, added after training.

native_counts

np.ndarray, denoised counts, added after inference

bayesfactor

np.ndarray, bayesian factor of whether native signals are present, added after inference

native_frequencies

np.ndarray, probability of native transcripts (normalized denoised counts), added after inference

noise_ratio

np.ndarray, noise ratio per cell, added after inference

feature_assignment

pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment

Methods:

train([batch_size, train_size, shuffle, ...])

train training scar model

inference([batch_size, count_model_inf, ...])

inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies

assignment([cutoff, moi])

assignment assignment of feature barcodes.

logger#

logging.Logger, the logger for this class.

nn_layer1#

int, number of neurons of the 1st layer.

nn_layer2#

int, number of neurons of the 2nd layer.

latent_dim#

int, number of neurons of the bottleneck layer.

dropout_prob#

float, dropout probability of neurons.

feature_type#

str, the feature to be denoised. One of the following:

‘mRNA’ – transcriptome
‘ADT’ – protein counts in CITE-seq
‘sgRNA’ – sgRNA counts for scCRISPRseq
‘tag’ – identity barcodes or any data types of super high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell.
‘CMO’ – Cell Multiplexing Oligo counts for cell hashing
‘ATAC’ – peak counts for scATACseq
By default “mRNA”
count_model#

str, the model to generate the UMI count. One of the following:

‘binomial’ – binomial model,
‘poisson’ – poisson model,
‘zeroinflatedpoisson’ – zeroinflatedpoisson model.
sparsity#

float, the sparsity of expected native signals. (0, 1]. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”.

raw_count#

np.ndarray, raw count matrix.

Type

raw_count

ambient_profile#

np.ndarray, the probability of occurrence of each ambient transcript.

Type

ambient_profile

runtime#

int, runtime in seconds.

loss_values#

list, loss values during training.

trained_model#

nn.Module object, added after training.

native_counts#

np.ndarray, denoised counts, added after inference

bayesfactor#

np.ndarray, bayesian factor of whether native signals are present, added after inference

native_frequencies#

np.ndarray, probability of native transcripts (normalized denoised counts), added after inference

noise_ratio#

np.ndarray, noise ratio per cell, added after inference

feature_assignment#

pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment

train(batch_size: int = 64, train_size: float = 0.998, shuffle: bool = True, kld_weight: float = 1e-05, lr: float = 0.001, lr_step_size: int = 5, lr_gamma: float = 0.97, epochs: int = 400, reconstruction_weight: float = 1, dropout_prob: float = 0, save_model: bool = False, verbose: bool = True)#

train training scar model

Parameters
  • batch_size (int, optional) – batch size, by default 64

  • train_size (float, optional) – the size of training samples, by default 0.998

  • shuffle (bool, optional) – whether to shuffle the data, by default True

  • kld_weight (float, optional) – weight of KL loss, by default 1e-5

  • lr (float, optional) – initial learning rate, by default 1e-3

  • lr_step_size (int, optional) – period of learning rate decay, by default 5

  • lr_gamma (float, optional) – multiplicative factor of learning rate decay, by default 0.97

  • epochs (int, optional) – training iterations, by default 800

  • reconstruction_weight (float, optional) – weight on reconstruction error, by default 1

  • dropout_prob (float, optional) – dropout probability of neurons, by default 0

  • save_model (bool, optional) – whether to save trained models(under development), by default False

  • verbose (bool, optional) – whether to print the details, by default True

Return type

After training, a trained_model attribute will be added.

inference(batch_size=4096, count_model_inf='poisson', adjust='micro', cutoff=3, round_to_int='stochastic_rounding', clip_to_obs=False, moi=None)#

inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies

Parameters
  • batch_size (int, optional) – batch size, set a small value upon GPU memory issue, by default 4096

  • count_model_inf (str, optional) – inference model for evaluation of ambient presence, by default “poisson”

  • adjust (str, optional) –

    Only used for calculating Bayesfactors to improve performance. One of the following:

    ’micro’ – adjust the estimated native counts per cell. This can overcome the issue of over- or under-estimation of noise.
    ’global’ – adjust the estimated native counts globally. This can overcome the issue of over- or under-estimation of noise.
    False – no adjustment, use the model-returned native counts.
    Defaults to “micro”

  • cutoff (int, optional) – cutoff for Bayesfactors, by default 3

  • round_to_int (str, optional) –

    whether to round the counts, by default “stochastic_rounding”

    New in version 0.4.1.

  • clip_to_obs (bool, optional) –

    whether to clip the predicted native counts to the observation in order to ensure that denoised counts are not greater than the observation, by default False. Use it with caution, as it may lead to over-estimation of overall noise.

    New in version 0.5.0.

  • moi (int, optional (under development)) – multiplicity of infection. If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016] for details, by default None

Return type

After inferring, several attributes will be added, inc. native_counts, bayesfactor, native_frequencies, and noise_ratio. A feature_assignment will be added in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.

assignment(cutoff=3, moi=None)#

assignment assignment of feature barcodes. Re-run it can test different cutoffs for your experiments.

Parameters
  • cutoff (int, optional) – cutoff for Bayesfactors, by default 3

  • moi (float, optional) – multiplicity of infection. (under development) If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016], by default None

Return type

After running, a attribute ‘feature_assignment’ will be added, in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.

Raises

NotImplementedError – if moi is not None