Denoising model#
The main module of scar
- class scar.main._scar.model(raw_count: Union[str, numpy.ndarray, pandas.core.frame.DataFrame, anndata._core.anndata.AnnData], ambient_profile: Optional[Union[str, numpy.ndarray, pandas.core.frame.DataFrame]] = None, nn_layer1: int = 150, nn_layer2: int = 100, latent_dim: int = 15, dropout_prob: float = 0, feature_type: str = 'mRNA', count_model: str = 'binomial', sparsity: float = 0.9, device: str = 'auto', verbose: bool = True)#
The scar model
- Parameters
raw_count (Union[str, np.ndarray, pd.DataFrame, ad.AnnData]) –
Raw count matrix or Anndata object.
Note
scar takes the raw UMI counts as input. No size normalization or log transformation.
ambient_profile (Optional[Union[str, np.ndarray, pd.DataFrame]], optional) – the probability of occurrence of each ambient transcript. If None, averaging cells to estimate the ambient profile, by default None
nn_layer1 (int, optional) – number of neurons of the 1st layer, by default 150
nn_layer2 (int, optional) – number of neurons of the 2nd layer, by default 100
latent_dim (int, optional) – number of neurons of the bottleneck layer, by default 15
dropout_prob (float, optional) – dropout probability of neurons, by default 0
feature_type (str, optional) –
the feature to be denoised. One of the following:
’mRNA’ – transcriptome data, including scRNAseq and snRNAseq’ADT’ – protein counts in CITE-seq’sgRNA’ – sgRNA counts for scCRISPRseq’tag’ – identity barcodes or any data types of high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell’CMO’ – Cell Multiplexing Oligo counts for cell hashing’ATAC’ – peak counts for scATACseqNew in version 0.5.2.
By default “mRNA”count_model (str, optional) –
the model to generate the UMI count. One of the following:
’binomial’ – binomial model,’poisson’ – poisson model,’zeroinflatedpoisson’ – zeroinflatedpoisson model, by default “binomial”sparsity (float, optional) –
range: [0, 1]. The sparsity of expected native signals. It varies between datasets, e.g. if one prefilters genes – use only highly variable genes – the sparsity should be low; on the other hand, it should be set high in the case of unflitered genes. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”. Thank Will Macnair for the valuable feedback.
New in version 0.4.0.
- Raises
TypeError – if raw_count is not str or np.ndarray or pd.DataFrame
TypeError – if ambient_profile is not str or np.ndarray or pd.DataFrame or None
Examples
>>> # Real data >>> import scanpy as sc >>> from scar import model >>> adata = sc.read("...") # load an anndata object >>> scarObj = model(adata, ambient_profile) # initialize scar model >>> scarObj.train() # start training >>> scarObj.inference() # inference >>> adata.layers["X_scar_denoised"] = scarObj.native_counts # results are saved in scarObj >>> adata.obsm["X_scar_assignment"] = scarObj.feature_assignment #'sgRNA' or 'tag' feature type
Examples
# Synthetic data import numpy as np import seaborn as sns import matplotlib.pyplot as plt from scar import data_generator, model # Generate a synthetic ADT count dataset np.random.seed(8) n_features = 50 # 50 ADTs n_cells = 6000 # 6000 cells n_celltypes = 6 # cell types citeseq = data_generator.citeseq(n_cells, n_celltypes, n_features) citeseq.generate() # Train scAR citeseq_denoised = model(citeseq.obs_count, citeseq.ambient_profile, feature_type="ADT", sparsity=0.6) # initialize scar model citeseq_denoised.train(epochs=100, verbose=False) # start training citeseq_denoised.inference() # inference # Visualization sorted_noisy_counts = citeseq.obs_count[citeseq.celltype.argsort()][ :, citeseq.ambient_profile.argsort() ] # noisy observation sorted_native_counts = citeseq.native_signals[citeseq.celltype.argsort()][ :, citeseq.ambient_profile.argsort() ] # native counts sorted_denoised_counts = citeseq_denoised.native_counts[citeseq.celltype.argsort()][ :, citeseq.ambient_profile.argsort() ] # denoised counts fig, axs = plt.subplots(ncols=3, figsize=(12,4)) sns.heatmap( np.log2(sorted_noisy_counts + 1), yticklabels=False, vmin=0, vmax=10, cmap="coolwarm", center=1, ax=axs[0], cbar_kws={"label": "log2(counts + 1)"}, ) axs[0].set_title("noisy observation") sns.heatmap( np.log2(sorted_native_counts + 1), yticklabels=False, vmin=0, vmax=10, cmap="coolwarm", center=1, ax=axs[1], cbar_kws={"label": "log2(counts + 1)"}, ) axs[1].set_title("native counts (ground truth)") sns.heatmap( np.log2(sorted_denoised_counts + 1), yticklabels=False, vmin=0, vmax=10, cmap="coolwarm", center=1, ax=axs[2], cbar_kws={"label": "log2(counts + 1)"}, ) axs[2].set_title("denoised counts (prediction)") fig.supxlabel("ADTs") fig.supylabel("cells") plt.tight_layout()
Attributes:
logging.Logger, the logger for this class.
int, number of neurons of the 1st layer.
int, number of neurons of the 2nd layer.
int, number of neurons of the bottleneck layer.
float, dropout probability of neurons.
str, the feature to be denoised.
str, the model to generate the UMI count.
float, the sparsity of expected native signals.
np.ndarray, raw count matrix.
np.ndarray, the probability of occurrence of each ambient transcript.
int, runtime in seconds.
list, loss values during training.
nn.Module object, added after training.
np.ndarray, denoised counts, added after inference
np.ndarray, bayesian factor of whether native signals are present, added after inference
np.ndarray, probability of native transcripts (normalized denoised counts), added after inference
np.ndarray, noise ratio per cell, added after inference
pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment
Methods:
train
([batch_size, train_size, shuffle, ...])train training scar model
inference
([batch_size, count_model_inf, ...])inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies
assignment
([cutoff, moi])assignment assignment of feature barcodes.
- logger#
logging.Logger, the logger for this class.
- nn_layer1#
int, number of neurons of the 1st layer.
- nn_layer2#
int, number of neurons of the 2nd layer.
- latent_dim#
int, number of neurons of the bottleneck layer.
- dropout_prob#
float, dropout probability of neurons.
- feature_type#
str, the feature to be denoised. One of the following:
‘mRNA’ – transcriptome‘ADT’ – protein counts in CITE-seq‘sgRNA’ – sgRNA counts for scCRISPRseq‘tag’ – identity barcodes or any data types of super high sparsity. E.g., in cell indexing experiments, we would expect a single true signal (1) and many negative signals (0) for each cell.‘CMO’ – Cell Multiplexing Oligo counts for cell hashing‘ATAC’ – peak counts for scATACseqBy default “mRNA”
- count_model#
str, the model to generate the UMI count. One of the following:
‘binomial’ – binomial model,‘poisson’ – poisson model,‘zeroinflatedpoisson’ – zeroinflatedpoisson model.
- sparsity#
float, the sparsity of expected native signals. (0, 1]. Forced to be one in the mode of “sgRNA(s)” and “tag(s)”.
- raw_count#
np.ndarray, raw count matrix.
- Type
raw_count
- ambient_profile#
np.ndarray, the probability of occurrence of each ambient transcript.
- Type
ambient_profile
- runtime#
int, runtime in seconds.
- loss_values#
list, loss values during training.
- trained_model#
nn.Module object, added after training.
- native_counts#
np.ndarray, denoised counts, added after inference
- bayesfactor#
np.ndarray, bayesian factor of whether native signals are present, added after inference
- native_frequencies#
np.ndarray, probability of native transcripts (normalized denoised counts), added after inference
- noise_ratio#
np.ndarray, noise ratio per cell, added after inference
- feature_assignment#
pd.DataFrame, assignment of sgRNA or tag or other feature barcodes, added after inference or assignment
- train(batch_size: int = 64, train_size: float = 0.998, shuffle: bool = True, kld_weight: float = 1e-05, lr: float = 0.001, lr_step_size: int = 5, lr_gamma: float = 0.97, epochs: int = 400, reconstruction_weight: float = 1, dropout_prob: float = 0, save_model: bool = False, verbose: bool = True)#
train training scar model
- Parameters
batch_size (int, optional) – batch size, by default 64
train_size (float, optional) – the size of training samples, by default 0.998
shuffle (bool, optional) – whether to shuffle the data, by default True
kld_weight (float, optional) – weight of KL loss, by default 1e-5
lr (float, optional) – initial learning rate, by default 1e-3
lr_step_size (int, optional) – period of learning rate decay, by default 5
lr_gamma (float, optional) – multiplicative factor of learning rate decay, by default 0.97
epochs (int, optional) – training iterations, by default 800
reconstruction_weight (float, optional) – weight on reconstruction error, by default 1
dropout_prob (float, optional) – dropout probability of neurons, by default 0
save_model (bool, optional) – whether to save trained models(under development), by default False
verbose (bool, optional) – whether to print the details, by default True
- Return type
After training, a trained_model attribute will be added.
- inference(batch_size=4096, count_model_inf='poisson', adjust='micro', cutoff=3, round_to_int='stochastic_rounding', clip_to_obs=False, moi=None)#
inference infering the expected native signals, noise ratios, Bayesfactors and expected native frequencies
- Parameters
batch_size (int, optional) – batch size, set a small value upon GPU memory issue, by default 4096
count_model_inf (str, optional) – inference model for evaluation of ambient presence, by default “poisson”
adjust (str, optional) –
Only used for calculating Bayesfactors to improve performance. One of the following:
’micro’ – adjust the estimated native counts per cell. This can overcome the issue of over- or under-estimation of noise.’global’ – adjust the estimated native counts globally. This can overcome the issue of over- or under-estimation of noise.False – no adjustment, use the model-returned native counts.Defaults to “micro”cutoff (int, optional) – cutoff for Bayesfactors, by default 3
round_to_int (str, optional) –
whether to round the counts, by default “stochastic_rounding”
New in version 0.4.1.
clip_to_obs (bool, optional) –
whether to clip the predicted native counts to the observation in order to ensure that denoised counts are not greater than the observation, by default False. Use it with caution, as it may lead to over-estimation of overall noise.
New in version 0.5.0.
moi (int, optional (under development)) – multiplicity of infection. If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016] for details, by default None
- Return type
After inferring, several attributes will be added, inc. native_counts, bayesfactor, native_frequencies, and noise_ratio. A feature_assignment will be added in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.
- assignment(cutoff=3, moi=None)#
assignment assignment of feature barcodes. Re-run it can test different cutoffs for your experiments.
- Parameters
cutoff (int, optional) – cutoff for Bayesfactors, by default 3
moi (float, optional) – multiplicity of infection. (under development) If assigned, it will allow optimized thresholding, which tests a series of cutoffs to find the best one based on distributions of infections under given moi. See Perturb-seq [Dixit2016], by default None
- Return type
After running, a attribute ‘feature_assignment’ will be added, in ‘sgRNA’ or ‘tag’ or ‘CMO’ feature type.
- Raises
NotImplementedError – if moi is not None