Skip to content

Preprocessing

Notebook-oriented helpers that turn raw single-cell counts into TwinCell's inputs. These are the exact functions used in the tutorial:

  • pseudobulk() — aggregate single cells into sample-level pseudo-bulk profiles.
  • pydeseq2() — compute differentially expressed genes (the disease signature) between two conditions.
from deeplife.twincell import pseudobulk, pydeseq2

pseudobulk

Aggregate cells into pseudo-bulk by grouping keys. In the tutorial it is called as:

pdata = pseudobulk(
    adata,
    perturbation="disease",      # obs column for the condition (psoriasis vs normal)
    cell_line="cell_type",       # obs column for the cell type / line
    batch_id="sample_id",        # obs column for the replicate / sample id
    n_min_replicates=20,         # drop groups with fewer cells
)

pseudobulk

pseudobulk(
    adata: AnnData,
    ad_obs: None | DataFrame = None,
    perturbation: None | str = None,
    cell_line: None | str = None,
    batch_id: None | str = None,
    n_min_replicates: int = 20,
    agg_layers: bool = False,
    agg_obsm: bool = False,
) -> AnnData

Aggregate single-cell counts into pseudobulks by grouping keys.

Parameters:

Name Type Description Default
adata AnnData

Input AnnData with counts in .X and annotations in .obs. If adata contains .layers or .obsm, they will also be aggregated.

required
ad_obs None | DataFrame

Optional explicit observation table; defaults to adata.obs.

None
perturbation None | str

obs key for perturbation identifier (stratification).

None
cell_line None | str

obs key for cell line identifier.

None
batch_id None | str

obs key for batch identifier.

None
n_min_replicates int

Minimum number of replicates to aggregate. Default is 20.

20
agg_layers bool

Whether to aggregate layers. Default is False.

False
agg_obsm bool

Whether to aggregate obsm. Default is False.

False

Returns:

Type Description
AnnData

anndata.AnnData: Pseudobulked AnnData with: - .X: Sum of counts per pseudobulk - .layers: Sum of layer counts per pseudobulk (if present in input) - .obsm: Mean of embeddings/coordinates per pseudobulk (if present in input) - .obs: Aggregated metadata with n_replicates - .var: Copied from input adata


pydeseq2

Run PyDESeq2 on the two arms and flag significant DEGs. In the tutorial:

de_results = pydeseq2(
    adata=pdata_control.concatenate(pdata_pert),
    design_factor="disease",     # obs column with the conditions
    control_group="normal",      # reference level in design_factor
    log2fc_sig=1.0,              # |log2 fold change| threshold for `significant`
    mlog10pvalue_sig=1.3,        # -log10(adjusted p) threshold for `significant`
)
significant_degs = de_results[de_results["significant"]].index.tolist()

pydeseq2

pydeseq2(
    adata: AnnData,
    design_factor: str,
    control_group: str,
    log2fc_sig: float | None = None,
    mlog10pvalue_sig: float | None = None,
) -> DataFrame

Run DESeq2 and compute results for each perturbation vs control.

Parameters:

Name Type Description Default
adata AnnData

AnnData object with raw counts in adata.X.

required
design_factor str

The column in adata.obs with experimental conditions (e.g. "perturbation").

required
control_group str

The value in the design_factor column that represents the control (e.g. "control").

required
log2fc_sig float | None

Absolute log2 fold-change threshold for the significant column. Both log2fc_sig and mlog10pvalue_sig must be provided to add the column.

None
mlog10pvalue_sig float | None

-log10(padj) threshold for the significant column.

None

Returns:

Type Description
DataFrame

A pandas DataFrame indexed by gene_name, sorted by

DataFrame

mlog10pvalue_adj descending.


Batch / CLI use

The two helpers above are thin wrappers over standalone packages that also ship command-line entry points for batch pipelines:

  • Pseudo-bulkdeeplife.pseudobulk (CLI: twincell-pseudobulk)
  • Differential expressiondeeplife.differential_expression (CLI: twincell-diffexpr)
python -m deeplife.pseudobulk.main --help
python -m deeplife.differential_expression.main --help

For notebooks, prefer the pseudobulk() / pydeseq2() helpers documented above.