Preprocessing¶
Notebook-oriented helpers that turn raw single-cell counts into TwinCell's inputs. These are the exact functions used in the tutorial:
pseudobulk()— aggregate single cells into sample-level pseudo-bulk profiles.pydeseq2()— compute differentially expressed genes (the disease signature) between two conditions.
pseudobulk¶
Aggregate cells into pseudo-bulk by grouping keys. In the tutorial it is called as:
pdata = pseudobulk(
adata,
perturbation="disease", # obs column for the condition (psoriasis vs normal)
cell_line="cell_type", # obs column for the cell type / line
batch_id="sample_id", # obs column for the replicate / sample id
n_min_replicates=20, # drop groups with fewer cells
)
pseudobulk
¶
pseudobulk(
adata: AnnData,
ad_obs: None | DataFrame = None,
perturbation: None | str = None,
cell_line: None | str = None,
batch_id: None | str = None,
n_min_replicates: int = 20,
agg_layers: bool = False,
agg_obsm: bool = False,
) -> AnnData
Aggregate single-cell counts into pseudobulks by grouping keys.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
Input AnnData with counts in |
required |
ad_obs
|
None | DataFrame
|
Optional explicit observation table; defaults to |
None
|
perturbation
|
None | str
|
obs key for perturbation identifier (stratification). |
None
|
cell_line
|
None | str
|
obs key for cell line identifier. |
None
|
batch_id
|
None | str
|
obs key for batch identifier. |
None
|
n_min_replicates
|
int
|
Minimum number of replicates to aggregate. Default is 20. |
20
|
agg_layers
|
bool
|
Whether to aggregate layers. Default is False. |
False
|
agg_obsm
|
bool
|
Whether to aggregate obsm. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
AnnData
|
anndata.AnnData: Pseudobulked AnnData with:
- |
pydeseq2¶
Run PyDESeq2 on the two arms and flag significant DEGs. In the tutorial:
de_results = pydeseq2(
adata=pdata_control.concatenate(pdata_pert),
design_factor="disease", # obs column with the conditions
control_group="normal", # reference level in design_factor
log2fc_sig=1.0, # |log2 fold change| threshold for `significant`
mlog10pvalue_sig=1.3, # -log10(adjusted p) threshold for `significant`
)
significant_degs = de_results[de_results["significant"]].index.tolist()
pydeseq2
¶
pydeseq2(
adata: AnnData,
design_factor: str,
control_group: str,
log2fc_sig: float | None = None,
mlog10pvalue_sig: float | None = None,
) -> DataFrame
Run DESeq2 and compute results for each perturbation vs control.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
adata
|
AnnData
|
AnnData object with raw counts in adata.X. |
required |
design_factor
|
str
|
The column in adata.obs with experimental conditions
(e.g. |
required |
control_group
|
str
|
The value in the design_factor column that represents
the control (e.g. |
required |
log2fc_sig
|
float | None
|
Absolute log2 fold-change threshold for the
|
None
|
mlog10pvalue_sig
|
float | None
|
-log10(padj) threshold for the |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas DataFrame indexed by |
DataFrame
|
|
Batch / CLI use¶
The two helpers above are thin wrappers over standalone packages that also ship command-line entry points for batch pipelines:
- Pseudo-bulk —
deeplife.pseudobulk(CLI:twincell-pseudobulk) - Differential expression —
deeplife.differential_expression(CLI:twincell-diffexpr)
For notebooks, prefer the pseudobulk() / pydeseq2() helpers documented above.