Skip to content

TwinCell

The headline reference for TwinCell: the data requirements and the high-level TwinCell study class used throughout the tutorial and the use cases.

The data-preparation helpers (pseudobulk() / pydeseq2()) live on the Preprocessing page.

from deeplife.twincell import TwinCell, read_h5ad

Data requirements

TwinCell accepts transcriptomic data as .h5ad (AnnData). For target validation you bring a target and one or two cell states (e.g. disease vs. healthy).

Input files must contain:

  • Raw gene expression counts — not normalized. In adata.X or a named layer (via raw_layer_name, which refers to a counts layer in adata.layers, not adata.raw).
  • Condition labels — a column in adata.obs identifying each cell/sample's state (e.g. "ctrl" vs. "stim").
  • Batch / sample grouping — a column in adata.obs for sample-level grouping (e.g. replicate). Used via batch_id_col; these are sample identifiers, not sequencing-batch labels.
  • Cell type annotations (single-cell only) — each prediction analyzes one cell type at a time.

Example obs schema:

Column Required? Example values Notes
condition yes ctrl, stim Two values; used as the contrast.
sample_id yes donor_1, donor_2, … Sample-level grouping for pseudo-bulk / DE.
cell_type single-cell only CD4 T cells, Monocytes, … One prediction per cell type.

Other requirements:

  • One or two cell states, with at least two biological replicates per arm for differential expression.
  • Gene identifiers as standard gene symbols in adata.var_names.
  • The dataset must fit in memory during local preprocessing — subset to the cell types of interest for very large datasets.

Raw counts only

TwinCell requires raw counts — the pipeline handles normalization internally. Passing pre-normalized data produces unreliable results.


Reading data

read_h5ad

read_h5ad(
    path_or_url: str | Path,
    *,
    destination: str | Path | None = None,
    timeout_seconds: float = 300.0,
    sanitize: bool = True,
) -> AnnData

Read AnnData from a local .h5ad file or a remote URI.

Local: pathlib.Path or a filesystem string (including file:// URIs). The file must exist; destination is ignored.

Remote: http://, https://, or s3:// URIs. Streams to destination when set, otherwise to a temporary file that is removed after load. For s3://, uses boto3 with the default credential chain.

Parameters:

Name Type Description Default
path_or_url str | Path

Path to a local .h5ad, or https:// / http:// / s3:// URI.

required
destination str | Path | None

Optional output path when downloading a remote object (parent dirs are created). Ignored for local paths.

None
timeout_seconds float

HTTP client timeout for HTTP(S); connect + read timeouts for S3.

300.0
sanitize bool

When True (default), coerce obs and var columns to string dtype and ensure unique obs_names / var_names (in-place on the returned object). Pass False to return the object exactly as stored in the file.

True

Returns:

Type Description
AnnData

In-memory anndata.AnnData.

Raises:

Type Description
FileNotFoundError

When a local path does not exist or is not a file.

ValueError

When a remote URI uses an unsupported scheme or malformed s3://.

HTTPError

On HTTP(S) network or status errors.

ClientError

On S3 API errors (e.g. missing object).

When sanitize is True, duplicate obs / var name warnings from AnnData during file load are suppressed because names are uniquified immediately after.


TwinCell study

The notebook-oriented entry point. Construct from a control + perturbed AnnData pair plus a DEG list, then call target_validation() and inspect the result with the score / causal-path / graph methods (see the tutorial).

The methods used in the tutorial are target_validation(), get_target_score(), get_causal_paths(), plot_causal_graph(), get_degs_impacted_by_target(), and get_all_degs().

TwinCell

TwinCell(
    *,
    pdata_control: AnnData,
    pdata_pert: AnnData | None = None,
    degs: list[str] | None = None,
    model: str = DEFAULT_TWINCELL_MODEL_VERSION,
    api_key: str,
    base_url: str | None = None,
    validate_on_init: bool = True,
    max_obs_per_anndata: int | None = None,
    check_api_on_init: bool = True,
    api_check_timeout_seconds: float = 15.0,
)

High-level handle for a split control vs perturbed AnnData pair and DEG list.

Provide expression matrices in adata.X and HGNC-style symbols in degs. This workflow expects two objects (not a merged pseudo-bulk matrix); see validate_twincell_split_anndata().

Suggested flow: run target_id() to submit data and obtain a prediction, then causal_analysis() for graph-style follow-up on a protein target. Use target_validation() for the integrated validation path: it submits the same split arms via POST /v1/predictions with job_type=target_validation (no separate causal POST). Use plot_causal_graph() for graphs from causal_analysis() when you hold a {"causal": ...} dict. For target validation, use plot_causal_graph() on the instance with top_n_degs (GET .../causal-graph).

On construction, local split validation runs and (by default) a quick API connectivity check (GET /health plus authenticated GET /v1/predictions). Progress and a final ready message are printed to stdout for notebooks.

Call close() when finished to release HTTP resources.

prediction_id property

prediction_id: UUID | None

Active prediction id on the internal TwinCellSession.

session property

session: TwinCellSession

Underlying HTTP session (shared client, active prediction_id).

close

close() -> None

Close the underlying TwinCellSession HTTP session if it was opened.

client

client(*, reuse: bool = True) -> Any

Return a bare DeepLifeClient.

target_id

target_id(
    *,
    pdata_pert: AnnData | None = None,
    degs: list[str] | None = None,
    label: str | None = None,
    model_version: str | None = None,
    wait: bool = True,
    timeout_seconds: float = 10 * 60,
    poll_interval_seconds: float = 2.0,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
    print_prediction_id: bool = True,
) -> Any

Submit split arms and DEGs, wait for the remote prediction, and return status.

Uses adata.X only; validation already ran at construction.

Parameters:

Name Type Description Default
pdata_pert AnnData | None

Perturbed pseudo-bulk AnnData. Falls back to the instance attribute set at construction if not provided here.

None
degs list[str] | None

List of DEG gene symbols. Falls back to the instance attribute set at construction if not provided here.

None

build_differential_causal_graph

build_differential_causal_graph(
    *,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
    prediction_id: str | UUID | None = None,
) -> Any

Refresh prediction results and cache the influence preview matrix.

influence_matrix

influence_matrix() -> Any

Return the cached influence preview matrix, fetching from the service if needed.

last_prediction

last_prediction() -> PredictionStatusResponse | None

Last prediction payload held by the internal TwinCellSession.

use_prediction

use_prediction(prediction_id: str | UUID) -> DataFrame

Select a completed prediction for subsequent getters on this instance.

After target_validation(), the session already points at that run. Use this to switch to another row from list_predictions(). Validates the id against the API and returns a one-row summary (same columns as list_predictions()).

Raises:

Type Description
NotFoundError

If the prediction does not exist or is not accessible.

list_predictions

list_predictions(
    *, limit: int = 50, cursor: str | None = None
) -> DataFrame

List past predictions (prediction_id, status, label, job_type, target).

Uses the same API key and base URL as this TwinCell instance. For the raw PredictionsListResponse, call TwinCellSession.list_predictions() on session().

Pagination: when the API returns next_cursor, it is stored in df.attrs["next_cursor"]; pass it as cursor= on the next call.

Example::

history = tc.list_predictions()
pid = history.iloc[0]["prediction_id"]
tc.use_prediction(pid)
tc.get_target_score(prediction_id=pid)

simulate

simulate(
    targets: list[str] | None = None,
    *,
    prediction_id: str | UUID | None = None,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> dict[str, Any]

Merge prediction scores with rows whose gene symbol is in degs.

causal_analysis

causal_analysis(
    *,
    target_id: PredictionStatusResponse | None = None,
    target: str,
    top_n_causal_degs: int = 1000,
    min_path_fraction: float = 0.1,
    min_path_probability: float = 0.0001,
    max_path_length: int | None = None,
    prediction_id: str | UUID | None = None,
    wait: bool = True,
    timeout_seconds: float = 10 * 60,
    poll_interval_seconds: float = 2.0,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> dict[str, Any]

Request a causal graph analysis for a protein target on a finished prediction.

Loads preview results locally to choose API parameters and to build a companion df_paths table. The remote job returns status and may include a graph image; use plot_causal_graph() to display it.

Normally call after target_id() for the same prediction. For the optional alternate mode, see target_validation().

Parameters:

Name Type Description Default
target_id PredictionStatusResponse | None

Completed prediction handle. If omitted, the active session prediction is used.

None
target str

Protein entity id, e.g. BRAF|PROTEIN.

required
top_n_causal_degs int

Caps ranked inputs used for min_degs_fold_uniform and for df_paths.

1000
min_path_fraction float

Minimum path mass fraction per DEG (API default 0.1).

0.1
min_path_probability float

Minimum raw path probability (API default 1e-4).

0.0001
max_path_length int | None

Optional maximum path length in nodes.

None

Returns:

Name Type Description
dict dict[str, Any]

df_paths (local preview table), causal (alias) and

dict[str, Any]

causal_analysis — terminal

dict[str, Any]

class:~deeplife.twincell.http.models.CausalAnalysisStatusResponse

dict[str, Any]

from GET /v1/causal-analysis/{id} (use .artifacts for all presigned

dict[str, Any]

outputs), plus target, prediction_id, top_degs,

dict[str, Any]

min_degs_fold_uniform, top_n_causal_degs.

target_validation

target_validation(
    *,
    pdata_pert: AnnData | None = None,
    degs: list[str] | None = None,
    target_id: PredictionStatusResponse | None = None,
    target: str,
    label: str | None = None,
    top_n_causal_degs: int = 1000,
    deg_significance_fold: float = 1.0,
    min_path_fraction: float = 0.1,
    min_path_probability: float = 0.0001,
    max_path_length: int | None = None,
    prediction_id: str | UUID | None = None,
    wait: bool = True,
    timeout_seconds: float = 10 * 60,
    poll_interval_seconds: float = 2.0,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> UUID

Integrated target_validation run (split uploads via POST /v1/predictions).

Submits pdata_control, pdata_pert, and degs with job_type='target_validation'. The API creates a new prediction row and enqueues the prediction worker with job_type=target_validation (no separate causal-analysis row); this is not a follow-up on an existing inference prediction id.

When a local influence preview exists (see prediction_id / target_id below), fold parameters align with causal_analysis(). Otherwise defaults are used.

After completion, use get_target_score(), get_all_degs(), get_degs_impacted_by_target(), get_causal_paths(), get_intermediary_nodes(), and plot_causal_graph() for allowed outputs (external tiers do not expose presigned artifact URLs).

Parameters:

Name Type Description Default
pdata_pert AnnData | None

Perturbed pseudo-bulk AnnData. Falls back to the instance attribute set at construction if not provided here.

None
degs list[str] | None

List of DEG gene symbols. Falls back to the instance attribute set at construction if not provided here.

None
target_id PredictionStatusResponse | None

Optional completed prediction handle used only to refresh local influence previews. Omit for standalone TV (no prior target_id()).

None
target str

Protein entity id, e.g. BRAF|PROTEIN.

required
label str | None

Optional run label stored on the prediction (visible in list_predictions()).

None
top_n_causal_degs int

Same role as in causal_analysis() when preview data is available.

1000
deg_significance_fold float

Fold for the embedded target_id step (default 1.0).

1.0
min_path_fraction float

Minimum path mass fraction for the causal step.

0.1
min_path_probability float

Minimum raw path probability for the causal step.

0.0001
max_path_length int | None

Optional maximum path length in nodes.

None

Returns:

Type Description
UUID

uuid.UUID: prediction_id for this run. The validation target is

UUID

remembered for get_target_score() when target is omitted there.

get_target_score

get_target_score(
    *,
    target: str | None = None,
    prediction_id: str | UUID | None = None,
    max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
    max_columns: int
    | None = DEFAULT_PREDICTION_MAX_COLUMNS,
    reload: bool = False,
) -> Any

Return the target-validation score row for target (single-row DataFrame).

Uses the latest active prediction when prediction_id is omitted. The default target is the one passed to the most recent target_validation().

External API responses expose only id, score, and percentage_degs_significant (via target_validation_score on GET). Internal users with full results still receive all score columns.

get_causal_paths

get_causal_paths(
    *, prediction_id: str | UUID | None = None
) -> Any

Return causal path rows for impacted DEGs (DataFrame).

Uses GET /v1/predictions/{id}/causal-paths. The API returns redacted rows for DEGs from get_degs_impacted_by_target() (server-filtered).

Parameters:

Name Type Description Default
prediction_id str | UUID | None

Defaults to the latest target_validation() run.

None

get_all_degs

get_all_degs(
    *, prediction_id: str | UUID | None = None
) -> list[str]

Return DEGs that mapped onto the interactome.

Uses GET /v1/predictions/{id}/mapped-degs.

Parameters:

Name Type Description Default
prediction_id str | UUID | None

Defaults to the latest target_validation() run.

None

get_degs_impacted_by_target

get_degs_impacted_by_target(
    *, prediction_id: str | UUID | None = None
) -> list[str]

Return DEGs whose influence score on the prediction's target exceeds the worker threshold.

Uses GET /v1/predictions/{id}/degs-impacted. The target is always taken from the prediction record (the target submitted with that prediction_id).

Parameters:

Name Type Description Default
prediction_id str | UUID | None

Defaults to the latest target_validation() run.

None

get_intermediary_nodes

get_intermediary_nodes(
    *,
    prediction_id: str | UUID | None = None,
    top_n_degs: int,
    deg: str | None = None,
) -> list[str]

Return unique gene symbols from all nodes on causal paths (Enrichr lists).

Uses GET /v1/predictions/{id}/intermediary-proteins. top_n_degs uses the same server-side filter as plot_causal_graph().

Parameters:

Name Type Description Default
prediction_id str | UUID | None

Defaults to the latest target_validation() run.

None
top_n_degs int

Top DEGs by score_deg_given_target (server-side filter).

required
deg str | None

Optional DEG filter (e.g. "NDRG4|RNA"); omit for all top DEGs.

None

get_intermediary_proteins

get_intermediary_proteins(
    *,
    prediction_id: str | UUID | None = None,
    top_n_degs: int,
    deg: str | None = None,
) -> list[str]

Alias for get_intermediary_nodes() (API route name retained).

plot_causal_graph

plot_causal_graph(
    *,
    top_n_degs: int,
    prediction_id: str | UUID | None = None,
    dpi: int | None = None,
    figsize: tuple[float, float] | None = None,
    display_dpi: float | None = None,
) -> Any

Show the causal graph via GET .../causal-graph (server Parquet replay).

Use after target_validation(). top_n_degs is required (TwinCell plot_causal_graph(..., top_n_causal_degs=...) semantics).

Parameters:

Name Type Description Default
top_n_degs int

Top DEGs by score_deg_given_target (server-side filter).

required
prediction_id str | UUID | None

Defaults to the latest target_validation() run.

None
dpi int | None

PNG render resolution on the API (default 200, TwinCell parity).

None
figsize tuple[float, float] | None

Matplotlib figure size in inches for notebook display. When omitted, width follows the API PNG aspect ratio (TwinCell layout).

None
display_dpi float | None

Matplotlib figure DPI for display only (default 100).

None

The figure displays once in Jupyter when this call is the last line in a cell.

extract_causal_subgraph

extract_causal_subgraph(
    simulation: Mapping[str, Any],
    *,
    target: str | None = None,
    identification_result: TargetIdentificationResult
    | None = None,
    n_degs: int | None = None,
    significance_threshold: float | None = None,
    prediction_id: str | UUID | None = None,
    wait: bool = True,
    timeout_seconds: float = 10 * 60,
    poll_interval_seconds: float = 2.0,
) -> dict[str, Any]

Pick a focal protein target and run causal_analysis()-style remote analysis.

display

display(subgraph: Mapping[str, Any]) -> Any

Show a causal subgraph PNG from extract_causal_subgraph().

path_analysis

path_analysis(
    context: Mapping[str, Any], *, deg: str, target: str
) -> dict[str, Any]

Summarise the influence score between one RNA token and one protein target.

filter_predictions_by_degs

filter_predictions_by_degs(
    *, degs: list[str]
) -> TargetIdentificationResult

Filter prediction rows whose gene token intersects degs.

as_dict

as_dict() -> dict[str, Any]

Summary fields useful for logging or UI.