TwinCell¶
The headline reference for TwinCell: the data requirements and the high-level TwinCell study class used throughout the tutorial and the use cases.
The data-preparation helpers (pseudobulk() / pydeseq2()) live on the Preprocessing page.
Data requirements¶
TwinCell accepts transcriptomic data as .h5ad (AnnData). For target validation you bring a target and one or two cell states (e.g. disease vs. healthy).
Input files must contain:
- Raw gene expression counts — not normalized. In
adata.Xor a named layer (viaraw_layer_name, which refers to a counts layer inadata.layers, notadata.raw). - Condition labels — a column in
adata.obsidentifying each cell/sample's state (e.g."ctrl"vs."stim"). - Batch / sample grouping — a column in
adata.obsfor sample-level grouping (e.g. replicate). Used viabatch_id_col; these are sample identifiers, not sequencing-batch labels. - Cell type annotations (single-cell only) — each prediction analyzes one cell type at a time.
Example obs schema:
| Column | Required? | Example values | Notes |
|---|---|---|---|
condition |
yes | ctrl, stim |
Two values; used as the contrast. |
sample_id |
yes | donor_1, donor_2, … |
Sample-level grouping for pseudo-bulk / DE. |
cell_type |
single-cell only | CD4 T cells, Monocytes, … |
One prediction per cell type. |
Other requirements:
- One or two cell states, with at least two biological replicates per arm for differential expression.
- Gene identifiers as standard gene symbols in
adata.var_names. - The dataset must fit in memory during local preprocessing — subset to the cell types of interest for very large datasets.
Raw counts only
TwinCell requires raw counts — the pipeline handles normalization internally. Passing pre-normalized data produces unreliable results.
Reading data¶
read_h5ad
¶
read_h5ad(
path_or_url: str | Path,
*,
destination: str | Path | None = None,
timeout_seconds: float = 300.0,
sanitize: bool = True,
) -> AnnData
Read AnnData from a local .h5ad file or a remote URI.
Local: pathlib.Path or a filesystem string (including file:// URIs).
The file must exist; destination is ignored.
Remote: http://, https://, or s3:// URIs. Streams to destination
when set, otherwise to a temporary file that is removed after load. For s3://,
uses boto3 with the default credential chain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_or_url
|
str | Path
|
Path to a local |
required |
destination
|
str | Path | None
|
Optional output path when downloading a remote object (parent dirs are created). Ignored for local paths. |
None
|
timeout_seconds
|
float
|
HTTP client timeout for HTTP(S); connect + read timeouts for S3. |
300.0
|
sanitize
|
bool
|
When True (default), coerce |
True
|
Returns:
| Type | Description |
|---|---|
AnnData
|
In-memory |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
When a local path does not exist or is not a file. |
ValueError
|
When a remote URI uses an unsupported scheme or malformed |
HTTPError
|
On HTTP(S) network or status errors. |
ClientError
|
On S3 API errors (e.g. missing object). |
When sanitize is True, duplicate obs / var name warnings from AnnData
during file load are suppressed because names are uniquified immediately after.
TwinCell study¶
The notebook-oriented entry point. Construct from a control + perturbed AnnData pair plus a DEG list, then call target_validation() and inspect the result with the score / causal-path / graph methods (see the tutorial).
The methods used in the tutorial are target_validation(), get_target_score(),
get_causal_paths(), plot_causal_graph(), get_degs_impacted_by_target(), and
get_all_degs().
TwinCell
¶
TwinCell(
*,
pdata_control: AnnData,
pdata_pert: AnnData | None = None,
degs: list[str] | None = None,
model: str = DEFAULT_TWINCELL_MODEL_VERSION,
api_key: str,
base_url: str | None = None,
validate_on_init: bool = True,
max_obs_per_anndata: int | None = None,
check_api_on_init: bool = True,
api_check_timeout_seconds: float = 15.0,
)
High-level handle for a split control vs perturbed AnnData pair and DEG list.
Provide expression matrices in adata.X and HGNC-style symbols in degs.
This workflow expects two objects (not a merged pseudo-bulk matrix); see
validate_twincell_split_anndata().
Suggested flow: run target_id() to submit data and obtain a prediction,
then causal_analysis() for graph-style follow-up on a protein target.
Use target_validation() for the integrated validation path: it submits the
same split arms via POST /v1/predictions with job_type=target_validation
(no separate causal POST). Use plot_causal_graph() for graphs from causal_analysis() when you hold a
{"causal": ...} dict. For target validation, use plot_causal_graph() on the
instance with top_n_degs (GET .../causal-graph).
On construction, local split validation runs and (by default) a quick API
connectivity check (GET /health plus authenticated GET /v1/predictions).
Progress and a final ready message are printed to stdout for notebooks.
Call close() when finished to release HTTP resources.
prediction_id
property
¶
Active prediction id on the internal TwinCellSession.
session
property
¶
Underlying HTTP session (shared client, active prediction_id).
target_id
¶
target_id(
*,
pdata_pert: AnnData | None = None,
degs: list[str] | None = None,
label: str | None = None,
model_version: str | None = None,
wait: bool = True,
timeout_seconds: float = 10 * 60,
poll_interval_seconds: float = 2.0,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
print_prediction_id: bool = True,
) -> Any
Submit split arms and DEGs, wait for the remote prediction, and return status.
Uses adata.X only; validation already ran at construction.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata_pert
|
AnnData | None
|
Perturbed pseudo-bulk AnnData. Falls back to the instance attribute set at construction if not provided here. |
None
|
degs
|
list[str] | None
|
List of DEG gene symbols. Falls back to the instance attribute set at construction if not provided here. |
None
|
build_differential_causal_graph
¶
build_differential_causal_graph(
*,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
prediction_id: str | UUID | None = None,
) -> Any
Refresh prediction results and cache the influence preview matrix.
influence_matrix
¶
Return the cached influence preview matrix, fetching from the service if needed.
last_prediction
¶
Last prediction payload held by the internal TwinCellSession.
use_prediction
¶
Select a completed prediction for subsequent getters on this instance.
After target_validation(), the session already points at that run.
Use this to switch to another row from list_predictions(). Validates
the id against the API and returns a one-row summary (same columns as
list_predictions()).
Raises:
| Type | Description |
|---|---|
NotFoundError
|
If the prediction does not exist or is not accessible. |
list_predictions
¶
List past predictions (prediction_id, status, label, job_type, target).
Uses the same API key and base URL as this TwinCell instance. For the
raw PredictionsListResponse, call
TwinCellSession.list_predictions() on session().
Pagination: when the API returns next_cursor, it is stored in
df.attrs["next_cursor"]; pass it as cursor= on the next call.
Example::
history = tc.list_predictions()
pid = history.iloc[0]["prediction_id"]
tc.use_prediction(pid)
tc.get_target_score(prediction_id=pid)
simulate
¶
simulate(
targets: list[str] | None = None,
*,
prediction_id: str | UUID | None = None,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> dict[str, Any]
Merge prediction scores with rows whose gene symbol is in degs.
causal_analysis
¶
causal_analysis(
*,
target_id: PredictionStatusResponse | None = None,
target: str,
top_n_causal_degs: int = 1000,
min_path_fraction: float = 0.1,
min_path_probability: float = 0.0001,
max_path_length: int | None = None,
prediction_id: str | UUID | None = None,
wait: bool = True,
timeout_seconds: float = 10 * 60,
poll_interval_seconds: float = 2.0,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> dict[str, Any]
Request a causal graph analysis for a protein target on a finished prediction.
Loads preview results locally to choose API parameters and to build a companion
df_paths table. The remote job returns status and may include a graph image;
use plot_causal_graph() to display it.
Normally call after target_id() for the same prediction. For the optional
alternate mode, see target_validation().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target_id
|
PredictionStatusResponse | None
|
Completed prediction handle. If omitted, the active session prediction is used. |
None
|
target
|
str
|
Protein entity id, e.g. |
required |
top_n_causal_degs
|
int
|
Caps ranked inputs used for |
1000
|
min_path_fraction
|
float
|
Minimum path mass fraction per DEG (API default 0.1). |
0.1
|
min_path_probability
|
float
|
Minimum raw path probability (API default 1e-4). |
0.0001
|
max_path_length
|
int | None
|
Optional maximum path length in nodes. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict[str, Any]
|
|
dict[str, Any]
|
|
|
dict[str, Any]
|
class: |
|
dict[str, Any]
|
from |
|
dict[str, Any]
|
outputs), plus |
|
dict[str, Any]
|
|
target_validation
¶
target_validation(
*,
pdata_pert: AnnData | None = None,
degs: list[str] | None = None,
target_id: PredictionStatusResponse | None = None,
target: str,
label: str | None = None,
top_n_causal_degs: int = 1000,
deg_significance_fold: float = 1.0,
min_path_fraction: float = 0.1,
min_path_probability: float = 0.0001,
max_path_length: int | None = None,
prediction_id: str | UUID | None = None,
wait: bool = True,
timeout_seconds: float = 10 * 60,
poll_interval_seconds: float = 2.0,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
) -> UUID
Integrated target_validation run (split uploads via POST /v1/predictions).
Submits pdata_control, pdata_pert, and degs with
job_type='target_validation'. The API creates a new prediction row and
enqueues the prediction worker with job_type=target_validation (no separate
causal-analysis row); this is not a follow-up on an existing inference
prediction id.
When a local influence preview exists (see prediction_id / target_id
below), fold parameters align with causal_analysis(). Otherwise defaults
are used.
After completion, use get_target_score(), get_all_degs(),
get_degs_impacted_by_target(), get_causal_paths(),
get_intermediary_nodes(), and
plot_causal_graph() for allowed outputs (external tiers do not expose
presigned artifact URLs).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdata_pert
|
AnnData | None
|
Perturbed pseudo-bulk AnnData. Falls back to the instance attribute set at construction if not provided here. |
None
|
degs
|
list[str] | None
|
List of DEG gene symbols. Falls back to the instance attribute set at construction if not provided here. |
None
|
target_id
|
PredictionStatusResponse | None
|
Optional completed prediction handle used only to refresh local
influence previews. Omit for standalone TV (no prior |
None
|
target
|
str
|
Protein entity id, e.g. |
required |
label
|
str | None
|
Optional run label stored on the prediction (visible in
|
None
|
top_n_causal_degs
|
int
|
Same role as in |
1000
|
deg_significance_fold
|
float
|
Fold for the embedded |
1.0
|
min_path_fraction
|
float
|
Minimum path mass fraction for the causal step. |
0.1
|
min_path_probability
|
float
|
Minimum raw path probability for the causal step. |
0.0001
|
max_path_length
|
int | None
|
Optional maximum path length in nodes. |
None
|
Returns:
| Type | Description |
|---|---|
UUID
|
uuid.UUID: |
UUID
|
remembered for |
get_target_score
¶
get_target_score(
*,
target: str | None = None,
prediction_id: str | UUID | None = None,
max_rows: int = DEFAULT_PREDICTION_MAX_ROWS,
max_columns: int
| None = DEFAULT_PREDICTION_MAX_COLUMNS,
reload: bool = False,
) -> Any
Return the target-validation score row for target (single-row DataFrame).
Uses the latest active prediction when prediction_id is omitted. The
default target is the one passed to the most recent target_validation().
External API responses expose only id, score, and
percentage_degs_significant (via target_validation_score on GET).
Internal users with full results still receive all score columns.
get_causal_paths
¶
Return causal path rows for impacted DEGs (DataFrame).
Uses GET /v1/predictions/{id}/causal-paths. The API returns redacted rows
for DEGs from get_degs_impacted_by_target() (server-filtered).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prediction_id
|
str | UUID | None
|
Defaults to the latest |
None
|
get_all_degs
¶
Return DEGs that mapped onto the interactome.
Uses GET /v1/predictions/{id}/mapped-degs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prediction_id
|
str | UUID | None
|
Defaults to the latest |
None
|
get_degs_impacted_by_target
¶
Return DEGs whose influence score on the prediction's target exceeds the worker threshold.
Uses GET /v1/predictions/{id}/degs-impacted. The target is always taken from
the prediction record (the target submitted with that prediction_id).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prediction_id
|
str | UUID | None
|
Defaults to the latest |
None
|
get_intermediary_nodes
¶
get_intermediary_nodes(
*,
prediction_id: str | UUID | None = None,
top_n_degs: int,
deg: str | None = None,
) -> list[str]
Return unique gene symbols from all nodes on causal paths (Enrichr lists).
Uses GET /v1/predictions/{id}/intermediary-proteins. top_n_degs uses
the same server-side filter as plot_causal_graph().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
prediction_id
|
str | UUID | None
|
Defaults to the latest |
None
|
top_n_degs
|
int
|
Top DEGs by |
required |
deg
|
str | None
|
Optional DEG filter (e.g. |
None
|
get_intermediary_proteins
¶
get_intermediary_proteins(
*,
prediction_id: str | UUID | None = None,
top_n_degs: int,
deg: str | None = None,
) -> list[str]
Alias for get_intermediary_nodes() (API route name retained).
plot_causal_graph
¶
plot_causal_graph(
*,
top_n_degs: int,
prediction_id: str | UUID | None = None,
dpi: int | None = None,
figsize: tuple[float, float] | None = None,
display_dpi: float | None = None,
) -> Any
Show the causal graph via GET .../causal-graph (server Parquet replay).
Use after target_validation(). top_n_degs is required (TwinCell
plot_causal_graph(..., top_n_causal_degs=...) semantics).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
top_n_degs
|
int
|
Top DEGs by |
required |
prediction_id
|
str | UUID | None
|
Defaults to the latest |
None
|
dpi
|
int | None
|
PNG render resolution on the API (default 200, TwinCell parity). |
None
|
figsize
|
tuple[float, float] | None
|
Matplotlib figure size in inches for notebook display. When omitted, width follows the API PNG aspect ratio (TwinCell layout). |
None
|
display_dpi
|
float | None
|
Matplotlib figure DPI for display only (default 100). |
None
|
The figure displays once in Jupyter when this call is the last line in a cell.
extract_causal_subgraph
¶
extract_causal_subgraph(
simulation: Mapping[str, Any],
*,
target: str | None = None,
identification_result: TargetIdentificationResult
| None = None,
n_degs: int | None = None,
significance_threshold: float | None = None,
prediction_id: str | UUID | None = None,
wait: bool = True,
timeout_seconds: float = 10 * 60,
poll_interval_seconds: float = 2.0,
) -> dict[str, Any]
Pick a focal protein target and run causal_analysis()-style remote analysis.
display
¶
Show a causal subgraph PNG from extract_causal_subgraph().
path_analysis
¶
Summarise the influence score between one RNA token and one protein target.
filter_predictions_by_degs
¶
Filter prediction rows whose gene token intersects degs.