psma package

PSMA surface modeling package.

class psma.PsmaMetrics(auc, auc_defined, mcc, threshold, confusion, metric_notes)

Bases: object

Summary metrics for PSMA test-set predictions.

Parameters:
  • auc (float | None)

  • auc_defined (bool)

  • mcc (float)

  • threshold (float)

  • confusion (dict[str, int])

  • metric_notes (dict[str, str])

auc: float | None
auc_defined: bool
mcc: float
threshold: float
confusion: dict[str, int]
metric_notes: dict[str, str]
class psma.PsmaParams(y_col, label_threshold, label_direction, similarity_method, embed_method, distance_k, tsne_perplexity, tsne_max_iter, grid_n, grid_lims_scale, eps, split_method, test_fraction, random_state, butina_distance_cutoff, mcc_prob_threshold, morgan_radius, morgan_n_bits)

Bases: object

Stable parameter payload describing one PSMA workflow run.

Parameters:
  • y_col (str)

  • label_threshold (float)

  • label_direction (str)

  • similarity_method (str)

  • embed_method (str)

  • distance_k (float)

  • tsne_perplexity (float)

  • tsne_max_iter (int)

  • grid_n (int)

  • grid_lims_scale (float)

  • eps (float)

  • split_method (str)

  • test_fraction (float)

  • random_state (int)

  • butina_distance_cutoff (float)

  • mcc_prob_threshold (float)

  • morgan_radius (int)

  • morgan_n_bits (int)

y_col: str
label_threshold: float
label_direction: str
similarity_method: str
embed_method: str
distance_k: float
tsne_perplexity: float
tsne_max_iter: int
grid_n: int
grid_lims_scale: float
eps: float
split_method: str
test_fraction: float
random_state: int
butina_distance_cutoff: float
mcc_prob_threshold: float
morgan_radius: int
morgan_n_bits: int
class psma.PsmaResult(indices, similarity, distance, coords, labels, grid, projection, prob_test, metrics, params, artifacts_dir, frames)

Bases: object

Top-level typed result returned by the PSMA workflow.

Parameters:
indices: PsmaIndices
similarity: PsmaSimilarity
distance: PsmaDistance
coords: PsmaCoordinates
labels: PsmaLabels
grid: PsmaGrid
projection: ProjectionDiagnostics
prob_test: ndarray
metrics: PsmaMetrics
params: PsmaParams
artifacts_dir: Path | None
frames: PsmaFrames
psma.compute_psma_surface(df, *, y_col, label_threshold, label_direction, similarity_method='rdkit_morgan_tanimoto', embed_method='pcoa', smiles_col='smiles', fp_col=None, emb_col=None, mol_id_col='mol_id', triples_df=None, distance_k=0.382, tsne_perplexity=30, tsne_max_iter=500, grid_n=128, grid_lims_scale=2.0, eps=1e-12, split_method='random', test_fraction=0.2, random_state=42, butina_distance_cutoff=0.4, mcc_prob_threshold=0.5, morgan_radius=2, morgan_n_bits=2048)

Run the PSMA computation path without writing artifacts.

Parameters:
  • df (DataFrame) – Input dataframe containing endpoint and feature columns.

  • y_col (str) – Continuous endpoint column name.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

  • similarity_method (str) – Similarity backend name.

  • embed_method (str) – Embedding backend name.

  • smiles_col (str | None) – Optional SMILES column for RDKit similarity.

  • fp_col (str | None) – Optional precomputed fingerprint column.

  • emb_col (str | None) – Optional embedding column for cosine similarity.

  • mol_id_col (str | None) – Optional molecule identifier column.

  • triples_df (DataFrame | None) – Optional triples dataframe for imported similarities.

  • distance_k (float) – Convex distance-transform parameter.

  • tsne_perplexity (float) – t-SNE perplexity when that backend is used.

  • tsne_max_iter (int) – t-SNE iteration cap when that backend is used.

  • grid_n (int) – Number of grid steps per axis.

  • grid_lims_scale (float) – Scale factor for the posterior grid bounds.

  • eps (float) – Numerical stabilizer used in posterior scoring.

  • split_method (str) – Split strategy, either 'random' or 'butina'.

  • test_fraction (float) – Fraction of compounds assigned to the test set.

  • random_state (int) – Random seed used across the workflow.

  • butina_distance_cutoff (float) – Distance threshold used by Butina clustering.

  • mcc_prob_threshold (float) – Probability threshold used for MCC.

  • morgan_radius (int) – Morgan fingerprint radius for RDKit similarity.

  • morgan_n_bits (int) – Morgan fingerprint size for RDKit similarity.

Returns:

Typed PSMA result payload containing workflow outputs, diagnostics, parameters, and data frames.

Return type:

PsmaResult

psma.run_psma_surface(df, *, y_col, label_threshold, label_direction, similarity_method='rdkit_morgan_tanimoto', embed_method='pcoa', smiles_col='smiles', fp_col=None, emb_col=None, mol_id_col='mol_id', triples_df=None, distance_k=0.382, tsne_perplexity=30, tsne_max_iter=500, grid_n=128, grid_lims_scale=2.0, eps=1e-12, split_method='random', test_fraction=0.2, random_state=42, butina_distance_cutoff=0.4, mcc_prob_threshold=0.5, morgan_radius=2, morgan_n_bits=2048, output_dir=None)

Run the PSMA workflow and optionally persist standard artifacts.

Parameters:
  • df (DataFrame) – Input dataframe containing endpoint and feature columns.

  • y_col (str) – Continuous endpoint column name.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

  • similarity_method (str) – Similarity backend name.

  • embed_method (str) – Embedding backend name.

  • smiles_col (str | None) – Optional SMILES column for RDKit similarity.

  • fp_col (str | None) – Optional precomputed fingerprint column.

  • emb_col (str | None) – Optional embedding column for cosine similarity.

  • mol_id_col (str | None) – Optional molecule identifier column.

  • triples_df (DataFrame | None) – Optional triples dataframe for imported similarities.

  • distance_k (float) – Convex distance-transform parameter.

  • tsne_perplexity (float) – t-SNE perplexity when that backend is used.

  • tsne_max_iter (int) – t-SNE iteration cap when that backend is used.

  • grid_n (int) – Number of grid steps per axis.

  • grid_lims_scale (float) – Scale factor for the posterior grid bounds.

  • eps (float) – Numerical stabilizer used in posterior scoring.

  • split_method (str) – Split strategy, either 'random' or 'butina'.

  • test_fraction (float) – Fraction of compounds assigned to the test set.

  • random_state (int) – Random seed used across the workflow.

  • butina_distance_cutoff (float) – Distance threshold used by Butina clustering.

  • mcc_prob_threshold (float) – Probability threshold used for MCC.

  • morgan_radius (int) – Morgan fingerprint radius for RDKit similarity.

  • morgan_n_bits (int) – Morgan fingerprint size for RDKit similarity.

  • output_dir (str | Path | None) – Optional output directory. When provided, standard artifacts are written to disk.

Returns:

Typed PSMA result payload. The artifacts_dir field is set only when artifact writing is requested.

Return type:

PsmaResult

Submodules

psma.cli module

Command-line interface for reproducible PSMA workflow runs.

psma.cli.main(argv=None)

Run the PSMA command-line interface.

Parameters:

argv (Sequence[str] | None) – Optional argument vector excluding the executable name.

Returns:

Process exit code where 0 indicates success.

Return type:

int

psma.distance module

Similarity-to-distance transform utilities.

psma.distance.similarity_to_distance(similarity, *, distance_k=0.382)

Convert similarities to convex-transformed distances.

Parameters:
  • similarity (ndarray) – Similarity matrix or rectangular similarity block.

  • distance_k (float) – Convex transform parameter.

Returns:

Distance matrix or block with the same shape as the input.

Return type:

ndarray

psma.embed module

2D embedding utilities for PSMA workflows.

psma.embed.pcoa_embed(distance_matrix, *, n_components=2)

Compute a classical multidimensional scaling embedding.

Parameters:
  • distance_matrix (ndarray) – Square distance matrix for the training set.

  • n_components (int) – Number of embedding dimensions to keep.

Returns:

Embedded coordinates with shape (n_samples, n_components).

Return type:

ndarray

psma.embed.tsne_embed(distance_matrix, *, perplexity=30.0, max_iter=500, random_state=42)

Compute a 2D t-SNE embedding from precomputed distances.

Parameters:
  • distance_matrix (ndarray) – Square precomputed distance matrix.

  • perplexity (float) – t-SNE perplexity parameter.

  • max_iter (int) – Maximum number of t-SNE iterations.

  • random_state (int) – Random seed for reproducible fitting.

Returns:

Embedded coordinates with shape (n_samples, 2).

Return type:

ndarray

psma.embed.embed_train(distance_matrix, *, method, tsne_perplexity, tsne_max_iter, random_state)

Dispatch the requested embedding backend for training distances.

Parameters:
  • distance_matrix (ndarray) – Square training distance matrix.

  • method (str) – Embedding backend name.

  • tsne_perplexity (float) – t-SNE perplexity used when method='tsne'.

  • tsne_max_iter (int) – t-SNE iteration cap used when method='tsne'.

  • random_state (int) – Random seed for embedding backends that use one.

Returns:

Embedded training coordinates.

Raises:

ValueError – If method is not a supported embedding backend.

Return type:

ndarray

psma.io module

Input/output helpers for PSMA surface workflows.

psma.io.ensure_output_dir(base_dir='data/processed/psma_surface')

Create and return the output directory.

Parameters:

base_dir (str | Path) – Relative or absolute path to the output directory.

Returns:

Created output directory path.

Return type:

Path

psma.io.validate_input_dataframe(df, *, y_col, smiles_col, fp_col, emb_col, similarity_method)

Validate dataframe columns required by the selected similarity backend.

Parameters:
  • df (DataFrame) – Input dataframe for one PSMA workflow run.

  • y_col (str) – Name of the continuous endpoint column.

  • smiles_col (str | None) – Name of the SMILES column when using RDKit similarity.

  • fp_col (str | None) – Name of the fingerprint column when fingerprints are precomputed.

  • emb_col (str | None) – Name of the embedding column for cosine similarity.

  • similarity_method (str) – Selected similarity backend name.

Raises:

ValueError – If the selected backend is unknown or required columns are missing.

Return type:

None

psma.io.extract_core_arrays(df, *, y_col, mol_id_col, smiles_col, fp_col, emb_col)

Extract aligned arrays used by the modeling pipeline.

Parameters:
  • df (DataFrame) – Input dataframe for one PSMA workflow run.

  • y_col (str) – Name of the continuous endpoint column.

  • mol_id_col (str | None) – Optional molecule identifier column.

  • smiles_col (str | None) – Optional SMILES column.

  • fp_col (str | None) – Optional precomputed fingerprint column.

  • emb_col (str | None) – Optional embedding column.

Returns:

Dictionary of aligned core arrays used across the workflow.

Raises:

ValueError – If the embedding column does not contain 1D vectors.

Return type:

dict[str, Any]

psma.io.save_json(path, payload)

Save a JSON payload with stable formatting.

Parameters:
  • path (Path) – Target JSON file path.

  • payload (dict[str, Any]) – JSON-serializable mapping to write.

Return type:

None

psma.io.save_csv(path, frame)

Save a dataframe to CSV without an index column.

Parameters:
  • path (Path) – Target CSV file path.

  • frame (DataFrame) – Dataframe to persist.

Return type:

None

psma.io.write_psma_artifacts(results, *, output_dir)

Write standard PSMA artifacts for a computed result payload.

Parameters:
  • results (PsmaResult) – Computed PSMA result payload containing frames, grids, params, and metrics.

  • output_dir (str | Path) – Relative or absolute target directory for persisted artifacts.

Returns:

Created output directory containing the written artifacts.

Return type:

Path

psma.kde module

Kernel density estimation helpers for PSMA posterior surfaces.

class psma.kde.KDEResult

Bases: TypedDict

Typed output from class-conditional KDE fitting.

hx: float
hy: float
kde_pos: KernelDensity
kde_neg: KernelDensity
c_train_scaled: ndarray
class psma.kde.DensityGrid

Bases: TypedDict

Typed output for evaluated density grids.

grid_x: ndarray
grid_y: ndarray
pos_density_z: ndarray
neg_density_z: ndarray
psma.kde.fit_class_kdes(c_train, y_train_bin)

Fit class-conditional KDEs in bandwidth-scaled coordinate space.

Parameters:
  • c_train (ndarray) – Embedded training coordinates with shape (n_samples, 2).

  • y_train_bin (ndarray) – Binary training labels aligned to c_train.

Returns:

Fitted KDE models, bandwidths, and scaled coordinates.

Raises:

ValueError – If both positive and negative classes are not present.

Return type:

KDEResult

psma.kde.evaluate_density_grid(kde_pos, kde_neg, c_train, *, hx, hy, grid_n, grid_lims_scale)

Evaluate class density surfaces on a regular 2D grid.

Parameters:
  • kde_pos (KernelDensity) – KDE fitted on positive-class training coordinates.

  • kde_neg (KernelDensity) – KDE fitted on negative-class training coordinates.

  • c_train (ndarray) – Embedded training coordinates used to define grid limits.

  • hx (float) – Horizontal bandwidth used to scale coordinates.

  • hy (float) – Vertical bandwidth used to scale coordinates.

  • grid_n (int) – Number of grid steps per axis.

  • grid_lims_scale (float) – Scale factor applied to the training-coordinate bounding box.

Returns:

Evaluated positive and negative density grids.

Return type:

DensityGrid

psma.labels module

Label thresholding utilities for continuous endpoints.

class psma.labels.LabelResult

Bases: TypedDict

Typed output for thresholded labels and class priors.

y_train_bin: ndarray
y_test_bin: ndarray
prior_pos: float
prior_neg: float
psma.labels.validate_thresholded_training_labels(y_all_bin, y_train_bin, *, label_threshold, label_direction, split_method)

Validate that thresholding produced a trainable class structure.

Parameters:
  • y_all_bin (ndarray) – Thresholded labels for the full dataset.

  • y_train_bin (ndarray) – Thresholded labels for the training split.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

  • split_method (str) – Split strategy used to construct the training set.

Raises:

ValueError – If thresholding collapses the full dataset to one class or if the selected split produces a single-class training set.

Return type:

None

psma.labels.threshold_labels(y_train, y_test, *, label_threshold, label_direction)

Threshold continuous values into binary labels and class priors.

Parameters:
  • y_train (ndarray) – Continuous training endpoint values.

  • y_test (ndarray) – Continuous test endpoint values.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

Returns:

Thresholded train/test labels and class priors.

Raises:

ValueError – If label_direction is not supported.

Return type:

LabelResult

psma.main module

Main orchestration entrypoint for PSMA surface modeling.

psma.main.compute_psma_surface(df, *, y_col, label_threshold, label_direction, similarity_method='rdkit_morgan_tanimoto', embed_method='pcoa', smiles_col='smiles', fp_col=None, emb_col=None, mol_id_col='mol_id', triples_df=None, distance_k=0.382, tsne_perplexity=30, tsne_max_iter=500, grid_n=128, grid_lims_scale=2.0, eps=1e-12, split_method='random', test_fraction=0.2, random_state=42, butina_distance_cutoff=0.4, mcc_prob_threshold=0.5, morgan_radius=2, morgan_n_bits=2048)

Run the PSMA computation path without writing artifacts.

Parameters:
  • df (DataFrame) – Input dataframe containing endpoint and feature columns.

  • y_col (str) – Continuous endpoint column name.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

  • similarity_method (str) – Similarity backend name.

  • embed_method (str) – Embedding backend name.

  • smiles_col (str | None) – Optional SMILES column for RDKit similarity.

  • fp_col (str | None) – Optional precomputed fingerprint column.

  • emb_col (str | None) – Optional embedding column for cosine similarity.

  • mol_id_col (str | None) – Optional molecule identifier column.

  • triples_df (DataFrame | None) – Optional triples dataframe for imported similarities.

  • distance_k (float) – Convex distance-transform parameter.

  • tsne_perplexity (float) – t-SNE perplexity when that backend is used.

  • tsne_max_iter (int) – t-SNE iteration cap when that backend is used.

  • grid_n (int) – Number of grid steps per axis.

  • grid_lims_scale (float) – Scale factor for the posterior grid bounds.

  • eps (float) – Numerical stabilizer used in posterior scoring.

  • split_method (str) – Split strategy, either 'random' or 'butina'.

  • test_fraction (float) – Fraction of compounds assigned to the test set.

  • random_state (int) – Random seed used across the workflow.

  • butina_distance_cutoff (float) – Distance threshold used by Butina clustering.

  • mcc_prob_threshold (float) – Probability threshold used for MCC.

  • morgan_radius (int) – Morgan fingerprint radius for RDKit similarity.

  • morgan_n_bits (int) – Morgan fingerprint size for RDKit similarity.

Returns:

Typed PSMA result payload containing workflow outputs, diagnostics, parameters, and data frames.

Return type:

PsmaResult

psma.main.run_psma_surface(df, *, y_col, label_threshold, label_direction, similarity_method='rdkit_morgan_tanimoto', embed_method='pcoa', smiles_col='smiles', fp_col=None, emb_col=None, mol_id_col='mol_id', triples_df=None, distance_k=0.382, tsne_perplexity=30, tsne_max_iter=500, grid_n=128, grid_lims_scale=2.0, eps=1e-12, split_method='random', test_fraction=0.2, random_state=42, butina_distance_cutoff=0.4, mcc_prob_threshold=0.5, morgan_radius=2, morgan_n_bits=2048, output_dir=None)

Run the PSMA workflow and optionally persist standard artifacts.

Parameters:
  • df (DataFrame) – Input dataframe containing endpoint and feature columns.

  • y_col (str) – Continuous endpoint column name.

  • label_threshold (float) – Threshold used to define the positive class.

  • label_direction (str) – Threshold direction, either 'le' or 'ge'.

  • similarity_method (str) – Similarity backend name.

  • embed_method (str) – Embedding backend name.

  • smiles_col (str | None) – Optional SMILES column for RDKit similarity.

  • fp_col (str | None) – Optional precomputed fingerprint column.

  • emb_col (str | None) – Optional embedding column for cosine similarity.

  • mol_id_col (str | None) – Optional molecule identifier column.

  • triples_df (DataFrame | None) – Optional triples dataframe for imported similarities.

  • distance_k (float) – Convex distance-transform parameter.

  • tsne_perplexity (float) – t-SNE perplexity when that backend is used.

  • tsne_max_iter (int) – t-SNE iteration cap when that backend is used.

  • grid_n (int) – Number of grid steps per axis.

  • grid_lims_scale (float) – Scale factor for the posterior grid bounds.

  • eps (float) – Numerical stabilizer used in posterior scoring.

  • split_method (str) – Split strategy, either 'random' or 'butina'.

  • test_fraction (float) – Fraction of compounds assigned to the test set.

  • random_state (int) – Random seed used across the workflow.

  • butina_distance_cutoff (float) – Distance threshold used by Butina clustering.

  • mcc_prob_threshold (float) – Probability threshold used for MCC.

  • morgan_radius (int) – Morgan fingerprint radius for RDKit similarity.

  • morgan_n_bits (int) – Morgan fingerprint size for RDKit similarity.

  • output_dir (str | Path | None) – Optional output directory. When provided, standard artifacts are written to disk.

Returns:

Typed PSMA result payload. The artifacts_dir field is set only when artifact writing is requested.

Return type:

PsmaResult

psma.metrics module

Evaluation metrics for PSMA predictions.

psma.metrics.compute_metrics(y_true, prob, *, threshold=0.5)

Compute AUC, MCC, and confusion entries for test predictions.

Parameters:
  • y_true (ndarray) – Binary ground-truth test labels.

  • prob (ndarray) – Predicted positive-class probabilities.

  • threshold (float) – Probability threshold used for hard predictions.

Returns:

Metric summary containing AUC, MCC, threshold, and confusion counts.

Return type:

dict[str, Any]

psma.plot module

Notebook-friendly plotting helpers for PSMA surfaces.

psma.plot.plot_posterior_2d(ax, *, grid_x, grid_y, posterior_z, pos_density_z, c_test, y_test_bin, posterior_palette='rdylgn_reversed', contour_levels=(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), point_size=12, point_alpha=0.75)

Plot a 2D posterior surface with test points.

Parameters:
  • ax (Axes) – Matplotlib axes used for rendering.

  • grid_x (ndarray) – X-axis coordinates of the posterior grid.

  • grid_y (ndarray) – Y-axis coordinates of the posterior grid.

  • posterior_z (ndarray) – Posterior probability surface values.

  • pos_density_z (ndarray) – Positive-class density surface values.

  • c_test (ndarray) – Projected test coordinates.

  • y_test_bin (ndarray) – Binary test labels aligned to c_test.

  • posterior_palette (str) – Palette selector for the posterior surface.

  • contour_levels (tuple[float, ...]) – Explicit posterior levels used for contour lines.

  • point_size (int) – Size of the projected test-point glyphs.

  • point_alpha (float) – Opacity for the projected test-point glyphs.

Returns:

The input axes after rendering.

Raises:

ValueError – If the posterior palette name is unsupported.

Return type:

Axes

psma.plot.plot_posterior_3d(*, grid_x, grid_y, posterior_z, c_test, prob_test, posterior_palette='rdylgn_reversed')

Plot a 3D posterior surface with scored test points.

Parameters:
  • grid_x (ndarray) – X-axis coordinates of the posterior grid.

  • grid_y (ndarray) – Y-axis coordinates of the posterior grid.

  • posterior_z (ndarray) – Posterior probability surface values.

  • c_test (ndarray) – Projected test coordinates.

  • prob_test (ndarray) – Posterior probabilities for the test coordinates.

  • posterior_palette (str) – Palette selector for the posterior surface.

Returns:

Tuple of the created figure and 3D axes.

Raises:

ValueError – If the posterior palette name is unsupported.

psma.plot.plot_posterior_2d_interactive(*, grid_x, grid_y, posterior_z, pos_density_z, c_test, y_test_bin, prob_test=None, mol_ids=None, smiles=None, width=950, height=720, title='Interactive PSMA Posterior (toggle layers via legend)', posterior_palette='rdylgn_reversed', heatmap_palette=None, contour_levels=(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), contour_width=5.0, contour_alpha=1.0, contour_color=None, point_size=4, point_fill_alpha=0.5, point_line_alpha=0.8, point_line_width=0.5)

Build an interactive Bokeh version of the 2D posterior plot.

Parameters:
  • grid_x (ndarray) – X-axis coordinates of the posterior grid.

  • grid_y (ndarray) – Y-axis coordinates of the posterior grid.

  • posterior_z (ndarray) – Posterior probability surface values.

  • pos_density_z (ndarray) – Positive-class density surface values.

  • c_test (ndarray) – Projected test coordinates.

  • y_test_bin (ndarray) – Binary test labels aligned to c_test.

  • prob_test (ndarray | None) – Optional posterior probabilities for the test points.

  • mol_ids (ndarray | None) – Optional molecule identifiers aligned to c_test.

  • smiles (ndarray | None) – Optional SMILES strings aligned to c_test for hover depictions.

  • width (int) – Width of the interactive plot in pixels.

  • height (int) – Height of the interactive plot in pixels.

  • title (str) – Figure title shown above the interactive plot.

  • posterior_palette (str) – Palette selector for the posterior heatmap. Use "rdylgn_reversed" for notebook-style parity or "viridis" for the previous default.

  • heatmap_palette (str | None) – Backwards-compatible alias for posterior_palette.

  • contour_levels (tuple[float, ...]) – Explicit posterior levels used for contour lines.

  • contour_width (float) – Width of the contour lines.

  • contour_alpha (float) – Opacity applied to the contour layer.

  • contour_color (str | None) – Optional fixed contour color. When None, contour levels are color-mapped instead.

  • point_size (int) – Size of the projected test-point glyphs.

  • point_fill_alpha (float) – Fill opacity for the projected test points.

  • point_line_alpha (float) – Outline opacity for the projected test points.

  • point_line_width (float) – Outline width for the projected test points.

Returns:

Configured Bokeh figure object.

Raises:
  • ImportError – If Bokeh is not installed or if SMILES hover depictions are requested without RDKit.

  • ValueError – If the heatmap palette name is unsupported.

Return type:

Any

psma.project module

Projection utilities for mapping test points into reference embedding space.

psma.project.project_test_coordinates(d_train: ndarray, d_test_train: ndarray, c_train: ndarray, *, return_diagnostics: Literal[False] = False) ndarray
psma.project.project_test_coordinates(d_train: ndarray, d_test_train: ndarray, c_train: ndarray, *, return_diagnostics: Literal[True]) tuple[ndarray, ProjectionDiagnostics]

Project test distances into the 2D embedding using a pseudoinverse map.

Parameters:
  • d_train – Square training distance matrix.

  • d_test_train – Rectangular test-vs-train distance block.

  • c_train – Embedded training coordinates.

  • return_diagnostics – Whether to return numerical stability diagnostics alongside projected coordinates.

Returns:

Projected test coordinates, optionally with diagnostics.

Raises:

ValueError – If projection inputs are malformed.

psma.psma module

Posterior surface calculation functions.

psma.psma.compute_posterior(pos_density, neg_density, *, prior_pos, prior_neg, eps=1e-12)

Compute posterior probabilities from class densities and priors.

Parameters:
  • pos_density (ndarray) – Positive-class density values.

  • neg_density (ndarray) – Negative-class density values.

  • prior_pos (float) – Positive-class prior probability.

  • prior_neg (float) – Negative-class prior probability.

  • eps (float) – Numerical stabilizer added to the denominator.

Returns:

Posterior positive-class probabilities.

Return type:

ndarray

psma.results module

Public typed result models for PSMA workflow outputs.

class psma.results.PsmaParams(y_col, label_threshold, label_direction, similarity_method, embed_method, distance_k, tsne_perplexity, tsne_max_iter, grid_n, grid_lims_scale, eps, split_method, test_fraction, random_state, butina_distance_cutoff, mcc_prob_threshold, morgan_radius, morgan_n_bits)

Bases: object

Stable parameter payload describing one PSMA workflow run.

Parameters:
  • y_col (str)

  • label_threshold (float)

  • label_direction (str)

  • similarity_method (str)

  • embed_method (str)

  • distance_k (float)

  • tsne_perplexity (float)

  • tsne_max_iter (int)

  • grid_n (int)

  • grid_lims_scale (float)

  • eps (float)

  • split_method (str)

  • test_fraction (float)

  • random_state (int)

  • butina_distance_cutoff (float)

  • mcc_prob_threshold (float)

  • morgan_radius (int)

  • morgan_n_bits (int)

y_col: str
label_threshold: float
label_direction: str
similarity_method: str
embed_method: str
distance_k: float
tsne_perplexity: float
tsne_max_iter: int
grid_n: int
grid_lims_scale: float
eps: float
split_method: str
test_fraction: float
random_state: int
butina_distance_cutoff: float
mcc_prob_threshold: float
morgan_radius: int
morgan_n_bits: int
class psma.results.PsmaIndices(train, test)

Bases: object

Train and test split indices for one PSMA workflow run.

Parameters:
  • train (ndarray)

  • test (ndarray)

train: ndarray
test: ndarray
class psma.results.PsmaSimilarity(s_full, s_train, s_test_train)

Bases: object

Similarity matrices produced during the PSMA workflow.

Parameters:
  • s_full (ndarray)

  • s_train (ndarray)

  • s_test_train (ndarray)

s_full: ndarray
s_train: ndarray
s_test_train: ndarray
class psma.results.PsmaDistance(d_train, d_test_train)

Bases: object

Distance matrices derived from workflow similarity matrices.

Parameters:
  • d_train (ndarray)

  • d_test_train (ndarray)

d_train: ndarray
d_test_train: ndarray
class psma.results.PsmaCoordinates(c_train, c_test)

Bases: object

Embedded train and projected test coordinates.

Parameters:
  • c_train (ndarray)

  • c_test (ndarray)

c_train: ndarray
c_test: ndarray
class psma.results.PsmaLabels(y_train_bin, y_test_bin, prior_pos, prior_neg)

Bases: object

Thresholded labels and derived training priors.

Parameters:
  • y_train_bin (ndarray)

  • y_test_bin (ndarray)

  • prior_pos (float)

  • prior_neg (float)

y_train_bin: ndarray
y_test_bin: ndarray
prior_pos: float
prior_neg: float
class psma.results.PsmaGrid(grid_x, grid_y, pos_density_z, neg_density_z, posterior_z)

Bases: object

Evaluated density and posterior surfaces on the PSMA grid.

Parameters:
  • grid_x (ndarray)

  • grid_y (ndarray)

  • pos_density_z (ndarray)

  • neg_density_z (ndarray)

  • posterior_z (ndarray)

grid_x: ndarray
grid_y: ndarray
pos_density_z: ndarray
neg_density_z: ndarray
posterior_z: ndarray
class psma.results.ProjectionDiagnostics(train_rank, train_condition_number, is_ill_conditioned, diagnostics)

Bases: object

Numerical diagnostics for pseudoinverse-based projection.

Parameters:
  • train_rank (int)

  • train_condition_number (float)

  • is_ill_conditioned (bool)

  • diagnostics (list[str])

train_rank: int
train_condition_number: float
is_ill_conditioned: bool
diagnostics: list[str]
class psma.results.PsmaMetrics(auc, auc_defined, mcc, threshold, confusion, metric_notes)

Bases: object

Summary metrics for PSMA test-set predictions.

Parameters:
  • auc (float | None)

  • auc_defined (bool)

  • mcc (float)

  • threshold (float)

  • confusion (dict[str, int])

  • metric_notes (dict[str, str])

auc: float | None
auc_defined: bool
mcc: float
threshold: float
confusion: dict[str, int]
metric_notes: dict[str, str]
class psma.results.PsmaFrames(train, test)

Bases: object

Notebook-friendly output frames for train and test compounds.

Parameters:
  • train (DataFrame)

  • test (DataFrame)

train: DataFrame
test: DataFrame
class psma.results.PsmaResult(indices, similarity, distance, coords, labels, grid, projection, prob_test, metrics, params, artifacts_dir, frames)

Bases: object

Top-level typed result returned by the PSMA workflow.

Parameters:
indices: PsmaIndices
similarity: PsmaSimilarity
distance: PsmaDistance
coords: PsmaCoordinates
labels: PsmaLabels
grid: PsmaGrid
projection: ProjectionDiagnostics
prob_test: ndarray
metrics: PsmaMetrics
params: PsmaParams
artifacts_dir: Path | None
frames: PsmaFrames

psma.score module

Direct test-point posterior scoring using fitted KDEs.

psma.score.score_test_points(c_test, *, kde_pos, kde_neg, hx, hy, prior_pos, prior_neg, eps=1e-12)

Score test points with Bayesian posterior probabilities.

Parameters:
  • c_test (ndarray) – Projected test coordinates.

  • kde_pos (KernelDensity) – KDE fitted on positive-class training coordinates.

  • kde_neg (KernelDensity) – KDE fitted on negative-class training coordinates.

  • hx (float) – Horizontal bandwidth used to scale coordinates.

  • hy (float) – Vertical bandwidth used to scale coordinates.

  • prior_pos (float) – Positive-class prior probability.

  • prior_neg (float) – Negative-class prior probability.

  • eps (float) – Numerical stabilizer added to the denominator.

Returns:

Posterior positive-class probabilities for the test points.

Return type:

ndarray

psma.similarity module

Similarity matrix construction backends.

psma.similarity.build_similarity_from_triples(*, n_samples, triples_df, mol_id_to_index, molid1_col='molid1', molid2_col='molid2', score_col='Sscore')

Build a full similarity matrix from a sparse triples table.

Parameters:
  • n_samples (int) – Number of samples expected in the full matrix.

  • triples_df (DataFrame) – Sparse triples dataframe containing pair scores.

  • mol_id_to_index (dict[str, int]) – Mapping from molecule id to matrix row index.

  • molid1_col (str) – Name of the first molecule id column.

  • molid2_col (str) – Name of the second molecule id column.

  • score_col (str) – Name of the similarity score column.

Returns:

Symmetric similarity matrix with unit diagonal.

Return type:

ndarray

psma.similarity.build_similarity_rdkit_morgan(*, smiles=None, fps=None, radius=2, n_bits=2048)

Build a full Tanimoto similarity matrix using Morgan fingerprints.

Parameters:
  • smiles (list[str] | None) – Optional SMILES strings used to generate fingerprints.

  • fps (list[Any] | None) – Optional precomputed fingerprints.

  • radius (int) – Morgan fingerprint radius.

  • n_bits (int) – Morgan fingerprint size.

Returns:

Full pairwise Tanimoto similarity matrix.

Raises:
  • ValueError – If neither SMILES nor fingerprints are provided, or if an invalid SMILES string is encountered.

  • ImportError – If RDKit is required but not installed.

Return type:

ndarray

psma.similarity.build_similarity_embedding_cosine(embeddings, *, map_to_unit=True)

Build a similarity matrix from cosine similarities of embeddings.

Parameters:
  • embeddings (ndarray) – Dense embedding matrix with one row per sample.

  • map_to_unit (bool) – Whether to map cosine similarity from [-1, 1] to [0, 1].

Returns:

Pairwise similarity matrix clipped to [0, 1].

Return type:

ndarray

psma.split module

Train/test splitting utilities.

psma.split.make_train_test_split(n_samples, *, split_method='random', test_fraction=0.2, random_state=42, similarity_matrix=None, butina_distance_cutoff=0.4)

Create train/test index arrays for the selected split strategy.

Parameters:
  • n_samples (int) – Total number of samples available for splitting.

  • split_method (str) – Split strategy, either 'random' or 'butina'.

  • test_fraction (float) – Fraction of samples assigned to the test split.

  • random_state (int) – Random seed used by the splitter.

  • similarity_matrix (ndarray | None) – Full similarity matrix required by Butina splitting.

  • butina_distance_cutoff (float) – Distance threshold used by Butina clustering.

Returns:

Sorted training and test index arrays.

Raises:

ValueError – If split configuration is unsupported or cannot produce a non-empty train/test split.

Return type:

tuple[ndarray, ndarray]