Methods

Simi Search evaluates ligand-based retrieval through three linked stages: dataset standardization, similarity scoring, and enrichment analysis.

Dataset Preparation

The downloader converts official LIT-PCBA SMILES files into normalized CSV files:

ID,SMILES,Real_Class,Label,Target,Split

Label=1 denotes an active molecule and Label=0 denotes an inactive molecule. The processed directory follows this layout:

data/processed/lit_pcba_ave/
  PPARG/
    PPARG-LIT-PCBA-train.csv
    PPARG-LIT-PCBA-validation.csv

Similarity Retrieval

For each target, the benchmark follows a held-out retrieval protocol:

  1. Select active molecules from the training split.

  2. Read all molecules from the validation split.

  3. Fingerprint every molecule from its SMILES string.

  4. Compute Tanimoto similarity between each validation molecule and every active training molecule.

  5. Assign each validation molecule its maximum active-query similarity.

  6. Rank validation molecules by decreasing score.

  7. Compute enrichment and ranking metrics from validation labels.

Default Fingerprint

Simi Search supports two fingerprint backends:

HashedSmilesFingerprint

Dependency-free deterministic integer bitsets from hashed SMILES n-grams. This is the lightweight baseline.

RdkitMorganFingerprint

RDKit Morgan/ECFP fingerprints converted into the same integer bitset representation used by the search engine. Install with pip install "simi-search[rdkit]" or conda install -c conda-forge rdkit.

from simi_search.fingerprints import HashedSmilesFingerprint, RdkitMorganFingerprint

fingerprinter = HashedSmilesFingerprint(n_bits=2048)
bitset = fingerprinter.fingerprint("CCO")

rdkit_fingerprinter = RdkitMorganFingerprint(radius=2, n_bits=2048)
rdkit_bitset = rdkit_fingerprinter.fingerprint("CCO")

Extension Point

Future molecular representations should implement the same contract:

class Fingerprinter:
    def fingerprint(self, smiles: str) -> int:
        ...

This keeps benchmark orchestration unchanged when adding learned molecular embeddings or other molecular fingerprints.

Metrics

Metric

Interpretation

EF1%

Enrichment factor in the top one percent of the ranked library.

EF5%

Enrichment factor in the top five percent of the ranked library.

BEDROC20

Early-recognition score with alpha 20.

ROC_AUC

Overall pairwise ranking quality.

PR_AUC

Precision-recall area for imbalanced active/inactive screens.

Interpretation

Early enrichment should drive method selection. A method that improves ROC_AUC but fails to improve EF1% or BEDROC20 may still be weak for practical virtual screening, where the top-ranked compounds matter most.