Methods
Simi Search evaluates ligand-based retrieval through three linked stages: dataset standardization, similarity scoring, and enrichment analysis.
Dataset Preparation
The downloader converts official LIT-PCBA SMILES files into normalized CSV files:
ID,SMILES,Real_Class,Label,Target,Split
Label=1 denotes an active molecule and Label=0 denotes an inactive
molecule. The processed directory follows this layout:
data/processed/lit_pcba_ave/
PPARG/
PPARG-LIT-PCBA-train.csv
PPARG-LIT-PCBA-validation.csv
Similarity Retrieval
For each target, the benchmark follows a held-out retrieval protocol:
Select active molecules from the training split.
Read all molecules from the validation split.
Fingerprint every molecule from its SMILES string.
Compute Tanimoto similarity between each validation molecule and every active training molecule.
Assign each validation molecule its maximum active-query similarity.
Rank validation molecules by decreasing score.
Compute enrichment and ranking metrics from validation labels.
Default Fingerprint
Simi Search supports two fingerprint backends:
HashedSmilesFingerprintDependency-free deterministic integer bitsets from hashed SMILES n-grams. This is the lightweight baseline.
RdkitMorganFingerprintRDKit Morgan/ECFP fingerprints converted into the same integer bitset representation used by the search engine. Install with
pip install "simi-search[rdkit]"orconda install -c conda-forge rdkit.
from simi_search.fingerprints import HashedSmilesFingerprint, RdkitMorganFingerprint
fingerprinter = HashedSmilesFingerprint(n_bits=2048)
bitset = fingerprinter.fingerprint("CCO")
rdkit_fingerprinter = RdkitMorganFingerprint(radius=2, n_bits=2048)
rdkit_bitset = rdkit_fingerprinter.fingerprint("CCO")
Extension Point
Future molecular representations should implement the same contract:
class Fingerprinter:
def fingerprint(self, smiles: str) -> int:
...
This keeps benchmark orchestration unchanged when adding learned molecular embeddings or other molecular fingerprints.
Metrics
Metric |
Interpretation |
|---|---|
|
Enrichment factor in the top one percent of the ranked library. |
|
Enrichment factor in the top five percent of the ranked library. |
|
Early-recognition score with alpha 20. |
|
Overall pairwise ranking quality. |
|
Precision-recall area for imbalanced active/inactive screens. |
Interpretation
Early enrichment should drive method selection. A method that improves
ROC_AUC but fails to improve EF1% or BEDROC20 may still be weak for
practical virtual screening, where the top-ranked compounds matter most.