Simi Search
Simi Search: a ligand-similarity benchmark for LIT-PCBA virtual screening.
Simi Search provides a focused framework for preparing LIT-PCBA, running ligand-only retrieval, and evaluating early enrichment. The benchmark deliberately excludes docking, protein-ligand interaction fingerprints, random forest scores, CNN scores, and consensus scoring so the effect of ligand similarity can be measured directly.
Abstract
Ligand-based virtual screening is often used when structural models, docking poses, or target-specific scoring functions are unavailable or expensive to produce. However, similarity search can be overstated when benchmarks contain analogue bias or when evaluation focuses on global ranking statistics rather than early enrichment. Simi Search addresses this by using the AVE-unbiased LIT-PCBA benchmark and by ranking validation compounds according to their maximum Tanimoto similarity to active training ligands. The package standardizes dataset preparation, scoring, and evaluation so future molecular fingerprints can be compared against a transparent hashed-SMILES baseline.
Framework Overview
Download AVE-unbiased target splits from the official archive.
Convert active and inactive SMILES files into target CSV files.
Rank validation compounds by max similarity to train actives.
Report EF1%, EF5%, BEDROC20, ROC_AUC, and PR_AUC.
Benchmark Design
Benchmark component |
Implementation |
Scientific rationale |
|---|---|---|
Dataset |
LIT-PCBA AVE-unbiased |
HTS-derived targets with reduced analogue bias. |
Query/reference set |
Active training molecules |
Known ligands define the target-specific similarity query set. |
Screening set |
Validation molecules |
Held-out compounds are ranked without using validation labels. |
Similarity score |
Maximum Tanimoto to any active training molecule |
Nearest-active retrieval baseline for ligand-based screening. |
Fingerprints |
Hashed SMILES n-grams and optional RDKit Morgan/ECFP |
Lightweight default plus chemistry-aware RDKit backend. |
Primary metrics |
EF1%, EF5%, BEDROC20 |
Early-recognition metrics aligned with practical virtual screening. |
Evaluation Pillars
Explore the Documentation
Getting Started
Install the package, prepare LIT-PCBA, and run a target benchmark.
Methods
Understand dataset handling, fingerprinting, retrieval, and metrics.
Release
Publish PyPI, conda, and Docker artifacts from GitHub Actions.