Simi Search

Simi Search: a ligand-similarity benchmark for LIT-PCBA virtual screening.

Simi Search provides a focused framework for preparing LIT-PCBA, running ligand-only retrieval, and evaluating early enrichment. The benchmark deliberately excludes docking, protein-ligand interaction fingerprints, random forest scores, CNN scores, and consensus scoring so the effect of ligand similarity can be measured directly.

Abstract

Ligand-based virtual screening is often used when structural models, docking poses, or target-specific scoring functions are unavailable or expensive to produce. However, similarity search can be overstated when benchmarks contain analogue bias or when evaluation focuses on global ranking statistics rather than early enrichment. Simi Search addresses this by using the AVE-unbiased LIT-PCBA benchmark and by ranking validation compounds according to their maximum Tanimoto similarity to active training ligands. The package standardizes dataset preparation, scoring, and evaluation so future molecular fingerprints can be compared against a transparent hashed-SMILES baseline.

Framework Overview

1 LIT-PCBA

Download AVE-unbiased target splits from the official archive.

→

2 Normalize

Convert active and inactive SMILES files into target CSV files.

→

3 Search

Rank validation compounds by max similarity to train actives.

→

4 Evaluate

Report EF1%, EF5%, BEDROC20, ROC_AUC, and PR_AUC.

Benchmark Design

Benchmark component	Implementation	Scientific rationale
Dataset	LIT-PCBA AVE-unbiased	HTS-derived targets with reduced analogue bias.
Query/reference set	Active training molecules	Known ligands define the target-specific similarity query set.
Screening set	Validation molecules	Held-out compounds are ranked without using validation labels.
Similarity score	Maximum Tanimoto to any active training molecule	Nearest-active retrieval baseline for ligand-based screening.
Fingerprints	Hashed SMILES n-grams and optional RDKit Morgan/ECFP	Lightweight default plus chemistry-aware RDKit backend.
Primary metrics	EF1%, EF5%, BEDROC20	Early-recognition metrics aligned with practical virtual screening.

Simi Search

Abstract

Framework Overview

Benchmark Design

Evaluation Pillars

Similarity Retrieval

Early Enrichment

Extensible API

Explore the Documentation

Getting Started

Methods

Release