Simi Search Logo

Documentation

  • Getting Started
  • Methods
  • API Reference
  • Release
Simi Search
  • Simi Search
  • Edit on GitHub

Simi Search

Simi Search: a ligand-similarity benchmark for LIT-PCBA virtual screening.

Simi Search provides a focused framework for preparing LIT-PCBA, running ligand-only retrieval, and evaluating early enrichment. The benchmark deliberately excludes docking, protein-ligand interaction fingerprints, random forest scores, CNN scores, and consensus scoring so the effect of ligand similarity can be measured directly.

Abstract

Ligand-based virtual screening is often used when structural models, docking poses, or target-specific scoring functions are unavailable or expensive to produce. However, similarity search can be overstated when benchmarks contain analogue bias or when evaluation focuses on global ranking statistics rather than early enrichment. Simi Search addresses this by using the AVE-unbiased LIT-PCBA benchmark and by ranking validation compounds according to their maximum Tanimoto similarity to active training ligands. The package standardizes dataset preparation, scoring, and evaluation so future molecular fingerprints can be compared against a transparent hashed-SMILES baseline.

Framework Overview

1 LIT-PCBA

Download AVE-unbiased target splits from the official archive.

→
2 Normalize

Convert active and inactive SMILES files into target CSV files.

→
3 Search

Rank validation compounds by max similarity to train actives.

→
4 Evaluate

Report EF1%, EF5%, BEDROC20, ROC_AUC, and PR_AUC.

Benchmark Design

Benchmark component

Implementation

Scientific rationale

Dataset

LIT-PCBA AVE-unbiased

HTS-derived targets with reduced analogue bias.

Query/reference set

Active training molecules

Known ligands define the target-specific similarity query set.

Screening set

Validation molecules

Held-out compounds are ranked without using validation labels.

Similarity score

Maximum Tanimoto to any active training molecule

Nearest-active retrieval baseline for ligand-based screening.

Fingerprints

Hashed SMILES n-grams and optional RDKit Morgan/ECFP

Lightweight default plus chemistry-aware RDKit backend.

Primary metrics

EF1%, EF5%, BEDROC20

Early-recognition metrics aligned with practical virtual screening.

Evaluation Pillars

Similarity Retrieval

Fingerprint SMILES and rank validation molecules by max-active Tanimoto similarity.

Early Enrichment

Use EF1%, EF5%, and BEDROC20 as the primary virtual-screening readouts.

Extensible API

Use the hashed baseline, RDKit Morgan fingerprints, or custom fingerprinters.

Explore the Documentation

Getting Started

Install the package, prepare LIT-PCBA, and run a target benchmark.

Methods

Understand dataset handling, fingerprinting, retrieval, and metrics.

Release

Publish PyPI, conda, and Docker artifacts from GitHub Actions.

Next

© Copyright 2026, ThinhUMP.

Built with Sphinx using a theme provided by Read the Docs.