PoseBusters: why your top-ranked pose can still be nonsense
RMSD under 2 angstroms is not the same as a pose you can hand to a chemist. PoseBusters showed how often deep-learning docking violates chemistry, and what to check.
For thirty years the de facto pass/fail for a docking pose has been “RMSD to the crystal pose under 2 Å.” That metric is fine when the underlying method respects bond lengths, angles, and ring planarity by construction — which is how every force-field-based docking program works. It stops being fine the moment a neural network starts emitting coordinates directly. PoseBusters is the paper that made the field admit this in 2024, and it changed how docking benchmarks should be reported.
What PoseBusters actually checks
Buttenschoen et al. (Chem Sci, 2024) introduced PoseBusters as an open-source RDKit-based suite of geometric and chemical sanity tests applied to every predicted pose. The checks fall into three buckets:
- Chemical validity of the ligand — correct stereochemistry preserved from the input SMILES, no atoms flipped between R/S, no double bonds isomerized E/Z, no aromatic rings that have lost planarity.
- Internal geometry — bond lengths within tolerance of reference distributions, bond angles within ranges, no impossibly short non-bonded contacts within the ligand, no atoms overlapping.
- Protein-ligand consistency — no steric clashes with the receptor (van der Waals overlap below 0.4 Å), ligand sits inside the binding pocket rather than buried in backbone or floating in solvent.
A pose that fails any of those checks is called PB-invalid. A pose that passes them all is PB-valid. The new metric the field has adopted is “RMSD < 2 Å andPB-valid” — both conditions or neither.
What the benchmark revealed
The PoseBusters Benchmark set is a curated 308-structure subset of PDB protein-ligand complexes released after 2021, deliberately chosen to be temporally out-of-distribution for any deep-learning method trained on PDBBind. The headline result: on this hold-out set, the best deep-learning docking method at the time (DiffDock) ranked highest on raw RMSD but produced PB-valid poses only a minority of the time. Classical methods — AutoDock Vina, Gold, Glide — produced PB-valid poses at much higher rates, and their combined “RMSD < 2 Å and PB-valid” success rates beat the deep-learning methods outright.
The reasons are mechanical. Force-field docking propagates atomic coordinates through a Lennard-Jones potential and a bonded-energy term that physically prohibits 0.5 Å bond lengths and 110° aromatic-ring distortions. A diffusion model learning coordinates in Cartesian space has no such prior — it can produce a pose that looks plausible at low resolution but contains a tetrahedral nitrogen flattened into a plane, or a phenyl ring with bond angles ranging from 95° to 130°. The pose passes RMSD because the atomic centroids land in roughly the right place; it fails PoseBusters because the molecule it represents could not physically exist.
The wider lesson for docking workflows
A pose has to pass three independent tests before a medicinal chemist should trust it:
- Geometric accuracy (low RMSD to a reference when a reference exists, or low pose-pose RMSD across replicate dockings when it doesn’t).
- Physical plausibility (PoseBusters or equivalent — bond lengths, angles, no clashes, valid stereochemistry).
- Interaction recovery — does the pose make the interactions a known active should make (hinge hydrogen bonds for kinase inhibitors, the covalent bond for warhead chemistries, the canonical hydrophobic stack with a specific residue)? ProLIF interaction fingerprints are the standard way to score this.
A top-1 pose that fails any of the three is suspect, regardless of how good the docking score is. This is especially true for any pose generated by a generative method — diffusion-based, flow-matching, or otherwise.
What this means for benchmarks you read
Any paper claiming state-of-the-art docking after Buttenschoen 2024 should be reporting PB-valid success rates, not just RMSD success rates. If a methods paper says “75% top-1 RMSD < 2 Å” without a PB-validity column, the right reaction is skepticism. Recent improvements — Boltz, AlphaFold3, Chai-1 — are starting to fold geometric constraints back into the model architecture (typed bond terms, equivariant losses on bond lengths) and their PB-valid rates have caught up substantially, but the lesson holds: RMSD alone is no longer sufficient evidence that a pose is correct.
Try the docking yourself
Liganx runs Vina and GNINA (when GPU-compatible) for scoring, and surfaces the PoseBusters checks alongside every result. Open Studio and dock any candidate against any target. After the run, the pose viewer flags PB-invalid poses with a warning chip — if you see it, the score is not trustworthy on its own and a re-dock with tighter exhaustiveness or a different initial conformer is the right next move. The point isn’t to memorize the checks; the point is that a docking score is one signal and pose validity is a separate, equally important one.
Because Liganx runs molecular docking online in the browser, the PoseBusters checks travel with every result, so you see pose validity next to the score rather than as a separate step.
Primary sources
- Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem Sci 15, 3130-3139 (2024). doi:10.1039/D3SC04185A
- Corso G, et al. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. ICLR 2023. arXiv:2210.01776
- Bouysset C, Fiorucci S. ProLIF: a library to encode molecular interactions as fingerprints. J Cheminform 13, 72 (2021). doi:10.1186/s13321-021-00548-6