Vina, GNINA, and Glide: what each scoring function buys you
An honest comparison of the three workhorse docking scoring functions in 2026 — what they actually optimize for, where they fail, and how to combine them.
Docking has three workhorse scoring functions in 2026: AutoDock Vina (empirical, open source), GNINA (Vina poses rescored by a convolutional neural network, open source), and Schrödinger Glide (physics-based with proprietary terms, commercial). They get used interchangeably in the literature, which is a mistake. They optimize different things, fail in different ways, and the right move is usually to run more than one and look at where they agree.
What each one is actually doing
AutoDock Vina (Trott & Olson 2010, Eberhardt et al. 2021) is an empirical scoring function: a weighted sum of gauss, repulsion, hydrophobic, hydrogen-bond, and torsional terms, with the weights fit to reproduce known binding affinities on a training set of protein-ligand complexes. The search is iterated local optimization with a Monte Carlo metaheuristic. Vina is fast, widely benchmarked, and the de-facto default for academic docking. On standard pose-reproduction benchmarks (PDBbind, CASF) it recovers a sub-2 Å pose for the top-scored result ~60–70% of the time on cross-docked complexes.
GNINA (McNutt et al. 2021, Francoeur et al. 2020) keeps Vina's sampling engine but replaces the scoring step with a 3D convolutional neural network trained on ~25M poses with crystallographic ground truth labels. The network sees a 3D density grid of the pocket plus ligand and outputs a CNNscore (probability the pose is correct) plus a CNNaffinity (predicted pK). On the CASF-2016 docking power benchmark, GNINA improves top-1 pose accuracy over Vina by roughly 10–15 percentage points on most target classes. Where it shines: highly flexible pockets and ligand series where pharmacophore complementarity matters more than coarse fit.
Glide (Friesner et al. 2004, 2006) uses a custom empirical scoring function (GlideScore) with terms for lipophilic contacts, hydrogen bonds, metal binding, rotatable bond penalty, and a Coulomb/vdW component. Its big differentiator is the sampling strategy: an exhaustive funnel from rigid initial placement through stepped refinement, with constraints from pharmacophore features that the user can specify. Glide SP and Glide XP are the two tiers — XP adds explicit water terms and more aggressive scoring penalties. Glide XP is the closest the industry gets to a “trusted” scoring function in prospective programs, at the cost of being closed source and license-gated.
Three things scoring functions are bad at
Before comparing them, it helps to be honest about what none of them do well.
- Absolute affinity prediction. Every workhorse scoring function correlates with experimental pK at around r = 0.5–0.7 on broad benchmarks (Su et al. 2019). That is useful for ranking within a congeneric series, and not at all useful for predicting whether a novel scaffold will hit at 10 nM. If you see a docking paper claiming predicted IC50, treat it the way you would treat a weather forecast from a Magic 8-Ball.
- Entropy. None of them model conformational entropy of the bound ligand with any rigor. The torsional term in Vina is a counting heuristic; Glide's rotatable-bond penalty is similar. Ligands with many rotatable bonds tend to be over-scored relative to rigidified analogs that perform better in cells.
- Explicit waters. Bridging waters in the binding site matter enormously and none of the standard scoring functions handle them well. Glide XP has a partial answer (WaterMap-derived terms); GNINA sees waters if you include them in the grid but most users don't; Vina ignores them entirely. If your target has a known crystallographic bridging water (kinase hinge waters, HIV protease flap waters), include it as part of the receptor or your scores are not telling you what you think.
What the rescoring strategy buys you
The cleanest pattern in the literature is the “dock with Vina, rescore with GNINA” workflow. Vina's sampling engine is well-tuned and fast; GNINA's CNN scoring catches cases where Vina's linear-combination scoring overweights coarse contact area at the expense of specific interactions. Francoeur et al. (2020) showed that for cross-docking benchmarks the rescoring strategy outperforms either Vina-alone or GNINA-alone, with the biggest gains on protein families with sub-pocket plasticity (kinases, GPCRs, nuclear receptors).
Two practical caveats worth flagging. First, the CNN model has seen its training set; novel chemotypes that are structurally far from anything in PDBbind can get penalized by the network for reasons that aren't physically meaningful. Always inspect the top pose visually before trusting a confident-looking score. Second, the GNINA TVM CUDA kernels are sensitive to GPU architecture — the prebuilt binaries run cleanly on most consumer and datacenter cards, but the bleeding-edge Blackwell-generation (sm_100/sm_120) chips need a fallback path with CNN scoring disabled. If you are running on a B200 or RTX 5090 today, expect to either rebuild from source or use Vina-only mode for the CNN rescoring step until upstream patches land.
How we use them in practice
The pragmatic rule we have settled on:
- Discovery-phase, broad virtual screen of millions of compounds: Vina, top 1–5% triaged by CNN rescoring with GNINA. The throughput math doesn't work otherwise.
- Lead optimization within a series: Glide XP if you have a license, otherwise GNINA CNN with the caveat that ranking small ligand modifications is at the edge of what either tool reliably does. Wet-lab SAR is still the source of truth.
- Mutation-selectivity questions: rank by ΔΔG between wild-type and mutant receptor, not by absolute score. All three scoring functions are noisier on absolute numbers than on relative differences when the receptor chemistry is held mostly constant. The mutation deep-dives we have written on EGFR T790M and KRAS G12C both lean on this ΔΔ pattern.
- Pose validation before reporting: PoseBusters (Buttenschoen et al. 2024) on the top pose before publishing or moving to synthesis. It catches the chemistry-violation failure modes (impossible bond lengths, intermolecular clashes, steric impossibilities) that scoring functions sometimes wave through.
Try the comparison yourself
Open Studio and pick any target from the catalog. Dock your candidate with Vina (default), then enable GNINA CNN rescoring on the same job. The result row shows both scores side-by-side, and the pose viewer overlays the top Vina pose and the top CNN-reranked pose so you can see whether the rescoring changed which pose won. On well-behaved kinase targets they agree most of the time; on flexible pockets with bridging waters they often disagree, and that disagreement is itself a useful signal. When the two functions point at the same pose with similar confidence, that's the case you can trust without additional follow-up.
Because Liganx offers molecular docking online and free, you can run Vina and GNINA on the same job in the browser and compare the two scoring functions without a local toolchain.
Primary sources
- Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 31, 455–461 (2010). doi:10.1002/jcc.21334
- McNutt AT, et al. GNINA 1.0: molecular docking with deep learning. J Cheminform 13, 43 (2021). doi:10.1186/s13321-021-00522-2
- Friesner RA, et al. Extra Precision Glide: Docking and Scoring Incorporating a Model of Hydrophobic Enclosure for Protein-Ligand Complexes. J Med Chem 49, 6177–6196 (2006). doi:10.1021/jm051256o
- Buttenschoen M, Morris GM, Deane CM. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem Sci 15, 3130–3139 (2024). doi:10.1039/D3SC04185A