Consensus scoring: why combining docking functions beats one
Every scoring function is wrong in its own way. Consensus scoring cancels the uncorrelated errors. Here is how rank-by-rank, rank-by-vote, and ECR actually work.
No docking scoring function is trustworthy on its own. Each one encodes a different set of approximations about electrostatics, desolvation, and entropy, and each fails on a different slice of chemical space. Consensus scoring is the pragmatic response: run several functions, combine their verdicts, and let the uncorrelated errors cancel. It is one of the oldest tricks in structure-based virtual screening and still one of the most reliable.
Why a single score lies to you
A docking score is a model of binding free energy, and every model cuts corners. Empirical functions like AutoDock Vina fit a handful of physically motivated terms to known affinities. Knowledge-based functions derive potentials from the statistics of the PDB. Machine-learning functions like GNINA’s CNN learn a scoring surface from thousands of co-crystal poses. They disagree because they are wrong in different places: Vina tends to over-reward buried hydrophobic surface, knowledge-based potentials inherit the biases of whatever crystal structures dominate the training set, and CNN scorers can be fooled by chemotypes unlike anything they saw in training.
The key statistical insight, formalized by Wang and Wang in 2001, is that if the errors of different functions are at least partlyindependent, averaging them reduces the variance of the estimate. The true binding signal is shared across functions; the noise is not. Average enough semi-independent estimates and the noise shrinks while the signal survives. That is the entire theoretical justification for consensus scoring, and it is why it only helps when the functions you combine are genuinely different in construction.
The three classic combination schemes
Charifson and colleagues introduced consensus scoring in 1999 and showed it cut false-positive rates against three targets. The schemes they and their successors use fall into three families:
- Rank-by-number — normalize each function’s raw scores onto a common scale and average the numbers. Simple, but sensitive to outliers and to the wildly different scales and offsets that scoring functions produce.
- Rank-by-rank — convert each function’s output to a rank order, then average the ranks. This throws away the magnitude of the scores but is immune to scale and unit mismatches, which is exactly the failure mode that wrecks rank-by-number.
- Rank-by-vote — give a compound one vote from each function that places it in, say, the top 10 percent, then rank by vote count. Brutally simple, surprisingly robust, and the easiest to reason about when you have only two or three functions.
A more modern refinement is Exponential Consensus Ranking (ECR), which weights each function’s contribution by an exponential of the compound’s rank. ECR rewards molecules that land near the top of any function rather than demanding agreement everywhere, and Palacio-Rodriguez and colleagues showed in 2019 that it improves enrichment in both single-structure and receptor-ensemble docking. It is rank-based, so it inherits rank-by-rank’s immunity to scale problems while being less punishing toward a real binder that one function happens to score poorly.
Pose consensus is a different thing
Consensus scoring combines affinity estimates. Consensusdocking combines geometries: you dock with several engines and keep poses that multiple programs place in the same spot, usually defined as an inter-pose RMSD under 2 angstrom. The logic is the same variance-cancellation argument applied to coordinates instead of scores. A pose that three independent samplers converge on is far more likely to be the real binding mode than a pose only one engine likes. The strongest workflows do both: require geometric agreement first, then rank the survivors by a consensus score. RSC Advances published a clean demonstration in 2021 that combining pose consensus with rank consensus beats either alone.
When consensus does not help
Consensus scoring is not free lunch. If your functions share a systematic bias, averaging them just averages the bias and gives you false confidence. Three knowledge-based functions all trained on the same skewed PDB slice will agree with each other and still be wrong together. The independence assumption is load-bearing: the more your functions resemble one another in construction, the less consensus buys you. Combine an empirical function, a knowledge-based one, and a learned one before you combine three flavors of the same idea.
It also costs compute. Each added function multiplies the rescoring time, and past three or four functions the marginal enrichment gain usually flattens. For a focused mutation-selectivity question on a handful of ligands the cost is trivial; for a million-compound library it is a real budget line.
Try a two-function consensus yourself
You do not need a screening cluster to see the effect. The cheapest useful consensus is two functions of different lineage: an empirical search function and a learned rescorer. Open Studio and dock your ligand set with Vina, then rescore the top poses with GNINA’s CNN. Where the two functions agree on the top ranks, trust the result; where they disagree, that is exactly the molecule worth inspecting by hand in the pose viewer. That disagreement signal is the practical payoff of consensus thinking, and it shows up most sharply on the close ΔΔ calls between a wild-type pocket and its mutant.
Liganx puts molecular docking online and free in the browser, so running molecular docking through two scoring functions and comparing their rankings takes a couple of clicks rather than a pipeline.
Primary sources
- Charifson PS, Corkery JJ, Murcko MA, Walters WP. Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J Med Chem 42, 5100-5109 (1999). doi:10.1021/jm990352k
- Wang R, Wang S. How does consensus scoring work for virtual library screening? An idealized computer experiment. J Chem Inf Comput Sci 41, 1422-1426 (2001). doi:10.1021/ci010025x
- Palacio-Rodriguez K, Lans I, Cavasotto CN, Cossio P. Exponential consensus ranking improves the outcome in docking and receptor ensemble docking. Sci Rep 9, 5142 (2019). doi:10.1038/s41598-019-41594-3