Back to feed

Building better AI benchmarks: How many raters are enough?

The latest research from Google

Mar 31, 2026

3/31/2026

Benchmark Design Should Align Annotation Depth Or Breadth with the Evaluation Metric

Building better AI benchmarks: How many raters are enough? · The latest research from Google

Science, Technology & Innovation · Mar 31, 2026

Choose annotation breadth or depth based on the evaluation metric: favor many items (“forest”) to optimize majority-vote accuracy, but favor many raters per item (“tree”) to capture disagreement, uncertainty, and distributional judgments—otherwise benchmark gains may reflect annotation design, not true model quality.


3/31/2026

Reproducible AI Benchmarks Require Optimizing Item Count and Raters Per Item to Capture Human Disagreement

Building better AI benchmarks: How many raters are enough? · The latest research from Google

Science, Technology & Innovation · Mar 31, 2026

AI benchmark reproducibility is often limited by under-measured human disagreement on subjective labels, so Google proposes a simulator-based framework that optimizes the trade-off between item count (N) and raters per item (K) using gold ratings to test thousands of configurations for statistical reliability, recommending benchmark design be treated as an N-vs-K optimization to produce reproducible, disagreement-aware evaluations.


3/31/2026

Higher Rater Counts Per Item Are Needed For Reproducible Subjective Benchmark Evaluations

Building better AI benchmarks: How many raters are enough? · The latest research from Google

Science, Technology & Innovation · Mar 31, 2026

Simulations indicate the common practice of using 1–5 raters per item is inadequate for reliable human-grounded evaluation on subjective tasks—often more than 10 raters are needed, with major implications for budgets, dataset design, and claims of reproducibility.


3/31/2026

Benchmarking Should Treat Disagreement As Signal In AI Evaluation

Building better AI benchmarks: How many raters are enough? · The latest research from Google

Science, Technology & Innovation · Mar 31, 2026

The research finds the “single truth” benchmarking paradigm mismatches many subjective or partially observable AI targets (e.g., harmful intent, safety, social interaction) and argues benchmarks should treat annotator disagreement as signal and quantify it to enable trustworthy comparisons across teams, locales, and safety-critical deployments.


3/31/2026

Simulate The N/K Mix Before Increasing Annotation Spend To Achieve Reproducible Benchmarks

Building better AI benchmarks: How many raters are enough? · The latest research from Google

Science, Technology & Innovation · Mar 31, 2026

Benchmark reliability can be achieved with modest annotation budgets by optimally allocating items versus raters (N/K)—e.g., ~1,000 total annotations can yield highly reproducible results—so teams should simulate/pilot the N/K mix against their metric because reallocating budget, not simply spending more, drives reliability.