The latest research from Google
Mar 31, 2026
Building better AI benchmarks: How many raters are enough? · The latest research from Google
Science, Technology & Innovation · Mar 31, 2026
Choose annotation breadth or depth based on the evaluation metric: favor many items (“forest”) to optimize majority-vote accuracy, but favor many raters per item (“tree”) to capture disagreement, uncertainty, and distributional judgments—otherwise benchmark gains may reflect annotation design, not true model quality.
Building better AI benchmarks: How many raters are enough? · The latest research from Google
Science, Technology & Innovation · Mar 31, 2026
AI benchmark reproducibility is often limited by under-measured human disagreement on subjective labels, so Google proposes a simulator-based framework that optimizes the trade-off between item count (N) and raters per item (K) using gold ratings to test thousands of configurations for statistical reliability, recommending benchmark design be treated as an N-vs-K optimization to produce reproducible, disagreement-aware evaluations.
Building better AI benchmarks: How many raters are enough? · The latest research from Google
Science, Technology & Innovation · Mar 31, 2026
Simulations indicate the common practice of using 1–5 raters per item is inadequate for reliable human-grounded evaluation on subjective tasks—often more than 10 raters are needed, with major implications for budgets, dataset design, and claims of reproducibility.
Building better AI benchmarks: How many raters are enough? · The latest research from Google
Science, Technology & Innovation · Mar 31, 2026
The research finds the “single truth” benchmarking paradigm mismatches many subjective or partially observable AI targets (e.g., harmful intent, safety, social interaction) and argues benchmarks should treat annotator disagreement as signal and quantify it to enable trustworthy comparisons across teams, locales, and safety-critical deployments.
Building better AI benchmarks: How many raters are enough? · The latest research from Google
Science, Technology & Innovation · Mar 31, 2026
Benchmark reliability can be achieved with modest annotation budgets by optimally allocating items versus raters (N/K)—e.g., ~1,000 total annotations can yield highly reproducible results—so teams should simulate/pilot the N/K mix against their metric because reallocating budget, not simply spending more, drives reliability.