构建更优AI基准:需要多少评估者才足够?

内容来源:https://research.google/blog/building-better-ai-benchmarks-how-many-raters-are-enough/
内容总结:
谷歌研究团队提出AI评估新框架:优化标注资源配置,提升基准测试可靠性
2026年3月31日,谷歌研究院科学家Flip Korn与Chris Welty发布了一项关于机器学习模型评估的重要研究。该研究针对当前人工智能基准测试中普遍存在的“人类标注分歧被忽视”问题,提出了一套基于“黄金标注数据”的创新评估框架,旨在通过优化“标注任务数量”与“每任务标注员数量”之间的平衡,构建更能反映人类观点复杂性、且具备高度可复现性的AI评测体系。
挑战:忽视人类分歧影响评估可靠性
研究指出,机器学习领域的可复现性依赖于实验条件与结果的一致性,而当前许多基准测试的“真实数据”依赖于人工标注。人类观点具有天然多样性,但现有研究往往忽视这种分歧,一个重要原因是标注预算有限,增加每项任务的标注员数量会显著提高成本。因此,业界通常仅为每个标注项配置1至5名标注员,以期找到一个“唯一正确答案”。研究认为,这种做法既不足以把握整体情况,也无法深入理解人类观点的细微差别。
核心实验:在“广度”与“深度”间寻找最优解
为解决上述问题,研究团队设计了一个模拟实验,核心是探讨在固定预算下,应优先增加标注任务的数量(“广度”或“森林”策略),还是增加每个任务的标注员数量(“深度”或“树木”策略)。团队开发了一款开源模拟器,在涉及毒性检测、仇恨言论识别等主观性任务的多个人类标注数据集上,对数千种“任务数量-标注员数量”组合进行了大规模测试,以寻找统计可靠性最高(即最可复现)的配置方案。
关键发现:打破“一刀切”的标注惯例
研究得出三项突破性结论,对现行机器学习评估标准提出了挑战:
- 常规标注员数量不足:常见的每项任务配置1、3或5名标注员的实践往往不够。要获得真正可靠、能反映人类观点细微差异的结果,通常每项任务需要超过10名标注员。
- 评估目标决定最优策略:不存在通用的“完美比例”。最优平衡完全取决于评测目标:
- 若目标是衡量模型是否符合人类的“多数投票”意见,则“广度”策略更优,增加任务数量比增加标注员更有效。
- 若目标是捕捉人类观点的完整分布(例如区分“可能”与“是”),则“深度”策略更有效,增加标注员数量是捕捉人类观点“总体差异”的唯一途径。
- 高效方案切实可行:研究带来了一个鼓舞人心的发现:无需无限预算。通过根据所选评测目标正确优化标注配置,仅需总计约1000次标注的适中预算,即可获得高度可复现的结果。然而,若配置失衡,即使增加研究预算,也可能导致结论不可靠。
行业意义:推动AI评估范式演进
这项研究对构建可靠的未来AI系统至关重要。长期以来,AI评估遵循“单一真相”范式,即每个输入对应一个“正确”标签。但随着AI进入伦理、安全意图识别、社交互动性质判断等更主观的领域,该范式已显不足。通过从侧重“广度”转向兼顾“深度”,业界可以构建出真正反映人类世界复杂性与多元视角的基准测试。这不仅为从业者提供了设计更优、更可复现测试的路线图,也表明理解人类为何存在分歧与了解他们何处达成共识同等重要。该研究为此提供了捕捉这两方面的数学工具。
(本研究由罗切斯特理工学院博士生Deepak Pandita与Christopher Homan教授合作完成。)
中文翻译:
构建更优的AI基准测试:需要多少标注者才足够?
2026年3月31日
Flip Korn与Chris Welty,谷歌研究院科学家
我们提出一种基于"黄金"标注数据的机器学习模型评估框架,该框架能优化标注条目数量与单条目标注人数之间的平衡关系,为构建高复现性、能捕捉人类意见差异的AI基准测试提供实践路径。
快速导读
在机器学习领域,复现性衡量的是使用相同代码、数据/数据分布及参数设置重复实验并获得相同结果的难易程度。高复现性能建立团队间的信任,助力研究在彼此成果上持续突破。
复现性的核心挑战在于:真实标注数据通常依赖人工完成,而人类与机器不同,会从多元视角处理问题,对结果常有分歧。令人惊讶的是,针对"忽视人类意见分歧"这一AI基准测试常见缺陷的影响研究却极为匮乏。研究缺失的部分原因在于,人工标注数据的预算有限,若为每个样本增加标注者数量,将大幅提升单条目的标注成本。
在《森林与树木:可复现ML评估中的(N,K)权衡》研究中,我们深入探讨了标注条目数量与单条目标注人数之间的复现性平衡关系:究竟是"少量标注者评估大量条目"更好,还是"大量标注者评估少量条目"更优?这本质上是广度与深度的抉择。广度策略(即"森林"视角)如同邀请1000位顾客各品尝餐厅一道菜品以获取整体质量印象;深度策略("树木"视角)则像请20位顾客品尝相同的50道菜品,从而更精准地把握每道菜品的特质,这些细节可能影响整体评价。
传统AI评估多倾向于广度策略。大多数研究者默认每条目配置1-5名标注者即可确定单一"正确"答案。我们的研究表明,这一标准往往不足以捕捉自然存在的意见分歧,为此我们提供了构建更可靠、更高性价比AI基准测试的路线图。
实验设计:预算模拟推演
主观性对实证基准测试的干扰是复现性的首要挑战。若不同研究者执行相同评估却得到不同结果,研究便失去复现性。为探寻标注条目数量与单条目标注人数的最优平衡,我们基于涉及毒性言论、仇恨言论检测等主观判断任务的真实数据集开发了模拟系统。
我们通过大规模"压力测试"来探索给定研究预算(如成本、时间等)的最优分配方案。通过调节两个核心变量观察其对结果可靠性的影响:
- 规模(N):标注条目总数(从100条的小规模到5万条的大规模)
- 群体(K):单条目标注人数(从1人到500人不等)
我们使用模拟器测试了数千种不同规模的参数组合,通过统计显著性验证(p<0.05)来确定最具复现性的配置方案。为助力学界研究,我们已在GitHub开源此模拟器。
数据集说明
研究采用多个数据集,每个数据集包含多类别标注,且单条目均含多条人工反馈:
- 毒性数据集:包含107,620条社交媒体评论,由17,280名标注者完成标注
- DICES对话AI安全评估多样性数据集:包含350段聊天机器人对话,由123名标注者从16个安全维度进行安全评级
- D3code跨文化数据集:涵盖4,554个条目,由来自21个国家的4,309名标注者进行冒犯性评级,数据在性别与年龄维度保持平衡
- 职场数据集:包含2,000条职场相关推文,每条由5名标注者分别回答3个问题,对应子集标记为JobsQ1/2/3,分别代表职场信息视角、就业状态与职业转换事件
基于这些数据集,我们还检验了"混乱数据"的影响。例如当99%的邮件都是垃圾邮件、仅1%为重要邮件(数据高度偏斜)时,最优标注者分布策略(广度vs深度)是否改变?同时我们也探索了多类别数据(如毒性标签包含"有毒""轻度冒犯""中立"等)的影响。
核心发现:没有放之四海而皆准的方案
研究揭示了挑战机器学习评估现状的三项关键洞见:
-
3-5名标注者的"标准配置"并不足够
结果显示,当前每条目配置1、3或5名标注者的通行做法往往不够充分。这种"低标注者"方案既缺乏把握全局的广度,也不具备理解人类意见微妙差异的深度。要获得真正可靠且反映人类认知细微差别的结果,实践者通常需要为每个条目配置10名以上的标注者。 -
评估指标决定策略选择
不存在"完美"的比例配置,最优平衡完全取决于测量目标:- 准确率——多数表决:若仅需判断模型是否与人类"多数意见"一致,广度策略通常更优。增加标注条目比增加标注者数量更有效
- 细微差异——意见光谱:若需要完整呈现人类意见分布(例如区分"可能"与"肯定"),深度策略更为有效。增加标注者数量是捕捉人类思维"全方差"的唯一途径
-
高效方案切实可行
最鼓舞人心的发现是:无限预算并非必需。研究表明,只要根据选定指标优化单条目标注人数比例,仅需约1000条总标注量即可获得高复现性结果。但若平衡策略选择不当,即使增加研究预算仍可能得到不可靠的结论。
这对AI未来发展的意义
这项研究对构建可信AI至关重要。多年来,该领域始终遵循"单一真值"范式——即认为每个输入都对应一个"正确"标签。但即便存在单一真值,实际测量也可能无法实现。随着AI进入伦理等更主观的领域,需要识别"有害意图""社交互动特质"等主观概念时,这种范式便难以为继。
通过超越"森林"视角、拥抱"树木"视角,我们能够构建真正反映人类世界复杂性、多元性及自然分歧的基准测试。这套路线图可帮助实践者设计更优质、更高复现性的测试方案,同时避免过度投入。最终,理解人类为何产生分歧与知晓他们何处达成共识同等重要,而我们的研究正提供了同时捕捉这两方面的数学工具。
致谢
本研究承蒙罗切斯特理工学院博士生Deepak Pandita与Christopher Homan教授的通力合作,特此致谢。
英文来源:
Building better AI benchmarks: How many raters are enough?
March 31, 2026
Flip Korn and Chris Welty, Research Scientists, Google Research
We introduce an evaluation framework for ML models, based on “gold” ratings data, that optimizes the trade-off between the number of items and raters per item, providing a roadmap for building highly reproducible AI benchmarks that capture the nuance of human disagreement.
Quick links
In machine learning, reproducibility measures how easy it is to repeat the same experiments — using the same code, data/distribution, and settings — and get the same results. A high level of reproducibility enables trust between teams and allows them to build on each other’s progress.
The challenge with reproducibility is that ground truth data usually relies on humans; and humans, unlike machines, approach all problems from a variety of perspectives and often disagree on the result. Surprisingly little research has studied the impact of effectively ignoring human disagreement, which is a common oversight in AI benchmarking. One reason for the lack of research is that budgets for collecting human-backed evaluation data are limited, and obtaining more samples from multiple raters for each example greatly increases the per-item annotation costs.
In “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” we investigate the reproducibility trade-off between the ratio of items being rated to the number of human raters per item. Is it better to have fewer raters for many items or many raters for fewer items? Think of this as a question between breadth and depth. The breadth (i.e., the forest) approach asks 1,000 different people to each try one meal at a restaurant to get an overall sense of quality. The depth (tree) approach asks 20 people to try the same 50 meals, revealing more about specific dishes, which might influence the overall rating.
Historically, AI evaluation has leaned toward the forest approach. Most researchers settle for 1 to 5 raters per item, assuming this is enough to find a single "correct" truth. Our research suggests this standard is often insufficient at capturing natural disagreement, and we provide a roadmap for building more reliable and cost efficient AI benchmarks.
The experiment: Simulating the budget
Subjectivity undermining empirical benchmarking is the primary challenge to reproducibility. If two different researchers run the same evaluation and get different results, the research isn't reproducible. To find the optimal balance between the number of items being rated and the number of raters per item, we developed a simulator based on real-world datasets that involve subjective tasks like toxicity and hate speech detection.
We essentially conducted a massive "stress test" to find the most efficient way to spend a given research budget (e.g., measured in cost, time, etc.). We changed two main levers to see which yielded the most reliable results:
- The Scale (N): Total number of items being rated (ranging from a small budget of 100 to a large budget of 50,000).
- The Crowd (K): How many people look at the same item (ranging from 1 person to a crowd of 500).
We used a simulator to test thousands of such combinations across various scales to see which configurations were the most statistically reliable (with p < 0.05) — and thus reproducible.
To support the broader community, we have open-sourced this simulator on GitHub.
Datasets
We use multiple datasets, each comprising various categories with several responses per item: - The Toxicity dataset consists of 107,620 social media comments labeled by 17,280 raters.
- The DICES Diversity in Conversational AI Evaluation for Safety dataset consists of 350 chatbot conversations rated for safety by 123 raters across 16 safety dimensions.
- D3code is a large crosscultural dataset comprising 4,554 items, each labeled for offensiveness by 4,309 raters from 21 countries and balanced across gender and age.
- Jobs is a collection of 2,000 job-related tweets labeled by 5 raters each. The raters answer 3 questions about each tweet, and the corresponding sets are denoted by JobsQ1/2/3. The categories in JobsQ1/2/3 represent the point of view of job-related information, employment status, and job transition events, respectively.
Using these datasets, we also tested what happens when the data is "messy". For instance, if 99% of emails are spam and only 1% are important (indicating high data skew), does that change the optimal rater distribution (breadth vs. depth)? In addition, we also explored the effect of having more data categories, e.g., toxicity tags, such as toxic, mildly offensive, neutral, etc.
Key findings: One size does not fit all
Our study revealed three major insights that challenge the status quo of machine learning evaluation:- The "standard" of 3–5 raters is not enough
Our results show that the common practice of using 1, 3 or 5 raters per item is often insufficient. This “low-rater” approach does not provide enough breadth to see the big picture, nor enough depth to understand the nuance of human opinion. To achieve truly reliable results that reflect human nuance, practitioners often need more than 10 raters per item. - The metric dictates the strategy
There is no "perfect" ratio. Instead, the optimal trade-off depends entirely on what is being measured:
- The "standard" of 3–5 raters is not enough
- Accuracy – The majority vote: If the goal is simply to see if a model matches the "majority vote" of humans, the forest approach is generally better. Adding more items helps more than adding more raters.
- Nuance – The range of opinions: If the full range of human opinion is desired — accounting for the fact that a "maybe" is different from a "yes" — the tree approach is more effective. Increasing the number of raters is the only way to capture the "total variation" of human thought.
- Efficiency is within reach
The most encouraging finding is that one doesn’t need an infinite budget. We found that by optimizing the ratings-per-item ratio correctly based on the chosen metric, one can achieve highly reproducible results with a modest budget of around 1,000 total annotations. However, choosing the wrong balance can lead to unreliable conclusions, even with an increase in research budget.
Why this matters for the future of AI
This research is vital for the future of reliable AI. For years, the field has operated under a "single truth" paradigm — the idea that for every input, there is one "right" label. But even when there’s a single ground-truth it may not be possible to measure it. And as AI moves into more subjective areas like ethics, identifying subjective concepts like harmful intent or the character of social interaction, that paradigm breaks down.
By moving away from the “forest" and embracing the “tree", we can build benchmarks that actually reflect the complexity and different perspectives that lead to the natural disagreement found in the human world. This roadmap allows practitioners to design better, more reproducible tests without overspending. Ultimately, understanding why humans disagree is just as important as knowing where they agree, and our research provides the mathematical tools to capture both.
Acknowledgements
This work owes much credit to our collaborators PhD student Deepak Pandita and Prof. Christopher Homan at RIT.
- Efficiency is within reach