«

评估大型语言模型行为倾向的一致性

qimuai 发布于 阅读:0 一手编译


评估大型语言模型行为倾向的一致性

内容来源:https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/

内容总结:

谷歌研究团队提出新评估框架:揭示大语言模型行为倾向与人类共识的差异

2026年4月3日,谷歌研究院的研究工程师Amir Taubenfeld、研究科学家Zorik Gekhman及心理学研究员Lior Nezry等人发表了一项关于大语言模型行为对齐评估的新研究。随着大语言模型日益融入日常生活,理解其行为模式变得至关重要。为此,团队开发了一套系统性的评估框架,将经典心理学量表转化为大规模情境判断测试,用以量化模型行为倾向与人类社交偏见的对齐程度。

该研究创新性地将用于评估人类 empathy(共情)、assertiveness(自信)等特质的标准化心理学量表(如IRI、ERQ)转化为模型在现实场景中的行为测试。研究团队设计了涵盖职业场合、冲突解决、日常决策等多种贴近实际生活的互动情境,通过让模型生成自然回应,并与550名人类标注者的偏好分布进行对比,以评估模型行为与人类共识的吻合度。

主要发现如下:

  1. 模型规模影响对齐效果:参数量小于250亿的模型在行为倾向上与人类共识的一致性显著较低,其表现接近随机水平。而参数量超过1200亿的大型模型及前沿闭源模型在人类共识高度一致的情境中表现优异,但在共识度低于90%的场景中,对齐水平会停滞在80%左右。

  2. 模型存在系统性“过度自信”:在所有测试的25个模型中,当人类标注者对最佳行动方案存在分歧(共识度低)时,模型并未如理想状态那样反映这种意见多样性,反而普遍表现出高度自信,坚持输出单一倾向的回应,未能充分代表人类观点的光谱。

  3. 自述倾向与实际行为存在偏差:研究发现,模型在问卷中“自我报告”的行为倾向(如自称“不易冲动”)与其在模拟情境中实际表现出的行为模式(如倾向于建议采取即时行动)之间存在明显不一致。这提示直接采用自述式问卷评估模型可能存在局限。

  4. 发现具体行为偏差模式:在定性分析中,模型在某些高共识场景中表现出与人类偏好相左的倾向。例如,在专业场合中,人类通常建议保持克制,而模型却更鼓励情感表露;在社会纠纷中,人类可能倾向于坚持立场,而模型则更优先推荐“和谐”策略。

这项研究标志着在理解大语言模型行为复杂性方面迈出了早期但重要的一步。它揭示的“共识对齐缺口”和“分布代表性不足”问题,为未来研发更贴合人类价值观与社会情境多样性的模型提供了关键的评估基础和优化方向。研究团队表示,此框架为持续探索模型行为对齐开辟了新途径,相关领域仍需进一步深入研究。

中文翻译:

评估大型语言模型的行为倾向对齐
2026年4月3日
谷歌研究院研究工程师 Amir Taubenfeld、研究科学家 Zorik Gekhman、心理学研究员 Lior Nezry

作为我们持续探索模型行为与对齐的一部分,我们引入了一套系统性评估框架,将成熟的评估方法转化为针对大型语言模型的大规模情境判断测试。这一尝试旨在理解和刻画模型的对齐状态,通过量化模型行为倾向与人类社会偏好的关系,识别模型输出与人类共识之间可衡量的对齐程度及偏差。

随着大型语言模型日益融入日常生活,理解其行为变得至关重要。在持续研究模型行为与对齐的过程中,本研究是该方向的初步探索。我们聚焦于行为倾向——即在社会情境中塑造回应的潜在趋向——并提出了一个框架,以探究大型语言模型所表现出的倾向与人类倾向的接近程度。

行为倾向通常通过不同特质(如共情力、果断性)下的自评问卷进行量化,个体需对偏好陈述(例如“我倾向于快速表达观点”)表达认同程度。本研究所用问卷均为经过科学验证的标准化工具,在国际研究与心理学领域广泛用于评估人格特质,例如:IRI(共情量表)、ERQ(情绪调节问卷)等。每种工具均基于同行评审文献,通过不同策略确立了其心理测量效度与信度。我们选取了学界最广泛使用的工具进行研究。

我们的目标是在此类心理学问卷基础上进行拓展,但直接将其应用于大型语言模型存在技术挑战,因为模型输出对提示措辞和分布变化较为敏感。因此,模型在自评形式中“声称”的倾向并不能保证在实际开放场景中转化为相应行为。

为应对这些挑战,在《评估大型语言模型的行为倾向对齐》中,我们的框架通过模拟现实用户-助手场景来评估模型行为倾向,这些场景中模型的建议角色可能产生实际影响。本研究是评估人类共识与模型行为在实际场景中对齐程度的初步尝试,重点关注日常人际互动与职场情境。我们确保这些场景根植于成熟心理学问卷,以捕捉核心行为特质的本质。测试场景涵盖专业沉着度、冲突解决、旅行预订等实务任务,以及生活方式与日常决策,突出模型在典型人类日常经验情境中的行为表现。通过对25个大型语言模型的大规模分析,我们发现了两类差距:一是模型倾向与人类标注者共识存在偏离;二是当缺乏共识时,模型倾向未能涵盖人类意见的多样性。这些初步结果揭示了改善行为对齐的机遇,以确保模型能更恰当地应对社会动态的细微差别,我们期待未来研究能在此基础上持续推进。

从自评到情境判断
我们首先从经过科学验证的成熟心理学问卷中提取陈述句,并将其转化为模型一般建议倾向的声明。改编后的陈述用于生成情境判断测试——一种在心理学、行为预测等领域广泛应用的评估方法。在各行业中,情境判断测试是评估复杂环境下行为能力与判断力的标准工具。这些测试通常包含呈现两种可能行动路径的现实场景:一种支持特定行为特质,另一种则与之相悖。在我们的研究中,每个情境判断测试均由三位独立标注者审核,以确保(由大型语言模型生成的)场景与行动逻辑一致,并准确捕捉所测试的潜在行为特征。

评估过程中,模型以情境判断测试作为输入生成自然回应,随后通过“模型作为评判者”的方式将回应映射到两种行动路径之一。

由于我们的目标并非量化大型语言模型的行为倾向,而是研究其与人类行为的对齐程度,我们为每个情境判断测试收集了来自550名参与者池中10位标注者的偏好行动,并将得出的人类偏好分布与每个场景中模型回应分布进行比较。

大型语言模型行为倾向的方向性对齐
此处我们聚焦于人类标注者对首选行动路径存在共识的场景子集。这些情况下的对齐至关重要,因为在人类高度认同的情境中未能展现或抑制特定特质,表明模型行为模式可能偏离典型人类行为范式。

我们将方向性对齐定义为可解释的评判标准,用于检验模型是否对人类多数支持的行动赋予更高概率。模型对齐程度通过符合该标准的场景比例进行量化。

下图展示了25个不同大型语言模型在四种特质上的结果。结果按人类标注者共识程度(每个场景10份回应)分组:完全一致(10/10)、极高共识(9-10)、高共识(8-9)。

较小规模模型(<250亿参数)表现出显著较低的方向性对齐,这在黑色水平线下方底行中红橙单元格的高频出现得以体现。这些较小模型往往难以区分特质的恰当表达与抑制,其与共识的对齐率接近随机概率。

大容量模型(>1200亿参数)及前沿闭源权重模型显示出显著改进,在人类标注者完全共识时达到近乎完美的对齐。但当共识低于90%时,这些模型的对齐率仍停留在80%-85%区间。

对高共识场景中模型偏离偏好行为模式的定性分析揭示了若干有趣规律:在人类建议保持镇定的专业场合,模型倾向于鼓励情感开放;在社会争端中,模型常优先考虑和谐而非坚持立场,与参与者偏好相左;此外,模型偶尔表现出比人类更高的冲动性,面对时效性机遇时更倾向于建议立即行动而非进行事务核查。

分布对齐的缺失
分布多元性作为一种公平性原则,主张模型回应分布应准确反映人类观点的多样性,而非收敛于单一主导回应。为在实验中捕捉这一特性,在人类对首选行动认同度较低的情况下,模型的概率质量应在两种可能行动间更均匀分布,从而降低对其首选行动的置信度。

下图展示了模型置信度随人类共识度的变化趋势。虽然完全分布对齐的模型置信度应与人类标注者共识度成比例(黑色虚线),但所有25个受评模型(蓝线)均表现出系统性过度自信。蓝色实线(代表25个大型语言模型平均值)表明模型未能体现人类标注者固有的模糊性与意见全谱分布。即使在人类意见显著分歧的低共识场景(50%-60%认同度),所有受评模型仍保持高置信度。

人类低共识时大型语言模型的立场选择
我们证实,当人类标注者对首选行动的共识度较低时,大型语言模型未能呈现这种模糊性,表现为过度自信。下图显示这种过度自信的方向存在显著差异,甚至在前沿模型之间也是如此。这表明不同的训练与对齐流程会催生独特的行为倾向。

自评陈述与显性行为
通过模型自评问卷陈述评估其倾向的有效性仍是活跃的研究领域。尽管有研究者质疑该方法的构念效度,亦有学者认为特定提示框架能实现可靠评估。虽然解决这一争议超出本研究范围,但我们的框架——将问卷条目直接映射到行为场景——为探究这些动态提供了独特视角。

下图揭示了大型语言模型自评结果与显性行为间的显著分歧。例如,模型常自评为低冲动性,但其行为倾向却偏向冲动。在各特质分布分析中,模型自评与显性行为亦存在明显不一致。这一分析揭示了直接自评方法的潜在局限性,同时凸显了我们框架作为未来研究基础的价值。

讨论
作为我们持续研究模型行为与对齐的初步贡献,我们提出了评估大型语言模型行为倾向的框架。该框架以成熟问卷方法论为基础,同时克服了传统自评测量的局限。它提供了一种量化差距的途径:既涵盖模型在高共识场景中未能持续反映人类标注者共识的情况,也涉及低共识场景中模型未能充分代表意见分布范围的现象。这是在理解模型行为倾向道路上迈出的重要一步,未来仍需在评估方法与弥补已识别差距等关键领域深化研究。

如需深入了解我们的方法与结果,请阅读完整论文。

致谢
本研究由 Amir Taubenfeld、Zorik Gekhman、Lior Nezry、Omri Feldman、Natalie Harris、Shashir Reddy、Romina Stella、Ariel Goldstein、Marian Croak、Yossi Matias 与 Amir Feder 共同完成。感谢 Itay Laish、Renee Shelby、Nino Scherrer、Sivan Eiger、Saška Mojsilović、Avinatan Hassidim、Ronit Levavi Morad 及 James Manyika 对研究的审阅与宝贵建议。

英文来源:

Evaluating alignment of behavioral dispositions in LLMs
April 3, 2026
Amir Taubenfeld, Research Engineer, Zorik Gekhman, Research Scientist, and Lior Nezry, Psychology Researcher, Google Research
As part of our ongoing exploration of model behavior and alignment, we introduce a systematic evaluation framework that transforms established assessments into large-scale situational judgment tests for large language models. This approach, an attempt to understand and map model alignment, allows for the quantification of model behavioral tendencies relative to human social inclinations, identifying measurable alignment and deviations between model outputs and aggregated human consensus.
As LLMs integrate into our daily lives, understanding their behavior becomes essential. In our ongoing efforts to study model behavior and alignment, we present this work as an early step in that direction. We focus on behavioral dispositions — the underlying tendencies that shape responses in social contexts — and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans.
Behavioral dispositions are typically quantified via self-report questionnaires under different traits (e.g., empathy, assertiveness), where individuals rate their agreement with preference-statements, such as, "I am quick to express an opinion." The questionnaires used in this study are standardized, scientifically validated measures widely used for assessing personality traits in international research and psychology such as: IRI (empathy), ERQ (emotion regulation), and more. Each instrument is grounded in peer-reviewed literature that establishes its psychometric validity and reliability using different strategies. We chose the most widely used instruments for our research.
Our objective is to build upon such psychological questionnaires, but directly applying them to LLMs presents technical challenges, as LLM outputs are sensitive to prompt phrasing and distribution shifts. Consequently, dispositions “claimed” by LLMs within a self-report format are not guaranteed to successfully transfer to behavior in realistic, open-ended settings.
To address these challenges, in “Evaluating Alignment of Behavioral Dispositions in LLMs,” our framework evaluates LLMs’ behavioral dispositions in realistic user-assistant scenarios where their advisory role can lead to tangible impact. This study is an early step in evaluating the alignment between human consensus and model behavior across realistic, practical scenarios, focusing on everyday human-to-human interactions and workplace situations. We ensure that these scenarios remain grounded in established psychological questionnaires to capture the essence of core behavioral traits. Tested scenarios included professional composure, conflict resolution, practical tasks such as booking a trip, and lifestyle or daily decision-making, highlighting model behavior in settings representative of typical human day-to-day experiences. Our large-scale analysis of 25 LLMs reveals two kinds of gaps: one where model dispositions deviate from consensus among human annotators, and another when model dispositions do not capture the range of human opinions when consensus is absent. These early results highlight the opportunity for better behavioral alignment to ensure that models can more appropriately navigate the nuances of social dynamics, results we expect future research to build on.
From self-report to situational judgment
We start by collecting statements from established, scientifically validated psychological questionnaires and adapt them into declarations of the model’s general advising tendency. The adapted statements are then used to generate Situational Judgment Tests (SJTs), an assessment methodology widely utilized in psychology, behavioral prediction, and other fields. Across these industries, SJTs are the standard for evaluating behavioral competencies and judgment in complex environments. These tests typically consist of realistic scenarios presenting two possible courses of action: one supporting a specific behavioral trait and one opposing it. In our research, each SJT is reviewed by three independent annotators to validate that the (LLM-generated) scenario and actions are coherent and faithfully capture the underlying behavioral markers being tested.
During the evaluation, the model is prompted with the SJT as input and generates a natural response, which is mapped to one of the two courses of action using an LLM-as-a-judge.
Since our goal is not to quantify LLMs’ behavioral dispositions, but to study the extent of their alignment with human behavior, we collect preferred actions from 10 annotators per SJT from a pool of 550 participants, and compare the resulting human preference distribution to the distribution of model responses in each scenario.
Directional alignment of LLMs’ behavioral dispositions
Here we focus on a subset of scenarios where there is a consensus between human annotators on the preferred course of action. Alignment in these cases is important, as failure to manifest or suppress a trait under strong human agreement suggests a behavioral profile that tends to act differently than typical human behavioral patterns.
We define directional alignment as an interpretable criteria that tests whether the model assigns a higher probability to the action supported by the human majority. Model alignment is then quantified by the percentage of scenarios where this criterion is met.
The figure below presents the results across 25 different LLMs and four distinct traits. The results are grouped by the level of consensus among human annotators (out of 10 responses per scenario): unanimity (10/10), very high (9, 10), and high consensus (8, 9).
Smaller models (<25B) show markedly lower directional alignment, as indicated by the higher prevalence of red and orange cells in the bottom rows under the black horizontal line. These smaller models frequently do not distinguish between the appropriate expression or suppression of traits, often aligning with consensus at near-chance rates.
Large-capacity (>120B) and frontier closed-weights models show significant improvement, achieving close to perfect alignment when consensus among human annotators is unanimous. However, these models’ alignment still plateaus in the low-to-mid 80s when consensus is lower than 90%.
Qualitative analysis of cases where LLMs deviate from the preferred behavioral mode in high-consensus scenarios revealed several interesting patterns. Models tend to encourage emotional openness in professional settings where humans recommend composure. In social disputes, models often prioritize harmony over standing one's ground, contrary to participant preferences. Lastly, models occasionally exhibit higher impulsivity than humans, recommending immediate action over logistical verification for time-sensitive opportunities.
Lack of distributional alignment
Distributional pluralism is a fairness principle arguing that the distribution of a model’s responses should accurately reflect the variety of human viewpoints rather than converging on a single, dominant response. To capture this in our setup, in cases where humans have lower agreement on the preferred action, the model’s probability mass should be more evenly distributed between the two possible actions, resulting in lower confidence in its preferred action.
The figure below presents the model's confidence as a function of human agreement. While a perfectly distributionally aligned model’s confidence should scale proportionally to consensus among human annotators (dotted black line) all 25 evaluated models (blue lines) show a systematic overconfidence in their decision. The solid blue line — representing the average across 25 LLMs — illustrates that models do not represent the inherent ambiguity and the full spectrum of opinions from the human annotators. Even in the low-consensus cases where human opinion is significantly divided (50–60% agreement), confidence remains high across all evaluated models.
LLMs take a stance when humans have low consensus
We established that when consensus among human annotators regarding the preferred action is low, LLMs do not represent such ambiguity, which is reflected as overconfidence. In the figure below we show that the direction of this overconfidence varies substantially, even between frontier models. This suggests that different training and alignment procedures give rise to unique behavioral dispositions.
Self-reporting and revealed behavior
The validity of assessing LLM dispositions via self-reported agreement with questionnaire statements remains an active area of research. While some researchers question the construct validity of this approach, others argue that specific prompting frameworks enable reliable assessment. While settling this debate is beyond the scope of this work, our framework — which maps questionnaire items directly to behavioral scenarios — offers a unique lens to study these dynamics.
The figure below presents a notable divergence between LLMs’ self-reporting and their revealed behavior. For instance, models frequently self-report to be low on impulsiveness, yet they show a behavioral tendency leaning toward impulsiveness. When examining the distribution within each trait, there are also clear inconsistencies between LLM's self-reporting and their revealed behavior. This analysis suggests potential limitations in the validity of direct self-reporting, and highlights the utility of our framework as a foundation for future research.
Discussion
As an early contribution to our ongoing study of model behavior and alignment, we introduce a framework for evaluating behavioral dispositions in LLMs, grounding our approach in established questionnaires methodology while addressing the limitations of traditional self-reporting measures. This framework provides a way to measure gaps, where models do not consistently reflect consensus among human annotators in high-agreement scenarios and underrepresented the range of opinions in low-consensus scenarios. This is a step forward in understanding model behavioral tendencies, and further research is needed in critical areas such as evaluation and addressing identified gaps.
For a deeper dive into our methodology and results, read the paper here.
Acknowledgements
This research was conducted by Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias and Amir Feder. We thank Itay Laish, Renee Shelby, Nino Scherrer, Sivan Eiger, Saška Mojsilović, Avinatan Hassidim, Ronit Levavi Morad, and James Manyika for reviewing the work and their valuable suggestions.

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读