借助生成式人工智能,培养面向未来的技能。

内容来源:https://research.google/blog/towards-developing-future-ready-skills-with-generative-ai/
内容总结:
谷歌发布AI技能评估实验平台 助力培养未来核心素养
2026年4月13日,谷歌研究院科学家加利·埃利丹与高级产品经理耶尔·哈拉马蒂联合宣布,其团队研发的生成式人工智能技能评估系统Vantage已正式上线谷歌实验室。该平台通过与纽约大学合作验证,证实AI评分能力已达到人类专家水平。
随着人工智能技术快速发展,"未来核心素养"——即那些不受技术变革影响的持久性人类能力——正成为全球教育关注焦点。经济合作与发展组织《2030学习指南针》和世界经济论坛《未来就业报告》均将批判性思维、协作能力与创造性思维列为关键素养。然而这类能力素来难以通过传统标准化测试有效衡量。
Vantage平台创新性地构建了动态模拟对话环境:学习者与多个AI虚拟角色进行开放式任务协作,执行大语言模型系统会根据预设评估标准实时引导对话走向,动态生成特定挑战场景。任务结束后,AI评估系统将对照专业量规分析对话记录,为学习者生成可视化技能图谱与质性反馈报告。
研究团队与纽约大学合作开展的验证实验显示,在188名18-25岁测试者参与的协作能力评估中,AI评估系统与人类专家评分的一致性已达到专家间相互认可的水平。在另一项与教育科技公司OpenMic合作的创意任务评估中,AI对180份多媒体创意作品的评分也与人类专家高度吻合。
这项技术突破为教育系统提供了可量化的"技能培养层",未来可与传统学科教学深度融合。教师可设计新型学习任务,例如让学生与AI虚拟角色辩论社会科学议题,在掌握学科知识的同时获得协作能力、批判性思维等素养的即时反馈。
谷歌团队表示,下一步研究将聚焦于模拟环境技能向现实场景的迁移转化,并关注不同文化背景下的表现差异性,推动评估体系向更包容、更公平的方向发展。该实验标志着人工智能正在推动教育评估范式变革,使难以量化的核心素养实现规模化科学测量成为可能。
中文翻译:
借助生成式人工智能,培养面向未来的能力
2026年4月13日
Gal Elidan(谷歌研究院科学家)与 Yael Haramaty(谷歌研究院高级产品经理)
我们的新研究展示了一种利用生成式人工智能评估“面向未来”能力的新方法。与纽约大学合作的研究结果显示,人工智能评分与人类专家水平相当。这项名为“Vantage”的研究实验现已在谷歌实验室平台开放体验。
快速了解
随着人工智能以前所未有的速度发展,“面向未来”的能力——即那些无论技术如何变革或自动化如何推进,仍将保持价值的人类持久能力——再次成为焦点。经济合作与发展组织的《2030年学习指南针》和世界经济论坛的《未来就业报告》等国际框架已明确了一系列优先能力,两者均强调相同的核心素养,包括批判性思维、协作能力和创造性思维。这些能力在人工智能兴起之前早已被视为关键,而在今天,它们变得比以往任何时候都更加重要。
今天我们分享“Vantage”,这是一项利用生成式人工智能在模拟环境中创建对话,以评估面向未来能力的研究实验。Vantage 是与纽约大学的教育学专家及研究人员合作开发的,旨在为高中生和大学生提供一个用于练习和经过验证的评估的沙盒环境。其构建采用了与数学或科学等核心学科相同的系统化方法。Vantage 目前提供英文版本,可在谷歌实验室注册使用。
衡量难以衡量的能力
任何有效学习过程的核心都是反馈与评估,这两者对于个人成长和有效教学都至关重要。在全球教育体系中,通常被衡量的内容才是被教授的内容。
然而,面向未来的能力 notoriously 难以衡量。传统的测试过于僵化,无法捕捉人们的思维过程和互动情况,并且与这些能力在现实世界中的应用方式相去甚远。虽然在真实的人际互动中测试这些能力是最理想的,但这同样需要大量资源,并且难以在众多学生中实现标准化和一致性的评分。例如,如果一个小组从未产生分歧,你如何公平地评估其冲突解决能力?或者,如果他们直接采纳了第一个出现的想法,你如何评估他们在彼此想法基础上进行创造性构建的能力?
我们的研究团队着手探索如何用一种可扩展、经过验证的方法来评估学生的面向未来能力,从而使教育工作者能够将课程与这些能力对齐,并支持学生成长。
通过人工智能模拟团队评估能力
Vantage 中的实验设置将学习者置于与人工智能虚拟角色进行的动态多方对话中,共同完成任务。这种设置使我们能够在控制评估环境的同时,模拟出比现有标准化测试更真实、更能代表现实场景的互动。它提供了一个应对复杂人际和情境挑战的沙盒。
当用户在开放式场景(例如准备辩论或推销创意构想)中与人工智能虚拟角色互动时,一个“执行大语言模型”会使用预设的评估标准来引导虚拟角色,以实现有效评估。该执行大语言模型持续分析对话状态,动态引入特定挑战——例如反驳某个想法或引入冲突——为学习者提供展示其技能的针对性机会。因此,它充当了新一代自适应评估引擎的角色,引导对话,从而在对话结束时收集到评估用户所需的信息。
任务完成后,一个“人工智能评估器”会根据与执行大语言模型相同的严格评估标准分析对话记录,识别并衡量技能应用的具体证据。随后,学习者会收到一份详细的能力图谱,包含可视化分数以及针对他们在对话中展示的技能给出的定性反馈。这使得人类技能发展中“不可见”的进展变得可见且可操作。
与合作伙伴共同验证我们的评估方法
为确保学术和教育学的严谨性,我们与纽约大学建立了研究合作伙伴关系。我们共同审查了通用的评估标准,并将其与相关任务对齐。此次合作的主要重点是建立并验证评估方法。我们通过一项联合研究实现了这一点,该研究邀请了188名年龄在18-25岁、来自美国的测试者,他们完成了Vantage中评估协作能力样本(冲突解决和项目管理)的任务。我们主要研究了两个问题:
-
我们能否引导对话以测试特定技能?
Vantage 的一项关键创新是使用执行大语言模型实现自适应评估。我们评估了大语言模型在引导对话以针对特定技能(如冲突解决或项目管理)方面的有效性。我们测量了用户在该技能上展示的相关信息量,并与学习者和独立运行、未被引导的AI虚拟角色在相同任务中的表现进行了比较。我们的研究结果表明,执行大语言模型确实成功地引导对话产生了高密度的信息,并在保持自然对话流的同时,带来了更多关于被评估技能的信息。这一能力在多个模拟任务中证明是一致的。更多结果和方法细节请参阅技术报告。 -
大语言模型评估面向未来能力的准确度如何?
为了测试人工智能评估器的准确性,我们将其评分与纽约大学评分员使用相同教学评估标准给出的评分进行了比较。结果显示,人工智能评估器与人类专家之间的一致性,与两位专家评分员之间的一致性相似。这表明人工智能评估器的对话评分可与人类专家评分员相媲美,从而确立了Vantage作为一种有效的自动化技能评估系统。
我们还与初创公司OpenMic(一家开发评估持久性技能的人工智能工具的公司)合作,就创造力和英语语言艺术进行了一项联合研究,以在另一背景下测试人工智能评估器。我们分析了180名学生在创造性多媒体任务(如与英语文学相关的角色访谈和媒体文章)中的作品,并将人工智能评估器的评分与OpenMic内部专家的评分进行了比较。在此,人工智能评估器与人类专家之间也表现出高度相关性,证明了人工智能评估器即使在复杂、真实的创造性任务上也能提供有效的评分。
展望课堂整合的未来
在学校环境中,这种模拟环境可能为一种可衡量的“技能层”铺平道路,该技能层可置于现有学校课程之上,并融入学术任务中。这将使教育工作者能够构想新的作业形式,例如,与人工智能虚拟角色辩论社会科学话题,或扮演规划实验室实验的团队领导角色。学生可以同时收到关于他们对学科内容(例如,实验室实验的科学原理)的理解以及他们的能力(例如,协作和批判性思维的质量)的反馈。这种方法将是对现有学生小组项目的补充,并有可能同时支持学术知识和持久性技能的发展。
实现大规模的未来能力培养
这项研究探索了我们如何能将关键的、面向未来的持久性技能从难以衡量转变为可大规模衡量。通过这样做,就有可能实现对未来准备情况更具包容性和更准确的呈现。这项实验是朝着更贴近未来需求的评估方法迈出的一步。
我们也希望我们的新基础设施将支持整个生态系统内的进一步研究和有效性研究。研究人员将不仅能够评估新工具对知识保留的影响,还能评估其对技能发展的直接影响。此类研究的潜力巨大,有助于更深入地理解不同的教学干预如何随着时间的推移塑造人类能力。
展望未来,我们正在扩展研究,以解决可迁移性这一关键问题——即在模拟沙盒中展示的技能如何转化为真实世界的人际互动。此外,认识到人类技能具有文化情境性,我们将重点探索不同环境下的表现,以确保我们的技术具有包容性和公平性。除了评估之外,下一阶段将转向技能成长,深化我们对通过模拟环境实践来发展技能的理解,并衡量其有效性。
致谢
感谢为此工作做出贡献的谷歌团队成员:Alon Harris, Alex Moy, Amir Globerson, Anisha Choudhury, Anna Iurchenko, Ayça Cakmakli, Ben Witt, Cathy Cheung, Diana Akrong, Elisabeth Bauer, Hairong Mu, Julia Wilkowski, Lev Borovoi, Lucile Martini, Maya Alva, Nir Kerem, Noa Kerrem Gilo, Preeti Singh, Rajvi Kapadia, Rena Levitt, Roni Rabin, Rotem Yulzary, Shashank Agarwal, Sophie Allweis, Tal Oppenheimer, Taylor Goddu, Tracey Lee-Joe, Tzvika Stein, Yaniv Carmel, Yishay Mor, Yoav Bar Sinai, 以及 Yuri Lev。感谢我们的纽约大学合作者 Yoav Bergner 及其团队,以及来自 OpenMic 的合作伙伴:Aviad Segal, Eliad Carmi, Hadas Gelbart 和 Yael Bar Moshe。我们感谢德克萨斯大学奥斯汀分校的 Cristine Legare 以及创业教育网络(NFTE)的总裁兼首席执行官 J.D. LaRock 提供的见解。特别感谢我们的高层支持者:Niv Efron, Avinatan Hassidim, Amy Keeling, Katherine Chou, Yossi Matias, Ronit Levavi Morad, Chris Phillips 和 Ben Gomes。
英文来源:
Towards developing future-ready skills with generative AI
April 13, 2026
Gal Elidan, Research Scientist, and Yael Haramaty, Senior Product Manager, Google Research
Our new research demonstrates a novel approach to assess “future-ready” skills using GenAI. The results of our study with New York University found the AI scoring to be on par with human experts. This research experiment, Vantage, is now available on Google Labs.
Quick links
As AI evolves at an unprecedented pace, there is a renewed focus on "future-ready" skills — the durable human competencies that will remain valuable regardless of technological shifts or automation. International frameworks, such as the OECD Learning Compass 2030 and the WEF’s Future of jobs report, have identified a set of priority skills, both highlighting the same core competencies, including critical thinking, collaboration, and creative thinking. While these skills have been considered essential long before the rise of AI, they are now becoming more critical than ever.
Today we are sharing Vantage, a research experiment for assessing future-ready skills by leveraging generative AI to create conversations in simulated environments. Developed in partnership with pedagogy experts and researchers from New York University, Vantage is designed to offer high school and college students a sandbox environment for practice and validated assessment, built with the same systematic methodology traditionally used for core academic subjects, such as math or science. Vantage is now available in English for sign up on Google Labs.
Measuring what's difficult to measure
At the heart of any effective learning process is feedback and assessment, both essential for individual growth and effective teaching. In global education systems, it is often the case that what is measured is what is taught.
Future-ready skills, however, are notoriously hard to measure. Typical tests are too rigid to capture people's thought processes and interactions and they are far removed from how these skills are used in the real world. While testing these skills in real human interactions would be ideal, it is also too resource-intensive and hard to standardize and grade consistently across many students. For instance, how would you fairly assess conflict resolution if a group never disagrees, or the ability to build creatively upon each other's ideas if they settle on the first one that comes up?
Our research team set out to discover how to assess students’ future-ready skills using a scalable, validated approach that could empower educators to align lessons with these skills and support student growth.
Assessing skills with an AI simulated team
The experimental setup in Vantage places learners in dynamic, multi-party conversations with AI avatars working together to complete tasks. This setup allows us to control the assessment environment while simulating interactions that are more authentic and representative of real-world scenarios than existing standardized tests. It provides a sandbox to navigate complex interpersonal and situational challenges.
As users interact with AI avatars in open-ended scenarios, such as preparing for a debate or pitching a creative vision, an Executive LLM uses a provided assessment rubric to steer the AI avatars toward an effective assessment. The Executive LLM constantly analyzes the state of the conversation to dynamically introduce specific challenges — such as pushing back on an idea or introducing a conflict — providing the learner with targeted opportunities to demonstrate their skills. As such, it acts as a next-generation adaptive assessment engine, steering the dialogue so that by the end of the conversation, the information needed for assessing the user has been gathered.
Upon completion of the task, an AI Evaluator analyzes the conversation transcript against the same rigorous assessment rubric used by the Executive LLM to identify and measure specific evidence of skill application. The learner then receives a detailed skill map, consisting of a visual score and qualitative feedback specific to the skills they demonstrated during the conversation. This makes the "invisible" progress of human skill development visible and actionable.
Working with partners to validate how we assess skills
To ensure academic and pedagogical rigor, we established a research partnership with New York University. Together we surveyed common rubrics and aligned them to the tasks in question. The primary focus of this collaboration was to set up and validate the assessment approach. We did this through a joint study with 188 testers ages 18-25 from the US who completed Vantage tasks assessing sample collaboration skills: conflict resolution and project management. We looked at two main research questions:
- Can we steer a conversation to test specific skills?
A key innovation of Vantage is the use of the Executive LLM to enable adaptive assessment. We evaluated how effectively LLMs could steer a conversation to target a specific skill at a time, such as conflict resolution or project management. We measured the volume of skill-related information demonstrated by the user for that skill, compared to the learner working with AI avatars that are independent and not steered on the same task. Our findings indicated that the Executive LLM does successfully guide the dialogue to produce high-density information and led to significantly more information about the assessed skills while maintaining a natural conversational flow. This capability proved consistent across multiple simulation tasks. Further results and details about the methodology can be found in the technical report. - How accurately can LLMs score future-ready skills?
To test the accuracy of the AI Evaluator, we compared its scores against those of New York University raters using the same pedagogical rubrics. The results showed that the agreement between the AI Evaluator and human experts was similar to the agreement between the two expert raters. This suggests that the AI Evaluator’s conversation ratings are comparable to that of human expert raters, establishing Vantage as an effective automated system for skill assessment.
We also collaborated with OpenMic, a startup developing AI-powered tools for assessing durable skills. Together we conducted a joint study on creativity and English language arts, to test the AI Evaluator in another context. We analyzed 180 students’ work on creative multimedia tasks, such as character interviews and media articles related to English literature, and compared the AI Evaluator's scores with those of OpenMic's internal experts. Here too, there was a high correlation between the AI Evaluator and human experts, demonstrating the AI Evaluator’s ability to provide valid scoring even on complex, real-world creative tasks.
Looking ahead towards integration in classrooms
In a school setting, this kind of simulated environment could pave the way for a measurable “skills layer” that sits atop existing school curricula and is integrated into academic tasks. This will enable educators to imagine new forms of assignments, for example, debating a social science topic with the AI avatars or taking on the role of a team lead planning a laboratory experiment. Students could receive feedback on both their understanding of the subject-matter (e.g., the science of the lab experiment) and their skills (e.g., the quality of their collaboration and critical thinking). This approach would be additional to existing group projects with other students, and has the potential to support the development of academic knowledge and durable skills in tandem.
Enabling future-readiness at scale
This research explores how we might transform essential, future-ready, durable skills from hard-to-measure to measurable at scale. By doing so, a more inclusive and accurate representation of future readiness becomes possible. This experiment is a step towards an assessment approach more closely aligned with future needs.
We also hope that our new infrastructure will support further research and efficacy studies across the ecosystem. Researchers will now be able to assess not only the impact of new tools on knowledge retention but also their direct influence on skill development. The potential of such studies is significant, offering a greater understanding of how different pedagogical interventions shape human competencies over time.
Looking ahead, we are expanding our research to tackle the crucial question of transferability — how skills demonstrated in a simulated sandbox translate to real-world human interactions. Furthermore, recognizing that human skills are culturally situated, we will focus on exploring performance across diverse settings to ensure our technology is inclusive and equitable. Beyond assessment, the next phase is to move towards skill growth, deepening our understanding and measuring the efficacy of skill development through practice in simulated environments.
Acknowledgements
Shout out to the Google team members who have contributed to this work: Alon Harris, Alex Moy, Amir Globerson, Anisha Choudhury, Anna Iurchenko, Ayça Cakmakli, Ben Witt, Cathy Cheung, Diana Akrong, Elisabeth Bauer, Hairong Mu, Julia Wilkowski, Lev Borovoi, Lucile Martini, Maya Alva, Nir Kerem, Noa Kerrem Gilo, Preeti Singh, Rajvi Kapadia, Rena Levitt, Roni Rabin, Rotem Yulzary, Shashank Agarwal, Sophie Allweis, Tal Oppenheimer, Taylor Goddu, Tracey Lee-Joe, Tzvika Stein, Yaniv Carmel, Yishay Mor, Yoav Bar Sinai, and Yuri Lev. Thanks to our New York University collaborator Yoav Bergner and his team, and to our partners from OpenMic: Aviad Segal, Eliad Carmi, Hadas Gelbart, and Yael Bar Moshe. We are grateful for the insights from Cristine Legare at The University of Texas at Austin, and J.D. LaRock, the President and CEO of the Network for Teaching Entrepreneurship (NFTE). Special thanks to our executive champions: Niv Efron, Avinatan Hassidim, Amy Keeling, Katherine Chou, Yossi Matias, Ronit Levavi Morad, Chris Phillips and Ben Gomes.