机器学习遗忘审计的新框架

qimuai 发布于 阅读:27 一手编译

机器学习遗忘审计的新框架

内容来源:https://research.google/blog/new-framework-for-auditing-machine-unlearning/

内容总结:

谷歌发布新型AI审计框架:精准检测模型“遗忘”效果,破解隐私验证难题

2026年6月10日,谷歌研究院科学家莫妮卡·里贝罗团队提出一种名为“正则化f散度核检验”的新方法,旨在以更高灵敏度和准确性判定两组数据是否来自完全不同的底层分布,为人工智能模型“机器遗忘”效果验证提供更可靠工具。

随着AI模型处理的数据量日益庞大且高度敏感,严格验证模型是否真正“遗忘”了特定训练数据,已从理论理想变为硬性合规要求。例如,根据欧盟《通用数据保护条例》的“被遗忘权”,开发者必须从数学上证明隐私保护已落实。然而,审计人员往往无法访问模型内部结构或原始训练数据,只能通过查询模型并分析其输出进行验证。

传统验证依赖“双样本检验”——比较从未见过某条记录的模型与声称已“遗忘”该记录的模型输出。若两者输出存在统计显著差异,则判定遗忘失败。然而,随着模型规模与复杂度增加,该方法面临严峻挑战:一是为从海量数据的随机噪声中识别真实违规,需提取海量样本,计算成本极高;二是模型因训练批次不同等非隐私因素也会产生分布差异,易导致误报。

此外,最新研究表明,仅通过调整模型当前参数无法实现完美遗忘;除非完全重走原始训练路径,否则被删除的信息总会留下永久痕迹。因此,传统的“完美重新训练等价”标准在标准局部遗忘算法下根本不可能实现。

为此,谷歌团队提出“相对距离检验”新思路:不再要求遗忘模型与从头重训模型完全一致,而是测量遗忘模型分布是否更接近安全的重新训练模型,还是更接近原来有问题的模型。该方法基于自适应统计工具包,利用f散度技术精确识别各类数据分布偏移,并通过核正则化方法高效估算高维复杂数据间的差异。

实验结果显示,该框架在多种任务中表现优异:在高能物理异常检测领域,它能识别出传统方法完全遗漏的细微异常;在差分隐私审计中,曲棍球棒散度检验仅需数千样本即可检测到特定隐私违规,而此前技术需数百万样本才能近似达到同等效果。

特别在机器学习遗忘评估中,传统双样本检验将所有存在分布差异的模型——包括完全安全的重新训练模型——错误地标记为遗忘失败。而新提出的“相对三样本检验”成功克服这一缺陷,正确识别出安全模型,并发现只有“随机标签”方法通过评估,微调、剪枝、选择性突触抑制等流行方法均未能有效遗忘目标数据。

团队强调,该框架为机器学习行为提供了更精准、灵活且数学严谨的审查工具,使研究人员和审计人员能在各类复杂分布偏移场景下,从统计上证明模型是否存在不安全行为或数据泄露。未来,团队计划进一步从理论上确定针对特定任务的最优散度类型,并建立更紧致的样本复杂度边界,推动审计效率提升。该研究由莫妮卡·里贝罗与安东宁·施拉布、亚瑟·格雷顿共同完成,成果已在2026年国际人工智能与统计学会会议上发表。

中文翻译:

2026年6月10日
Mónica Ribero,谷歌研究院研究科学家

我们提出了一种方法,旨在可靠地判断两组数据观测是否来自完全不同的底层分布,且具备统计显著性。

机器遗忘技术能让AI系统“忘记”训练数据中的特定部分,而无需从头重新训练模型,从而避免巨大的计算成本。这对于法规合规(如GDPR的“被遗忘权”)、AI安全以及模型质量至关重要。

随着模型处理的数据集日益庞大且高度敏感,验证机器遗忘已从理论理想转变为严格需求,开发者必须从数学上证明隐私性。然而,由于审计人员通常无法访问模型的内部机制或原始训练数据,他们只能通过查询模型并分析输出样本来验证系统。

数据科学家和研究人员依赖的一种验证方法是双样本检验,这是一种统计方法,用于判断两组数据观测是否来自完全不同的底层分布。例如,为验证遗忘效果,审计人员可能会比较一个从未见过特定记录的模型的输出,与一个声称已“遗忘”该记录的模型的输出。如果输出在既定阈值内存在统计差异,则说明遗忘失败。

随着模型规模和复杂性的增长,双样本检验及其他用于机器遗忘审计的统计工具变得难以实施,且统计功效下降。为了从大规模模型固有的随机噪声中识别出真正的违规行为,并达到足够的统计显著性,审计人员需要提取大量样本。这使得实际测试在计算上极其昂贵。

为应对这一日益严峻的挑战,我们提出了正则化f-散度核检验(发表于AISTATS 2026),这是一个新框架,旨在让机器学习模型的审计更加灵敏、灵活且准确。我们从理论上证明,我们的检验在任何样本量下都能自然控制假阳性,并且随着可用数据样本量的增加,假阴性的风险会可靠地趋近于零。

评估模型安全性通常需要测量两个复杂数据集之间的距离或散度。不同的应用自然需要不同的“距离”概念。虽然最大均值差异(MMD)等流行的标准工具擅长检测数据中的广泛全局偏移(例如模型系统性地生成比对照模型更亮的图像),但它们往往缺乏捕捉复杂异常所需的特异性。例如,如果添加某人的特定数据后,仅当以极其特定的方式提示时,模型才会生成高度特异性的异常输出——而其他所有样本的分布完全相同——传统的MMD检验可能会完全忽略这种局部偏移。

此外,现有的大多数检验框架迫使研究人员进行容易出错的手动选择,例如选择最适合全局或局部偏移的特定统计量,或调整核带宽与正则化参数等复杂设置。

除了实际应用中的困难,双样本检验作为验证方法,在验证机器学习模型的遗忘效果时也存在缺陷。考虑以下示例:两个从头开始使用完全相同数据训练的模型可能产生不同的分布。蓝色分布是模型在不含受损数据的情况下重新训练得到的分布,但由于重新训练时使用了不同的批次大小,其分布与标准分布(绿色)存在差异。这会导致假阳性,即被检测模型被错误判定为不安全。

此外,最近的研究表明,AI模型仅通过调整当前设置永远无法完美地“遗忘”数据;除非它重新追溯原始训练的每一步,否则总会留下它本应删除的信息的永久痕迹。因此,对于标准的局部遗忘算法,实现完美的“重训练等价”在根本上是不可能的,传统的双样本检验总能发现对“遗忘集”的依赖性。

我们通过提出一种相对距离检验来解决这一挑战,该检验衡量一个遗忘模型在分布上是否更接近安全重训练的模型,还是更接近原始受损模型。

我们的检验作为一个高度适应性的统计工具包,利用f-散度让审计人员能够精准定位特定类型的数据偏移,包括:

在高维真实世界数据上计算这些散度是出了名的困难。为了在不要求海量计算资源的情况下处理这些复杂的优化问题,我们使用核正则化方法来高效估计差异。

我们的自适应检验方法会自动选择最佳的散度和最优的超参数配置,以最大化检验的可靠性,完全消除了样本分割的需要。

由于我们提出的检验具有通用性,我们在广泛的问题上进行了实验。我们在扰动均匀分布(合成双样本基准)以及物理学数据集中的Expo1D异常检测任务上评估了我们的框架——这是一个专注于利用机器学习搜索粒子物理标准模型之外的新物理现象的专业领域。我们使用高能物理数据,是因为该领域需要世界上最精确的“差异探测器”——其逻辑是,如果框架能发现违反物理定律的稀有粒子,那么它也能发现AI模型中微小的隐私泄露。

随后,我们将主要焦点转向审计差分隐私和评估机器遗忘这两个关键的实际应用:

我们的框架成功恢复或超越了所有先前的基准方法,且手动调整的工作量显著减少。

实验结果表明,没有任何单一检验能在所有可能场景中一致优于其他检验。相反,不同的f-散度如同专门的传感器,会对不同类型的局部数据偏移“亮起”。通过在不同统计量上采用聚合方法,我们的框架成功捕捉到了标准检验完全遗漏的细微错误和异常。

在隐私审计方面,曲棍球杆散度检验被证明是一个强大且有效的工具。由于它直接与纯差分隐私的数学基础对齐,它能让审计人员严格控制数据偏移的可接受程度。我们的自适应检验框架成功检测到了隐私违规行为,且所需的数据样本量远少于先前的基准测试器,需要的超参数调整也更少。

在一个显著案例中,我们的框架仅用几千个样本就检测到了特定稀疏向量技术机制(SVT3)中的违规行为,而之前研究的DP-Auditorium等技术则需要数百万个样本才能达到类似的违规检测率。

我们的发现还建议重新定义如何评估机器遗忘。如下表所示,我们观察到我们评估的所有近似遗忘方法都不符合严格的标准双样本遗忘定义。由于双样本检验只是寻找任何分布差异,它们错误地将完全安全的重训练模型标记为遗忘失败。

相比之下,我们提出的相对三样本检验成功克服了这一缺陷。它正确且一致地将安全重训练的模型识别为“安全”。在评估近似遗忘算法时,只有随机标签技术通过了评估。

其他流行方法,如微调、剪枝和选择性突触抑制,被发现无法有效遗忘目标数据。我们强调,这些实验的主要目标是评估遗忘方法,而非设计算法本身。因此,我们使用了这些遗忘程序的简化实现;在实际生产环境中对遗忘方法进行排名需要更严格的设置。

我们提出的新框架为审视机器学习行为提供了更精确、更适应性强且数学上更严谨的视角。通过利用正则化f-散度核检验,研究人员和审计人员现在可以在大量问题及复杂分布偏移中,从统计上证明模型是否存在不安全行为或数据泄露。

随着该领域的发展,从理论上为我们的实证观察奠定基础,以精确描述哪种特定的散度最适合其他新颖任务,仍是未来研究中一个令人兴奋的方向。建立更严格的样本复杂度边界也将是一个关键重点,以使这些审计更加高效。

本文所述工作是与Antonin Schrab和Arthur Gretton合作完成的。我们感谢Nicole Mitchell和Eleni Triantafillou提出的深刻反馈,感谢Kimberly Schwede制作图表,以及Mark Simborg提供的有益编辑。

英文来源:

June 10, 2026
Mónica Ribero, Research Scientist, Google Research
We introduce a method designed to confidently determine whether there is statistically significant evidence that two sets of data observations come from entirely different underlying distributions.
Machine unlearning allows AI systems to "forget" specific parts of their training data without the massive cost of retraining a model from scratch. This is essential for regulatory compliance (like GDPR’s "Right to be Forgotten"), AI safety, and model quality.
As models process increasingly massive and highly sensitive datasets, verifying machine unlearning has moved from theoretical ideal to a strict requirement, where developers must now mathematically prove privacy. However, because auditors often don’t have access to the model's internal workings or original training data, they must verify the system strictly by querying it and analyzing the output samples.
One method data scientists and researchers rely on for verification is two-sample testing, a statistical method that determines if two sets of data observations come from entirely different underlying distributions. For example, to verify unlearning, auditors might compare outputs from a model that never saw a specific record against a model that supposedly "forgot" it. If the outputs are statistically different within a defined threshold, the unlearning failed.
As models grow in size and complexity, two-sample testing and other statistical tools used for machine unlearning auditing become challenging to implement and they lose statistical power. To identify a real violation from random noise inherent in large-scale models, and with enough statistical significance, an auditor needs to extract a large number of samples. This makes real-world testing completely computationally very expensive..
To address this growing challenge, we introduce Regularized f-Divergence Kernel Tests, presented at AISTATS 2026, a new framework designed to make auditing ML models much more sensitive, flexible, and accurate. We theoretically prove that our tests naturally control for false positives for any sample size, and that the risk of false negatives reliably converges to zero as the number of available data samples increases.
Evaluating model safety often requires measuring the distance, or divergence, between two complex data sets. Different applications naturally require different notions of “distance”. While popular standard tools like maximum mean discrepancy (MMD) excel at detecting broad, global shifts across data (such as a model systematically generating brighter images than its counterpart), they often lack the necessary specificity to capture complex anomalies. For instance, if the addition of a specific person's data causes a model to generate a highly specific outlier output only when prompted in a very exact way — while having an equal distribution on all other samples — traditional MMD tests might completely overlook this local shift.
Also, most existing testing frameworks force researchers to make error-prone manual choices, such as picking the specific statistic best suited for either global or local shifts or tuning complex settings like kernel bandwidths and regularization parameters.
In addition to being hard in practice, two-sample testing as a verification method is flawed when verifying unlearning of ML models. Consider the example below showing how two models trained from scratch on the exact same data can produce different distributions. The blue distribution is the distribution of a model retrained without compromised data. However, its distribution is different from the standard (green) due to retraining with different batch sizes. This results in a false positive, indicating that the tested model is unsafe.
Furthermore, recent work shows that an AI model can never perfectly “forget” data just by tweaking its current settings; unless it re-traces every step of its original training, it will always leave behind a permanent footprint of the information it was supposed to delete. Accordingly, achieving perfect “retrain equivalence” is fundamentally impossible for standard, local unlearning algorithms and a traditional two-sample test can always find a dependence on the “forget set”.
We resolve this challenge by proposing a relative distance test that measures whether an unlearned model is distributionally closer to a safely retrained model or to the original, compromised one.
Our test acts as a highly adaptable statistical toolkit that leverages f-divergences to allow auditors to pinpoint highly specific types of data shifts, including:
Calculating these divergences on high-dimensional, real-world data is notoriously difficult. To make these complex optimization problems tractable without requiring massive amounts of compute, we use kernel regularization methods to estimate the differences efficiently.
Our adaptive testing approach automatically selects the best divergence and the optimal hyperparameter configurations to maximize the reliability of the test, entirely eliminating the need for sample splitting.
Because our proposed tests are general, we experimented across a wide variety of problems. We evaluated our framework on perturbed uniforms (synthetic two-sample benchmarks), as well as the Expo1D outlier detection task within physics datasets — a specialized area that uses ML to search for new physical phenomena outside the standard model of particle physics. We used high-energy physics data because that field requires the world’s most precise "difference detectors” — the idea being, if the framework can spot a rare particle that defies the laws of physics, it can spot a tiny privacy leak in an AI model.
We then shifted our primary focus to the critical, real-world applications of auditing differential privacy and evaluating machine unlearning:
Our framework successfully recovered or outperformed all previous baseline methods with significantly less manual tuning.
The experimental results demonstrated that no single test consistently outperforms the others across every possible scenario. Instead, different f-divergences act as specialized sensors that "light up" for different types of localized data shifts. By using an aggregated approach across diverse statistics, our framework successfully caught subtle errors and anomalies that standard tests completely missed.
For privacy auditing, the hockey-stick divergence test proved to be a powerful and effective tool. Because it directly aligns with the mathematical foundations of pure differential privacy, it allows auditors to tightly control the acceptable degree of data shift. Our adaptive testing framework successfully caught privacy violations using significantly fewer data samples and requiring far less hyperparameter tuning than previous baseline testers.
In one notable instance, our framework detected violations in a specific sparse vector technique mechanism (SVT3) using only a few thousand samples, while previously studied techniques like DP-Auditorium required millions of samples to approximate the same violation detection rate.
Our findings also suggest a redefinition of how to evaluate machine unlearning. As shown in the table below, we observed that none of the approximate unlearning methods we evaluated were compliant with the strict, standard two-sample unlearning definition. Because two-sample tests simply look for any distributional difference, they incorrectly flagged perfectly safe, retrained models as unlearning failures.
In contrast, our proposed relative three-sample test successfully overcame this flaw. It correctly and consistently identified the safely retrained models as "safe". When evaluating the approximate unlearning algorithms, only the random label technique passed the evaluation.
Other popular methods, such as finetuning, pruning, and Selective Synaptic Dampening, were found to be ineffective at truly forgetting the targeted data. We emphasize that our primary goal in these experiments was the evaluation of the unlearning methodologies, rather than designing the algorithms themselves. Consequently, we used simplified implementations of these unlearning procedures; more rigorous setups will be required to rank unlearning methods in practical production environments.
Our newly proposed framework provides a much more precise, adaptable, and mathematically sound lens for examining ML behavior. By leveraging regularized f-Divergence kernel tests, researchers and auditors can now statistically prove whether a model is behaving unsafely or leaking data across a massive class of problems and complex distributional shifts.
As this field evolves, theoretically grounding our empirical observations to characterize exactly which specific divergence is optimal for other novel tasks remains an exciting direction for future work. Establishing tighter sample complexity bounds will also be a key focus to make these audits even more efficient.
The work described here was done jointly with Antonin Schrab and Arthur Gretton. We thank Nicole Mitchell and Eleni Triantafillou for insightful feedback, and Kimberly Schwede for the graphics and Mark Simborg for helpful edits.

谷歌研究进展

文章目录


    扫描二维码,在手机上阅读