面向人工智能的网络数据基础设施层的出现

qimuai 发布于 2026-6-25 07:00 阅读：0 一手编译

面向人工智能的网络数据基础设施层的出现

内容来源：https://www.technologyreview.com/2026/06/24/1139202/the-emergence-of-the-web-data-infrastructure-layer-for-ai/

内容总结：

AI发展催生全新网络数据基础设施层

随着人工智能技术的持续突破，企业为充分发挥其潜力，亟需大规模实时数据支撑。然而，当前互联网架构并非为AI的自动化数据抓取而设计，大量信息被封锁或非结构化，成为制约模型应用的核心瓶颈。业内专家指出，AI的下一阶段发展将依赖于构建全新的“网络数据基础设施层”——该层需能每周导航数亿现有域名与数十亿新增URL，实时传递信息并突破技术壁垒。

实时数据成为AI性能关键

早期AI突破依赖训练数据规模与模型尺寸，但如今企业遭遇根本性瓶颈：需动态追踪网页数据的非结构化演变，以验证输出结果的时效性与准确性。传统静态数据训练已无法满足竞争定价、市场趋势等实时波动需求。数据显示，56%的AI从业者认为企业必须接入实时网络数据以提升对AI输出的信任。检索增强生成（RAG）技术虽能引入外部数据，但大规模实时检索仍面临延迟挑战——Gartner预测，若缺乏AI就绪数据（精准、结构化、情景化），60%的AI项目将在年内夭折。

新基础设施：模拟人类浏览行为

为突破限制，新兴平台通过模拟人类浏览行为（包括IP地址、地理位置等上千个参数），以每日800亿次规模仿造合法用户访问，从JavaScript密集型网站或反爬软件中实时提取结构化数据。该系统严格遵守GDPR、CCPA等隐私框架，仅限公开数据，并采用经审查的IP网络。专家强调：“在大规模、超低延迟下不被屏蔽地采集数据，才是关键。企业若自建这类基础设施，将陷入与核心AI研发争夺资源的工程泥潭。”

行业影响与趋势

实时数据正改变企业内部AI能力：零售企业可据此动态定价，全球品牌可追踪商标侵权。随着生态成熟，投资该数据基础设施层的企业将更易构建适应现实变化的AI系统。麻省理工科技评论联合Bright Data发布的《2026年AI数据报告》指出，未来AI模型与供给数据的底层基础设施间的界限或将逐渐消失。

中文翻译：

赞助内容
人工智能的网页数据基础设施层应运而生
随着人工智能的持续进步，基础设施必须同步进化，以便大规模获取和传输实时信息。
与Bright Data合作呈现

人工智能正在蓬勃发展，每天都有新的应用场景涌现。为了充分利用这项技术的潜力，企业需要大规模的数据。然而，在许多情况下，相关信息要么被封锁，要么是非结构化的，这限制了人工智能模型对其的利用。

要理解这一挑战，不妨先审视网络本身的基础。互联网最初的设计并非为了满足新式人工智能应用所需的自动发现与检索功能。要克服这一固有的设计局限，就需要基础设施的支持。

人工智能的下一个前沿或许取决于一个全新的网络数据基础设施层，它能让模型发现并映射这个不断扩张的数字领域。这一层必须能够穿梭于数以亿计的现有网站域名以及每周新增的数十亿个新网址之间，实时传输信息并跨越技术障碍。

“数据表明，外界存在的数据远比你想象的多，”网络数据收集平台Bright Data的首席执行官Or Lenchner表示。“想象一下宇宙：它就在那里，但你不了解你所不知道的东西。”

助力获取新鲜、相关且可信的数据

尽管早期的人工智能突破是由扩大训练数据和模型规模所驱动的，但各类组织如今正遭遇一个根本性的瓶颈：它们需要跟上网络数据动态、非结构化且持续演变的特性，才能将输出结果建立在当前且可验证的信息之上。人工智能的性能越来越不依赖于模型架构本身，而更多地取决于系统的计算、网络、检索和数据工程能力——即系统快速且可靠地获取新鲜、相关且可信数据的能力。

传统的模型训练依赖于在特定时间点收集的信息快照。基于此类静态数据训练人工智能已不再足够。为了追踪竞争对手定价、消费者情绪和市场趋势等变化，公司需要持续不断的新信息流，实时拉取数据并附带相关上下文。因此，其基础设施必须能够同时处理数以百万计的网站交互，而这些网站在地理、语言、格式和访问规则上千差万别。

“如果无法检索实时信息，就缺少了上下文语境，”Lenchner说。“在商业环境中，这已经无法令人接受。过时的答案会导致错误的决策和失望的消费者。”

速度不仅仅是便利性的问题，更是必要性使然。如今的组织在价格、库存、市场、安全威胁和客户行为持续变化的环境中运营。数据检索的延迟会降低一个原本精密的模型的实用性。

使用实时的高质量网络数据还能减少人工智能的“幻觉”现象，因为模型拥有了更相关的知识库。这有助于建立用户信任。事实上，一项调查发现，56%的人工智能从业者表示，企业需要访问实时网络数据以提高对人工智能输出的信任度。为了确保模型高效且有效地运行，信息还必须被精简至适当的核心要素。

尽管引入了检索增强生成（RAG）技术——即模型在收到查询时拉取外部数据——但许多人工智能系统在运营环境中仍然难以提供最新、上下文相关且可信的输出。根据Gartner的预测，到今年年底，将有60%的不依赖“人工智能就绪数据”（即准确、结构化、有组织且具备上下文的数据）支持的人工智能项目被放弃。

这是因为仅靠大规模检索并不能解决问题。正如Lenchner所言：“你需要大规模地检索数据，但同时也要实时检索。延迟会成为一个问题，因为终端用户在等待输出结果。”

大规模获取新鲜且“人工智能就绪”的数据带来了技术和结构上的挑战。在实践中，许多企业系统在人工智能应用中结合了公共网络检索、API、许可数据集和专有内部数据。将这些碎片化的来源整合成一个及时且可用的知识层，需要专业能力。一些研究发现，97%的人工智能组织依赖实时网络数据基础设施，但90%的组织感到受到各种限制的束缚。企业正越来越多地开发技术方法来应对这些限制。

Lenchner打了这样一个比方：“把训练好的模型想象成智力，把相关数据想象成知识。一个强大的智力层叠加在一个空洞的知识层之上，就像一个什么都不知道的天才——在实践中毫无用处。智力与知识必须相辅相成。”

新基础设施的前景

一个新的网络数据基础设施层可以满足这种对更强大人工智能输入日益增长的需求，它能够实现数据发现、实时访问以及针对特定情境的定制。正如Lenchner所描述的：“关键在于大规模收集数据，实现超低延迟，同时不被屏蔽。”

这类平台并非依赖增加计算能力，而是模拟人类浏览行为来访问可用内容，并将原始代码转化为结构化的数据流。它可以处理那些传统抓取工具难以交互的网站，例如大量使用JavaScript的网站，或配备激进反爬虫软件的网站。

Lenchner解释道：“这基本上就是拥有一种基础设施，它可以模仿网络用户，携带身份识别信息——IP地址、位置，以及超过1000个其他参数——并且是规模化的。想象一下，每天为数百万个网站执行800亿次这样的操作。每一次，你都看起来完全符合网站预期你该有的样子。”

当然，持续检索带来了新的数据治理挑战。为了应对这些挑战，平台可以执行符合全球隐私框架（如欧盟的《通用数据保护条例》和《加州消费者隐私法案》）的严格合规协议。它们也可以被限制在可公开访问的公共信息范围内，避免使用付费墙或私人登录。所使用的任何网络节点都可以经过审查并基于用户同意，同时可以向IP地址的所有者提供激励。通过这种方式，系统可以被设计得符合日益严格的法规要求。

如此复杂的能力并非轻易可得。“当这成为一家公司的关键基础设施时，”Lenchner说，“内部自行开发就成了一个全职的工程问题，会与人工智能的实际工作形成资源竞争。”应对这种复杂性需要组织投入大量资源，导致许多公司转而寻求专为数据检索、编排和可观测性设计的专业平台。

面向现实世界的基础设施

实时数据检索正在改变人工智能系统在企业内部能够做到的事情。例如，一家零售公司可以利用公共信息来驱动动态定价引擎，而全球品牌则可以追踪商标侵权行为。

随着生态系统的成熟，投资于这一新兴数据基础设施层的组织将能更好地构建响应更迅速、更可靠、更符合现实世界条件的人工智能系统——这些系统能够利用当前的网络数据持续适应变化。随着时间的推移，人工智能模型与为其提供养分的基础设施之间的界限甚至可能开始消失。

正如Lenchner所言：“世界正在变化。世界上发生的一切都在被上传到公共互联网。正在产生的新数据量正在增长并且加速。”

欲了解更多来自Bright Data的信息，请阅读《2026年人工智能数据报告》。

本内容由 Insights 制作，这是 MIT Technology Review 的定制内容部门。它并非由 MIT Technology Review 的编辑团队撰写。其研究、设计、撰写工作由人类作家、编辑、分析师和插图师完成。这包括调查的撰写和调查数据的收集。可能使用的人工智能工具仅限于经过彻底人工审核的次要制作流程。

深度探索
人工智能

面向基督徒的美国新手机网络旨在屏蔽色情与性别相关内容
该手机套餐将于下周在T-Mobile网络上推出，在网络安全性上采取了根本性措施。

一家初创公司声称突破了制约大语言模型的瓶颈
Subquadratic 现已分享了其新模型的更多细节。但一些人仍持怀疑态度。

马斯克诉奥特曼第一周：马斯克称自己受骗，警告人工智能可能毁灭全人类，并承认xAI蒸馏了OpenAI模型
马斯克保持了冷静，而OpenAI的律师则用尖锐的问题逼问他起诉公司的动机。

对人工智能就业恐慌的现实审视
关于人工智能对劳动力市场的影响，数字究竟说明了什么？答案可能会让你吃惊。

保持联系
获取来自
MIT Technology Review
的最新动态
发现特别优惠、头条新闻、即将举行的活动等更多信息。

英文来源：

Sponsored
The emergence of the web data infrastructure layer for AI
As AI continues to advance, infrastructure must evolve to enable access and delivery of real-time information at scale.
In partnership withBright Data
AI is booming. New use cases are emerging each day. To capitalize on the technology’s potential, enterprises require data at scale. In many cases, though, the relevant information is blocked or unstructured, which limits its use by AI models.
To understand this challenge, consider the foundation of the web itself. The web was not designed for the automated discovery and retrieval that new AI applications demand. Overcoming this inherent design constraint requires infrastructure.
The next frontier in AI may depend on a new web data infrastructure layer that can enable models to discover and map this ever-expanding digital realm. This layer must be able to navigate hundreds of millions of existing web domains and billions of new URLs created each week, delivering real-time information and overcoming technical barriers.
“The data suggests there's far more data out there,” says Or Lenchner, CEO of Bright Data, a web data collection platform. “Think of the universe: It's out there, but you don't know what you don't know.”
Enabling access to fresh, relevant, and trustworthy data
While early AI breakthroughs were driven by scaling training data and model size, organizations are now encountering a fundamental bottleneck: They need to keep pace with the dynamic, unstructured, and constantly evolving nature of web data in order to ground outputs in current and verifiable information. AI performance increasingly depends not just on model architecture but on a system’s compute, networking, retrieval, and data engineering capabilities—that is, the system’s ability to quickly and reliably retrieve data that is fresh, relevant, and trustworthy.
Traditional model training relies on snapshots of information collected at a particular point in time. Training AI on such static data is no longer sufficient. To track fluctuations such as competitor pricing, consumer sentiment, and market trends, companies need a constant feed of new information, pulling data in real time along with relevant context. Their infrastructure must therefore be able to handle millions of simultaneous interactions across websites that vary by geography, language, format, and access rules.
“If it can't retrieve real-time information, it lacks context,” Lenchner says. “In a business setting, that's not acceptable anymore. Stale answers lead to bad decisions and disappointed consumers.”
Speed is not merely a matter of convenience; it’s a matter of necessity. Today’s organizations operate in environments where prices, inventory, markets, security threats, and customer behavior change continuously. Delayed data retrieval can reduce the usefulness of an otherwise sophisticated model.
Using live, high-quality web data can also reduce AI hallucinations because the model has a more relevant knowledge base. This builds user trust. In fact, one survey found that 56% of AI practitioners said businesses need access to real-time web data to improve trust in AI outputs. To ensure the model runs efficiently and effectively, the information must also be pared down to the appropriate essentials.
Despite the introduction of retrieval-augmented generation (RAG), where models pull in external data at the moment of a query, many AI systems still struggle to deliver outputs that are current, contextually relevant, and trustworthy in operational settings. According to Gartner, 60% of AI projects that are not supported by AI-ready data—accurate, structured, organized, and contextualized—will be abandoned by the end of the year.
This is because large-scale retrieval alone does not solve the problem. As Lenchner puts it, “You need to retrieve data at scale, but also in real time. Latency becomes an issue because of the end user who is waiting for the output.”
Accessing fresh, AI-ready data at scale introduces technical and structural challenges. In practice, many enterprise systems combine public web retrieval with APIs, licensed datasets, and proprietary internal data in their AI applications. Integrating these fragmented sources into a timely and usable knowledge layer requires specialized capabilities. Some research has found that 97% of AI organizations depend on real-time web data infrastructure, but 90% feel boxed in by various restrictions. Companies are increasingly developing technical approaches to navigate these constraints.
Lenchner draws this metaphor: “Think of the trained model as intelligence and relevant data as knowledge. A powerful intelligence layer sitting on top of a hollow knowledge layer is like a genius who knows nothing—useless in practice. Intelligence and knowledge have to come together.”
The promise of new infrastructure
A new layer of web data infrastructure can address this developing need for stronger AI inputs by enabling discovery of data, real-time access, and tailoring to a specific context. As Lechner describes it, “It's all about collecting data at scale, super-low latency, without being blocked.”
Rather than relying on increased computing power, this type of platform emulates human browsing behavior to access available content and transform raw code into structured data feeds. It can work with websites that might not interact with traditional scraping tools, such as those heavy in JavaScript, or with aggressive antibot software.
As Lenchner explains, “It's basically having infrastructure that can mimic a web user with identifying information—IP address, location, and 1,000 more parameters. And at scale. Think of doing that 80 billion times a day for millions of websites. And every single time, you are looking exactly as the website expects you to look.”
Of course, continuous retrieval introduces new data governance challenges. To address them, platforms can enforce strict compliance protocols aligned with global privacy frameworks, such as the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). They can also be limited to openly accessible, public information, avoiding paywalls or private logins. Any networks used can be vetted and consent-based, and incentives can be provided to owners of IP addresses. In this way, systems can be designed to comply with tightening regulation.
Such complex capabilities do not come easy. “When this is critical infrastructure for a company,” Lenchner says, “doing it in-house becomes a full-time engineering problem that competes with the actual AI work.” Addressing this complexity requires organizations to commit significant resources, leading many to seek specialized platforms designed specifically for data retrieval, orchestration, and observability.
Infrastructure for the real world
Real-time data retrieval is changing what AI systems can do inside organizations. For example, a retail company can use public information to enable a dynamic pricing engine, and global brands can track trademark infringements.
As the ecosystem matures, organizations that invest in this emerging data infrastructure layer will be better positioned to build AI systems that are more responsive, reliable, and aligned with real-world conditions—AI systems that can continuously adapt using current web data. Over time, the distinction between AI models and the infrastructure that feeds them may even begin to disappear.
As Lenchner says, “The world is changing. And everything that is happening in the world is being uploaded to the public web. The amount of new data that is being generated is growing and accelerating.”
To learn more from Bright Data, read the Data for AI 2026 report.
This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff. It was researched, designed, and written by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.
Deep Dive
Artificial intelligence
A new US phone network for Christians aims to block porn and gender-related content
Launching next week on T-Mobile's network, the cell plan takes a nuclear approach to online safety.
A startup claims it broke through a bottleneck that’s holding back LLMs
Subquadratic has now shared more details about its new model. But some are still skeptical.
Musk v. Altman week 1: Elon Musk says he was duped, warns AI could kill us all, and admits that xAI distills OpenAI’s models
Musk kept his cool, and OpenAI’s lawyer bulldozed him with piercing questions about his motivations for suing the company.
A reality check on the AI jobs hysteria
What do the numbers really say about the impact of artificial intelligence on the labor market? The answer might surprise you.
Stay connected
Get the latest updates from
MIT Technology Review
Discover special offers, top stories, upcoming events, and more.

MIT科技评论

文章目录

📚 推荐阅读

扫描二维码，在手机上阅读