人类基因组的复杂物理特性为何可能令AI困惑

内容来源:https://www.quantamagazine.org/why-the-human-genomes-tangled-physicality-may-confound-ai-20260618/
内容总结:
人类基因组的物理缠绕性可能让AI“摸不着头脑”
尽管自上世纪50年代起,DNA就被奉为“生命密码”,但科学家发现,人类基因组远非一张简单的蓝图或程序脚本,其复杂的物理结构与调控机制,甚至可能令当前最先进的人工智能算法感到困惑。
基因并非全部,调控才是关键
国际人类基因组计划在2003年几乎完整解析了约30亿个DNA碱基对的序列,结果却令人意外:仅有约2%的基因组序列是真正的编码基因。更深层的问题在于这些基因如何被“调控”——即如何在正确的时间、正确的细胞中开启或关闭,从而用同一套基因组创造出肌肉、大脑、皮肤等截然不同的细胞类型。
哈佛医学院生物学家凯伦·阿德尔曼指出,人类基因组的运作逻辑远比细菌复杂。细菌的调控如同“或”门——一个信号决定开关;而人类的调控更像“与”门,需要整合多重信号才能做出决策,且调控旋钮可无级调节,而非简单的开/关。
增强子与DNA环:千里之外的“遥控器”
基因组中散布着成千上万乃至数百万个“增强子”片段,它们是转录因子的聚集点,决定基因是否开始转录。然而,一个增强子可能影响多个基因,一个基因也可能受多个增强子调控,且许多增强子位于距离目标基因数百万个碱基对之外的“远端”。
如何实现远程控制?答案在于DNA形成的巨大“环状结构”。一种名为“黏连蛋白”的分子马达在DNA链上穿梭,将远端增强子拉近到目标基因附近,形成一个松散的、瞬息万变的“转录枢纽”。这种结构在不同细胞中千差万别,即便是同类型细胞,其调控状态也从未完全相同。
三维折叠与“表观遗传”注释
染色质的物理形状本身即是一种调控手段。紧密包裹的“异染色质”使基因沉默,而松散的“常染色质”则允许转录。基因组在细胞核内被折叠成“拓扑关联结构域”,同一域内的基因往往被协同调控。此外,化学修饰(表观遗传标记)如同对DNA脚本的“批注”,能改变染色质的包装方式和基因活性。当细胞分裂时,这些“批注”也会被复制。
RNA“检查点”与自我指涉的“怪圈”
即便基因成功转录出信使RNA,还有后续关卡。微小RNA等非编码RNA能降解或修饰信使RNA,阻止其翻译成蛋白质。信使RNA本身还需经过“剪接”加工,在不同细胞环境下可产生功能迥异的蛋白质变体。
所有参与调控的分子——从转录因子到非编码RNA——本身又是由基因组以同样依赖上下文的方式产生的。这使得基因组如同一个递归、自我指涉的系统,计算机科学家侯世达称之为“怪圈”:它既作用于自身,又受自身历史和内外信号的调节。
AI的局限:黑箱预测难解生命之谜
面对如此复杂且依赖动态物理结构的调控网络,一些生物学家寄希望于人工智能。像Evo 2、Genos以及谷歌DeepMind的AlphaGenome等基因组基础模型,试图通过海量数据学习序列与性状之间的相关性。然而,爱丁堡大学生物学家温迪·比克莫尔指出,人体发育过程中细胞类型的复杂性和时间动态变化,正是当前数据缺失的巨大盲区。
生物技术公司Genyro联合创始人阿德里安·伍尔夫森在其著作中强调,生命功能不仅依赖基因组,还受饮食、环境、微生物组等因素组成的“信息组”影响。基因组基础模型甚至无法预测所有基因突变的后果,因为关键信息根本不在基因序列之中。
结论:基因组是“活的器官”
1983年诺贝尔奖得主芭芭拉·麦克林托克曾将基因组形容为“细胞中高度敏感的器官”,它能监测自身活动、纠正错误,并通过重构自身来应对意外。比克莫尔认为,尽管细节尚未完全明晰,但科学界已逐渐凝聚共识:基因组不是静态的线性指令集,而是一个对内外信号做出动态响应的开放信息系统。
“麦克林托克比人们当时意识到的要精准得多,”阿德尔曼说,“她所说的就是:基因组不是静止的——它是鲜活的。”对于真正渴望理解生命运作的人来说,AI或许只是工具,而最终揭示基本原理的,仍需人类的理性思维。
中文翻译:
为何人类基因组错综复杂的物理特性可能难倒人工智能
引言
自20世纪50年代DNA的分子结构被揭示以来,它一直被许多生物学家誉为生命的奥秘。他们阅读并研究了存储在生物体细胞DNA中的信息,即所谓的基因组,并宣称这个基因数据库必定是某种蓝图、代码脚本或计算机。但如果DNA真的隐藏着关于生命运作方式的某种更伟大的秘密,生物学家们至今仍未找到它。
事实上,人类基因组与其说是一个脚本,不如说是一个越仔细观察就越难解的谜题。了解整个序列——我们DNA中所有约30亿个化学构建模块的排序,由国际人类基因组计划在1990年至2003年间近乎完全解析——并没有带来多大帮助。这项研究表明,人类基因组中仅有不到2%的部分由真正的基因(即DNA中编码信息的序列)组成。
现在已清楚,理解人类基因组不再是弄清楚每个基因功能的问题。更深入且更难的问题是这些基因如何被使用或调控,这个问题似乎涉及基因组其余部分的一部分,甚至可能是大部分。通过开启和关闭基因群组,我们身体中许多不同的细胞类型都可以由相同的物质创造出来。细胞还会根据来自邻居和周围环境的持续信号流,随时调控它们的基因。但事实证明,支配基因调控的过程如此复杂,以至于一些生物学家怀疑,我们渺小的大脑是否能够完全理解它——理解基因组究竟是如何工作的。
有些人寄希望于将分析外包给人工智能。诸如Evo 2、Genos以及Google DeepMind的AlphaGenome等基因组“基础模型”在海量的基因组数据上进行训练,生物学家利用这些模型来预测DNA序列的差异如何影响生物过程,并最终影响整个生物体的性状(包括疾病风险)。这些算法并不担心正在发生的复杂调控过程;所有这些据说都被算法的“训练”所涵盖,通过训练,它从我们已经知道的情况中推断出相关性。
这种方法可能很有用,但对于那些渴望真正理解基因组(以及最终生命本身)如何运作的人来说,一个计算黑箱永远无法满足需求。或许更重要的是,基因组可能不会屈服于这类AI模型最终假定的那种直接的输入-输出方式。
这是因为基因组不是蓝图或算法。它是别的东西。
旧有观点
考虑到它是大约40亿年演化的产物,我们的基因组很复杂或许并不令人惊讶。令人惊讶的是这些复杂性的具体表现形式。“我们的基因组并不是我们坐下来画设计图就能画出来的东西,”在哈佛医学院研究基因调控的生物学家凯伦·阿德尔曼说。
传统观点认为,我们DNA的一小部分包含了制造蛋白质分子的代码,这些蛋白质分子协调着我们细胞的化学反应。每条蛋白质的指令都包含在一个相应的基因中——我们大约有2万个这样的基因——基因序列的长度可以从几十个到近300万个DNA“字母”(代表称为核苷酸的分子)不等。从一个基因制造蛋白质是一个分为两个阶段的过程。首先,DNA被一种叫做聚合酶的酶逐个字母地读取,它在一个相关的分子(称为信使RNA,简称mRNA)中创建该代码的副本。这被称为转录。然后,mRNA被一个称为核糖体的分子机器读取,它构建蛋白质——这个过程称为翻译。由核糖体制造的蛋白质随后去执行它们的工作,制造并维持生物体。
这幅图景大体上仍然是正确的。但事实证明,“基因可能不是基因组中最有趣的部分,”阿德尔曼说。
更重要的是我们的基因(其中许多与更简单的生物体共有)是如何被调控的:即开启和关闭。细胞需要哪些蛋白质会随着时间以及细胞类型(肌肉、大脑、皮肤等)而变化。编码这些蛋白质的基因如何被调控,取决于基因组中一些不编码蛋白质的部分。
自20世纪60年代以来,生物学家就已经知道基因调控以及“非编码”DNA的参与。但多年来,他们对此的大部分理解都来自对细菌等简单生物体的研究,这些生物体的调控原理通常很简单。然而,逐渐清晰的是,在像我们这样复杂的真核生物中,基因调控要复杂得多,涉及重叠的监督和控制系统,每个系统都有其自身的复杂性。
转录因子
转录由称为转录因子的蛋白质启动,它们如同基因调控的运营经理。这些蛋白质附着在DNA的片段上(通常靠近目标基因),并招募聚合酶来制造mRNA副本。在细菌中,转录因子有点像钥匙,可以插入DNA上独特的结合位点锁中。但在复杂生物体中,它们并非如此工作。在我们体内,转录因子的逻辑更难解析。
首先,我们的转录因子对特定的DNA结合位点没有强烈的偏好。此外,它们倾向于成对或成组地工作。同一个转录因子在不同的环境中可能产生不同的效果,例如在一个细胞类型中激活基因转录,在另一个细胞类型中则抑制它,这取决于周围有哪些其他转录因子。
在细菌中,调控往往具有“或”逻辑,阿德尔曼说,即一个特定的信号打开或关闭一个基因:要么这样,要么那样。但在人类基因组中,逻辑更像是计算机科学家所指的“与”。多个信号被整合以达成调控决策:这个、那个,还有另外那个。在这种情况下,调控可以更灵敏地响应环境的细微差别,并且调控旋钮是可调的,而不仅仅是开/关。“这是我们调控复杂性的美妙之处,”阿德尔曼说。
当它们与基因组相互作用时,转录因子会结合到称为增强子的DNA片段上——这些增强子本身就呈现出一个谜题。
增强子
增强子是转录因子的聚集点,它们被认为对转录起着决定性影响:它们为等待中的聚合酶发出“启动”信号,以制造DNA序列的mRNA版本。看起来很简单,但将增强子与其各自的基因对应起来远非易事。我们的基因组有数十万,甚至数百万个增强子。这意味着我们拥有的增强子数量远远超过基因数量。每个基因可能受到许多增强子的影响,而每个增强子也可能影响多个基因。
“令人尴尬的是,在人类基因组计划完成25年后,我们仍然不知道基因组中所有增强子的位置,更不用说它们在作用时做了什么以及它们控制哪些基因了,”爱丁堡大学的基因组生物学家温迪·比克莫尔说。
生物学家确实知道,大多数增强子不会对单个转录因子做出反应。它们的激活“需要一种混合剂,”比克莫尔说。“这正是赋予(增强子)那种精妙特异性的原因——因为它只有在特定的时间、特定的细胞中,才拥有正确的因子组合来结合并激活该增强子。”
正如你所料,一些增强子靠近它们调控的基因,甚至位于基因内部的DNA上。但另一些则远离该基因——可能相距数百万个核苷酸,中间还有更多的基因。
这种所谓的“远距离”增强子的存在“似乎很疯狂,”比克莫尔说。“你如何将信息从那里传递到这里,传递给需要被激活的基因?这基本上是一个悬而未决的问题。”
答案之一是以环的形式出现。
环与枢纽
远距离增强子通过巨大的DNA环,或者更严格地说是染色质(DNA及其包装蛋白质的组合,像是从毛线团中解开一样)环,被带到它们调控的基因处。这些环是由一种叫做黏连蛋白的蛋白质马达创建的,它在DNA链上上下移动,并根据需要将其挤压出来。
一旦黏连蛋白形成一个环来聚集这些元件,接下来会发生什么?曾有人认为它们会粘在一起或组装成一个分子机器,但事实并非如此。相反,这些成分似乎形成了一个松散但密集的团块,在其中它们以微弱、短暂且不加选择的方式相互作用——一种有时被称为凝聚体的委员会。
这些转录枢纽非常流动,并且细胞间各不相同。“这里会有一点环挤压在进行,在下一个细胞中可能就在那里,而且整个过程转动得难以置信地快,”比克莫尔说。即使这些细胞在理论上完全相同——比如都是皮肤细胞——基因调控机制在任何时刻的具体运作情况,在任何两个细胞中都不会完全相同。
染色质环只是基因转录依赖于其周围染色质形状和结构的原因之一。
染色质形状
教科书中的染色体图像——我们基因组被划分成的46个单位之一——是一个紧凑的X形染色质簇。但是,只要细胞不处于活跃分裂状态,其染色质就会解开,看起来像一团乱麻。然而,混乱中自有秩序。染色质的某些部分紧密堆积成一种称为异染色质的形式。那里紧凑的DNA对转录因子相对难以接近;它所包含的基因通常被沉默。与此同时,其他部分则相对松散、开放且易于接近:这被称为常染色质。
有专门的酶参与包装和重新包装染色质,从而控制转录。换句话说,重要的不仅仅是DNA中编码的信息,还有它在空间中动态存在的物理形式。“我们已经不再将基因组视为一长段线性DNA代码,”比克莫尔说。“将这种极其动态的三维折叠视为调控的绝对核心,是一个非常令人兴奋的变化。”
这种三维组织的一个方面是将染色质片段聚类成称为拓扑关联结构域的区室。在一个TAD内,基因似乎被共同调控:成组地开启或关闭。这样的组将基因套件一起保持活跃或沉默,以在不同细胞类型中形成并提供功能。黏连蛋白也参与染色质的重排以构建TAD——这是一个动态过程,其中染色质在我们的细胞中被不断重排。
染色质形状也可能受到称为表观遗传标记的化学修饰的影响:这些是小分子,附着在称为组蛋白的DNA包装蛋白质上,或直接粘在DNA上。其中一些表观遗传修饰可以改变组蛋白上的电荷,从而改变蛋白质相互吸引或排斥的方式,进而重新调整染色质的包装。染色质的表观遗传修饰就像是对DNA脚本的注释,在特定语境下改变其含义。当细胞分裂时,这些表观遗传注释也被复制。
这些标记如何以及何时被添加和改变,以及每种类型的标记对基因活动意味着什么,都是没有简单答案的复杂问题。一些研究人员谈论着支配基因调控这一方面的“表观遗传密码”,但如此系统化的东西是否真的存在,远非明确。
所有这些过程以及其他过程可以决定一个基因是否被转录成mRNA。但还有更深层次的调控决定了mRNA是否随后被翻译成相应的蛋白质——以及产生哪种蛋白质。
RNA干预
这种转录后调控通常由被称为非编码的RNA分子控制。这些短命的分子不像mRNA那样是蛋白质的模板,但有其自身的其他工作。当mRNA从DNA的蛋白质编码区域(所谓的“编码基因”)产生时,非编码RNA是从其他通常被描述为非编码基因的DNA区域转录而来的。这些非编码RNA用途广泛,在细胞中扮演着多种角色。研究人员每天都在了解更多它们的功能,而且它们中的许多(如果不是大多数的话)似乎都参与了基因调控。
例如,被称为微小RNA的小型非编码RNA可以在mRNA被翻译成蛋白质之前使其沉默。它们通过引导专门的酶到达特定的mRNA来降解或化学修饰它来实现这一点。微小RNA并非独自完成这项工作,而是类似于转录因子,以组合方式、成群地、并且以一种相当混杂的方式运作:一个特定的微小RNA可能调控许多mRNA,而一个特定的mRNA可能受许多微小RNA调控。
为什么要制造mRNA,却又阻止它被翻译成蛋白质呢?这种转录后调控就像拥有另一个检查点:细胞真的需要这种蛋白质吗?微小RNA可以被调动起来,使细胞能够根据即时环境调整基因表达。通过这种方式,基因组的运作不那么像程序的必然进程,而更像一个适应性和响应性的过程。
另一个转录后的复杂情况是,mRNA只有在经过重组后才能被翻译成蛋白质。刚刚转录出来的mRNA包含编码蛋白质片段的序列,称为外显子,以及不应被翻译、需要被剪掉的序列,称为内含子(严格来说,这种未经编辑的RNA被称为前体mRNA)。剪掉内含子并将外显子拼接在一起的工作是由一个称为剪接体的分子组装体完成的,它由几种蛋白质以及各种非编码RNA组成。
剪接体也可能对环境敏感,因此它可能将前体mRNA剪接成一种细胞类型中的蛋白质,而在另一种细胞类型中则编码略有不同的蛋白质。有时这些不同的蛋白质“同工型”可能具有非常不同的作用。例如,转录因子通常以这种方式进行可变剪接,它们的同工型可以承担不同的调控任务——有些可能激活基因表达,而另一些则可能抑制它。
制衡机制
总而言之,这些以及其他的调控机制表明,基因组远非某些在后台自动运行、构建并维持我们生命的程序。事实上,我们的细胞正在就如何使用其基因——包括它们包含的信息和它们呈现的结构——做出复杂的决策。
因此,细胞需要组装一个相当松散和模糊的委员会组件,如转录因子和增强子,以启动转录,这还取决于当时染色质链的形状和塑形方式。然后在mRNA和最终的功能性蛋白质之间,还有更多层次的决策和行动。
也请记住,所有的参与者——从转录因子到非编码RNA——本身也是通过同样依赖于环境的进程从基因组中产生的。这使得基因组更像一个递归的、自我指涉的系统,计算机科学家道格拉斯·霍夫施塔特称之为“怪圈”。它作用于自身,留心自身的历史(例如,这决定了染色质构象和表观遗传标记),并留意来自细胞内部和外部的信息。那么,它就不是一张蓝图。
因此,它绝不容易理解。“如果我是上帝,我不会这样设计,”比克莫尔说。“但我们就在这里!”
为什么像我们这样的动物的基因调控如此复杂?一个可能的答案是,演化并没有远见卓识去设计出高效且逻辑透明的东西,它只是对已有的东西进行修补。或许如此——但真核生物的基因调控不仅仅是细菌中发生情况的杂乱版本。它具有不同的原理,并且这些原理的存在肯定有其原因。
比克莫尔怀疑,调控和基因组组织的复杂性可能是产生生物体复杂性的唯一途径。例如,拥有多种组织类型和多样化生活方式的生物体,需要对特定细胞中哪些基因开启或关闭进行更多控制。这其中的一个要求是DNA中需要越来越多的非编码调控序列。但它们不可能都紧邻基因本身。
“随着复杂性增加,你需要添加越来越多的增强子,”比克莫尔说。“但你要把它们放在哪里呢?你开始把它们放得越来越远。一旦它们(足够远),你就开始需要TAD和三维(染色质)折叠来让这些东西工作。”
我们也需要调控的复杂性,因为在演化的时间尺度上,人类基因组已经积累了来自寄生病毒的DNA,形式是可跳跃的遗传物质,称为转座因子。这些序列已经插入到我们染色体的各个地方,并且擅长自我复制。为了筛选好的DNA和坏的DNA,我们需要额外的调控层,以确保细胞不会翻译它们并不真正需要或可能有害的RNA。
在我们基因组的运作中有如此多依赖于环境的制衡机制,它显然不是一个在每个情况下都能可预测地产生相同结果的程序或算法。它是一个开放的信息系统,对外部输入和基因组动态的内部状态做出响应。如果AI仅仅依赖基因组内的遗传序列来预测基因组将做什么,这就提出了一个挑战。
“一个高度敏感的器官”
开发诸如AlphaGenome等基于AI的基因组基础模型的研究人员希望,所有这些调控层——转录因子、剪接、表观遗传标记、环、染色质包装等等——都将隐含地包含在算法学习到的遗传序列与生物体性状之间的相关性中。他们乐于将上述复杂性置于黑箱之中,只要模型能生成准确的预测。但这会奏效吗?
“我确信(AlphaGenome)会有用,但有局限性,”比克莫尔说。“对我来说,巨大的差距在于人体的复杂性——在于所有的细胞类型以及它们在发育过程中如何随时间变化。而所有这些数据都是缺失的。”
从根本上说,挑战在于基因组不是一套静态的、线性的指令。它是高度动态的,并且以组合和混杂的逻辑,在特定环境下使用其信息。“我们是否能够(在像AlphaGenome这样的算法中)捕捉到那个方面,我不知道,”她说。
然而,问题甚至更深,因为特定生物体(包括我们每个人)的运作方式不仅仅取决于基因组。其他因素,如饮食、环境、微生物组,至少对我们来说还有文化,也可能至关重要——这不仅关乎我们的行为方式和健康状况,也关乎我们基因组本身的状态。生物学家阿德里安·伍尔夫森,加州生物技术公司Genyro的联合创始人(该公司旨在利用AI系统进行所谓的“生成生物学”),将这种信息云称为“信息组”。
“虽然人类基因组构成了人类信息组的基础,但其他层面的基因外信息同样重要,”伍尔夫森在他于2026年4月出版的《论物种的未来》一书中写道。他认为,基因组基础模型甚至无法预测基因突变的所有后果,因为相关信息首先就不在基因组序列中。
那么,我们应该如何看待基因组呢?也许唯一能捕捉基因组真实运作方式的隐喻必须来自生物学本身。2020年,生物历史学家伊芙琳·福克斯将基因组比作“一个极其灵敏的反应系统”。她说,与其说是一系列导致性状形成的基因,不如说它更像“一个装置,用于响应它从环境中不断接收到的变化信号,来调控特定蛋白质的生产。”
这听起来与遗传学家芭芭拉·麦克林托克在1983年因发现转座子而获得诺贝尔生理学或医学奖时所发表的演讲中描绘的画面相近。她宣称,基因组是“细胞的一个高度敏感的器官,监测基因组活动并纠正常见错误,感知异常和意外事件并做出反应,通常通过重组基因组来实现。”
自那时起的研究已经充实了这一形象,揭示了染色质的形状如何与其DNA序列编码的信息同样重要,以及一整套分子如何协作重组染色质,并做出集体决策,以依赖环境的方式利用其遗传信息。没有任何人类技术是以这种方式运作的,因此诸如蓝图、程序或计算机之类的隐喻总是有所欠缺。
比克莫尔持乐观态度,认为尽管基因组复杂,但其运作方式是能够被理解的。“我们现在已经掌握它了,”她说。“我们可能不知道细节,但我认为整个领域现在正凝聚成一个框架,我们正沿着相似的思路思考。”人工智能肯定可以帮助理解,但最终,需要人类推理来辨别基本原理。
“麦克林托克当时的观点远比人们意识到的要精准,”阿德尔曼说。“她说的是基因组不是静态的——它是活的。”
英文来源:
Why the Human Genome’s Tangled Physicality May Confound AI
Introduction
Since its molecular structure was deduced in the 1950s, DNA has been hailed by many biologists as the secret of life. They’ve read and studied the information stored in the DNA found in the cells of living organisms, known as their genomes, and claimed that this genetic database must be some kind of blueprint, code script, or computer. But if DNA really does harbor some greater secret about how life works, biologists have yet to find it.
In fact, the human genome is less a script than a puzzle that gets harder the closer they look. Knowing the entire sequence — the order of all 3 billion or so of our DNA’s chemical building blocks, nearly fully deduced by the international Human Genome Project between 1990 and 2003 — hasn’t helped much. That investigation showed that barely 2% of the human genome consists of actual genes, the information-coding sequences of DNA.
It’s now clear that understanding the human genome is no longer a matter of figuring out what each gene does. The deeper and much harder question is how those genes are used, or regulated, a question that seems to involve some and perhaps much of the rest of the genome. By switching suites of genes on and off, the many different cell types in our bodies can all be created from the same material. Cells also regulate their genes from moment to moment in response to a constant inflow of signals from their neighbors and surroundings. But the processes that govern gene regulation are proving so complex that some biologists wonder whether a full understanding of it — of how the genome really works — will ever be within the grasp of our puny minds.
Some are counting on outsourcing the analysis to artificial intelligence. Genomic “foundation models” such as Evo 2, Genos, and Google DeepMind’s AlphaGenome are trained on vast quantities of genomic data, which biologists use to make predictions about how differences in DNA sequence affect biological processes and ultimately the traits (including disease risk) of a whole organism. These algorithms don’t worry about the complicated regulatory stuff going on; all of that is supposedly subsumed by the algorithm’s “training,” through which it deduces correlations from cases we already know about.
This approach is likely to be useful, but for those who crave real understanding of how the genome, and ultimately life itself, works, a computational black box will never suffice. And perhaps more to the point, the genome might not submit to the kind of straightforward input-output approach that such AI models ultimately assume.
That’s because the genome is no blueprint or algorithm. It is something else.
The Old View
Given that it’s the product of around 4 billion years of evolution, perhaps it’s not surprising that our genome is complicated. The surprise has been what those complications are. “Our genome is not what we might make it if we sat down at the drawing board,” said the biologist Karen Adelman, who studies gene regulation at Harvard Medical School.
The traditional view posits that a small proportion of our DNA holds the code for making the protein molecules that orchestrate our cells’ chemistry. Each instruction for a protein is held in a corresponding gene — we have around 20,000 of these — and gene sequences can range in length from a couple of dozen to almost 3 million DNA “letters” (representing molecules called nucleotides). Making a protein from its gene is a two-stage affair. First the DNA is read, letter by letter, by an enzyme called a polymerase, which creates a copy of that code in a related molecule called messenger RNA (mRNA). This is called transcription. The mRNA is then read by a piece of molecular machinery called the ribosome, which constructs the protein — a process called translation. The proteins made by the ribosome then go off to do their jobs in making and sustaining the organism.
This picture is still more or less correct. But it turns out that “the genes are probably not the most interesting part of the genome,” Adelman said.
What matters more is how our genes, many of which we share with simpler organisms, are regulated: turned on and off. Which proteins a cell needs changes over time and according to cell type: muscle, brain, skin, and so on. How the genes that encode those proteins are regulated depends on some of the genome that doesn’t code for proteins.
Biologists have known about gene regulation, and the involvement of “noncoding” DNA, since the 1960s. But for many years, most of what they understood about this came from studies of simple organisms like bacteria, where the principles are generally straightforward. It has gradually become clear, though, that in complex eukaryotic organisms like us, gene regulation is far more complicated, involving overlapping systems of oversight and control, each with its own intricacies.
Transcription Factors
Transcription gets started by proteins called transcription factors, which are like the operations managers of gene regulation. These proteins stick to sections of DNA (typically close to the target gene) and recruit the polymerase enzyme to make an mRNA copy. In bacteria, transcription factors are rather like keys that fit the locks of unique binding sites on DNA. But that’s not how they work in complex organisms. In us, the logic of transcription factors is more difficult to parse.
For one thing, our transcription factors don’t show strong preferences for particular DNA binding sites. What’s more, they tend to work in pairs or groups. And a given transcription factor might have different effects in different contexts, such as activating gene transcription in one cell type but suppressing it in another, depending on which other transcription factors are around.
In bacteria, regulation tends to have an “OR” logic, Adelman said, whereby a particular signal turns a gene on or off: It’s either this or that. But in the human genome the logic is more like what computer scientists designate “AND.” Many signals are integrated to reach a regulatory decision: this and that and also that other thing. In this case, regulation can be more responsive to nuances of context, and the regulatory knobs are tunable rather than being just on/off. “This is part of the beauty” of our regulatory complexity, Adelman said.
When they interact with the genome, transcription factors bind to pieces of DNA called enhancers — which present a puzzle of their own.
Enhancers
Enhancers are gathering points for transcription factors, and they are thought to be the decisive influence on transcription: They deliver the “go” signal for a waiting polymerase to make an mRNA version of the DNA sequence. Seems simple enough, but mapping enhancers to their respective genes is far from straightforward. Our genome has hundreds of thousands, perhaps millions, of enhancers. That means we have many more of them than we have genes. Each gene might be influenced by many enhancers, and each enhancer might influence multiple genes.
“It’s embarrassing that 25 years after the Human Genome Project, we don’t know where all the enhancers are in the genome, let alone what they do when they act and which genes they control,” said Wendy Bickmore, a genome biologist at the University of Edinburgh.
Biologists do know that most enhancers won’t respond to a single transcription factor. Their activation “requires a cocktail,” Bickmore said. “That’s what gives [an enhancer] that exquisite specificity — because it’s only in a particular cell at a particular time that you have the right combination of factors to bind and activate that enhancer.”
Some enhancers are, as you’d expect, close to the genes they regulate, or even sit on DNA inside a gene. But others sit far away from the gene — perhaps millions of nucleotides away, with more genes in between.
The existence of such so-called “distal” enhancers “seems bonkers,” Bickmore said. “How do you get that information from over there to over here, to the gene that needs to be activated? That’s a largely unanswered question.”
One of the answers comes in the form of a loop.
Loops and Hubs
Distal enhancers are brought to the gene they regulate on great loops of DNA or, more strictly, of chromatin, the combination of DNA and its packaging proteins that are unraveled as if from a ball of wool. The loops are created by a protein motor called cohesin, which runs up and down the DNA strand and extrudes it as needed.
Once cohesin has formed a loop to bring elements together, what then? It was once thought that they then stick together or assemble into a molecular machine, but they don’t. Rather, the components appear to form a loose but dense blob in which they interact rather weakly, fleetingly, and indiscriminately — a sort of committee, sometimes called a condensate.
These transcription hubs are extremely fluid and differ from one cell to another. “There’ll be a bit of loop extrusion going on over here, in the next cell it might be over here, and the whole thing is turning over incredibly fast,” Bickmore said. Even if the cells are notionally identical — both skin cells, say — exactly what the gene-regulatory machinery is up to at any moment is never quite the same in any two of them.
Chromatin loops are just one reason why a gene’s transcription depends on the shape and structure of the chromatin around it.
Chromatin Shape
The textbook image of a chromosome — one of the 46 units into which our genomes are divided — is of a compact, X-shaped cluster of chromatin. But any time a cell is not actively dividing, its chromatin is unwound into what looks like a tangled mess. There is order to the chaos, however. Some parts of chromatin are densely packed into a form called heterochromatin. The compacted DNA there is relatively inaccessible to transcription factors; the genes it contains are typically silenced. Meanwhile, other parts are relatively loose, open, and accessible: This is called euchromatin.
There are special enzymes involved in packaging and repackaging chromatin, thereby controlling transcription. In other words, what matters is not just the encoded information in the DNA but also how it exists physically and dynamically in space. “We’ve stopped thinking about the genome as a linear piece of DNA code,” Bickmore said. “Thinking about this incredibly dynamic three-dimensional folding as absolutely inherent to regulation is a very exciting change.”
One aspect of this 3D organization is the clustering of segments of chromatin into compartments called topologically associating domains (TADs). Within a TAD, the genes seem to be coregulated: switched on or off in groups. Such groups keep suites of genes active or silent together to form and provide function in different cell types. Cohesin is also involved in the shuffling of chromatin to construct TADs — a dynamic process in which the chromatin is constantly rearranged in our cells.
Chromatin shape can also be influenced by chemical modifications called epigenetic marks: small molecules attached to DNA packaging proteins called histones or stuck directly to DNA. Some of these epigenetic modifications can alter the electrical charges on histones, which changes how the proteins attract or repel one another and so rejigs the chromatin packing. Epigenetic modifications to chromatin are like annotations of the DNA script that change its meaning in a given context. When cells divide, the epigenetic annotations are copied, too.
How and when the marks get added and changed, and what each type of mark means for gene activity, are complex questions with no simple answers. Some researchers talk of an “epigenetic code” governing this aspect of gene regulation, but it’s far from clear if anything so systematic really exists.
All of these processes and others can determine whether a gene gets transcribed into mRNA. But there are further layers of regulation that determine whether the mRNA is then translated into a corresponding protein — and which protein arises.
RNA Interventions
This post-transcriptional regulation is often controlled by RNA molecules that are said to be noncoding. These short-lived molecules aren’t templates for proteins, as mRNA is, but have other jobs of their own. While mRNA is produced from the protein-coding areas of DNA (so-called “coding genes”), noncoding RNAs are transcribed from other DNA regions now generally described as noncoding genes. These noncoding RNAs are versatile, taking on varied roles in a cell. Researchers are learning more about what they can do every day, and many if not most of them seem to be involved in gene regulation.
Small noncoding RNAs called microRNAs, for example, can silence mRNAs before they can be translated into proteins. They do this by guiding special enzymes to a particular mRNA to degrade or chemically modify it. The microRNAs don’t do this job alone but, not unlike transcription factors, act combinatorially, in groups, and in a rather promiscuous manner: A given microRNA might regulate many mRNAs, and a given mRNA might be regulated by many microRNAs.
Why make an mRNA only to stop it getting translated in a protein? This sort of post-transcriptional regulation is like having another checkpoint: Does the cell really need this protein? MicroRNAs can be mobilized to allow cells to adjust gene expression depending on the immediate context. In this way, the workings of the genome are less like a program’s inevitable progression and more like an adaptive and responsive process.
Another post-transcriptional complication is that mRNAs get translated to protein only after they have been reorganized. Fresh from transcription, an mRNA contains sequences that encode bits of protein, called exons, as well as sequences that shouldn’t be translated and need to be snipped out, called introns. (Strictly speaking, this pre-edited RNA is called pre-mRNA.) The job of editing introns out and splicing exons together is done by a molecular assembly called the spliceosome, which is made from several proteins together with various noncoding RNAs.
The spliceosome too can be sensitive to context, so that it might splice the pre-mRNA to encode one protein in one cell type and a slightly different protein in another. Sometimes these different protein “isoforms” can have very different roles. Transcription factors, for example, are often alternatively spliced in this way, and their isoforms can take on different regulatory tasks — some might activate gene expression, for instance, while others repress it.
Checks and Balances
All told, these and other regulatory mechanisms show that the genome is far from some automated program running in the background to build us and keep us alive. Our cells are, in effect, making complex decisions about how to use their genes — both the information they contain and the structure they assume.
Thus, cells need to assemble a rather loose and fuzzy committee of components, such as transcription factors and enhancers, to get transcription underway, which also depends on how the chromatin strand is shaped and molded at that moment. Then there are further layers of decision-making and action-taking in between mRNA and the final, functional protein.
Remember, too, that all the players — from transcription factors to noncoding RNAs — are themselves produced from the genome in the same kind of context-dependent process. That makes the genome rather like a recursive, self-referential system that the computer scientist Douglas Hofstadter dubbed “a strange loop.” It acts on itself, mindful of its own history (which determines chromatin conformation and epigenetic markings, say) and heedful of messages from inside and outside the cell. Not, then, a blueprint.
And for that reason, not at all easy to understand. “I wouldn’t have designed it this way if I was God,” Bickmore said. “But here we are!”
Why is gene regulation in animals like us so darned complicated? One potential answer is that evolution doesn’t have the foresight to design with efficiency and transparent logic, but merely tinkers with what it has already available. Maybe so — but eukaryotic gene regulation isn’t just a messy version of what happens in bacteria. It has different principles, and there’s surely a reason for them.
Bickmore suspects that the complexity of regulation and of genome organization might have been the only means of generating complexity in the organism. For example, organisms with many tissue types and varied lifestyles required more control over which genes were on or off in a given cell. One thing this demanded was more and more noncoding regulatory sequences in DNA. But then they couldn’t all fit close to the gene itself.
“As you get more complexity, you need to add more and more enhancers,” Bickmore said. “But where are you going to put them? You start to put them farther and farther away. Once they are [far enough], you start to need TADs and three-dimensional [chromatin] folding to allow those things to work.”
We also need regulatory complexity because, over evolutionary time, the human genome has accumulated DNA from parasitic viruses in the form of jumping genetic material called transposable elements. These sequences have inserted themselves all over our chromosomes and are good at replicating themselves. To sift the good DNA from the bad, we needed additional layers of regulation to ensure that cells weren’t translating RNAs they don’t really need or that could be actively harmful.
With so many context-dependent checks and balances in the workings of our genome, it is evidently not a program or algorithm that predictably generates the same outcome in every situation. It’s an open informational system that responds to external inputs and the genome’s dynamic internal conditions. This poses a challenge if AI relies solely on the genetic sequences within genomes to predict what genomes will do.
“A Highly Sensitive Organ”
Researchers developing AI-based genomic foundation models such as AlphaGenome hope that all these layers of regulation — transcription factors, splicing, epigenetic marks, loops, chromatin packing, and so on — will be implicitly included in the correlations that the algorithms learn between genetic sequence and organismal traits. They’re content for the complexity described above to be in a black box, so long as the model generates accurate predictions. But will that work?
“I’m sure [AlphaGenome] is going to be useful, but with limitations,” Bickmore said. “To me the big gap is in the complexity of the human body — in all the cell types and how they change over time in development. And all that data is missing.”
Fundamentally, the challenge is that the genome is not a set of static, linear instructions. It is highly dynamic, and it uses its information contextually, with combinatorial and promiscuous logic. “Whether we’ll ever be able to capture that aspect” in algorithms like AlphaGenome, “I don’t know,” she said.
Yet the problem goes even deeper because the functioning of specific organisms, including each of us, doesn’t just depend on genomes. Other factors, such as diet, environment, microbiome and, for us at least, culture, can matter hugely, too — not just in terms of how we act and how healthy we are but also in the state of our genome itself. The biologist Adrian Woolfson, co-founder of California-based biotech company Genyro, which aims to use AI systems for so-called “generative biology,” calls this information cloud the “informiome.”
“While the human genome forms the foundation of the human informiome, other layers of extra-genetic information are equally important,” Woolfson wrote in his book On the Future of Species, published in April 2026. Genomic foundation models won’t even be able to predict all the consequences of genetic mutations, he argued, because the relevant information is not in the genome sequence in the first place.
So how should we think about the genome? Maybe the only metaphors that can capture the way the genome really works must come from biology itself. In 2020, the biological historian Evelyn Fox compared the genome to “an exquisitely sensitive reactive system.” Rather than a sequence of genes leading to the formation of traits, she said, it’s more of “a device for regulating the production of specific proteins in response to constantly changing signals it receives from its environment.”
That sounds close to the picture painted by the geneticist Barbara McClintock in the address she delivered upon being awarded the 1983 Nobel Prize in Physiology or Medicine for her discovery of transposons. The genome, she declared, is “a highly sensitive organ of the cell, monitoring genomic activities and correcting common errors, sensing the unusual and unexpected events and responding to them, often by restructuring the genome.”
Research since that time has fleshed out this image, revealing how the shape of chromatin can matter as much as the information its DNA sequences encode and how an army of molecules collaborates to reorganize it and make collective decisions about how to use its genetic information in context-dependent ways. There is no human technology that works this way, so metaphors such as blueprints, programs, or computers will always fall short.
Bickmore is optimistic that the workings of the genome are understandable, despite its complexity. “We’ve got a handle on it now,” she said. “We might not know the details, but I think the whole field is coalescing now into a framework where we’re thinking along similar lines.” AI can surely help with this sense-making, but in the end, human reasoning will be needed to discern the fundamental principles.
“McClintock was far more on point than people realized at the time,” Adelman said. “What she said was that the genome isn’t static — it’s living.”