大规模DNA存储中基于引物库的多模式数据组织和文件检索

张淑芳a,*(), 李予辉a, 张睿娴, 李炳志b,c,*(), 王晴a

工程(英文) ›› 2025, Vol. 48 ›› Issue (5) : 151-162.

PDF(3446 KB)
PDF(3446 KB)
工程(英文) ›› 2025, Vol. 48 ›› Issue (5) : 151-162. DOI: 10.1016/j.eng.2023.10.021
Article

 大规模DNA存储中基于引物库的多模式数据组织和文件检索

  • 张淑芳a,*(), 李予辉a, 张睿娴, 李炳志b,c,*(), 王晴a
作者信息 +

Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage

  • Shu-Fang Zhanga,*(), Yu-Hui Lia, Rui-Xian Zhang, Bing-Zhi Lib,c,*(), Qing Wanga
Author information +
History +

摘要

 目前,基于聚合酶链式反应(PCR)扩增的文件检索方法是DNA文件检索最常用且有效的方法。正交引物的数量限制了可以被准确访问的文件数量,且会影响单个寡核苷酸池中的文件存储密度。本文针对大容量DNA数据存储提出了一种单个寡核苷酸池中基于PCR文件检索的多模式DNA序列设计方法。首先,通过分析每个预测引物长度下正交引物的最大数量,发现引物长度和最大可用引物数量之间并不呈线性增加的关系,正交引物的最大数量约为104数量级;然后,本文分析了具有不同种类引物结合位点的DNA序列可进行文件映射的最大地址空间容量。在引物库的容量为R(其中R为偶数)的情况下,利用本文所提出的单引物DNA序列设计方案可映射的地址空间数是以前的4倍,并且利用两级引物DNA序列设计方案可以达到$\left[\frac{\mathbb{R}}{2} \bullet\left(\frac{\mathbb{R}}{2}-1\right)\right]^{2}$倍。最后,根据寡核苷酸池中待存储文件的数量要求,设计了一种多模式的DNA序列生成方法,以满足在存储有大规模文件数量的寡核苷酸池中进行目标文件随机检索的需求。验证了利用本文所提正交引物库设计器进行引物生成的性能,其产生的正交引物之间所形成的最稳定异源二聚体的平均吉布斯自由能为-1 kcal∙(mol∙L−1)−1 (1 kcal = 4.184 kJ). 同时,通过选择性PCR扩增对具有两级引物结合位点的DNA序列进行随机访问,当不同位置的引物结合位点序列彼此互异时,支持按照最低103倍的reads数精确读取出目标序列。本文提供了一套用于正交引物库生成以及文件与引物间多模式映射方案的流程,所提方案有助于实现在大规模DNA核苷酸池中对文件的精确随机访问。

Abstract

At present, the polymerase chain reaction (PCR) amplification-based file retrieval method is the most commonly used and effective means of DNA file retrieval. The number of orthogonal primers limits the number of files that can be accurately accessed, which in turn affects the density in a single oligo pool of digital DNA storage. In this paper, a multi-mode DNA sequence design method based on PCR file retrieval in a single oligonucleotide pool is proposed for high-capacity DNA data storage. Firstly, by analyzing the maximum number of orthogonal primers at each predicted primer length, it was found that the relationship between primer length and the maximum available primer number does not increase linearly, and the maximum number of orthogonal primers is on the order of 104. Next, this paper analyzes the maximum address space capacity of DNA sequences with different types of primer binding sites for file mapping. In the case where the capacity of the primer library is $\mathbb{R}$ (where $\mathbb{R}$ is even), the number of address spaces that can be mapped by the single-primer DNA sequence design scheme proposed in this paper is four times that of the previous one, and the two-level primer DNA sequence design scheme can reach $\left[\frac{\mathbb{R}}{2} \bullet\left(\frac{\mathbb{R}}{2}-1\right)\right]^{2}$ times. Finally, a multi-mode DNA sequence generation method is designed based on the number of files to be stored in the oligonucleotide pool, in order to meet the requirements of the random retrieval of target files in an oligonucleotide pool with large-scale file numbers. The performance of the primers generated by the orthogonal primer library generator proposed in this paper is verified, and the average Gibbs free energy of the most stable heterodimer formed between the orthogonal primers produced is -1 kcal∙(mol∙L−1)−1 (1 kcal = 4.184 kJ). At the same time, by selectively PCR-amplifying the DNA sequences of the two-level primer binding sites for random access, the target sequence can be accurately read with a minimum of 103 reads, when the primer binding site sequences at different positions are mutually different. This paper provides a pipeline for orthogonal primer library generation and multi-mode mapping schemes between files and primers, which can help achieve precise random access to files in large-scale DNA oligo pools.

关键词

 DNA存储 / 文件检索 / 正交引物 / PCR扩增 / DNA序列设计

Keywords

DNA storage / File retrieval / Orthogonal primer / PCR-amplifying / DNA sequence design

引用本文

导出引用
张淑芳 , 李予辉 , 张睿娴 , 李炳志 , 王晴.  大规模DNA存储中基于引物库的多模式数据组织和文件检索. Engineering. 2025, 48(5): 151-162 https://doi.org/10.1016/j.eng.2023.10.021

参考文献

[1]
US News World Rep (1964 Feb:84-6.)
[2]
A.J. Pinho, D. Pratas, P.J.S.G. Ferreira Bacteria DNA sequence compression using a mixture of finite-context models, IEEE, Nice, France. Piscataway ( 2011), pp. 125-128. doi: 10.1109/SSP.2011.5967637
[3]
N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E.M. LeProust, B. Sipos, et al.. Towards practical, high capacity, low-maintenance information storage in synthesized DNA. Nature, 494 (7435) ( 2013), pp. 77-80. doi: 10.1038/nature11875
[4]
A.K.Y. Yim, A.C.S. Yu, J.W. Li, A.I.C. Wong, J.F.C. Loo, K.M. Chan, et al.. The essential component in DNA-based information storage system: robust error-tolerating module. Front Bioeng Biotechnol, 2 (2) ( 2014), pp. 49-51
[5]
P.A.I. Savitri, M.DT. Adiwijaya, W. Astuti. Digital medical image compression algorithm using adaptive Huffman coding and graph based quantization based on IWT-SVD, IEEE, Bandung, Indonesia. Piscataway ( 2016), pp. 264-269
[6]
S.L. Shipman, J. Nivala, J.D. Macklis, G.M. Church. Molecular recordings by directed CRISPR spacer acquisition. Science, 353(6298):aaf1175 ( 2016)
[7]
S.L. Shipman, J. Nivala, J.D. Macklis, G.M. Church. CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature, 547 (7663) ( 2017), pp. 345-349. doi: 10.1038/nature23017
[8]
F. Farzadfard, T.K. Lu. Genomically encoded analog memory with precise in vivo DNA writing in living cell populations. Science, 346 (6211) ( 2014), p. 1256272
[9]
N. Roquet, A.P. Soleimany, A.C. Ferris, S. Aaronson, T.K. Lu. Synthetic recombinase-based state machines in living cells. Science, 353(6297):aad8559 ( 2016)
[10]
W. Chen, M. Han, J. Zhou, Q. Ge, P. Wang, X. Zhang, et al.. An artificial chromosome for data storage. Natl Sci Rev, 8(5):nwab028 ( 2021)
[11]
L. Song, F. Geng, Z.Y. Gong, X. Chen, J. Tang, C. Gong, et al.. Robust data storage in DNA by de Bruijn graph-based de novo strand assembly. Nat Commun, 13 (1) ( 2022), p. 5361
[12]
S.M.H. Tabatabaei Yazdi, Y. Yuan, J. Ma, H. Zhao, O. Milenkovic.A rewritable, random-access DNA-based storage system. Sci Rep, 5 (1) ( 2015), p. 14138
[13]
L. Organick, S.D. Ang, Y.J. Chen, R. Lopez, S. Yekhanin, K. Makarychev, et al.. Random access in large-scale DNA data storage. Nat Biotechnol, 36 (3) ( 2018), pp. 242-248. doi: 10.1038/nbt.4079
[14]
K.J. Tomek, K. Volkel, A. Simpson, A.G. Hass, E.W. Indermaur, J.M. Tuck, et al.. Driving the scalability of DNA-based information storage systems. ACS Synth Biol, 8 (6) ( 2019), pp. 1241-1248. doi: 10.1021/acssynbio.9b00100
[15]
X. Song, S. Shah, J. Reif. Multidimensional data organization and random access in large-scale DNA storage systems. Theor Comput Sci, 894 ( 2021), pp. 190-202
[16]
S. Newman, A.P. Stephenson, M. Willsey, B.H. Nguyen, C.N. Takahashi, K. Strauss, et al.. High density DNA data storage library via dehydration with digital microfluidic retrieval. Nat Commun, 10 (1) ( 2019), p. 1706
[17]
C. Bee, Y.J. Chen, M. Queen, D. Ward, X. Liu, L. Organick, et al.. Molecular-level similarity search brings computing to DNA data storage. Nat Commun, 12 (1) ( 2021), p. 4764
[18]
J.L. Banal, T.R. Shepherd, J. Berleant, H. Huang, M. Reyes, C.M. Ackerman, et al.. Random access DNA memory using Boolean search in an archival file storage system. Nat Mater, 20 (9) ( 2021), pp. 1272-1280. doi: 10.1038/s41563-021-01021-3
[19]
L. Piantanida, W.L. Hughes. A PCR-free approach to random access in DNA. Nat Mater, 20 (9) ( 2021), pp. 1173-1174. doi: 10.1038/s41563-021-01089-x
[20]
H.F. Löchel, M. Welzel, G. Hattab, A.C. Hauschild, D. Heider.Fractal construction of constrained code words for DNA storage systems. Nucleic Acids Res, 50 (5) ( 2022), p. e30. doi: 10.1093/nar/gkab1209
[21]
B. Cao, X. Li, X. Zhang, B. Wang, Q. Zhang, X. Wei. Designing uncorrelated address constrain for DNA storage by DMVO algorithm. IEEE/ACM Trans Comput Biol Bioinf, 19 (2) ( 2022), pp. 866-877. doi: 10.1109/tcbb.2020.3011582
[22]
Q. Yin, Y. Zheng, B. Wang, Q. Zhang. Design of constraint coding sets for archive DNA storage. IEEE/ACM Trans Comput Biol Bioinf, 19 (6) ( 2022), pp. 3384-3394. doi: 10.1109/tcbb.2021.3127271
[23]
B.H. Nguyen, C.N. Takahashi, G. Gupta, J.A. Smith, R. Rouse, P. Berndt, et al.. Scaling DNA data storage with nanoscale electrode wells. Sci Adv, 7(48):eabi6714 ( 2021)
[24]
C.N. Takahashi, B.H. Nguyen, K. Strauss, L. Ceze.Demonstration of end-to-end automation of DNA data storage. Sci Rep, 9 (1) ( 2019), p. 4998
[25]
L. Organick, B.H. Nguyen, R. McAmis, W.D. Chen, A.X. Kohll, S.D. Ang, et al.. An empirical comparison of preservation methods for synthetic DNA data storage. Small Methods, 5 (5) ( 2021), p. 2001094
[26]
P.L. Antkowiak, J. Koch, P. Rzepka, B. Nguyen, K. Strauss, W.J. Stark, et al.. Anhydrous calcium phosphate crystals stabilize DNA for dry storage. Chem Commun, 58 (19) ( 2022), pp. 3174-3177. doi: 10.1039/d2cc00414c
基金
 
PDF(3446 KB)

Accesses

Citation

Detail

段落导航
相关文章

/