Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage

Shu-Fang Zhang , Yu-Hui Li , Rui-Xian Zhang , Bing-Zhi Li , Qing Wang

Engineering ›› 2025, Vol. 48 ›› Issue (5) : 159 -171.

PDF (3446KB)
Engineering ›› 2025, Vol. 48 ›› Issue (5) :159 -171. DOI: 10.1016/j.eng.2023.10.021
Research Sythetic Biology—Article
research-article

Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage

Author information +
History +
PDF (3446KB)

Abstract

At present, the polymerase chain reaction (PCR) amplification-based file retrieval method is the most commonly used and effective means of DNA file retrieval. The number of orthogonal primers limits the number of files that can be accurately accessed, which in turn affects the density in a single oligo pool of digital DNA storage. In this paper, a multi-mode DNA sequence design method based on PCR file retrieval in a single oligonucleotide pool is proposed for high-capacity DNA data storage. Firstly, by analyzing the maximum number of orthogonal primers at each predicted primer length, it was found that the relationship between primer length and the maximum available primer number does not increase linearly, and the maximum number of orthogonal primers is on the order of 104. Next, this paper analyzes the maximum address space capacity of DNA sequences with different types of primer binding sites for file mapping. In the case where the capacity of the primer library is R (where R is even), the number of address spaces that can be mapped by the single-primer DNA sequence design scheme proposed in this paper is four times that of the previous one, and the two-level primer DNA sequence design scheme can reach 2·2-12 times. Finally, a multi-mode DNA sequence generation method is designed based on the number of files to be stored in the oligonucleotide pool, in order to meet the requirements of the random retrieval of target files in an oligonucleotide pool with large-scale file numbers. The performance of the primers generated by the orthogonal primer library generator proposed in this paper is verified, and the average Gibbs free energy of the most stable heterodimer formed between the orthogonal primers produced is −1 kcal∙(mol∙L−1)−1 (1 kcal = 4.184 kJ). At the same time, by selectively PCR-amplifying the DNA sequences of the two-level primer binding sites for random access, the target sequence can be accurately read with a minimum of 103 reads, when the primer binding site sequences at different positions are mutually different. This paper provides a pipeline for orthogonal primer library generation and multi-mode mapping schemes between files and primers, which can help achieve precise random access to files in large-scale DNA oligo pools.

Graphical abstract

Keywords

DNA storage / File retrieval / Orthogonal primer / PCR-amplifying / DNA sequence design

Cite this article

Download citation ▾
Shu-Fang Zhang, Yu-Hui Li, Rui-Xian Zhang, Bing-Zhi Li, Qing Wang. Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage. Engineering, 2025, 48(5): 159-171 DOI:10.1016/j.eng.2023.10.021

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

In 1964, Wiener proposed the concept of DNA storage [1], and, since the beginning of the 21st century, scientists have made various attempts to store data in DNA [2], [3], [4], [5], [6], verifying the feasibility of DNA data storage. Compared with traditional silicon-based storage media, DNA data storage has the advantages of low maintenance cost, high storage density, and a long storage time, and it is expected to become a new generation of storage media for storing massive archival data. When accessing and decoding reconstructed file data stored in a DNA pool, it is usually necessary to amplify and decode all the data in the DNA pool. For massive data storage in DNA, the above reading method results in a significant waste of computational and sequencing resources.

In recent years, researchers have proposed random-access methods that selectively extract specific data from the DNA pool for retrieval, sequencing, and decoding based on actual application needs [7], [8], [9], [10], [11], thereby effectively reducing the cost and improving the convenience of data reading. Among such methods, the file-retrieval method based on polymerase chain reaction (PCR) amplification has a lower cost and simpler operation, using the primers of about 20 nucleotides to perform PCR amplification in order to extract specific files.

In 2015, Tabatabaei Yazdi et al. [12] proposed a file retrieval method based on PCR amplification. The team designed specific address strings in DNA sequences to support file retrieval. In 2018, Organick et al. [13] proposed an end-to-end DNA file storage solution that considered biological constraints in primer design and mapping different primers to different files. In 2019, Tomek et al. [14] used the controllable hybridization characteristics of primers and primer-binding sites under specific conditions to achieve the previewing of files by controlling the primer concentration and reaction temperature. Song et al. [15] studied the multi-dimensional data organization and random access in large-scale DNA storage systems, effectively increasing the limit of DNA storage capacity.

In addition to PCR amplification-based file retrieval methods, other file retrieval methods have emerged. In 2019, Newman et al. [16] used microfluidics to achieve the file retrieval of DNA molecules. By using an electric field to precisely control the movement of micro-droplets to the slot where the target DNA molecule was located, the DNA molecules of the target file were extracted. In 2021, the University of Washington and Microsoft used biotin-labeled probe nucleotide sequences and magnetic bead affinity purification technology to achieve a similarity search of files stored in DNA [17]. In 2021, a team from Massachusetts Institute of Technology and Harvard University used silica microspheres to solidify plasmids that store actual file information and fixed multiple oligonucleotide probes as metadata on the microsphere surface. The target file was accessed through a fluorescence-activated cell sorter [18]. In recent years, research has also been conducted on various aspects of DNA storage [19], [20], [21], [22], [23], [24], [25], [26]. Compared with the PCR amplification retrieval method, these retrieval methods have high costs and complexity, so they are not conducive to large-scale promotion and application.

In order to accurately extract the required files from a single oligo pool containing massive data, primers must exhibit uniqueness as a screening condition with the PCR-based file retrieval method, where each primer corresponds to a specific file. Due to the strict biochemical reaction constraints involved in the design of primers, the number of available primers does not increase linearly as the primer length grows; rather, it has a corresponding capacity limit. For DNA storage with a file retrieval function, this capacity limit will seriously affect the storage capacity. Therefore, it is necessary to comprehensively analyze the primer length and the number of orthogonal primers under the primer design criteria, predict the upper limit of the orthogonal primer library capacity corresponding to different primer lengths, and reasonably design a template sequence corresponding to the stored file according to the capacity limit, thus effectively improving the file retrieval efficiency of DNA storage.

Based on the above analysis, this paper considers the constraints of primer design in DNA storage, uses the quasi-Monte Carlo simulation algorithm to simulate the number of available primers generated under different random primer generation frequencies, and predicts the maximum capacity of available primers (Fig. 1(a)). Furthermore, this paper theoretically analyzes the maximum address space of DNA sequence structures with multi-level primer binding sites and compares the relationship between the address space and the primer library capacity of three different DNA sequence structures (Fig. 1(b)). On this basis, this paper proposes a multi-mode DNA sequence design method for PCR file retrieval based on a single oligonucleotide pool (Fig. 1(c)).

We first design an orthogonal primer library based on the primer design criteria and then design DNA sequences with single or multiple primer binding sites, which facilitate the selection of corresponding DNA sequence design methods according to actual needs. Finally, we conduct simulation experiments on the mismatch of generated primers under different orthogonal constraint conditions. The experimental results show that, compared with the case when only the constraint of the primer itself is considered, the probability of primer mismatch will be significantly reduced for the case when both the constraint of the primer itself and the orthogonality constraint between primers are considered. In addition, the feasibility of the multi-primer binding site DNA sequence design method proposed in this paper is demonstrated through actual PCR.

By analyzing the relationship between primer length and the maximum capacity of the orthogonal primer library, we observe the problem that the storage capacity in a single oligo pool can easily reach a bottleneck due to the low orthogonal primer library capacity in primer-based selective PCR amplifications. Therefore, this paper proposes a multi-mode sequence design scheme to meet the storage needs in different application scenarios. With the method proposed in the paper, the upper limit of the file storage capacity in a single oligo pool is greatly increased.

2. Materials and methods

2.1. Primer design criteria in DNA storage

The design criteria for a primer in PCR include constraints on the primer length, melting temperature (Tm), annealing temperature, guanine–cytosine (GC) content, GC distribution, secondary structure, base repetition, run length, and 3′ end stability. The length of the primer should be between 15 and 30 nucleotides (nt); this is because a primer with a too-short length will lead to a reduction in the specificity of the PCR reaction, while a primer with a too-long length will lead to a significantly higher probability of homodimers and heterodimers, as well as high Tm values, which in turn affect the proper functionality of the DNA polymerase.

The Tm of the primer should be between 52 and 58 °C. A low or high Tm will have a negative impact on the PCR reaction. In particular, a low Tm will reduce the specificity of the primer binding to the template sequence, resulting in a spurious band, while a high Tm may not allow the DNA polymerase to work properly. Furthermore, it may result in difficulty in binding between the primer and the template or a no-amplification failure. As a result, the effect of the Tm needs to be considered in primer design.

In addition to the criteria listed above, the following constraints should also be considered:

(1)The GC content should between 40% and 60%.

(2)No more than three purines or three pyrimidines should appear in the last five bases at the 3′ end of the primer.

(3)No secondary structures should be found, including hairpin structures, homodimers, or heterodimers.

(4)According to the base repetition constraint, which refers to the repeated appearance of a di-nucleotide group (e.g., ATATATAT), the maximum allowable number of repeated di-nucleotide groups in the primer sequence is four.

(5)The appearance of more than four consecutive identical bases in the primer should be avoided, and the Gibbs free energy at the 3′ end should be as high as possible.

These primer design principles are essential in order to improve the specificity of the PCR amplification reaction. It is noticeable that many of these constraints restrict the 3′ end of the primer. This is because the 3′ end is the starting point for DNA polymerase to perform DNA molecular extension, making the specificity of this part important.

When conducting file retrieval in DNA storage based on PCR amplification, it is not only necessary to consider traditional biological primer design criteria; orthogonality constraints between primers must also be considered. As the number of template sequences for PCR reactions in DNA storage is generally at least 10 000 or more, and with the decrease in DNA synthesis cost, larger-scale sequence synthesis becomes possible, and the number of files stored in the same oligonucleotide pool will significantly increase. Therefore, in the design mode of one-to-one mapping between file and primer, if there is no orthogonality constraint between primers, the PCR reaction will lead to non-target sequence amplification by matching primers with incorrect binding sites, which will affect the subsequent decoding work.

2.2. Improved orthogonal primer design method

To generate an orthogonal primer library more efficiently, we designed an orthogonal primer library generation system, as shown in Fig. 2, including four modules: a random primer generator, self-checker, inter-checker, and targeted mutation module. The random primer generator is used to generate random oligonucleotide sequences, and the generated sequence is input into the self-checker to check the primer’s self-constraint conditions. In the orthogonal primer library design system, if an oligonucleotide sequence does not meet a certain constraint condition, the sequence is input into the targeted mutation module. The targeted mutation module performs directed mutations on the bases in the local sequence that violate the constraint condition to make it meet the constraint condition after mutation. The primer sequence is then re-entered into the self-checker after targeted mutation; if the sequence passes the self-checker, it enters the inter-checker to check the constraint between the primers.

Through the orthogonal primer library design system, orthogonal primers that meet all DNA storage primer constraint conditions can be automatically generated as needed, making it easy to generate appropriate primers based on actual storage needs. However, given the DNA storage primer constraint conditions and the analysis of the relationship between primer length and the number of orthogonal primers, it can be seen that the complexity of checking the constraint conditions between primers grows as the number of orthogonal primers increases, severely reducing the efficiency of primer library generation. On the other hand, the constraint strength of the Hamming distance constraint and non-homologous dimer constraint between primers will increase with the increase of the capacity of the orthogonal primer library, severely limiting the ability for file retrieval in DNA storage. Existing DNA sequence design methods cannot be used to perform file retrieval in large-scale DNA pools.

2.3. Multi-mode file DNA sequence design

Through the above analysis, it can be seen that traditional file-primer mapping methods can be used to extract the target file through a single round of PCR amplification when the number of files stored in the DNA pool is small. However, once the number of files stored in a single DNA pool becomes too large, the capacity of the orthogonal primer library needs to be increased, and the design difficulty will significantly increase. In addition, the number of primers required for file retrieval will be too large, leading to increased retrieval costs. At this point, the use of a DNA template sequence with multi-primer binding sites can break through the bottleneck of the capacity limit of the orthogonal primer library and exponentially expand the available address space by increasing the level of forward and reversed primer binding sites. Therefore, this paper proposes a multi-mode DNA sequence design method, as shown in the flow chart in Fig. 3.

To store a given set of files, the selected primer sequence length and whether to prioritize “one primer, one file” mapping to design the DNA sequence are first input. If so, the algorithm enters the binding site level least (BSLL) primer design strategy; otherwise, it enters the primer library least (PLL) design strategy. For the BSLL design strategy, the primer library generator will design as many orthogonal primers as possible to meet the storage requirements. If the capacity of the designed orthogonal primer library is sufficient, the final template sequence is output according to the one-primer–one-file mapping relationship. Otherwise, the number of primer binding site pairs in the template sequence needs to be calculated according to Eq. (1) to meet the storage requirements of the given file set.

nmin=lgNf2·lgsizePN2

where N is the bases in oligonucleotide sequence, Nf is the number of files to be stored, n represents the number of levels of the primer binding site in the template sequence, PN is the orthogonal primer library, nmin is the minimum primer binding site level required, and is the ceiling function.

For the PLL design strategy, the primer binding site level n needs to be input first. The minimum number of orthogonal primers that the primer library generator needs to design, size(PN), can be calculated using Eq. (2):

sizePN=2·Nf2n

When the value of 2·Nf2n is odd, sizePN=2·Nf2n+1. Then, the orthogonal primer library generator is used to cyclically generate available primers. When the number of feasible primers in the generated orthogonal primer library reaches size(PN), the generation step is ended, and all template sequences of the file set to be stored are generated according to the primer binding site level n.

2.4. Selective PCR

In the PCR amplification, 20 template sequences were added to the reaction system at equal concentrations as the initial substrate for subsequent PCR amplification. Six groups of experiments were conducted, with an annealing temperature of 53 °C, a primer concentration of 2 μL per 50 μL of substrate, and 35 cycles of PCR. The sequencing results were obtained through second-generation Illumina sequencing technology for analysis. The primer combinations added in each round of the six PCR reactions are shown in Table 1. In experiments 1–3, the two forward primers and two reversed primers added in the two rounds of PCR amplification were different. In experiments 4–6, the two forward primers or two reversed primers added in the two rounds of PCR amplification could be the same.

3. Results and discussion

3.1. Analysis of the relationship between primer length and the number of orthogonal primers

The one-to-one mapping between primers and files is the basis for implementing DNA storage file retrieval. The function of the primers is similar to that of the address bus in silicon-based storage media, which allows random access to a specific storage space on the storage chip. In file retrieval in PCR-based DNA storage, primers are used as unique identifiers to differentiate files through biochemical reactions. For an oligonucleotide sequence with N bases, the set of all oligonucleotide sequences is denoted as CN=s1,s2,,sk,,s4N, where sk represents the kth oligonucleotide sequence. Since each position of the oligonucleotide sequence can have four possible bases, the size of the set CN is 4N. By traversing all sequences in the set CN, all oligonucleotide sequences that meet the DNA storage primer conditions are selected and added to the orthogonal primer library set PN=sp1,sp2,,spn(n<N,PNCN), where the size of the orthogonal primer library set PN is n. The time complexity required to traverse CN will reach O(4N). When the primer length is 30 nt, the size of CN is 260, making the sequential traversal and screening of sequences impractical, as the search space will increase fourfold with each additional base in the primer sequence.

To investigate the relationship between primer length N and the size of the orthogonal primer library PN, we use mathematical statistical methods to fit the function relationship between the generated random primer quantity i and the size of the orthogonal primer library size(Pi,N); we then predict the size of the orthogonal primer library size(Pi,N). The main modules of the analysis system include a random primers generator, a primers self-checker, and a primers inter-checker (Fig. 4(a)). The Monte Carlo primer generation module in the primer generator uses N-dimensional Hammersley point sets to generate random primer sequences, where N is the primer length. The primer generator is used to produce random oligonucleotide sequences as input for the primers self-checker. The primers self-checker performs GC content detection for the entire sequence, GC clustering detection at the 3′ end, and the detection of consecutive di-nucleotide repeats sequences; it also runs constraint detection, homodimer detection, and Tm value detection (Fig. 4(b)). Primers that meet all the constraint conditions will be output from the self-checker to the primers inter-checker. The primers inter-checker is used to detect the Hamming distance constraint and the non-homologous dimer constraint between the current primer and all primers in the orthogonal primer library (Fig. 4(c)). If the current primer meets all the constraint conditions, it will be stored as a feasible primer in the orthogonal primer library. The objective of this analysis system is to find the function relationship between the randomly generated primer quantity i and the size of the orthogonal primer library size(Pi,N).

For the primer generator, we utilize the quasi-Monte Carlo simulation algorithm to solve complex system variables and predict the size(PN) without traversing the set CN. Fig. 4(d-i) shows that, when considering all primer design constraints for the set CN (blue elliptical part), a subset PN (green shape) can be screened. Due to the complexity of the primer constraint conditions, the subset PN will be distributed in CN in a certain way. To solve size(PN), CN needs to be randomly sampled m times (m = i); in addition, it must be determined whether the sampled sequence sk(1k4N) satisfies the primer constraint conditions, and filter out the set of sequences Pi,N that satisfy these conditions. It can be known that Pi,NPNCN. By changing the quantity of randomly generated primers i, a set of Pi,N can be obtained. The size of the orthologous primer library size(Pi,N) is found by means of a regression analysis as a function of the random primers quantity i generated; an estimate of size(PN) can be obtained by substituting the known quantity size(CN) into the function.

The key to using the quasi-Monte Carlo simulation algorithm is that the sampling must satisfy uniform distribution, and the higher the uniformity, the better the approximation effect. The quasi-Monte Carlo simulation algorithm uses a set of low-bias and uniform distribution points to replace the pseudo-random sequence of uniform distribution for sampling. Fig. 4(d-ii) provides a schematic diagram of the sample point distribution obtained by two-dimensional (2D) random sampling using independent and identically distributed uniform distribution, and Fig. 4(d-iii) is a schematic diagram of the sample point distribution obtained by 2D random sampling using a Hammersley point set. It can be seen that the uniformity of the latter is better than that of the former, so using the Hammersley point set to estimate the value of size(PN) is more reliable.

The generation method of the d-dimensional Hammersley point set is as follows:

For any non-negative integer l it can be expanded into a prime number p as follows:

l=a0+a1p+a2p2++arpr

where ai is an integer in the interval [0,p-1]. Define the function Φp about l as follows:

Φpl=a0p+a1p2+a2p3++arpr+1

Assuming d to be the dimension of the sampling space, a set of function sequences Φp1,Φp2,Φp3,,Φpd-1 can be obtained by giving any prime number sequence p1,p2,p3,,pd-1. From this, the d-dimensional Hammersley point set can be obtained:

ln,Φp1l,Φp2l,,Φpd-1lforl=0,1,2,,n-1

where p1<p2<p3<<pd-1.

Taking a 20-base oligonucleotide sequence as an example, the size of the primer library size(PN) is obtained by varying the number of randomly generated primers i (Fig. 4(e)). Each real data point represents the average of 10 size(Pi,N) values. Considering that the size of size(Pi,N) is equal to 240, the size of the orthogonal primer library set size(P20) can be estimated to be around 12 396 by inputting it into the regression equation. From the above analysis, it can be seen that the maximum number of files that can be stored accurately in a single DNA pool with one-to-one correspondence between the files and the primers is quite limited.

Using the same method, a set of PN(15N30) is obtained by sampling CN(15N30), and the function relationship between the number of randomly generated primers i and the average size of the orthogonal primer library sizePi,N(15N30) is fitted. The endpoint of each predicted curve in Fig. 4(f) represents the size(PN) predicted when the number of randomly generated primers i is equal to size(CN).

Fig. 4(g) shows the relationship between the length of the primer N and the predicted size of the orthogonal primer library size(PN). It can be seen that the size of the orthogonal primer library satisfies the primer constraints of the DNA storage system and does not increase monotonically with the length of the primer. The experimental results show that the size(P25) is the largest when the primer length is 25 nt, and the upper limit is only about 104.

For a 20-nt sequence, there are 420 possibilities. However, basic constraints must be taken into account during primer design. After adding the orthogonality constraints, the number of primers that satisfies all the constraints is drastically reduced. This is the reason why only a small portion of primers can be used for accurate file retrieval.

When the number of files stored in the same DNA oligo pool far exceeds 104, the design scheme of one-to-one correspondence between the files and primers will not meet the requirements of accurate file retrieval, because a file is represented by a unique primer in a classical one-to-one mapping relationship. If more files than the capacity of the orthologous primer library are stored in the same DNA oligo pool, it will not be possible to find primers for the excess files that can orthologue with the primers in the rest of the files. Orthogonality between primers is an important factor in ensuring the specificity of PCR amplification. If there are two files with poorly orthologous primer sequences, then a random access of either file based on PCR amplification will result in a product containing non-target sequences. This phenomenon worsens as the number of stored files increases, eventually leading to the need for extensive post-processing such as sequence screening or even failure of the search against the target file due to too many irrelevant sequences.

3.2. Analysis of random access to DNA sequences with multi-primer binding sites

3.2.1. Design of a DNA structure with multi-primer binding sites and its random-access method

In order to improve the file retrieval capability, expand the address space, and reduce the difficulty of designing orthogonal primer libraries in large-scale DNA file storage systems, we designed flexible DNA sequence structures with different levels of primer binding sites. The structure is illustrated in Fig. 5(a), where Ln-FP BS is the forward primer binding site of the nth level, and Ln-RP BS is the reversed primer binding site of the nth level. The number of levels of the forward and reversed primer binding sites determines the number of PCRs required for file retrieval. For example, in the two-level forward and reversed primer binding site DNA sequence structure in Fig. 5(a-ii), the outermost primer binding site is the complementary sequence of the primers added in the first round of PCR, the inner primer binding site is the complementary sequence of the primers added in the second round of PCR, and the middle part of the DNA sequence is the payload sequence. Fig. 5(a-iii) shows the process diagram of using two-level forward and reversed primers to achieve the random retrieval of target sequences through PCR amplification.

The forward primers library {FP1, FP2} and reversed primers library {RP1, RP2} are used in Fig. 5(a-iii) to extract the green target sequence. In the first round of PCR, the primer pair used is FP1 and RP1, which still includes all four sequences. However, in the second round of PCR amplification using the primer pair FP2 and RP2, non-target sequences cannot be amplified due to the lack of a reversed primer binding site, lack of a forward primer binding site, or lack of both forward and reversed primer binding sites. Therefore, after two rounds of selective PCR amplification, the target sequence can be correctly amplified.

3.2.2. Analysis of the maximum theoretical address capacity of the multi-primer binding site DNA sequence structure

For a DNA sequence structure with u forward primer binding sites and v reversed primer binding sites, given an orthogonal primer library PN, the primer library can be divided into the forward and reversed primer libraries PN1 and PN2, respectively, such that PN1PN2=PN,PN1PN2=. The maximum address space size S that can be achieved by selecting target fragments through max(u,v) rounds of selective PCR amplification can be obtained using Eq. (6):

S=m=0u-1sizePN1-m·j=0v-1sizePN2-j

Different levels of primer binding sites use different primers. Since the times of PCR amplification for file random access depend on the maximum values of u and v, Eq. (6) shows that, when u=v and PN1 and PN2 are constant, S reaches its maximum value:

S=m=0u-1sizePN1-m·j=0u-1sizePN2-j
Sm=0u-1sizePN1-m+j=0u-1(sizePN2-j)24

Considering that size(PN1)+size(PN2)=size(PN), it follows that

Sm=0u-1sizePN2-m2

Therefore, the maximum theoretical address capacity of a DNA sequence structure with multiple primer binding sites is

Smax=m=0u-1sizePN2-m2

Fig. 5(b) shows four different mapping relationships between files and primers, and the relationship between the maximum address space and the size of the given orthogonal primer library. The function relationship between the size of the orthogonal primer library size(PN) and the maximum address space Smax that can be used for the random access of files under the strategy of “one primer, one file” is shown in Fig. 5(b-i). The DNA file storage strategy using “one primer pair, one file” mapping is shown in Fig. 5(b-ii), the “two primer pairs, one file” mapping is shown in Fig. 5(b-iii), and the "three primer pairs, one file" mapping is shown in Fig. 5(b-iv). From Fig. 5(c), it can be seen that, with the same size of orthogonal primer library, the maximum address space that can be obtained using the multi-primer–one-file storage strategy grows exponentially compared with the traditional “one primer, one file” storage strategy. Therefore, the design scheme of using a DNA sequence structure with multiple primer binding sites can overcome the problem of insufficient address space for random access using existing methods and can greatly reduce the difficulty of primer library design and the cost of primer synthesis.

3.3. Analysis of primer performance generated by the orthogonal primer designer

In order to verify the effectiveness of the proposed method for designing orthogonal primer libraries, we conducted a simulation experiment. First, we randomly generated 50 primer sequences that satisfied the primer self-constraint conditions without considering the orthogonality constraints between primers. Then, we used the improved orthogonal primer library design method to generate 50 sets of orthogonal primer sequences. The maximum complementary base pair distribution, maximum continuous complementary base pair distribution, and most stable heterodimer Gibbs free energy were analyzed for the 50 primer sequences generated under the two conditions.

As shown in Fig. 6(a), the range of Gibbs free energy change (ΔG) was –9000 to 0 cal∙mol−1 (1 cal = 4.184 J) when only the primer self-constraint conditions were considered, and the average value fluctuated between –4000 and –2000 cal∙mol−1, with a lower absolute value of the Gibbs free energy indicating greater stability of the heterodimer. When the orthogonality constraints between primers were considered, the range was reduced to –4000 to 1000 cal∙mol−1, and the average value fluctuated between –2000 and 0 cal∙mol−1. The total average values under the two conditions were calculated to be –3130 and –1000 cal∙mol−1, respectively. In Fig. 6(b), when the orthogonality constraints between primers were considered, the average ratio of the Gibbs free energy of the most stable heterodimer generated by perfectly matched primers to that generated by mismatched primers was 21.52, while the average ratio was 5.96 when the orthogonality constraints between primers were not considered. This ratio quantifies the degree of binding stability when primers are correctly matched versus when they are mismatched.

As shown in Fig. 6(c), when only the primer self-constraint conditions were considered, the distribution range of the maximum complementary base pairs between primer sequences was 4–12 base pairs, and the average value fluctuated between 6 and 8 base pairs. When the orthogonality constraints between primers were considered, the distribution range of the maximum complementary base pairs was reduced to 1–5 base pairs, and the average value fluctuated between 3 and 5 base pairs. The total average maximum complementary base pairs for the two sets of 50 primers were calculated to be 7.3 and 3.8 base pairs, respectively. In Fig. 6(d), when only the primer self-constraint conditions were considered, the distribution range of the maximum continuous complementary base pairs between primer sequences was 2–7 base pairs, and the average value fluctuated between 3 and 4 base pairs. When the orthogonality constraints between primers were considered, the range was reduced to 0–4 base pairs, and the average value fluctuated between 1 and 3 base pairs. The total average values under the two conditions were calculated to be 3.8 and 2.3 base pairs, respectively.

The above experiments show that after considering the orthogonality constraints between primers, the probability of inefficiency or even failing due to primer mismatch in PCR amplification has significantly reduced.

3.4. Experimental analysis of multi-mode DNA sequence design

To better illustrate the relationship between the total number of files and the minimum number of orthogonal primers required for file retrieval, Eqs. (1), (2) were calculated for four different DNA sequence structures, as shown in Fig. 5(b). The maximum capacity of the primer library was set to 20 nt in length. The curve variations of the four different file-to-primer mapping methods in Figs. 7(a) and (b) revealed that the single primer–file mapping method reached the maximum capacity of the orthogonal primer library when the number of files reached the order of 104. As the number of files to be stored increased further, the “one primer pair, one file,” “two primer pairs, one file,” and “three primer pairs, one file” mapping methods all surpassed the maximum capacity of the orthogonal primer library. At this point, with n level of primer binding site of a certain DNA sequence, then, for the same number of files to be stored, the minimum number of orthogonal primers required for the latter was only about Nf-1/[2n(n+1)] times that of the former. Taking 5000 files as an example, as shown in Fig. 7(c), using the multi-mode DNA sequence design method proposed in this paper, the minimum number of orthogonal primers required under the BSLL strategy was 5000; under the PLL strategy, depending on the different levels of the output DNA sequence primer binding site, the minimum numbers of orthogonal primers required were 142, 18, and 10, respectively.

To verify the feasibility of DNA sequences with multi-level primer binding sites for random file retrieval, a series of experiments were designed and analyzed. Without loss of generality, due to the high specificity and sensitivity of PCR amplification, the principle of two rounds of PCR amplification and more rounds of PCR amplification were the same. Therefore, in order to verify the effectiveness of the DNA file sequence structure with multi-primer binding sites proposed in this paper for PCR amplification file retrieval, 20 DNA file sequences with two levels of forward and reversed primer binding sites were designed in this experiment. The orthogonal primer library used in this experiment was P20={p1,p2,p3,p4,p5,p6}, which was divided into an orthogonal forward primer library P20,FP=FP1=p1,FP2=p2,FP3=p3 and an orthogonal reversed primer library P20,RP=RP1=p4,RP2=p5,RP3=p6. The structure of the DNA file sequence is shown in Fig. 7(d), where L1-FP BS and L1-RP BS are the binding sites of the first-level forward and reversed primers, respectively, which serve to achieve the first round of selective PCR amplification, and L2-FP BS and L2-RP BS are the binding sites of the second-level forward and reversed primers, respectively, which serve to perform the second round of selective PCR amplification using the product of the first PCR reaction as a substrate. The selected two-level forward and reversed primer binding site sequences for the 20 sequences are shown in Table 2.

After sequencing the products of the two rounds of PCR amplification in experiments 1–3 and counting the reads of each sequence, a heatmap of groups 1–3 was obtained, as shown in Fig. 7(e). The experimental results showed that only one DNA sequence had a read ratio of > 98% in each group, and the read ratio of other sequences was significantly lower than that of the target DNA sequence (target sequence in Table 1). This indicates that, if the two forward primer binding sites and two reversed primer binding sites of the target DNA sequence are designed to be different, then the DNA sequence design scheme proposed in this paper can accurately extract the target DNA sequences.

The heatmap of the results from experiments 4–6 with the same sequencing coverage as experiments 1–3 (Fig. 7(e), group index 4–6) shows that multiple sequences had significantly increased read ratios. The reason for this difference is that, in comparison with the design of experiments 1–3, the different levels of forward or reversed primer binding sites in the DNA sequences designed for the three experiments in group index 4–6 had identical sequences selected. For example, in experiment 4, both the first and second rounds of PCR amplification added the primer FP1.

From these experiments, the following rule can be drawn: If there are two or more files in the same DNA pool, and the different levels of forward or reversed primer binding site sequences in their DNA sequences are not completely different, then it will inevitably lead to the inability to uniquely amplify the target sequence after multiple rounds of selective PCR amplification. This is also the reason why this paper has established that the sequences used for different level primer binding sites must be different when analyzing the theoretical maximum address capacity of DNA file sequences with multi-level primer binding sites in previous sections.

3.5. Discussion

The multi-mode file DNA sequence design scheme proposed in this paper can provide a solution to the need for accurate retrieval of large numbers of files with the limited capacity of orthologous primer libraries. However, since the increase in additional primer binding sites leads to a decrease in storage density, it is critical to find a balance between the maximum address space and storage density. Fig. 7(f) shows the relationship between the maximum available address space and the storage density for a single template sequence length of 190 nt, using four different file DNA sequence structures. Here, storage density is defined as the ratio of the payload portion to the total template sequence.

Obviously, when the primer length or the level of primer binding sites increases, the storage density will decrease. However, as the level of primer binding sites increases incrementally, the maximum address space can grow exponentially. For the data organization structure of “one primer, one file,” the storage density is distributed between 0.68 (primer length of 30 nt) and 0.84 (primer length of 15 nt). When the “one primer pair, one file” organization method is used, the maximum address space can be significantly increased while keeping the storage density constant. However, due to the lack of a universal reversed primer binding site, different files may correspond to different reversed primers, which can lead to an increase in the complexity of the PCR amplifications.

If the first two organization schemes still do not satisfy the file storage requirements, additional primer binding sites must be added, which inevitably leads to a decrease in storage density. However, as shown in Fig. 7(f), the expansion of the maximum address space is exponential as the number of primer binding site levels increases. Although the storage density is only about 0.05 when using a “three primer pairs, one file” organization method with a primer length of 30 nt, the same address space capacity can be achieved by choosing a primer length of 20 nt and increasing the storage density to 0.37. From the perspective of DNA synthesis technology, with the development of synthetic biology technology, if the future synthesis of a single template sequence length reaches several hundred nucleotides or is even longer, then the storage density using the organization scheme proposed in this paper will also increase. After all, there is an inherent limit to the length of primers, and the longer the template sequence, the lower the proportion of primers in it.

The analysis in this paper reveals that the appropriate primer length and the level of primer binding sites are the key to balancing the storage density with the ability to retrieve files accurately. The quantitative variation relationship between the two is also given in this paper, facilitating the selection of appropriate sequence design patterns in different application scenarios.

This paper provides a solution for random access in large-scale files stored in DNA, as compared with previous studies by scholars. Both Tabatabaei Yazdi et al. [12] and Organick et al. [13] proposed the use of PCR amplifications for target file retrieval. However, the high requirements for primers in PCR amplification create some difficulties in the accurate retrieval of target files in the context of large-scale file storage. In recent years, many other scholars have explored different paths, such as microfluidics, DNA micro-disks, silica microspheres, and other solutions. Attempts have been made to find another way to get rid of the capacity limitation of using primers as file identifiers. However, these attempts are still in the exploratory stage, and the related technology and equipment have not been widely popularized, unlike PCR technology. Therefore, this paper still adopts the most widely used and mature PCR amplification and proposes a multi-mode DNA sequence design architecture for its bottleneck problem, which can not only realize the demand of file retrieval under large-scale file storage but also keep the cost low.

The mapping relationship model between files and primers proposed in this paper both supports accurate one-file retrieval and implies another potential feature to support multi-mode file access. If there are multiple files using the same primer sequence for a given primer binding site—for example, in the “one primer pair, one file” architecture, assuming that the primer combination (FP1, RP1) is for file A, (FP1, RP2) is for file B, and (FP1, RP3) is for file C—then, if the same forward primer FP1 is added and a batch of reversed primers, such as RP1, RP2, and RP3, are added together for the PCR amplification reaction, the target batch of files with specific properties can be retrieved all at once. This may become a future research direction.

4. Conclusions

This article proposed a multi-mode DNA sequence design method for PCR-based file retrieval in a single oligonucleotide pool. Firstly, the relationship between primer length and the capacity of an orthogonal primer library was analyzed. It was concluded that the primer length and the maximum number of available primers do not have a linear relationship, and the upper limit of the file capacity that can be supported by accurate retrieval in a single oligonucleotide pool under the “one primer, one file” mapping relationship was fitted. Secondly, the maximum theoretical address space supported by DNA sequences with multi-level primer binding sites was analyzed and compared with the address space under the “one primer, one file” mapping method. Then, a multi-mode DNA sequence design method was proposed to meet the random-access requirements of a single oligonucleotide pool containing different numbers of files. Finally, the performance of the primers generated by the orthogonal primer library design method based on a primer discriminator was analyzed through simulation experiments, and the feasibility of the accurate retrieval of target files under a two-level primer binding site was verified by selective PCR amplification experiments. When using the proposed method, a balance can be found between the difficulty of primer library design and the number of PCR reaction rounds, and file retrieval in large-scale DNA pools can be achieved with low primer library design difficulty and few rounds of PCR.

The design scheme presented in this paper is highly flexible in order to permit the selection of an appropriate design pattern as needed in practical scenarios. If the number of primers is small and no more primer sequences are designed and synthesized, using the PLL scheme proposed in this paper can provide sufficient address space for accurate file retrieval. If the number of files to be stored in the same DNA oligo pool is very large, and if the researchers prefer to design and synthesize a large number of orthogonal primers, the maximum address space can also be greatly extended using the scheme in this paper.

In conclusion, this paper breaks through the traditional method in which a strong correlation between the number of files and the number of primers is required in order to achieve accurate file retrieval. Using the BSLL and PLL sequence design methods proposed in this paper, the decoupling of the relationship between the two can be realized, providing high flexibility in the design of file retrieval for DNA storage.

In the future, with a reduction in the cost and increase in the length of large-scale DNA molecule synthesis, we may be able to synthesize and store files with huge amounts of information and quantities. Moreover, through biochemical experiments, the practical application of multi-level primers will be verified at a deeper level.

CRediT authorship contribution statement

Shu-Fang Zhang: Writing - review & editing, Supervision, Resources, Project administration, Funding acquisition, Conceptualization. Yu-Hui Li: Writing - original draft, Visualization, Validation, Methodology, Investigation, Formal analysis, Data curation. Rui-Xian Zhang: Writing - original draft, Validation, Investigation, Data curation. Bing-Zhi Li: Writing - review & editing, Supervision, Conceptualization. Qing Wang: Investigation, Formal analysis.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the fund from Tianjin Municipal Science and Technology Bureau (22JCYBJC01390).

References

[1]

Machines smarter than men? Interview with Dr.Norbert Wiener, noted scientist.US News World Rep 1964 Feb:84–6.

[2]

Pinho AJ, Pratas D, Ferreira PJSG.Bacteria DNA sequence compression using a mixture of finite-context models.In: Proceedings of 2011 IEEE Statistical Signal Processing Workshop; 2011 Jun 28–30; Nice, France. Piscataway: IEEE; 2011. p. 125–8.

[3]

Goldman N, Bertone P, Chen S, Dessimoz C, LeProust EM, Sipos B, et al.Towards practical, high capacity, low-maintenance information storage in synthesized DNA.Nature 2013; 494(7435):77-80.

[4]

Yim AKY, Yu ACS, Li JW, Wong AIC, Loo JFC, Chan KM, et al.The essential component in DNA-based information storage system: robust error-tolerating module.Front Bioeng Biotechnol 2014; 2(2):49-51.

[5]

Savitri PAI, Murdiansyah DT, Astuti W.Digital medical image compression algorithm using adaptive Huffman coding and graph based quantization based on IWT-SVD.In: Proceedings of 2016 4th International Conference on Information and Communication Technology; 2016 May 25–27; Bandung, Indonesia. Piscataway: IEE E; 2016. p. 264–9.

[6]

Shipman SL, Nivala J, Macklis JD, Church GM.Molecular recordings by directed CRISPR spacer acquisition.Science 2016; 353(6298):aaf1175.

[7]

Shipman SL, Nivala J, Macklis JD, Church GM.CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria.Nature 2017; 547(7663):345-349.

[8]

Farzadfard F, Lu TK.Genomically encoded analog memory with precise in vivo DNA writing in living cell populations.Science 2014; 346(6211):1256272.

[9]

Roquet N, Soleimany AP, Ferris AC, Aaronson S, Lu TK.Synthetic recombinase-based state machines in living cells.Science 2016; 353(6297):aad8559.

[10]

Chen W, Han M, Zhou J, Ge Q, Wang P, Zhang X, et al.An artificial chromosome for data storage.Natl Sci Rev 2021; 8(5):nwab028.

[11]

Song L, Geng F, Gong ZY, Chen X, Tang J, Gong C, et al.Robust data storage in DNA by de Bruijn graph-based de novo strand assembly.Nat Commun 2022; 13(1):5361.

[12]

Tabatabaei SMH Yazdi, Yuan Y, Ma J, Zhao H, Milenkovic O.A rewritable, random-access DNA-based storage system.Sci Rep 2015; 5(1):14138.

[13]

Organick L, Ang SD, Chen YJ, Lopez R, Yekhanin S, Makarychev K, et al.Random access in large-scale DNA data storage.Nat Biotechnol 2018; 36(3):242-248.

[14]

Tomek KJ, Volkel K, Simpson A, Hass AG, Indermaur EW, Tuck JM, et al.Driving the scalability of DNA-based information storage systems.ACS Synth Biol 2019; 8(6):1241-1248.

[15]

Song X, Shah S, Reif J.Multidimensional data organization and random access in large-scale DNA storage systems.Theor Comput Sci 2021; 894:190-202.

[16]

Newman S, Stephenson AP, Willsey M, Nguyen BH, Takahashi CN, Strauss K, et al.High density DNA data storage library via dehydration with digital microfluidic retrieval.Nat Commun 2019; 10(1):1706.

[17]

Bee C, Chen YJ, Queen M, Ward D, Liu X, Organick L, et al.Molecular-level similarity search brings computing to DNA data storage.Nat Commun 2021; 12(1):4764.

[18]

Banal JL, Shepherd TR, Berleant J, Huang H, Reyes M, Ackerman CM, et al.Random access DNA memory using Boolean search in an archival file storage system.Nat Mater 2021; 20(9):1272-1280.

[19]

Piantanida L, Hughes WL.A PCR-free approach to random access in DNA.Nat Mater 2021; 20(9):1173-1174.

[20]

Löchel HF, Welzel M, Hattab G, Hauschild AC, Heider D.Fractal construction of constrained code words for DNA storage systems.Nucleic Acids Res 2022; 50(5):e30.

[21]

Cao B, Li X, Zhang X, Wang B, Zhang Q, Wei X.Designing uncorrelated address constrain for DNA storage by DMVO algorithm.IEEE/ACM Trans Comput Biol Bioinf 2022; 19(2):866-877.

[22]

Yin Q, Zheng Y, Wang B, Zhang Q.Design of constraint coding sets for archive DNA storage.IEEE/ACM Trans Comput Biol Bioinf 2022; 19(6):3384-3394.

[23]

Nguyen BH, Takahashi CN, Gupta G, Smith JA, Rouse R, Berndt P, et al.Scaling DNA data storage with nanoscale electrode wells.Sci Adv 2021; 7(48):eabi6714.

[24]

Takahashi CN, Nguyen BH, Strauss K, Ceze L.Demonstration of end-to-end automation of DNA data storage.Sci Rep 2019; 9(1):4998.

[25]

Organick L, Nguyen BH, McAmis R, Chen WD, Kohll AX, Ang SD, et al.An empirical comparison of preservation methods for synthetic DNA data storage.Small Methods 2021; 5(5):2001094.

[26]

Antkowiak PL, Koch J, Rzepka P, Nguyen B, Strauss K, Stark WJ, et al.Anhydrous calcium phosphate crystals stabilize DNA for dry storage.Chem Commun 2022; 58(19):3174-3177.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (3446KB)

2862

Accesses

0

Citation

Detail

Sections
Recommended

/