In the age of information explosion, the speed of information generation grows rapidly. According to the prediction of Internet Data Center(IDC)[6], there will be 460 billion GB data created per day. The increase of information is challenging the existing storage technology. Thus, we need to create a new storage-medium with higher capacity of data. DNA has high storage density, only 1 kilogram can store the data around the world. Besides, DNA’s structure is stable in ordinary temperature, which means it can store data in the long term without any other power like electricity. This characteristic of DNA make it a useful material of data storage.
The process of DNA storage is encoding binary information into 4 bases of DNA, A, T, C, G. Then synthesis the bases into DNA to store data and use sequencing of DNA to read data.
However, DNA storage has some constraints. First of all, the DNA synthesis is slow and expensive, it will take thousands of dollars and days of time to just store several MB data. After that, the stable structure of DNA based on its constraints on GC content and run-length, which means the DNA must have G and C bases between 40% and 60% and the number of continuous identical base can not be larger than 3. Moreover, there are some extra errors when encoding data into DNA, such as insertion and deletion. Therefore, there is a need of specific error correcting code for DNA storage. At last, since we can only synthesize DNA in short segment, it is hard to randomly access the data after encoding it into DNA. This constraint on DNA storage still need to be solved.
Until now, there are some achievements all over the world in the area of DNA storage. In 1998, researchers of Harvard university first encoded binary code of a picture into DNA. This is the first time that DNA was proved to be a storage-medium of non-biological information. After that, professor Church[1] of Harvard university successfully stored 650KB data into DNA, which is thousands of times larger than previous work. A year later, European Bioinformatics Institute(EMBL) translate 20 MB data into DNA. After that, Yazdi et al.[3]have created the first DNA-based storage architecture that enables random access of data blocks and rewriting of information stored at arbitrary locations within the blocks. Further more, Erlich and Zielinski[2] use LT code to realize DNA fountain, which enables an efficient DNA storage architecture and its speed is 100 times faster than Church’s method. In 2018, W.song et al. came up a method of encoding data into DNA while satisfying the constraints on GC content and run length[5]. As DNA gradually recognized as a future storage material, the number of research on DNA storage is increasing. Microsoft and University of Washington stored 200MB data into DNA in 2016[4]. Recently, they even created a fully automated system of DNA storage, but the cost of time and money still can not be reduced. On account of the efforts of researchers, the methods of DNA storage are continuously improving.
In the next part, #2, we will explore two main constraints of DNA encoding, GC-Content constraint and Run-Length constraint and do some experiment on them. In #3, a novel encoding strategy named Luby Transform will be introduced and will be further explored in our future implementation. The fourth part, #4, is about our future plan on this topic.
Not to make things too complex at the very beginning, it is reasonable to first consider only GC-content and run-length constraints since these two are mentioned and considered by most other studies.
Wentu et al. have come up with a method that can (only) satisfy these two constraints and can theoretically reach the highest code rate of
It is shown in [5] that if
Since
To generate the encoding table, i.e., the encoding function, the first step is to find a set, say
To encode a message, we first use arbitrary one of the 64 encoding table to encode the first segment. Then for each segment to encode, we choose the encoding function according to the previous three encoded nucleotides and use it to encode this segment.
To decode a message, we use the same function for encoding the first segment to decode the first encoded segment. The remaining procedure is similar to that of encoding.
To find the
To store the reversed table, i.e., the decoding function, a hashmap and binary search are both available. For binary search, the time complexity is
Hence, the expected time complexity for encoding and decoding
Due to the space and time limit, we only performed test on
Although some elements in the encoding table may not satisfy the gc-content constraint as shown in Part 1, the encoding result in Part 2 shows that the gc-contents in all cases fall into [0.4, 0.6], which satisfy the gc-content constraint. In Part 3, we found that the encoding speed is slow with only tens of kilobytes per second on average. And the speed can vary in a wide range from hundreds of bytes per second to hundreds of kilobytes per second. We also found that the encoding speed decreases as
- Part 1 Table GC-Content Test
- Part 2 Encoding GC-Content Test
- Part 3 Encoding Speed Test
#3 Implement a DNA storage encoding strategy based on DNA Fountain.(Including Error-Correcting Codes)
DNA fountain is a strategy for DNA storage which has strong robustness against data corruption and is developed by Erlich and Zielinsiki in 2017[2]. It can overcome both oligo dropouts and biochemical constraints of DNA storage. The encoding process includes three steps. Firstly, the binary file to be encoded is divided into a group of non-overlapping segments of certain length. Secondly, use Luby Transform(LT code) to package data into short messages named droplets. The droplet mainly contains a data portion that includes the payload and a fixed-length seed which is used to identify the segments involved in generating the payload. In each iteration of Luby Transform, a droplet is created. Then a screening procedure is performed on the droplet. In this stage the algorithm treats each binary droplet as a DNA sequence, i.e., an DNA oligo and check whether the oligo can satisfy the GC-Content constraint and homopolymer-run constraint or not. The iteration is performed until a sufficient number of valid oligos are created such that the oligos can be fully decoded back to the original data.
LT code is a erasure correcting codes which can be used to transmit digital data reliably on an erasure channel. The encoding algorithm can produce an infinite number of message packets, i.e., it is rateless.
Dividing the original message into n blocks of equal length segments. Then use pseudorandom number generator to generate a random degree d (1 ≤ d ≤ n ). Degree is the number of blocks to be XORed in the next iteration.
d packets are selected using discrete uniform distribution from the n blocks. Next, XOR all selected d blocks into a single block, i.e., the payload. The result packet should contain both the payload and some extra information including the number of blocks in the original message(n) and a seed to indicate which d blocks are chosen to perform the XOR operation.
Repeating these steps until the receiver can determine that the message can be successfully decoded.
Exclusive OR(
Implement another DNA storage encoding method based on DNA Fountain. (Including Error-Correcting Codes).
Adding some Error-Correcting Codes to our first project. It is expected that it could correct both Insertion and deletion error in DNA synthesis.
Design a program named random Error-Generator. Use random Error-Generator to simulate the process of DNA storage. Error-generator can generate common errors in DNA synthesis procedure. The goal is to use the strategies mentioned above to recover the original file.
For the first project we implement, the time complexity is too large when n is large. We will try to optimize our algorithm to shorten the time in the situation that n is large. And we will also try to simplify the encoding and decoding tables to reduce the storage space.
[1] G.M. Church, Y. Gao, and S.Kosuri, Next generation digital information storage in DNA, Science, no.6102:1628, 2012.
[2] Y. Erlich and D.Zielinsiki, DNA fountain enables a robust and efficient storage architecture, Science, 6328:950954, 2017.
[3] S.M.H.T Yazdi, Y. Yuan, J. Ma, H. Zhao, and O. Milenkovic, A rewritable, random access DNA based storage system, Nature Scientific Reports, 14138, 2015.
[4] L. Organick, S. Dumas Ang, Y. Chen, et al. Random access in large-scale DNA data storage, Nature Biotechnology 10.1038/nbt.4079, 2018.
[5] W.Song, K. Cai, M. Zhang, and C. Yuen, Codes with run length and GC content constraints for DNA based data storage, IEEE Communications Letters, col.22(10):20042007, 2018.
[6] D. Reinsel and J Gantz, The digital universe in 2020, Big data, bigger digital shadows, and biggest growth in the far east, 2012.