MPEG-G: The Future of Genetic Data Storage and Analysis

Chechi Amah
4 min readApr 29, 2022

Note: This article originated as my AP Computer Science Principles Explore Task. This was written in late 2019 and does not reflect on the research done during more recent times. Enjoy!

What is MPEG-G?

Data DNA storage strives to process information with MPEG-G, a data system that uses the same MPEG compression technology, to easily store the DNA of its subjects while reducing the storage cost of DNA storage and commoditizing DNA analysis. The data would then be shared with medical professionals and used to determine a diagnosis for patients, administer medication, and perform other treatments that require knowledge about a patient’s genetics and medical history [4]. As shown in my computational artifact, this information would be compressed into sections that are small enough to be sent through personal devices through the MPEG technique.

Artifact

I used Google Slides to visualize the process of compressing any given strand of DNA into a file that can easily be shared. To acquire a picture of a DNA matrix I used Google Images. For the chart pictures, I screenshot information from the website of Leonardo Chiariglione. I downloaded these images and inserted them into the Google Slide. I used the shapes feature to make lines, arrows, and word bubbles to properly illustrate the flow of the compression process. Text descriptions were added to further describe the process. Lastly, I put cited these images with numbers in brackets.

DNA Data Compression Process Diagram by Chechi Amah

Pros and Cons

DNA Storage is a technology that has the potential to revolutionize the way that data is stored and shared. Unlike other forms of media, the data in DNA can sustain itself for thousands of years without it degrading. This form of data takes up less space or a source of power than its predecessors. Because of these factors and the fact that the data takes the shape of a double helix nature, the data is less susceptible corrupt. With this computing innovation, companies such as the Swiss Genomsys can help medical professionals by making the data that is collected from patients into minuscule storage of information [2]. This benefits the lives of others seems to be far from view.

Despite its many promising features, this new data storage technique is not yet perfect. The storing of data will prove to be a challenge for many reasons. Firstly, the process of compressing data with the MPEG for DNA is very costly and would cost approximately US $12,400 per megabyte of data stored [2]. Seemingly more efficient than techniques of data storage that came before it, data DNA is read back at a low speed. So far the process of data storing through DNA proves to be tedious. Data cannot be accessed in individual sections. This means that to rewrite a part of that data or reference a part of that data, one would have to sift the entire data set and redo the process if necessary.

Tools for Input and Output

First, the DNA information will be taken from a subject and stored and represented with the following quaternary symbols: A G C T for the four bases of DNA and reads them the same way you would binary code [7]. The program then puts the reads from the input file into a bin that matches the genome. Next, bits of data are further divided into classes and subsets and rearranged in a matrix. With the MPEG algorithm, information is compressed into small data that can be decompressed. The data is used to store the changes made between each frame instead of storing the entire frame.

With MPEG-G, data in each descriptor column is compressed after using the CABAC compressor, an algorithm that converts information into the video. This video information can be encoded with Discrete Cosine Transform or DCT, a technique that converts the raw wavelength data into its weight in cosines by using the trigonometric function to measure the wavelengths. MPEG-G uses the technique of lossy compression, one that subtly eliminates extra information such as redundancies from the information. The algorithm compresses descriptors of a class in a bin in an Access Unit or AU with 6 at most per bin [8]. The MPEG-G file now contains all AUs of all bins that correspond to a segment of the genome. Though this algorithm does allow you to access data more specifically and efficiently, it can also allow hackers to easily transfer malware and corrupt the most personal of intellectual property [1][5].

References

[1]

Bonfield, James. “MPEG-G: the Bad.” Data Geekdom, 27 Sept. 2018,http://datageekdom.blogspot.com/2018/09/mpeg-g-bad.html, Accessed 13 Dec 2019.

[2]

Caroll, Alex. “DNA: The Future of Digital Storage?” Lifeline Data Centers, Lifeline Data Centers, 22 Mar. 2013, https://lifelinedatacenters.com/data-center/dnas-digital-storage/, Accessed 16 Dec 2019.

[3]

Chiariglione, Leonardo. “Genome Is Digital, and Can Be Compressed.” Leonardo Chiariglione Blog, 13 Jan. 2019, http://blog.chiariglione.org/genome-is-digital-and-can-be-compressed/, Accessed 14 Dec 2019.

[4]

“GenomSys, the Swiss Startup Disrupting Genomics Information Handling, Closes a CHF 9.3 Million Series A Investment Round.” Business Wire, 18 Sept. 2019,https://www.businesswire.com/news/home/20190918005067/en/GenomSys-Swiss-Startup-Disrupting-Genomics-Information-Handling, Accessed 11 Dec 2019.

[5]

Langston, Jennifer. “DNA Sequencing Tools Lack Robust Protections against Cybersecurity Risks.” UW News, U Of Washington P, 10 Aug. 2017, https://www.washington.edu/news/2017/08/10/dna-sequencing-tools-lack-robust-protections-against-cybersecurity-risks/, Accessed 17 Dec 2019.

[6]

Walsh, Karen McNulty. Brookhaven National Laboratory — a Passion for Discovery, 3 Mar. 2014, https://www.bnl.gov/techtransfer/news/news.php?a=24672, Accessed 16 Dec 2019.

[7]

Waltz, Emily. “With DNA Data Storage, 3D-Printed Bunnies Carry Their Blueprints.” IEEE Spectrum: Technology, Engineering, and Science News, 9 Dec. 2019, https://spectrum.ieee.org/the-human-os/biomedical/devices/dna-of-things, Accessed 10 Dec. 2019.

[8]

Zielinski, Dina. “Transcript of ‘How We Can Store Digital Data in DNA.’” TED, https://www.ted.com/talks/dina_zielinski_how_we_can_store_digital_data_in_dna/transcript?language=en#t-73357, Accessed 11 Dec. 2019.

--

--