DNA Sequence Compress Algorithm for Compression of Biological Sequences

Authors

  • Siva Phanindra DAGGUBATI Andhra University, Visakhapatnam
  • Venkata Rao KASUKURTHI
  • Prasad Reddy PVGD

DOI:

https://doi.org/10.22399/ijcesen.675

Keywords:

DNA sequence compression, Bioinformatics, Compression ratio, Gen Compress

Abstract

The rapid advancements in high-throughput DNA sequencing technologies have ushered in a transformative era in genomics and bioinformatics. The generation of vast DNA sequence datasets has become commonplace, contributing to critical areas such as personalized medicine, drug discovery, and agricultural biotechnology. However, the storage, transmission, and processing of these massive datasets pose significant challenges due to the inherently large size of DNA sequences. To address these challenges, various DNA sequence compression algorithms have been proposed. This paper introduces DNASeqCompress, an innovative algorithm tailored specifically for compressing DNA and RNA sequences. DNASeqCompress employs a statistical model-based approach to identify and compress repetitive sub-sequences efficiently. It selects frequent sub-sequences and stores them along with their positional information, resulting in reduced storage and transmission sizes. We implement and evaluate DNASeqCompress on various DNA sequences, comparing its performance with the existing GenCompress algorithm. Through our experiments, we demonstrate that DNASeqCompress outperforms GenCompress in terms of compression and ease of implementation, particularly for sequences with repetitive patterns. The average percentage of improvement of the proposed algorithm (DNASeqCompress) over GenCompress is approximately 15.52% observed across a diverse dataset of DNA sequences. This research provides a comprehensive analysis and comparison of DNASeqCompress and GenCompress, contributing valuable insights into DNA sequence compression algorithms.

References

Behzadi, B., Le Fessant, F. (2005). DNA Compression Challenge Revisited: A Dynamic Programming Approach. http://www.cs.ucr.edu/~stelo/cpm/cpm05/cpm05_5_2_Behzadi.pdf.

Beck, S., Alderton, R.P. (1993). A strategy for amplification, purification, and selection of M13 templates for large-scale DNA sequencing. Anal Biochem., 212(2): 498-505. https://doi.org/10.1006/abio.1993.1359.

Chen, X., Li, M., Ma, B., Tromp, J. (2002). DNACompress: Fast and effective DNA sequence compression. Bioinformatics, 18(12): 1696-1698. https://doi.org/10.1093/bioinformatics/18.12.1696.

Chen, X., Kwong, S., Li, M. (1999). A compression algorithm for DNA sequences and its applications in genome comparison. Genome informatics. International Conference on Genome Informatics, 10: 51-61. http://dx.doi.org/10.1145/332306.332352.

Hutchison, C.A. (2007). DNA sequencing: bench to bedside and beyond. Nucleic Acids Res., 35(18): 6227-6237. https://doi.org/10.1093/nar/gkm688.

Dale, J.W., Schantz, M.V. (2008). From Genes to Genomes Concepts and Applications of DNA Technology. 2nd Edition, Wiley.

Loewenstern, D., Yianilos, P.N. (1997). Significantly lower entropy estimates for natural DNA sequences. In Proc. of the Data Compression Conf., (DCC '97), pp. 151-160. https://doi.org/10.1109/DCC.1997.581998.

Edwards, J.R., Ruparel, H., Ju, J.Y. (2005). Mass-spectrometry DNA sequencing. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, 573(1-2): 3-12. http://dx.doi.org/10.1016/j.mrfmmm.2004.07.021.

Rivals, E., Delahaye, J.P., Dauchet, M., Delgrange, O. (1996). A guaranteed compression scheme for repetitive DNA sequences. Data Compression Conference. https://doi.ieeecomputersociety.org/10.1109/DCC.1996.488385.

Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23(3): 337-343.

Kaipa, K.K., Bopardikar, A.S., Abhilash, S., Venkataraman, P., Lee, K., Ahn, T., Narayanan, R. (2010). Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location. IEEE Conference on Bioinformatics and Biomedicine Workshop (BIBMW), pp. 851-852. https://doi.org/10.1109/BIBMW.2010.5703941.

Misra, K.N., Aaggarwal, A., Abdelhadi, E., Srivastava, P. (2010). An efficient horizontal and vertical method for online DNA sequence compression. International Journal of Computer Applications, 3(1). http://dx.doi.org/10.5120/757-954.

Franca, L.T.C., Carrilho, E., Kist, T.B.L. (2002). A review of DNA sequencing techniques. Quarterly Reviews of Biophysics, 35(2): 169-200. https://doi.org/10.1017/S0033583502003797.

National Center for Bio Technology Information. https://www.ncbi.nlm.nih.gov/htbin-post/Entrenz/query?db=n_s.

Rivest, R. (1992). Step 4. Process Message in 16-Word Blocks. The MD5 Message-Digest Algorithm. https://doi.org/10.17487/RFC1321.

Grumbach, S., Tahi, F. (1994). A new challenge for compression algorithms: Genetic sequences. Journal of Information Processing and Management, 30(6): 875-866. https://doi.org/10.1016/0306-4573(94)90014-0.

Grumbach, S., Tahi, F. (1993). Compression of DNA sequences. In Proc. IEEE Symp. On Data Compression, Snowbird, UT, USA, pp. 340-350. https://doi.org/10.1109/DCC.1993.253115.

Srinivasa, K.G., Jagadish, M., Venugopal, K.R., Patnaik, L.M. (2006). Efficient compression of non-repetitive DNA sequences using dynamic programming. 2006 International Conference on Advanced Computing and Communications, pp. 569-574. https://doi.org/10.1109/ADCOM.2006.4289956.

A. Rajeshkhanna, S. Kiran, A. Ranichitra, & S. Hemasri. (2024). Efficient DNA Cryptography Using One-Time Pad and Run-Length Encoding for Optimized Ciphertext Storage. International Journal of Computational and Experimental Science and Engineering, 10(4);1258-1270. https://doi.org/10.22399/ijcesen.641

ALTINTAN , D., & PURUTÇUOĞLU, V. (2018). Exact Stochastic Sşmulation Algorithms and Impulses in Biological Systems. International Journal of Computational and Experimental Science and Engineering, 4(2), 41–47. Retrieved from https://www.ijcesen.com/index.php/ijcesen/article/view/66

JABER, K. M., A. HAMAD, N., & M. QUIAM, F. (2019). A Framework for Query Optimization Algorithms for Biological Data. International Journal of Computational and Experimental Science and Engineering, 5(2), 76–79. Retrieved from https://www.ijcesen.com/index.php/ijcesen/article/view/92

Downloads

Published

2024-12-11

How to Cite

Siva Phanindra DAGGUBATI, Venkata Rao KASUKURTHI, & Prasad Reddy PVGD. (2024). DNA Sequence Compress Algorithm for Compression of Biological Sequences. International Journal of Computational and Experimental Science and Engineering, 10(4). https://doi.org/10.22399/ijcesen.675

Issue

Section

Research Article