The Mathematics of Genomes
In the same way we use the 26 letters of the alphabet to write text, and the 2 bits zero and one to write computer code, the 4 basic DNA units (Adenine, Cytosine, Guanine, Thymine) are used by Nature to encode information as DNA strands. Theoretically, a DNA strand can be viewed as a “word” over the 4-letter alphabet {A, C, G, T}, and the mathematical structure of such words has implications for their biological structure and function.
This talk describes our research into the mathematical properties of genomic DNA sequences by exploring the connection between word frequencies ina genome and the type of organism that the genome belongs to. In particular, I describe our investigation into the Chaos Game Representation of a DNA sequence as a potential “genomic signature” for its species, and the usability of such genomic signatures for species identification and classification. The potential impact of such an alignment-free universal classification method could be significant, given that 86% of existing species on Earth and 91% of species in the oceans still await classification.