GenomeOcean Leverages AI to Decode Nature’s Secret Language

Hang Chang
Feb 27
5 min read

GenomeOcean is a generative model available for researchers to analyze and create DNA sequences from microbial life. [Image: OpenAI]

Unlocking the intricate details of life’s genetic blueprint has long been a significant challenge in genomics. Enter GenomeOcean, a powerful generative model now available to the global research community able to not just analyze but also create DNA sequences mimicking an array of microbial life. By leveraging massive user-generated datasets at the JGI and other publicly available data, GenomeOcean employs artificial intelligence for cutting-edge genomic research — and is capable of discovering and modelling complex biological sequences such as biosynthetic gene cluster structures.

Zhong Wang, a computational biologist at the U.S. Department of Energy (DOE) Joint Genome Institute (JGI), a DOE Office of Science User Facility located at Lawrence Berkeley National Laboratory (Berkeley Lab), led the project. He explained that large language models—artificial intelligence trained to understand, generate and manipulate the human language—are well positioned to help enhance our understanding of life’s genomic code, advancing DOE efforts of achieving a predictive understanding of biological systems.“If we could model the genome as if it were a language, then we can take advantage of existing methodologies developed for natural languages to study genomes,” Wang explained. And since microbial genomes are simpler to model than large plant or animal genomes, they are a good starting point to test the hypothesis.

DNA as the World’s Oldest Language

DNA is among the oldest forms of communication. This intricate code carries the information necessary for life's processes, making it a fundamental language of biology — and primed for integration with artificial intelligence tools.

LLMs aren’t just for understanding the human language and using it to communicate with machines. It has been applied in a variety of industries, from the health care to the financial sector — whether to identify studies and protocols or analyze and predict market trends.

“Genomes do seem like a language, and the GenomeOcean model learned some structures and semantics from it,” Wang said. “Natural language has grammar and structure, and words set up in a certain way. Genomes work in a similar manner. From the gene, we have those regulatory elements like promoters in front of this protein coding gene, like a structure.”

Think of verbs and nouns as the nucleotide bases of DNA; genes analogous to paragraphs; genomes equate to complete written works. Using large amounts of genomic data from a variety of microbial communities, the model learns to recognize patterns, structures and relationships within the data — similar to the way LLMs learn grammar and context.

This training empowers GenomeOcean to predict the biological meaning and relevance of new sequences. It can generate novel sequences based on the principles it learns from the data about genetic structure — carrying with it the potential to draft strands encoded with new proteins or genetic pathways not observed in nature. Because of its enhanced neural network architecture, GenomeOcean has shown improved computational efficiency of its modeling in the early testing stages. By aiding the understanding of patterns in large complex data, the JGI’s development of GenomeOcean supports the DOE Biological and Environmental Research program’s efforts to build predictive models for the behaviors and interactions of complex systems.

One of its key innovations is the use of Byte-Pair Encoding (BPE) tokenization strategy. Imagine you’re at an arcade and need to convert your money into tokens to play the various machines. You’d go find the cash converter box, feed it a $1, $5 or $10 bill and receive a set amount of tokens, commensurate in value to your dollars. This is, in essence, what the BPE strategy achieves. It transforms DNA sequences into “tokens” that are then processed by the GenomeOcean model. Because of this increased efficiency, the model can process these tokens 150 times faster.

Leveraging Large Data Sets for Smarter AI

Any artificial intelligence tool is only “as good as the data it is trained on,” Wang said. “Most models rely on reference genomes to train genome foundation models. However, the sheer vastness of microbial diversity is not covered by reference genomes.”

Most of the more than 600 billion base pairs (Gigabases or Gbp) of genetic data used to train GenomeOcean are publicly available on JGI data platforms, generated from past studies conducted by JGI users. These datasets include an extensive 25-year study of microbial communities in freshwater environments, the long-term ecological Harvard Forest project, large datasets from soil microbiome studies collected across various terrains, and microbial life adapted to extreme conditions studied through the Antarctic Lake project. The model’s training also included early access to the largest biosynthetic gene cluster data set through the JGI's Secondary Metabolite Collaboratory.

In addition to the datasets generated by JGI users, GenomeOcean also utilized large-scale metagenomic assemblies from three terabase-scale (Tb) coassemblies that explore diverse environmental microbiomes. Together, these datasets represent a wide range of microbial life from various environments, including oceans, lakes, mammalian guts, forests, and soils, ensuring a robust and representative training set for GenomeOcean.

Cumulatively, the data used to train GenomeOcean encompasses approximately 645.4 Gbp of high-quality contigs from diverse sources, derived from 219 Tb of raw environmental data.

Trained on numerous versions of genes and enzymes, GenomeOcean can generate diverse types of enzyme genes, or numerous variants of one gene, enabling biologists and engineers to draft innovative gene designs and facilitate new discoveries in microbial genome structures and functions.

Wang worked with the MetaHipMer team with Berkeley Lab’s Computing Sciences Area to develop an efficient metagenome assembler capable of assembling many terabases of sequences from large metagenome projects — ultimately assembling over 220 Tb of sequences and obtaining 700 Gb of assemblies. The team skipped the binning process to prevent information loss as the training does not require complete genomes.

“So, we used the raw assemblies directly to train the model,” Wang explained. “This is analogous to using a mixture of Twitter and Wikipedia, instead of Wikipedia text alone to train the natural language model.”

This highlight on efficiency and diversity in sampling is to help avoid common pitfalls that similar models have historically faced.

“There are two things: We want to make sure our data set is high quality and also diverse — because we want the model to generalize to any genome from any habitat,” Wang said.

Training a model of this scale with 4 billion parameters is not a task that can be handled by a small computer cluster. GenomeOcean was trained on a supercomputer operated by the National Energy Research Scientific Computing Center, through allocations funded by BER.

The View from GenomeOcean’s Horizon

With GenomeOcean still in its nascent stages, Wang’s top priority for the tool is to identify its usefulness among scientific researchers in order to enhance what it can offer. As of June 2025, Genome Ocean was ranked the No. 1 most-downloaded genome foundation model on the huggingface platform from which it is available. Wang estimated that there are 500-800 genome foundation models available on the platform.

”We are collaborating with researchers from within the Berkeley Lab as well as the broader research community, some people from the Joint Bioenergy Institute (JBEI), because they really want to design new enzymes, new pathways and solve challenging environmental problems,” Wang said. “We are also seeking to work with biotechnology companies including the bioeconomy space, trying to explore whether we can leverage this model to help their research projects.”

There is still trouble-shooting to be done — knowledge gaps in GenomeOcean’s ability to learn that need to be addressed, either through additional training or a different architecture. Wang seeks to optimize the model's agility and efficiency, while minimizing its demand for substantial operational resources.

”The metagenomics data generated by the JGI are very high-quality,” Wang said. “We will welcome the user community to come to build these AI models. I hope that the JGI will eventually not only host data, but also host these models — while helping its user community use them, or even build their own.”

Researchers from Northwestern University, Johns Hopkins University, the University of California, Berkeley, Illumina Inc., and Miami University were also involved in the work.

GenomeOcean Leverages AI to Decode Nature’s Secret Language

DNA as the World’s Oldest Language

Leveraging Large Data Sets for Smarter AI

The View from GenomeOcean’s Horizon

Recent Posts

Comments