Resources for DNA/RNA Sequence Modelling

machine learning

genomics

Roundup of recent papers and open source software

Author

Erle Holgersen

Published

August 5, 2024

Updated September 8th, 2024.

There is a lot of enthusiasm around generative protein design. EvolutionaryScale recently launched with a $142 million seed round. Profluent announced they had generated an improved Cas9 protein.

Meanwhile, a few of us are plugging away on modelling DNA and RNA regulatory processes. DNA and RNA therapeutics are still a smaller field than antibodies and small molecules, and the interest from the machine learning community is correspondingly smaller.

Still, there have been a number of packages released over the past few months. This post rounds up some of the recent releases of software and papers. It is in no way an exhaustive list.

Open-source software

Kipoi: The original model zoo for genomics. Currently contains 37 models, including state-of-the-art ones like Framepool.

tangermeme: Utilities for working with sequence models. Launched by Jacob Schreiber in April as “everything but the model”, and is now on version 0.2.0.

gReLU: Tools for training and applying sequence models. Also includes a model zoo with unified access to state of the art models such as Borzoi and Enformer re-implemented in PyTorch.

Helical: Recently launched model zoo from the Helical team, a small startup out of Luxembourg aiming to democratize access to biology foundation models. At the moment most of the models are single-cell foundation models, but they also have HyenaDNA.

Model releases

Borzoi: Transformer model to predict cell-type specific RNA-seq coverage from DNA sequence at 16 bp resolution. Previous sequence-to-expression models have focused on modelling CAGE-seq, which only captures the 5’ end of the transcript. Expanding to RNA-seq enables the model to learn additional regulatory mechanisms like splicing and polyadenylation. The last author, David Kelley, is one of the giants in the field.

Puffin: Explainable models of transcription initiation. The authors train a simple two-layer interpretable model to predict read coverage at the 5’ end of transcripts (CAGE, FANTOM, GRO-cap and PRO-cap), and use it to identify key motifs that influence transcription.

Nucleotide Transformer: Collection of language models pre-trained on DNA sequences, with varying numbers of parameters and both human-only and multi-species models. Unlike Borzoi, Puffin, and older models like Enformer, genomics language models like Nucleotide Transformer are trained on reference sequence alone and no experimental data. Nucleotide Transformer was developed by InstaDeep, which was acquired by COVID-vaccine developer BioNTech last year.

HyenaDNA: Genomics language model with Hyena architecture. Unlike nucleotide transformer, HyenaDNA uses single nucleotides as tokens and takes up to 1 million nucleotides as context.

Evo: Another genomic foundation model, this one from the Arc Institute. Evo is trained on whole prokaryotic genomes, and uses a context length of 131 kb at single-nucleotide resolution.

Caduceus: The first genomics model to use the mamba architecture.

SegmentNT: A cool application of nucleotide transformer to segment the genome at single nucleotide resolution. Each base pair is assigned a propoability of being part of a gene, a splice site, a promoter, an enhancer, and a polyadenylation site.

DNA-Diffusion: One of the few generative models of DNA. The paper focuses on generating 200 nt cell-type specific regulatory elements.

Benchmarking papers

The Genomics Long-Range Benchmark: ICLR paper comparing Enformer to Nucleotide Transformer and HyenaDNA (with fine-tuning) on three tasks: variant effect prediction, CAGE gene expression, and bulk RNA-seq gene expression. Even with fine-tuning, Enformer outperforms the language models.

Sasse et al. 2023: Benchmarking paper that evaluates the ability of Enformer and other sequence-to-expression models to predict personalized gene expression from genotype. They found that the models often struggle to predict the correct direction of effect for variants.

Tang et al. 2024: Evaluations of language models without fine-tuning on various genomics assays, including an evaluation of zero-shot prediction of MPRA variant effects.

Other papers

Nucleotide dependency analysis of DNA language models: Very recent paper showing a new application of genomic language models to discover functional elements. Earlier papers have focused on using gradients to identify regions the model considers important, but this paper presents an orthogonal approach for DNA language models.

Conferences

MLCB 2024: More general machine learning in computational biology conference, but with substantial coverage of sequence-to-activity models. Recordings of the two days of talks are available.