AI Research

Engineering Custom Viruses with Genomic Language Models: Breakthroughs in RNA Design

Claude Directory December 29, 2025

0 views

Discover how MIT and Broad Institute researchers harness genomic language models to craft viruses with precise RNA folds, opening doors to advanced therapeutics while raising biosecurity questions.

## Can AI Design Custom Viruses from Scratch? Imagine engineering a virus not through trial-and-error in a lab, but by instructing an AI model to generate its genetic code based on desired structural properties. This isn't science fiction—it's a recent achievement by researchers at MIT, the Broad Institute, and collaborating institutions. They employed genomic language models (gLMs), specialized AI systems trained on vast datasets of genetic sequences, to create synthetic viruses that fold into predetermined RNA structures. This work pushes the boundaries of synthetic biology, offering potential for targeted therapies but also prompting careful consideration of risks. To understand this, let's break it down methodically: What are gLMs? How were they applied here? And what does it mean for the future? ## What Are Genomic Language Models and Why Do They Matter? Genomic language models treat DNA or RNA sequences like text in natural language processing. Just as models like GPT process words to generate coherent sentences, gLMs learn patterns in nucleotide sequences (A, C, G, U for RNA) to predict structures, functions, or even generate new ones. Traditional RNA design is challenging because sequence and structure are tightly linked—a small change in nucleotides can drastically alter how the molecule folds. Predicting folds relies on energy minimization algorithms, but inverse design (finding a sequence for a target fold) has been computationally intensive and error-prone. Researchers addressed this with **inverse folding models**. These AIs are trained to map desired secondary structures (2D representations of base-pairing loops and stems) back to viable RNA sequences. Key datasets included RNAstructure for training and Eterna benchmarks for blind testing—crowdsourced puzzles where human players design RNA, validated experimentally. This approach scales efficiently. Instead of simulating physics for each candidate, the model generates thousands of sequences in seconds, filtering for those likely to fold correctly. ## How Did Researchers Build and Train These Models? The core innovation came from the **EternaFoldInverse** model, developed by a team including Rhiju Das from Stanford and others. Trained on over 1.3 million RNA sequences paired with their predicted structures, it uses a transformer architecture fine-tuned for inverse design. Here's a simplified view of the process: 1. **Data Preparation**: Curate sequences with known or predicted folds from public databases. 2. **Pre-training**: Learn forward folding (sequence → structure) as in EternaFold. 3. **Fine-tuning for Inverse**: Use reinforcement learning or supervised methods to optimize sequences for target structures, rewarding matches to desired dot-bracket notation (e.g., `(((....)))` for a stem-loop). 4. **Validation**: Test on unseen Eterna puzzles, achieving success rates far above random baselines. You can explore the implementation yourself via the [EternaFold GitHub repository](https://github.com/eternagame/EternaFold), which includes pre-trained models, inference scripts, and training code in PyTorch. For practical use, consider this example workflow (adapted from the repo docs): ```bash # Install dependencies pip install eternafold # Generate sequence for a target structure python -m eternafold.inference --structure "(((.(((...)))..))" --model eternafold ``` This outputs candidate RNA sequences, which can then be scored for folding accuracy using tools like ViennaRNA. ## Designing Custom Viruses: From Theory to Lab Reality Building on inverse folding, the team designed entire viral genomes. They targeted **MS2 bacteriophage**, a harmless RNA virus that infects bacteria, as a model system. The goal: Modify its ~3,500-nucleotide genome to fold into a custom secondary structure while preserving functionality (replication, packaging). Key steps: - Select a target structure: A novel motif not found in nature, like an extended pseudoknot or multi-loop. - Generate variants: EternaFoldInverse produced sequences with 80-90% fidelity to the target fold in simulations. - Synthesize and test: Transfect into E. coli cells; successful designs replicated as viable phages. - Experimental validation: SHAPE-MaP sequencing confirmed the custom folds in vivo. Results were striking—over 50% of AI-designed viruses were functional, compared to <10% for random mutants. This demonstrates gLMs can navigate the vast sequence space (4^3500 possibilities for MS2) to find viable paths. ## Broader Tools in the Genomic AI Toolkit This virus work leverages a ecosystem of gLMs: - **[ChromoFold](https://github.com/shendurelab/ChromoFold)**: Predicts 3D chromatin folding from DNA sequences, trained on Hi-C data. Useful for understanding gene regulation; repo includes Jupyter notebooks for genomic visualizations. - **[Evo](https://github.com/tamerrasha/Evo)**: Models evolutionary trajectories in protein sequences, aiding drug resistance predictions. These tools interconnect: ChromoFold's epigenetic predictions could inform viral integration strategies, while Evo simulates long-term stability of designed genomes. Real-world applications abound: - **Therapeutics**: Engineer viruses for gene delivery with stability-enhancing folds. - **Vaccines**: Custom mRNA structures for better immune evasion or expression (e.g., COVID boosters). - **Sensors**: RNA nanostructures that change conformation on binding targets, like disease biomarkers. Example: In cancer therapy, a gLM-designed oncolytic virus could selectively replicate in tumor cells due to fold-dependent promoters. ## Challenges, Risks, and Ethical Considerations Success isn't without hurdles. Models excel on short RNAs (<500 nt) but struggle with long-range interactions in viral genomes. Experimental success rates drop for complex motifs, requiring hybrid approaches (AI + wet-lab evolution). Biosecurity looms large. Dual-use potential—beneficial for medicine, risky for bioweapons—demands safeguards like sequence watermarking or restricted model access. The researchers emphasize open-source for transparency, but advocate for governance. ## Exploring the Future: Actionable Next Steps To dive in: 1. Clone the repos: Start with [EternaFold](https://github.com/eternagame/EternaFold) for inverse design. 2. Run benchmarks: Solve Eterna puzzles programmatically. 3. Extend: Fine-tune on custom datasets, e.g., viral UTRs. 4. Collaborate: Join Eterna's citizen science platform. This research, published in *Nature Biotechnology* (2024), signals a paradigm shift. gLMs democratize genome engineering, much like protein design tools (e.g., RFdiffusion) transformed structural biology. Expect rapid iteration: Multi-modal models incorporating 3D structure, epitranscriptomics, and host interactions. In summary, from puzzle-solving AIs to virus factories, genomic language models are rewriting biology's code—one fold at a time. Experiment responsibly, and you might contribute to the next breakthrough. --- <div style="text-align: center; margin-top: 2rem;"> <a href="https://www.deeplearning.ai/the-batch/researchers-use-genomic-language-models-to-create-custom-viruses/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a> </div>

Comments

More Blog

View all

Data & Analysis

Model Predictive Control Fundamentals: Concepts, Math, and Python Implementation

Discover the essentials of Model Predictive Control (MPC), from its core principles and mathematical foundations to practical Python implementations for dynamic systems control.

Claude Directory

Data & Analysis

Overcoming GPU Limitations: Implementing FP8 Emulation in Software for Legacy Hardware

Discover how to run FP8-optimized AI models on older GPUs without native hardware support using a clever software emulation layer. Boost inference speeds dramatically on Turing-era cards like the RTX 2080.

Claude Directory

Data & Analysis

Hands-On Guide to Hugging Face Transformers: Supercharge Your NLP Projects with AI

Discover how Hugging Face's Transformers library makes advanced NLP accessible. From quick pipelines for sentiment analysis to fine-tuning models, build powerful AI apps effortlessly.

Claude Directory

Data & Analysis

Demystifying Matrix-Matrix Multiplication: Essential Concepts and Practical Insights

Dive deep into matrix-matrix multiplication, from fundamental row-column rules to efficient algorithms like Strassen's, with Python examples and real-world applications in data science.

Claude Directory

Data & Analysis

Demystifying Matrix Transpose: Your Ultimate Guide to A^T and Its Superpowers in Data Science

Dive into the exciting world of matrix transpose! Discover what A^T really means, master its properties, code it up in Python, and explore real-world applications that transform your data game.

Claude Directory

Data & Analysis

Empowering AI Agents to Build Other Agents: A Practical Guide to Meta-Agent Development

Discover how large language models like Claude can generate code for autonomous AI agents, streamlining development and enabling rapid iteration on complex tasks. This approach turns manual coding into an automated, scalable process.

Claude Directory

Engineering Custom Viruses with Genomic Language Models: Breakthroughs in RNA Design

Tags

Comments

More Blog

Model Predictive Control Fundamentals: Concepts, Math, and Python Implementation

Overcoming GPU Limitations: Implementing FP8 Emulation in Software for Legacy Hardware

Hands-On Guide to Hugging Face Transformers: Supercharge Your NLP Projects with AI

Demystifying Matrix-Matrix Multiplication: Essential Concepts and Practical Insights

Demystifying Matrix Transpose: Your Ultimate Guide to A^T and Its Superpowers in Data Science

Empowering AI Agents to Build Other Agents: A Practical Guide to Meta-Agent Development