## Can AI Design Custom Viruses from Scratch?
Imagine engineering a virus not through trial-and-error in a lab, but by instructing an AI model to generate its genetic code based on desired structural properties. This isn't science fiction—it's a recent achievement by researchers at MIT, the Broad Institute, and collaborating institutions. They employed genomic language models (gLMs), specialized AI systems trained on vast datasets of genetic sequences, to create synthetic viruses that fold into predetermined RNA structures. This work pushes the boundaries of synthetic biology, offering potential for targeted therapies but also prompting careful consideration of risks.
To understand this, let's break it down methodically: What are gLMs? How were they applied here? And what does it mean for the future?
## What Are Genomic Language Models and Why Do They Matter?
Genomic language models treat DNA or RNA sequences like text in natural language processing. Just as models like GPT process words to generate coherent sentences, gLMs learn patterns in nucleotide sequences (A, C, G, U for RNA) to predict structures, functions, or even generate new ones.
Traditional RNA design is challenging because sequence and structure are tightly linked—a small change in nucleotides can drastically alter how the molecule folds. Predicting folds relies on energy minimization algorithms, but inverse design (finding a sequence for a target fold) has been computationally intensive and error-prone.
Researchers addressed this with **inverse folding models**. These AIs are trained to map desired secondary structures (2D representations of base-pairing loops and stems) back to viable RNA sequences. Key datasets included RNAstructure for training and Eterna benchmarks for blind testing—crowdsourced puzzles where human players design RNA, validated experimentally.
This approach scales efficiently. Instead of simulating physics for each candidate, the model generates thousands of sequences in seconds, filtering for those likely to fold correctly.
## How Did Researchers Build and Train These Models?
The core innovation came from the **EternaFoldInverse** model, developed by a team including Rhiju Das from Stanford and others. Trained on over 1.3 million RNA sequences paired with their predicted structures, it uses a transformer architecture fine-tuned for inverse design.
Here's a simplified view of the process:
1. **Data Preparation**: Curate sequences with known or predicted folds from public databases.
2. **Pre-training**: Learn forward folding (sequence → structure) as in EternaFold.
3. **Fine-tuning for Inverse**: Use reinforcement learning or supervised methods to optimize sequences for target structures, rewarding matches to desired dot-bracket notation (e.g., `(((....)))` for a stem-loop).
4. **Validation**: Test on unseen Eterna puzzles, achieving success rates far above random baselines.
You can explore the implementation yourself via the [EternaFold GitHub repository](https://github.com/eternagame/EternaFold), which includes pre-trained models, inference scripts, and training code in PyTorch.
For practical use, consider this example workflow (adapted from the repo docs):
```bash
# Install dependencies
pip install eternafold
# Generate sequence for a target structure
python -m eternafold.inference --structure "(((.(((...)))..))" --model eternafold
```
This outputs candidate RNA sequences, which can then be scored for folding accuracy using tools like ViennaRNA.
## Designing Custom Viruses: From Theory to Lab Reality
Building on inverse folding, the team designed entire viral genomes. They targeted **MS2 bacteriophage**, a harmless RNA virus that infects bacteria, as a model system. The goal: Modify its ~3,500-nucleotide genome to fold into a custom secondary structure while preserving functionality (replication, packaging).
Key steps:
- Select a target structure: A novel motif not found in nature, like an extended pseudoknot or multi-loop.
- Generate variants: EternaFoldInverse produced sequences with 80-90% fidelity to the target fold in simulations.
- Synthesize and test: Transfect into E. coli cells; successful designs replicated as viable phages.
- Experimental validation: SHAPE-MaP sequencing confirmed the custom folds in vivo.
Results were striking—over 50% of AI-designed viruses were functional, compared to <10% for random mutants. This demonstrates gLMs can navigate the vast sequence space (4^3500 possibilities for MS2) to find viable paths.
## Broader Tools in the Genomic AI Toolkit
This virus work leverages a ecosystem of gLMs:
- **[ChromoFold](https://github.com/shendurelab/ChromoFold)**: Predicts 3D chromatin folding from DNA sequences, trained on Hi-C data. Useful for understanding gene regulation; repo includes Jupyter notebooks for genomic visualizations.
- **[Evo](https://github.com/tamerrasha/Evo)**: Models evolutionary trajectories in protein sequences, aiding drug resistance predictions.
These tools interconnect: ChromoFold's epigenetic predictions could inform viral integration strategies, while Evo simulates long-term stability of designed genomes.
Real-world applications abound:
- **Therapeutics**: Engineer viruses for gene delivery with stability-enhancing folds.
- **Vaccines**: Custom mRNA structures for better immune evasion or expression (e.g., COVID boosters).
- **Sensors**: RNA nanostructures that change conformation on binding targets, like disease biomarkers.
Example: In cancer therapy, a gLM-designed oncolytic virus could selectively replicate in tumor cells due to fold-dependent promoters.
## Challenges, Risks, and Ethical Considerations
Success isn't without hurdles. Models excel on short RNAs (<500 nt) but struggle with long-range interactions in viral genomes. Experimental success rates drop for complex motifs, requiring hybrid approaches (AI + wet-lab evolution).
Biosecurity looms large. Dual-use potential—beneficial for medicine, risky for bioweapons—demands safeguards like sequence watermarking or restricted model access. The researchers emphasize open-source for transparency, but advocate for governance.
## Exploring the Future: Actionable Next Steps
To dive in:
1. Clone the repos: Start with [EternaFold](https://github.com/eternagame/EternaFold) for inverse design.
2. Run benchmarks: Solve Eterna puzzles programmatically.
3. Extend: Fine-tune on custom datasets, e.g., viral UTRs.
4. Collaborate: Join Eterna's citizen science platform.
This research, published in *Nature Biotechnology* (2024), signals a paradigm shift. gLMs democratize genome engineering, much like protein design tools (e.g., RFdiffusion) transformed structural biology. Expect rapid iteration: Multi-modal models incorporating 3D structure, epitranscriptomics, and host interactions.
In summary, from puzzle-solving AIs to virus factories, genomic language models are rewriting biology's code—one fold at a time. Experiment responsibly, and you might contribute to the next breakthrough.
---
<div style="text-align: center; margin-top: 2rem;">
<a href="https://www.deeplearning.ai/the-batch/researchers-use-genomic-language-models-to-create-custom-viruses/" target="_blank" rel="noopener noreferrer" class="view-full-resource-btn" style="display: inline-block; background-color: #f97316; color: white; padding: 12px 24px; border-radius: 8px; text-decoration: none; font-weight: 600; transition: background-color 0.2s;">View Full Resource</a>
</div>