How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

## Background I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($500-2,000 for decent quality datasets). So I asked myself: *What if I could build a system that generates these datasets autonomously while I sleep?* **Result:** 1,065 professional instruction/response pairs in 72 hours, 100% local, zero API costs. Here's exactly how I did it. --- ## The Problem with Existing Datasets When you want to fine-tune a local LLM for specific tasks, you face three options: 1. **Use generic datasets** → Don't match your domain 2. **Manual creation** → Exhausting, slow, doesn't scale 3. **Buy commercial datasets** → Expensive ($500-2,000), limited customization I wanted code instruction pairs with: - Python/AI/ML/DevOps/Database topics - Chain-of-Thought reasoning (not just answers) - Real-world complexity (no template responses) - Continuous generation (24/7 if possible) None of the existing options delivered all four. --- ## The Solution: Multi-Agent Autonomous System I built a three-agent system inspired by academic research workflows: ```plaintext ┌──────────────────┐ │ Curator │ ← Selects topics from knowledge base │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Producer │ ← Generates instruction + response + CoT │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Critic │ ← Validates logic, checks hallucinations │ (DeepSeek 1.5B)│ └────────┬─────────┘ │ ▼ [Accept/Reject] ``` ### Agent Roles **Curator Agent:** - Reads ChromaDB vector store of potential topics - Selects next topic based on diversity + priority - Prevents duplicates via similarity search - Think of it as a research librarian **Producer Agent:** - Generates the actual instruction/response pair - Includes Chain-of-Thought reasoning - Uses domain knowledge to create realistic scenarios - Think of it as the expert writer **Critic Agent:** - Reviews Producer's output for: - Hallucinations - Logical errors - Incomplete reasoning - Generic template responses - Binary decision: Accept or Reject - Think of it as the peer reviewer --- ## Tech Stack ### Core Components ```python # 1. Ollama (Local LLM Engine) # Custom Modelfiles with 8k context: qwen2.5-coder-8k:7b # 4.7GB, main generator deepseek-r1-8k:1.5b # 1.1GB, validator # 2. CrewAI (Agent Orchestration) from crewai import Agent, Crew, Task # 3. ChromaDB (Deduplication + Memory) from chromadb import Client # 4. Flask (Real-time Dashboard) from flask import Flask, render_template ``` ### Hardware - **AMD Ryzen AI 9 HX 370** (12 cores, 32GB RAM shared as VRAM) - **Mini PC** (Geekom A9 Max, ~$899) - **SSD NVMe** (for fast model loading) Total investment: **$899 + ~$3.60 electricity for 72 hours** --- ## Implementation Details ### Challenge 1: Memory Leaks in Long Runs **Problem:** Agent instances accumulate state. After ~100 cycles, performance degrades → crashes. **Solution:** Recreate all agents every cycle. ```python from crewai import Agent, Crew, Task for cycle in range(1000): # Recreate agents from scratch (prevents memory leaks) curator = Agent( role="Topic Curator", goal="Select next topic to generate", llm=ollama_qwen ) producer = Agent( role="Content Producer", goal="Generate high-quality instruction pair", llm=ollama_qwen ) critic = Agent( role="Quality Critic", goal="Validate logic and catch hallucinations", llm=ollama_deepseek ) # Define sequential workflow tasks = [ Task(description="Select topic", agent=curator), Task(description="Generate content", agent=producer), Task(description="Validate quality", agent=critic) ] crew = Crew(agents=[curator, producer, critic], tasks=tasks) result = crew.kickoff() # Explicit cleanup del crew, curator, producer, critic ``` **Result:** Zero crashes in 72 hours. Stable RAM at 24.2 GB. --- ### Challenge 2: CrewAI + Ollama Compatibility **Problem:** CrewAI v1.12+ requires string-only responses, but Ollama returns objects by default. **Solution:** Embed configuration in Modelfiles (permanent fix). ```dockerfile # Modelfile for qwen2.5-coder-8k:7b FROM qwen2.5-coder:7b PARAMETER num_ctx 8192 PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM """You are an expert programmer...""" ``` Run: ```bash ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek ``` This makes `num_ctx: 8192` permanent at the model level. No runtime config needed. --- ### Challenge 3: Duplicate Questions **Problem:** Random topic generation creates duplicates. **Solution:** ChromaDB similarity search + rejection. ```python import chromadb client = chromadb.Client() collection = client.create_collection("generated_questions") def is_duplicate(new_question, threshold=0.85): """Check if question already exists via embedding similarity""" results = collection.query( query_texts=[new_question], n_results=1 ) if not results['ids']: return False similarity = results['distances'][0][0] return similarity > threshold # Before saving if not is_duplicate(instruction): save_to_dataset(entry) collection.add( documents=[instruction], ids=[unique_id] ) ``` **Result:** 452 unique topics across 1,065 entries (zero duplicates). --- ## Results After 72 Hours ### Metrics | Metric | Value | |--------|-------| | **Total Entries** | 1,065 | | **Success Rate** | 50.9% (605 accepted / 460 rejected) | | **Unique Topics** | 452 | | **Velocity** | 14.4 entries/hour | | **RAM Usage** | 24.2 GB (stable) | | **CPU Usage** | 31.1% average | | **Crashes** | 0 | ### Distribution by Category - **Python:** 346 entries (33%) - **AI/ML:** 287 entries (27%) - **Backend/API:** 126 entries (12%) - **Database/SQL:** 99 entries (9%) - **DevOps/Cloud:** 99 entries (9%) - **Other:** 108 entries (10%) --- ## Quality Samples ### Example 1: Python Development ```json { "instruction": "How to create a Dockerfile for a Python application?", "input": "", "output": "# Step-by-step guide to create a production-ready Dockerfile:\n\n1. **Choose Base Image**\n - Use official Python image: `python:3.11-slim`\n - For smaller size: `python:3.11-alpine`\n\n2. **Set Working Directory**\n ```dockerfile\n WORKDIR /app\n ```\n\n3. **Copy Requirements First** (leverage Docker cache)\n ```dockerfile\n COPY requirements.txt .\n RUN pip install --no-cache-dir -r requirements.txt\n ```\n\n4. **Copy Application Code**\n ```dockerfile\n COPY . .\n ```\n\n5. **Set Environment Variables**\n ```dockerfile\n ENV PYTHONUNBUFFERED=1\n ENV PORT=8000\n ```\n\n6. **Expose Port**\n ```dockerfile\n EXPOSE $PORT\n ```\n\n7. **Define Entrypoint**\n ```dockerfile\n CMD [\"python\", \"app.py\"]\n ```\n\n**Complete Example:**\n```dockerfile\nFROM python:3.11-slim\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nENV PYTHONUNBUFFERED=1\nEXPOSE 8000\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n```\n\n**Best Practices:**\n- Use multi-stage builds for smaller images\n- Don't run as root (create user)\n- Use .dockerignore to exclude unnecessary files\n- Pin dependency versions in requirements.txt", "thought": "User needs containerization setup. Key considerations: 1) Layer caching optimization (copy requirements first), 2) Security (non-root user), 3) Image size (slim vs alpine tradeoff), 4) Production readiness (health checks, signals). Template should be practical and immediately usable." } ``` **Character counts:** - Output: 928 chars - Thought: 266 chars --- ### Example 2: AI/ML Engineering ```json { "instruction": "Explain the difference between EC2 instances and Lambda functions for ML model deployment", "input": "", "output": "**EC2 Instances vs Lambda for ML Deployment:**\n\n**EC2 Instances (Virtual Machines)**\n\n*Pros:*\n- Full control over environment (custom libraries, CUDA, etc.)\n- No execution time limits\n- Better for large models (multi-GB)\n- GPU support (P3, G4 instances)\n- Persistent connections (WebSockets, streaming)\n- Predictable latency\n\n*Cons:*\n- Always running = always paying ($50-500+/month)\n- Manual scaling configuration\n- Need to manage OS, security patches\n- Cold start when scaling from 0\n\n*Best for:*\n- Real-time inference with steady traffic\n- Models >1GB\n- GPU-intensive workloads\n- Long-running predictions (>15 min)\n\n**Lambda Functions (Serverless)**\n\n*Pros:*\n- Pay per invocation ($0.20 per 1M requests)\n- Auto-scaling (0 to thousands)\n- No server management\n- Built-in HA/fault tolerance\n\n*Cons:*\n- 15-minute timeout\n- Limited to 10GB RAM\n- No GPU support\n- Cold starts (1-5 seconds)\n- Package size limit (250MB unzipped)\n\n*Best for:*\n- Batch predictions\n- Small models (<100MB)\n- Sporadic/unpredictable traffic\n- Budget constraints\n\n**Hybrid Approach:**\nUse SageMaker Serverless Inference:\n- Lambda-like pricing\n- Support for larger models\n- Automatic scaling\n- GPU available\n\n**Decision Tree:**\n```\nModel size >1GB? → EC2\nGPU required? → EC2\nConstant traffic? → EC2\nBudget <$50/mo? → Lambda\nUnpredictable spikes? → Lambda\n```", "thought": "User comparing infrastructure options for ML deployment. Critical factors: cost structure (fixed vs pay-per-use), model size constraints, GPU requirements, latency tolerance, traffic patterns. Should provide concrete decision criteria and mention SageMaker Serverless as middle ground. Real-world cost examples help contextualize." } ``` **Character counts:** - Output: 1,876 chars - Thought: 261 chars --- ## Cost Comparison ### My Approach (Local) - **Hardware:** $899 one-time (reusable forever) - **Electricity:** $3.60 for 72 hours - **Per entry:** $0.85 amortized (or $0.003 electricity only) ### Commercial APIs (for 1,065 entries) - **GPT-4 API:** ~$850 (800 tokens avg × $0.001/token × 1,065) - **Claude API:** ~$1,200 (longer responses × higher price) - **OpenRouter DeepSeek:** ~$150 (cheapest API option) **Savings: 99%+ (after hardware pays for itself in 2-3 runs)** --- ## Lessons Learned ### 1. Quality Over Quantity A 50.9% rejection rate seems harsh, but it's a **feature, not a bug**. The Critic agent catches: - ❌ Hallucinated facts ("Python 4.0 was released...") - ❌ Incomplete reasoning ('Just use library X" without explanation) - ❌ Generic templates ("Here are 5 steps...") - ❌ Logic errors (contradictory statements) Only entries that pass rigorous validation make it through. **Result:** Every entry is production-quality. --- ### 2. Hardware Performs Better Than Expected I was skeptical about running two LLMs simultaneously on 32GB RAM. Turns out AMD's shared VRAM architecture works great: - **During generation:** 24.2 GB RAM (87% utilization) - **CPU:** 31.1% average (not maxed, room for optimization) - **Cycle time:** ~6 minutes per entry (2 models sequentially) - **Stability:** Zero crashes in 72 hours For comparison, this would cost $40-100/month in cloud GPU time. --- ### 3. Automation Compounds The system ran while I: - Slept (8 hours × 3 nights = ~400 entries) - Worked (8 hours × 3 days = ~400 entries) - Coded other projects (8 hours × 3 days = ~265 entries) Total: 1,065 entries with ~10 minutes of manual intervention (checking dashboard once/day). **That's the power of autonomous systems.** --- ## Free Datasets Released I just uploaded **5 LITE versions (50% of data, 100% free)** to HuggingFace: ### 🐍 [Python Developer Dataset](https://huggingface.co/datasets/bernabepuente/python-instruction-dataset) 173 entries covering: - Docker/containerization - Testing patterns - API development - Package management - Best practices ### 🤖 [AI/ML Engineering Dataset](https://huggingface.co/datasets/bernabepuente/ai-ml-instruction-dataset) 143 entries covering: - Model deployment (EC2, Lambda, SageMaker) - MLOps patterns - Training optimization - Cloud ML services - LLM fine-tuning ### ☁️ [DevOps & Cloud Dataset](https://huggingface.co/datasets/bernabepuente/devops-cloud-instruction-dataset) 49 entries covering: - CI/CD pipelines - Infrastructure as Code - Kubernetes/Docker - Monitoring & logging - AWS/GCP/Azure ### 🔌 [Backend & API Dataset](https://huggingface.co/datasets/bernabepuente/backend-api-instruction-dataset) 63 entries covering: - REST/GraphQL design - Authentication patterns - Database optimization - Caching strategies - Microservices ### 💾 [Database & SQL Dataset](https://huggingface.co/datasets/bernabepuente/database-sql-instruction-dataset) 49 entries covering: - Query optimization - Schema design - Transactions & ACID - NoSQL patterns - Performance tuning **Format:** JSONL with `instruction`, `input`, `output`, `thought` fields **License:** MIT (use however you want) **Ready for:** HuggingFace Transformers, Axolotl, LLaMA Factory --- ## What's Next? ### V2 Features (in progress) 1. **Multilingual datasets** (Spanish, French, German) 2. **Difficulty levels** (Beginner/Intermediate/Advanced) 3. **Custom domains** (on-demand generation for specific niches) 4. **Video tutorials** (system walkthrough + setup guide) ### Open-Source Plans If there's enough interest, I'll release: - Complete agent definitions - Modelfile configurations - ChromaDB setup scripts - Dashboard code (Flask + real-time metrics) - Consolidation scripts (JSONL → categorized products) **Update:** Based on comments, I'll publish setup guide by end of week. --- ## Try It Yourself ### Quick Start 1. **Download a dataset:** ```bash wget https://huggingface.co/datasets/bernabepuente/python-instruction-dataset/resolve/main/dataset.jsonl ``` 2. **Load in Python:** ```python import json with open('dataset.jsonl', 'r') as f: data = [json.loads(line) for line in f] print(f"Loaded {len(data)} entries") print(f"Sample: {data[0]}") ``` 3. **Fine-tune with Transformers:** ```python from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer # Load your base model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Prepare dataset def format_prompt(entry): return f"Instruction: {entry['instruction']}\n\nResponse: {entry['output']}" # Train! # (full training code in dataset README) ``` --- ## Questions I'm Happy to Answer Drop a comment if you want to know about: - Optimizing Ollama for long runs - CrewAI agent patterns for production - ChromaDB deduplication strategies - Dataset quality validation - Hardware recommendations for local LLM workflows - Monetization experiments (Scale AI, Gumroad, HF Premium) --- ## Conclusion Building autonomous systems that work while you sleep is incredibly satisfying. This project taught me: - Local LLMs can match commercial quality (with good prompts) - Multi-agent validation catches errors humans would miss - Hardware investment pays for itself quickly - Automation compounds exponentially **Total time invested:** ~20 hours (setup + monitoring) **Total output:** 1,065 professional datasets **Time per entry:** 1.1 minutes of human time Try the datasets and let me know what you think! --- **Links:** - 🎁 Free Datasets: https://huggingface.co/bernabepuente - 📧 Contact: (tu email o Twitter) - 💬 Discuss: Comments below! **Tech Stack:** Ollama • CrewAI • ChromaDB • Flask • Python 3.11 • AMD Ryzen AI 9 **License:** MIT (all datasets)

How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

Tags

Comments

More Blog

How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

Local AI Will Save Us All (The Math Says So, Trust Me)

Lost in the AI Hype, I Started Small

Building a Replay-Tested Interactive Brokers Client in Go

Playwright in Pictures: Fully Parallel Mode

Designing a CLI for Both Humans and Agents