How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries) — DeepSeek Blog | Neura Market
    Neura MarketNeura Market/DeepSeek
    ChatGPTChatGPTClaudeClaudeGeminiGeminiCursorCursorGrokGrokPerplexityPerplexityDeepSeekDeepSeek
    CoPilotCoPilotStable DiffusionStable DiffusionMidjourneyMidjourney
    View All Directories
    OverviewRulesPromptsMCPsAgentsBlogVideosGuidesCoursesCommunityTrendingGenerate
    DeepSeekBlogHow I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)
    Back to Blog
    How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)
    ai

    How I Built an Autonomous Dataset Generator with CrewAI + Ollama (72-hour run, 1,065 entries)

    Bernabé Puente Moure April 14, 2026
    0 views

    Background I needed high-quality instruction datasets for fine-tuning local LLMs, but...

    ## Background I needed high-quality instruction datasets for fine-tuning local LLMs, but commercial options were prohibitively expensive ($500-2,000 for decent quality datasets). So I asked myself: *What if I could build a system that generates these datasets autonomously while I sleep?* **Result:** 1,065 professional instruction/response pairs in 72 hours, 100% local, zero API costs. Here's exactly how I did it. --- ## The Problem with Existing Datasets When you want to fine-tune a local LLM for specific tasks, you face three options: 1. **Use generic datasets** → Don't match your domain 2. **Manual creation** → Exhausting, slow, doesn't scale 3. **Buy commercial datasets** → Expensive ($500-2,000), limited customization I wanted code instruction pairs with: - Python/AI/ML/DevOps/Database topics - Chain-of-Thought reasoning (not just answers) - Real-world complexity (no template responses) - Continuous generation (24/7 if possible) None of the existing options delivered all four. --- ## The Solution: Multi-Agent Autonomous System I built a three-agent system inspired by academic research workflows: ```plaintext ┌──────────────────┐ │ Curator │ ← Selects topics from knowledge base │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Producer │ ← Generates instruction + response + CoT │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Critic │ ← Validates logic, checks hallucinations │ (DeepSeek 1.5B)│ └────────┬─────────┘ │ ▼ [Accept/Reject] ``` ### Agent Roles **Curator Agent:** - Reads ChromaDB vector store of potential topics - Selects next topic based on diversity + priority - Prevents duplicates via similarity search - Think of it as a research librarian **Producer Agent:** - Generates the actual instruction/response pair - Includes Chain-of-Thought reasoning - Uses domain knowledge to create realistic scenarios - Think of it as the expert writer **Critic Agent:** - Reviews Producer's output for: - Hallucinations - Logical errors - Incomplete reasoning - Generic template responses - Binary decision: Accept or Reject - Think of it as the peer reviewer --- ## Tech Stack ### Core Components ```python # 1. Ollama (Local LLM Engine) # Custom Modelfiles with 8k context: qwen2.5-coder-8k:7b # 4.7GB, main generator deepseek-r1-8k:1.5b # 1.1GB, validator # 2. CrewAI (Agent Orchestration) from crewai import Agent, Crew, Task # 3. ChromaDB (Deduplication + Memory) from chromadb import Client # 4. Flask (Real-time Dashboard) from flask import Flask, render_template ``` ### Hardware - **AMD Ryzen AI 9 HX 370** (12 cores, 32GB RAM shared as VRAM) - **Mini PC** (Geekom A9 Max, ~$899) - **SSD NVMe** (for fast model loading) Total investment: **$899 + ~$3.60 electricity for 72 hours** --- ## Implementation Details ### Challenge 1: Memory Leaks in Long Runs **Problem:** Agent instances accumulate state. After ~100 cycles, performance degrades → crashes. **Solution:** Recreate all agents every cycle. ```python from crewai import Agent, Crew, Task for cycle in range(1000): # Recreate agents from scratch (prevents memory leaks) curator = Agent( role="Topic Curator", goal="Select next topic to generate", llm=ollama_qwen ) producer = Agent( role="Content Producer", goal="Generate high-quality instruction pair", llm=ollama_qwen ) critic = Agent( role="Quality Critic", goal="Validate logic and catch hallucinations", llm=ollama_deepseek ) # Define sequential workflow tasks = [ Task(description="Select topic", agent=curator), Task(description="Generate content", agent=producer), Task(description="Validate quality", agent=critic) ] crew = Crew(agents=[curator, producer, critic], tasks=tasks) result = crew.kickoff() # Explicit cleanup del crew, curator, producer, critic ``` **Result:** Zero crashes in 72 hours. Stable RAM at 24.2 GB. --- ### Challenge 2: CrewAI + Ollama Compatibility **Problem:** CrewAI v1.12+ requires string-only responses, but Ollama returns objects by default. **Solution:** Embed configuration in Modelfiles (permanent fix). ```dockerfile # Modelfile for qwen2.5-coder-8k:7b FROM qwen2.5-coder:7b PARAMETER num_ctx 8192 PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM """You are an expert programmer...""" ``` Run: ```bash ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek ``` This makes `num_ctx: 8192` permanent at the model level. No runtime config needed. --- ### Challenge 3: Duplicate Questions **Problem:** Random topic generation creates duplicates. **Solution:** ChromaDB similarity search + rejection. ```python import chromadb client = chromadb.Client() collection = client.create_collection("generated_questions") def is_duplicate(new_question, threshold=0.85): """Check if question already exists via embedding similarity""" results = collection.query( query_texts=[new_question], n_results=1 ) if not results['ids']: return False similarity = results['distances'][0][0] return similarity > threshold # Before saving if not is_duplicate(instruction): save_to_dataset(entry) collection.add( documents=[instruction], ids=[unique_id] ) ``` **Result:** 452 unique topics across 1,065 entries (zero duplicates). --- ## Results After 72 Hours ### Metrics | Metric | Value | |--------|-------| | **Total Entries** | 1,065 | | **Success Rate** | 50.9% (605 accepted / 460 rejected) | | **Unique Topics** | 452 | | **Velocity** | 14.4 entries/hour | | **RAM Usage** | 24.2 GB (stable) | | **CPU Usage** | 31.1% average | | **Crashes** | 0 | ### Distribution by Category - **Python:** 346 entries (33%) - **AI/ML:** 287 entries (27%) - **Backend/API:** 126 entries (12%) - **Database/SQL:** 99 entries (9%) - **DevOps/Cloud:** 99 entries (9%) - **Other:** 108 entries (10%) --- ## Quality Samples ### Example 1: Python Development ```json { "instruction": "How to create a Dockerfile for a Python application?", "input": "", "output": "# Step-by-step guide to create a production-ready Dockerfile:\n\n1. **Choose Base Image**\n - Use official Python image: `python:3.11-slim`\n - For smaller size: `python:3.11-alpine`\n\n2. **Set Working Directory**\n ```dockerfile\n WORKDIR /app\n ```\n\n3. **Copy Requirements First** (leverage Docker cache)\n ```dockerfile\n COPY requirements.txt .\n RUN pip install --no-cache-dir -r requirements.txt\n ```\n\n4. **Copy Application Code**\n ```dockerfile\n COPY . .\n ```\n\n5. **Set Environment Variables**\n ```dockerfile\n ENV PYTHONUNBUFFERED=1\n ENV PORT=8000\n ```\n\n6. **Expose Port**\n ```dockerfile\n EXPOSE $PORT\n ```\n\n7. **Define Entrypoint**\n ```dockerfile\n CMD [\"python\", \"app.py\"]\n ```\n\n**Complete Example:**\n```dockerfile\nFROM python:3.11-slim\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nENV PYTHONUNBUFFERED=1\nEXPOSE 8000\nCMD [\"uvicorn\", \"main:app\", \"--host\", \"0.0.0.0\", \"--port\", \"8000\"]\n```\n\n**Best Practices:**\n- Use multi-stage builds for smaller images\n- Don't run as root (create user)\n- Use .dockerignore to exclude unnecessary files\n- Pin dependency versions in requirements.txt", "thought": "User needs containerization setup. Key considerations: 1) Layer caching optimization (copy requirements first), 2) Security (non-root user), 3) Image size (slim vs alpine tradeoff), 4) Production readiness (health checks, signals). Template should be practical and immediately usable." } ``` **Character counts:** - Output: 928 chars - Thought: 266 chars --- ### Example 2: AI/ML Engineering ```json { "instruction": "Explain the difference between EC2 instances and Lambda functions for ML model deployment", "input": "", "output": "**EC2 Instances vs Lambda for ML Deployment:**\n\n**EC2 Instances (Virtual Machines)**\n\n*Pros:*\n- Full control over environment (custom libraries, CUDA, etc.)\n- No execution time limits\n- Better for large models (multi-GB)\n- GPU support (P3, G4 instances)\n- Persistent connections (WebSockets, streaming)\n- Predictable latency\n\n*Cons:*\n- Always running = always paying ($50-500+/month)\n- Manual scaling configuration\n- Need to manage OS, security patches\n- Cold start when scaling from 0\n\n*Best for:*\n- Real-time inference with steady traffic\n- Models >1GB\n- GPU-intensive workloads\n- Long-running predictions (>15 min)\n\n**Lambda Functions (Serverless)**\n\n*Pros:*\n- Pay per invocation ($0.20 per 1M requests)\n- Auto-scaling (0 to thousands)\n- No server management\n- Built-in HA/fault tolerance\n\n*Cons:*\n- 15-minute timeout\n- Limited to 10GB RAM\n- No GPU support\n- Cold starts (1-5 seconds)\n- Package size limit (250MB unzipped)\n\n*Best for:*\n- Batch predictions\n- Small models (<100MB)\n- Sporadic/unpredictable traffic\n- Budget constraints\n\n**Hybrid Approach:**\nUse SageMaker Serverless Inference:\n- Lambda-like pricing\n- Support for larger models\n- Automatic scaling\n- GPU available\n\n**Decision Tree:**\n```\nModel size >1GB? → EC2\nGPU required? → EC2\nConstant traffic? → EC2\nBudget <$50/mo? → Lambda\nUnpredictable spikes? → Lambda\n```", "thought": "User comparing infrastructure options for ML deployment. Critical factors: cost structure (fixed vs pay-per-use), model size constraints, GPU requirements, latency tolerance, traffic patterns. Should provide concrete decision criteria and mention SageMaker Serverless as middle ground. Real-world cost examples help contextualize." } ``` **Character counts:** - Output: 1,876 chars - Thought: 261 chars --- ## Cost Comparison ### My Approach (Local) - **Hardware:** $899 one-time (reusable forever) - **Electricity:** $3.60 for 72 hours - **Per entry:** $0.85 amortized (or $0.003 electricity only) ### Commercial APIs (for 1,065 entries) - **GPT-4 API:** ~$850 (800 tokens avg × $0.001/token × 1,065) - **Claude API:** ~$1,200 (longer responses × higher price) - **OpenRouter DeepSeek:** ~$150 (cheapest API option) **Savings: 99%+ (after hardware pays for itself in 2-3 runs)** --- ## Lessons Learned ### 1. Quality Over Quantity A 50.9% rejection rate seems harsh, but it's a **feature, not a bug**. The Critic agent catches: - ❌ Hallucinated facts ("Python 4.0 was released...") - ❌ Incomplete reasoning ('Just use library X" without explanation) - ❌ Generic templates ("Here are 5 steps...") - ❌ Logic errors (contradictory statements) Only entries that pass rigorous validation make it through. **Result:** Every entry is production-quality. --- ### 2. Hardware Performs Better Than Expected I was skeptical about running two LLMs simultaneously on 32GB RAM. Turns out AMD's shared VRAM architecture works great: - **During generation:** 24.2 GB RAM (87% utilization) - **CPU:** 31.1% average (not maxed, room for optimization) - **Cycle time:** ~6 minutes per entry (2 models sequentially) - **Stability:** Zero crashes in 72 hours For comparison, this would cost $40-100/month in cloud GPU time. --- ### 3. Automation Compounds The system ran while I: - Slept (8 hours × 3 nights = ~400 entries) - Worked (8 hours × 3 days = ~400 entries) - Coded other projects (8 hours × 3 days = ~265 entries) Total: 1,065 entries with ~10 minutes of manual intervention (checking dashboard once/day). **That's the power of autonomous systems.** --- ## Free Datasets Released I just uploaded **5 LITE versions (50% of data, 100% free)** to HuggingFace: ### 🐍 [Python Developer Dataset](https://huggingface.co/datasets/bernabepuente/python-instruction-dataset) 173 entries covering: - Docker/containerization - Testing patterns - API development - Package management - Best practices ### 🤖 [AI/ML Engineering Dataset](https://huggingface.co/datasets/bernabepuente/ai-ml-instruction-dataset) 143 entries covering: - Model deployment (EC2, Lambda, SageMaker) - MLOps patterns - Training optimization - Cloud ML services - LLM fine-tuning ### ☁️ [DevOps & Cloud Dataset](https://huggingface.co/datasets/bernabepuente/devops-cloud-instruction-dataset) 49 entries covering: - CI/CD pipelines - Infrastructure as Code - Kubernetes/Docker - Monitoring & logging - AWS/GCP/Azure ### 🔌 [Backend & API Dataset](https://huggingface.co/datasets/bernabepuente/backend-api-instruction-dataset) 63 entries covering: - REST/GraphQL design - Authentication patterns - Database optimization - Caching strategies - Microservices ### 💾 [Database & SQL Dataset](https://huggingface.co/datasets/bernabepuente/database-sql-instruction-dataset) 49 entries covering: - Query optimization - Schema design - Transactions & ACID - NoSQL patterns - Performance tuning **Format:** JSONL with `instruction`, `input`, `output`, `thought` fields **License:** MIT (use however you want) **Ready for:** HuggingFace Transformers, Axolotl, LLaMA Factory --- ## What's Next? ### V2 Features (in progress) 1. **Multilingual datasets** (Spanish, French, German) 2. **Difficulty levels** (Beginner/Intermediate/Advanced) 3. **Custom domains** (on-demand generation for specific niches) 4. **Video tutorials** (system walkthrough + setup guide) ### Open-Source Plans If there's enough interest, I'll release: - Complete agent definitions - Modelfile configurations - ChromaDB setup scripts - Dashboard code (Flask + real-time metrics) - Consolidation scripts (JSONL → categorized products) **Update:** Based on comments, I'll publish setup guide by end of week. --- ## Try It Yourself ### Quick Start 1. **Download a dataset:** ```bash wget https://huggingface.co/datasets/bernabepuente/python-instruction-dataset/resolve/main/dataset.jsonl ``` 2. **Load in Python:** ```python import json with open('dataset.jsonl', 'r') as f: data = [json.loads(line) for line in f] print(f"Loaded {len(data)} entries") print(f"Sample: {data[0]}") ``` 3. **Fine-tune with Transformers:** ```python from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer # Load your base model model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Prepare dataset def format_prompt(entry): return f"Instruction: {entry['instruction']}\n\nResponse: {entry['output']}" # Train! # (full training code in dataset README) ``` --- ## Questions I'm Happy to Answer Drop a comment if you want to know about: - Optimizing Ollama for long runs - CrewAI agent patterns for production - ChromaDB deduplication strategies - Dataset quality validation - Hardware recommendations for local LLM workflows - Monetization experiments (Scale AI, Gumroad, HF Premium) --- ## Conclusion Building autonomous systems that work while you sleep is incredibly satisfying. This project taught me: - Local LLMs can match commercial quality (with good prompts) - Multi-agent validation catches errors humans would miss - Hardware investment pays for itself quickly - Automation compounds exponentially **Total time invested:** ~20 hours (setup + monitoring) **Total output:** 1,065 professional datasets **Time per entry:** 1.1 minutes of human time Try the datasets and let me know what you think! --- **Links:** - 🎁 Free Datasets: https://huggingface.co/bernabepuente - 📧 Contact: (tu email o Twitter) - 💬 Discuss: Comments below! **Tech Stack:** Ollama • CrewAI • ChromaDB • Flask • Python 3.11 • AMD Ryzen AI 9 **License:** MIT (all datasets)

    Tags

    aiautomationllmmachinelearning

    Comments

    More Blog

    View all
    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠ai

    How I'm using ASTs and Gemini to solve the "Codebase Onboarding" problem 🧠

    Hi everyone! 👋 I’m Tara, a Senior Software Engineer and Consultant. Over the years, I've jumped...

    T
    tworrell
    Local AI Will Save Us All (The Math Says So, Trust Me)ai

    Local AI Will Save Us All (The Math Says So, Trust Me)

    Every few weeks a take goes viral in tech circles making the case for ditching cloud AI and running...

    S
    Sebastian Schürmann
    Lost in the AI Hype, I Started Smallai

    Lost in the AI Hype, I Started Small

    And it helped me get back into tech without drowning TL;DR at the end Coming back to...

    R
    Rohini Gaonkar
    Building a Replay-Tested Interactive Brokers Client in Gogo

    Building a Replay-Tested Interactive Brokers Client in Go

    I wanted an IBKR library that felt like Go and had testing I could trust. So I wrote one.

    T
    Thomas Marcelis
    Playwright in Pictures: Fully Parallel Modeplaywright

    Playwright in Pictures: Fully Parallel Mode

    Playwright’s fullyParallel mode is often treated as a simple performance switch. In practice, it...

    V
    Vitaliy Potapov
    Designing a CLI for Both Humans and Agentscli

    Designing a CLI for Both Humans and Agents

    Learn how Alpic designed its CLI for both human developers and AI agents — covering tradeoffs like polling, context windows, interactivity, and statelessness.

    J
    Julien Vallini

    Stay up to date

    Get the latest DeepSeek prompts, rules, and resources delivered to your inbox weekly.

    Neura Market LogoNeura Market

    Discover the best AI prompts, plugins, and resources for DeepSeek and more.

    Content Types

    • Rules
    • Prompts
    • MCPs
    • Agents
    • Guides

    Platforms

    • ChatGPT Directory
    • Claude Directory
    • Gemini Directory
    • Cursor Directory
    • Grok Directory
    • Perplexity Directory
    • DeepSeek Directory
    • CoPilot Directory
    • Stable Diffusion Directory
    • Midjourney Directory
    • All Directories

    Resources

    • Blog
    • Documentation
    • Help Center
    • Marketplace

    Legal

    • Privacy Policy
    • Terms of Service

    © 2026 Neura Market. All rights reserved.

    |

    Not affiliated with any AI platform vendors.