Loading...
Loading...
Multi-Agent Synthetic Data Generator Platform (Google Agent SDK + Gemini + Cloud Run)
Multi-Agent Synthetic Data Generator Platform (Google Agent SDK + Gemini + Cloud Run)
1. Overview
The “Multi-Agent Synthetic Data Generator Platform” is a Cloud-native system that enables users to generate privacy-safe, domain-realistic synthetic datasets through natural language instructions.
The platform uses a multi-agent workflow built with Google Agent SDK, powered by Gemini, deployed on Cloud Run, and backed by Firestore + GCS.
Users simply describe what dataset they need, provide a data dictionary, approve a sample preview, and the system generates large-scale datasets for analytics, testing, and ML training.
2. Problem Statement
Organizations need high-quality datasets for:
ML model development
Analytics
Testing environments
Prototyping
However, real data contains:
PII
sensitive attributes
compliance constraints (GDPR, HIPAA, PCI)
security risks
Synthetic data eliminates leakage risk while preserving statistical realism.
But current synthetic data tools require:
Data science skills
Complex configurations
Manual schema engineering
This product makes it all natural-language-based and automated using multi-agents.
3. Goals
Primary Goals
Allow users to describe dataset requirements in English
Automatically convert descriptions to well-defined schema
Ask clarifying questions before generation
Generate 10-row synthetic preview
Generate full-scale synthetic dataset on approval
Store dataset + metadata in GCS + Firestore
Use multi-agent architecture with defined roles
Secondary Goals
Generate ML-ready datasets:
normalized
balanced classes
outlier/noise injection
train/val/test split
label generation
4. Non-Goals
(Not required for 1-day hackathon MVP)
Training deep generative models
Real-time stream data generation
Custom GAN training
Data encryption/key management for commercial use
5. Target Users
Data scientists
ML engineers
QA teams
Analytics teams
App developers needing seed/test data
Enterprises needing privacy-safe alternatives to production data
6. Multi-Agent System (Google Agent SDK)
The system uses multiple specialized agents, each responsible for a specific task.
Agent 1: NL Interpreter Agent
Purpose: Convert user’s natural language description into an initial structured schema.
Input:
Plain English description
Domain selection
Output:
Draft schema (fields, types, constraints)
Points of ambiguity
Tools: Gemini Flash / Pro
Agent 2: Schema Clarification Agent
Purpose: Ask follow-up questions to ensure schema correctness.
Your responsibilities:
Validate fields
Identify missing constraints
Request domain-specific details
Output:
Final schema (validated, structured JSON)
Agent 3: Schema Validator Agent
Purpose: Ensure:
no conflicting constraints
no PII leakage
proper type mapping
allowed distributions
Output:
Approved schema
Errors or corrections
Agent 4: Sample Generator Agent
Purpose: Generate 10-row sample synthetic dataset.
Output format: CSV or JSONL
Tools: Gemini (structured output enforced)
Goal:
Provide user preview
Ensure quality before generating full dataset
Agent 5: Bulk Generator Agent
Purpose: Generate full dataset using batch calls.
Batching Strategy:
100–500 rows per batch
Merge in backend
Tools: Gemini
Storage: GCS
Output:
Final dataset (CSV)
Firestore job metadata
Agent 6: Quality & Privacy Reviewer Agent
Checks:
Distribution alignment
Type validity
PII risk score
Outlier thresholds
Outputs risk warnings + validation checks.
Agent 7: ML Data Agent (Phase 2)
Generates ML-specific data variants:
normalized numeric columns
class-label generation
add controlled noise/outliers
create train/val/test split
balanced dataset
Agent 8: Storage Agent
Handles:
Firestore writes
GCS uploads
Signed URLs
Metadata linking
Agent 9: Orchestrator Agent
Coordinates overall workflow:
Interpret → Clarify → Validate → Sample → Approve → Generate → Store → Finalize
7. Technical Architecture
User → Web UI → Cloud Run API → Orchestrator Agent
| |
| Multi-Agent Flow (Google Agent SDK)
↓ ↓
Browser UI ← Firestore (metadata) ← Sample/Bulk Data Generation
↑ ↑
GCS (datasets) ← Bulk Generator Agent
Components:
Frontend
React / Next.js
File upload for data dictionary
Simple chat-like interface for agent clarification
Backend
Cloud Run Python/Node service hosting:
multi-agent orchestrator
batching logic
validation layers
Databases
Firestore → job metadata, schema versions
GCS → datasets
AI
Gemini Flash / Pro
Google Agent SDK multi-agent workflows
8. User Flow
User describes dataset in natural language
Selects domain
Uploads/provides data dictionary
System (Agents 1–3) asks clarifying questions
Sample agent generates 10-row preview
User accepts / modifies
Bulk generator creates full dataset
Dataset stored + downloadable
(Optional) ML-ready synthetic sets created
9. Firestore Data Model
Collection: jobs
job_id
user_id
schema
domain
status (pending/running/done/failed)
sample_preview_url
final_dataset_url
row_count
created_at
completed_at
error_logs
Collection: user_sessions
Tracks conversation and history.
10. Prompt Specifications
Natural Language Interpretation Prompt
Convert this plain English description into a structured schema.
Identify ambiguities and ask clarifying questions.
Sample Generation Prompt
Generate exactly 10 rows of synthetic data.
Output strict CSV only.
Respect schema constraints and domain distributions.
No extra text.
Bulk Generation Prompt
Generate 100 rows per batch.
Return strict CSV.
Maintain distributions, uniqueness, and data types.
PII Safety Prompt
Ensure no real names, emails, addresses, or identifiable data.
Only use fictional yet realistic values.
ML Agent Prompt
Generate ML-ready synthetic data with:
normalized floats
balanced labels
controlled noise
clear label definitions
11. Hackathon Scoring Alignment
Cloud Run Usage (+5)
Backend and all agent orchestrations deployed fully on Cloud Run.
GCP Database Usage (+2)
Firestore stores schema, metadata, job history.
Google’s AI Usage (+5)
Gemini is core to:
schema interpretation
clarifications
sample & bulk generation
ML dataset creation
Functional Demo (+5)
Demo includes:
NL → Schema
Agents asking clarification
10-row preview
Full dataset generation
Firestore + GCS interactive view
Downloadable CSV
Blog Excellence (+5)
Blog includes:
Architecture diagrams
Multi-agent workflow breakdown
Screenshots
Sample datasets
Code snippets
Impact (+5)
Show impact in:
Healthcare
Fintech
Retail analytics
Enterprise AI
Apps requiring anonymized data
12. 1-Day Development Plan (Hackathon)
Hour 1–2
Set up Cloud Run + Firestore
Set up Google Agent SDK baseline
Build Orchestrator agent skeleton
Hour 3–4
Build NL Interpreter Agent
Build Schema Clarification Agent
Hour 5
Build Sample Generator Agent
Render preview in UI
Hour 6–7
Bulk Generator Agent + batching system
Hour 8
Firestore + GCS integrations
Hour 9
Demo UI polish
Fix issues
Hour 10
Write blog
Prepare pitch
13. Risks & Mitigations
Risk: Gemini outputs malformed CSV
Mitigation: enforce JSONL → convert to CSV server-side.
Risk: Clarification loop too long
Mitigation: limit to max 3 questions.
Risk: Large dataset token limits
Mitigation: batch generation of 100–500 rows per request.
14. Future Extensions
Full UI schema editor
Domain-specific presets
Auto-ML model training using synthetic data
Integration with BigQuery
Drift matching (synthetic data shaped like real samples)SkillSprout is an AI-powered microlearning platform designed to help users learn new skills through bite-sized lessons and adaptive quizzes. The platform leverages Azure OpenAI for content generation, Gradio for user interaction, and Model Context Protocol (MCP) for agent interoperability.
This dashboard is a web-based interface built using **Next.js (or Astro)** and hosted on **Vercel**. It acts as the control center for Joey’s stock intelligence, allowing you to:
Gemini Code Flow is an advanced AI-powered development orchestration platform that adapts RuV's Claude Code Flow for Google's Gemini CLI. It enables developers to leverage multiple AI agents working in parallel to write, test, and optimize code using the SPARC methodology.
**Version: 6.0 (FINAL)**