PRD

Multi-Agent Synthetic Data Generator Platform (Google Agent SDK + Gemini + Cloud Run)

Overview

The “Multi-Agent Synthetic Data Generator Platform” is a Cloud-native system that enables users to generate privacy-safe, domain-realistic synthetic datasets through natural language instructions.

The platform uses a multi-agent workflow built with Google Agent SDK, powered by Gemini, deployed on Cloud Run, and backed by Firestore + GCS.

Users simply describe what dataset they need, provide a data dictionary, approve a sample preview, and the system generates large-scale datasets for analytics, testing, and ML training.

Problem Statement

Organizations need high-quality datasets for:

ML model development

Analytics

Testing environments

Prototyping

However, real data contains:

PII

sensitive attributes

compliance constraints (GDPR, HIPAA, PCI)

security risks

Synthetic data eliminates leakage risk while preserving statistical realism.

But current synthetic data tools require:

Data science skills

Complex configurations

Manual schema engineering

This product makes it all natural-language-based and automated using multi-agents.

Goals Primary Goals

Allow users to describe dataset requirements in English

Automatically convert descriptions to well-defined schema

Ask clarifying questions before generation

Generate 10-row synthetic preview

Generate full-scale synthetic dataset on approval

Store dataset + metadata in GCS + Firestore

Use multi-agent architecture with defined roles

Secondary Goals

Generate ML-ready datasets:

normalized

balanced classes

outlier/noise injection

train/val/test split

label generation

Non-Goals

(Not required for 1-day hackathon MVP)

Training deep generative models

Real-time stream data generation

Custom GAN training

Data encryption/key management for commercial use

Target Users

Data scientists

ML engineers

QA teams

Analytics teams

App developers needing seed/test data

Enterprises needing privacy-safe alternatives to production data

Multi-Agent System (Google Agent SDK)

The system uses multiple specialized agents, each responsible for a specific task.

Agent 1: NL Interpreter Agent

Purpose: Convert user’s natural language description into an initial structured schema.

Input:

Plain English description

Domain selection

Output:

Draft schema (fields, types, constraints)

Points of ambiguity

Tools: Gemini Flash / Pro

Agent 2: Schema Clarification Agent

Purpose: Ask follow-up questions to ensure schema correctness.

Your responsibilities:

Validate fields

Identify missing constraints

Request domain-specific details

Output:

Final schema (validated, structured JSON)

Agent 3: Schema Validator Agent

Purpose: Ensure:

no conflicting constraints

no PII leakage

proper type mapping

allowed distributions

Output:

Approved schema

Errors or corrections

Agent 4: Sample Generator Agent

Purpose: Generate 10-row sample synthetic dataset.

Output format: CSV or JSONL Tools: Gemini (structured output enforced)

Goal:

Provide user preview

Ensure quality before generating full dataset

Agent 5: Bulk Generator Agent

Purpose: Generate full dataset using batch calls.

Batching Strategy:

100–500 rows per batch

Merge in backend

Tools: Gemini Storage: GCS

Output:

Final dataset (CSV)

Firestore job metadata

Agent 6: Quality & Privacy Reviewer Agent

Checks:

Distribution alignment

Type validity

PII risk score

Outlier thresholds

Outputs risk warnings + validation checks.

Agent 7: ML Data Agent (Phase 2)

Generates ML-specific data variants:

normalized numeric columns

class-label generation

add controlled noise/outliers

create train/val/test split

balanced dataset

Agent 8: Storage Agent

Handles:

Firestore writes

GCS uploads

Signed URLs

Metadata linking

Agent 9: Orchestrator Agent

Coordinates overall workflow:

Interpret → Clarify → Validate → Sample → Approve → Generate → Store → Finalize

Technical Architecture User → Web UI → Cloud Run API → Orchestrator Agent | | | Multi-Agent Flow (Google Agent SDK) ↓ ↓ Browser UI ← Firestore (metadata) ← Sample/Bulk Data Generation ↑ ↑ GCS (datasets) ← Bulk Generator Agent

Components: Frontend

React / Next.js

File upload for data dictionary

Simple chat-like interface for agent clarification

Backend

Cloud Run Python/Node service hosting:

multi-agent orchestrator

batching logic

validation layers

Databases

Firestore → job metadata, schema versions

GCS → datasets

Gemini Flash / Pro

Google Agent SDK multi-agent workflows

User Flow

User describes dataset in natural language

Selects domain

Uploads/provides data dictionary

System (Agents 1–3) asks clarifying questions

Sample agent generates 10-row preview

User accepts / modifies

Bulk generator creates full dataset

Dataset stored + downloadable

(Optional) ML-ready synthetic sets created

Firestore Data Model Collection: jobs job_id user_id schema domain status (pending/running/done/failed) sample_preview_url final_dataset_url row_count created_at completed_at error_logs

Collection: user_sessions

Tracks conversation and history.

Prompt Specifications Natural Language Interpretation Prompt

Convert this plain English description into a structured schema. Identify ambiguities and ask clarifying questions.

Sample Generation Prompt

Generate exactly 10 rows of synthetic data. Output strict CSV only. Respect schema constraints and domain distributions. No extra text.

Bulk Generation Prompt

Generate 100 rows per batch. Return strict CSV. Maintain distributions, uniqueness, and data types.

PII Safety Prompt

Ensure no real names, emails, addresses, or identifiable data. Only use fictional yet realistic values.

ML Agent Prompt

Generate ML-ready synthetic data with:

normalized floats

balanced labels

controlled noise

clear label definitions

Hackathon Scoring Alignment Cloud Run Usage (+5)

Backend and all agent orchestrations deployed fully on Cloud Run.

GCP Database Usage (+2)

Firestore stores schema, metadata, job history.

Google’s AI Usage (+5)

Gemini is core to:

schema interpretation

clarifications

sample & bulk generation

ML dataset creation

Functional Demo (+5)

Demo includes:

NL → Schema

Agents asking clarification

10-row preview

Full dataset generation

Firestore + GCS interactive view

Downloadable CSV

Blog Excellence (+5)

Blog includes:

Architecture diagrams

Multi-agent workflow breakdown

Screenshots

Sample datasets

Code snippets

Impact (+5)

Show impact in:

Healthcare

Fintech

Retail analytics

Enterprise AI

Apps requiring anonymized data

1-Day Development Plan (Hackathon) Hour 1–2

Set up Cloud Run + Firestore

Set up Google Agent SDK baseline

Build Orchestrator agent skeleton

Hour 3–4

Build NL Interpreter Agent

Build Schema Clarification Agent

Hour 5

Build Sample Generator Agent

Render preview in UI

Hour 6–7

Bulk Generator Agent + batching system

Hour 8

Firestore + GCS integrations

Hour 9

Demo UI polish

Fix issues

Hour 10

Write blog

Prepare pitch

Risks & Mitigations Risk: Gemini outputs malformed CSV

Mitigation: enforce JSONL → convert to CSV server-side.

Risk: Clarification loop too long

Mitigation: limit to max 3 questions.

Risk: Large dataset token limits

Mitigation: batch generation of 100–500 rows per request.

Future Extensions

Full UI schema editor

Domain-specific presets

Auto-ML model training using synthetic data

Integration with BigQuery

Drift matching (synthetic data shaped like real samples)

Related Documents

AGENTS.md — ShakkaShell v2.0

Fleet Management System - Product Requirements Document (PRD)

SourceAtlas PRD v2.9.6

CLAHub v2 — Product Requirements Document