AI Agent Architecture for Croner App

Overview

This document outlines a scalable, provider-agnostic AI agent infrastructure for the Croner App Admin Portal. The architecture is designed to be flexible enough to support both OpenAI and Google Vertex AI, with the ability to expand to additional use cases over time.

Architecture Diagrams

1. High-Level AI Agent Architecture

Complete AI system showing:

Admin Portal: AI-enhanced UI components (chat, insights, survey assistant, data cleaner)
AI Service Layer: Agent orchestrator with specialized agents (Data Analyst, Survey Expert, Data Cleaner, Chat Assistant)
Provider Abstraction: Strategy pattern supporting both OpenAI and Vertex AI
Vector Database: RAG (Retrieval Augmented Generation) for grounded responses
Monitoring & Observability: Cost tracking, metrics, distributed tracing

Key Components:

✅ Provider-agnostic design (easy to switch between OpenAI/Vertex)
✅ Specialized agents for different use cases
✅ RAG integration for reducing hallucinations
✅ Built on existing infrastructure (Django, Celery, PostgreSQL)

2. AI Agent Interaction Flows

Four detailed sequence diagrams:

Use Case 1: AI Data Analysis

Admin requests AI insights on job data
System fetches data from PostgreSQL, retrieves similar analyses from vector DB
LLM generates insights with context
Results cached in Redis (30min TTL)

Use Case 2: Real-Time AI Chat

WebSocket-based streaming chat
Conversation history maintained in Redis
RAG retrieves relevant documentation
Streaming response for real-time UX

Use Case 3: AI Survey Question Generation

Admin provides survey topic
System retrieves similar surveys and best practices
LLM generates structured questions (JSON output)
Questions saved as drafts for editing

Use Case 4: Async AI Data Cleaning

CSV upload triggers Celery task
Data processed in batches (1000 rows)
LLM validates, cleans, and standardizes data
Email notification on completion

3. AI Provider Abstraction Layer

Strategy Pattern for provider flexibility:

Interface Methods:

class AIProviderInterface:
    def generate_completion(prompt, model, params)
    def generate_embeddings(texts)
    def stream_response(prompt, callback)
    def function_calling(prompt, functions)
    def get_token_count(text)

Implementations:

OpenAIProvider: GPT-4o, GPT-4-turbo, text-embedding-3-large
VertexAIProvider: Gemini 1.5 Pro (2M context), Gemini Flash, text-embedding-004
Future: Anthropic (Claude 3.5), Local (Llama 3.1)

Provider Router:

Cost optimization (select cheaper model for simple tasks)
Latency requirements (real-time vs batch)
Context window needs (large data analysis)
Fallback strategy (primary → secondary)

Monitoring:

Token consumption tracking
Cost per request (USD)
API latency (P50, P95, P99)
Error rates by provider
Provider distribution analytics

4. RAG (Retrieval Augmented Generation) Pipeline

Reduces AI hallucinations by grounding responses in real data:

Data Sources:

Survey templates and questions
Historical job data (JSONB)
Documents from Azure Blob
Variables and survey schema

ETL Pipeline (Celery scheduled jobs):

Extract data from PostgreSQL/Azure
Chunk text semantically (not fixed size)
Clean and normalize data
Enrich with metadata (tags, categories, dates)

Embedding Generation:

Batch processing for efficiency
Separate collections for different data types
Dimensions: 1536 (OpenAI) or 3072 (Vertex)

Vector Database:

Option 1: Pinecone (managed, cloud-native)
Option 2: Qdrant (self-hosted in Kubernetes)
Indexes: HNSW (fast approximate) or IVF (large-scale)

Query Pipeline:

Query rewriter (optimization)
Query expander (synonyms, related terms)
Query embedding generation
Vector search (semantic similarity)
Hybrid search (vector + keyword)
MMR reranking (diversity + relevance)
Cross-encoder reranker (precision refinement)

Context Assembly:

Top-K results (k=5-10)
Metadata filtering (date, survey type)
System instructions (role, guidelines)
User context (admin profile, history)

Feedback Loop:

User feedback (thumbs up/down)
Relevance metrics analytics
Periodic embedding retraining

5. AI Deployment Stack

Production-ready Kubernetes deployment:

NEW Services:

AI Service Pods: FastAPI-based, autoscaling (2-5 replicas)
- AI Orchestrator (agent router)
- Agent Runtime (LangChain/LlamaIndex)
- Provider Clients (OpenAI + Vertex SDK)

Integration with Existing Infrastructure:

Django API calls AI Service via internal gRPC/REST
Celery workers handle long-running AI tasks
Redis for conversation history and response caching
PostgreSQL for AI logs and metadata
Azure Blob for documents and exports

Security:

HashiCorp Vault for API key management
Azure Key Vault integration (existing)
Web Application Firewall with rate limiting
Service mesh (Istio/Linkerd) for traffic control

Observability (NEW):

Prometheus: Metrics collection (tokens, latency, cost)
Grafana: Dashboards for AI performance
Loki: Log aggregation for AI service
Jaeger: Distributed tracing across services
LangSmith/LangFuse: LLM-specific observability (prompt chains, token usage)

CI/CD:

GitHub Actions or Azure DevOps
Docker Registry for container images
ArgoCD for GitOps deployment
Separate deployment pipeline for AI service

Technology Stack

AI Frameworks

Component	Technology	Purpose
Agent Framework	LangChain / LlamaIndex	Agent orchestration, chains, memory
API Framework	FastAPI	High-performance async API
Vector DB	Pinecone / Qdrant	Embeddings storage & search
LLM Providers	OpenAI, Vertex AI	Language model inference
Observability	LangSmith / LangFuse	LLM monitoring & debugging

OpenAI Models

GPT-4o: Fast, cost-effective (recommended for most use cases)
GPT-4-turbo: Large context window (128k tokens)
text-embedding-3-large: High-quality embeddings (3072 dimensions)

Google Vertex AI Models

Gemini 1.5 Pro: Massive context (2M tokens), multimodal
Gemini 1.5 Flash: Fast responses, lower cost
text-embedding-004: Enterprise-grade embeddings

Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Goal: Set up core AI infrastructure

Deploy FastAPI AI service as Kubernetes pod
Implement provider abstraction layer (OpenAI + Vertex)
Set up vector database (Pinecone or Qdrant)
Create basic agent orchestrator
Implement API key management (Vault)
Set up monitoring (Prometheus + Grafana)
Deploy LangSmith for LLM observability

Deliverables:

AI service accepting requests from Django
Provider router dynamically selecting OpenAI/Vertex
Basic health checks and metrics

Phase 2: RAG Pipeline (Weeks 5-7)

Goal: Build retrieval augmented generation

Create ETL pipeline for embedding generation
Index surveys, job data, documents in vector DB
Implement semantic search with reranking
Build context assembly logic
Add caching layer (Redis)
Create feedback collection system

Deliverables:

Vector search returning relevant context
RAG-enhanced responses with citations
Scheduled jobs for re-indexing data

Phase 3: Specialized Agents (Weeks 8-12)

Goal: Implement use-case-specific agents

Agent 1: Data Analyst Agent

Build prompt templates for insights generation
Integrate with JobDataResults (JSONB)
Create structured output schemas (trends, outliers)
Add visualization data generation
Implement admin UI component

Agent 2: Survey Expert Agent

Build question generation prompts
Integrate with survey templates and variables
Create validation for generated questions
Add draft saving functionality
Implement admin UI component

Agent 3: Data Cleaning Agent

Build validation and cleaning prompts
Integrate with JobDataOriginal pipeline
Implement batch processing (Celery)
Create cleaning report generation
Add admin UI component

Agent 4: Chat Assistant Agent

Build conversational prompts
Implement conversation memory (Redis)
Add WebSocket streaming
Create chat UI component
Implement context-aware responses

Deliverables:

Four specialized agents operational in admin portal
UI components integrated with React 18
End-to-end workflows tested

Phase 4: Production Hardening (Weeks 13-16)

Goal: Production-ready AI system

Deliverables:

Production-ready AI service with 99.9% uptime
Cost controls and monitoring
Security and compliance measures
Admin analytics dashboard

Phase 5: Expansion (Future)

Goals: Scale to additional use cases

Client portal AI features (if needed)
Additional agents (e.g., Report Generator, Email Composer)
Multi-language support
Voice input/output
Advanced analytics (sentiment, classification)

Cost Estimation

OpenAI Pricing (as of 2024)

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-4-turbo	$10.00	$30.00
text-embedding-3-large	$0.13	-

Vertex AI Pricing (as of 2024)

Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemini 1.5 Pro	$1.25	$5.00
Gemini 1.5 Flash	$0.075	$0.30
text-embedding-004	$0.025	-

Monthly Estimate (100 admin users, moderate usage)

Component	Estimated Cost
LLM API calls (mixed)	$500 - $1,500
Embeddings generation	$50 - $150
Vector DB (Pinecone)	$70 - $200
Redis cache	$50 - $100
Monitoring (LangSmith)	$50 - $150
Additional compute (K8s)	$100 - $300
Total	$820 - $2,400/month

Cost Optimization:

Use GPT-4o (cheaper) for most tasks, GPT-4-turbo only for large context
Cache responses in Redis (30min TTL)
Use Vertex AI for batch processing (50% cheaper)
Implement smart routing (cost vs latency)

Security Considerations

API Key Management

Store keys in HashiCorp Vault or Azure Key Vault
Rotate keys monthly
Use separate keys for dev/staging/production
Implement key usage monitoring

Data Privacy

PII Detection: Scan inputs for sensitive data (SSN, credit cards)
Data Masking: Redact PII before sending to LLM
Audit Logging: Log all AI interactions (GDPR/SOC2)
Data Residency: Use Vertex AI if data must stay in specific regions

Rate Limiting

Per-user limits (e.g., 100 requests/hour)
Per-agent limits (e.g., 1000 requests/day)
Cost budgets (e.g., $500/day)
Circuit breakers for API failures

Prompt Injection Prevention

Validate user inputs
Use system message boundaries
Implement output validation
Monitor for suspicious patterns

Monitoring & Alerts

Key Metrics

Token Usage: Input/output tokens by agent, user, provider
Latency: P50, P95, P99 response times
Cost: Real-time spending by agent/user
Error Rate: 4xx, 5xx errors by provider
Cache Hit Rate: Redis cache effectiveness
Embeddings: Vector DB query latency

Alerts

Cost exceeds daily budget ($X/day)
Error rate > 5% for 5 minutes
Latency P95 > 10 seconds
API key approaching rate limit
Vector DB query failures

Dashboards

Executive Dashboard: Total cost, usage trends, ROI metrics
Operations Dashboard: Latency, errors, uptime by service
Agent Dashboard: Usage by agent type, performance metrics
Cost Dashboard: Spending by provider, agent, user

Success Metrics

Technical Metrics

API response time: < 3 seconds (P95)
Uptime: > 99.9%
Error rate: < 1%
Cache hit rate: > 60%

Business Metrics

Adoption: % of admins using AI features weekly
Efficiency: Time saved per task (e.g., survey creation -50%)
Quality: Accuracy of AI insights (human validation)
Satisfaction: NPS score for AI features

Cost Metrics

Cost per request: < $0.10
ROI: Admin time saved vs AI cost
Budget adherence: < 10% variance

FAQ

Why FastAPI for AI service instead of Django?

FastAPI offers:

Async support for LLM streaming
Better performance (3x faster than Django)
Native WebSocket support
Smaller container size (faster deployments)
Auto-generated OpenAPI docs

Django remains for core business logic; FastAPI handles AI-specific workloads.

Why both OpenAI and Vertex AI?

Redundancy: Fallback if one provider has outages
Cost optimization: Route tasks to cheaper provider
Compliance: Vertex AI for data residency requirements
Feature parity: Some features only available on one provider
Negotiation leverage: Multi-provider reduces lock-in

Why RAG instead of fine-tuning?

Cost: RAG is ~10x cheaper than fine-tuning
Flexibility: Update knowledge base without retraining
Transparency: Citations show data sources
Accuracy: Grounded in real data, less hallucination
Maintenance: No model retraining pipeline needed

Can we use open-source models (Llama, Mistral)?

Yes, the abstraction layer supports it! But consider:

Infrastructure cost: Need GPU instances ($500-2000/month)
Performance: Open models lag behind GPT-4/Gemini
Maintenance: Model updates, optimization, monitoring
Best for: High-volume, privacy-critical use cases

Start with OpenAI/Vertex, migrate to open-source if scale justifies it.

How do we handle AI hallucinations?

RAG: Ground responses in real data
Structured outputs: JSON schemas enforce format
Validation: Post-process outputs with business rules
Citations: Show data sources for verification
Human-in-the-loop: Admin reviews before applying changes
Feedback loop: Learn from corrections

Next Steps

Review architecture with engineering team
Choose vector DB: Pinecone (managed) vs Qdrant (self-hosted)
Decide AI provider: OpenAI, Vertex, or both
Prioritize agents: Which use case first?
Allocate budget: $1-3k/month for initial rollout
Set up dev environment: AI service + vector DB
Begin Phase 1: Foundation infrastructure

Estimated Timeline: 16 weeks to production-ready AI system

Team Requirements:

1x ML Engineer (AI service, agents, RAG)
1x Backend Engineer (Django integration, APIs)
1x Frontend Engineer (React UI components)
1x DevOps Engineer (K8s deployment, monitoring)

Resources

Documentation

Example Implementations

Ready to build the future of compensation analytics with AI! 🚀

AI Agent Architecture for Croner App

AI Agent Architecture for Croner App

Overview

Architecture Diagrams

1. High-Level AI Agent Architecture

2. AI Agent Interaction Flows

Use Case 1: AI Data Analysis

Use Case 2: Real-Time AI Chat

Use Case 3: AI Survey Question Generation

Use Case 4: Async AI Data Cleaning

3. AI Provider Abstraction Layer

4. RAG (Retrieval Augmented Generation) Pipeline

5. AI Deployment Stack

Technology Stack

AI Frameworks

OpenAI Models

Google Vertex AI Models

Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Phase 2: RAG Pipeline (Weeks 5-7)

Phase 3: Specialized Agents (Weeks 8-12)

Agent 1: Data Analyst Agent

Agent 2: Survey Expert Agent

Agent 3: Data Cleaning Agent

Agent 4: Chat Assistant Agent

Phase 4: Production Hardening (Weeks 13-16)

Phase 5: Expansion (Future)

Cost Estimation

OpenAI Pricing (as of 2024)

Vertex AI Pricing (as of 2024)

Monthly Estimate (100 admin users, moderate usage)

Security Considerations

API Key Management

Data Privacy

Rate Limiting

Prompt Injection Prevention

Monitoring & Alerts

Key Metrics

Alerts

Dashboards

Success Metrics

Technical Metrics

Business Metrics

Cost Metrics

FAQ

Why FastAPI for AI service instead of Django?

Why both OpenAI and Vertex AI?

Why RAG instead of fine-tuning?

Can we use open-source models (Llama, Mistral)?

How do we handle AI hallucinations?

Next Steps

Resources

Documentation

Example Implementations

Related Documents

youtube

🚀 Lovable AI & Cloud - Complete Setup Guide

Evaluation and Observability

LLM Judge — Setup & Operations