Loading...
Loading...
Loading...
This document outlines a **scalable, provider-agnostic AI agent infrastructure** for the Croner App Admin Portal. The architecture is designed to be flexible enough to support both OpenAI and Google Vertex AI, with the ability to expand to additional use cases over time.
# AI Agent Architecture for Croner App
## Overview
This document outlines a **scalable, provider-agnostic AI agent infrastructure** for the Croner App Admin Portal. The architecture is designed to be flexible enough to support both OpenAI and Google Vertex AI, with the ability to expand to additional use cases over time.
## Architecture Diagrams
### 1. [High-Level AI Agent Architecture](14-ai-agent-architecture.mmd)
Complete AI system showing:
- **Admin Portal**: AI-enhanced UI components (chat, insights, survey assistant, data cleaner)
- **AI Service Layer**: Agent orchestrator with specialized agents (Data Analyst, Survey Expert, Data Cleaner, Chat Assistant)
- **Provider Abstraction**: Strategy pattern supporting both OpenAI and Vertex AI
- **Vector Database**: RAG (Retrieval Augmented Generation) for grounded responses
- **Monitoring & Observability**: Cost tracking, metrics, distributed tracing
**Key Components:**
- ✅ Provider-agnostic design (easy to switch between OpenAI/Vertex)
- ✅ Specialized agents for different use cases
- ✅ RAG integration for reducing hallucinations
- ✅ Built on existing infrastructure (Django, Celery, PostgreSQL)
### 2. [AI Agent Interaction Flows](15-ai-agent-interaction-flow.mmd)
Four detailed sequence diagrams:
#### Use Case 1: AI Data Analysis
- Admin requests AI insights on job data
- System fetches data from PostgreSQL, retrieves similar analyses from vector DB
- LLM generates insights with context
- Results cached in Redis (30min TTL)
#### Use Case 2: Real-Time AI Chat
- WebSocket-based streaming chat
- Conversation history maintained in Redis
- RAG retrieves relevant documentation
- Streaming response for real-time UX
#### Use Case 3: AI Survey Question Generation
- Admin provides survey topic
- System retrieves similar surveys and best practices
- LLM generates structured questions (JSON output)
- Questions saved as drafts for editing
#### Use Case 4: Async AI Data Cleaning
- CSV upload triggers Celery task
- Data processed in batches (1000 rows)
- LLM validates, cleans, and standardizes data
- Email notification on completion
### 3. [AI Provider Abstraction Layer](16-ai-provider-abstraction.mmd)
**Strategy Pattern** for provider flexibility:
**Interface Methods:**
```python
class AIProviderInterface:
def generate_completion(prompt, model, params)
def generate_embeddings(texts)
def stream_response(prompt, callback)
def function_calling(prompt, functions)
def get_token_count(text)
```
**Implementations:**
- **OpenAIProvider**: GPT-4o, GPT-4-turbo, text-embedding-3-large
- **VertexAIProvider**: Gemini 1.5 Pro (2M context), Gemini Flash, text-embedding-004
- **Future**: Anthropic (Claude 3.5), Local (Llama 3.1)
**Provider Router:**
- Cost optimization (select cheaper model for simple tasks)
- Latency requirements (real-time vs batch)
- Context window needs (large data analysis)
- Fallback strategy (primary → secondary)
**Monitoring:**
- Token consumption tracking
- Cost per request (USD)
- API latency (P50, P95, P99)
- Error rates by provider
- Provider distribution analytics
### 4. [RAG (Retrieval Augmented Generation) Pipeline](17-ai-rag-pipeline.mmd)
Reduces AI hallucinations by grounding responses in real data:
**Data Sources:**
- Survey templates and questions
- Historical job data (JSONB)
- Documents from Azure Blob
- Variables and survey schema
**ETL Pipeline (Celery scheduled jobs):**
1. Extract data from PostgreSQL/Azure
2. Chunk text semantically (not fixed size)
3. Clean and normalize data
4. Enrich with metadata (tags, categories, dates)
**Embedding Generation:**
- Batch processing for efficiency
- Separate collections for different data types
- Dimensions: 1536 (OpenAI) or 3072 (Vertex)
**Vector Database:**
- **Option 1**: Pinecone (managed, cloud-native)
- **Option 2**: Qdrant (self-hosted in Kubernetes)
- Indexes: HNSW (fast approximate) or IVF (large-scale)
**Query Pipeline:**
1. Query rewriter (optimization)
2. Query expander (synonyms, related terms)
3. Query embedding generation
4. Vector search (semantic similarity)
5. Hybrid search (vector + keyword)
6. MMR reranking (diversity + relevance)
7. Cross-encoder reranker (precision refinement)
**Context Assembly:**
- Top-K results (k=5-10)
- Metadata filtering (date, survey type)
- System instructions (role, guidelines)
- User context (admin profile, history)
**Feedback Loop:**
- User feedback (thumbs up/down)
- Relevance metrics analytics
- Periodic embedding retraining
### 5. [AI Deployment Stack](18-ai-deployment-stack.mmd)
Production-ready Kubernetes deployment:
**NEW Services:**
- **AI Service Pods**: FastAPI-based, autoscaling (2-5 replicas)
- AI Orchestrator (agent router)
- Agent Runtime (LangChain/LlamaIndex)
- Provider Clients (OpenAI + Vertex SDK)
**Integration with Existing Infrastructure:**
- Django API calls AI Service via internal gRPC/REST
- Celery workers handle long-running AI tasks
- Redis for conversation history and response caching
- PostgreSQL for AI logs and metadata
- Azure Blob for documents and exports
**Security:**
- HashiCorp Vault for API key management
- Azure Key Vault integration (existing)
- Web Application Firewall with rate limiting
- Service mesh (Istio/Linkerd) for traffic control
**Observability (NEW):**
- **Prometheus**: Metrics collection (tokens, latency, cost)
- **Grafana**: Dashboards for AI performance
- **Loki**: Log aggregation for AI service
- **Jaeger**: Distributed tracing across services
- **LangSmith/LangFuse**: LLM-specific observability (prompt chains, token usage)
**CI/CD:**
- GitHub Actions or Azure DevOps
- Docker Registry for container images
- ArgoCD for GitOps deployment
- Separate deployment pipeline for AI service
---
## Technology Stack
### AI Frameworks
| Component | Technology | Purpose |
|-----------|-----------|---------|
| Agent Framework | LangChain / LlamaIndex | Agent orchestration, chains, memory |
| API Framework | FastAPI | High-performance async API |
| Vector DB | Pinecone / Qdrant | Embeddings storage & search |
| LLM Providers | OpenAI, Vertex AI | Language model inference |
| Observability | LangSmith / LangFuse | LLM monitoring & debugging |
### OpenAI Models
- **GPT-4o**: Fast, cost-effective (recommended for most use cases)
- **GPT-4-turbo**: Large context window (128k tokens)
- **text-embedding-3-large**: High-quality embeddings (3072 dimensions)
### Google Vertex AI Models
- **Gemini 1.5 Pro**: Massive context (2M tokens), multimodal
- **Gemini 1.5 Flash**: Fast responses, lower cost
- **text-embedding-004**: Enterprise-grade embeddings
---
## Implementation Phases
### Phase 1: Foundation (Weeks 1-4)
**Goal**: Set up core AI infrastructure
- [ ] Deploy FastAPI AI service as Kubernetes pod
- [ ] Implement provider abstraction layer (OpenAI + Vertex)
- [ ] Set up vector database (Pinecone or Qdrant)
- [ ] Create basic agent orchestrator
- [ ] Implement API key management (Vault)
- [ ] Set up monitoring (Prometheus + Grafana)
- [ ] Deploy LangSmith for LLM observability
**Deliverables:**
- AI service accepting requests from Django
- Provider router dynamically selecting OpenAI/Vertex
- Basic health checks and metrics
### Phase 2: RAG Pipeline (Weeks 5-7)
**Goal**: Build retrieval augmented generation
- [ ] Create ETL pipeline for embedding generation
- [ ] Index surveys, job data, documents in vector DB
- [ ] Implement semantic search with reranking
- [ ] Build context assembly logic
- [ ] Add caching layer (Redis)
- [ ] Create feedback collection system
**Deliverables:**
- Vector search returning relevant context
- RAG-enhanced responses with citations
- Scheduled jobs for re-indexing data
### Phase 3: Specialized Agents (Weeks 8-12)
**Goal**: Implement use-case-specific agents
#### Agent 1: Data Analyst Agent
- [ ] Build prompt templates for insights generation
- [ ] Integrate with JobDataResults (JSONB)
- [ ] Create structured output schemas (trends, outliers)
- [ ] Add visualization data generation
- [ ] Implement admin UI component
#### Agent 2: Survey Expert Agent
- [ ] Build question generation prompts
- [ ] Integrate with survey templates and variables
- [ ] Create validation for generated questions
- [ ] Add draft saving functionality
- [ ] Implement admin UI component
#### Agent 3: Data Cleaning Agent
- [ ] Build validation and cleaning prompts
- [ ] Integrate with JobDataOriginal pipeline
- [ ] Implement batch processing (Celery)
- [ ] Create cleaning report generation
- [ ] Add admin UI component
#### Agent 4: Chat Assistant Agent
- [ ] Build conversational prompts
- [ ] Implement conversation memory (Redis)
- [ ] Add WebSocket streaming
- [ ] Create chat UI component
- [ ] Implement context-aware responses
**Deliverables:**
- Four specialized agents operational in admin portal
- UI components integrated with React 18
- End-to-end workflows tested
### Phase 4: Production Hardening (Weeks 13-16)
**Goal**: Production-ready AI system
- [ ] Implement rate limiting per user/agent
- [ ] Add cost budgets and alerts
- [ ] Create fallback strategies (primary → secondary provider)
- [ ] Implement retry logic with exponential backoff
- [ ] Add comprehensive error handling
- [ ] Create admin dashboard for AI usage
- [ ] Implement PII detection and redaction
- [ ] Add audit logging for compliance
- [ ] Performance testing and optimization
- [ ] Security audit and penetration testing
**Deliverables:**
- Production-ready AI service with 99.9% uptime
- Cost controls and monitoring
- Security and compliance measures
- Admin analytics dashboard
### Phase 5: Expansion (Future)
**Goals**: Scale to additional use cases
- [ ] Client portal AI features (if needed)
- [ ] Additional agents (e.g., Report Generator, Email Composer)
- [ ] Multi-language support
- [ ] Voice input/output
- [ ] Advanced analytics (sentiment, classification)
---
## Cost Estimation
### OpenAI Pricing (as of 2024)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| GPT-4o | $2.50 | $10.00 |
| GPT-4-turbo | $10.00 | $30.00 |
| text-embedding-3-large | $0.13 | - |
### Vertex AI Pricing (as of 2024)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|-------|----------------------|------------------------|
| Gemini 1.5 Pro | $1.25 | $5.00 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
| text-embedding-004 | $0.025 | - |
### Monthly Estimate (100 admin users, moderate usage)
| Component | Estimated Cost |
|-----------|----------------|
| LLM API calls (mixed) | $500 - $1,500 |
| Embeddings generation | $50 - $150 |
| Vector DB (Pinecone) | $70 - $200 |
| Redis cache | $50 - $100 |
| Monitoring (LangSmith) | $50 - $150 |
| Additional compute (K8s) | $100 - $300 |
| **Total** | **$820 - $2,400/month** |
**Cost Optimization:**
- Use GPT-4o (cheaper) for most tasks, GPT-4-turbo only for large context
- Cache responses in Redis (30min TTL)
- Use Vertex AI for batch processing (50% cheaper)
- Implement smart routing (cost vs latency)
---
## Security Considerations
### API Key Management
- Store keys in HashiCorp Vault or Azure Key Vault
- Rotate keys monthly
- Use separate keys for dev/staging/production
- Implement key usage monitoring
### Data Privacy
- **PII Detection**: Scan inputs for sensitive data (SSN, credit cards)
- **Data Masking**: Redact PII before sending to LLM
- **Audit Logging**: Log all AI interactions (GDPR/SOC2)
- **Data Residency**: Use Vertex AI if data must stay in specific regions
### Rate Limiting
- Per-user limits (e.g., 100 requests/hour)
- Per-agent limits (e.g., 1000 requests/day)
- Cost budgets (e.g., $500/day)
- Circuit breakers for API failures
### Prompt Injection Prevention
- Validate user inputs
- Use system message boundaries
- Implement output validation
- Monitor for suspicious patterns
---
## Monitoring & Alerts
### Key Metrics
- **Token Usage**: Input/output tokens by agent, user, provider
- **Latency**: P50, P95, P99 response times
- **Cost**: Real-time spending by agent/user
- **Error Rate**: 4xx, 5xx errors by provider
- **Cache Hit Rate**: Redis cache effectiveness
- **Embeddings**: Vector DB query latency
### Alerts
- Cost exceeds daily budget ($X/day)
- Error rate > 5% for 5 minutes
- Latency P95 > 10 seconds
- API key approaching rate limit
- Vector DB query failures
### Dashboards
- **Executive Dashboard**: Total cost, usage trends, ROI metrics
- **Operations Dashboard**: Latency, errors, uptime by service
- **Agent Dashboard**: Usage by agent type, performance metrics
- **Cost Dashboard**: Spending by provider, agent, user
---
## Success Metrics
### Technical Metrics
- API response time: < 3 seconds (P95)
- Uptime: > 99.9%
- Error rate: < 1%
- Cache hit rate: > 60%
### Business Metrics
- **Adoption**: % of admins using AI features weekly
- **Efficiency**: Time saved per task (e.g., survey creation -50%)
- **Quality**: Accuracy of AI insights (human validation)
- **Satisfaction**: NPS score for AI features
### Cost Metrics
- Cost per request: < $0.10
- ROI: Admin time saved vs AI cost
- Budget adherence: < 10% variance
---
## FAQ
### Why FastAPI for AI service instead of Django?
FastAPI offers:
- **Async support** for LLM streaming
- **Better performance** (3x faster than Django)
- **Native WebSocket** support
- **Smaller container size** (faster deployments)
- **Auto-generated OpenAPI docs**
Django remains for core business logic; FastAPI handles AI-specific workloads.
### Why both OpenAI and Vertex AI?
- **Redundancy**: Fallback if one provider has outages
- **Cost optimization**: Route tasks to cheaper provider
- **Compliance**: Vertex AI for data residency requirements
- **Feature parity**: Some features only available on one provider
- **Negotiation leverage**: Multi-provider reduces lock-in
### Why RAG instead of fine-tuning?
- **Cost**: RAG is ~10x cheaper than fine-tuning
- **Flexibility**: Update knowledge base without retraining
- **Transparency**: Citations show data sources
- **Accuracy**: Grounded in real data, less hallucination
- **Maintenance**: No model retraining pipeline needed
### Can we use open-source models (Llama, Mistral)?
Yes, the abstraction layer supports it! But consider:
- **Infrastructure cost**: Need GPU instances ($500-2000/month)
- **Performance**: Open models lag behind GPT-4/Gemini
- **Maintenance**: Model updates, optimization, monitoring
- **Best for**: High-volume, privacy-critical use cases
Start with OpenAI/Vertex, migrate to open-source if scale justifies it.
### How do we handle AI hallucinations?
- **RAG**: Ground responses in real data
- **Structured outputs**: JSON schemas enforce format
- **Validation**: Post-process outputs with business rules
- **Citations**: Show data sources for verification
- **Human-in-the-loop**: Admin reviews before applying changes
- **Feedback loop**: Learn from corrections
---
## Next Steps
1. **Review architecture** with engineering team
2. **Choose vector DB**: Pinecone (managed) vs Qdrant (self-hosted)
3. **Decide AI provider**: OpenAI, Vertex, or both
4. **Prioritize agents**: Which use case first?
5. **Allocate budget**: $1-3k/month for initial rollout
6. **Set up dev environment**: AI service + vector DB
7. **Begin Phase 1**: Foundation infrastructure
**Estimated Timeline**: 16 weeks to production-ready AI system
**Team Requirements**:
- 1x ML Engineer (AI service, agents, RAG)
- 1x Backend Engineer (Django integration, APIs)
- 1x Frontend Engineer (React UI components)
- 1x DevOps Engineer (K8s deployment, monitoring)
---
## Resources
### Documentation
- [OpenAI API Docs](https://platform.openai.com/docs)
- [Vertex AI Docs](https://cloud.google.com/vertex-ai/docs)
- [LangChain Docs](https://python.langchain.com/docs/)
- [Pinecone Docs](https://docs.pinecone.io/)
- [LangSmith Docs](https://docs.smith.langchain.com/)
### Example Implementations
- [LangChain + FastAPI Template](https://github.com/langchain-ai/langchain/tree/master/templates)
- [OpenAI Cookbook](https://github.com/openai/openai-cookbook)
- [RAG Tutorial](https://github.com/run-llama/llama_index)
---
**Ready to build the future of compensation analytics with AI!** 🚀
> 屬於 [research/](./README.md)。涵蓋 LLM-as-Judge、Reasoning Model、評估維度、Judge 設計原則。
> ⚠️ Note (Option A): `hwp-web (planned)` is intentionally excluded/disabled in this repo snapshot.
Here are three new, highly specialized AI agents for the T20 framework:
The **LLM Judge** is LLMTrace's third security detector alongside the