Loading...
Loading...
Loading...
1. [Overview](#overview)
# OPERATIONS RUNBOOK: AI-Driven Cultural Heritage Preservation App
## Table of Contents
1. [Overview](#overview)
2. [Deployment Procedures](#deployment-procedures)
- [Infrastructure Overview](#infrastructure-overview)
- [Deployment Steps](#deployment-steps)
- [Deployment Diagram](#deployment-diagram)
- [Trade-offs and Rationale](#trade-offs-and-rationale)
3. [Monitoring and Observability](#monitoring-and-observability)
- [Metrics and Alerts](#metrics-and-alerts)
- [Logging Strategy](#logging-strategy)
- [Visualization Tools](#visualization-tools)
4. [Incident Response](#incident-response)
- [Incident Categories](#incident-categories)
- [Incident Handling Workflow](#incident-handling-workflow)
- [Escalation Policy](#escalation-policy)
5. [Disaster Recovery](#disaster-recovery)
- [Backup Strategy](#backup-strategy)
- [Recovery Procedures](#recovery-procedures)
- [Testing Recovery](#testing-recovery)
6. [Appendix](#appendix)
- [Glossary](#glossary)
- [Related Documentation](#related-documentation)
---
## Overview
The **AI-Driven Cultural Heritage Preservation App** is a production-grade system designed to digitize, analyze, and preserve cultural artifacts using AI technologies. This document serves as the operational runbook for deploying, monitoring, responding to incidents, and recovering the system in case of failures.
The system is built with a microservices architecture, leveraging containerized services orchestrated by Kubernetes. It integrates AI/ML pipelines for artifact recognition and metadata extraction, and it provides a web-based interface for users to interact with the system.
This document is intended for DevOps engineers, SREs (Site Reliability Engineers), and developers responsible for maintaining and extending the system.
---
## Deployment Procedures
### Infrastructure Overview
The system is deployed on a cloud-native architecture using **AWS** as the primary cloud provider. The infrastructure includes:
- **Kubernetes Cluster**: Orchestrates containerized microservices.
- **Amazon RDS**: Stores metadata and user data.
- **Amazon S3**: Stores digitized artifacts and AI model files.
- **Amazon SageMaker**: Hosts AI/ML models for artifact recognition.
- **Amazon CloudFront**: Serves static assets and provides CDN capabilities.
- **Amazon CloudWatch**: Monitors logs and metrics.
- **AWS Lambda**: Handles serverless tasks such as image preprocessing.
#### Key Components
1. **Frontend**: React-based web application served via CloudFront.
2. **Backend**: Node.js/Express API for business logic and communication with the database.
3. **AI/ML Pipeline**: Python-based services for artifact recognition and metadata extraction.
4. **Database**: PostgreSQL database hosted on Amazon RDS.
5. **Storage**: S3 buckets for storing large files and backups.
---
### Deployment Steps
1. **Prepare the Environment**:
- Ensure AWS CLI is installed and configured with appropriate IAM credentials.
- Verify Kubernetes CLI (`kubectl`) and Helm are installed.
- Confirm access to the Git repository and CI/CD pipeline.
2. **Infrastructure Setup**:
- Use Terraform scripts (located in `infra/terraform`) to provision AWS resources:
```bash
terraform init
terraform plan
terraform apply
```
- This will create the Kubernetes cluster, RDS instance, S3 buckets, and other required resources.
3. **Build and Push Docker Images**:
- Build Docker images for all microservices:
```bash
docker build -t <repository>/<service-name>:<version> .
```
- Push images to Amazon Elastic Container Registry (ECR):
```bash
docker push <repository>/<service-name>:<version>
```
4. **Deploy to Kubernetes**:
- Use Helm charts (located in `infra/helm`) to deploy services:
```bash
helm install <release-name> ./infra/helm/<service-name>
```
- Verify deployments:
```bash
kubectl get pods
kubectl get services
```
5. **Run Post-Deployment Checks**:
- Verify the application is accessible via the public endpoint.
- Run integration tests using the test suite:
```bash
npm run test:integration
```
---
### Deployment Diagram
Below is a high-level architecture diagram of the system:
```mermaid
graph TD
User -->|HTTP Requests| CloudFront -->|API Gateway| Backend
Backend -->|Queries| RDS[(PostgreSQL)]
Backend -->|Fetch| S3[(Artifact Storage)]
Backend -->|Invoke| SageMaker[(AI/ML Models)]
Backend -->|Logs| CloudWatch
```
---
### Trade-offs and Rationale
- **Kubernetes**: Chosen for its scalability and flexibility. While it introduces operational complexity, it allows for seamless scaling of microservices.
- **AWS Services**: Provides managed solutions (e.g., RDS, S3) to reduce operational overhead. The trade-off is vendor lock-in.
- **Helm**: Simplifies Kubernetes deployments but requires additional learning for new team members.
---
## Monitoring and Observability
### Metrics and Alerts
Key metrics to monitor:
- **Application Metrics**:
- API response times (P95, P99 latencies).
- Error rates (HTTP 4xx/5xx).
- **Infrastructure Metrics**:
- CPU and memory usage of Kubernetes pods.
- Disk I/O and storage utilization for RDS and S3.
- **AI/ML Metrics**:
- Model inference latency.
- Model accuracy (monitored via SageMaker).
Set up alerts in **CloudWatch** for:
- High API error rates (>5% over 5 minutes).
- RDS CPU utilization > 80%.
- S3 bucket storage nearing capacity.
### Logging Strategy
- **Centralized Logging**: Use **Fluentd** to aggregate logs from all services and forward them to CloudWatch Logs.
- **Log Levels**:
- `INFO`: General application events.
- `WARN`: Non-critical issues.
- `ERROR`: Critical failures requiring immediate attention.
### Visualization Tools
- **Grafana**: Visualize metrics from Prometheus (integrated with Kubernetes).
- **Kibana**: Analyze logs stored in Elasticsearch.
- **AWS CloudWatch Dashboards**: Monitor AWS-specific metrics.
---
## Incident Response
### Incident Categories
1. **Critical**: Complete system outage or data loss.
2. **High**: Partial system outage or degraded performance.
3. **Medium**: Non-critical issues affecting a subset of users.
4. **Low**: Minor issues with no immediate impact.
### Incident Handling Workflow
1. **Detection**:
- Alerts are triggered via CloudWatch and routed to PagerDuty.
2. **Triage**:
- On-call engineer assesses the severity and impact.
3. **Mitigation**:
- Apply immediate fixes (e.g., scaling pods, restarting services).
4. **Resolution**:
- Deploy permanent fixes via hotfix or scheduled release.
5. **Postmortem**:
- Document the incident in the Incident Log.
### Escalation Policy
1. On-call engineer investigates and attempts resolution within 30 minutes.
2. If unresolved, escalate to the team lead.
3. If still unresolved after 1 hour, escalate to the engineering manager.
---
## Disaster Recovery
### Backup Strategy
- **Database Backups**:
- Automated daily snapshots of RDS.
- Retain backups for 30 days.
- **Artifact Backups**:
- S3 versioning enabled for all buckets.
- **Configuration Backups**:
- Store Kubernetes manifests and Terraform state in a version-controlled Git repository.
### Recovery Procedures
1. **Database Recovery**:
- Restore RDS from the latest snapshot via AWS Console or CLI.
2. **Artifact Recovery**:
- Use S3 versioning to retrieve previous versions of files.
3. **Infrastructure Recovery**:
- Reapply Terraform scripts to recreate infrastructure.
### Testing Recovery
- Perform quarterly disaster recovery drills.
- Simulate scenarios such as database corruption or S3 data loss.
- Document recovery times and update procedures as needed.
---
## Appendix
### Glossary
- **Kubernetes**: An open-source system for automating deployment, scaling, and management of containerized applications.
- **Helm**: A package manager for Kubernetes.
- **SageMaker**: AWS service for building, training, and deploying machine learning models.
### Related Documentation
- [Architecture Overview](docs/ARCHITECTURE.md)
- [API Documentation](docs/API.md)
- [CI/CD Pipeline Guide](docs/CI_CD.md)
- [Security Best Practices](docs/SECURITY.md)
---
This document provides a comprehensive guide for operating the AI-Driven Cultural Heritage Preservation App. For any questions or clarifications, please contact the DevOps team.> 屬於 [research/](./README.md)。涵蓋 LLM-as-Judge、Reasoning Model、評估維度、Judge 設計原則。
> ⚠️ Note (Option A): `hwp-web (planned)` is intentionally excluded/disabled in this repo snapshot.
Here are three new, highly specialized AI agents for the T20 framework:
The **LLM Judge** is LLMTrace's third security detector alongside the