Loading...
Loading...
title: "RFC-053: Schematized Topic-Based Microservice Pipelines"
---
id: rfc-053
title: "RFC-053: Schematized Topic-Based Microservice Pipelines"
sidebar_label: "RFC-053: Topic Pipelines"
status: Draft
author: jrepp
created: 2025-11-10
updated: 2025-11-15
project_id: prism
doc_uuid: a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d
tags:
- kafka
- messaging
- microservices
- schema
- developer-experience
- observability
- debugging
- patterns
- architecture
---
# RFC-053: Schematized Topic-Based Microservice Pipelines
## Executive Summary
This RFC proposes support for **schematized topic-based microservice pipelines**, where data flows through a chain of Kafka topics with strongly-typed schemas at each stage. Prism will provide developer tooling to discover, debug, and optimize these workflows through:
1. **Pipeline Discovery & Visualization** - Automatic topology detection from topic schemas and consumer group subscriptions
2. **Schema-Aware Observability** - End-to-end tracing with schema validation at each pipeline stage
3. **Developer Productivity Tools** - Local replay, diff tooling, and pipeline testing frameworks
4. **Production Debugging** - Data lineage tracking, schema evolution management, and error root cause analysis
## Problem Statement
### Current Challenges with Topic-Based Pipelines
Event-driven microservice architectures rely on pipelines where:
- Services process data sequentially through Kafka topics
- Each stage reads from input topics, transforms data, writes to output topics
- Schemas evolve independently across services
- Debugging requires manual correlation across topics, consumer groups, and traces
**Pain Points:**
1. **Pipeline Visibility**
- No automatic discovery of topic relationships
- Unclear which services consume from which topics
- Documentation becomes stale
- Topology difficult to understand
2. **Schema Management**
- Schema evolution breaks downstream consumers
- No validation that output matches next stage input
- Schema version compatibility unclear
- Version mismatches cause runtime failures
3. **Debugging**
- Errors in stage N+1 caused by bad data from earlier stages
- Manual message correlation across topics
- No tooling to replay message flows
- Production issues require extensive log analysis
4. **Developer Productivity**
- Local testing requires full infrastructure
- No pre-deployment validation of downstream impact
- Schema changes require manual team coordination
- Integration testing requires entire pipeline
5. **Operations**
- Consumer lag at one stage cascades downstream
- No unified pipeline health view
- Bottleneck identification requires manual analysis
- Inconsistent error handling across stages
## Goals
### Primary Goals
1. **Pipeline Discovery**: Automatically detect and visualize topic-based pipelines from:
- Kafka topic metadata and schema registry
- Consumer group subscriptions and topic assignments
- Producer/consumer configuration in Prism patterns
2. **Schema-Aware Tracing**: Provide end-to-end observability with:
- Message tracing across all pipeline stages
- Automatic schema validation at each topic boundary
- Schema evolution impact analysis
- Data lineage tracking from source to sink
3. **Developer Tools**: Enable productive local development with:
- Pipeline replay from any stage with historical data
- Schema diff tooling to identify breaking changes
- Local pipeline testing with mock topics
- Contract testing between pipeline stages
4. **Production Debugging**: Simplify troubleshooting with:
- Root cause analysis for data quality issues
- Message flow visualization for specific correlation IDs
- Schema mismatch detection and alerting
- Pipeline health dashboards with stage-specific metrics
### Non-Goals
- Building a general-purpose workflow orchestration engine (use Airflow/Temporal instead)
- Supporting non-Kafka messaging systems in initial version
- Replacing existing schema registry implementations
- Providing data transformation logic (Prism is infrastructure, not business logic)
- Real-time stream processing with complex joins (use Kafka Streams/Flink)
- Batch job orchestration (use Airflow for batch pipelines)
- Schema migration tooling (use schema registry features)
- Long-term data archival (use dedicated data lake solutions)
## Proposed Solution
### Architecture Overview
Prism will introduce a new **Pipeline Pattern** that extends existing Producer/Consumer patterns with:
```text
┌─────────────────────────────────────────────────────────────────┐
│ Prism Pipeline Registry │
│ - Topology Discovery │
│ - Schema Evolution Tracking │
│ - Pipeline Metadata Store │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌───────────▼──────┐ ┌──────▼────────┐ ┌────▼──────────┐
│ Producer │ │ Transform │ │ Consumer │
│ Pattern │ │ Pattern │ │ Pattern │
│ +Pipeline Ext │ │ +Pipeline │ │ +Pipeline │
└──────────────────┘ └───────────────┘ └───────────────┘
│ │ │
│ │ │
┌────▼─────┐ ┌───▼────┐ ┌───▼────┐
│ Topic A │────────▶│Topic B │───────▶│Topic C │
│ Schema 1 │ │Schema 2│ │Schema 3│
└──────────┘ └────────┘ └────────┘
```
### Key Components
#### 1. Pipeline Discovery Service
**Responsibilities:**
- Scan Kafka cluster for topics and consumer groups
- Query schema registry for topic schemas
- Correlate producers/consumers using Prism pattern metadata
- Build directed acyclic graph (DAG) of topic relationships
- Detect cycles and orphaned topics
**Implementation:**
```protobuf
message PipelineTopology {
string pipeline_id = 1;
repeated PipelineStage stages = 2;
repeated TopicEdge edges = 3;
map<string, SchemaVersion> schemas = 4;
google.protobuf.Timestamp discovered_at = 5;
}
message PipelineStage {
string stage_id = 1;
string service_name = 2;
repeated string input_topics = 3;
repeated string output_topics = 4;
string consumer_group = 5;
PipelineStageMetadata metadata = 6;
}
message TopicEdge {
string from_stage = 1;
string to_stage = 2;
string topic_name = 3;
SchemaCompatibility compatibility = 4;
}
```
**Discovery Algorithm:**
1. Enumerate topics with schema registry entries
2. Identify subscribed topics per consumer group
3. Link consumers to output topics via Prism pattern metadata
4. Build directed graph: services as nodes, topics as edges
5. Validate schema compatibility at each edge
6. Detect cycles and orphaned topics
7. Store topology in Pipeline Registry with timestamp
#### 2. Schema-Aware Tracing
**Extend existing message envelope with pipeline context:**
```protobuf
message PipelineContext {
string pipeline_id = 1;
string stage_id = 2;
int32 stage_sequence = 3; // Position in pipeline (0-based)
string correlation_id = 4; // End-to-end trace ID
repeated SchemaValidation validations = 5;
map<string, string> stage_metadata = 6;
DataLineage lineage = 7;
}
message SchemaValidation {
string stage_id = 1;
string schema_name = 2;
int32 schema_version = 3;
bool validation_passed = 4;
repeated string errors = 5;
google.protobuf.Timestamp validated_at = 6;
}
message DataLineage {
repeated DataSource sources = 1;
repeated Transformation transformations = 2;
repeated ModelVersion models = 3;
DataQualityMetrics quality = 4;
google.protobuf.Timestamp expires_at = 5;
repeated string pii_tags = 6;
}
message DataSource {
string source_id = 1;
string source_type = 2;
string source_location = 3;
google.protobuf.Timestamp ingested_at = 4;
map<string, string> metadata = 5;
}
message Transformation {
string transform_id = 1;
string transform_type = 2;
string code_version = 3;
map<string, double> metrics = 4;
}
message ModelVersion {
string model_id = 1;
string model_version = 2;
string framework = 3;
map<string, double> inference_metrics = 4;
}
message DataQualityMetrics {
int64 records_processed = 1;
int64 records_failed = 2;
map<string, double> quality_scores = 3;
repeated string anomalies = 4;
}
```
**Tracing Flow:**
1. Producer creates message with initial `pipeline_context`
2. Each stage validates input schema before processing
3. Validation results appended to `pipeline_context.validations`
4. Transformation metrics recorded in `pipeline_context.lineage`
5. OpenTelemetry spans link stages via `correlation_id`
6. Prism proxy logs validation failures with stage context
7. Failed messages routed to DLQ with full lineage
**Benefits:**
- End-to-end message flow visibility
- Automatic schema mismatch detection
- Debugging context for production issues
- Data lineage for compliance (GDPR, SOC2)
#### 3. Developer Tooling
**A. Pipeline CLI (`prismctl pipeline`)**
```bash
# Discover and visualize pipelines
prismctl pipeline discover --cluster=prod
# Show topology for specific pipeline
prismctl pipeline show user-signup-flow --format=dot
prismctl pipeline show user-signup-flow --format=mermaid
# Validate schema compatibility
prismctl pipeline validate user-signup-flow
# Replay messages for testing with partition awareness
prismctl pipeline replay \
--pipeline=user-signup-flow \
--from-stage=email-validator \
--correlation-id=abc123 \
--partition=2 \
--offset-range=1000-2000 \
--preserve-order=true \
--rate-limit=1000 \
--target=local
# Replay with sampling for large datasets
prismctl pipeline replay \
--pipeline=ml-feature-pipeline \
--from-stage=feature-transformer \
--time-range="2025-11-06T00:00:00Z/2025-11-07T00:00:00Z" \
--sample-rate=0.1 \
--target=local
# Diff schemas between stages
prismctl pipeline schema-diff \
--pipeline=user-signup-flow \
--from-stage=api-gateway \
--to-stage=email-validator
# Sample messages for debugging
prismctl pipeline sample \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--filter="$.features.click_rate > 0.8" \
--count=100 \
--output=debug_samples.jsonl
# Compare outputs between versions
prismctl pipeline diff \
--pipeline=ml-feature-pipeline \
--stage=feature-transformer \
--baseline-version=v1.2.0 \
--comparison-version=v1.3.0 \
--correlation-ids=sample.txt \
--show-distribution-stats
```
**B. Local Development Environment**
```yaml
# .prism/pipelines/user-signup.yml
pipeline:
id: user-signup-flow
stages:
- id: api-gateway
service: api-service
output_topics:
- name: raw-signups
schema: SignupRequest
- id: email-validator
service: validator-service
input_topics:
- name: raw-signups
schema: SignupRequest
output_topics:
- name: validated-signups
schema: ValidatedSignup
- id: user-creator
service: user-service
input_topics:
- name: validated-signups
schema: ValidatedSignup
output_topics:
- name: created-users
schema: User
# Start local pipeline with mocked topics
prismctl pipeline dev user-signup-flow
```
**C. Contract Testing Framework**
```python
# tests/test_pipeline_contracts.py
from prism_data.testing import PipelineContractTest
class TestUserSignupPipeline(PipelineContractTest):
pipeline_id = "user-signup-flow"
def test_email_validator_input_contract(self):
"""Ensure email-validator can consume from api-gateway output."""
self.assert_schema_compatible(
producer_stage="api-gateway",
consumer_stage="email-validator",
topic="raw-signups"
)
def test_end_to_end_message_flow(self):
"""Test complete pipeline with sample message."""
result = self.send_message(
stage="api-gateway",
message={"email": "[email protected]", "name": "Test User"}
)
self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="user-creator",
timeout_seconds=10
)
self.assert_schema_validations_passed(
correlation_id=result.correlation_id
)
class TestMLFeaturePipeline(PipelineContractTest):
pipeline_id = "ml-feature-pipeline"
def test_feature_transformer_data_quality(self):
"""Ensure feature transformer output meets quality thresholds."""
result = self.send_message(
stage="raw-data-ingestion",
message={"user_id": 123, "clicks": 5, "timestamp": "2025-11-07"}
)
output = self.assert_message_reaches_stage(
correlation_id=result.correlation_id,
stage="feature-transformer",
timeout_seconds=10
)
self.assert_data_quality(
output,
rules={
"completeness": {"threshold": 0.95},
"range_checks": {
"click_rate": {"min": 0.0, "max": 1.0},
"feature_count": {"min": 10, "max": 100}
},
"no_nulls": ["user_id", "feature_vector"],
"distribution": {
"feature_vector": {"mean_range": [0, 10], "std_range": [0, 5]}
}
}
)
def test_schema_drift_detection(self):
"""Alert on schema drift in feature pipeline."""
self.assert_no_schema_drift(
stage="feature-transformer",
baseline_time_range="2025-10-01/2025-10-31",
comparison_time_range="2025-11-01/2025-11-07",
max_drift_score=0.1
)
```
#### 4. Production Debugging
**A. Pipeline Health Dashboard**
Grafana dashboard with:
- Pipeline topology visualization (auto-generated from discovery)
- Per-stage metrics: throughput, latency, error rate
- Consumer lag across all stages
- Schema validation failure rate
- Correlation ID search and trace viewer
**B. Root Cause Analysis**
When errors occur at stage N, Prism automatically:
1. Identifies messages that failed validation
2. Traces correlation ID back through pipeline
3. Highlights stage where data corruption occurred
4. Shows schema diff between expected and actual
5. Links to relevant service logs and traces
**Example Scenario:**
```text
Stage 3 (user-creator) failing with schema validation errors:
┌─────────────────────────────────────────────────────────┐
│ Root Cause: email-validator (Stage 2) │
│ │
│ Expected output schema: ValidatedSignup v2 │
│ Actual output schema: ValidatedSignup v1 │
│ │
│ Missing field: phone_number (required in v2) │
│ │
│ Action: Upgrade email-validator to produce v2 schema │
│ Or: Downgrade user-creator to accept v1 schema │
└─────────────────────────────────────────────────────────┘
```
**C. ML-Specific Debugging**
**Feature Drift Detection:**
- Statistical tests (Kolmogorov-Smirnov, chi-squared) on feature distributions
- Alert when distributions deviate beyond configurable threshold
- Time-series visualization of feature statistics
**Model Performance Tracking:**
- Track inference latency, throughput, prediction distributions per stage
- Correlate model version with performance metrics
- Automatic rollback trigger on regression beyond threshold
- Record model metadata in pipeline context lineage
**Data Sampling for Debug:**
- JSONPath filters for targeted message sampling
- Stratified sampling for representative debugging datasets
- Export to JSONL for offline analysis
### Integration with Existing Patterns
**Backward Compatibility:**
- Existing Producer/Consumer patterns work unchanged
- Pipeline features opt-in via pattern configuration
- No breaking changes to protobuf message definitions
**Pattern Extensions:**
```yaml
# Producer pattern with pipeline support
pattern: producer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: api-gateway
output_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: latest
# Consumer pattern with pipeline support
pattern: consumer
pipeline:
enabled: true
pipeline_id: user-signup-flow
stage_id: email-validator
input_topics:
- name: raw-signups
schema:
registry: confluent
subject: raw-signups-value
version: 2
output_topics:
- name: validated-signups
schema:
registry: confluent
subject: validated-signups-value
version: 1
```
## Implementation Plan
### Phase 1: Foundation (Weeks 1-4)
**Deliverables:**
- [ ] Pipeline topology protobuf definitions
- [ ] Pipeline Registry service (basic CRUD)
- **Dependency**: Choose storage backend (SQLite vs Postgres - ADR needed)
- **Risk**: Schema evolution for registry itself
- [ ] Discovery service for Kafka cluster scanning
- **Dependency**: Kafka Admin API rate limits - test at scale first
- **Risk**: Large clusters (>5000 topics) may timeout
- [ ] Schema compatibility validator
- **Dependency**: Integration with Confluent Schema Registry AND AWS Glue
- **Risk**: Different registries have incompatible APIs
- [ ] `prismctl pipeline discover` command
- [ ] Load test discovery with synthetic topics
**Success Criteria:**
- Can scan production Kafka cluster with throttling
- Generate pipeline topology graph in <30s for 1000 topics
- Detect schema incompatibilities with 100% accuracy vs manual verification
- Load test passes with 10,000 synthetic topics
### Phase 2: Developer Tooling (Weeks 5-8)
**Deliverables:**
- [ ] Pipeline CLI commands (show, validate, schema-diff, replay, sample)
- [ ] Local development environment support
- [ ] Pipeline visualization (DOT/Mermaid export)
- [ ] Contract testing framework with data quality validation
- [ ] Partition-aware replay with sampling support
**Success Criteria:**
- Developers can visualize and replay pipelines locally
- Schema diffs identify breaking changes before deployment
- Contract tests validate schema compatibility and data quality
- Replay supports partition targeting and rate limiting
### Phase 3: Tracing & Observability (Weeks 9-12)
**Deliverables:**
- [ ] Pipeline context in message envelope with data lineage
- [ ] Schema validation at each stage
- [ ] OpenTelemetry integration with pipeline spans
- [ ] Grafana dashboard templates with topology visualization
- [ ] Correlation ID search with trace aggregation
- [ ] ML-specific metrics (feature drift, model performance)
**Success Criteria:**
- End-to-end trace spans link all pipeline stages
- Schema validation failures logged and alerted with root cause
- Dashboards show pipeline health, consumer lag, validation rates
- Data lineage tracked for ML compliance
### Phase 4: Production Debugging (Weeks 13-16)
**Deliverables:**
- [ ] Root cause analysis with automatic stage identification
- [ ] Message replay with partition and offset control
- [ ] Schema evolution impact reports with migration plans
- [ ] Pipeline health alerting rules (lag, validation failures, drift)
- [ ] Feature drift detection for ML pipelines
- [ ] Dead-letter queue integration
**Success Criteria:**
- Reduce MTTR for pipeline issues
- Identify root cause stage for schema errors automatically
- Support partition-aware replay with sampling
- Detect and alert on feature drift for ML workloads
## Success Metrics
### Developer Productivity
- **Onboarding Time**: Time for new developers to understand pipeline topology
- **Debug Time**: Time to identify root cause of pipeline failures
- **Test Coverage**: Percentage of pipeline stages covered by contract tests
### Operational Excellence
- **MTTR**: Mean time to resolution for pipeline issues
- **Schema Errors**: Production schema validation failures caught pre-deployment
- **Pipeline Visibility**: Production pipelines automatically discovered and visualized
### Performance & Scale
- **Discovery Throughput**: Support clusters with 10,000+ topics without degradation
- **Discovery Latency**: Complete topology scan in <30s for 1,000 topics
- **Trace Overhead**: Pipeline context adds <5% message size overhead
- **Replay Performance**: Support replay at 10,000 msg/s for debugging
- **Memory Footprint**: Pipeline registry uses <1GB RAM for 100 active pipelines
## Schema Evolution Strategy
Prism enforces schema compatibility levels per topic to prevent breaking changes.
### Compatibility Levels
**BACKWARD** (default): New consumers read old data
- Add optional fields: allowed
- Remove optional fields: allowed
- Add required fields: forbidden
- Remove required fields: forbidden
**FORWARD**: Old consumers read new data
- Add optional fields: allowed
- Remove optional fields: allowed
- Add required fields: forbidden
- Remove required fields: forbidden
**FULL**: Backward + Forward compatible
- Add optional fields: allowed
- All other changes: forbidden
### Breaking Change Workflow
When pre-deployment validation detects incompatibility:
1. CI/CD fails with schema diff showing incompatible changes
2. Developer chooses migration strategy:
- **Dual-write**: Write to both `topic` and `topic-v2` during transition
- **Blue-green**: Create `topic-v2`, migrate consumers incrementally
- **Consumer upgrade first**: Upgrade all consumers to support both schemas
3. Execute migration: `prismctl pipeline migrate --strategy=dual-write`
4. Monitor consumer lag and error rates on both topics
5. Validate all consumers migrated successfully
6. Deprecate old topic after configured transition period (default: 7 days)
### Implementation
- Add `schema_compatibility_level` to `PipelineStage` protobuf
- Validate compatibility during `prismctl pipeline validate`
- Block deployments that violate compatibility rules
- Generate migration plan with rollback steps
## Failure Scenarios & Recovery
### Scenario 1: Schema Registry Unavailable
**Impact**: Cannot validate schemas, discovery fails
**Detection**: Health check fails for schema registry client
**Recovery**:
1. Fall back to cached schemas (5-minute TTL)
2. Alert on-call engineer
3. Gracefully degrade: Allow pipeline to proceed with warning
4. Log all unvalidated messages for post-incident validation
### Scenario 2: Consumer Lag Exceeds Threshold
**Impact**: Downstream stages starved, end-to-end latency increases
**Detection**: Consumer group lag exceeds configured threshold
**Recovery**:
1. `prismctl pipeline diagnose` identifies bottleneck stage
2. Auto-scale consumer instances if enabled
3. Trigger backpressure to slow producers
4. Alert if lag persists beyond threshold
### Scenario 3: Schema Validation Failure Rate Exceeds Threshold
**Impact**: Data quality issues, downstream failures
**Detection**: Validation failure rate metric crosses threshold
**Recovery**:
1. Identify failing stage via dashboard
2. Automatically pause consumer at failing stage
3. Quarantine failed messages to dead-letter queue (DLQ)
4. Root cause analysis links to recent deployments
5. Automated rollback if correlated with recent schema change
### Scenario 4: Pipeline Topology Drift
**Impact**: Discovered topology doesn't match declared configuration
**Detection**: Topology diff between registry and actual cluster state
**Recovery**:
1. Alert indicates drift detected
2. `prismctl pipeline reconcile` shows differences
3. Admin chooses: accept drift (update registry) or fix cluster
4. Audit log tracks all topology changes
## Multi-Tenancy & Security
### Namespace Isolation
Pipelines scoped to Prism namespaces:
- Prefix all topic names with namespace: `{namespace}.{topic_name}`
- Pipeline CLI respects namespace permissions
- Cross-namespace pipelines require explicit authorization
### Sensitive Data Handling
**PII Detection**: Automatically tag PII fields in schemas
**Data Masking**: Support redaction in replay and sampling
```bash
prismctl pipeline replay \
--pipeline=user-signup-flow \
--mask-fields="email,phone,ssn" \
--target=local
```
### Audit Logging
- Log all pipeline topology changes with who/what/when
- Track message replay operations for forensics
- Integrate with compliance reporting (SOC2, GDPR)
### Encryption
- Schema registry credentials encrypted at rest in Pipeline Registry
- Kafka credentials managed via Vault integration (ADR-028)
- TLS required for all inter-service communication
## Cost & Resource Planning
### Storage Costs
- **Pipeline Registry**: ~10MB per pipeline topology (protobuf compressed)
- **Trace Data**: With 1M msgs/day, 7-day retention:
- Uncompressed: ~100 bytes per message = 700GB
- Compressed (zstd): ~20GB (80% reduction)
- Sampling strategy: 100% validation failures, 1% successes = ~7GB
### Compute Costs
- **Discovery Service**: 1 CPU core, 512MB RAM (continuous scanning)
- **Schema Validation**: ~0.1-1ms CPU per message depending on schema complexity
- **Replay Service**: Scale linearly with replay throughput (provision on-demand)
- **Estimated compute**: $30-50/month for 100 pipelines (t3.medium + Lambda)
### Network Costs
- **Admin API Calls**: Discovery scans = ~1000 calls/5min = 288k calls/day
- Mitigation: Cache topology (5-minute TTL), incremental scanning
- **Cross-AZ Traffic**: Pipeline spanning AZs = $0.01/GB egress
- Mitigation: Same-AZ routing, replica placement awareness
- **Estimated network**: $5-20/month depending on cluster topology
## Adoption Path & Migration
### For New Pipelines
1. Define pipeline in `.prism/pipelines/*.yml`
2. Add `pipeline` section to Producer/Consumer patterns
3. Run `prismctl pipeline validate` in CI/CD
4. Deploy with automatic topology registration
### For Existing Pipelines
1. **Discovery Phase**: Run `prismctl pipeline discover` to map existing topology
2. **Annotation Phase**: Add pipeline metadata to patterns incrementally
3. **Validation Phase**: Enable schema validation (warning mode first)
4. **Enforcement Phase**: Block deployments on validation failures
### Pilot Project Recommendations
- **Start with**: Non-critical pipeline with 3-5 stages, <10k msg/s
- **Avoid**: High-throughput (>100k msg/s) or mission-critical pipelines initially
- **Measure**:
- MTTR for debugging pipeline issues
- Time to onboard new developer to pipeline
- Number of schema-related production incidents
- Developer satisfaction survey
- **Timeline**: 2-week pilot, 2-week retrospective with team feedback
- **Success criteria**: Developer feedback positive, no production incidents from pipeline features
## Risks & Mitigations
### Risk 1: Discovery Performance Impact
**Risk**: Scanning large clusters (5000+ topics) may impact cluster performance or timeout.
**Mitigation:**
- Use Kafka Admin API with rate limiting (10 requests/second)
- Cache topology with configurable TTL (default: 5 minutes)
- Incremental discovery: only scan topics modified since last scan
- Manual topology registration as fallback for large clusters
- Offload discovery to dedicated Kafka cluster replica if available
### Risk 2: Schema Registry Dependency
**Risk**: Tight coupling to specific schema registry implementations (Confluent, AWS Glue).
**Mitigation:**
- Abstract schema registry behind Prism interface
- Support multiple registry providers via plugins
- Allow manual schema definition as fallback
- Maintain local schema cache for offline development
### Risk 3: Message Overhead
**Risk**: Pipeline context adds 100-500 bytes per message (overhead varies with lineage depth).
**Mitigation:**
- Make pipeline context optional (opt-in per pattern)
- Use efficient protobuf encoding
- Compress validation histories when >10 validations
- Support Kafka header-based context (not in message body) to reduce payload size
- Prune old validation entries after configurable stage count (default: 10 stages)
### Risk 4: Adoption Resistance
**Risk**: Teams resist adding pipeline metadata to existing services.
**Mitigation:**
- Demonstrate value with pilot project (quantify MTTR reduction)
- Provide automated migration tooling for existing pipelines
- Support gradual rollout: works with partial pipeline metadata
- Discovery works without manual metadata (infers from Kafka consumer groups)
- Create clear documentation with real-world examples
- Provide optional features: teams enable schema validation when ready
## Alternatives Considered
### Alternative 1: Use Existing Tools (Kafka Streams, ksqlDB)
**Pros:**
- Mature ecosystem with wide adoption
- Built-in topology management
- Strong community support
**Cons:**
- Requires rewriting services in Kafka Streams
- Limited observability for custom microservices
- No schema-aware debugging tooling
- Doesn't integrate with existing Prism patterns
**Decision**: Build on Prism to leverage existing patterns and provide deeper integration.
### Alternative 2: Schema Registry as Single Source of Truth
**Pros:**
- Schema registry already tracks topic schemas
- No additional metadata storage needed
**Cons:**
- Schema registry doesn't track consumer relationships
- Can't represent pipeline topology
- Limited to schema-level metadata
- Doesn't support custom pipeline stages
**Decision**: Use schema registry as data source but maintain pipeline topology separately.
### Alternative 3: Kafka Connector-Based Approach
**Pros:**
- Kafka Connect has built-in topology concepts
- Integrates with existing connector ecosystem
**Cons:**
- Limited to Kafka Connect-based services
- Doesn't support custom microservices
- Poor observability for debugging
- No schema validation tooling
**Decision**: Support Kafka Connect as one type of pipeline stage but don't limit to connectors.
## Open Questions
1. **Multi-Cluster Pipelines**: Should Prism support pipelines spanning multiple Kafka clusters?
- How to correlate topics across clusters (identical topic names vs prefix)?
- Cross-cluster latency monitoring and alerting?
- MirrorMaker2 integration for topic replication?
2. **Pipeline Versioning**: How should pipeline topology versions be managed?
- Git-based versioning with declarative YAML?
- Immutable pipeline versions stored in registry?
- Rollback support: revert to previous topology version?
- Diff between topology versions for change tracking?
3. **Access Control**: How should pipeline-level permissions work?
- Per-stage access control: restrict who can modify each stage?
- Pipeline-wide RBAC: permissions for entire pipeline?
- Integration with existing Prism namespace AuthZ?
- Separate read/write permissions for pipeline metadata vs data access?
## References
- [ADR-031: Message Envelope Protocol](/adr/adr-031)
- [RFC-031: Message Envelope Protocol](/rfc/rfc-031)
- [RFC-037: Mailbox Pattern - Searchable Event Store](/rfc/rfc-037)
- [RFC-046: Consolidated Pattern Protocols](/rfc/rfc-046)
- Confluent Schema Registry Documentation
- Kafka Streams Topology Documentation
- OpenTelemetry Semantic Conventions for Messaging
## Changelog
- **2025-01-07**: Add performance targets, schema evolution strategy, data lineage for ML, partition-aware replay, data quality validation, ML debugging features, cost planning, failure scenarios, multi-tenancy/security, adoption path
- **2025-11-07**: Initial draft
Full-stack web application for the University of Guelph Rocketry Club featuring AI-powered chatbot, member management, project showcases, and sponsor integration.
Reactory Data (`reactory-data`) is the data, assets, and CDN repository for the Reactory platform. It provides baseline directory structures, fonts, themes, internationalization files, client plugin source code and runtime bundles, email templates, workflow schedules, database backups, AI learning resources, and static content.
globs: src/app/**/*.tsx src/components/**/*.tsx src/hooks/**/*.ts src/lib/**/*.ts
A TypeScript CLI application that initiates and maintains an autonomous conversation between two AI personas using Ollama. The app starts with user input and then continues the conversation automatically until stopped.