id: rfc-053 title: "RFC-053: Schematized Topic-Based Microservice Pipelines" sidebar_label: "RFC-053: Topic Pipelines" status: Draft author: jrepp created: 2025-11-10 updated: 2025-11-15 project_id: prism doc_uuid: a1b2c3d4-e5f6-4a5b-8c9d-0e1f2a3b4c5d tags:

kafka
messaging
microservices
schema
developer-experience
observability
debugging
patterns
architecture

RFC-053: Schematized Topic-Based Microservice Pipelines

Executive Summary

This RFC proposes support for schematized topic-based microservice pipelines, where data flows through a chain of Kafka topics with strongly-typed schemas at each stage. Prism will provide developer tooling to discover, debug, and optimize these workflows through:

Pipeline Discovery & Visualization - Automatic topology detection from topic schemas and consumer group subscriptions
Schema-Aware Observability - End-to-end tracing with schema validation at each pipeline stage
Developer Productivity Tools - Local replay, diff tooling, and pipeline testing frameworks
Production Debugging - Data lineage tracking, schema evolution management, and error root cause analysis

Problem Statement

Current Challenges with Topic-Based Pipelines

Event-driven microservice architectures rely on pipelines where:

Services process data sequentially through Kafka topics
Each stage reads from input topics, transforms data, writes to output topics
Schemas evolve independently across services
Debugging requires manual correlation across topics, consumer groups, and traces

Pain Points:

Pipeline Visibility
- No automatic discovery of topic relationships
- Unclear which services consume from which topics
- Documentation becomes stale
- Topology difficult to understand
Schema Management
- Schema evolution breaks downstream consumers
- No validation that output matches next stage input
- Schema version compatibility unclear
- Version mismatches cause runtime failures
Debugging
- Errors in stage N+1 caused by bad data from earlier stages
- Manual message correlation across topics
- No tooling to replay message flows
- Production issues require extensive log analysis
Developer Productivity
- Local testing requires full infrastructure
- No pre-deployment validation of downstream impact
- Schema changes require manual team coordination
- Integration testing requires entire pipeline
Operations
- Consumer lag at one stage cascades downstream
- No unified pipeline health view
- Bottleneck identification requires manual analysis
- Inconsistent error handling across stages

Goals

Primary Goals

Pipeline Discovery: Automatically detect and visualize topic-based pipelines from:
- Kafka topic metadata and schema registry
- Consumer group subscriptions and topic assignments
- Producer/consumer configuration in Prism patterns
Schema-Aware Tracing: Provide end-to-end observability with:
- Message tracing across all pipeline stages
- Automatic schema validation at each topic boundary
- Schema evolution impact analysis
- Data lineage tracking from source to sink
Developer Tools: Enable productive local development with:
- Pipeline replay from any stage with historical data
- Schema diff tooling to identify breaking changes
- Local pipeline testing with mock topics
- Contract testing between pipeline stages
Production Debugging: Simplify troubleshooting with:
- Root cause analysis for data quality issues
- Message flow visualization for specific correlation IDs
- Schema mismatch detection and alerting
- Pipeline health dashboards with stage-specific metrics

Non-Goals

Building a general-purpose workflow orchestration engine (use Airflow/Temporal instead)
Supporting non-Kafka messaging systems in initial version
Replacing existing schema registry implementations
Providing data transformation logic (Prism is infrastructure, not business logic)
Real-time stream processing with complex joins (use Kafka Streams/Flink)
Batch job orchestration (use Airflow for batch pipelines)
Schema migration tooling (use schema registry features)
Long-term data archival (use dedicated data lake solutions)

Proposed Solution

Architecture Overview

Prism will introduce a new Pipeline Pattern that extends existing Producer/Consumer patterns with:

┌─────────────────────────────────────────────────────────────────┐
│                    Prism Pipeline Registry                       │
│  - Topology Discovery                                            │
│  - Schema Evolution Tracking                                     │
│  - Pipeline Metadata Store                                       │
└─────────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┼─────────────────┐
            │                 │                 │
┌───────────▼──────┐  ┌──────▼────────┐  ┌────▼──────────┐
│  Producer        │  │  Transform    │  │  Consumer     │
│  Pattern         │  │  Pattern      │  │  Pattern      │
│  +Pipeline Ext   │  │  +Pipeline    │  │  +Pipeline    │
└──────────────────┘  └───────────────┘  └───────────────┘
         │                    │                  │
         │                    │                  │
    ┌────▼─────┐         ┌───▼────┐        ┌───▼────┐
    │ Topic A  │────────▶│Topic B │───────▶│Topic C │
    │ Schema 1 │         │Schema 2│        │Schema 3│
    └──────────┘         └────────┘        └────────┘

Key Components

1. Pipeline Discovery Service

Responsibilities:

Scan Kafka cluster for topics and consumer groups
Query schema registry for topic schemas
Correlate producers/consumers using Prism pattern metadata
Build directed acyclic graph (DAG) of topic relationships
Detect cycles and orphaned topics

Implementation:

message PipelineTopology {
  string pipeline_id = 1;
  repeated PipelineStage stages = 2;
  repeated TopicEdge edges = 3;
  map<string, SchemaVersion> schemas = 4;
  google.protobuf.Timestamp discovered_at = 5;
}

message PipelineStage {
  string stage_id = 1;
  string service_name = 2;
  repeated string input_topics = 3;
  repeated string output_topics = 4;
  string consumer_group = 5;
  PipelineStageMetadata metadata = 6;
}

message TopicEdge {
  string from_stage = 1;
  string to_stage = 2;
  string topic_name = 3;
  SchemaCompatibility compatibility = 4;
}

Discovery Algorithm:

Enumerate topics with schema registry entries
Identify subscribed topics per consumer group
Link consumers to output topics via Prism pattern metadata
Build directed graph: services as nodes, topics as edges
Validate schema compatibility at each edge
Detect cycles and orphaned topics
Store topology in Pipeline Registry with timestamp

2. Schema-Aware Tracing

Extend existing message envelope with pipeline context:

message PipelineContext {
  string pipeline_id = 1;
  string stage_id = 2;
  int32 stage_sequence = 3;  // Position in pipeline (0-based)
  string correlation_id = 4;  // End-to-end trace ID
  repeated SchemaValidation validations = 5;
  map<string, string> stage_metadata = 6;
  DataLineage lineage = 7;
}

message SchemaValidation {
  string stage_id = 1;
  string schema_name = 2;
  int32 schema_version = 3;
  bool validation_passed = 4;
  repeated string errors = 5;
  google.protobuf.Timestamp validated_at = 6;
}

message DataLineage {
  repeated DataSource sources = 1;
  repeated Transformation transformations = 2;
  repeated ModelVersion models = 3;
  DataQualityMetrics quality = 4;
  google.protobuf.Timestamp expires_at = 5;
  repeated string pii_tags = 6;
}

message DataSource {
  string source_id = 1;
  string source_type = 2;
  string source_location = 3;
  google.protobuf.Timestamp ingested_at = 4;
  map<string, string> metadata = 5;
}

message Transformation {
  string transform_id = 1;
  string transform_type = 2;
  string code_version = 3;
  map<string, double> metrics = 4;
}

message ModelVersion {
  string model_id = 1;
  string model_version = 2;
  string framework = 3;
  map<string, double> inference_metrics = 4;
}

message DataQualityMetrics {
  int64 records_processed = 1;
  int64 records_failed = 2;
  map<string, double> quality_scores = 3;
  repeated string anomalies = 4;
}

Tracing Flow:

Producer creates message with initial pipeline_context
Each stage validates input schema before processing
Validation results appended to pipeline_context.validations
Transformation metrics recorded in pipeline_context.lineage
OpenTelemetry spans link stages via correlation_id
Prism proxy logs validation failures with stage context
Failed messages routed to DLQ with full lineage

Benefits:

End-to-end message flow visibility
Automatic schema mismatch detection
Debugging context for production issues
Data lineage for compliance (GDPR, SOC2)

3. Developer Tooling

A. Pipeline CLI (prismctl pipeline)

# Discover and visualize pipelines
prismctl pipeline discover --cluster=prod

# Show topology for specific pipeline
prismctl pipeline show user-signup-flow --format=dot
prismctl pipeline show user-signup-flow --format=mermaid

# Validate schema compatibility
prismctl pipeline validate user-signup-flow

# Replay messages for testing with partition awareness
prismctl pipeline replay \
  --pipeline=user-signup-flow \
  --from-stage=email-validator \
  --correlation-id=abc123 \
  --partition=2 \
  --offset-range=1000-2000 \
  --preserve-order=true \
  --rate-limit=1000 \
  --target=local

# Replay with sampling for large datasets
prismctl pipeline replay \
  --pipeline=ml-feature-pipeline \
  --from-stage=feature-transformer \
  --time-range="2025-11-06T00:00:00Z/2025-11-07T00:00:00Z" \
  --sample-rate=0.1 \
  --target=local

# Diff schemas between stages
prismctl pipeline schema-diff \
  --pipeline=user-signup-flow \
  --from-stage=api-gateway \
  --to-stage=email-validator

# Sample messages for debugging
prismctl pipeline sample \
  --pipeline=ml-feature-pipeline \
  --stage=feature-transformer \
  --filter="$.features.click_rate > 0.8" \
  --count=100 \
  --output=debug_samples.jsonl

# Compare outputs between versions
prismctl pipeline diff \
  --pipeline=ml-feature-pipeline \
  --stage=feature-transformer \
  --baseline-version=v1.2.0 \
  --comparison-version=v1.3.0 \
  --correlation-ids=sample.txt \
  --show-distribution-stats

B. Local Development Environment

# .prism/pipelines/user-signup.yml
pipeline:
  id: user-signup-flow
  stages:
    - id: api-gateway
      service: api-service
      output_topics:
        - name: raw-signups
          schema: SignupRequest

    - id: email-validator
      service: validator-service
      input_topics:
        - name: raw-signups
          schema: SignupRequest
      output_topics:
        - name: validated-signups
          schema: ValidatedSignup

    - id: user-creator
      service: user-service
      input_topics:
        - name: validated-signups
          schema: ValidatedSignup
      output_topics:
        - name: created-users
          schema: User

# Start local pipeline with mocked topics
prismctl pipeline dev user-signup-flow

C. Contract Testing Framework

# tests/test_pipeline_contracts.py
from prism_data.testing import PipelineContractTest

class TestUserSignupPipeline(PipelineContractTest):
    pipeline_id = "user-signup-flow"

    def test_email_validator_input_contract(self):
        """Ensure email-validator can consume from api-gateway output."""
        self.assert_schema_compatible(
            producer_stage="api-gateway",
            consumer_stage="email-validator",
            topic="raw-signups"
        )

    def test_end_to_end_message_flow(self):
        """Test complete pipeline with sample message."""
        result = self.send_message(
            stage="api-gateway",
            message={"email": "test@example.com", "name": "Test User"}
        )

        self.assert_message_reaches_stage(
            correlation_id=result.correlation_id,
            stage="user-creator",
            timeout_seconds=10
        )

        self.assert_schema_validations_passed(
            correlation_id=result.correlation_id
        )

class TestMLFeaturePipeline(PipelineContractTest):
    pipeline_id = "ml-feature-pipeline"

    def test_feature_transformer_data_quality(self):
        """Ensure feature transformer output meets quality thresholds."""
        result = self.send_message(
            stage="raw-data-ingestion",
            message={"user_id": 123, "clicks": 5, "timestamp": "2025-11-07"}
        )

        output = self.assert_message_reaches_stage(
            correlation_id=result.correlation_id,
            stage="feature-transformer",
            timeout_seconds=10
        )

        self.assert_data_quality(
            output,
            rules={
                "completeness": {"threshold": 0.95},
                "range_checks": {
                    "click_rate": {"min": 0.0, "max": 1.0},
                    "feature_count": {"min": 10, "max": 100}
                },
                "no_nulls": ["user_id", "feature_vector"],
                "distribution": {
                    "feature_vector": {"mean_range": [0, 10], "std_range": [0, 5]}
                }
            }
        )

    def test_schema_drift_detection(self):
        """Alert on schema drift in feature pipeline."""
        self.assert_no_schema_drift(
            stage="feature-transformer",
            baseline_time_range="2025-10-01/2025-10-31",
            comparison_time_range="2025-11-01/2025-11-07",
            max_drift_score=0.1
        )

4. Production Debugging

A. Pipeline Health Dashboard

Grafana dashboard with:

Pipeline topology visualization (auto-generated from discovery)
Per-stage metrics: throughput, latency, error rate
Consumer lag across all stages
Schema validation failure rate
Correlation ID search and trace viewer

B. Root Cause Analysis

When errors occur at stage N, Prism automatically:

Identifies messages that failed validation
Traces correlation ID back through pipeline
Highlights stage where data corruption occurred
Shows schema diff between expected and actual
Links to relevant service logs and traces

Example Scenario:

Stage 3 (user-creator) failing with schema validation errors:

  ┌─────────────────────────────────────────────────────────┐
  │ Root Cause: email-validator (Stage 2)                   │
  │                                                          │
  │ Expected output schema: ValidatedSignup v2               │
  │ Actual output schema: ValidatedSignup v1                 │
  │                                                          │
  │ Missing field: phone_number (required in v2)            │
  │                                                          │
  │ Action: Upgrade email-validator to produce v2 schema    │
  │ Or: Downgrade user-creator to accept v1 schema          │
  └─────────────────────────────────────────────────────────┘

C. ML-Specific Debugging

Feature Drift Detection:

Statistical tests (Kolmogorov-Smirnov, chi-squared) on feature distributions
Alert when distributions deviate beyond configurable threshold
Time-series visualization of feature statistics

Model Performance Tracking:

Track inference latency, throughput, prediction distributions per stage
Correlate model version with performance metrics
Automatic rollback trigger on regression beyond threshold
Record model metadata in pipeline context lineage

Data Sampling for Debug:

JSONPath filters for targeted message sampling
Stratified sampling for representative debugging datasets
Export to JSONL for offline analysis

Integration with Existing Patterns

Backward Compatibility:

Existing Producer/Consumer patterns work unchanged
Pipeline features opt-in via pattern configuration
No breaking changes to protobuf message definitions

Pattern Extensions:

# Producer pattern with pipeline support
pattern: producer
pipeline:
  enabled: true
  pipeline_id: user-signup-flow
  stage_id: api-gateway
  output_topics:
    - name: raw-signups
      schema:
        registry: confluent
        subject: raw-signups-value
        version: latest

# Consumer pattern with pipeline support
pattern: consumer
pipeline:
  enabled: true
  pipeline_id: user-signup-flow
  stage_id: email-validator
  input_topics:
    - name: raw-signups
      schema:
        registry: confluent
        subject: raw-signups-value
        version: 2
  output_topics:
    - name: validated-signups
      schema:
        registry: confluent
        subject: validated-signups-value
        version: 1

Implementation Plan

Phase 1: Foundation (Weeks 1-4)

Deliverables:

Pipeline topology protobuf definitions
Pipeline Registry service (basic CRUD)
- Dependency: Choose storage backend (SQLite vs Postgres - ADR needed)
- Risk: Schema evolution for registry itself
Discovery service for Kafka cluster scanning
- Dependency: Kafka Admin API rate limits - test at scale first
- Risk: Large clusters (>5000 topics) may timeout
Schema compatibility validator
- Dependency: Integration with Confluent Schema Registry AND AWS Glue
- Risk: Different registries have incompatible APIs
prismctl pipeline discover command
Load test discovery with synthetic topics

Success Criteria:

Can scan production Kafka cluster with throttling
Generate pipeline topology graph in <30s for 1000 topics
Detect schema incompatibilities with 100% accuracy vs manual verification
Load test passes with 10,000 synthetic topics

Phase 2: Developer Tooling (Weeks 5-8)

Deliverables:

Pipeline CLI commands (show, validate, schema-diff, replay, sample)
Local development environment support
Pipeline visualization (DOT/Mermaid export)
Contract testing framework with data quality validation
Partition-aware replay with sampling support

Success Criteria:

Developers can visualize and replay pipelines locally
Schema diffs identify breaking changes before deployment
Contract tests validate schema compatibility and data quality
Replay supports partition targeting and rate limiting

Phase 3: Tracing & Observability (Weeks 9-12)

Deliverables:

Pipeline context in message envelope with data lineage
Schema validation at each stage
OpenTelemetry integration with pipeline spans
Grafana dashboard templates with topology visualization
Correlation ID search with trace aggregation
ML-specific metrics (feature drift, model performance)

Success Criteria:

End-to-end trace spans link all pipeline stages
Schema validation failures logged and alerted with root cause
Dashboards show pipeline health, consumer lag, validation rates
Data lineage tracked for ML compliance

Phase 4: Production Debugging (Weeks 13-16)

Deliverables:

Root cause analysis with automatic stage identification
Message replay with partition and offset control
Schema evolution impact reports with migration plans
Pipeline health alerting rules (lag, validation failures, drift)
Feature drift detection for ML pipelines
Dead-letter queue integration

Success Criteria:

Reduce MTTR for pipeline issues
Identify root cause stage for schema errors automatically
Support partition-aware replay with sampling
Detect and alert on feature drift for ML workloads

Success Metrics

Developer Productivity

Onboarding Time: Time for new developers to understand pipeline topology
Debug Time: Time to identify root cause of pipeline failures
Test Coverage: Percentage of pipeline stages covered by contract tests

Operational Excellence

MTTR: Mean time to resolution for pipeline issues
Schema Errors: Production schema validation failures caught pre-deployment
Pipeline Visibility: Production pipelines automatically discovered and visualized

Performance & Scale

Discovery Throughput: Support clusters with 10,000+ topics without degradation
Discovery Latency: Complete topology scan in <30s for 1,000 topics
Trace Overhead: Pipeline context adds <5% message size overhead
Replay Performance: Support replay at 10,000 msg/s for debugging
Memory Footprint: Pipeline registry uses <1GB RAM for 100 active pipelines

Schema Evolution Strategy

Prism enforces schema compatibility levels per topic to prevent breaking changes.

Compatibility Levels

BACKWARD (default): New consumers read old data

Add optional fields: allowed
Remove optional fields: allowed
Add required fields: forbidden
Remove required fields: forbidden

FORWARD: Old consumers read new data

Add optional fields: allowed
Remove optional fields: allowed
Add required fields: forbidden
Remove required fields: forbidden

FULL: Backward + Forward compatible

Add optional fields: allowed
All other changes: forbidden

Breaking Change Workflow

When pre-deployment validation detects incompatibility:

CI/CD fails with schema diff showing incompatible changes
Developer chooses migration strategy:
- Dual-write: Write to both topic and topic-v2 during transition
- Blue-green: Create topic-v2, migrate consumers incrementally
- Consumer upgrade first: Upgrade all consumers to support both schemas
Execute migration: prismctl pipeline migrate --strategy=dual-write
Monitor consumer lag and error rates on both topics
Validate all consumers migrated successfully
Deprecate old topic after configured transition period (default: 7 days)

Implementation

Add schema_compatibility_level to PipelineStage protobuf
Validate compatibility during prismctl pipeline validate
Block deployments that violate compatibility rules
Generate migration plan with rollback steps

Failure Scenarios & Recovery

Scenario 1: Schema Registry Unavailable

Impact: Cannot validate schemas, discovery fails

Detection: Health check fails for schema registry client

Recovery:

Fall back to cached schemas (5-minute TTL)
Alert on-call engineer
Gracefully degrade: Allow pipeline to proceed with warning
Log all unvalidated messages for post-incident validation

Scenario 2: Consumer Lag Exceeds Threshold

Impact: Downstream stages starved, end-to-end latency increases

Detection: Consumer group lag exceeds configured threshold

Recovery:

prismctl pipeline diagnose identifies bottleneck stage
Auto-scale consumer instances if enabled
Trigger backpressure to slow producers
Alert if lag persists beyond threshold

Scenario 3: Schema Validation Failure Rate Exceeds Threshold

Impact: Data quality issues, downstream failures

Detection: Validation failure rate metric crosses threshold

Recovery:

Identify failing stage via dashboard
Automatically pause consumer at failing stage
Quarantine failed messages to dead-letter queue (DLQ)
Root cause analysis links to recent deployments
Automated rollback if correlated with recent schema change

Scenario 4: Pipeline Topology Drift

Impact: Discovered topology doesn't match declared configuration

Detection: Topology diff between registry and actual cluster state

Recovery:

Alert indicates drift detected
prismctl pipeline reconcile shows differences
Admin chooses: accept drift (update registry) or fix cluster
Audit log tracks all topology changes

Multi-Tenancy & Security

Namespace Isolation

Pipelines scoped to Prism namespaces:

Prefix all topic names with namespace: {namespace}.{topic_name}
Pipeline CLI respects namespace permissions
Cross-namespace pipelines require explicit authorization

Sensitive Data Handling

PII Detection: Automatically tag PII fields in schemas

Data Masking: Support redaction in replay and sampling

prismctl pipeline replay \
  --pipeline=user-signup-flow \
  --mask-fields="email,phone,ssn" \
  --target=local

Audit Logging

Log all pipeline topology changes with who/what/when
Track message replay operations for forensics
Integrate with compliance reporting (SOC2, GDPR)

Encryption

Schema registry credentials encrypted at rest in Pipeline Registry
Kafka credentials managed via Vault integration (ADR-028)
TLS required for all inter-service communication

Cost & Resource Planning

Storage Costs

Pipeline Registry: ~10MB per pipeline topology (protobuf compressed)
Trace Data: With 1M msgs/day, 7-day retention:
- Uncompressed: ~100 bytes per message = 700GB
- Compressed (zstd): ~20GB (80% reduction)
- Sampling strategy: 100% validation failures, 1% successes = ~7GB

Compute Costs

Discovery Service: 1 CPU core, 512MB RAM (continuous scanning)
Schema Validation: ~0.1-1ms CPU per message depending on schema complexity
Replay Service: Scale linearly with replay throughput (provision on-demand)
Estimated compute: $30-50/month for 100 pipelines (t3.medium + Lambda)

Network Costs

Admin API Calls: Discovery scans = ~1000 calls/5min = 288k calls/day
- Mitigation: Cache topology (5-minute TTL), incremental scanning
Cross-AZ Traffic: Pipeline spanning AZs = $0.01/GB egress
- Mitigation: Same-AZ routing, replica placement awareness
Estimated network: $5-20/month depending on cluster topology

Adoption Path & Migration

For New Pipelines

Define pipeline in .prism/pipelines/*.yml
Add pipeline section to Producer/Consumer patterns
Run prismctl pipeline validate in CI/CD
Deploy with automatic topology registration

For Existing Pipelines

Discovery Phase: Run prismctl pipeline discover to map existing topology
Annotation Phase: Add pipeline metadata to patterns incrementally
Validation Phase: Enable schema validation (warning mode first)
Enforcement Phase: Block deployments on validation failures

Pilot Project Recommendations

Start with: Non-critical pipeline with 3-5 stages, <10k msg/s
Avoid: High-throughput (>100k msg/s) or mission-critical pipelines initially
Measure:
- MTTR for debugging pipeline issues
- Time to onboard new developer to pipeline
- Number of schema-related production incidents
- Developer satisfaction survey
Timeline: 2-week pilot, 2-week retrospective with team feedback
Success criteria: Developer feedback positive, no production incidents from pipeline features

Risks & Mitigations

Risk 1: Discovery Performance Impact

Risk: Scanning large clusters (5000+ topics) may impact cluster performance or timeout.

Mitigation:

Use Kafka Admin API with rate limiting (10 requests/second)
Cache topology with configurable TTL (default: 5 minutes)
Incremental discovery: only scan topics modified since last scan
Manual topology registration as fallback for large clusters
Offload discovery to dedicated Kafka cluster replica if available

Risk 2: Schema Registry Dependency

Risk: Tight coupling to specific schema registry implementations (Confluent, AWS Glue).

Mitigation:

Abstract schema registry behind Prism interface
Support multiple registry providers via plugins
Allow manual schema definition as fallback
Maintain local schema cache for offline development

Risk 3: Message Overhead

Risk: Pipeline context adds 100-500 bytes per message (overhead varies with lineage depth).

Mitigation:

Make pipeline context optional (opt-in per pattern)
Use efficient protobuf encoding
Compress validation histories when >10 validations
Support Kafka header-based context (not in message body) to reduce payload size
Prune old validation entries after configurable stage count (default: 10 stages)

Risk 4: Adoption Resistance

Risk: Teams resist adding pipeline metadata to existing services.

Mitigation:

Demonstrate value with pilot project (quantify MTTR reduction)
Provide automated migration tooling for existing pipelines
Support gradual rollout: works with partial pipeline metadata
Discovery works without manual metadata (infers from Kafka consumer groups)
Create clear documentation with real-world examples
Provide optional features: teams enable schema validation when ready

Alternatives Considered

Alternative 1: Use Existing Tools (Kafka Streams, ksqlDB)

Pros:

Mature ecosystem with wide adoption
Built-in topology management
Strong community support

Cons:

Requires rewriting services in Kafka Streams
Limited observability for custom microservices
No schema-aware debugging tooling
Doesn't integrate with existing Prism patterns

Decision: Build on Prism to leverage existing patterns and provide deeper integration.

Alternative 2: Schema Registry as Single Source of Truth

Pros:

Schema registry already tracks topic schemas
No additional metadata storage needed

Cons:

Schema registry doesn't track consumer relationships
Can't represent pipeline topology
Limited to schema-level metadata
Doesn't support custom pipeline stages

Decision: Use schema registry as data source but maintain pipeline topology separately.

Alternative 3: Kafka Connector-Based Approach

Pros:

Kafka Connect has built-in topology concepts
Integrates with existing connector ecosystem

Cons:

Limited to Kafka Connect-based services
Doesn't support custom microservices
Poor observability for debugging
No schema validation tooling

Decision: Support Kafka Connect as one type of pipeline stage but don't limit to connectors.

Open Questions

Multi-Cluster Pipelines: Should Prism support pipelines spanning multiple Kafka clusters?
- How to correlate topics across clusters (identical topic names vs prefix)?
- Cross-cluster latency monitoring and alerting?
- MirrorMaker2 integration for topic replication?
Pipeline Versioning: How should pipeline topology versions be managed?
- Git-based versioning with declarative YAML?
- Immutable pipeline versions stored in registry?
- Rollback support: revert to previous topology version?
- Diff between topology versions for change tracking?
Access Control: How should pipeline-level permissions work?
- Per-stage access control: restrict who can modify each stage?
- Pipeline-wide RBAC: permissions for entire pipeline?
- Integration with existing Prism namespace AuthZ?
- Separate read/write permissions for pipeline metadata vs data access?

References

ADR-031: Message Envelope Protocol
RFC-031: Message Envelope Protocol
RFC-037: Mailbox Pattern - Searchable Event Store
RFC-046: Consolidated Pattern Protocols
Confluent Schema Registry Documentation
Kafka Streams Topology Documentation
OpenTelemetry Semantic Conventions for Messaging

Changelog

2025-01-07: Add performance targets, schema evolution strategy, data lineage for ML, partition-aware replay, data quality validation, ML debugging features, cost planning, failure scenarios, multi-tenancy/security, adoption path
2025-11-07: Initial draft

RFC-053: Schematized Topic-Based Microservice Pipelines

RFC-053: Schematized Topic-Based Microservice Pipelines

Executive Summary

Problem Statement

Current Challenges with Topic-Based Pipelines

Goals

Primary Goals

Non-Goals

Proposed Solution

Architecture Overview

Key Components

1. Pipeline Discovery Service

2. Schema-Aware Tracing

3. Developer Tooling

4. Production Debugging

Integration with Existing Patterns

Implementation Plan

Phase 1: Foundation (Weeks 1-4)

Phase 2: Developer Tooling (Weeks 5-8)

Phase 3: Tracing & Observability (Weeks 9-12)

Phase 4: Production Debugging (Weeks 13-16)

Success Metrics

Developer Productivity

Operational Excellence

Performance & Scale

Schema Evolution Strategy

Compatibility Levels

Breaking Change Workflow

Implementation

Failure Scenarios & Recovery

Scenario 1: Schema Registry Unavailable

Scenario 2: Consumer Lag Exceeds Threshold

Scenario 3: Schema Validation Failure Rate Exceeds Threshold

Scenario 4: Pipeline Topology Drift

Multi-Tenancy & Security

Namespace Isolation

Sensitive Data Handling

Audit Logging

Encryption

Cost & Resource Planning

Storage Costs

Compute Costs

Network Costs

Adoption Path & Migration

For New Pipelines

For Existing Pipelines

Pilot Project Recommendations

Risks & Mitigations

Risk 1: Discovery Performance Impact

Risk 2: Schema Registry Dependency

Risk 3: Message Overhead

Risk 4: Adoption Resistance

Alternatives Considered

Alternative 1: Use Existing Tools (Kafka Streams, ksqlDB)

Alternative 2: Schema Registry as Single Source of Truth

Alternative 3: Kafka Connector-Based Approach

Open Questions

References

Changelog

Related Documents

Design Document: BharatSeva AI

OpenClaw Enterprise Transformation Plan

Qwen Image and Edit: Open-sourcing and Local GGUF Generations with Lightning

Qwen3-TTS — Model Reference