architecture

From Monolith State to Stateless Microservices: A Real-World Refactor

Eslam Genedy April 21, 2026

0 views

...

--- title: From Monolith State to Stateless Microservices: A Real-World Refactor published: true description: ... tags: architecture, microservices, aws, distributedsystems cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vj8ztr09i9x5p9zly9vd.jpg --- Two years ago our team inherited **Atlas** — a B2B SaaS monolith with ~680,000 lines of Rails code, a decade of accumulated complexity, and a deployment pipeline that caused anxiety every Friday afternoon. The symptoms were familiar: - Single PostgreSQL instance handling transactional **and** reporting load - In-process custom thread pool for background jobs - Singleton service objects maintaining application-level state - Session data stored in application memory behind a sticky-session load balancer - Deployments caused ~4 minutes of session loss for active users Our goal wasn't a big-bang rewrite. It was a **disciplined, iterative extraction** — shipping features throughout the process while incrementally achieving statelessness. --- ## Phase 1 — Audit: Finding State Where It Shouldn't Be Before touching code, we needed a full inventory of where state lived. This sounds simple. It isn't. ### Identifying Stateful Singletons The most dangerous pattern we found: service objects instantiated once at boot and shared across all requests. ```ruby # config/initializers/rate_limiter.rb RateLimiter = RateLimiterService.new( store: {}, # mutable hash in process memory window: 60, limit: 100 ) ``` `RateLimiterService` held a mutable `store` hash accumulating per-user request counts. Under a single process: fine. Under autoscaling with multiple pods: each pod maintained independent counters, making rate limiting effectively broken. We audited the entire codebase and found **23 such singletons** across four categories: | Category | Count | Problem | |---|---|---| | Rate limiters / throttles | 6 | Per-process counters — ineffective at scale | | Feature flag caches | 4 | Stale state — flags wouldn't propagate across pods | | Third-party client pools | 8 | Connection state not shared, over-connecting | | Tenant config aggregators | 5 | Mutations in one request leaked into others | The fix for every category followed the same principle: **move state out of process memory into a shared external store**. ```ruby # Before — process-local hash RateLimiter = RateLimiterService.new(store: {}) # After — Redis-backed, shared across all pods RateLimiter = RateLimiterService.new( store: RedisStore.new( client: Redis.current, namespace: "rate_limiter" ) ) ``` --- ### The Sticky Session Problem Atlas used Rails' default `:cookie_store` — but with a twist. A developer years earlier needed to store non-serializable ActiveRecord objects in the session. Rather than fix serialization, they mirrored part of the session server-side in a process-level hash, then configured the load balancer with **IP-based sticky sessions** to make sure users always hit the same pod. The consequences: - Horizontal scaling didn't distribute load evenly - Pod restarts caused immediate session loss - Rolling deployments silently logged out active users **Fix part 1** — Migrate session storage to Redis: ```ruby # Gemfile gem 'redis-session-store' # config/initializers/session_store.rb Rails.application.config.session_store :redis_session_store, key: '_atlas_session', redis: { expire_after: 2.hours, key_prefix: 'atlas:session:', client: Redis.current } ``` **Fix part 2** — Eliminate non-serializable objects from session entirely: ```ruby # Before — storing the full ActiveRecord object session[:current_user] = current_user # After — store only the primitive ID, hydrate on each request session[:current_user_id] = current_user.id # In ApplicationController def current_user @current_user ||= User.find_by(id: session[:current_user_id]) end ``` After this change we removed sticky sessions from the load balancer entirely. Session data became a first-class shared resource, not a pod-local artifact. **We saw an immediate 18% reduction in p99 login latency** by eliminating database joins that had been avoided via session caching with stale data. --- ## Phase 2 — Extraction: Breaking State Ownership With state inventoried and moved to external stores, we began service extraction using a **strangler fig** pattern — new services intercepted traffic slice by slice while the monolith continued running. ### Choosing What to Extract First We scored candidates across three dimensions: ```plaintext Score = (Domain Clarity × Async Viability) / Shared State Coupling ``` High score = good first candidate. Notifications won. Notifications had: - Clear bounded domain - No need for synchronous response - High enough volume to justify the infrastructure - A clean, definable event contract The original monolith code was fully synchronous and blocking: ```ruby def complete_order(order) order.finalize! NotificationMailer.order_confirmation(order).deliver_now # blocks thread WebhookService.deliver(:order_completed, order) # blocks thread ActivityFeed.append(order.user, :order_completed, order) # blocks thread end ``` If the email provider was slow, order completion was slow. If a webhook endpoint timed out, the user waited. Classic **incidental coupling through synchrony**. --- ## Phase 3 — SQS for Async State Updates We replaced the synchronous calls with event publishing to SQS. The monolith's job changed: finalize the order, emit a fact, return. ### The Event Bus Abstraction We built a thin `EventBus` wrapper around the AWS SDK to keep the calling code clean: ```ruby # lib/event_bus.rb class EventBus QUEUE_URLS = { "order.completed" => ENV["SQS_ORDER_EVENTS_URL"], "user.registered" => ENV["SQS_USER_EVENTS_URL"], "subscription.changed" => ENV["SQS_BILLING_EVENTS_URL"] }.freeze def self.publish(event_type, payload) url = QUEUE_URLS.fetch(event_type) do raise ArgumentError, "Unknown event type: #{event_type}" end sqs_client.send_message( queue_url: url, message_body: JSON.generate({ event: event_type, payload: payload, timestamp: Time.now.utc.iso8601, version: "1.0" }), message_attributes: { "EventType" => { string_value: event_type, data_type: "String" } } ) end def self.sqs_client @sqs_client ||= Aws::SQS::Client.new(region: ENV["AWS_REGION"]) end end ``` The refactored order completion: ```ruby def complete_order(order) order.finalize! EventBus.publish("order.completed", { order_id: order.id, user_id: order.user_id, user_email: order.user.email, total: order.total.to_s, currency: order.currency, line_items: order.line_items.map { |li| { sku: li.sku, qty: li.qty } } }) end ``` The controller returns immediately. No email. No webhook. No feed update. All of those happen asynchronously, decoupled from the user's request thread. --- ### The Notification Service Consumer The extracted notification service is a standalone Python service (we used Python for new services — the team had strong Python ML tooling we wanted to leverage long-term). It polls SQS and processes events: ```python # notification_service/consumer.py import boto3, json, logging from handlers import order_handlers, user_handlers logger = logging.getLogger(__name__) HANDLER_MAP = { "order.completed": order_handlers.handle_order_completed, "user.registered": user_handlers.handle_user_registered, "subscription.changed": user_handlers.handle_subscription_changed, } def poll(queue_url: str, sqs_client): while True: response = sqs_client.receive_message( QueueUrl=queue_url, MaxNumberOfMessages=10, WaitTimeSeconds=20, # long polling — reduces empty receives MessageAttributeNames=["EventType"] ) for message in response.get("Messages", []): process_message(message, queue_url, sqs_client) def process_message(message: dict, queue_url: str, sqs_client): try: body = json.loads(message["Body"]) event_type = body["event"] handler = HANDLER_MAP.get(event_type) if not handler: logger.warning(f"No handler registered for event: {event_type}") else: handler(body["payload"]) # Only delete the message if processing succeeded sqs_client.delete_message( QueueUrl=queue_url, ReceiptHandle=message["ReceiptHandle"] ) except Exception as exc: # Do NOT delete — message returns to queue after visibility timeout # SQS dead-letter queue (DLQ) handles repeated failures logger.error(f"Failed to process message {message['MessageId']}: {exc}") ``` ```python # notification_service/handlers/order_handlers.py from clients.email_client import send_transactional_email from clients.webhook_client import deliver_webhook from clients.feed_client import append_feed_event def handle_order_completed(payload: dict): send_transactional_email( template="order_confirmation", to=payload["user_email"], context={ "order_id": payload["order_id"], "total": payload["total"], "currency": payload["currency"], "line_items": payload["line_items"] } ) deliver_webhook( event="order.completed", payload=payload ) append_feed_event( user_id=payload["user_id"], event_type="order_completed", metadata={"order_id": payload["order_id"]} ) ``` ### SQS Configuration: Dead-Letter Queue Idempotency and failure handling are non-negotiable in async systems. We configured a DLQ for every event queue: ```hcl # terraform/sqs.tf resource "aws_sqs_queue" "order_events" { name = "atlas-order-events" visibility_timeout_seconds = 30 message_retention_seconds = 86400 # 24 hours redrive_policy = jsonencode({ deadLetterTargetArn = aws_sqs_queue.order_events_dlq.arn maxReceiveCount = 3 # 3 failed attempts → DLQ }) } resource "aws_sqs_queue" "order_events_dlq" { name = "atlas-order-events-dlq" message_retention_seconds = 1209600 # 14 days — time to investigate } ``` With `maxReceiveCount = 3`, a message that fails processing three times moves to the DLQ automatically. We set up CloudWatch alarms on DLQ depth so failures surface as alerts rather than silent data loss. --- ## Phase 4 — Event-Driven Patterns Across Services Once SQS was in place and the team had built muscle with it, we extended the pattern across the rest of the extraction effort. Three patterns proved especially valuable. ### Pattern 1: Event Sourcing for Audit Trails The billing domain previously had an `updated_at` column and a loosely maintained `versions` table. After extraction, every billing event became a first-class fact published to SQS and persisted in a dedicated event store: ```python # billing_service/event_store.py import boto3, json from datetime import datetime, timezone dynamodb = boto3.resource("dynamodb") table = dynamodb.Table("atlas-billing-events") def record_event(aggregate_id: str, event_type: str, payload: dict): timestamp = datetime.now(timezone.utc).isoformat() table.put_item(Item={ "pk": f"BILLING#{aggregate_id}", "sk": f"EVENT#{timestamp}", "event_type": event_type, "payload": payload, "version": get_next_version(aggregate_id) }) def get_billing_history(aggregate_id: str) -> list: response = table.query( KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)", ExpressionAttributeValues={ ":pk": f"BILLING#{aggregate_id}", ":prefix": "EVENT#" }, ScanIndexForward=True ) return response["Items"] ``` This replaced what used to be a `SELECT * FROM billing_audits WHERE account_id = ?` query on the monolith's main database — which had become a serious bottleneck at reporting time. --- ### Pattern 2: Saga for Distributed Transactions Distributed systems don't have ACID transactions across service boundaries. When we extracted the subscription service, we had to handle the scenario: charge succeeds → subscription activation fails. We implemented a choreography-based saga using SQS + SNS: ```plaintext charge.succeeded ──► [Subscription Service] ──► subscription.activated │ (on failure) │ ▼ subscription.activation_failed │ ▼ [Billing Service] ──► charge.reversed ``` ```python # subscription_service/saga.py def handle_charge_succeeded(payload: dict): account_id = payload["account_id"] plan_id = payload["plan_id"] try: activate_subscription(account_id, plan_id) publish_event("subscription.activated", { "account_id": account_id, "plan_id": plan_id, "charge_id": payload["charge_id"] }) except ActivationError as exc: # Compensating transaction — tell billing to reverse the charge publish_event("subscription.activation_failed", { "account_id": account_id, "charge_id": payload["charge_id"], "reason": str(exc) }) ``` ```python # billing_service/saga_handlers.py def handle_activation_failed(payload: dict): charge_id = payload["charge_id"] reverse_charge(charge_id) publish_event("charge.reversed", { "charge_id": charge_id, "account_id": payload["account_id"], "reason": payload["reason"] }) ``` The key insight: **each service owns its own compensating logic**. There's no central orchestrator that becomes a bottleneck or single point of failure. --- ### Pattern 3: Read Model Projections (CQRS) The reporting module was one of the last pieces consuming the monolith's main Postgres instance. We extracted it by building read-model projections — denormalized views rebuilt from events. Each time an event arrived, a projection worker updated a read-optimized store: ```python # reporting_service/projections/revenue_projection.py def on_order_completed(payload: dict): """ Update the revenue projection whenever an order completes. This is eventually consistent — typically < 500ms behind real-time. """ date_key = payload["timestamp"][:10] # "2024-11-15" dynamodb.update_item( TableName="atlas-revenue-daily", Key={ "date": {"S": date_key}, "currency": {"S": payload["currency"]} }, UpdateExpression="ADD total_cents :amount, order_count :one", ExpressionAttributeValues={ ":amount": {"N": str(payload["total_cents"])}, ":one": {"N": "1"} } ) ``` The reporting API now reads from DynamoDB, never from Postgres. The main database load dropped by ~31% after this migration. --- ## Results After 14 Months | Metric | Before | After | |---|---|---| | Deployment time | 18 min (full downtime) | 3 min (rolling, zero downtime) | | Horizontal scaling | Not possible | Auto-scales to 12 pods | | p99 API latency | 1,840ms | 290ms | | Session-related incidents | ~3/month | 0 in last 6 months | | Primary DB CPU (peak) | 94% | 41% | | MTTR for notification failures | ~2 hours | ~8 minutes (DLQ alert → fix) | --- ## Lessons Worth Remembering **1. State audit before extraction.** Do not extract a service and discover later it secretly depends on shared mutable state. Build the inventory first. Every singleton, every in-memory store, every sticky-session dependency. **2. Make events self-contained.** An event payload should carry enough data that consumers never need to call back into the emitting service to process it. Avoid events like `{ "order_id": 123 }` that force consumers to make a synchronous call to the monolith. Include the data the consumers need. **3. Idempotency is mandatory, not optional.** SQS delivers messages *at least once*. Consumers will see duplicates. Every handler must be safe to invoke multiple times with the same payload. Use a dedupe key: ```python def handle_order_completed(payload: dict): dedupe_key = f"notification:order_completed:{payload['order_id']}" if cache.exists(dedupe_key): return # Already processed — safe to skip # ... process ... cache.set(dedupe_key, "1", ex=86400) # 24-hour TTL ``` **4. DLQs are your observability layer.** A growing DLQ is a canary. Alert on it aggressively. Every message in the DLQ represents a business event that did not complete — treat it with the same urgency as a failed database write. **5. Don't extract the hardest domain first.** Start with high-async-viability, low-shared-state domains (notifications, emails, audit logs). Build team confidence and tooling patterns before tackling the hard stuff (billing, auth, core domain logic). --- ## The Code That Made the Biggest Difference If you take one pattern from this article, take this one: **the thin event bus wrapper with a strict schema contract**. ```ruby # The monolith's entire async surface area lives here. # Every service interaction is an event. # Every event has a version field. # Schema changes are additive, never breaking. EventBus.publish("order.completed", { schema_version: "1.2", order_id: order.id, # ... fields }) ``` Versioning events from day one saved us multiple breaking-change incidents. When the notification service needed a new field, we added it to the payload (non-breaking). When a field needed to change shape, we bumped the version and ran both handlers in parallel until consumers migrated. --- *The migration is ongoing — we still have three bounded contexts inside the monolith. But the hardest part wasn't the code. It was building shared mental models across the team about what "stateless" actually means in a distributed system, and why the discipline of event contracts pays for itself a hundred times over.*

From Monolith State to Stateless Microservices: A Real-World Refactor

Tags

Comments

More Blog

Minimalist EKS: The Easy Way

Never forget to enter the Stern Grove lottery again!

A Free Screenshot Editor That Never Uploads Your Image

I built a CLI to break my highlights out of Apple Books

A Developer's Guide to Agent Hooks in Antigravity CLI

Tactical vs. Strategic Agentic AI Development — A Playbook for Developers