Day 7: Building Your First End-to-End Log Processing Pipeline

By Drew Dru October 20, 2025

What We’re Building Today

What We’re Building Today

> Today marks a critical milestone in our distributed systems journey. We’re integrating all the components we’ve built over the past week into a complete, production-ready log processing pipeline that runs on a single machine but follows distributed system principles. By the end of this lesson, you’ll have:

> • A multi-service Spring Boot system with log producers, consumers, and a query API gateway • Event-driven architecture using Apache Kafka for reliable message delivery and decoupling • Persistent storage layer with PostgreSQL for long-term log retention and Redis for high-performance caching • Production observability with Prometheus metrics, distributed tracing, and Grafana dashboards

[

](https://substackcdn.com/image/fetch/$s!1-cL!,fauto,qauto:good,flprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16e359f1-1a55-4c49-b3d1-2d40dfaf9f39_965x610.png)

Why This Foundation Matters

What we’re building today isn’t just a local system—it’s a scaled-down version of the same architectural patterns that power Netflix’s real-time analytics, Uber’s event streaming platform, and Amazon’s CloudWatch logs. Every design decision we make reflects the same trade-offs these companies face at massive scale.

The critical insight here is understanding that distributed systems aren’t about having multiple machines—they’re about designing with distribution principles from day one. When Netflix processes billions of log events daily, they use the exact same event sourcing patterns, message queue semantics, and observability strategies we’ll implement today. The only difference is scale.

By building these patterns locally first, you’ll understand the fundamental challenges: message ordering guarantees, failure recovery, backpressure handling, and data consistency models. These challenges don’t change whether you’re processing 1,000 events per second or 1 million—only your infrastructure capacity changes.

System Design Deep Dive: Core Distributed Patterns

### Event Sourcing and Message Queuing Architecture

Our system implements event sourcing through Kafka, where every log event becomes an immutable fact in our system’s history. This isn’t just message passing—it’s a fundamental shift from request-response to event-driven thinking.

The key trade-off here is consistency versus availability. By using Kafka’s partitioning strategy, we guarantee message ordering within partitions but accept eventual consistency across the system. This is the same decision Uber made for their real-time data platform: they prioritized availability and partition tolerance over immediate consistency, accepting that different services might have slightly different views of the world at any given moment.

Anti-pattern warning: Don’t fall into the trap of using Kafka like a traditional message queue with immediate acknowledgments. The power comes from treating it as a distributed log where consumers can replay events and multiple subscribers can process the same stream independently.

### Cache-Aside Pattern with Redis

We implement cache-aside caching where our API gateway checks Redis first, then falls back to PostgreSQL for cache misses. This pattern handles the classic trade-off between read latency and data freshness.

The architectural insight here is understanding cache invalidation strategies. We use time-based expiration (TTL) for simplicity, but production systems like Amazon’s DynamoDB Accelerator (DAX) implement sophisticated invalidation based on write-through patterns and event triggers.

Scalability bottleneck: Redis becomes a single point of failure in this design. At Netflix scale, this evolves into Redis clusters with consistent hashing, but for our local system, we accept this limitation while building the foundational patterns correctly.

### Circuit Breaker and Bulkhead Patterns

Using Resilience4j, we implement circuit breakers around external service calls and bulkhead patterns to isolate different types of requests. This prevents cascade failures where one slow database query doesn’t bring down your entire log ingestion pipeline.

The trade-off is between system resilience and operational complexity. Circuit breakers add monitoring overhead and require careful tuning, but they’re essential for systems that need to handle partial failures gracefully. This is exactly how Spotify prevents their recommendation service outages from affecting music playback.

### Observable Systems Architecture

We instrument everything with Micrometer metrics and distributed tracing. This isn’t just monitoring—it’s designing for debuggability at scale. Every service exposes health checks, custom metrics, and trace context propagation.

Production consideration: Observability has performance costs. Each metric collection and trace span adds latency. We carefully choose what to measure, focusing on RED metrics (Rate, Errors, Duration) rather than monitoring everything.

### Database Connection Pool Management

Our system uses HikariCP connection pooling with carefully tuned parameters. This addresses the fundamental challenge of database connections as a finite resource that must be shared across concurrent requests.

The key insight is that connection pools are your first line of defense against database overload. The configuration we use (maximum 20 connections, 30-second timeout) reflects production patterns where you’d rather fail fast with connection timeouts than overwhelm your database with too many concurrent queries.

Implementation Walkthrough

### Service Architecture Decisions

We structure our system as three distinct Spring Boot services:

Log Producer Service: Exposes REST endpoints for log ingestion and publishes to Kafka. The architectural decision here is to separate ingestion from processing, allowing us to handle traffic spikes by buffering events in Kafka rather than overwhelming downstream services.

Log Consumer Service: Subscribes to Kafka topics and persists logs to PostgreSQL. We implement idempotent processing using log event IDs, ensuring that message redelivery doesn’t create duplicate records. This reflects the same exactly-once semantics that Apache Flink uses for stream processing.

API Gateway Service: Provides query endpoints with Redis caching. The gateway pattern centralizes cross-cutting concerns like authentication, rate limiting, and response caching. This is the same pattern Amazon uses for API Gateway, though our implementation is obviously simpler.

\[Image placeholder: Service interaction diagram showing request/response flow between services\]

### Message Processing Strategy

We configure Kafka consumers with manual acknowledgment, giving us control over when messages are considered processed. This prevents message loss during service restarts and allows for sophisticated error handling strategies.

The consumer group configuration ensures that multiple consumer instances can process messages in parallel while maintaining ordering guarantees within partitions. This is crucial for scaling—you can add more consumer instances without changing your application code.

### Database Design Philosophy

Our PostgreSQL schema includes proper indexing for time-range queries, which are the most common pattern for log systems. We also implement connection pooling and prepared statements for optimal performance.

Performance insight: We use JSONB for storing log metadata, taking advantage of PostgreSQL’s native JSON indexing. This gives us the flexibility of document storage with the performance of relational queries—exactly the approach Discord uses for storing billions of messages.

Production Considerations

### Performance Optimization

The system includes JVM tuning parameters optimized for throughput over latency. We configure G1GC for consistent performance under varying load conditions. Kafka producers use batching and compression to maximize throughput while minimizing network overhead.

Monitoring strategy: We track JVM metrics, Kafka lag, database connection pool utilization, and Redis hit rates. These four metrics tell you everything you need to know about system health in production.

### Failure Scenarios and Recovery

We implement graceful degradation where Redis failures don’t break the API—queries just become slower as they hit PostgreSQL directly. Kafka failures are handled by Spring Retry with exponential backoff, preventing overwhelming the broker during recovery.

The critical insight is that systems fail in partial ways. Netflix’s chaos engineering approach teaches us to build systems that continue operating even when individual components fail. Our circuit breaker configuration reflects this philosophy.

### Deployment and Scaling Strategies

The Docker Compose configuration includes resource limits and restart policies that reflect production deployment practices. We use health checks to ensure services are genuinely ready before receiving traffic.

Capacity planning: The system is configured to handle approximately 1,000 events per second on standard development hardware. The bottleneck is typically PostgreSQL writes, which is exactly what you’d expect in a log processing system.

Scale Connection: Enterprise Patterns

This exact architecture pattern appears in every major tech company, scaled up with additional infrastructure:

Netflix: Their event streaming platform processes billions of events using the same Kafka partitioning strategy and consumer group patterns we implement

Uber: Their real-time data infrastructure uses identical circuit breaker and bulkhead patterns for handling partial failures across thousands of microservices

Amazon: CloudWatch Logs uses the same cache-aside pattern with ElastiCache (Redis) for high-performance log queries

The architectural principles don’t change at scale—only the infrastructure capacity and operational complexity increase.

Next Steps

Tomorrow we’ll break our single-machine constraint and implement TCP-based log collection, introducing network partition handling and service discovery patterns. This builds naturally on today’s foundation by extending our event-driven architecture across machine boundaries.

The key insight you should take from today: distributed systems design isn’t about having multiple machines—it’s about designing with distribution principles that make your system resilient, observable, and scalable from day one.

Youtube Video:

* *

Implementation Guide

====================

### Github Link:

https://github.com/sysdr/sdc-java/tree/main/day7/distributed-log-processor

Setting Up Your Development Environment

Before we start building, let’s make sure you have everything you need installed on your computer.

### Required Software

Java Development Kit 17+

java -version

# Should show version 17 or higher

Maven 3.8+

mvn -version

# Should show version 3.8 or higher

Docker and Docker Compose

docker --version

docker-compose --version

# Both should be installed and running

If you’re missing any of these, install them before continuing. Docker is especially important since it will run our databases and message queues.

Building the Complete System

### Step 1: Generate the Project Structure

Run the system generator script we created:

chmod +x generatesystemfiles.sh

This creates a complete project with three Spring Boot services and all the configuration files we need. You should see a new folder called distributed-log-processor with everything inside.

\[Image placeholder: Terminal screenshot showing successful project generation\]

### Step 2: Start the Infrastructure Services

Navigate to your project folder and start the supporting services:

cd distributed-log-processor

This command starts:

PostgreSQL database (port 5432)

Apache Kafka with Zookeeper (port 9092)

Redis cache (port 6379)

Prometheus monitoring (port 9090)

Grafana dashboards (port 3000)

Wait for the setup script to complete. It will check that all services are healthy before finishing.

### Step 3: Build the Java Applications

Now let’s compile our three Spring Boot services:

# Build all services at once

mvn clean package -DskipTests

# This might take a few minutes the first time

# Maven will download all the dependencies we need

You should see “BUILD SUCCESS” for each of the three services: log-producer, log-consumer, and api-gateway.

### Step 4: Start the Services

Open three separate terminal windows and start each service:

Terminal 1 - Log Producer:

cd log-producer

mvn spring-boot:run

Terminal 2 - Log Consumer:

cd log-consumer

mvn spring-boot:run

Terminal 3 - API Gateway:

cd api-gateway

mvn spring-boot:run

Each service will take about 30-60 seconds to start. You’ll know they’re ready when you see “Started \[ServiceName\]Application” in the logs.

\[Image placeholder: Three terminal windows showing services starting up\]

Testing Your System

### Health Check Test

Let’s verify everything is working by checking the health of each service:

# Check Log Producer (port 8081)

curl http://localhost:8081/producer/api/logs/health

# Check Log Consumer (port 8082)

curl http://localhost:8082/actuator/health

# Check API Gateway (port 8080)

curl http://localhost:8080/gateway/api/query/health

Each command should return a JSON response with “status”: “healthy”.

### Send Your First Log Event

Let’s send a log message to our system:

curl -X POST http://localhost:8081/producer/api/logs/ingest \

-H “Content-Type: application/json” \

-d ‘{

“level”: “INFO”,

“message”: “Hello from our distributed log system!”,

“source”: “test-application”,

“metadata”: {

“userId”: “student123”,

“action”: “first_test”

}

}’

You should get back a response like:

{

“status”: “accepted”,

“logId”: “some-unique-id”,

“message”: “Log event successfully ingested”

}

### Query Your Log Back

Wait about 5 seconds for the message to be processed, then query it back:

curl “http://localhost:8080/gateway/api/query/logs?source=test-application”

You should see your log message in the response, along with timestamps and the metadata you sent.

### Run the Full Integration Test

We’ve provided a complete test suite that verifies all the functionality:

This script tests:

Health checks for all services

Single log ingestion

Batch log ingestion

Log querying and search

Statistics generation

All tests should pass with green checkmarks.

Performance Testing

### Load Test

Let’s see how much load our system can handle:

This sends 600 log messages over 60 seconds (10 per second) and measures response times. Watch the terminal output to see how your system performs under load.

### Monitor Performance

While the load test is running, open Grafana in your web browser:

http://localhost:3000

Requests per second

Response times

Error rates

Database and cache performance

\[Image placeholder: Grafana dashboard showing live metrics during load test\]

Understanding the Data Flow

Let’s trace what happens when you send a log message:

1. HTTP Request → Your curl command sends JSON to the Log Producer

2. Kafka Publishing → Producer converts JSON to Kafka message and sends to log-events topic

3. Message Consumption → Log Consumer receives message from Kafka

4. Database Write → Consumer saves log to PostgreSQL with proper indexing

5. Query Path → API Gateway checks Redis cache first, then PostgreSQL if needed

6. Monitoring → All steps generate metrics that Prometheus collects

### Viewing the Raw Data

You can see your logs directly in the database:

# Connect to PostgreSQL

docker exec -it postgres psql -U loguser -d logprocessor

# Query your logs

SELECT id, level, message, source, timestamp FROM log_events ORDER BY timestamp DESC LIMIT 10;

# Exit the database

### Checking Kafka Topics

You can also see the messages in Kafka:

# List Kafka topics

docker exec -it kafka kafka-topics --bootstrap-server localhost:9092 --list

# See recent messages (press Ctrl+C to stop)

docker exec -it kafka kafka-console-consumer --bootstrap-server localhost:9092 --topic log-events --from-beginning

Exploring Advanced Features

### Circuit Breaker Testing

Let’s trigger the circuit breaker by stopping the database:

# Stop PostgreSQL

docker stop postgres

# Try to send logs - should get circuit breaker response

curl -X POST http://localhost:8081/producer/api/logs/ingest \

-H “Content-Type: application/json” \

-d ‘{”level”: “ERROR”, “message”: “Test during outage”, “source”: “test”}’

# Restart PostgreSQL

docker start postgres

The system should return a “service unavailable” response when the database is down, demonstrating our resilience patterns.

### Cache Performance

Let’s test the cache-aside pattern:

# First query (cache miss) - will be slower

time curl “http://localhost:8080/gateway/api/query/stats”

# Second query (cache hit) - much faster

time curl “http://localhost:8080/gateway/api/query/stats”

The second request should be noticeably faster because it’s served from Redis.

### Batch Processing

Test sending multiple logs at once:

curl -X POST http://localhost:8081/producer/api/logs/batch \

-H “Content-Type: application/json” \

-d ‘[

{”level”: “INFO”, “message”: “Batch message 1”, “source”: “batch-test”},

{”level”: “WARN”, “message”: “Batch message 2”, “source”: “batch-test”},

{”level”: “ERROR”, “message”: “Batch message 3”, “source”: “batch-test”}

]’

This is much more efficient than sending logs one at a time.

Monitoring and Debugging

### Key Metrics to Watch

In Grafana, focus on these important metrics:

Throughput Metrics:

logingestiontotal - How many logs per second we’re processing

kafkamessagessent_total - Messages successfully sent to Kafka

kafkamessagesconsumed_total - Messages processed by consumers

Performance Metrics:

logingestionduration - How long it takes to process each log

httpserverrequests - API response times

logprocessingduration - Consumer processing time

Health Metrics:

Circuit breaker states (open/closed/half-open)

Database connection pool usage

Redis hit/miss rates

JVM memory and garbage collection

### Common Issues and Solutions

“Connection refused” errors: Make sure Docker containers are running with docker-compose ps

Slow queries: Check the database connection pool isn’t exhausted in the metrics

High memory usage: Monitor JVM heap size and consider increasing it with -Xmx4g

Kafka lag: If consumers fall behind, increase the number of consumer instances

Real-World Applications

The system you’ve built demonstrates the same patterns used by:

Netflix: Their event streaming processes billions of user interactions daily using identical Kafka partitioning and consumer group patterns.

Uber: Their dispatch system uses the same circuit breaker patterns to handle partial failures across thousands of services.

Discord: Their message storage uses PostgreSQL with JSONB for metadata, just like our log system.

Amazon CloudWatch: Uses cache-aside patterns with Redis for high-performance log queries.

The only differences at scale are:

More machines and larger clusters

Additional monitoring and alerting

More sophisticated deployment automation

Higher performance requirements

But the fundamental architectural patterns remain the same.

\[Image placeholder: Comparison diagram showing local system vs enterprise scale\]

What You’ve Accomplished

Congratulations! You’ve built a production-ready distributed log processing system that includes:

Event-driven architecture with proper message ordering

Resilience patterns that handle failures gracefully

High-performance caching with Redis

Comprehensive monitoring with Prometheus and Grafana

Horizontal scaling capabilities built-in from day one

This foundation prepares you for tomorrow’s lesson where we’ll extend this system across network boundaries, introducing service discovery and network partition handling.

The most important lesson: distributed systems aren’t about having multiple machines—they’re about designing with distribution principles that make systems resilient, observable, and scalable from the start.

Cleanup

When you’re done experimenting, stop all services:

# Stop Spring Boot services (Ctrl+C in each terminal)

# Stop Docker infrastructure

docker-compose down

# Remove volumes if you want to start fresh next time

docker-compose down -v

This frees up system resources and stops all the background processes.

You can include dynamic values by using placeholders like: https://drewdru.syndichain.com/articles/e9f4a4e4-2267-4df4-9b7e-64b609bf9629 , Drew Dru, https://sdcourse.substack.com/p/day-7-building-your-first-end-to , drewdru, drewdru, drewdru, drewdru These will automatically be replaced with the actual data when the message is sent.

Reference: https://drewdru.syndichain.com/articles/e9f4a4e4-2267-4df4-9b7e-64b609bf9629

No comments yet.