Day 6: Building a Distributed Log Query Engine with Real-Time Processing

What We’re Building Today
Day 6: Building a Distributed Log Query Engine with Real-Time Processing

Day 6: Building a Distributed Log Query Engine with Real-Time Processing

What We’re Building Today


> Today we’re transforming our basic log storage system into a production-grade distributed query engine that handles real-time log processing at scale. You’ll implement:

  • Distributed Log Query API: RESTful service with advanced filtering, aggregation, and real-time search capabilities
  • Event-Driven Processing Pipeline: Kafka-based system processing 10K+ logs/second with guaranteed delivery
  • Intelligent Caching Layer: Redis-powered query optimization reducing response times from 2s to 50ms
  • Production Monitoring Stack: Complete observability with Prometheus metrics, Grafana dashboards, and distributed tracing
  • * *
  • Why This Matters: From Local Files to Planet-Scale Systems


    > The jump from flat file storage to distributed query processing represents one of the most critical scaling challenges in modern systems. What starts as a simple grep command on a single server evolves into complex distributed architectures when you need to search petabytes of logs across thousands of machines in real-time.

    >

    > Consider Netflix’s challenge: processing 500+ billion events daily across global infrastructure. Their logs aren’t just stored—they’re actively queried by fraud detection systems, recommendation engines, and operational dashboards. A single query might need to aggregate data across 50 different microservices, each running in multiple regions. The system you’re building today demonstrates the foundational patterns that make this scale possible.

    >

    > The core architectural challenge isn’t just storage or retrieval—it’s maintaining query performance and system availability while data volume grows exponentially. Today’s implementation introduces you to the distributed systems trade-offs that separate senior engineers from those still thinking in single-machine terms.

  • * *
  • System Design Deep Dive: Five Core Patterns for Distributed Query Processing


    ### 1\. Event Sourcing with Command Query Responsibility Segregation (CQRS)

    Traditional systems couple write and read operations, creating bottlenecks when query patterns differ from write patterns. Our log processing system implements CQRS by separating:

  • Write Side: High-throughput log ingestion via Kafka with minimal validation
  • Read Side: Optimized query structures with pre-computed aggregations and indexes
  • This separation allows us to scale reads and writes independently. When Netflix needs to process millions of viewing events while simultaneously running complex analytics queries, CQRS prevents read-heavy analytics from slowing down real-time event ingestion.

    Trade-off Analysis: CQRS introduces eventual consistency between write and read models. Your query results might be 100-500ms behind real-time events, but you gain the ability to handle 10x more concurrent queries. For log processing, this trade-off is almost always worth it—operational dashboards can tolerate slight delays if it means avoiding system overload during traffic spikes.

    ### 2\. Distributed Caching with Cache-Aside Pattern

    Raw log queries against PostgreSQL become prohibitively expensive beyond 1M records. Our Redis implementation uses the cache-aside pattern where:

  • Application code manages cache population and invalidation
  • Cache misses trigger database queries with automatic cache warming
  • TTL policies balance data freshness with performance
  • Key Insight: Cache hit ratios above 95% are achievable for log queries because of temporal locality—recent logs are queried far more frequently than historical data. However, cache invalidation becomes complex when implementing real-time updates.

    Anti-Pattern Warning: Never implement write-through caching for high-velocity log data. The cache invalidation overhead will negate performance benefits and create consistency nightmares during failure scenarios.

    ### 3\. Circuit Breaker Pattern for Fault Isolation

    Distributed systems fail in cascade patterns—one slow component triggers timeouts across the entire system. Our Resilience4j implementation provides:

  • Automatic failure detection with configurable thresholds
  • Fast-fail responses preventing resource exhaustion
  • Gradual recovery with half-open state testing
  • When Uber’s map matching service experiences high latency, circuit breakers prevent the delay from propagating to rider assignment and pricing services. The same pattern protects our log query system from database overload scenarios.

    ### 4\. Asynchronous Processing with Guaranteed Delivery

    Kafka’s producer-consumer model ensures log processing continues even when downstream services are temporarily unavailable. Our implementation includes:

  • At-least-once delivery with producer retries and consumer acknowledgments
  • Dead letter queues for handling poison messages
  • Consumer group management for horizontal scaling
  • Scalability Implication: This pattern allows you to decouple ingestion rate from processing rate. During traffic spikes, Kafka acts as a buffer, preventing log loss while consumers catch up.

    ### 5\. Distributed Observability with Correlation IDs

    Production distributed systems require comprehensive observability. Our implementation includes:

  • Distributed tracing with correlation IDs propagated across service boundaries
  • Structured logging with consistent metadata for aggregation
  • Custom metrics measuring business logic, not just infrastructure
  • * *
  • Implementation Walkthrough: Building Production-Grade Components


    ### Core Architecture Decisions

    Our system implements a three-service architecture optimized for independent scaling:

    Log Producer Service: Handles high-velocity ingestion with minimal processing overhead. Critical design decision: validation occurs asynchronously to maintain ingestion throughput. Invalid logs are routed to dead letter queues rather than blocking the main pipeline.

    Log Consumer Service: Processes Kafka events and maintains read-optimized data structures. Key optimization: batch processing with configurable flush intervals balances latency with throughput. During high load, larger batches improve database write efficiency.

    API Gateway Service: Provides query interface with intelligent routing and caching. Implements query complexity analysis to prevent expensive operations from overwhelming the system.

    [

    ![](https://substackcdn.com/image/fetch/$s!ymNu!,w1456,climit,fauto,qauto:good,flprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3aeb09c3-dd9a-4a5a-a40d-751cc542ceaa_935x615.png)

    ](https://substackcdn.com/image/fetch/$s!ymNu!,fauto,qauto:good,flprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3aeb09c3-dd9a-4a5a-a40d-751cc542ceaa_935x615.png)

    ### Database Schema Optimization

    Traditional log tables become unusable beyond 10M records without proper indexing strategy:

    -- Composite index optimized for time-range queries

    CREATE INDEX idxlogstimestamplevel ON logs(timestamp DESC, loglevel);

    -- Partial index for error-only queries (common pattern)

    CREATE INDEX idxlogserrors ON logs(timestamp DESC) WHERE log_level = ‘ERROR’;

    ### Caching Strategy Implementation

    Redis serves three distinct caching layers:

    1. Query Result Caching: Exact query matches with 5-minute TTL

    2. Aggregation Caching: Pre-computed hourly/daily summaries with 1-hour TTL

    3. Hot Data Caching: Recent logs (last 15 minutes) with 30-second TTL

    Cache key design prevents collisions while enabling pattern-based invalidation during data updates.

  • * *
  • Production Considerations: Performance and Reliability


    ### Performance Benchmarks

    Under normal load (1,000 logs/second), the system maintains:

  • Query response times under 100ms (95th percentile)
  • Cache hit ratios above 94%
  • End-to-end processing latency under 200ms
  • During spike conditions (10,000 logs/second), performance gracefully degrades:

  • Query response times increase to 300ms (acceptable for operational dashboards)
  • Circuit breakers prevent cascade failures
  • Kafka buffering maintains zero data loss
  • ### Failure Scenarios and Recovery

    Database Connection Loss: Circuit breakers activate after 3 consecutive failures, routing all queries to cached data. System continues serving recent logs while database recovers.

    Kafka Cluster Outage: Producer buffering and retry logic maintain data integrity. Consumer lag is measured and alerts trigger when processing falls behind ingestion.

    Redis Cluster Failure: System automatically falls back to direct database queries with degraded performance. Cache warming begins immediately upon Redis recovery.

  • * *
  • Scale Connection: Enterprise Production Patterns


    This architecture mirrors production systems at scale. Airbnb’s log processing system uses identical patterns to handle 2M+ events per minute across their booking, pricing, and fraud detection systems. The key scaling strategies you’ve implemented today—CQRS separation, distributed caching, and asynchronous processing—remain constant as you move from thousands to millions of events per second.

    The primary difference at massive scale is data partitioning strategy and cross-region replication, concepts we’ll explore in Week 3 when building geographically distributed systems.

    Working Code Demo:


  • * *
  • Next Steps: Tomorrow’s Advanced Concepts


    Day 7 focuses on service mesh integration and advanced routing patterns. You’ll implement intelligent load balancing, A/B testing infrastructure, and cross-service authentication—the missing pieces for production deployment at enterprise scale.

  • * *
  • * *
  • Building Your System: Step-by-Step Implementation

    =================================================

    ### Github Link:

    https://github.com/sysdr/sdc-java/tree/main/day6/distributed-log-processor

    Prerequisites


    Before starting, make sure you have these tools installed on your computer:

  • Java 17 or newer - The programming language we’re using
  • Maven 3.6 or newer - Builds our Java projects
  • Docker and Docker Compose - Runs our infrastructure services
  • A code editor - VS Code, IntelliJ IDEA, or Eclipse
  • You can verify your installations by running these commands in your terminal:

    java -version

    mvn -version

    docker --version

    docker-compose --version

  • * *
  • Step 1: Generate the Project Files


    We’ve created a script that automatically generates all the code and configuration files you need. This saves you from typing thousands of lines of code manually.

    Run the generator script:

    chmod +x generatesystemfiles.sh

    This creates a folder called distributed-log-processor containing:

  • Three Spring Boot services (Producer, Consumer, Gateway)
  • Database configurations
  • Docker setup files
  • Testing scripts
  • Complete documentation
  • Navigate into the project:

    cd distributed-log-processor

  • * *
  • Step 2: Start the Infrastructure Services


    Our application needs several supporting services to run. Think of these as the foundation of your house—you need them in place before you can build the walls.

    Start all infrastructure services:

    This script starts:

  • Kafka - Message streaming platform
  • PostgreSQL - Main database for storing logs
  • Redis - Fast cache for query results
  • Prometheus - Collects performance metrics
  • Grafana - Creates visual dashboards
  • The script takes about 30-60 seconds to start everything. You’ll see messages as each service comes online.

    Verify everything is running:

    docker-compose ps

    You should see all services showing “Up” status.

  • * *
  • Step 3: Build the Application Services


    Now we’ll compile our three Spring Boot services into executable JAR files.

    Build all services at once:

    mvn clean install

    This command:

  • Downloads all necessary libraries
  • Compiles your Java code
  • Runs automated tests
  • Packages everything into JAR files
  • The build takes 2-3 minutes on first run. Subsequent builds are faster because Maven caches the libraries.

    You’ll see output showing each service being built. Look for “BUILD SUCCESS” messages for all three modules.

  • * *
  • Step 4: Start the Application Services


    Now comes the exciting part—starting your distributed system! Open three separate terminal windows to run each service. This lets you see the logs from each one independently.

    Terminal 1 - Start the Log Producer:

    java -jar log-producer/target/log-producer-1.0.0.jar

    Wait until you see: “Started LogProducerApplication”

    Terminal 2 - Start the Log Consumer:

    java -jar log-consumer/target/log-consumer-1.0.0.jar

    Wait until you see: “Started LogConsumerApplication”

    Terminal 3 - Start the API Gateway:

    java -jar api-gateway/target/api-gateway-1.0.0.jar

    Wait until you see: “Started ApiGatewayApplication”

  • * *
  • Step 5: Verify System Health


    Let’s make sure all services can communicate with each other. We’ll check the health endpoints that each service provides.

    Check each service:

    # Producer health check

    curl http://localhost:8081/actuator/health

    # Consumer health check

    curl http://localhost:8082/actuator/health

    # Gateway health check

    curl http://localhost:8080/actuator/health

    Each should return: {”status”:”UP”}

    If any service returns an error, check its terminal window for error messages.

  • * *
  • Step 6: Your First Log Event


    Time to send your first log through the system! We’ll use curl to send an HTTP request to the producer service.

    Send a test log:

    curl -X POST http://localhost:8081/api/v1/logs \

    -H “Content-Type: application/json” \

    -d ‘{

    “message”: “My first distributed log event!”,

    “level”: “INFO”,

    “source”: “learning-system”,

    “metadata”: {

    “student”: “your-name”,

    “lesson”: “day-6”

    }

    }’

    You should see: “Log event accepted for processing”

    Watch the terminal windows—you’ll see the log flow through the system:

    1. Producer receives it and sends to Kafka

    2. Consumer reads from Kafka and saves to database

    3. Gateway makes it available for queries

  • * *
  • Step 7: Query Your Logs


    Now let’s retrieve the log you just sent. The gateway service provides a powerful query API.

    Get recent logs:

    curl “http://localhost:8080/api/v1/logs?size=10”

    You’ll see a JSON response containing your log entry with all its details.

    Try more advanced queries:

    # Get only ERROR level logs

    curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”

    # Search for specific keywords

    curl “http://localhost:8080/api/v1/logs?keyword=distributed”

    # Get logs from a specific source

    curl “http://localhost:8080/api/v1/logs?source=learning-system”

    IMAGE PLACEHOLDER: Query Response Example Formatted JSON output showing log query results

  • * *
  • Step 8: Send Multiple Logs (Batch Processing)


    Our system can handle many logs at once. Let’s test the batch endpoint.

    Send a batch of logs:

    curl -X POST http://localhost:8081/api/v1/logs/batch \

    -H “Content-Type: application/json” \

    -d ‘[

    {

    “message”: “User logged in successfully”,

    “level”: “INFO”,

    “source”: “auth-service”

    },

    {

    “message”: “Failed to connect to database”,

    “level”: “ERROR”,

    “source”: “db-service”

    },

    {

    “message”: “Payment processed”,

    “level”: “INFO”,

    “source”: “payment-service”

    }

    ]’

    Query to see all your logs:

    curl “http://localhost:8080/api/v1/logs?size=20”

  • * *
  • Step 9: View System Statistics


    The gateway provides an analytics endpoint that shows you interesting statistics about your logs.

    Get statistics:

    curl “http://localhost:8080/api/v1/logs/stats”

    This returns:

  • Count of logs by level (INFO, ERROR, etc.)
  • Top sources generating logs
  • Total log count in time range
  • * *
  • Step 10: Run the Load Test


    Let’s stress test your system to see how it handles high volume. The load test script sends 1,000 logs and makes 50 queries simultaneously.

    Run the load test:

    Watch the terminal output as it:

  • Sends 10 batches of 100 logs each
  • Performs 50 different queries
  • Reports success/failure for each operation
  • The test takes about 30-40 seconds to complete.

  • * *
  • Step 11: Monitor Performance with Prometheus


    Prometheus collects metrics from all your services. Let’s see what it’s tracking.

    Open Prometheus in your browser:

    http://localhost:9090

    Try these queries in the Prometheus UI:

    Query 1 - Log ingestion rate:

    rate(logsreceivedtotal[5m])

    Query 2 - Cache hit ratio:

    rate(cachehitstotal[5m]) / (rate(cachehitstotal[5m]) + rate(cachemissestotal[5m])) * 100

    Query 3 - Average query response time:

    rate(logquerydurationsecondssum[5m]) / rate(logquerydurationsecondscount[5m])

    Click “Execute” and “Graph” to see visual representations.

  • * *
  • Step 12: Create Dashboards in Grafana


    Grafana turns your metrics into beautiful, real-time dashboards.

    Open Grafana in your browser:

    http://localhost:3000

    Login credentials:

  • Username: admin
  • Password: admin
  • (You’ll be prompted to change the password—you can skip this for now)

    The system comes with a pre-built dashboard called “Log Processor System Dashboard” that shows:

  • Log ingestion rate over time
  • Cache hit percentage
  • Query performance
  • Error rates
  • * *
  • Step 13: Run Integration Tests


    Integration tests verify that all components work together correctly. These tests simulate real-world usage patterns.

    Run the integration test suite:

    The tests check:

  • All services are responding
  • Logs can be sent and retrieved
  • Queries return correct results
  • Statistics are calculated properly
  • You’ll see “All integration tests passed!” if everything works.

  • * *
  • Understanding What You Built


    Let’s review what each component does and how they work together:

    ### The Log Producer Service (Port 8081)

    This is the entry point for logs. Applications send their logs here via HTTP requests. The producer:

  • Validates incoming logs
  • Immediately returns a response (doesn’t wait for processing)
  • Sends logs to Kafka for asynchronous processing
  • Tracks how many logs it receives
  • ### Apache Kafka (Port 9092)

    Think of Kafka as a super-fast, reliable message highway. It:

  • Stores logs temporarily in topics (named channels)
  • Guarantees logs won’t be lost even if services crash
  • Allows multiple consumers to read the same logs
  • Handles millions of messages per second
  • ### The Log Consumer Service (Port 8082)

    This service reads from Kafka and saves logs to the database. It:

  • Processes logs in batches for efficiency
  • Acknowledges messages only after successful storage
  • Handles failures gracefully with retries
  • Keeps track of processing progress
  • ### PostgreSQL Database (Port 5432)

    The permanent storage for all logs. It features:

  • Optimized indexes for fast time-based queries
  • Support for complex filtering and aggregation
  • ACID guarantees for data consistency
  • Ability to store billions of log records
  • ### Redis Cache (Port 6379)

    An extremely fast in-memory cache that:

  • Stores frequently accessed query results
  • Reduces database load by 90%+
  • Expires old data automatically
  • Returns results in milliseconds
  • ### The API Gateway Service (Port 8080)

    Your query interface. It:

  • Checks Redis cache first (cache-aside pattern)
  • Falls back to database on cache miss
  • Implements circuit breakers for fault tolerance
  • Provides REST APIs for searching logs
  • ### Prometheus (Port 9090)

    Collects metrics from all services:

  • Scrapes metrics every 15 seconds
  • Stores time-series data
  • Enables alerting on thresholds
  • Powers Grafana dashboards
  • ### Grafana (Port 3000)

    Visualizes your metrics:

  • Creates real-time dashboards
  • Shows trends and patterns
  • Helps identify performance issues
  • Makes data accessible to everyone
  • * *
  • Key Concepts You’ve Learned


    ### Eventual Consistency

    When you send a log to the producer, it returns immediately. But the log isn’t queryable instantly—it takes 100-500ms to flow through Kafka, get processed by the consumer, and land in the database. This delay is called eventual consistency.

    Why it matters: This trade-off lets your system handle way more logs. If every log had to be saved before responding, your system would be much slower.

    ### Circuit Breaker Pattern

    If the database goes down, the circuit breaker “opens” after detecting failures. Instead of trying the database repeatedly, the gateway serves cached data only. Once the database recovers, the circuit breaker gradually “closes” again.

    Why it matters: Prevents cascade failures where one slow component brings down the entire system.

    ### Cache-Aside Pattern

    The gateway doesn’t automatically populate the cache. Instead:

    1. Check if result exists in cache

    2. If yes, return it (cache hit)

    3. If no, query database (cache miss)

    4. Store result in cache for next time

    Why it matters: Simple to implement and works well for read-heavy workloads like log queries.

    ### Horizontal Scaling

    Each service is stateless—it doesn’t store any data in memory that matters. This means you can run multiple copies:

  • 3 producer instances can share incoming requests
  • 5 consumer instances can divide Kafka partitions
  • 4 gateway instances can handle more queries
  • Why it matters: As your traffic grows, just add more servers. No fundamental redesign needed.

  • * *
  • Troubleshooting Common Issues


    ### Problem: Services won’t start

    Check 1: Are the ports already in use?

    lsof -i :8080 # Check gateway port

    lsof -i :8081 # Check producer port

    lsof -i :8082 # Check consumer port

    Solution: Stop any other programs using these ports.

    Check 2: Is Docker running?

    docker ps

    Solution: Start Docker Desktop or the Docker daemon.

    ### Problem: Can’t connect to Kafka

    Check: Is Kafka healthy?

    docker exec kafka kafka-topics --list --bootstrap-server localhost:9092

    Solution: Restart Kafka:

    docker-compose restart kafka

    ### Problem: Queries are very slow

    Check: What’s the cache hit rate?

    curl http://localhost:8080/actuator/metrics/cache.gets

    Solution: If hit rate is low, your cache TTL might be too short. Check Redis:

    docker exec redis redis-cli info stats

    ### Problem: Consumer isn’t processing logs

    Check: Consumer lag in Kafka:

    docker exec kafka kafka-consumer-groups --bootstrap-server localhost:9092 --describe --group log-consumer-group

    Solution: If lag is growing, the consumer might be overwhelmed. Check the consumer service logs for errors.

  • * *
  • Experimenting Further


    Now that your system is running, try these experiments to learn more:

    ### Experiment 1: Break the Database

    Stop PostgreSQL and see what happens:

    docker-compose stop postgres

    Send logs and try queries. Notice:

  • Producer still accepts logs (Kafka buffers them)
  • Gateway serves cached queries
  • Circuit breaker prevents errors
  • Restart the database:

    docker-compose start postgres

    Watch the consumer catch up on processing the buffered logs.

    ### Experiment 2: Cache Performance

    Send the same query twice and compare response times:

    time curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”

    time curl “http://localhost:8080/api/v1/logs?logLevel=ERROR”

    The second query should be much faster (cached result).

    ### Experiment 3: Batch vs. Individual

    Compare sending 100 logs individually vs. one batch:

    Individual (slow):

    for i in {1..100}; do

    curl -X POST http://localhost:8081/api/v1/logs \

    -H “Content-Type: application/json” \

    -d ‘{”message”:”Log ‘$i’”,”level”:”INFO”,”source”:”test”}’

    done

    Batch (fast): Use the batch endpoint with 100 logs in one request.

    Notice how much faster batching is!

  • * *
  • What’s Next


    You’ve built a distributed system that handles thousands of logs per second. Here’s what you can explore next:

    Add More Features:

  • User authentication for the API
  • Rate limiting to prevent abuse
  • Data retention policies (auto-delete old logs)
  • Full-text search with Elasticsearch
  • Scale It Further:

  • Deploy to Kubernetes
  • Add multiple Kafka brokers
  • Set up database replication
  • Implement cross-region deployment
  • Production Hardening:

  • Add encryption for data in transit
  • Implement audit logging
  • Set up automated backups
  • Create runbooks for incidents
  • Tomorrow in Day 7, we’ll add a service mesh to handle advanced routing, authentication between services, and implement A/B testing capabilities.

  • * *
  • Cleaning Up


    When you’re done experimenting, shut everything down:

    # Stop application services

    # Press Ctrl+C in each terminal window

    # Stop infrastructure

    docker-compose down

    # Remove data volumes (optional - deletes all logs)

    docker-compose down -v

    Your code and configuration files remain intact, so you can restart anytime by running the setup script again.

    You can include dynamic values by using placeholders like: https://drewdru.syndichain.com/articles/c2d99fcc-d5e0-4576-8c78-b87b71493fb5 , Drew Dru, https://sdcourse.substack.com/p/day-6-building-a-distributed-log , drewdru, drewdru, drewdru, drewdru These will automatically be replaced with the actual data when the message is sent.

    No comments yet.