Event-Driven Architecture with Kafka and Flink: Stream Processing Patterns for 2026

Real-time data processing has become a competitive necessity. Whether it’s fraud detection that needs sub-second decisions, personalization engines reacting to user behavior, or operational dashboards reflecting live system state — batch pipelines that run every hour simply don’t cut it anymore. Apache Kafka and Apache Flink remain the cornerstone of production event-driven architectures in 2026. This post covers the patterns, pitfalls, and practical implementations.

Data Streaming Photo by Franck V. on Unsplash

The Event-Driven Architecture Landscape in 2026

Core Technologies

Layer	Technology	Alternatives
Message broker	Apache Kafka	Redpanda, Pulsar, AWS Kinesis
Stream processing	Apache Flink	Spark Streaming, Kafka Streams
Schema registry	Confluent Schema Registry	AWS Glue
Stream storage	Apache Iceberg	Delta Lake, Apache Hudi
Serving	Apache Pinot, Druid	ClickHouse, Materialize

Why Kafka + Flink Still Dominate

Kafka: Proven at 100M+ events/sec, log-based storage enables replay, broad ecosystem
Flink: Stateful stream processing with exactly-once guarantees, SQL support, native Kubernetes operator

Event Design: The Foundation

Good event-driven systems live or die by event design. Events should be:

Immutable facts — not commands or state snapshots:

❌ Bad: UpdateUserPreferences { user_id: "123", dark_mode: true }
✅ Good: UserPreferenceChanged { user_id: "123", preference: "dark_mode", value: "true", timestamp: "..." }

Self-describing with schema evolution in mind:

// Using Protocol Buffers for Kafka messages
syntax = "proto3";
package com.myorg.events.v1;

message OrderPlaced {
  string event_id = 1;
  string order_id = 2;
  string user_id = 3;
  repeated OrderItem items = 4;
  Money total = 5;
  string currency = 6;
  google.protobuf.Timestamp placed_at = 7;
  // v2 additions — backward compatible
  optional string promo_code = 8;
  optional DeliveryAddress delivery_address = 9;
}

message OrderItem {
  string product_id = 1;
  string product_name = 2;
  int32 quantity = 3;
  Money unit_price = 4;
}

message Money {
  int64 amount_cents = 1;
}

Schema Registry with Avro

from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer
from confluent_kafka.serialization import SerializationContext, MessageField

schema_registry_conf = {'url': 'http://schema-registry:8081'}
schema_registry_client = SchemaRegistryClient(schema_registry_conf)

order_schema = """
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.myorg.events",
  "fields": [
    {"name": "event_id", "type": "string"},
    {"name": "order_id", "type": "string"},
    {"name": "user_id", "type": "string"},
    {"name": "total_cents", "type": "long"},
    {"name": "placed_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
  ]
}
"""

avro_serializer = AvroSerializer(
    schema_registry_client,
    order_schema,
)

producer_conf = {'bootstrap.servers': 'kafka:9092'}
producer = Producer(producer_conf)

def publish_order_placed(order: dict):
    producer.produce(
        topic='orders.placed',
        key=order['order_id'],
        value=avro_serializer(
            order,
            SerializationContext('orders.placed', MessageField.VALUE)
        ),
        on_delivery=delivery_report,
    )
    producer.flush()

Kafka Patterns for Production

1. Partitioning Strategy

Partitioning determines parallelism and ordering guarantees:

orders.placed topic — 24 partitions
├── Partition by user_id hash → all orders for a user are ordered
├── 24 partitions × 3 brokers → 72-fold parallelism
└── Consumer group with 24 consumers → each partition assigned to one consumer

def get_partition_key(event: dict) -> str:
    """
    Choose partition key based on ordering requirements:
    - User events: user_id (preserves per-user ordering)
    - Order events: order_id (preserves per-order ordering)
    - Payment events: payment_id (no specific requirement, random distribution)
    """
    if event.get('user_id') and requires_user_ordering(event['event_type']):
        return event['user_id']
    return event.get('order_id', event['event_id'])

2. Consumer Group Management

from confluent_kafka import Consumer, KafkaException
import json

consumer_conf = {
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'fraud-detection-service',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,  # Manual commit for exactly-once semantics
    'max.poll.interval.ms': 300000,
    'session.timeout.ms': 30000,
}

consumer = Consumer(consumer_conf)
consumer.subscribe(['orders.placed', 'payments.initiated'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            raise KafkaException(msg.error())

        try:
            event = json.loads(msg.value())
            process_event(event)
            consumer.commit(asynchronous=False)  # Commit after successful processing
        except ProcessingError as e:
            # Send to dead letter queue
            publish_to_dlq(msg, e)
            consumer.commit(asynchronous=False)  # Still commit to avoid infinite loop
finally:
    consumer.close()

3. Transactional Outbox Pattern

Never write to Kafka directly from within a database transaction — use the outbox pattern:

-- 1. Application writes to DB and outbox atomically
BEGIN;
INSERT INTO orders (id, user_id, total) VALUES ($1, $2, $3);
INSERT INTO outbox (id, topic, key, payload, created_at)
VALUES (gen_random_uuid(), 'orders.placed', $1, $4, NOW());
COMMIT;

# 2. Outbox relay (Debezium CDC or custom poller) publishes to Kafka
# Debezium automatically captures INSERT into outbox table → Kafka topic
# No polling needed; CDC via Postgres logical replication

# debezium-connector-config.json
{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.dbname": "myapp",
    "table.include.list": "public.outbox",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.key": "key",
    "transforms.outbox.route.by.field": "topic",
    "transforms.outbox.table.expand.json.payload": true
  }
}

Apache Flink: Stateful Stream Processing

Core Concepts

Data Sources → Transformations → Sinks
     ↑              ↑
  Kafka         State Backends
  Files         (RocksDB, Memory)
  HTTP          Checkpointing

Flink’s superpower is managed state — you can accumulate, aggregate, and query state across millions of keys without managing external databases.

Flink SQL: The 2026 Default

Most Flink work now starts with SQL:

-- Create Kafka source tables
CREATE TABLE orders (
    order_id STRING,
    user_id STRING,
    total_cents BIGINT,
    placed_at TIMESTAMP(3),
    WATERMARK FOR placed_at AS placed_at - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'orders.placed',
    'properties.bootstrap.servers' = 'kafka:9092',
    'properties.group.id' = 'flink-analytics',
    'scan.startup.mode' = 'latest-offset',
    'format' = 'avro-confluent',
    'avro-confluent.url' = 'http://schema-registry:8081'
);

CREATE TABLE payments (
    payment_id STRING,
    order_id STRING,
    status STRING,
    completed_at TIMESTAMP(3),
    WATERMARK FOR completed_at AS completed_at - INTERVAL '10' SECOND
) WITH (
    'connector' = 'kafka',
    'topic' = 'payments.completed',
    'properties.bootstrap.servers' = 'kafka:9092',
    'format' = 'avro-confluent',
    'avro-confluent.url' = 'http://schema-registry:8081'
);

-- Sink table (ClickHouse for analytics)
CREATE TABLE order_metrics (
    window_start TIMESTAMP(3),
    window_end TIMESTAMP(3),
    total_orders BIGINT,
    total_revenue_cents BIGINT,
    avg_order_value DOUBLE
) WITH (
    'connector' = 'jdbc',
    'url' = 'jdbc:clickhouse://clickhouse:8123/analytics',
    'table-name' = 'order_metrics'
);

-- 1-minute tumbling window aggregation
INSERT INTO order_metrics
SELECT
    TUMBLE_START(placed_at, INTERVAL '1' MINUTE) AS window_start,
    TUMBLE_END(placed_at, INTERVAL '1' MINUTE) AS window_end,
    COUNT(*) AS total_orders,
    SUM(total_cents) AS total_revenue_cents,
    AVG(total_cents) AS avg_order_value
FROM orders
GROUP BY TUMBLE(placed_at, INTERVAL '1' MINUTE);

Stateful Fraud Detection with Flink DataStream API

Some patterns require the full DataStream API. Here’s a real-time velocity check for fraud:

import org.apache.flink.api.common.state.*;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

public class FraudDetectionJob {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // Enable checkpointing for fault tolerance (every 10s)
        env.enableCheckpointing(10_000);

        DataStream<Transaction> transactions = env
            .fromSource(KafkaSource.<Transaction>builder()
                .setBootstrapServers("kafka:9092")
                .setTopics("payments.initiated")
                .setGroupId("fraud-detection")
                .setValueOnlyDeserializer(new TransactionDeserializer())
                .build(),
                WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
                "Kafka Source"
            );

        DataStream<FraudAlert> alerts = transactions
            .keyBy(Transaction::getUserId)
            .process(new FraudDetector());

        alerts.sinkTo(KafkaSink.<FraudAlert>builder()
            .setBootstrapServers("kafka:9092")
            .setRecordSerializer(new FraudAlertSerializer("fraud.alerts"))
            .build());

        env.execute("Fraud Detection");
    }
}

class FraudDetector extends KeyedProcessFunction<String, Transaction, FraudAlert> {
    // State: transaction count in last 60 seconds
    private transient ValueState<Long> countState;
    // State: total amount in last 60 seconds
    private transient ValueState<Long> amountState;
    // State: timer cleanup handle
    private transient ValueState<Long> timerState;

    @Override
    public void open(Configuration parameters) {
        countState = getRuntimeContext().getState(
            new ValueStateDescriptor<>("tx-count", Long.class));
        amountState = getRuntimeContext().getState(
            new ValueStateDescriptor<>("tx-amount", Long.class));
        timerState = getRuntimeContext().getState(
            new ValueStateDescriptor<>("timer", Long.class));
    }

    @Override
    public void processElement(Transaction tx, Context ctx, Collector<FraudAlert> out)
            throws Exception {

        long count = countState.value() == null ? 0L : countState.value();
        long amount = amountState.value() == null ? 0L : amountState.value();

        count++;
        amount += tx.getAmountCents();

        countState.update(count);
        amountState.update(amount);

        // Register cleanup timer 60s in the future (overwrite previous)
        long timer = ctx.timerService().currentProcessingTime() + 60_000L;
        if (timerState.value() != null) {
            ctx.timerService().deleteProcessingTimeTimer(timerState.value());
        }
        ctx.timerService().registerProcessingTimeTimer(timer);
        timerState.update(timer);

        // Velocity check: > 10 transactions OR > $1000 in 60 seconds
        if (count > 10 || amount > 100_000L) {
            out.collect(new FraudAlert(
                tx.getUserId(),
                tx.getPaymentId(),
                String.format("Velocity exceeded: %d txns, $%.2f in 60s",
                    count, amount / 100.0),
                FraudAlert.Severity.HIGH
            ));
        }
    }

    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<FraudAlert> out)
            throws Exception {
        // Clear state after 60 seconds
        countState.clear();
        amountState.clear();
        timerState.clear();
    }
}

The Kappa Architecture Pattern

Modern event-driven systems increasingly use the Kappa Architecture (streaming-only, no batch):

Raw Events (Kafka, retained forever)
          ↓
   Stream Processor (Flink)
     ├── Real-time materialized views → Serving DB (Pinot, Redis)
     ├── Aggregated metrics → OLAP (ClickHouse)
     └── Derived events → Other Kafka topics

Benefits over Lambda Architecture (batch + stream):

Single codebase for historical and real-time processing
No divergence between batch and streaming results
Replay from Kafka log to reprocess historical data

Key Operational Considerations

Kafka Sizing (2026 Rules of Thumb)

Factor	Guideline
Partitions per topic	`max(target throughput / 10MB/s, num consumers)`
Replication factor	3 (never 2 — split-brain risk)
Retention	Based on replay needs; 7-30 days is common
Consumer lag alert	Alert if lag > 60s of messages

Flink Checkpointing Best Practices

# flink-conf.yaml
execution.checkpointing.interval: 30s
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.timeout: 3min
execution.checkpointing.max-concurrent-checkpoints: 1
state.backend: rocksdb
state.checkpoints.dir: s3://my-flink-checkpoints/
state.savepoints.dir: s3://my-flink-savepoints/

Conclusion

Event-driven architecture with Kafka and Flink gives you the building blocks for truly real-time systems. The patterns — outbox, schema registry, stateful stream processing, windowed aggregations — are well-understood and battle-tested in production.

The shift in 2026 is toward Flink SQL for most use cases, with the DataStream API reserved for complex stateful logic. Managed services (Confluent Cloud, Amazon MSK, Ververica Platform) have reduced operational burden significantly.

Start with a single event stream, design your events carefully, and build processing pipelines incrementally. Real-time is no longer a luxury — it’s a baseline expectation. ⚡

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)