Apache Kafka

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events per day. Originally developed at LinkedIn and open-sourced in 2011, it's now maintained by the Apache Software Foundation.

Core Concepts

Topics

A topic is a category/feed name to which records are published. Topics are partitioned for parallelism:

Topic: user-events
├── Partition 0: [event1, event4, event7...]
├── Partition 1: [event2, event5, event8...]
└── Partition 2: [event3, event6, event9...]

Producers & Consumers

[Producer A] ──┐
[Producer B] ──┼──► [Kafka Topic] ──┬──► [Consumer 1]
[Producer C] ──┘                    └──► [Consumer 2]

Consumer Groups

Consumers in a group share partitions for parallel processing:

Consumer Group: analytics
├── Consumer 1 → Partition 0, 1
└── Consumer 2 → Partition 2

Brokers

Kafka runs as a cluster of servers (brokers) that store and serve data.

Key Features

Feature	Description
Durability	Messages persisted to disk, replicated
Scalability	Horizontal scaling via partitions
Speed	Millions of messages/second
Ordering	Guaranteed within partition
Retention	Configurable (time or size based)

Use Cases

Event Sourcing - Store all state changes as events
Log Aggregation - Collect logs from multiple services
Stream Processing - Real-time data transformations
Message Queue - Decouple microservices
Change Data Capture - Sync database changes

Python Example

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    value_serializer=lambda v: json.dumps(v).encode()
)

producer.send("user-events", {"user_id": 123, "action": "login"})
producer.flush()

# Consumer
consumer = KafkaConsumer(
    "user-events",
    bootstrap_servers=["localhost:9092"],
    group_id="analytics",
    value_deserializer=lambda m: json.loads(m.decode())
)

for message in consumer:
    print(f"Received: {message.value}")

Kafka vs Alternatives

Feature	Kafka	RabbitMQ	Redis Streams
Throughput	Very High	Medium	High
Persistence	Yes	Optional	Yes
Ordering	Per partition	Per queue	Yes
Replay	Yes	No	Yes
Complexity	High	Medium	Low

Ecosystem

Kafka Streams - Stream processing library
Kafka Connect - Data integration framework
Schema Registry - Schema management (Avro, Protobuf)
ksqlDB - SQL interface for stream processing

When to Use Kafka

✅ High throughput requirements (100k+ msg/sec)
✅ Event sourcing / audit logs
✅ Real-time analytics pipelines
✅ Microservices communication at scale

❌ Simple pub/sub with few consumers
❌ Low latency requirements (<10ms)
❌ Small-scale applications