What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events per day. Originally developed at LinkedIn and open-sourced in 2011, it's now maintained by the Apache Software Foundation.
Core Concepts
Topics
A topic is a category/feed name to which records are published. Topics are partitioned for parallelism:
Topic: user-events
├── Partition 0: [event1, event4, event7...]
├── Partition 1: [event2, event5, event8...]
└── Partition 2: [event3, event6, event9...]
Producers & Consumers
[Producer A] ──┐
[Producer B] ──┼──► [Kafka Topic] ──┬──► [Consumer 1]
[Producer C] ──┘ └──► [Consumer 2]
Consumer Groups
Consumers in a group share partitions for parallel processing:
Consumer Group: analytics
├── Consumer 1 → Partition 0, 1
└── Consumer 2 → Partition 2
Brokers
Kafka runs as a cluster of servers (brokers) that store and serve data.
Key Features
| Feature | Description |
|---|---|
| Durability | Messages persisted to disk, replicated |
| Scalability | Horizontal scaling via partitions |
| Speed | Millions of messages/second |
| Ordering | Guaranteed within partition |
| Retention | Configurable (time or size based) |
Use Cases
- Event Sourcing - Store all state changes as events
- Log Aggregation - Collect logs from multiple services
- Stream Processing - Real-time data transformations
- Message Queue - Decouple microservices
- Change Data Capture - Sync database changes
Python Example
from kafka import KafkaProducer, KafkaConsumer
import json
# Producer
producer = KafkaProducer(
bootstrap_servers=["localhost:9092"],
value_serializer=lambda v: json.dumps(v).encode()
)
producer.send("user-events", {"user_id": 123, "action": "login"})
producer.flush()
# Consumer
consumer = KafkaConsumer(
"user-events",
bootstrap_servers=["localhost:9092"],
group_id="analytics",
value_deserializer=lambda m: json.loads(m.decode())
)
for message in consumer:
print(f"Received: {message.value}")
Kafka vs Alternatives
| Feature | Kafka | RabbitMQ | Redis Streams |
|---|---|---|---|
| Throughput | Very High | Medium | High |
| Persistence | Yes | Optional | Yes |
| Ordering | Per partition | Per queue | Yes |
| Replay | Yes | No | Yes |
| Complexity | High | Medium | Low |
Ecosystem
- Kafka Streams - Stream processing library
- Kafka Connect - Data integration framework
- Schema Registry - Schema management (Avro, Protobuf)
- ksqlDB - SQL interface for stream processing
When to Use Kafka
✅ High throughput requirements (100k+ msg/sec)
✅ Event sourcing / audit logs
✅ Real-time analytics pipelines
✅ Microservices communication at scale
❌ Simple pub/sub with few consumers
❌ Low latency requirements (<10ms)
❌ Small-scale applications