📊 Data Engineering + ⚙️ AI Infrastructure advanced

Apache Kafka

A distributed event streaming platform for building real-time data pipelines and streaming applications at scale.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform capable of handling trillions of events per day. Originally developed at LinkedIn and open-sourced in 2011, it's now maintained by the Apache Software Foundation.

Core Concepts

Topics

A topic is a category/feed name to which records are published. Topics are partitioned for parallelism:

Topic: user-events
├── Partition 0: [event1, event4, event7...]
├── Partition 1: [event2, event5, event8...]
└── Partition 2: [event3, event6, event9...]

Producers & Consumers

[Producer A] ──┐
[Producer B] ──┼──► [Kafka Topic] ──┬──► [Consumer 1]
[Producer C] ──┘                    └──► [Consumer 2]

Consumer Groups

Consumers in a group share partitions for parallel processing:

Consumer Group: analytics
├── Consumer 1 → Partition 0, 1
└── Consumer 2 → Partition 2

Brokers

Kafka runs as a cluster of servers (brokers) that store and serve data.

Key Features

Feature Description
Durability Messages persisted to disk, replicated
Scalability Horizontal scaling via partitions
Speed Millions of messages/second
Ordering Guaranteed within partition
Retention Configurable (time or size based)

Use Cases

  1. Event Sourcing - Store all state changes as events
  2. Log Aggregation - Collect logs from multiple services
  3. Stream Processing - Real-time data transformations
  4. Message Queue - Decouple microservices
  5. Change Data Capture - Sync database changes

Python Example

from kafka import KafkaProducer, KafkaConsumer
import json

# Producer
producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    value_serializer=lambda v: json.dumps(v).encode()
)

producer.send("user-events", {"user_id": 123, "action": "login"})
producer.flush()

# Consumer
consumer = KafkaConsumer(
    "user-events",
    bootstrap_servers=["localhost:9092"],
    group_id="analytics",
    value_deserializer=lambda m: json.loads(m.decode())
)

for message in consumer:
    print(f"Received: {message.value}")

Kafka vs Alternatives

Feature Kafka RabbitMQ Redis Streams
Throughput Very High Medium High
Persistence Yes Optional Yes
Ordering Per partition Per queue Yes
Replay Yes No Yes
Complexity High Medium Low

Ecosystem

  • Kafka Streams - Stream processing library
  • Kafka Connect - Data integration framework
  • Schema Registry - Schema management (Avro, Protobuf)
  • ksqlDB - SQL interface for stream processing

When to Use Kafka

✅ High throughput requirements (100k+ msg/sec)
✅ Event sourcing / audit logs
✅ Real-time analytics pipelines
✅ Microservices communication at scale

❌ Simple pub/sub with few consumers
❌ Low latency requirements (<10ms)
❌ Small-scale applications

// Example Usage

LinkedIn processes over 7 trillion messages per day through Kafka, powering real-time activity feeds and analytics.