Apache Pulsar

Cloud-native distributed messaging and streaming platform

💫 What is Apache Pulsar?

Apache Pulsar is a cloud-native, distributed messaging and streaming platform designed for modern applications. Originally developed by Yahoo (now Verizon Media), Pulsar combines the best features of traditional messaging systems with modern cloud-native architecture patterns.

Pulsar's unique architecture separates serving and storage layers, enabling independent scaling and providing features like multi-tenancy, geo-replication, and tiered storage out of the box. This design makes it particularly well-suited for large-scale, mission-critical applications in cloud environments.

With built-in support for both messaging and streaming workloads, Pulsar offers a unified platform that can handle everything from simple pub-sub messaging to complex stream processing scenarios, making it an attractive choice for organizations looking to consolidate their messaging infrastructure.

⭐ Key Features

Multi-Tenancy

Native multi-tenant architecture with namespace isolation, authentication, and authorization.

Geo-Replication

Built-in cross-datacenter replication for disaster recovery and global data distribution.

Tiered Storage

Automatic data offloading to cheaper storage (S3, GCS) for cost-effective long-term retention.

Schema Evolution

Built-in schema registry with support for Avro, JSON, and Protobuf schema evolution.

Functions

Lightweight compute framework for stream processing without external dependencies.

Unified Messaging

Single platform for streaming, queuing, and pub-sub messaging patterns.

🏗️ Pulsar Architecture

Pulsar's architecture is built around the separation of concerns, with distinct layers for serving, storage, and coordination. This design enables independent scaling and provides operational flexibility.

Brokers

Stateless serving layer that handles client connections and message routing.

Client ↔ Broker ↔ BookKeeper

BookKeeper

Distributed log storage system that provides durability and replication.

Ledgers: [L1] [L2] [L3]

ZooKeeper

Coordination service for metadata management and cluster coordination.

Metadata + Coordination

Key Concepts

Hierarchy:

• Tenant: Top-level namespace for multi-tenancy
• Namespace: Administrative unit within tenant
• Topic: Message feed within namespace
• Subscription: Consumer group equivalent

Example Structure:

tenant/namespace/topic
company/app1/user-events
company/app2/order-stream

📬 Messaging Models

📡 Publish-Subscribe (Streaming)

Each subscription maintains its own cursor, allowing multiple consumers to independently process the same message stream. Perfect for event streaming and real-time analytics.

Characteristics:

• Independent cursors per subscription
• Message replay capability
• Multiple consumers per subscription
• Fan-out message delivery

Use Cases:

• Event streaming
• Real-time analytics
• Audit logging
• Notification systems

📋 Message Queuing

Messages are distributed among consumers in a round-robin fashion, ensuring load balancing and work distribution across multiple worker instances.

Characteristics:

• Round-robin message distribution
• Load balancing across consumers
• Message acknowledgments
• Automatic failover

Use Cases:

• Task processing
• Job queues
• Background processing
• Work distribution

⚙️ Pulsar Functions

Pulsar Functions is a lightweight compute framework that enables stream processing directly within the Pulsar cluster. Functions can be written in Java, Python, or Go, and are automatically managed by the Pulsar runtime.

Function Features

•
Lightweight: No external frameworks required
•
Multi-language: Java, Python, Go support
•
Auto-scaling: Scales based on message backlog
•
State Management: Built-in state store

Function Types

•
Transform: Message transformation and enrichment
•
Filter: Conditional message processing
•
Aggregate: Windowed aggregations
•
Route: Content-based message routing

Function Example

Simple transformation function (Python):

def process(input):
    # Transform the message
    result = input.upper()
    return result

Deployment:

pulsar-admin functions create \
  --py transform.py \
  --inputs input-topic \
  --output output-topic

🔌 Connectors & Ecosystem

Pulsar provides a rich ecosystem of connectors for integrating with external systems, along with compatibility with Kafka APIs and protocols.

Source Connectors

Import data from external systems

• Debezium (CDC)
• Kafka Connect
• File sources
• Database sources

Sink Connectors

Export data to external systems

• Elasticsearch
• HDFS/S3
• JDBC databases
• Cloud storage

Protocol Support

Multiple protocol compatibility

• Native Pulsar protocol
• Kafka API compatibility
• WebSocket
• HTTP REST

🎯 Common Use Cases

Enterprise Applications

•
Financial Services: High-throughput trading and risk management systems
•
E-commerce: Order processing and inventory management
•
Gaming: Real-time player event processing
•
IoT Platforms: Sensor data collection and analysis

Cloud-Native Patterns

•
Multi-Region Deployments: Global data replication and disaster recovery
•
Serverless Computing: Event-driven function execution
•
Microservices: Service-to-service communication
•
Edge Computing: Distributed data processing at the edge

📊 Performance & Characteristics

Performance Metrics

Throughput: 2.5M+ messages/second

Latency: Sub-5ms publish latency

Durability: Synchronous replication

Availability: 99.99% uptime in production

Deployment Models

Self-managed: Full control and customization

Cloud Services: StreamNative, DataStax Astra

Kubernetes: Helm charts and operators

Containers: Docker Compose, Docker Swarm

🔗 Related Technologies

Pulsar fits into the broader messaging and streaming ecosystem alongside these technologies:

Alternative Platforms

• Apache Kafka - Distributed streaming platform
• RabbitMQ - Traditional AMQP broker
• Redis - In-memory pub/sub messaging

Complementary Tools

• Apache BookKeeper - Distributed log storage
• Apache Flink - Stream processing engine
• Presto/Trino - SQL query engine

← Back to Glossary