Is Kafka just a better message queue?

No. While it can function as one, Kafka is a distributed event streaming platform designed to store, process, and react to event streams in real-time, acting as a durable log.

Do I still need ZooKeeper for a new Kafka cluster?

No. For new clusters, using KRaft mode (production-ready since Kafka 3.3) is highly recommended as it eliminates the need for ZooKeeper, simplifying operations and improving scalability.

What is the most important hardware resource for Kafka?

Memory (RAM) is the most critical resource for broker performance because Kafka heavily utilizes the operating system's page cache. It's a best practice to allocate most of the server's RAM to the page cache rather than the JVM heap.

TL;DR

Strategic Importance: Kafka is a distributed event streaming platform, not just a message queue. It enables real-time data processing, decouples systems, and creates a future-proof data infrastructure.
Core Architecture: Key components are Topics, Partitions, Brokers, Producers, and Consumers. Partitions are the fundamental unit of parallelism.
KRaft is the Future: Kafka’s move from ZooKeeper to KRaft mode simplifies operations, improves scalability to millions of partitions, and provides near-instantaneous failover.
Configuration is Key: Durability and availability are configured through a trade-off using acks, replication.factor, and min.insync.replicas.
Hardware Matters: Kafka is optimized for sequential I/O. Generous RAM for the OS page cache is more critical for performance than a large JVM heap.

Introduction: Kafka as a Strategic Foundation for Data-Driven Companies

In today’s digital economy, shaped by real-time data and the demands of Industry 4.0, Chief Technology Officers (CTOs) and VPs of Engineering face the challenge of creating architectures that not only meet current business needs but also enable future growth and innovation. In this context, Apache Kafka has evolved from a niche tool into a fundamental technology used by over 80% of Fortune 100 companies.¹ However, it is a common misconception to view Kafka merely as an enhanced messaging queue. In reality, Kafka is a distributed event streaming platform that enables companies to publish, store, process, and react to streams of events in real time.² For technical leaders, understanding Kafka is not just a technical necessity, but a strategic one.

Implementing Kafka as the central nervous system of the data infrastructure offers three crucial strategic advantages:³

Decoupling of Systems: Kafka enables the creation of a loosely coupled architecture where different systems and microservices act as independent producers or consumers of data streams.⁴ Producers send data to Kafka without needing to know which systems will later use that data. Likewise, consumers can process data without knowing the origin or implementation details of the producers.⁴ This decoupling is key to creating agile and maintainable system landscapes that can evolve independently.
Real-Time Data Processing: Unlike traditional batch processing systems, Kafka allows for the processing of events as they occur.⁵ This is the foundation for a variety of business-critical use cases, from instant fraud detection in the financial sector and dynamic pricing in e-commerce to predictive maintenance in the industrial IoT.¹
Future-Proof Data Infrastructure: Kafka serves as a durable, fault-tolerant storage for business events (messages).¹ These events can be retained for a configurable period—from seconds to years, or even indefinitely.⁶ This creates a “single source of truth” that new applications can access at any time to analyze historical data or implement new business logic without having to burden the original source systems again.

However, the true strategic importance of Kafka lies deeper than in mere performance metrics. The introduction of Kafka enforces architectural discipline. It shifts the paradigm from fragile point-to-point integrations, which often lead to a hard-to-maintain “spaghetti architecture,” to a robust, broadcast-based model.⁷ In a traditional architecture, adding a new data-consuming service often requires modifications to several existing systems to establish the new data flows. With Kafka as a central hub, a new service can simply subscribe to an existing data stream without a single line of code being changed in the producing systems. For a CTO, choosing Kafka is therefore not just a technology choice, but an investment in organizational agility, a reduction in long-term technical debt, and an acceleration of the time-to-market for future data-driven products and services.

The Core Architecture of Kafka: An Overview for Decision-Makers

To fully leverage the strategic advantages of Kafka, a fundamental understanding of its core components is essential. Each component plays a specific role in ensuring scalability, fault tolerance, and high performance. The architecture is deliberately designed to scale horizontally and handle massive data volumes.²

Message

The Message (also known as a record or event) is the atomic unit of data in Kafka. It represents an immutable fact that “something happened.”⁶ A message typically consists of a Key, a Value, a Timestamp, and optional Headers for metadata.⁸ The value contains the actual payload (e.g., a sensor reading or transaction information), while the key is crucial for partitioning and ensuring order.

Topic

A Topic is a logical channel or category in which messages are organized. You can think of it as an analogue to a table in a relational database.⁶ Producers write messages to topics, and consumers read messages from topics.⁹ Topics are the primary level of abstraction that developers interact with to manage and separate data streams.¹⁰

Partition

The Partition is the fundamental unit of parallelism and scalability in Kafka.¹¹ Each topic is divided into one or more partitions. Each partition is an ordered, immutable, append-only log of messages.⁶ By splitting a topic across multiple partitions, Kafka can distribute the data load and processing requests across multiple servers in the cluster, thus overcoming the scalability limits of a single machine.¹² Within a partition, the order of messages is strictly guaranteed.¹²

Broker & Cluster

A Kafka Cluster is a distributed system consisting of one or more servers, each known as a Broker.⁶ Each broker is a standalone Kafka server that hosts a subset of the partitions for various topics. It is responsible for handling write requests from producers, serving read requests from consumers, and replicating partition data to other brokers to ensure fault tolerance.¹³ The interplay of brokers in a cluster provides high availability and resilience.²

Producer

A Producer is a client application that writes (publishes) messages to Kafka topics.⁶ The producer is responsible for deciding which partition of a topic a message is written to. This can be based on the message key or done via a round-robin strategy to distribute the load evenly.⁹ To learn more about how producers and other clients work under the hood, read our guide on the hidden intelligence of Kafka clients.

Consumer & Consumer Group

A Consumer is a client application that reads messages from Kafka topics. To scale and parallelize processing, consumers are organized into Consumer Groups.¹¹ Kafka ensures that each partition of a topic is read by exactly one consumer within a consumer group at any given time.¹⁴ When new consumers join or leave a group, Kafka automatically redistributes the partitions among the remaining members. This mechanism, known as rebalancing, enables dynamic load balancing and fault tolerance. To dive deeper into how consumers and producers interact with the cluster, check out our article on the hidden intelligence of Kafka clients.

The Distributed Commit Log

At its core, Kafka is a distributed, replicated, append-only commit log.⁶ This design is the source of its exceptional performance, as it is optimized for sequential disk I/O operations, which are handled extremely efficiently by modern operating systems. It is also the basis for its durability and fault tolerance.

The interaction between partitions and consumer groups is a key differentiator of Kafka from traditional messaging systems. It allows Kafka to simultaneously realize the semantics of a message queue (work distribution) and a publish-subscribe system (broadcast). Within a single consumer group, the partitions of a topic are distributed among the members. Each message is therefore processed by only one consumer in that group, which corresponds to the model of a queue where multiple workers process tasks.¹⁵ At the same time, however, multiple independent consumer groups can subscribe to the same topic.⁵ Each group receives a complete, independent copy of all messages in the topic and processes them in parallel. This corresponds to the publish-subscribe model, where a message is sent to all subscribers.¹⁶ This dual nature is made possible because Kafka decouples the reading of data from its deletion. Messages are not removed after being read but are stored based on a retention policy.⁴ Each consumer (or consumer group) simply manages its own position (the so-called offset) in the log.¹⁷ For a CTO, this means that a single data stream—for example, from smart meters—only needs to be produced once and can then be consumed by different departments (e.g., billing, grid analytics, customer service) independently and without mutual interference. This maximizes data reuse and minimizes the load on the producing systems.

The Evolution of Metadata Management: From ZooKeeper to KRaft

One of the most significant architectural advancements in the history of Apache Kafka is the transition from its dependency on Apache ZooKeeper to self-managed metadata using KRaft (Kafka Raft Metadata mode). This change is of strategic importance for technical leaders as it fundamentally improves the operational complexity, scalability, and resilience of Kafka clusters.

The Historical Role of ZooKeeper

For many years, Apache ZooKeeper was an indispensable part of any Kafka deployment. It was a separate, distributed coordination service responsible for managing critical metadata for the Kafka cluster.⁹ Its main tasks included:

Controller Election: Electing a broker to be the “controller,” which is responsible for managing partition leaders and coordinating state changes in the cluster.¹⁸
Cluster Membership: Tracking the active brokers in the cluster.¹⁸
Topic Configuration: Storing configurations for topics, including the number of partitions, replication factors, and other settings.¹⁸
Access Control Lists (ACLs): Managing permissions for accessing topics.¹⁸

The Challenges with ZooKeeper

Although ZooKeeper fulfilled its role, the dependency on this external system brought significant challenges:

Operational Complexity: Operating a Kafka cluster required deploying, managing, monitoring, and securing two separate distributed systems.¹⁹ Each system had its own configuration parameters, failure modes, and operational playbooks, which increased the Total Cost of Ownership (TCO).²⁰
Scalability Bottleneck: ZooKeeper became a bottleneck for very large Kafka clusters. In particular, the number of partitions a cluster could efficiently manage was limited by ZooKeeper’s performance, often to a few hundred thousand partitions.²¹
Slow Failover: In the event of a failure of the acting controller broker, the failover process was slow. The new controller first had to load the entire metadata state from ZooKeeper before it could take over its duties. This led to a temporary unavailability of the cluster for metadata operations like creating topics or rebalancing consumer groups.¹⁸

Introduction to KRaft (Kafka Raft Metadata mode)

With KRaft, the dependency on ZooKeeper has been eliminated. Instead, Kafka now implements the Raft consensus algorithm directly within a dedicated group of brokers known as the controller quorum.²² The cluster’s metadata is no longer stored in ZooKeeper but in an internal, highly available Kafka topic named __cluster_metadata.²¹ Kafka thus uses its own proven mechanisms for replication and log storage to manage its own metadata.

Strategic Advantages of KRaft

The switch to KRaft offers decisive strategic advantages:

Simplified Architecture and Operations: There is now only one system to deploy, manage, monitor, and secure. This significantly reduces the TCO and simplifies the entire operational landscape.¹⁹
Massive Scalability: KRaft is designed to support clusters with millions of partitions, eliminating the previous scalability bottleneck and making the architecture future-proof.²¹
Near-Instantaneous Failover: The standby controllers in the quorum continuously replicate the metadata log. In a failover, a new leader is elected and is immediately active because it already has the entire state in memory. This drastically improves cluster availability.¹⁸
Unified Security Model: A single security model for both data and metadata simplifies administration and reduces potential security vulnerabilities.¹⁹

The introduction of KRaft is more than just a technical upgrade; it is a fundamental strengthening of the Kafka architecture. It transforms Kafka from a system dependent on a coordination service to a system that is itself a self-sufficient coordination service. This internalization of a critical dependency reduces systemic risk and operational fragility. For a CTO, this is a significant risk mitigation. The number of “moving parts” is reduced, the operational playbook is simplified, and the entire platform becomes more resilient and predictable. While the performance and scalability gains are immense, the reduction in operational complexity and risk is the overriding strategic advantage. Since Apache Kafka 3.3, KRaft mode is considered production-ready for new clusters, and migration from existing ZooKeeper-based clusters is actively being developed.¹⁹

Configuration and Best Practices: Guardrails for Stability and Performance

The true power of Kafka is only unlocked through thoughtful configuration. The default settings are often a compromise that is not optimal for every use case. For technical leaders, it is crucial to understand that these configurations are not merely technical details but levers to consciously align the architecture with business requirements. It’s about managing the fundamental trade-offs between durability, availability, latency, and throughput.

Durability vs. Availability: The Crucial Trade-off

Kafka Configuration Trade-Off Simulator

Visually explore how configuration impacts Durability, Latency, and Throughput.

Configuration Parameters

Quick Scenarios

Producer `acks`

Replication Factor: 3

Min. In-Sync Replicas: 2

Batching (`linger.ms`): 0 ms

Simulate Broker Failures

Impact Analysis

✅ Durability Guarantee High

🐢 Message Latency High

🚀 Potential Throughput Lower

Cluster Status

Explanation

The most important configuration decision in Kafka revolves around the guarantee that no data is lost, even in the event of server failures. This is controlled by the interplay of three central parameters.

replication.factor: This is the total number of copies of each partition stored in the cluster. A typical and recommended value for production environments is 3.²³ This means there is one “leader” replica and two “follower” replicas distributed across different brokers. This allows for the failure of up to two brokers without data loss.²⁴ An even more robust configuration strategy, sometimes referred to as “RF++,” recommends setting the replication factor to min.insync.replicas + 2. With a min.insync.replicas value of 2, this would mean a replication.factor of 4. This configuration provides increased fault tolerance, allowing one broker to be taken down for planned maintenance while still tolerating the unexpected failure of another broker without compromising the cluster’s write availability.
In-Sync Replicas (ISR): This is the set of replicas that are considered “fully synchronized” with the leader. A follower that falls too far behind in replication is removed from the ISR list.²⁵
Producer acks: This setting on the producer side determines when a write operation is confirmed as successful. It is the direct lever for trading off latency against durability.¹⁴
- acks=0 (“Fire-and-forget”): The producer sends the message and does not wait for any confirmation. This offers the lowest latency but no guarantee of delivery. Data can be lost during network problems or broker failures.²⁶
- acks=1 (Leader confirmation): The producer waits for confirmation from the leader that the message has been written to its log. This is a good compromise between performance and safety. However, data can be lost if the leader fails before the followers have replicated the message.²⁶
- acks=all (or -1): The producer waits until all replicas in the current ISR list have confirmed the message. This provides the highest durability guarantee but comes with the highest latency.²⁷
min.insync.replicas: This parameter at the broker or topic level sets the minimum number of replicas that must be in the ISR for a write with acks=all to be accepted. If the number of available ISRs falls below this value, the broker rejects the write with an error message. This prioritizes consistency over availability.²³

The min.insync.replicas configuration is the direct technical implementation of a company’s risk tolerance. It translates a business requirement like “We must not lose financial transactions under any circumstances” into concrete system behavior: “Reject writes if redundancy is compromised.” For a business-critical data stream, the standard configuration for maximum durability is replication.factor=3, min.insync.replicas=2, and acks=all on the producer side.²⁸ This combination ensures that every confirmed message is safely stored on at least two different machines. The system can survive the failure of one broker without losing data and without affecting write availability. For less critical data, such as web clickstreams for analytics, where low latency and high availability are more important than a 100% data guarantee, acks=1 could be a legitimate choice. These configurations are therefore not purely IT decisions but should be made in consultation with the business departments to reflect the value and criticality of the respective data.

Table 1: Trade-offs of the Producer acks Setting

acks Setting	Latency	Throughput	Durability Guarantee	Typical Use Case
acks=0	Lowest	Highest	None (data loss possible)	Logging, metrics (where occasional loss is tolerable)
acks=1	Low	High	Good (data loss possible on leader failure before replication)	Standard use cases, web tracking
acks=all	Highest	Lower	Highest (no data loss as long as min.insync.replicas is met)	Financial transactions, critical business events

Partitioning Strategies for Scalability and Order

Partitioning is the key to Kafka’s scalability. The right strategy depends on whether the order of messages or even load distribution is the priority.

Partitions as the Unit of Parallelism: The number of partitions in a topic defines the upper limit for consumer parallelism within a consumer group. If a topic has 10 partitions, up to 10 consumer instances can work in parallel to process the data. More partitions generally allow for higher overall throughput.¹⁰ The ratio between the number of consumers and partitions is crucial: if there are more consumers than partitions, the excess consumers will remain idle, as each partition can only be assigned to one consumer per group. Conversely, if there are fewer consumers than partitions, some or all consumers will process messages from multiple partitions to distribute the load.
The Role of the Message Key: The key of a message is crucial for the partitioning strategy.
- With a Key: If a message has a key, the producer applies a hash function to the key to deterministically select a partition (typically hash(key) % num_partitions). This guarantees that all messages with the same key always land in the same partition. Since order is guaranteed within a partition, the order for all messages with that key is also ensured.⁴ For a detailed, practical guide on what happens when this goes wrong and how to troubleshoot it, see our post on Why Kafka Events Can Arrive Out of Order.
- Without a Key (null key): If no key is provided, the producer defaults to distributing messages in a round-robin fashion across all available partitions. This ensures very even load distribution, but there is no guarantee of order for related messages.⁹

Rules of Thumb for Producers and Consumers

Producer Optimization:
- Batching for Throughput: To maximize throughput, producers should be configured to send messages in batches. The parameters batch.size (the size of a batch in bytes) and linger.ms (the maximum time the producer waits to fill a batch) are crucial here. Higher values for these parameters lead to larger batches, which improves throughput and compression efficiency but increases latency.²³
- Idempotence for Reliability: Enable idempotent producers (enable.idempotence=true). This has been the default since Kafka 3.0. This setting prevents message duplication during network retries and provides “exactly-once, in-order” semantics at the partition level without significantly impacting performance.²⁹
Consumer Optimization:
- Tune Parallelism: The number of consumer instances in a group should not exceed the number of partitions of the consumed topic. Additional consumers would remain idle.¹⁵
- Control Polling Behavior: The parameters fetch.min.bytes and fetch.max.wait.ms control how much data a consumer fetches from the broker with each request. Increasing these values can reduce the number of network round-trips, which lessens the load on the brokers and improves throughput, but can increase latency for individual messages.³⁰
- Offset Management: For critical applications, automatic offset committing (enable.auto.commit=false) should be disabled. Instead, offsets should be committed manually after a message has been successfully processed. This prevents data loss in case of failure and is the basis for “at-least-once” or “exactly-once” processing guarantees.³¹

Practical Example from the Energy Sector: Smart Metering with IoT

To illustrate the concepts discussed so far, let’s consider a practical scenario from the German energy sector: the rollout of smart meters. This example shows how Kafka can serve as the backbone for a modern, scalable IoT data platform.

Scenario

A German energy provider needs to collect and process consumption data from millions of smart meters in real time. The requirements are diverse:

Billing: Timely and accurate creation of consumption bills.
Grid Monitoring: Live monitoring of the power grid to detect anomalies, load peaks, and potential outages.
Load Forecasting: Analysis of consumption patterns to better predict energy demand.
Customer Portal: Providing real-time dashboards for end customers to visualize their own consumption.

Topic and Key Design

A well-thought-out structure for topics and keys is the foundation for a maintainable and scalable architecture.

Topic Naming Convention: A hierarchical naming convention creates clarity and allows for easy management and access control. A proven schema could look like this: {country}.{domain}.{region}.{datatype}.{object}.{version}.
- Example Topic: en.energy.bavaria.readings.power.v1
- This structure³² allows for targeted data access, e.g., all power readings (*.readings.power.*) or all data from a specific region (*.bavaria.*) can be consumed with a single pattern subscription. Versioning (.v1) is crucial for managing schema changes over the long lifespan of the devices.³³
Message Key Strategy: The unique identifier of the smart meter (e.g., the MaLo-ID or MeLo-ID in Germany) must be used as the Message Key.³⁴
- Justification: Using the meter_id as the key ensures that all readings from a specific meter are deterministically assigned to the same partition. This is critically important as it guarantees a strict processing order for each individual meter.³⁴ Without this guarantee, readings could be processed out of order, leading to incorrect consumption calculations, faulty billing, and unreliable analytics. The message key is therefore not just a technical detail but the fundamental enabler for stateful processing and per-device analytics at scale. Any stateful operation, such as calculating current consumption or detecting outages, relies on the correct order of messages for a single device. Partitioning by meter_id ensures that the state for a specific device can be efficiently managed locally in a single consumer thread, without the need for expensive and complex synchronization across the network.

Data Format Selection: Avro vs. Protobuf

The choice of data format is a long-term decision, especially in the IoT space where devices remain in the field for years. The ability to evolve the data schema over time is therefore a critical criterion.

Argument for Avro: Avro’s greatest strength is its flexible and robust schema evolution.³⁵ It supports both backward and forward compatibility, meaning old consumers can read new data and new consumers can read old data. This is invaluable in a long-lived IoT scenario where firmware updates on millions of devices cannot happen simultaneously. The schemas are defined in JSON and can be managed centrally in a Schema Registry, which simplifies data governance.³⁶
Argument for Protobuf: Protobuf generally offers slightly higher performance (lower latency, higher throughput) and produces more compact messages, which can be advantageous for resource-constrained devices.³⁶ However, its schema evolution is more rigid; field numbers must not be reused after being deleted, and changing field types is risky.³⁵
Recommendation: For this use case, Avro is strongly recommended. The operational flexibility and guaranteed compatibility mechanisms over a long device lifecycle far outweigh the marginal performance benefits of Protobuf. The risk of data corruption from improper schema evolution with Protobuf is too high in a large-scale, long-term IoT deployment.

Architecture Sketch

The resulting architecture leverages Kafka’s pub-sub capabilities to provide a single data stream for multiple, independent business processes:

Producers: Millions of smart meters send their readings. Since many IoT devices use protocols like MQTT, which are optimized for unreliable networks, an MQTT gateway is often used. This gateway receives data from the devices and then acts as a Kafka producer, writing the messages to the appropriate Kafka topic.³⁷
Kafka Cluster: A central, highly available Kafka cluster ingests the data into the en.energy.*.readings.power.v1 topic.
Multiple Consumer Groups: Various applications access the same data stream by subscribing with their own unique group.id:
- group.id=billing-service: A stream processing service (e.g., with Kafka Streams or Apache Flink) aggregates the readings per meter and prepares the data for monthly billing.
- group.id=grid-analytics-dashboard: A real-time analytics engine reads the data to populate live dashboards for the grid control center. It calculates aggregated load profiles, identifies anomalies, and visualizes the state of the grid.³⁸
- group.id=data-lake-archiver: A Kafka Connect Sink Connector writes all raw data unchanged to a cost-effective long-term storage (e.g., Amazon S3 or HDFS). This data is then available for historical analysis, training machine learning models, and regulatory requirements.³⁹

This architecture demonstrates the core strength of Kafka: data is produced once and can be reused an unlimited number of times for various purposes without affecting the source systems or other consuming applications.

Hardware Planning and Capacity Calculation: The Foundation for Operations

Careful planning of hardware resources is crucial for the performance, stability, and Total Cost of Ownership (TCO) of a Kafka cluster. This section provides a pragmatic guide to sizing storage, CPU, memory, and network.

Storage: SSD vs. HDD – A Cost-Benefit Comparison

A common assumption is that a high-performance system like Kafka must run on expensive Solid-State Drives (SSDs). However, the reality is more nuanced and depends heavily on Kafka’s I/O pattern.

Kafka’s I/O Pattern: Kafka is optimized for sequential read and write operations. New messages are always appended to the end of log files, and consumers generally read data linearly. Modern operating systems and even traditional hard disk drives (HDDs) are extremely efficient at handling sequential I/O loads.⁴⁰
When are HDDs an Option? For high-throughput workloads with long retention periods, where cost per terabyte is a critical factor, high-quality server HDDs (ideally in a JBOD—Just a Bunch of Disks—configuration) can be a surprisingly cost-effective and performant solution.⁴¹
The Advantages of SSDs: SSDs are superior when extremely low latency is business-critical. Their main advantage, however, is in non-sequential read access. This occurs when consumers have fallen far behind and need to “catch up,” reading older data that may no longer be in the operating system’s cache. In such scenarios, SSDs offer significantly more consistent and better performance.⁴²
Recommendation: For critical clusters with strict latency requirements, SSDs are the first choice. For high-volume use cases like data archiving or pure throughput workloads, an evaluation of high-performance HDDs should be considered to optimize TCO. In cloud environments, this decision is often abstracted by the choice of instance type (e.g., storage-optimized vs. general-purpose), but understanding the underlying principle is crucial for an informed selection.⁴³

Table 2: Comparison of Storage Technologies (SSD vs. HDD) for Kafka Brokers

Criterion	SSD (Solid-State Drive)	HDD (Hard Disk Drive)	Recommendation for CTOs
Sequential Throughput	High	Good (Surprisingly competitive)	For pure throughput scenarios, HDDs are a cost-effective option.
Latency	Very Low	High	If latency is business-critical, SSDs are the only choice.
Random Read Performance	Excellent	Poor	Critical for “catching up” consumers; this is where the SSD advantage is greatest.
Cost per TB	Higher	Lower	The primary lever for cost optimization with long retention times.
Reliability	Higher (no moving parts)	Lower	RAID configurations are essential with HDDs to mitigate failure risk.

Formula for Capacity Calculation

An accurate calculation of storage requirements is essential for budgeting and infrastructure planning. The following formula provides a reliable approach:⁴⁴

Kafka Capacity Calculator

Estimate the storage requirements for your Apache Kafka cluster based on your workload parameters.

Message Rate per Second

Avg. Message Size (Bytes)

Retention Period (Days)

Replication Factor

Safety Buffer (25%)

Estimated Storage Requirement

Daily Ingress

Retained Raw Data

Replicated Data

Total Storage

Calculate Daily Data Ingress: Daily Ingress = Message Rate per Second × Average Message Size × 86400
Calculate Total Data Volume Based on Retention Time: Data Retention = Daily Ingress × Retention Period in Days
Account for Replication Factor: Replicated Data = Data Retention × Replication Factor
Add a Safety Buffer: Total Storage = Replicated Data × (1 + Buffer in %)

A buffer of 20-30% is recommended to handle load spikes, operational buffers (e.g., for rebalancing), and future growth.⁴⁵

Table 3: Example Capacity Calculation for Smart Metering

Parameter	Value	Calculation Step
Number of Meters	1,000,000	-
Readings per Meter per Day	96 (every 15 minutes)	-
Average Message Size	500 Bytes	-
Daily Ingress (Raw Data)	≈48 GB	1,000,000 × 96 × 500 Bytes
Retention Period	14 Days	-
Stored Raw Data	672 GB	48 GB/day × 14 days
Replication Factor	3	-
Stored Replicated Data	2.016 TB	672 GB × 3
Safety Buffer (25%)	504 GB	2.016 TB × 0.25
Total Storage Requirement	≈2.52 TB	2.016 TB + 504 GB

CPU, Memory, and Network: Practical Recommendations

Memory (RAM): This is the most important resource for the performance of Kafka brokers. Kafka heavily utilizes the operating system’s page cache to serve read requests directly from RAM and to buffer writes. A counter-intuitive but crucial best practice is to keep the JVM heap for the Kafka process relatively small (e.g., 6-8 GB) and leave the majority of physical RAM to the OS for the page cache.⁴⁶ A good starting point for a production broker is 32-64 GB of RAM.⁴⁷ More RAM means a larger cache and thus better read performance for “hot” data.
CPU: Kafka is typically I/O-bound or network-bound, not CPU-bound. However, CPU usage increases significantly when SSL/TLS encryption and/or compression are used. Here, more cores are more beneficial than a higher clock speed, as Kafka parallelizes many I/O threads.⁴⁶ A modern multi-core processor with 12-24 cores is sufficient for most use cases.⁴⁷
Network: A high-bandwidth, low-latency network is essential. 10 GbE is the de facto standard for production environments. Network capacity can become a bottleneck, especially with many consumers fetching data simultaneously or with high replication traffic between brokers.³⁰

Summary and Strategic Implications

Apache Kafka is far more than just another technology in the modern data stack; it is a fundamental architectural paradigm that empowers companies to make the shift towards real-time, data-driven organizations. For technical leaders, understanding the strategic implications of Kafka is crucial to unlocking its full potential.

The key takeaways can be summarized as follows:

Kafka as an Architectural Principle: The adoption of Kafka promotes a loosely coupled, event-driven architecture. This reduces the complexity of integrations, increases development agility, and creates a robust foundation for scaling microservice landscapes.
KRaft as an Operational Milestone: The replacement of ZooKeeper with KRaft mode is a decisive step towards simplifying operations. Reducing to a single system to manage lowers the Total Cost of Ownership (TCO), increases scalability to millions of partitions, and improves resilience through near-instantaneous controller failover. Migrating to modern Kafka versions is therefore a clear strategic recommendation.
Configuration as a Business Decision: Kafka’s configuration parameters, especially the interplay of replication.factor, min.insync.replicas, and acks, are not purely technical settings. They are direct levers to technically map the business-required balance between data durability and system availability. These decisions must be made in a dialogue between technology and business departments and reflect the criticality of the respective data streams.
Design for Scalability: A well-thought-out design of topics, partitions, and especially message keys from the outset is essential. Choosing the right key, as shown in the IoT example, is the foundation for ordered and stateful processing at scale and thus for the success of complex streaming applications.
Hardware as the Foundation for Performance and Cost: Hardware planning must take into account Kafka’s specific I/O patterns. The central role of the OS page cache for performance means that generously sized memory is often more important than a large JVM heap. A differentiated view of SSDs and HDDs allows for a cost-optimized infrastructure adapted to the workload (latency vs. throughput).

For CTOs and VPs of Engineering, engaging with Kafka is an investment in the future-readiness of their IT landscape. It is about creating an infrastructure that not only handles today’s data volumes but is also flexible enough to support the unknown use cases of tomorrow. An evaluation of existing data architectures in light of the paradigms enabled by Kafka is a crucial step on the path to digital transformation and establishing a true competitive advantage through data.

Kafka Fundamentals – Grundlagen der Event-Streaming-Plattform erklärt - Thinkport, accessed on September 24, 2025, https://thinkport.digital/kafka-fundamentals/ ↩ ↩² ↩³
Was ist Kafka? – Apache Kafka erklärt - AWS, accessed on September 24, 2025, https://aws.amazon.com/de/what-is/apache-kafka/ ↩ ↩² ↩³
Kafka Fundamentals lernen mit Thinkport, accessed on September 24, 2025, https://thinkport.digital/kafka-fundamentals-lerne ↩
Documentation - Apache Kafka, accessed on September 24, 2025, https://kafka.apache.org/documentation/ ↩ ↩² ↩³ ↩⁴
Apache Kafka for Smart Grid, Utilities and Energy Production | PDF - Slideshare, accessed on September 24, 2025, https://www.slideshare.net/slideshow/apache-kafka-for-smart-grid-utilities-and-energy-production/241332010 ↩ ↩²
Intro to Apache Kafka®: Tutorials, Explainer Videos & More, accessed on September 24, 2025, https://developer.confluent.io/what-is-apache-kafka/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Introduction - Apache Kafka, accessed on September 24, 2025, https://kafka.apache.org/intro ↩
Apache Kafka Architecture Deep Dive - Confluent Developer, accessed on September 24, 2025, https://developer.confluent.io/courses/architecture/get-started/ ↩
Starting out with Kafka clusters: topics, partitions and brokers | by Martin Hodges | Medium, accessed on September 24, 2025, https://medium.com/@martin.hodges/starting-out-with-kafka-clusters-topics-partitions-and-brokers-c9fbe4ed1642 ↩ ↩² ↩³ ↩⁴
Kafka topic partitioning strategies and best practices - New Relic, accessed on September 24, 2025, https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning ↩ ↩²
Kafka Partitions: Essential Concepts for Scalability and Performance - DataCamp, accessed on September 24, 2025, https://www.datacamp.com/tutorial/kafka-partitions ↩ ↩²
Intro to Kafka Partitions | Apache Kafka® 101 - Confluent Developer, accessed on September 24, 2025, https://developer.confluent.io/courses/apache-kafka/partitions/ ↩ ↩²
Apache Kafka cluster: Key components and building your first cluster - Instaclustr, accessed on September 24, 2025, https://www.instaclustr.com/education/apache-kafka/apache-kafka-cluster-key-components-and-building-your-first-cluster/ ↩
Kafka Best Practices Guide - Logisland - GitHub Pages, accessed on September 24, 2025, https://logisland.github.io/docs/guides/kafka-best-practices-guide ↩ ↩²
Consumer Group Protocol: Scalability and Fault Tolerance, accessed on September 24, 2025, https://developer.confluent.io/courses/architecture/consumer-group-protocol/ ↩ ↩²
What is Kafka Consumer Group - GitHub, accessed on September 24, 2025, https://github.com/AutoMQ/automq/wiki/What-is-Kafka-Consumer-Group ↩
Kafka Deep Dive for System Design Interviews, accessed on September 24, 2025, https://www.hellointerview.com/learn/system-design/deep-dives/kafka ↩
Kafka’s Shift from ZooKeeper to Kraft | Baeldung, accessed on September 24, 2025, https://www.baeldung.com/kafka-shift-from-zookeeper-to-kraft ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
The Evolution of Kafka Architecture: From ZooKeeper to KRaft | by …, accessed on September 24, 2025, https://romanglushach.medium.com/the-evolution-of-kafka-architecture-from-zookeeper-to-kraft-f42d511ba242 ↩ ↩² ↩³ ↩⁴
From ZooKeeper to KRaft: How the Kafka migration works – Strimzi, accessed on September 24, 2025, https://strimzi.io/blog/2024/03/21/kraft-migration/ ↩
Apache Kafka’s KRaft Protocol: How to Eliminate Zookeeper and Boost Performance by 8x, accessed on September 24, 2025, https://oso.sh/blog/apache-kafkas-kraft-protocol-how-to-eliminate-zookeeper-and-boost-performance-by-8x/ ↩ ↩² ↩³
Kafka Raft vs. ZooKeeper vs. Redpanda, accessed on September 24, 2025, https://www.redpanda.com/guides/kafka-alternatives-kafka-raft ↩
Apache Kafka® broker: Key components, tutorial, and best practices - NetApp Instaclustr, accessed on September 24, 2025, https://www.instaclustr.com/education/apache-kafka/apache-kafka-broker-key-components-tutorial-and-best-practices/ ↩ ↩² ↩³
stackoverflow.com, accessed on September 24, 2025, https://stackoverflow.com/questions/71666294/kafka-replication-factor-vs-min-in-sync-replicas#:~:text=Replication%2Dfactor%20is%20the%20total,and%20accepting%20new%20incoming%20messages. ↩
Kafka Replication and Committed Messages - Confluent Documentation, accessed on September 24, 2025, https://docs.confluent.io/kafka/design/replication.html ↩
Kafka Acknowledgment Settings Explained: acks=0,1,all - Dattell, accessed on September 24, 2025, https://dattell.com/data-architecture-blog/kafka-acknowledgment-settings-explained-acks01all/ ↩ ↩²
Kafka Acks & Min Insync Replicas Explained - 2 Minute Streaming, accessed on September 24, 2025, https://blog.2minutestreaming.com/p/kafka-acks-min-insync-replicas-explained ↩
How to Tune Kafka’s Durability and Ordering Guarantees - Confluent Developer, accessed on September 24, 2025, https://developer.confluent.io/courses/architecture/guarantees/ ↩
12 Kafka Best Practices: Run Kafka Like the Pros - NetApp Instaclustr, accessed on September 24, 2025, https://www.instaclustr.com/education/apache-kafka/12-kafka-best-practices-run-kafka-like-the-pros/ ↩
How to Improve Kafka Performance: A Comprehensive Guide, accessed on September 24, 2025, https://community.ibm.com/community/user/blogs/devesh-singh/2024/09/26/how-to-improve-kafka-performance-a-comprehensive-g ↩ ↩²
Kafka replication factor vs min.insync.replicas - Stack Overflow, accessed on September 24, 2025, https://stackoverflow.com/questions/71666294/kafka-replication-factor-vs-min-in-sync-replicas ↩
MUST follow kafka topic naming convention - OTTO Consumer API, accessed on September 24, 2025, https://api.otto.de/portal/guidelines/r200006 ↩
Kafka Topic Naming Conventions: Best Practices, Patterns, and …, accessed on September 24, 2025, https://www.confluent.io/learn/kafka-topic-naming-convention/ ↩
Kafka Message Key: A Comprehensive Guide - Confluent, accessed on September 24, 2025, https://www.confluent.io/learn/kafka-message-key/ ↩ ↩²
Avro vs. JSON Schema vs. Protobuf: Choosing the Right Format for …, accessed on September 24, 2025, https://www.automq.com/blog/avro-vs-json-schema-vs-protobuf-kafka-data-formats ↩ ↩²
Avro vs Protobuf: A Comparison of Two Popular Data Serialization Formats - Wallarm, accessed on September 24, 2025, https://lab.wallarm.com/what/avro-vs-protobuf/ ↩ ↩²
MQTT to Kafka: Benefits, Use Case & A Quick Guide - EMQX, accessed on September 24, 2025, https://www.emqx.com/en/blog/mqtt-and-kafka ↩
How to Build Real-Time Apache Kafka® Dashboards That Drive Action - Confluent, accessed on September 24, 2025, https://www.confluent.io/blog/build-real-time-kafka-dashboards/ ↩
Apache Kafka® architecture: A complete guide [2025] – NetApp Instaclustr, accessed on September 24, 2025, https://www.instaclustr.com/education/apache-kafka/apache-kafka-architecture-a-complete-guide-2025/ ↩
Does Kafka really need SSD disk? [closed] - Stack Overflow, accessed on September 24, 2025, https://stackoverflow.com/questions/60651994/does-kafka-really-need-ssd-disk ↩
16 Ways Tiered Storage Makes Apache Kafka® Simpler, Better, and Cheaper - Aiven, accessed on September 24, 2025, https://aiven.io/blog/16-ways-tiered-storage-makes-kafka-better ↩
SSD or HDD for Kafka Brokers? ( Using SSD for Kafka ) - Codemia.io, accessed on September 24, 2025, https://codemia.io/knowledge-hub/path/ssd_or_hdd_for_kafka_brokers_using_ssd_for_kafka ↩
Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost, accessed on September 24, 2025, https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your-apache-kafka-clusters-to-optimize-performance-and-cost/ ↩
Kafka Capacity Planning - Codemia, accessed on September 24, 2025, https://codemia.io/knowledge-hub/path/kafka_capacity_planning ↩
Mastering Kafka Disk Capacity Planning for Peak Performance …, accessed on September 24, 2025, https://medium.com/@noel.B/effective-disk-capacity-planning-in-apache-kafka-explained-d1e8f6b2f180 ↩
Hardware requirement for apache kafka - Codemia.io, accessed on September 24, 2025, https://codemia.io/knowledge-hub/path/hardware_requirement_for_apache_kafka ↩ ↩²
Running Kafka in Production with Confluent Platform, accessed on September 24, 2025, https://docs.confluent.io/platform/current/kafka/deployment.html ↩ ↩²

Apache Kafka for Technical Leaders: A Strategic Guide to Scalability, Resilience, and Efficiency