feedbacksraka.blogg.se - Kafka remove duplicate messages

It also impacts the resource requirements of your Kafka cluster as each additional message and replica takes up space and uses traffic on your network, broker disk, and broker memory. If you want to subdivide that processing further then you could route messages by consuming them from a source partition, producing them to a different sink partition and then consuming them again downstream, but this model quickly bloats your end-to-end latency metrics as you consume from and produce to Kafka for each step in your data processing pipeline. Partitioning your topic and using consumer groups to distribute partitions across a set of Consumer is the simplest and recommended way to distribute your processing workload, but this model still requires you to poll and process messages in a synchronous manner. This works for simple consuming use cases that can be modeled as a single loop, but it’s not sufficient when you want to distribute work across worker threads to better utilize local CPU and Memory resources.

The recommended model for flow control when using the Consumer is to poll and process in one synchronous loop so that you only ever retrieve more records when it’s possible to process them. The sequence diagram below depicts an ideal scenario where the Consumer is polled and has records returned within the timeout provided (10ms). If no records are available then the Consumer will wait up to the user-defined timeout for fetch request responses and then return those records. If there are records available to return (as a result of previous async fetch requests triggered by previous polls), then those records are returned immediately. Calling the poll method actually performs many different operations internally, but in simple terms, it will asynchronously create fetch requests to brokers to request Kafka records for assigned partitions. Although this method is blocking (with an optional timeout), internally the Consumer requests and caches records asynchronously. The standard way to consume records from Kafka is to call the poll method of Consumer.

Flow Control Efficiency Improvements in Apache Kafka 2.4.0.

How Alpakka Kafka uses Flow Control in the Kafka Consumer.

A Primer on Akka Streams and Back-pressure.

If you’re just interested in reading about the performance improvements then skip to the Flow Control Efficiency Improvements in Kafka 2.4.0 section.

If you’re not very familiar with Akka Streams or Alpakka Kafka then you may find it worthwhile to read the whole post to learn how both of these tools work and integrate together. We conclude with some benchmarks that demonstrate the performance improvements before and after the new Consumer is used. We’ll discuss how the Consumer’s flow control mechanisms were improved to provide better performance. This blog post begins with details about how the Consumer and Alpakka Kafka internals are designed to facilitate asynchronous polling and back-pressure. In our benchmarks we’ve observed a network traffic savings between the Broker and Consumer of 32%, and an improvement in Consumer throughput of 16%! Apache Kafka 2.4.0 includes a fix contributed by Lightbend that solves these performance problems ( KAFKA-7548).