Message delivery semantics in streaming platforms

1. Overview

In this tutorial, we will discuss the message delivery semantics in streaming platforms.

First, We will immediately see the event flow through the core components of the streaming platform. Next, we will discuss the common reasons for data loss and duplication in such platforms. Then, we’ll look at the three major distribution semantics available.

We’ll discuss how we can achieve these semantics in streaming platforms, as well as how they deal with data loss and duplication issues.

In each of the Delivery Semantics, we will briefly talk about how to get Delivery Guarantee in Apache Kafka.

2. Basics of Streaming Platforms

Simply put, streaming platforms such as Apache Kafka and Apache ActiveMQ process events in a real-time or real-time manner from one or multiple sources (also called producers) and send them to one or multiple sources for further processing. destinations (also called consumers). change, analyze, or store.

Producers and consumers are separated through brokers, and this enables scalability.

Some use cases of streaming applications can be a high volume of user activity tracking in an eCommerce site, financial transactions and real-time fraud detection, autonomous mobile devices that require real-time processing, etc.

There are two important considerations in the message delivery platform,

  • Accuracy
  • Delay

Often, In distributed, real-time systems, we need to strike a balance between latency and accuracy Depends on what is more important to the system.

This is where we need to understand the delivery guarantees offered by streaming platforms out of the box or implement the desired ones using message metadata and platform configuration.

Next, let’s briefly look at the issues of data loss and duplication in streaming platforms, which will lead us to discuss distribution semantics to manage these issues.

3. Potential Data Loss and Duplication Scenario

To understand data loss and/or duplication in streaming platforms, let’s quickly step back and take a look at the high-level flow of events in streaming platforms:


Here, we can see that there can be potentially many points of failure along the flow from producer to consumer.

Often, this results in issues like data loss, lag and duplication of messages.

Let’s take a look at each component of the diagram above and see what can go wrong and the possible consequences for streaming systems.

3.1. manufacturer failure

Producing failures can lead to some issues:

  • After a message is generated by the producer, it may fail before sending it over the network. This may cause data loss.
  • Producer may fail while waiting to receive acknowledgment from broker. When the producer recovers, it tries to resend the message assuming the missing acknowledgment from the broker. This can lead to duplication of data on the broker,

3.2. Network issues between producer and broker

A network failure between the producer and the broker can happen:

  • The producer may send a message which is never received by the broker due to network issues.
  • There may also be a scenario where the broker receives the message and sends the acknowledgment, but the producer never receives the acknowledgment due to network issues.

In both these cases, Producer will resend the message, resulting in data duplication on the broker,

3.3. Broker failure

Similarly, broker failure can also lead to data duplication:

  • A broker may fail after delivering a message to persistent storage and before sending an acknowledgment to the producer. This can cause data to be re-sent from producers leading to data duplication.
  • A broker can track the messages that consumers have read so far. The broker may have failed before providing this information. This can cause consumers to read the same message multiple times leading to data duplication.

3.4. message persistence issue

In-memory condition can lead to failure to write data to disk which can lead to data loss.

3.5. Network issues between consumer and broker

A network failure between the broker and the consumer can result in:

  • Despite the broker sending the message and recording that it has sent the message, the consumer may never receive the message.
  • Similarly, a consumer may send an acknowledgment after receiving a message, but the broker may never receive the acknowledgment.

In both cases, the broker can send messages for data duplication.

3.6. consumer failures

  • The consumer may fail before processing the message.
  • The consumer may fail before recording in the persistence storage that it has processed the message.
  • The consumer may fail even after recording that it has processed the message but before sending the acknowledgment.

This may cause the consumer to request the same message from the broker again, which may lead to data duplication.

Next, let’s look at the delivery semantics available in streaming platforms to deal with these issues to meet different system requirements.

4. Distribution Semantics

Distribution semantics define how streaming platforms guarantee the delivery of events from source to destination in our streaming applications.

There are three different distribution semantics available:

  • at most once
  • at least once
  • all at once

4.1. at least once delivery

In this view, The consumer first saves the state of the last received event and then processes it.

Simply put, if the event processing fails in the middle, when the consumer is restarted, it cannot go back to read the old event.

As a result, there is no guarantee of successful event processing across all received events.

Least semantics are ideal for situations where some data loss is not an issue and accuracy is not imperative.

Considering the example of Apache Kafka, which uses offsets for messages, the sequence of At-Most-Once guarantees would be:

  • continue offset
  • continue result

To enable At-Most-Once Semantics in Kafka, we’ll need to setenable.auto.commit” To “Truth” on the consumer.

If a failure occurs and the consumer is restarted, it will look at the last permanent offset. Since offsets persist before actual event processing, we cannot establish whether each event received by the consumer was processed successfully. In this case, the consumer may miss some events.

Let’s imagine this semantics:



4.2. at least once delivery

In this approach, the consumer processes the received event, persists the result somewhere, and then finally saves the state of the last received event.

Unlike at least once, here, in case of failure, the consumer can read and reprocess the old events.

In some scenarios, this can lead to data duplication. Let’s consider the example where the consumer fails after processing and saving an event, but before saving the last known event position (also known as offset).

The consumer will restart and read from the offset. Here, the consumer reprocesses the event more than once because even though the message was successfully processed before the failure, the state of the last received event was not successfully saved:

This approach is ideal for any application that updates a ticker or gauge to show the current price. However, use cases that require precision in aggregates, such as sums and counters, are not ideal for at least one processing, mainly because duplicate events produce incorrect results.

Consequently, in this distribution semantic, No data is lost, but there may be situations where the same event is reprocessed,

To avoid processing the same event multiple times, we can use idle consumers,

Essentially a passive consumer can consume a message multiple times but only process it once.

The combination of the following approaches enables wasteful consumers at least once in delivery:

  • The producer assigns a unique message id to each message.
  • The consumer keeps a record of all the processed messages in the database.
  • When a new message arrives, the consumer checks it against the existing message idIn persistent storage table.
  • If a match occurs, the consumer updates the offset without re-consuming, sends back the acknowledgment, and effectively marks the message as consumed.
  • When the event does not already exist, a database transaction is started, and a new one message id is put. After this, this new message is processed based on whatever business logic is required. Upon completion of message processing, the transaction is finally committed

In Kafka, at least once to ensure semantics, the producer must wait for an acknowledgment from the broker.

If no acknowledgment is received from the broker, the producer resends the message.

Additionally, as the producer writes messages to the broker in batches, if that write fails and the producer tries again, messages within the batch may be written to Kafka more than once.

However, to avoid duplication, Kafka introduced the feature of the naive producer.

Essentially, to enable semantics in Kafka at least once, we have to:

  • set property”ACK“For the value “1” on the producer side
  • group “enable.auto.commit“Property for Value”false“On the consumer side.
  • group “enable.idempotence“Property for Value”Truth,
  • Attach the sequence number and manufacturer ID to each message from the manufacturer

Kafka can identify message duplication on a topic by using the broker sequence number and producer ID.

4.3. one time delivery

It is semantically identical to guarantee delivery at least once. First, the received event is processed, and then the results are stored somewhere. In case of failure and restart, the consumer can re-read and reprocess the old events. Although, youLike least-once processing, any duplicate events are removed and not processed, resulting in exactly-once processing.

This is ideal for any application in which accuracy is important, such as precision counters or anything else that requires an event to be processed only once and without loss.

The sequence proceeds as follows:

  • continue result
  • continue offset

Let’s see what happens when a consumer fails after processing the events but without saving the offset in the diagram below:

We can remove the repetition in semantics exactly once by having:

  • Important Updates – We will save the result on the unique key or ID to be generated. In the case of duplication, the generated key or id will already be in the results (a database, for example), so the consumer can remove duplicates without updating the result.
  • Transactional Updates – We’ll Save Returns results in batches that require the transaction to be started and committed, so in the event of a commit, the events will be processed successfully. Here we will just skip the duplicate event without any result update.

Let’s see what we need to do to enable one-time semantics in Kafka:

  • “Enable idle producer and transaction facility on the producer by setting unique value to”transaction.id“For Every Manufacturer”
  • Enable transaction facility on consumer by setting property “isolation.level“For Values”read_commit,

5. Epilogue

In this article, we have looked at the differences between the three distribution semantics used in streaming platforms.

After a brief overview of the event flow in the streaming platform, we noticed data loss and duplication issues. Then, we looked at how to reduce these issues by using different distribution semantics. Then we looked at the delivery semantics at least once, then at least once, and finally at least once.

       

Leave a Comment