Share on facebook
Share on twitter
Share on linkedin

How to retry failed Kinesis events using AWS Lambda dead letter queues for SNS

A Cloud Guru News
A Cloud Guru News

AWS announced DLQ support for Lambda — but only for async event sources such as SNS and not poll-based Kinesis streams

In a previous post, I discussed the challenges with processing Kinesis events with Lambda. The blog shared some of my lessons learned and tips for partial failures, using dead letter queues, and avoiding hot streams.

With regards to failed records, I discussed that while Lambda support for DLQ extends to asynchronous invocations such as SNS and S3, it does not support poll-based invocations such as Kinesis and DynamoDB streams. This means developers still need to implement a retry strategy for Kinesis functions — potentially taking advantage of the retry flow with SNS.

Since SNSalready comes with DLQ support, you can simplify your setup by sending the failed events to a SNStopic instead. Lambdawould then process it a further 3 times before passing it off to the designated DLQ.

Treat Kinesis and SNS as different entries points to some business logic that reacts to an event in your domain.

In the blog, I’ll provide some concrete example to show how easily this can be accomplished. If you want to try it for yourself, the code for the examples in this blog is available on github.

The Kinesis function

In our Kinesis function, we’ll need to keep track of events that we fail to process. At the end of the function, we can then publish them to SNS for further retries, and take advantage of the built-in “retry then Dead Letter Queue” flow SNS has with Lambda.

As mentioned in the previous post, it’s often more important to keep the system processing events in real-time than to ensure every event is processed successfully, but you do still want to give a failed event a few retries before throwing in the towel.

As part of the demo project, I also included an API endpoint for you to trigger events into Kinesis. Once deployed, if you hit the endpoint with the query string parameter ?fail=true then you can observe Kinesis failing to process the record due to an error:

The event would then be published to SNS.

The SNS function

In the SNS function, we’ll extract the original event object that was published to Kinesis, and attempt to process it again using the same business logic as our Kinesis function.

Out of the box, we get 3 attempts at processing the failed Kinesis event again with the SNS event source.

After all 3 attempts have been exhausted, there’s one more chance to save the data with the Dead Letter Queue (DLQ) support — which is only available for async event sources such as SNS.

As of version 1.22.0 of the Serverless framework, it doesn’t support adding SQS as the DLQ resource (via the onError attribute).

There is also the serverless-plugin-lambda-dead-letter plugin, but I never managed to get it to work. So, in the meantime, you might have to make do with setting the DLQ Resource manually if you want to use SQS.

After our Kinesis event has been processed 3 more times by SNS, we can see that the SNS event has been pushed to the designated SQS queue …

…with the entire event payload for the SNS function as message.

The SNS message also has some additional message attributes, although they’re probably not useful to you, especially with a 200 ErrorCode. I know, I’m confused by it too!

So there you go, a simple way to piggyback off the SNS error handling flow for Lambda to retry failed Kinesis events a few more times, whilst doing it out-of-band to allow the stream to go ahead and keep the system real-time.

Trading consistency for realtime-ness

One of the key benefits with using Kinesis is that the order of events for a given partition key is preserved. For example, for a given user ID, it means we can process events related to this user in the same order as they’re received by Kinesis. We can even enforce the ordering of events by setting the SequenceNumberForOrdering attribute in the PutRecord request.

The approach I have outlined here sacrifices this consistency because it defers the retries to a background process, retried events can therefore be processed out of order with respect to other events for the same partition key.

This is a conscious tradeoff to allow the Kinesis processing to continue in the face of persistent failures to process some events, and preserve the overall realtime-ness of the system, as I stated in my previous post:

we found it was more important to keep the system flowing than to halt processing for any failed events, even if for a minute.

Another thing to keep in mind is that, unless you can design away poison messages (and of course, it’s totally doable), eventually the decision to ditch persistently failing events will be made for you when they are expired from the stream.

Consider impact of retries on downstream systems

As I discussed in the post about pub-sub and push-pull messaging patterns, another benefit of using Kinesis with Lambda is that it allows you to control the amount of load you pass onto your downstream systems.

In the event there is a sudden influx of failed events, the retries via SNS can overwhelm your downstream systems and in turn cause even more failed events and escalate a bad situation further.

In conclusion — keep it real

Because of the way Kinesis works, that the level of concurrency and retry is at a per-shard level, it forces us to make a choice between:

  1. keeping the overall system real time, and discard persistently failing events
  2. ensuring all events are process successfully

In a world where the retry-until-success behavior can be narrowed down to a per partition key level, then maybe we won’t have to make this trade off.

As things stand, I think all real time event processing systems should choose option 1 — otherwise, you’re not really building a “real time” system.

If that is your decision, then the approach outlined here makes it easier for you to implement a retry-thrice-then-DLQ flow as it piggybacks Lambda’s built-in retry behaviour for SNS.

If you choose option 2 to ensure all events are process successfully, then this approach is definitely not for you! Instead, you should focus on monitoring and alerting so that you discover these poison messages quickly and design your system to be able to handle them better.

If, as it turns out, your only defense against poison messages is to “discard them” — then you could have chosen option 1 in the first place.

Like what you’re reading? You can find all my Serverless and AWS Lambda related posts here.

All my posts on Serverless & AWS LambdaTips for writing Lambda functions—


Get more insights, news, and assorted awesomeness around all things cloud learning.

Get Started
Who’s going to be learning?
Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!