Share on facebook
Share on twitter
Share on linkedin

3 Pro Tips for Developers using AWS Lambda with Kinesis Streams

A Cloud Guru News
A Cloud Guru News

TL; DR: Lessons learned from our pitfalls include considering partial failures, using dead letter queues, and avoiding hot streams

Yubl was a social networking app with a timeline feature similar to TwitterThe development team leveraged a serverless architecture where Lambda and Kinesis became a prominent feature of our design.

As part of the design, we tried to keep in mind that the characteristics that define a system that processes Kinesis events — which for me must have at least these 3 qualities:

AWS Lambda and Kinesis sitting on a tree
  1. The system should be real-time — as in “within a few seconds”
  2. The system should retry failed events — but retries should not violate the realtime constraint on the system
  3. The system should be able to retrieve events that could not be processed — so someone can investigate root cause or provide manual intervention
Yubl had around 170 Lambda functions running in production — gluing everything together

Whilst our experience using Lambda with Kinesis was great in general, there were a couple of lessons that we had to learn along the way. Here are 3 useful tips to help you avoid some of the pitfalls we fell into and accelerate your own adoption of Lambda and Kinesis.

ProTip #1: Consider Partial Failures

AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires 

Because of the way Lambda functions are retried, if you allow your function to err on partial failures, then the default behavior is to retry the entire batch until success or the data expires from the stream.

To decide if this default behavior is right for you, you have to answer certain questions:

  1. Can events be processed more than once?
  2. What if those partial failures are persistent? Perhaps due to a bug in the business logic that is not handling certain edge cases gracefully?
  3. Is it more important to process every event till success than keeping the overall system real-time?

In the case of Yubl, we found it was more important to keep the system flowing than to halt processing for any failed events, even if for a minute.

For instance, when a user created a new post, we would distribute it to all of your followers by processing the yubl-posted event. The 2 basic choices we’re presented with are:

  1. allow errors to bubble up and fail the invocation — we give every event every opportunity to be processed; but if some events fail persistently then no one will receive new posts in their feed and the system appears unavailable
  2. catch and swallow partial failures — failed events are discarded, some users will miss some posts but the system appears to be running normally to users; even affected users might not realize that they had missed some posts

Of course, it doesn’t have to be a binary choice. There’s plenty of room to add smarter handling for partial failures which we will discuss shortly.

When you create a new post in the Yubl app, your content is distributed to your followers’ feeds
Yubl’s architecture for distributing a user’s posts to his followers’ feeds

We encapsulated these 2 choices as part of our tooling so that we get the benefit of reusability and the developers can make an explicit choice for every Kinesis processor they create

Depending on the problem you’re solving, you would apply different choices. The important thing is to always consider how partial failures would affect your system as a whole.

ProTip #2 : Use Dead Letter Queues (DLQ)

AWS announced support for Dead Letter Queues (DLQ) at the end of 2016. While Lambda support for DLQ extends to asynchronous invocations such as SNS and S3, it does not support poll-based invocations such as Kinesis and DynamoDB streams. Until AWS updates the DLQ features, there’s nothing stopping you from applying the concepts to Kinesis streams yourself.

First, let’s roll back the clock to a time when we didn’t have Lambda. Back then, we’d use long running applications to poll Kinesis streams ourselves. Heck, I even wrote my own producer and consumer libraries because when AWS rolled out Kinesis they totally ignored anyone not running on the JVM!

Lambda has taken over a lot of the responsibilities — polling, tracking where you are in the stream, error handling, etc. — but as we have discussed above it doesn’t remove you from the need to think for yourself. Prior to using Lambda, my long running application to poll Kinesis would:

  1. poll Kinesis for events
  2. process the events by passing them to a delegate function (your code)
  3. failed events are retried 2 additional times
  4. after the 2 retries are exhausted, they are saved into a SQS queue
  5. record the last sequence number of the batch so that we don’t lose the current progress if the host VM dies or the application crashes
  6. another long running application would poll the SQS queue for events that couldn’t be process realtime
  7. process the failed events by passing them to the same delegate function as above (your code)
  8. after the max no. of retrievals the events are passed off to a DLQ
  9. this triggers CloudWatch alarms and someone can manually retrieve the event from the DLQ to investigate

Lambda function that processes Kinesis events should also:

  • retry failed events X times depending on processing time
  • send failed events to a DLQ after exhausting X retries

Since SNS already comes with DLQ support, you can simplify your setup by sending the failed events to a SNS topic instead. Lambda would then process it a further 3 times before passing it off to the designated DLQ.

Tip: Keep the functions that process Kinesis and SNS in the same service so they can share the same processing logic

ProTip #3 : Avoid “Hot” Streams

We found that when a Kinesis stream has 5 or more Lambda function subscribers we would start to see lots ReadProvisionedThroughputExceeded errors in CloudWatch. Fortunately these errors are silent to us as they happen to, and are handled by, the Lambda service polling the stream.

However, we occasionally see spikes in the GetRecords.IteratorAge metric, which tells us that a Lambda function will sometimes lag behind. This did not happen frequently enough to present a problem but the spikes were unpredictable and did not correlate to spikes in traffic or number of incoming Kinesis events.

Increasing the number of shards in the stream made matters worse and the number of ReadProvisionedThroughputExceeded increased proportionally.

According to the Kinesis documentation … each shard can support up to 5 transactions per second for reads, up to a maximum total data reads of 2 MB per second.

And the Lambda documentation … If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each Lambda function processes events on a shard in the order that they arrive.

One would assume that each of the aforementioned Lambda functions would be polling its shard independently. Since the problem is having too many Lambda functions poll the same shard, it makes sense that adding new shards will only escalate the problem further.

All problems in computer science can be solved by another level of indirection. — David Wheeler

After speaking to the AWS support team about this, the only advice we received was to apply the fan out pattern — by adding another layer of Lambda function who would distribute the Kinesis events to others.

Applying the “fan out” pattern with Lambda functions and Kinesis

Whilst this is simple to implement, it has some downsides:

  • it vastly complicates the logic for handling partial failures (see above)
  • all Lambda functions now process events at the rate of the slowest function, potentially damaging the realtime-ness of the system

We also considered and discounted several other alternatives, including

  • have one stream per subscriber — this has a significant cost implication, and more importantly it means publishers would need to publish the same event to multiple Kinesis streams in a “transaction” with no easy way to rollback on partial failures since you can’t unpublish an event in Kinesis
  • roll multiple subscriber logic into one — this corrodes our service boundary as different subsystems are bundled together to artificially reduce the no. of subscribers

In the end, we didn’t find a truly satisfying solution and decided to reconsider if Kinesis was the right choice for our Lambda functions on a case by case basis.

For subsystems that do not have to be realtime, use S3 as source instead. All our Kinesis events are persisted to S3 via Kinesis Firehose. The resulting S3 files can then be processed by these subsystems using Lambda functions. As an example, one such subsystem would stream the events to Google BigQuery for BI.

For work that are task-based (i.e. order is not important), use SNS/SQS as source instead. SNS is natively supported by Lambda, and we implemented a proof-of-concept architecture for processing SQS events with recursive Lambda functions, with elastic scaling. Now that SNS has DLQ support, it would definitely be the preferred option provided that its degree of parallelism would not flood and overwhelm downstream systems such as databases, etc.

For everything else, continue to use Kinesis and apply the fan out pattern as an absolute last resort.

Wrapping up…

So there you have it, 3 pro tips from a group of developers who have had the pleasure of working extensively with Lambda and Kinesis. Want more cloud goodness? Check these out:

Want to improve your cloud skills and knowledge? Check out A Cloud Guru’s library of courses, labs and learning paths!


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?