Share on facebook
Share on twitter
Share on linkedin

I’m here to tell you the truth, the good, the bad and the ugly of AWS X-Ray and Lambda

A Cloud Guru News
A Cloud Guru News

While there is plenty of value using AWS X-Ray with Lambda, the current limitations restricts its usefulness for complex systems

AWS announced general availability of AWS Lambda support for AWS X-Ray back in May — and it’s taken me a while to try it out. While I see plenty of value, its current limitations significantly restricts its usefulness in a complex system.

The AWS Lambda-specific documentation was a bit fragmented, so I relied on some experimentation and pieced together clues from several sources:

I discovered that the recording annotations and metadata didn’t work as advertised in the sample — although I could be doing something wrong.


Update July 03, 2017: after this post was published the folks at AWS got in touch and kindly cleared up some of the issues highlighted here which were caused by poor documentation which they’ll rectify in the near future. Scroll down to see the clarification on the relevant sections.


The Sample Application

The sample project I created centers around a Lambda function called service-a, which in term calls a number of downstream systems:

  • publishing to a SNS topic
  • GET’ng and PUT’ng an object in S3
  • GET’ng and PUT’ng a row in DynamoDB
  • invoking another Lambda function (service-c) using the Lambda API
  • making a HTTP request to an API Gateway endpoint backed by another Lambda function — service-berror and timeout functions in the diagram represents the success, error and timeout cases respectively

The Good

Once I figured out the magic incantations, I was able to get the results. It took more time and energy than it should have — but by and large, most features worked as advertised at the first (or second) time of asking.

Below is a trace of the service-a function, which includes the time it takes for Lambda to initialize the function, and the various downstream systems it talked to — all nested under custom subsegments.

It even includes the trace of the service-c function — and the time it spent publishing to SNS — which was invoked using the Lambda API.

The service map for service-a includes service-c as a downstream dependency, as well as service-c’s dependency on SNS.

The Bad — it’s always a 200 response code

When the service-a function is invoked through its API Gateway endpoint and errors, the corresponding trace still reports a 200 response code.

Presumably what X-Ray sees is a 200 response from the Lambda service whose payload indicates a 502 response to the API Gateway invocation and so it thought “hey, it’s a 200!”.

The service-a endpoint errored with 502
But the trace reported a 200 response code because (presumably) the AWS Lambda service itself reported a 200

Oddly enough, the trace for the error function also reports a 200 even though its own status field indicates it had errored.

This behavior is both confusing and inconsistent to me — perhaps I have misunderstood how it works. Sadly, the X-Ray’s concepts page also does not explain the difference between an Error and a Fault.

While this might seem like a small nuisance now, the inability to quickly identify error traces will hurt you most when you need to diagnose problems in production. Most likely, when you’re under the most time pressure.


Update July 03, 2017: AWS confirmed that the reason the errors are reported as 200 is due to Lambda service returning a 200 response (with payload that indicates an error). One workaround is to use the filter expression service() { fault } which returns all traces that contains a fault.


Traces don’t span over API Gateway

When the service-a function makes an outgoing HTTP request to an API Gateway endpoint the trace stops at the API Gateway endpoint and doesn’t extend to the Lambda functions that are triggered by API Gateway.

This behavior was consistent with all 3 endpoints I tested — service-berror and timeout.

For this test, I have followed the X-Ray documentation and used the X-Ray SDK to wrap the Nodejs https module when making the HTTP request.

I can see the trace IDs are correctly passed along in the outgoing HTTP request and received by the handling Lambda function.

Trace IDs are passed along the HTTP headers in the GET request to the timeout function invocation.
The root trace ID correctly match the segment ID of the originating function’s trace.

This is the service map I expected to see in this case — where service-a’s trace follows through the HTTP request to API Gateway and includes the invocation of the timeout function.

WARNING: this is not a screenshot! This is a composed image of what I would like to see — where traces follow through to Lambda invocations behind the failing HTTP request

Instead, the actual service map stops at the API Gateway.

The trace does not follow through to the Lambda function behind the outgoing HTTP request

However, when invoking another Lambda function directly (using the Lambda API and wrapped AWS SDK) the trace worked as expected.

Perhaps the limitation lies with API Gateway?

Below you can see the breakdown of time spent in service-c when it’s invoked directly, but the tracing to service-b stopped at the API Gateway layer.

The Ugly — no sampling

According to the Lambda’s documentation on X-Ray, requests should be sampled at 1 request per minute.

Image for post

However, that wasn’t the case in my experiments. EVERY request was sampled, as you can see from the age of the traces in the screenshot below. This behavior was consistent when invoking Lambda via API Gateway as well as via the Lambda management console.

While the X-Ray service is not expensive per se — $5.00 per million traces — it’s nonetheless a cost that can easily spring up on you if you are unwillingly tracing every request through your system.

As an example, I worked on a moderately successful social game at Gamesys with ~1M DAU. At roughly 250M user requests per day, X-Ray would have cost $5 * 250 * 30 days = $37500, which was more than our entire AWS bill at the time!


Update July 03, 2017: this turns out to be a problem with the documentation, which doesn’t mention that sampling is volume-based and only kicks in once you reach a certain volume of requests.


Annotations and Metadata only work on subsegments

The one thing that just refused to work — even though I have followed the examples — was adding annotation and metadata to the root segment:

module.exports.handler = (event, context, callback) => {
  ...
  let segment = AWSXRay.getSegment();
  let n = Math.random() * 3;
  segment.addMetadata('random', `${n}`);      // this doesn't work
  segment.addAnnotation('path', event.path);  // this doesn't work
  ...
}

Interestingly, adding annotations and metadata to subsegments works fine.

let publishSNS = segment => {
  return new Promise((resolve, reject) => {
    console.log('publishing to SNS topic');
    
    let f = co.wrap(function* (subsegment) {
      ...
      subsegment.addAnnotation('topic', topicArn);
      subsegment.addMetadata('message', 'test');
      let req = {
        Message: message,
        TopicArn: topicArn
      };
      yield sns.publishAsync(req);
      subsegment.close();
      resolve();
    });
    
    AWSXRay.captureAsyncFunc("## publishing to SNS", f, segment);
  });
};

Looking at the logs, I found something interesting: the segment ID for the root segment doesn’t match the segment ID in the X-Ray trace.

For instance, I fetch the root segment for my function in the handler and logs it to CloudWatch Logs.

const AWSXRay = require('aws-xray-sdk');
module.exports.handler = (event, context, callback) => {
  ...
  // this should be the root segment for my function
  let segment = AWSXRay.getSegment();
  console.log(JSON.stringify(segment));
  ...
}

In the logs I can see the segment ID is 05b2b9ac6c9e5682.

But in the X-Ray trace, the segment ID for the root segment is 2b7d5b4a2a2d96e9.

The segment ID of the root segment doesn’t match the one the function fetched at runtime.

Furthermore, the trace ID is also different:

  • in the logs it’s 1–59504311-d765e7accb8558871fa89d6d
  • in the X-Ray console it’s 1–59504312–5ef2a3eda0c1b2c4d64dda00

This was very odd, so I decided to track the trace ID in the logs vs in the X-Ray console, starting with a coldstart.

From the data it’s clear to see that AWSXray.getSegment() returns the root segment from the previous invocation

Bingo! Looks like it’s a bug in the X-Ray SDK for Nodejs where AWSXray.getSegment() returns the root segment from the previous invocation.


Update July 03, 2017 : whilst there was a bug in the X-Ray SDK wrt tracking the trace-id, adding annotations and metadata to the root segment is simply not supported, which the doc doesn’t explicitly state. You can work around this by creating a subsegment that covers the entire span of your function invocation to act as your custom root segment and attach any annotation and metadata related to the invocation there.


The Truth

So there you have it, my weekend escapade with AWS X-Ray and Lambda 🙂

“I came here to tell you the truththe good, the bad and the ugly.” — Oliver North

Overall I’m impressed with what I saw, and feel X-Ray would have added a lot of value to the serverless architecture I was building at Yubl. However, the inability to span traces over API Gateway endpoints makes it a far less useful addition to our ecosystem.

Furthermore, the X-Ray service is focused on execution time and helping you identify performance bottlenecks. However, there’s another important aspect to distributed tracing that X-Ray does not concern itself with — helping you debug your system by ensuring a set of correlation IDs are captured in all log messages.

Although you can probably use the trace ids that X-Ray provides, it’s still your responsibility to capture them in all log messages and propagate all your logs to one easily searchable place. We invested some effort into enabling distributed tracing in our serverless architecture at Yubl, which you can read about in detail in this post.

Are you using X-Ray in production? I’d love to hear your thoughts and feedbacks on the service, and any pitfalls to look out for in the comments section!

Thanks for reading! If you like what you read, hit the ❤ button below so that others may find this. You can follow me on Twitter.

Recommended

Get more insights, news, and assorted awesomeness around all things cloud learning.

Get Started
Who’s going to be learning?
Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!