A cloud conversation with Blake Stoddard, Sr SRE at Basecamp / HEY. This interview has been edited and condensed for clarity.
Making HEY while the cloud shines
Forrest Brazeal: In the past few weeks, Basecamp’s new email app HEY has taken the tech world by storm. There’s also been a lot of talk about the “HEYstack” – your cloud infrastructure running on AWS and Kubernetes. Since Basecamp is known for choosing “boring technologies”, what made you consider running your latest app in the cloud?
Blake Stoddard: Basecamp has historically hosted our SaaS software in on-premises data centers, but a few years ago we decided to give the cloud a try for some of our workloads.
We knew we wanted to use containers to better orchestrate our infrastructure deployments, so we started out with AWS ECS back in 2016. In 2018 when we re-evaluated how we managed container orchestration, Kubernetes was the choice.
We gave Google Kubernetes Engine (GKE) a go — since they naturally jump to mind for managed Kubernetes — but we ran into some issues there and ultimately decided to leave GCP in early 2019.
Since we already had some other resources in AWS from our initial ECS migrations, it seemed natural to give EKS a shot. The AWS team has made great strides in fixing various concerns we’ve had with EKS since the product was first announced in 2017.
Fast-forward to today, HEY lives mostly on AWS managed services — EKS with lots of spot instances, but also Aurora MySQL, Elasticache Redis, and AWS’s managed Elasticsearch.
The HEY stack:
– Vanilla Ruby on Rails on the backend, running on edge
– Stimulus, Turbolinks, Trix + NEW MAGIC on the front end
– MySQL for DB (Vitess for sharding)
– Redis for short-lived data + caching
– ElasticSearch for indexing
– AWS/K8S— DHH (@dhh) June 24, 2020
Wait, so you actually migrated your Kubernetes workloads from GCP to AWS? Is this the rare use case when “multicloud” architecture pays off?
[Laughing] I guess it is. We take portability pretty seriously, so we want to know we can migrate workloads between clouds (and on-prem) if necessary.
We’re pragmatic about this, though. Basecamp’s first stab at containers on AWS was with ECS, and we still run ECS for multiple production workloads. But we found our preferences for deployment, monitoring and logging worked better on EKS than ECS.
We also find Kubernetes less of a black box than ECS. Over our few years using ECS, we still occasionally run into issues with tasks not launching, etc. Then you spend minutes poking around the UI trying to figure out what is going on, why isn’t X launching, why is Y stuck in pending. And even then, your options for what you can actually do to intervene are slim.
With Kubernetes we can have a heavier touch in how things are scheduled, what happens when things go wrong, and a heightened ability to inspect what’s going on via a powerful CLI (on top of the deployment preferences I mentioned earlier).
And yet you are using managed EKS rather than, say, hosting Kubernetes on EC2 instances. How do you draw that line between portability and reduced management burden?
There’s no guarantee we’ll stay on EKS forever. But EKS as a service manages some low-level things, like K8s masters, that we’d simply rather not deal with. I mean, the fee for the EKS control plane is ten cents an hour, which works out to, like, $820 per year per cluster. That’s pretty cheap compared to an engineer’s time, even with the number of clusters we’re running.
If we can pay a fee to have that managed so we can focus our time on engineering work that’s higher-value for the company, we’ll take that deal all day long.
Basecamp’s founder and CTO, David Heinemeier Hansson, has said he’s “never been so happy to be in the cloud as the past 2 weeks.” How has the public cloud helped you handle the traffic of an unexpectedly popular new service?
We could not have met the demand for HEY without the public cloud. I mean, we were planning for 50,000 active users within a couple of months, and we blew past that within two weeks.
In the traditional data center world, there’s no good way to keep pace with that kind of spiky growth. You’re either going to be paying for massive overprovisioning up front, or you’re going to be scrambling to rack and stack new hardware all the time.
Instead, AWS lets us spin up new compute capacity whenever we need it. This was great for testing, too — before the public launch of HEY, we load-tested the app based on models of our own internal usage from dogfooding the product for several months. The cloud was perfect for that — it makes experimenting with different compute and database sizes for the app super easy.
In the end, scaling was a lot smoother than I would have expected given the demand. We had done enough testing to know the infrastructure was solid. It was just a matter of scaling our clusters horizontally to handle frontend and backend loads.
Any places where cloud costs are becoming a pain point as you grow?
I actually have a hat hanging here in my office that says “Cloud spend czar”, because I spend a lot of time dealing with that! Look, from a raw infrastructure cost perspective, we can run our workloads more cheaply on-prem. We go to the cloud to create more value, not to spend less money per se.
But I am keeping an eye on a few things. We run about a 90% spot instance mix, which keeps our compute costs down. Data transfer is always a big line item. We pay more to transfer data out from S3 to the internet than for all our EC2 compute, if you can believe it.
Running spot instances effectively with Amazon EKS → https://t.co/UlkbWd8mJX— Blake Stoddard (@t3rabytes) June 29, 2020
Another gotcha: some services can rack up bills when you least expect it, like CloudWatch Logs. It integrates so easily with EKS! It’s just one button to click! And then you’re looking at your bill, going “where did this extra $50-$60 per day come from?”
With everything you know now about how HEY has scaled, would you make any different architecture choices if you could redesign the app from scratch?
I made two mistakes at the beginning that jump to mind, when I was laying out the infrastructure for HEY. We use Terraform, and I had written a new module to manage our VPCs as a replacement for the module we wrote back in 2016.
First mistake: I didn’t handle IPv6 subnets and it’s coming back to bite me now (and with native IPv6 support coming to EKS soon, it’d be nice to have that fixed by then).
The second mistake was that I planned for a static number of availability zones. Turns out not every AWS region has 4 AZs — us-east-1 and us-west-2 do, but us-east-2 does not!
So HEY is running in multiple regions right now?
Yes, it’s currently in us-east-1 and us-west-2. We hope soon to move to a true active-active deployment with latency-based routing between regions. We’ve actually run this in production several times, but it’s tricky to get right with 80ms latency to the primary region for requests that need to write while ensuring that you don’t have stale reads.
On a similar note, we’ve spent a lot of time thinking about sharding and data locality. I’d love to get data closer to the end user, but it’s really a pipe dream. We have customers in Russia, and even though we’re rendering their requests quickly on our side, the round trip makes everything feel slower.
I prefer GCP’s approach here with global load balancers, where you can anycast across multiple active regions. Since we’re in AWS, we’re looking at Global Accelerator as the path there. We’re hoping that will help with speeding up our outbound traffic because the data stays on the AWS network for significantly longer before hitting the public internet, but there’s a hefty additional cost to using it with no guarantee that it actually fills the void we’re looking for.
Any final words of wisdom to share with those who want to implement the “HEYstack” on AWS?
Pay close attention to your DB schemas! We rely heavily on relational databases and borking a schema has awful cascading effects.
Second, cache as much as you can. HEY renders a lot of email, and we’ve gotten it to be quick, but at scale it’s compute-intensive.
I think you just listed the two hardest problems in computer science, naming things and cache invalidation. Off-by-one errors are the other hardest problem … got one more thing?
You shouldn’t pick up a given technology and integrate it with your stack just because it’s the trendy thing. We didn’t jump into the cloud 10 years ago just because it was cool, it felt like we were comparatively late to adopt Kubernetes, we don’t use a service mesh. We’ve let these technologies mature, and now we’re getting real value.
Our story (like many others!) shows that you don’t have to go past the bleeding edge to reap the benefits of cloud.