Share on facebook
Share on twitter
Share on linkedin

Affordable and Scalable Data Analytics on AWS Using Athena and Data Pipeline

A Cloud Guru News
A Cloud Guru News

Statehill uses AWS to measure the government’s impact into actionable intelligence to help your ability to advocate effectively

If you’re using Amazon Web Services or just to some extent keeping tabs with their service offerings you can’t have missed out on the latest addition in their suits of analytics services, Athena. It’s a tool that fits in very well with the recent trend of serverless cloud computing. Essentially just computing without having to allocate the cloud based resources yourself.

AWS Athena, like Google’s BigQuery is a serverless tool for querying big data sets without having to set up clusters of physical or virtual machines. A somewhat fair comparison in the AWS space would be to say that Athena is to EMR what Lambda is to EC2. You have a lot of limitations and you can just do a subset of the things you can do with the non-serverless option. But once you actually architect with taking those limitations into account you can build something that delivers result at a really low price point.

We decided that we wanted to port some of the calculations we do for legislator ideology in our application to Athena, with the goal of being able to scale better and cut costs.

Statehill connects legislative data around the world to discover, analyze, and measure government’s impact on you. We start with collecting and organizing legislative information into actionable intelligence to facilitate your ability to advocate effectively.

Essentially what we’re doing is taking historical voting records and determining how Republican or Democrat leaning legislators are on a number of categories. We’re currently doing this in a number of US states, including MontanaTexas and Wisconsin.

Ideology for Alan Redfield (R), Montana legislator in the 65th session
Ideology for Alan Redfield (R), Montana legislator in the 65th session

First and foremost we had to conclude that we could accomplish the task using nothing but SQL, as that’s how you query Athena. Once we had determined that, our first step was shipping data from our RDS instance to S3 in a CSV format so we can create a table in Athena. And how do you efficiently ship data from RDS to S3 on a regular basis? Obviously using Data Pipeline!

Data pipeline

Data Pipeline is an incredibly powerful Swiss army knife kind of service, and thanks to scheduling it can be a viable option for running recurring data transformation tasks, or even used just for computing tasks. Which is how we ended up using it to actually query Athena.

Athena may at first glance just look like a GUI. But you’d still think you can run queries using the AWS SDK? Not so fast, the only way to actually run a query outside of the GUI is by connecting using JDBC and using the Athena driver. So for us to query Athena on a regular basis we ended up using Data Pipeline again, but this time the goal was to run a ShellCommandActivity for preparing the results of the query in S3, and then later ship it back to our RDS instance.

To accomplish this we had to build a custom AMI with Java 1.8 (as the default EC2 AMI has 1.7), build a small Java class that could connect to Athena, run a query and then output the results at a specified location. Now I’m not a good Java developer by any means, but I can get around. In the end, we ended up with something like this (loosely based upon the sample provided by AWS):

statement = conn.createStatement();
ResultSet rs = statement.executeQuery(sql);
ResultSetMetaData rsmd = rs.getMetaData();
CSVWriter writer = new CSVWriter(
  new FileWriter(outputLocation),
  ',',
  CSVWriter.NO_QUOTE_CHARACTER,
  CSVWriter.NO_ESCAPE_CHARACTER
);
while (rs.next()) {
  String[] entries = new String[rsmd.getColumnCount()];
  
  for (int i = 1; i < rsmd.getColumnCount() + 1; i++) {
    entries[i — 1] = rs.getString(i);
  }
  writer.writeNext(entries);
}

This worked incredibly well, and since the sql variable and outputLocation is actually coming from args we can reuse the same class for future queries, and simply duplicate the pipeline.

Final setup, now serving data for our customers in production
Final setup, now serving data for our customers in production

Conclusion

Athena is an incredibly powerful tool and I recommend that you take a look at it. It’s a cheap and fast way to run queries distributed, and it’s now powering production data for our application. At $5 per TB of data scanned the barrier of entry for big data processing has been significantly lowered.

There are certainly a few ways to improve our setup, there almost always is. But the main thing would probably be swapping the CSV’s that we’re querying in Athena for Parquet files to save on costs and improve query performance. For more about that, Mark Litwintschik wrote an excellent post comparing different data formats on Athena here that I recommend reading.

Recommended

Get more insights, news, and assorted awesomeness around all things cloud learning.

Get Started
Who’s going to be learning?
Sign In
Welcome Back!
Thanks for reaching out!

You’ll hear from us shortly. In the meantime, why not check out what our customers have to say about ACG?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


$2,495.00

Checkout