Share on facebook
Share on twitter
Share on linkedin

Scaling Data Analytics with AWS Data Pipeline and Athena

A Cloud Guru News
A Cloud Guru News

If you’re using Amazon Web Services or just to some extent keeping tabs with their service offerings you can’t have missed out on the latest addition in their suits of analytics services, Athena. It’s a tool that fits in very well with the recent trend of serverless cloud computing. Essentially just computing without having to allocate the cloud based resources yourself.

What is AWS Athena?

AWS Athena, like Google’s BigQuery is a serverless tool for querying big data sets without having to set up clusters of physical or virtual machines. A somewhat fair comparison in the AWS space would be to say that Athena is to EMR what Lambda is to EC2.

You have a lot of limitations and you can just do a subset of the things you can do with the non-serverless option. But once you actually architect with taking those limitations into account you can build something that delivers result at a really low price point.

Athena at Work

We decided that we wanted to port some of the calculations we do for legislator ideology in our application to Athena, with the goal of being able to scale better and cut costs.

Statehill uses AWS to measure the government’s impact into actionable intelligence to help your ability to advocate effectively

Statehill connects legislative data around the world to discover, analyze, and measure government’s impact on you. We start with collecting and organizing legislative information into actionable intelligence to facilitate your ability to advocate effectively.

Essentially what we’re doing is taking historical voting records and determining how Republican or Democrat leaning legislators are on a number of categories. We’re currently doing this in a number of US states, including MontanaTexas and Wisconsin.

AWS Athena being used to measure the governments impact into actionable intelligence.
Ideology for Alan Redfield (R), Montana legislator in the 65th session

Interested in upscaling or beginning your journey with Cloud Data? A Cloud Guru’s AWS Data Learning Paths offers custom courses fit for beginners and advanced gurus!

How to Move Data to AWS Data Pipeline

First and foremost we had to conclude that we could accomplish the task using nothing but SQL, as that’s how you query Athena. Once we had determined that, our first step was shipping data from our RDS instance to S3 in a CSV format so we can create a table in Athena. And how do you efficiently ship data from RDS to S3 on a regular basis? Obviously using Data Pipeline!

AWS Data pipeline

Data Pipeline is an incredibly powerful Swiss army knife kind of service, and thanks to scheduling it can be a viable option for running recurring data transformation tasks, or even used just for computing tasks. Which is how we ended up using it to actually query Athena.

Athena may at first glance just look like a GUI. But you’d still think you can run queries using the AWS SDK? Not so fast, the only way to actually run a query outside of the GUI is by connecting using JDBC and using the Athena driver.

So for us to query Athena on a regular basis we ended up using Data Pipeline again, but this time the goal was to run a ShellCommandActivity for preparing the results of the query in S3, and then later ship it back to our RDS instance.

Using Data Pipelines to Query Athena

To accomplish this we had to build a custom AMI with Java 1.8 (as the default EC2 AMI has 1.7), build a small Java class that could connect to Athena, run a query and then output the results at a specified location. Now I’m not a good Java developer by any means, but I can get around. In the end, we ended up with something like this (loosely based upon the sample provided by AWS):

statement = conn.createStatement();
ResultSet rs = statement.executeQuery(sql);
ResultSetMetaData rsmd = rs.getMetaData();
CSVWriter writer = new CSVWriter(
  new FileWriter(outputLocation),
while ( {
  String[] entries = new String[rsmd.getColumnCount()];
  for (int i = 1; i < rsmd.getColumnCount() + 1; i++) {
    entries[i — 1] = rs.getString(i);

This worked incredibly well, and since the sql variable and outputLocation is actually coming from args we can reuse the same class for future queries, and simply duplicate the pipeline.

Athena powering production data
Final setup, now serving data for our customers in production


Athena is an incredibly powerful tool and I recommend that you take a look at it. It’s a cheap and fast way to run queries distributed, and it’s now powering production data for our application. At $5 per TB of data scanned the barrier of entry for big data processing has been significantly lowered.

There are certainly a few ways to improve our setup, there almost always is. But the main thing would probably be swapping the CSV’s that we’re querying in Athena for Parquet files to save on costs and improve query performance. For more about that, Mark Litwintschik wrote an excellent post comparing different data formats on Athena here that I recommend reading.

Level up your cloud career

A Cloud Guru makes it easy (and awesome) to get certified and master modern tech skills — whether you’re new to cloud or a seasoned pro. Check out ACG’s current free courses or get started now with a free trial.


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?