Scaling Data Analytics with AWS Data Pipeline and Athena

AWS Athena is a serverless tool for querying big data sets without having to set up clusters of physical or virtual machines. Learn More!

By A Cloud Guru News

Jun 08, 2023 • 6 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

If you’re using Amazon Web Services or just to some extent keeping tabs with their service offerings you can’t have missed out on the latest addition in their suits of analytics services, Athena. It’s a tool that fits in very well with the recent trend of serverless cloud computing. Essentially just computing without having to allocate the cloud based resources yourself.

What is AWS Athena?

AWS Athena, like Google’s BigQuery is a serverless tool for querying big data sets without having to set up clusters of physical or virtual machines. A somewhat fair comparison in the AWS space would be to say that Athena is to EMR what Lambda is to EC2.

You have a lot of limitations and you can just do a subset of the things you can do with the non-serverless option. But once you actually architect with taking those limitations into account you can build something that delivers result at a really low price point.

Athena at Work

We decided that we wanted to port some of the calculations we do for legislator ideology in our application to Athena, with the goal of being able to scale better and cut costs.

Statehill uses AWS to measure the government’s impact into actionable intelligence to help your ability to advocate effectively

Statehill connects legislative data around the world to discover, analyze, and measure government’s impact on you. We start with collecting and organizing legislative information into actionable intelligence to facilitate your ability to advocate effectively.

Essentially what we’re doing is taking historical voting records and determining how Republican or Democrat leaning legislators are on a number of categories. We’re currently doing this in a number of US states, including Montana, Texas and Wisconsin.

Interested in upscaling or beginning your journey with Cloud Data? A Cloud Guru's AWS Data Learning Paths offers custom courses fit for beginners and advanced gurus!

How to Move Data to AWS Data Pipeline

First and foremost we had to conclude that we could accomplish the task using nothing but SQL, as that’s how you query Athena. Once we had determined that, our first step was shipping data from our RDS instance to S3 in a CSV format so we can create a table in Athena. And how do you efficiently ship data from RDS to S3 on a regular basis? Obviously using Data Pipeline!

So for us to query Athena on a regular basis we ended up using Data Pipeline again, but this time the goal was to run a ShellCommandActivity for preparing the results of the query in S3, and then later ship it back to our RDS instance.

Using Data Pipelines to Query Athena

To accomplish this we had to build a custom AMI with Java 1.8 (as the default EC2 AMI has 1.7), build a small Java class that could connect to Athena, run a query and then output the results at a specified location. Now I’m not a good Java developer by any means, but I can get around. In the end, we ended up with something like this (loosely based upon the sample provided by AWS):

statement = conn.createStatement();
ResultSet rs = statement.executeQuery(sql);
ResultSetMetaData rsmd = rs.getMetaData();
CSVWriter writer = new CSVWriter(
  new FileWriter(outputLocation),
  ',',
  CSVWriter.NO_QUOTE_CHARACTER,
  CSVWriter.NO_ESCAPE_CHARACTER
);
while (rs.next()) {
  String[] entries = new String[rsmd.getColumnCount()];
  
  for (int i = 1; i < rsmd.getColumnCount() + 1; i++) {
    entries[i — 1] = rs.getString(i);
  }
  writer.writeNext(entries);
}

This worked incredibly well, and since the sql variable and outputLocation is actually coming from args we can reuse the same class for future queries, and simply duplicate the pipeline.

Conclusion

Athena is an incredibly powerful tool and I recommend that you take a look at it. It’s a cheap and fast way to run queries distributed, and it’s now powering production data for our application. At $5 per TB of data scanned the barrier of entry for big data processing has been significantly lowered.

There are certainly a few ways to improve our setup, there almost always is. But the main thing would probably be swapping the CSV’s that we’re querying in Athena for Parquet files to save on costs and improve query performance. For more about that, Mark Litwintschik wrote an excellent post comparing different data formats on Athena here that I recommend reading.

Level up your cloud career

A Cloud Guru makes it easy (and awesome) to get certified and master modern tech skills — whether you’re new to cloud or a seasoned pro. Check out ACG’s current free courses or get started now with a free trial.

Start A Free Trial

A C.

More about this author