Data Analytics with Spark and EMR

1 hour
  • 4 Learning Objectives

About this Hands-on Lab

You work for an organization that has a wide variety of users. You have been tasked with running some data analytics for an upcoming marketing campaign. The end goal is to determine the most common users, grouped by their gender and age. You have all the data you need stored in an S3 bucket. In this lab, you will be in charge of running data analytics on hundreds/thousands of files containing CSV data about the users who interact with the application. To accomplish this, you will first need to create an EMR cluster and copy user data into HDFS. Next, you will run a PySpark Apache Spark script to count the number of users, grouping them by their age and gender. Finally, you will need to load the results into S3 for further analysis.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Create an EMR Cluster

Navigate into the EMR console and create an EMR cluster with 1 primary node and 1 core node. The instance types should both be m4.large. Ensure the EMR cluster has Spark and Hadoop installed on it.

Helpful Documentation

Getting Started with Amazon EMR

Apache Spark on EMR

Copy Data from S3 to HDFS Using s3-dist-cp

Using the s3-dist-cp command, copy the user data from S3 into HDFS. The user data is stored in a public s3 bucket.

Helpful Documentation

Using s3-dist-cp on EMR

How do I use s3-dist-cp?

Run a PySpark Script Using spark-submit

Using the spark-submit command, execute the PySpark script to group the user data by the dob.age and gender attributes. Count all the records, ensure they are in ascending order, and report the results in CSV format back to HDFS.

Helpful Documentation

Using the command-runner.jar application on EMR to run spark-submit

Copy Data from HDFS to S3 Using s3-dist-cp

Copy the CSV results from HDFS to S3 using the s3-dist-cp command. You’ll need to create a new S3 bucket to store your results in.

Helpful Documentation

Using s3-dist-cp on EMR

How do I use s3-dist-cp?

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Get Started
Who’s going to be learning?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!