You work for an organization that has a wide variety of users. You have been tasked with running some data analytics for an upcoming marketing campaign. The end goal is to determine the most common users, grouped by their gender and age. You have all the data you need stored in an S3 bucket. In this lab, you will be in charge of running data analytics on hundreds/thousands of files containing CSV data about the users who interact with the application. To accomplish this, you will first need to create an EMR cluster and copy user data into HDFS. Next, you will run a PySpark Apache Spark script to count the number of users, grouping them by their age and gender. Finally, you will need to load the results into S3 for further analysis.
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Create an EMR Cluster
Navigate into the EMR console and create an EMR cluster with 1 primary node and 1 core node. The instance types should both be
m4.large
. Ensure the EMR cluster has Spark and Hadoop installed on it.Helpful Documentation
- Copy Data from S3 to HDFS Using s3-dist-cp
Using the
s3-dist-cp
command, copy the user data from S3 into HDFS. The user data is stored in a public s3 bucket.Helpful Documentation
- Run a PySpark Script Using spark-submit
Using the
spark-submit
command, execute the PySpark script to group the user data by thedob.age
andgender
attributes. Count all the records, ensure they are in ascending order, and report the results in CSV format back to HDFS.Helpful Documentation
Using the
command-runner.jar
application on EMR to runspark-submit
- Copy Data from HDFS to S3 Using s3-dist-cp
Copy the CSV results from HDFS to S3 using the
s3-dist-cp
command. You’ll need to create a new S3 bucket to store your results in.Helpful Documentation