Running a Pyspark Job on Cloud Dataproc Using Google Cloud Storage

30 minutes
  • 4 Learning Objectives

About this Hands-on Lab

This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Prepare Our Environment
  1. First, we need to enable the Dataproc API:

    gcloud services enable dataproc.googleapis.com
  2. Then create a Cloud Storage bucket:

    gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
  3. Now create the dataproc cluster:

  4. gcloud dataproc clusters create wordcount --region=us-central1 --zone=us-central1-f --single-node --master-machine-type=n1-standard-2
  5. And finally, download the wordcount.py file that will be used for the pyspark job:

    gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/* .
Submit the Pyspark Job to the Dataproc Cluster

In Cloud Shell, type:

gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- 
gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt 
gs://$DEVSHELL_PROJECT_ID-data/output/
Review the Pyspark Output
  1. In Cloud Shell, download output files from the GCS output location:

    gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .

    Note: Alternatively, we could download them to our local machine via the web console.

Delete the Dataproc Cluster
  1. We don’t need our cluster any longer, so let’s delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.

  2. Select the wordcount cluster, then click DELETE, and OK to confirm.

    Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.

Additional Resources

In this lab, you will create a single node Dataproc cluster and a GCS bucket for your Pyspark job output. Separating the storage from the compute allows you to treat your cluster as ephemeral, and we will delete the cluster when we are done while preserving the results.

Launch your lab in incognito mode (or another browser private browsing mode) to avoid issues with cached logins.

For detailed instructions on how to complete these tasks, expand each learning objective below, or click the Guide tab above the video player.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?