This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Prepare Our Environment
First, we need to enable the Dataproc API:
gcloud services enable dataproc.googleapis.com
Then create a Cloud Storage bucket:
gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
Now create the
dataproc
cluster:gcloud dataproc clusters create wordcount --region=us-central1 --zone=us-central1-f --single-node --master-machine-type=n1-standard-2
And finally, download the
wordcount.py
file that will be used for thepyspark
job:gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/* .
- Submit the Pyspark Job to the Dataproc Cluster
In Cloud Shell, type:
gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt gs://$DEVSHELL_PROJECT_ID-data/output/
- Review the Pyspark Output
In Cloud Shell, download output files from the GCS output location:
gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .
Note: Alternatively, we could download them to our local machine via the web console.
- Delete the Dataproc Cluster
We don’t need our cluster any longer, so let’s delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.
Select the wordcount cluster, then click DELETE, and OK to confirm.
Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.