One of the primary benefits of Dataflow is that it can handle both streaming and batch data processing in a serverless, fast, and cost-effective manner. In this hands-on lab, you’ll establish the necessary infrastructure — including a Cloud Storage bucket, a Pub/Sub topic, and a BiqQuery dataset — to execute a Dataflow template on real-time streaming data from New York City’s ever-busy taxi service.
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Enable the Necessary APIs
Enable the Dataflow, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, and Cloud Resource Manager APIs, either through the user interface or the Cloud Shell.
- Create a Storage Bucket
Create a Cloud Storage bucket to hold the temporary Dataflow data.
- Create a Dataset and Table
Create a BigQuery dataset and table with the proper schema to hold the dataset-generated data.
- Run a Pub/Sub to BigQuery Dataflow Job
Use the Pub/Sub to BigQuery Dataflow template to process the data.
- Query the Resulting Dataset
Input and run the desired queries.