Setting Up a Data Streaming Pipeline with Dataflow

45 minutes
  • 5 Learning Objectives

About this Hands-on Lab

One of the primary benefits of Dataflow is that it can handle both streaming and batch data processing in a serverless, fast, and cost-effective manner. In this hands-on lab, you’ll establish the necessary infrastructure — including a Cloud Storage bucket, a Pub/Sub topic, and a BiqQuery dataset — to execute a Dataflow template on real-time streaming data from New York City’s ever-busy taxi service.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Enable the Necessary APIs

Enable the Dataflow, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, and Cloud Resource Manager APIs, either through the user interface or the Cloud Shell.

Create a Storage Bucket

Create a Cloud Storage bucket to hold the temporary Dataflow data.

Create a Dataset and Table

Create a BigQuery dataset and table with the proper schema to hold the dataset-generated data.

Run a Pub/Sub to BigQuery Dataflow Job

Use the Pub/Sub to BigQuery Dataflow template to process the data.

Query the Resulting Dataset

Input and run the desired queries.

Additional Resources

Your company has a new client, New York City. You have been tasked with the responsibility of establishing a way to analyze ongoing data from the city’s taxi cabs. You decide to use a Dataflow template to stream the data into a Pub/Sub topic and output, properly formatted, to a BigQuery dataset.

To accomplish this task, you’ll need to complete the following steps:

  1. Enable the necessary APIs.
  2. Create a storage bucket.
  3. Create a dataset and table.
  4. Run a Pub/Sub to BigQuery Dataflow job.
  5. Query the resulting dataset.

Use the following schema for your BigQuery dataset table:


In the dataset template, set the Pub/Sub topic to:


