Create an Azure Machine Learning Pipeline

1.75 hours
  • 6 Learning Objectives

About this Hands-on Lab

With Azure Machine Learning Studio, we don’t need to write any code to get started using machine learning. With Machine Learning pipelines, we can drag and drop functionality into a graph to create repeatable processing units. In this lab, we create a Machine Learning pipeline to predict the progression of diabetes based on a set of health metrics, all without writing a single line of code.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Create a Training Cluster

A Machine Learning workspace has been provisioned at the start of the lab. View it in the Machine Learning Studio.

To perform any actions, we need available compute power. Since we’re not using a notebook, we don’t need a compute instance. Instead, we need the power a training cluster will provide.

  1. Click Compute in the left menu, then choose Training Cluster from the menu.

  2. Click the New button.

  3. Create a unique name for the training cluster.

  4. Use Standard_D2_v2 nodes for the virtual machine size.

  5. Set the minimum and maximum nodes to 2. (Note: For clusters outside the lab environment, it is better to set minimum nodes to 0 so the compute will deprovision when not in use. This saves money. We are using two so we don’t have to wait for compute to spin up and down in the middle of the lab.)

  6. Click Create.

The training cluster will take time to provision. While waiting, explore the Sample: Diabetes dataset available through Azure Open Datasets. Be sure to view the original data description for more information about what the data fields mean.

Register and Preview the Diabetes Dataset
  1. Click Datasets in the left menu, then choose Create Dataset.
  2. In the dropdown menu, choose From Open Datasets.
  3. Select Sample: Diabetes from the options grid, then click Next.
  4. Give the dataset a descriptive name, then click Create. For the rest of the lab, we will refer to this dataset as OpenDiabetesData.
  5. Click the OpenDiabetesData dataset, then click Generate profile.
  6. Select our training cluster as the compute target, then click Generate.
  7. Monitor the progress in the Experiments UI under the dataset_unregistered_datasets experiment.

This will take some time to complete, usually around 8-10 minutes. Grab a coffee in the meantime.

Explore the Dataset

These are arguably the most important steps in any Machine Learning project. We need to understand the data and what it represents to make an effective model from it. Take some time to become familiar with this data.

  1. Click Datasets in the left menu, then choose OpenDiabetesData.
  2. Open the Explore tab and explore through the preview. Notice that there is a Path column provided by Azure. That is not part of the health data we want to use to train our model.
  3. Switch to the Profile tab and explore the distributions and statistics related to the data.
Transform the Data
  1. Open the Designer from the left menu.
  2. Create a new pipeline using the Easy-to-use prebuilt modules.
  3. On the right menu, select the training cluster as the default compute for this pipeline.
  4. Give the pipeline a descriptive name. We will refer to this as the DiabetesPipeline for the lab.
  5. Open the Datasets submenu and drag the OpenDiabetesData node onto the canvas. Notice the output of this node is a DataFrameDirectory.
  6. The first transformation on the data is to remove the Path column.
    1. In the Data Transformation submenu, drag the Select Columns In Dataset node onto the canvas underneath the OpenDiabetesData node.
    2. Connect the output of OpenDiabetesData to the input of Select Columns in Dataset.
    3. Click the Select Columns in Dataset node, then click Edit column in the right menu.
    4. For the first line, choose All Columns from the dropdown.
    5. Click Save.
  7. Next, we need to split our data into training and test sets, with 70% going to training and 30% going to testing.
    1. Drag a Split Data node from the Data Transformation submenu on the left onto the canvas. Position it under the Select Columns in Dataset node.
    2. Connect the output of Select Columns in Dataset to the input of Split Data. Notice that both of these are DataFrameDirectory type.
    3. Change the Fraction of rows to 0.7. This 70%, the first output, will be our training set.
    4. We want the data to be randomized, but in a repeatable way so that it is the same between training experiments. For this, set the Random seed to any non-zero value.
    5. Add a comment to the node: "Train/Test Split 70/30".
Train a Model
  1. Now that the data is transformed, let’s train a model using it.
    1. Drag a Train Model node onto the canvas from the Model Training submenu on the left. Position it under the Split Data node.
    2. Connect the left output of the Split Data node (our training data) to the right input of the Train Model node.
  2. The Train Model node requires an UntrainedModelDirectory as its left input. This comes from a Machine Learning algorithm.
    1. Drag a Linear Regression node onto the canvas from the Machine Learning Algorithms submenu on the left.
    2. Connect the output of Linear Regression to the left input of Train Model.
  3. The output of Train Model is a ModelDirectory. This is the trained model, but it doesn’t tell us anything about how accurate the model is.
    1. Drag a Score Model node onto the canvas from the Model Scoring & Evaluation submenu on the left. Place it below the Train Model node.
    2. Notice that it takes two inputs, a ModelDirectory and a DataFrameDirectory.
    3. Connect the right output from Split Data (our test data) to the right input of Score Model.
    4. The Score Model node produces predictions, which is the test data with an extra column for the predicted values (Scored DataFrameDirectory). However, this does not tell us how accurate the predictions are.
  4. To quantitatively evaluate the model, drag an Evaluate Model node onto the canvas from the Model Scoring & Evaluation submenu on the left.
    1. Connect the output of Score Model to the left input (the DataFrameDirectory type).
  5. Now that our pipeline is fully constructed, we can run it. Click the Submit button at the top right of the page.
    1. In the popup, select Create new for a new Experiment.
    2. Give it a descriptive name. We’ll refer to it as DiabetesExperiment for the rest of the lab.
    3. In the Run description, add "Linear Regression" to the comment. This gives us an easy way to distinguish the run later.
    4. Finally, click Submit to start the pipeline!

This will take around 10-15 minutes to run. Time for another coffee. We can watch the progress in the Experiments UI.

View the Results
  1. Once the pipeline has completed, open DiabetesPipeline in the designer.
  2. Right-click the Score Model node and choose Visualize Scored dataset.
  3. Scan through the rows and compare the actual value of column Y to the predicted value in column Scored Labels. Note that the predictions are not very close, and in some cases, quite far off from the actual value.
  4. Close the Scored Model result visualization.
  5. Right-click the Evaluate Model node and choose Visualize Evaluation results. The Root Mean Squared Error (RMSE) shows how far our predictions are deviated. A perfect model would have a RMSE of 0.

Our actual labels (Y) range from approximately 0 to 350. When building the lab, we got an RMSE of around 58, which is quite high compared to the scale. This means the model did not do a good job of predicting the labels. There are many tweaks we can make to this pipeline to make the model more accurate, which we will do in a future lab.

Additional Resources

NOTE: Make sure to choose to create a computing cluster using Standard_D2_v2 instances.

What if we need to take advantage of the power of machine learning, but don't have time to learn all of the code? What if we need a visual representation of our work to show our boss or other stakeholders? Azure Machine Learning pipelines help with both of these problems. We can start transforming data and training models, all in a visual way.

We will use the Sample: Diabetes dataset to train a model and predict the progression of the disease based on different health markers.

This lab is using the designer feature of Azure Machine Learning Studio. Designer is only available in an Enterprise Machine Learning workspace, one of which has been provisioned for the lab. If we are repeating these steps with our own resources, be sure to use an Enterprise workspace.

Lab Goals

  1. Create a training cluster
  2. Register the diabetes dataset
  3. Generate a preview for the dataset
  4. Explore the dataset
  5. Transform the dataset
  6. Train a model
  7. View the results

Logging in to the lab environment

To avoid issues with the lab, use a new Incognito or Private browser window to log in to the lab. This ensures that your personal account credentials, which may be active in your main window, are not used for the lab.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?