With Azure Machine Learning Studio, we don’t need to write any code to get started using machine learning. With Machine Learning pipelines, we can drag and drop functionality into a graph to create repeatable processing units. In this lab, we create a Machine Learning pipeline to predict the progression of diabetes based on a set of health metrics, all without writing a single line of code.
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Create a Training Cluster
A Machine Learning workspace has been provisioned at the start of the lab. View it in the Machine Learning Studio.
To perform any actions, we need available compute power. Since we’re not using a notebook, we don’t need a compute instance. Instead, we need the power a training cluster will provide.
Click Compute in the left menu, then choose Training Cluster from the menu.
Click the New button.
Create a unique name for the training cluster.
Use Standard_D2_v2 nodes for the virtual machine size.
Set the minimum and maximum nodes to 2. (Note: For clusters outside the lab environment, it is better to set minimum nodes to 0 so the compute will deprovision when not in use. This saves money. We are using two so we don’t have to wait for compute to spin up and down in the middle of the lab.)
Click Create.
The training cluster will take time to provision. While waiting, explore the Sample: Diabetes dataset available through Azure Open Datasets. Be sure to view the original data description for more information about what the data fields mean.
- Register and Preview the Diabetes Dataset
- Click Datasets in the left menu, then choose Create Dataset.
- In the dropdown menu, choose From Open Datasets.
- Select Sample: Diabetes from the options grid, then click Next.
- Give the dataset a descriptive name, then click Create. For the rest of the lab, we will refer to this dataset as OpenDiabetesData.
- Click the OpenDiabetesData dataset, then click Generate profile.
- Select our training cluster as the compute target, then click Generate.
- Monitor the progress in the Experiments UI under the
dataset_unregistered_datasets
experiment.
This will take some time to complete, usually around 8-10 minutes. Grab a coffee in the meantime.
- Explore the Dataset
These are arguably the most important steps in any Machine Learning project. We need to understand the data and what it represents to make an effective model from it. Take some time to become familiar with this data.
- Click Datasets in the left menu, then choose OpenDiabetesData.
- Open the Explore tab and explore through the preview. Notice that there is a Path column provided by Azure. That is not part of the health data we want to use to train our model.
- Switch to the Profile tab and explore the distributions and statistics related to the data.
- Transform the Data
- Open the Designer from the left menu.
- Create a new pipeline using the Easy-to-use prebuilt modules.
- On the right menu, select the training cluster as the default compute for this pipeline.
- Give the pipeline a descriptive name. We will refer to this as the
DiabetesPipeline
for the lab. - Open the Datasets submenu and drag the
OpenDiabetesData
node onto the canvas. Notice the output of this node is a DataFrameDirectory. - The first transformation on the data is to remove the Path column.
- In the Data Transformation submenu, drag the Select Columns In Dataset node onto the canvas underneath the
OpenDiabetesData
node. - Connect the output of
OpenDiabetesData
to the input ofSelect Columns in Dataset
. - Click the
Select Columns in Dataset
node, then click Edit column in the right menu. - For the first line, choose All Columns from the dropdown.
- Click Save.
- In the Data Transformation submenu, drag the Select Columns In Dataset node onto the canvas underneath the
- Next, we need to split our data into training and test sets, with 70% going to training and 30% going to testing.
- Drag a Split Data node from the Data Transformation submenu on the left onto the canvas. Position it under the
Select Columns in Dataset
node. - Connect the output of
Select Columns in Dataset
to the input ofSplit Data
. Notice that both of these are DataFrameDirectory type. - Change the Fraction of rows to
0.7
. This 70%, the first output, will be our training set. - We want the data to be randomized, but in a repeatable way so that it is the same between training experiments. For this, set the Random seed to any non-zero value.
- Add a comment to the node: "Train/Test Split 70/30".
- Drag a Split Data node from the Data Transformation submenu on the left onto the canvas. Position it under the
- Train a Model
- Now that the data is transformed, let’s train a model using it.
- Drag a Train Model node onto the canvas from the Model Training submenu on the left. Position it under the
Split Data
node. - Connect the left output of the
Split Data
node (our training data) to the right input of theTrain Model
node.
- Drag a Train Model node onto the canvas from the Model Training submenu on the left. Position it under the
- The
Train Model
node requires an UntrainedModelDirectory as its left input. This comes from a Machine Learning algorithm.- Drag a Linear Regression node onto the canvas from the Machine Learning Algorithms submenu on the left.
- Connect the output of
Linear Regression
to the left input ofTrain Model
.
- The output of
Train Model
is aModelDirectory
. This is the trained model, but it doesn’t tell us anything about how accurate the model is.- Drag a Score Model node onto the canvas from the Model Scoring & Evaluation submenu on the left. Place it below the
Train Model
node. - Notice that it takes two inputs, a ModelDirectory and a DataFrameDirectory.
- Connect the right output from
Split Data
(our test data) to the right input ofScore Model
. - The
Score Model
node produces predictions, which is the test data with an extra column for the predicted values (Scored DataFrameDirectory). However, this does not tell us how accurate the predictions are.
- Drag a Score Model node onto the canvas from the Model Scoring & Evaluation submenu on the left. Place it below the
- To quantitatively evaluate the model, drag an Evaluate Model node onto the canvas from the Model Scoring & Evaluation submenu on the left.
- Connect the output of
Score Model
to the left input (the DataFrameDirectory type).
- Connect the output of
- Now that our pipeline is fully constructed, we can run it. Click the Submit button at the top right of the page.
- In the popup, select Create new for a new Experiment.
- Give it a descriptive name. We’ll refer to it as
DiabetesExperiment
for the rest of the lab. - In the Run description, add "Linear Regression" to the comment. This gives us an easy way to distinguish the run later.
- Finally, click Submit to start the pipeline!
This will take around 10-15 minutes to run. Time for another coffee. We can watch the progress in the Experiments UI.
- Now that the data is transformed, let’s train a model using it.
- View the Results
- Once the pipeline has completed, open DiabetesPipeline in the designer.
- Right-click the
Score Model
node and choose Visualize Scored dataset. - Scan through the rows and compare the actual value of column
Y
to the predicted value in column Scored Labels. Note that the predictions are not very close, and in some cases, quite far off from the actual value. - Close the Scored Model result visualization.
- Right-click the
Evaluate Model
node and choose Visualize Evaluation results. The Root Mean Squared Error (RMSE) shows how far our predictions are deviated. A perfect model would have a RMSE of 0.
Our actual labels (
Y
) range from approximately 0 to 350. When building the lab, we got an RMSE of around 58, which is quite high compared to the scale. This means the model did not do a good job of predicting the labels. There are many tweaks we can make to this pipeline to make the model more accurate, which we will do in a future lab.