Feature Selection Before Training in Azure Machine Learning

1 hour
  • 5 Learning Objectives

About this Hands-on Lab

If you are presented with a large number of distinct features to use to train your model, it is rarely a good idea to throw them all at the model. Many of the features will not have any predictive power for your desired label. In the best case scenario, using these extraneous features will only increase training times. At worst, they will increase model complexity, training time, and prediction error rates. To avoid these costly increases, we can use feature engineering to pick the most relevant features before we start training our model. In this lab, we will explore using the Pearson’s Correlation, and Chi Squared statistics to pick the best features for our model.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Setup the Workspace
  1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.

  2. Create a Training Cluster of D2 instances.

  3. Create a new blank Pipeline in the Azure Machine Learning Studio Designer.

Prepare the Data
  1. Use the data from the Automobile price data (Raw) dataset.
  2. Remove the normalized-losses column.
  3. Remove rows that are missing the price. We can’t train the model using data missing the label.
  4. Replace all missing values with 0.
  5. Split the data into training and testing sets. Use 70% of the data for training. Be sure to set a random seed for repeatability.
Create the Baseline Model
  1. Create a model using the Boosted Decision Tree Regression algorithm.

    Note: Since we are comparing multiple models, we want them to be initialized the same way. For good science, we want there to be only one difference between the control and experiment groups, which in our case, will be the features passed to the model. To accomplish this, set the random seed to any non-zero number.

  2. Train the model. Make sure to use the training data for this step.
  3. Generate predictions using the testing data.
  4. Generate statistics for the predictions.
  5. Submit the pipeline. This will take a couple minutes to run.
  6. When the pipeline completes, view the prediction statistics.

    Note: Since we’re comparing models, we need a metric to compare against. Root Mean Square Error (RMSE) determines how far off our model is, on average, from the true price. It is measured in the same units as the label, which makes it very easy to work with. Lower values are better. This model produces the RMSE we will try to beat by engineering features.

Select Features with Pearson’s Correlation
  1. Create another Boosted Decision Tree Regression model. Use the same random seed as the first model to initialize it in the same way.
  2. Using Pearson’s correlation, rank the features in the training data based on their correlation to the price data. Pass the top 5 to the model.

    Note: The algorithm for Pearson’s correlation requires numbers, so only numeric columns will be considered.

  3. Select the same features from the testing data as you did in the training data.

    Note: You cannot use the same node for this because it will run the selection algorithm again, but use the training data, which can produce different results.

  4. Train the model using the selected features from the training data.
  5. Generate predictions on the testing data filtered to the same set of selected features.
  6. Generate statistics for the predictions.
  7. Submit the pipeline. This will take a couple minutes to run, but should be faster since the data doesn’t have to reprocess.
  8. When the pipeline completes, view the chosen features for both the training and test data sets to see if they line up.
  9. Find Pearson’s r values and see how strongly the selected features correlate to the price.

    Note: The closer that the value is to 1, the more strongly positively predictive the feature is, meaning it helps predict our label. At 0, there is no correlation (this also applies to non-numerical columns). Closer to -1, the feature is strongly negatively predictive, meaning it predicts the opposite of what we want.

  10. Check the RMSE (Root Mean Square Error). Did this model perform better or worse than our baseline?
Select Features with Chi Squared
  1. Copy and paste each node we created in the previous step, then wire it up the same way as before.
  2. For this model, change the feature selection to use Chi Squared instead of Pearson’s correlation. Also, since we learned that 5 features was not enough in the previous step, try increasing the number of features to 10.

    Note: The Chi Squared algorithm does not require numerical data, so all columns will be considered.

  3. Submit the pipeline. This will again be quick since we don’t have to redo any of the previous steps.
  4. Once the pipeline completes, view the chosen features for both the training and testing data sets to see if they line up.
  5. Find the Chi Squared values for the chosen columns. Note that the top 5 fields chosen by Pearson’s correlation are still considered predictive of price, but they are no longer ranked in the same order.
  6. Lastly, check the RMSE. This model performed better than our previous experiment, so this is a better feature-selected model of this data. How does it compare to the baseline?

Additional Resources

Let's say we want to predict car prices from a set of features about the car. There are lots of different things we might know about the car, from the make and model, to safety ratings, horsepower, fuel efficiency, and many more. Many of these probably have some effect on the price, but how much effect will they have?

Each consumer will have slightly different preferences for which feature is more important, which will ultimately affect their decision to purchase the car, but can we predict an average price? Furthermore, before we use a machine learning model to predict the price, can we predict which of these features will help us predict the price more accurately? Of course we can, using statistics!

Lab Goals

  1. Setup the Workspace
  2. Create the Baseline Model
  3. Select Features with Pearson's Correlation
  4. Select Features with Chi Squared

Logging in to the lab environment

To avoid issues with the lab, use a new Incognito or Private browser window to log in to the lab. This ensures that your personal account credentials, which may be active in your main window, are not used for the lab.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?