Feature Selection After Training in Azure Machine Learning

45 minutes
  • 5 Learning Objectives

About this Hands-on Lab

Feature selection before training a model helps us to forecast which features will have the most predictive power for the chosen label. We can also use feature selection after training a model to find which features the model learned to be most important. In this lab, we will explore the different options for feature selection on models that are already trained.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Set up the Workspace
  1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.

  2. Create a training cluster of D2 instances.

  3. Create a new blank pipeline in the Azure Machine Learning Studio Designer.

Predict the Car Price

Through exploration, we have determined that the automobile data includes a column called normalized-losses that is not relevant. The dataset is also missing some values.

Let’s clean up the data before passing it to our models. Also, remember to add comments to your nodes as you create the pipeline so you can visually keep track of what each step is doing.

  1. Use the data from the Automobile price data (Raw) dataset.

  2. Remove the normalized-losses column.

  3. Remove rows that are missing the price. We can’t train the model using data missing the label.

  4. Replace all missing values with 0.

  5. Split the data into training and testing sets. Use 70% of the data for training. Be sure to set a random seed for repeatability.

  6. Create a model using the Boosted Decision Tree Regression algorithm.

  7. Train the model. Make sure to use the training data for this step.

  8. Generate predictions using the testing data.

  9. Generate statistics for the predictions.

  10. Submit the pipeline. This will take a few minutes to complete.

  11. When the pipeline completes, view the prediction statistics and find the Root Mean Square Error (RMSE).

    Note: The RMSE determines how far off our model is on average from the true price. It is measured in the same units as the label, which makes it very easy to work with. Lower values are better.

Select Features to Improve the Root Mean Square Error

The Root Mean Square Error is a good indicator of how well our model predicts the price. Let’s figure out which features are important for reducing our error.

  1. Determine which features most affect the Root Mean Square Error. Use the model you trained in the previous step. This also requires data, so let it work against the testing data.

  2. Submit the pipeline. This will be fairly quick because it can reuse most of the work done in the previous step.

  3. When the pipeline completes, view the results. The features are ranked by importance. Higher values mean more impact on the RMSE, and the values are roughly on the scale of the RMSE.

  4. Did you expect the top features to be the most important? Did you expect any other features to be higher on the list?

Select Features to Improve the Coefficient of Determination

The coefficient of determination is another indication for proportionally how much any independent feature can be used to predict the label. This is another option for ranking how important each feature is to the model. By choosing the highest ranked features, we may be able to improve our model’s predictions.

  1. Determine which features most affect the coefficient of determination. Use the same model and the training data as before.

  2. Submit the pipeline. This will be again be fairly quick.

  3. When the pipeline completes, view the results. The features are once again ranked by importance roughly on the scale of the coefficient of determination, usually between 0 and 1, though it can be negative (and can go to negative infinity in abnormal circumstances).

  4. Answer the same questions as above. Do the feature ranks make sense? Are there any that you did or didn’t expect? How are they different from the features deemed most important by RMSE?

Evaluate the Results

Note that we’ve found which features that this particular model configuration learned are most important. Which statistics you choose to compare will affect your results. The intialization values will also affect your results. This is why machine learning is as much art as it is science. It’s good to look at the different statistics and see where they agree and disagree, especially after determining which statistics are most important for your problem.

You can also use the Filter Based Feature Selection module to help confirm predictions made. When using permutation feature importance this way, make sure to compare the results to a model trained without feature selection as a baseline. Try this with the remaining time you have in the lab!

Additional Resources

Let's try to predict the price of a car from its features. We know many things about the car, including make, engine size, fuel efficiency, and number of doors. Once we predict the price, we will then score the model while dropping out individual features to see how much the model's performance changes. This will allow us to more intelligently pick the features we use to train the model.

Lab Goals

  1. Set up the Workspace
  2. Predict the Car Price
  3. Select Features to Improve the Root Mean Square Error
  4. Select Features to Improve the Coefficient of Determination

Logging in to the lab environment

To avoid issues with the lab, use a new Incognito or Private browser window to log in to the lab. This ensures that your personal account credentials, which may be active in your main window, are not used for the lab.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Get Started
Who’s going to be learning?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


$2,495.00

Checkout
Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!