Lab
A Cloud Guru

Feature Selection After Training in Azure Machine Learning

Feature selection before training a model helps us to forecast which features will have the most predictive power for the chosen label. We can also use feature selection after training a model to find which features the model learned to be most important. In this lab, we will explore the different options for feature selection on models that are already trained.

Try for free Contact sales

Path Info

Level

Advanced

Duration

45m

Published

Sep 24, 2020

Challenge

Set up the Workspace
1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
2. Create a training cluster of Standard_D3_v2 instances.
3. Create a new blank pipeline in the Azure Machine Learning Studio Designer.
Challenge

Predict the Car Price
Through exploration, we have determined that the automobile data includes a column called normalized-losses that is not relevant. The dataset is also missing some values.

Let's clean up the data before passing it to our models. Also, remember to add comments to your nodes as you create the pipeline so you can visually keep track of what each step is doing.
1. Use the data from the Automobile price data (Raw) dataset.
2. Remove the normalized-losses column.
3. Remove rows that are missing the price. We can't train the model using data missing the label.
4. Replace all missing values with 0.
5. Split the data into training and testing sets. Use 70% of the data for training. Be sure to set a random seed for repeatability.
6. Create a model using the Boosted Decision Tree Regression algorithm.
7. Train the model. Make sure to use the training data for this step.
8. Generate predictions using the testing data.
9. Generate statistics for the predictions.
10. Submit the pipeline. This will take a few minutes to complete.
11. When the pipeline completes, view the prediction statistics and find the Root Mean Square Error (RMSE).
Note: The RMSE determines how far off our model is on average from the true price. It is measured in the same units as the label, which makes it very easy to work with. Lower values are better.
Challenge

Select Features to Improve the Root Mean Square Error
The Root Mean Square Error is a good indicator of how well our model predicts the price. Let's figure out which features are important for reducing our error.
1. Determine which features most affect the Root Mean Square Error. Use the model you trained in the previous step. This also requires data, so let it work against the testing data.
2. Submit the pipeline. This will be fairly quick because it can reuse most of the work done in the previous step.
3. When the pipeline completes, view the results. The features are ranked by importance. Higher values mean more impact on the RMSE, and the values are roughly on the scale of the RMSE.
4. Did you expect the top features to be the most important? Did you expect any other features to be higher on the list?
Challenge

Select Features to Improve the Coefficient of Determination
The coefficient of determination is another indication for proportionally how much any independent feature can be used to predict the label. This is another option for ranking how important each feature is to the model. By choosing the highest ranked features, we may be able to improve our model's predictions.
1. Determine which features most affect the coefficient of determination. Use the same model and the training data as before.
2. Submit the pipeline. This will be again be fairly quick.
3. When the pipeline completes, view the results. The features are once again ranked by importance roughly on the scale of the coefficient of determination, usually between 0 and 1, though it can be negative (and can go to negative infinity in abnormal circumstances).
4. Answer the same questions as above. Do the feature ranks make sense? Are there any that you did or didn't expect? How are they different from the features deemed most important by RMSE?
Challenge

Evaluate the Results

Note that we've found which features that this particular model configuration learned are most important. Which statistics you choose to compare will affect your results. The intialization values will also affect your results. This is why machine learning is as much art as it is science. It's good to look at the different statistics and see where they agree and disagree, especially after determining which statistics are most important for your problem.

You can also use the Filter Based Feature Selection module to help confirm predictions made. When using permutation feature importance this way, make sure to compare the results to a model trained without feature selection as a baseline. Try this with the remaining time you have in the lab!

Author

A Cloud Guru

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.