Feature selection before training a model helps us to forecast which features will have the most predictive power for the chosen label. We can also use feature selection after training a model to find which features the model learned to be most important. In this lab, we will explore the different options for feature selection on models that are already trained.
Successfully complete this lab by achieving the following learning objectives:
- Set up the Workspace
Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
Create a training cluster of
Create a new blank pipeline in the Azure Machine Learning Studio Designer.
- Predict the Car Price
Through exploration, we have determined that the automobile data includes a column called
normalized-lossesthat is not relevant. The dataset is also missing some values.
Let’s clean up the data before passing it to our models. Also, remember to add comments to your nodes as you create the pipeline so you can visually keep track of what each step is doing.
Use the data from the Automobile price data (Raw) dataset.
Remove rows that are missing the price. We can’t train the model using data missing the label.
Replace all missing values with 0.
Split the data into training and testing sets. Use 70% of the data for training. Be sure to set a random seed for repeatability.
Create a model using the
Boosted Decision Tree Regressionalgorithm.
Train the model. Make sure to use the training data for this step.
Generate predictions using the testing data.
Generate statistics for the predictions.
Submit the pipeline. This will take a few minutes to complete.
When the pipeline completes, view the prediction statistics and find the Root Mean Square Error (RMSE).
Note: The RMSE determines how far off our model is on average from the true price. It is measured in the same units as the label, which makes it very easy to work with. Lower values are better.
- Select Features to Improve the Root Mean Square Error
The Root Mean Square Error is a good indicator of how well our model predicts the price. Let’s figure out which features are important for reducing our error.
Determine which features most affect the Root Mean Square Error. Use the model you trained in the previous step. This also requires data, so let it work against the testing data.
Submit the pipeline. This will be fairly quick because it can reuse most of the work done in the previous step.
When the pipeline completes, view the results. The features are ranked by importance. Higher values mean more impact on the RMSE, and the values are roughly on the scale of the RMSE.
Did you expect the top features to be the most important? Did you expect any other features to be higher on the list?
- Select Features to Improve the Coefficient of Determination
The coefficient of determination is another indication for proportionally how much any independent feature can be used to predict the label. This is another option for ranking how important each feature is to the model. By choosing the highest ranked features, we may be able to improve our model’s predictions.
Determine which features most affect the coefficient of determination. Use the same model and the training data as before.
Submit the pipeline. This will be again be fairly quick.
When the pipeline completes, view the results. The features are once again ranked by importance roughly on the scale of the coefficient of determination, usually between 0 and 1, though it can be negative (and can go to negative infinity in abnormal circumstances).
Answer the same questions as above. Do the feature ranks make sense? Are there any that you did or didn’t expect? How are they different from the features deemed most important by RMSE?
- Evaluate the Results
Note that we’ve found which features that this particular model configuration learned are most important. Which statistics you choose to compare will affect your results. The intialization values will also affect your results. This is why machine learning is as much art as it is science. It’s good to look at the different statistics and see where they agree and disagree, especially after determining which statistics are most important for your problem.
You can also use the Filter Based Feature Selection module to help confirm predictions made. When using permutation feature importance this way, make sure to compare the results to a model trained without feature selection as a baseline. Try this with the remaining time you have in the lab!