One of the hard truths of machine learning is that certain kinds of data that would be really useful are also really hard to get. Some data points are more plentiful than others, but understanding (and predicting!) the minority class of data is incredibly useful in many areas. What if you want to detect fraudulent transactions, diagnose rare medical conditions, or discover anomalous behavior in your networks? You’ll easily be able to gather plenty of examples of non-fraudulent transactions, common conditions, and normal user behavior, but you may only have a small amount of data for what you want to predict. In this lab, we will explore the Synthetic Minority Oversampling Technique, better known as SMOTE, as a way of boosting the signal of the minority class.
Successfully complete this lab by achieving the following learning objectives:
- Set Up the Workspace
Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
Create a training cluster of
Create a new blank pipeline in the Azure Machine Learning Studio designer.
- Create a Baseline Model
We need a baseline model to compare our SMOTE models against. For this, we will create a basic classification model.
The data you will be working with is in the CRM Churn Labels Shared and CRM Dataset Shared nodes. Set the data in CRM Churn Labels Shared as a label, and rename the column to "Label". Join the CRM dataset with the labels.
Split the data into training and test sets. Use 70% of the data for training. Set a random seed to repeatably split the data.
Using a Two-Class Boosted Decision Tree algorithm, train a classification model. Make sure to use the training data for this step.
Generate predictions for the testing data.
Generate statistics for the predictions.
Submit the pipeline. Create a new experiment to hold the results.
Once the pipeline completes, view the prediction statistics and find the area under the curve (AUC). This is what we will try to beat with our other models.
- Use SMOTE to Increase Underrepresented Samples
While we have established our baseline, we should be able to get better results if we have more examples of the underrepresented data. Let’s experiment with SMOTE to increase the churn data and see what impact it has on our model.
Create twice as many synthetic churn samples with SMOTE using the single nearest neighboring data point. Allow SMOTE to use all available data. Choose a non-zero random seed so we can start with the same intialization parameters for multiple experiments.
Note: SMOTE percentage is the percentage increase in the minority examples. This value must be in multiples of 100. A value of 100 means that we will create 100% extra examples, so we’ll effectively have twice as many examples. This will not affect the majority class. It will only change the proportion of data being used to train the model.
Number of nearest neighbors, frequently referred to as K, allows us to define how many similar examples to use when creating the synthetic data. The more neighbors you use, the more varied the synthetic samples. More neighbors gives more variety, but it can also lead to noise in the synthetic data. For our first pipeline, we’ll only use 1 nearest neighbor.
Split the oversampled data into training and testing data sets. Use 70% of the data for training and set the same random seed as the baseline model’s split.
Train another classification model with the same architecture as before, using the synthetic testing data.
Note: The synthetic data we are creating must only be used for training. This is crucially important. The data generated by SMOTE is statistically similar to the real data, but it is not the real data. When you see "synthetic data", you should be thinking "manufactured, imaginary data". We do not want to score the model using this. Instead, we will score the model using the test data split from the baseline model.
Use SMOTE to triple the churn examples using the two closest data points. Train another model using this data as above.
Use SMOTE to quadruple the churn examples using the three closest data points. Train another model using this data.
Submit the pipeline reusing the same experiment. This will take a few minutes since we’ve added so many nodes to the pipeline.
- Evaluate the SMOTE Results
The area under the curve (AUC) is a good proxy for how well the model performs across many threshold values, so we’ll use it to determine the best model. Higher values are better.
Which model do you think will be best?
Check the AUC of the baseline model so it is fresh in your mind.
For each SMOTE pipeline, find the AUC. Using these, determine which model performed best. Did any of your models perform worse than you expected?
Adjust the threshold value of your best model up and down to balance the model performance. For this problem, we can use the F1 score to help us determine the right threshold. Determining the optimum threshold will involve balancing how much time your Customer Relations team has for outreach against how many customers you are willing to let churn.
Note: The F1 score tells us the balance between precision and recall, which will help us find a good ratio for our true positives (users correctly predicted to churn) and false positives (users that wouldn’t churn but are predicted to). This will help balance the amount of work the Customer Relations team will have to do.