Machine learning is particularly good at finding patterns in data. One application for this is training a model to find anomalies, which are data points that don’t fit the discovered pattern. This has value across many industries, such as finance, information security, and medicine. Unfortunately, most anomalies are rare, so the quantity of examples for them is very small compared to normal data. In this lab, you will work with credit card transactions, which are labeled as either valid or fraudulent, and you need to create a model to identify the fraud. You will learn how to prepare data for training an anomaly detection model, as well as how to use one common anomaly detection algorithm.
Successfully complete this lab by achieving the following learning objectives:
- Set up the Workspace
- Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
- Create a Training Cluster of
D2instances. This lab will require a lot of compute for the pipeline, so set the max instances to
- Create a new blank Pipeline in the Azure Machine Learning Studio Designer.
- Prepare the Data
We need to do a few things to make the data suitable for training and evaluating the model. First, the label column has values of
1for the normal class and
2for the abnormal class. However, the scoring uses
0for abnormal and
1for normal, so we have to correct the abnormal value to evaluate our results properly. Next, we need to set the label column as a label. Finally, we need to split the data into training and test sets. However, unlike most other machine learning algorithms, Principal Component Analysis models are only trained on the normal data, so we have to further split the training data to only include normal examples.
- Start with the German Credit Card UCI dataset. The column named ‘Col21’ is the label. Inspect the label values.
- Use an Apply Math Operation node to change the label value of
- Use an Edit Metadata node to rename the label column to ‘Label’ and set it as a label column.
- Create training and testing datasets using Split Data. The German Credit Card dataset is fairly small, so use 80% of the data for training. Make sure the training and test datasets contain a proportional number of examples of both normal and abnormal data.
- From the training dataset, Split Data again to only contain the normal examples. Remember, normal values are labeled as
1in the data.
- Train Anomaly Detection Models
Principal Component Analysis (PCA) reduces our feature space as an unsupervised process to make the model more efficient. This reduction inherently loses information. To compensate for the lost information, we can oversample the data, generating statistically similar examples to the rest of our training data to help boost the anomalous information. See this research paper for a much more detailed explanation.
The oversampling rate is another hyperparameter to tune in your model. I’ve done this for you for the lab. For this dataset, an oversampling rate of 5 (which means 500% extra data) will produce decent results. We still have to tune the number of features needed to produce a good model.
- With three PCA-Based Anomaly Detection nodes, select about 1/3 of the features, 1/2 of the features, and 2/3 of the features. Set the oversampling rate to 5. Do not use feature normalization for this dataset.
- Use a Train Anomaly Detection Model node to train each model. Make sure to only pass in the normal data to this training process.
- Use Score Model nodes to predict the testing data.
- Use Evaluate Model nodes to see prediction stats.
Submit the pipeline. Due to the large amount of processing required, this can take 10-15 minutes. Grab a coffee, read the linked research paper, or watch another lesson while you wait.
- Evaluate the Models
For anomaly detection, we are concerned primarily with True Negatives, which are correctly predicted anomalies. However, we must also pay attention to False Negatives, which are normal values incorrectly predicted as anomalous. Having a large amount of False Negatives adds a lot of noise to the signal we are trying to find. The metric we want is called Negative Predictive Value, which is the ratio of True Negatives to all predicted negatives (True Negatives + False Negatives).
- View the results of each Evaluate Model node. Which produces the most True Negatives?
- Which model has the best Negative Predictive Rate?
There are many more combinations of features and oversampling that can be tried to produce an optimal model, but this gives you a good idea of the spectrum of possible results. You can also try with more or less data. There are plenty of options for hyperparameter tuning in this pipeline.