Detecting Anomalies in Azure Machine Learning

45 minutes
  • 4 Learning Objectives

About this Hands-on Lab

Machine learning is particularly good at finding patterns in data. One application for this is training a model to find anomalies, which are data points that don’t fit the discovered pattern. This has value across many industries, such as finance, information security, and medicine. Unfortunately, most anomalies are rare, so the quantity of examples for them is very small compared to normal data. In this lab, you will work with credit card transactions, which are labeled as either valid or fraudulent, and you need to create a model to identify the fraud. You will learn how to prepare data for training an anomaly detection model, as well as how to use one common anomaly detection algorithm.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Set up the Workspace
  1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
  2. Create a Training Cluster of D2 instances. This lab will require a lot of compute for the pipeline, so set the max instances to 4.
  3. Create a new blank Pipeline in the Azure Machine Learning Studio Designer.
Prepare the Data

We need to do a few things to make the data suitable for training and evaluating the model. First, the label column has values of 1 for the normal class and 2 for the abnormal class. However, the scoring uses 0 for abnormal and 1 for normal, so we have to correct the abnormal value to evaluate our results properly. Next, we need to set the label column as a label. Finally, we need to split the data into training and test sets. However, unlike most other machine learning algorithms, Principal Component Analysis models are only trained on the normal data, so we have to further split the training data to only include normal examples.

  1. Start with the German Credit Card UCI dataset. The column named ‘Col21’ is the label. Inspect the label values.
  2. Use an Apply Math Operation node to change the label value of 2 to 0.
  3. Use an Edit Metadata node to rename the label column to ‘Label’ and set it as a label column.
  4. Create training and testing datasets using Split Data. The German Credit Card dataset is fairly small, so use 80% of the data for training. Make sure the training and test datasets contain a proportional number of examples of both normal and abnormal data.
  5. From the training dataset, Split Data again to only contain the normal examples. Remember, normal values are labeled as 1 in the data.
Train Anomaly Detection Models

Principal Component Analysis (PCA) reduces our feature space as an unsupervised process to make the model more efficient. This reduction inherently loses information. To compensate for the lost information, we can oversample the data, generating statistically similar examples to the rest of our training data to help boost the anomalous information. See this research paper for a much more detailed explanation.

The oversampling rate is another hyperparameter to tune in your model. I’ve done this for you for the lab. For this dataset, an oversampling rate of 5 (which means 500% extra data) will produce decent results. We still have to tune the number of features needed to produce a good model.

  1. With three PCA-Based Anomaly Detection nodes, select about 1/3 of the features, 1/2 of the features, and 2/3 of the features. Set the oversampling rate to 5. Do not use feature normalization for this dataset.
  2. Use a Train Anomaly Detection Model node to train each model. Make sure to only pass in the normal data to this training process.
  3. Use Score Model nodes to predict the testing data.
  4. Use Evaluate Model nodes to see prediction stats.

Submit the pipeline. Due to the large amount of processing required, this can take 10-15 minutes. Grab a coffee, read the linked research paper, or watch another lesson while you wait.

Evaluate the Models

For anomaly detection, we are concerned primarily with True Negatives, which are correctly predicted anomalies. However, we must also pay attention to False Negatives, which are normal values incorrectly predicted as anomalous. Having a large amount of False Negatives adds a lot of noise to the signal we are trying to find. The metric we want is called Negative Predictive Value, which is the ratio of True Negatives to all predicted negatives (True Negatives + False Negatives).

  1. View the results of each Evaluate Model node. Which produces the most True Negatives?
  2. Which model has the best Negative Predictive Rate?

There are many more combinations of features and oversampling that can be tried to produce an optimal model, but this gives you a good idea of the spectrum of possible results. You can also try with more or less data. There are plenty of options for hyperparameter tuning in this pipeline.

Additional Resources

You have been given a set of data about credit card transactions that have been labeled as valid or fraudulent. You need to create a model to identify fraud. Any fraudulent transaction predicted by this model will automatically place a hold on the user's account, forcing them to contact the company to confirm their purchases. Your task is to catch as many fraudulent transactions as you can in order to prevent misuse, while inconveniencing the least amount of customers making valid purchases.

Lab Goals

  1. Set up the workspace.
  2. Prepare the data.
  3. Train anomaly detection models.
  4. Evaluate the models.

Logging in to the Lab Environment

To avoid issues with the lab, use a new incognito or private browser window to log in to the lab. This ensures your personal account credentials, which may be active in your main window, are not used for the lab.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Get Started
Who’s going to be learning?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


$2,495.00

Checkout
Sign In
Welcome Back!
Thanks for reaching out!

You’ll hear from us shortly. In the meantime, why not check out what our customers have to say about ACG?