Creating Custom Modules in Azure Machine Learning Designer

1 hour
  • 5 Learning Objectives

About this Hands-on Lab

While Azure Machine Learning Designer provides a wealth of transforms, not every need can be met by the predefined modules. For more specific or custom needs, you can write your own data manipulations directly in either Python or R. This gives you the power to use any standard features of those languages, such as conditionals and looping, as well as many libraries, including your own modules, to prepare, clean, and wrangle the data to your exact needs. In this lab, we will write a custom data transformation module in Python.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Set Up the Workspace
  1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.

  2. Create a training cluster of D2 instances.

  3. Create a new blank pipeline in the Azure Machine Learning Studio Designer.

Explore the Data
  1. Add an IMDB Movie Titles dataset node to the canvas. Visualize the data and try to find a pattern for the year at the end of the movie name.

  2. Use the Year at end of title query in an Apply SQL Transformation node to show outliers to the basic pattern.

  3. Submit the pipeline to apply the transform. Once finished, inspect the output of Apply SQL Transformation.

  4. Though the data is mostly uniform, there are some values that are different. Evaluate the results to find the edge cases that will need to be captured.

Extract the Year
  1. Add an Execute Python Script transform node to the canvas.

  2. Define a better regular expression pattern than what we used in the Year at end of title query above.

  3. Create a function to extract the year from the title based on the regular expression.

  4. Define the entry point to the code. Add "Year" as a new column in the data frame. If you have issues, refer to the Run Python code in Azure Machine Learning documentation.

  5. Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query.

  6. Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did the regular expression capture all of the values, or will we have to do further processing later?

Clean Up the Movie Titles
  1. Add another Execute Python Script transform node to the canvas.

  2. Define a regular expression pattern to strip the year and anything after it from the movie name.

  3. Create a function to parse the movie name using the regular expression you just created. This function should not lose any movie names if they don’t match the pattern.

  4. Define the entry point to the code. Replace the current movie name with the cleaned movie name.

  5. Add an Apply SQL Transformation transform node to evaluate the results. Use the Year at end of title query to show any results that were not cleaned up as expected.

  6. Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did your regular expression clean up the names properly? Did you lose any data in the process? For reference, there are 16,059 rows in the original dataset.

Combine the Custom Modules
  1. With our code written and tested, combine the code into one module to make the pipeline more efficient.

  2. Define regular expression patterns for extracting the year and cleaning up the movie name.

  3. Create functions for applying those regular expressions to the data.

  4. Define your entry point function and add calls to both of your data cleaning functions. The order in which they are called will matter.

  5. Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query to show any rows that did not parse the year correctly.

  6. Submit the pipeline, then inspect the results from the Execute Python Script node first. Did the movie names and years come out like you expected? Did you lose any rows of data?

  7. Inspect the results from the Apply SQL Transformation node. This will show you data that still needs further processing.

Additional Resources

Some data requires extra steps and custom logic to process. One such example is when the source data has not been normalized, meaning that a column contains multiple features that our model should see. Sometimes it is possible to fix this in the source, but it is usually easier to build a transform directly into your data pipeline.

The IMDB Movie Titles dataset contains the movie release year as part of the movie name feature. We want to extract this date as another feature, then remove it from the movie name. This will require parsing text to convert dates to integers, adding those integers as a new feature, and then manipulating the text in the existing feature. However, because we are parsing text, we need to make sure that the all of the text fields follow the same pattern. You guessed it: we first need to spend time understanding our data.

Lab Goals

  1. Set Up the Workspace
  2. Explore the Data
  3. Extract the Year
  4. Clean Up the Movie Name
  5. Combine the Custom Module

Log in to the Lab Environment

Log in to the Azure portal using the credentials provided on the lab page. Be sure to use an incognito or private browser window to ensure you're using the lab account rather than your own.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?