Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Azure icon
Labs

Creating Custom Modules in Azure Machine Learning Designer

While Azure Machine Learning Designer provides a wealth of transforms, not every need can be met by the predefined modules. For more specific or custom needs, you can write your own data manipulations directly in either Python or R. This gives you the power to use any standard features of those languages, such as conditionals and looping, as well as many libraries, including your own modules, to prepare, clean, and wrangle the data to your exact needs. In this lab, we will write a custom data transformation module in Python.

Azure icon
Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 1h 0m
Published
Clock icon Sep 24, 2020

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Set Up the Workspace

    1. Log in and go to the Azure Machine Learning Studio workspace provided in the lab.

    2. Create a training cluster of D2 instances.

    3. Create a new blank pipeline in the Azure Machine Learning Studio Designer.

  2. Challenge

    Explore the Data

    1. Add an IMDB Movie Titles dataset node to the canvas. Visualize the data and try to find a pattern for the year at the end of the movie name.

    2. Use the Year at end of title query in an Apply SQL Transformation node to show outliers to the basic pattern.

    3. Submit the pipeline to apply the transform. Once finished, inspect the output of Apply SQL Transformation.

    4. Though the data is mostly uniform, there are some values that are different. Evaluate the results to find the edge cases that will need to be captured.

  3. Challenge

    Extract the Year

    1. Add an Execute Python Script transform node to the canvas.

    2. Define a better regular expression pattern than what we used in the Year at end of title query above.

    3. Create a function to extract the year from the title based on the regular expression.

    4. Define the entry point to the code. Add "Year" as a new column in the data frame. If you have issues, refer to the Run Python code in Azure Machine Learning documentation.

    5. Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query.

    6. Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did the regular expression capture all of the values, or will we have to do further processing later?

  4. Challenge

    Clean Up the Movie Titles

    1. Add another Execute Python Script transform node to the canvas.

    2. Define a regular expression pattern to strip the year and anything after it from the movie name.

    3. Create a function to parse the movie name using the regular expression you just created. This function should not lose any movie names if they don't match the pattern.

    4. Define the entry point to the code. Replace the current movie name with the cleaned movie name.

    5. Add an Apply SQL Transformation transform node to evaluate the results. Use the Year at end of title query to show any results that were not cleaned up as expected.

    6. Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did your regular expression clean up the names properly? Did you lose any data in the process? For reference, there are 16,059 rows in the original dataset.

  5. Challenge

    Combine the Custom Modules

    1. With our code written and tested, combine the code into one module to make the pipeline more efficient.

    2. Define regular expression patterns for extracting the year and cleaning up the movie name.

    3. Create functions for applying those regular expressions to the data.

    4. Define your entry point function and add calls to both of your data cleaning functions. The order in which they are called will matter.

    5. Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query to show any rows that did not parse the year correctly.

    6. Submit the pipeline, then inspect the results from the Execute Python Script node first. Did the movie names and years come out like you expected? Did you lose any rows of data?

    7. Inspect the results from the Apply SQL Transformation node. This will show you data that still needs further processing.

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans