While Azure Machine Learning Designer provides a wealth of transforms, not every need can be met by the predefined modules. For more specific or custom needs, you can write your own data manipulations directly in either Python or R. This gives you the power to use any standard features of those languages, such as conditionals and looping, as well as many libraries, including your own modules, to prepare, clean, and wrangle the data to your exact needs. In this lab, we will write a custom data transformation module in Python.
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Set Up the Workspace
Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
Create a training cluster of
D2
instances.Create a new blank pipeline in the Azure Machine Learning Studio Designer.
- Explore the Data
Add an IMDB Movie Titles dataset node to the canvas. Visualize the data and try to find a pattern for the year at the end of the movie name.
Use the Year at end of title query in an Apply SQL Transformation node to show outliers to the basic pattern.
Submit the pipeline to apply the transform. Once finished, inspect the output of Apply SQL Transformation.
Though the data is mostly uniform, there are some values that are different. Evaluate the results to find the edge cases that will need to be captured.
- Extract the Year
Add an Execute Python Script transform node to the canvas.
Define a better regular expression pattern than what we used in the Year at end of title query above.
Create a function to extract the year from the title based on the regular expression.
Define the entry point to the code. Add "Year" as a new column in the data frame. If you have issues, refer to the Run Python code in Azure Machine Learning documentation.
Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query.
Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did the regular expression capture all of the values, or will we have to do further processing later?
- Clean Up the Movie Titles
Add another Execute Python Script transform node to the canvas.
Define a regular expression pattern to strip the year and anything after it from the movie name.
Create a function to parse the movie name using the regular expression you just created. This function should not lose any movie names if they don’t match the pattern.
Define the entry point to the code. Replace the current movie name with the cleaned movie name.
Add an Apply SQL Transformation transform node to evaluate the results. Use the Year at end of title query to show any results that were not cleaned up as expected.
Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did your regular expression clean up the names properly? Did you lose any data in the process? For reference, there are 16,059 rows in the original dataset.
- Combine the Custom Modules
With our code written and tested, combine the code into one module to make the pipeline more efficient.
Define regular expression patterns for extracting the year and cleaning up the movie name.
Create functions for applying those regular expressions to the data.
Define your entry point function and add calls to both of your data cleaning functions. The order in which they are called will matter.
Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query to show any rows that did not parse the year correctly.
Submit the pipeline, then inspect the results from the Execute Python Script node first. Did the movie names and years come out like you expected? Did you lose any rows of data?
Inspect the results from the Apply SQL Transformation node. This will show you data that still needs further processing.