Kesha's Korner
Share on facebook
Share on twitter
Share on linkedin

Preparing Your Dataset for Machine Learning

Kesha Williams
Kesha Williams

Hi, I’m Kesha Williams, an AWS Machine Learning Hero and Alexa Champion. This is Kesha’s Korner, where we learn about artificial intelligence and machine learning on AWS. Come learn with me as we transform your engineering skills and future-proof your career!

Today, we are going to talk about preparing your data for machine learning. 

The Data

Machine learning depends heavily on data! The data preparation stage is very important. The higher the quality of your data, the higher the quality of your predictions in the long run. In this prep stage, we have to focus on both quality and quantity. Just remember that bad data can limit the effectiveness of your machine learning models and diminish public trust in the technology, which we don’t want to happen.

Most of the work that goes into producing a machine learning model occurs in this data prep stage, which is shown as the pink blocks in our diagram.

Machine Learning Model

It’s great that you have massive amounts of reputable data, but are they in a format that a machine can learn from and easily find trends and patterns in? Typically the answer is no, so you have to go through a labeling, cleaning, and transformation process. 

Developers and others transitioning to machine learning often underestimate the time needed to label, clean, and transform data. When dealing with data, domain expertise plays a big role. You would definitely be less effective when working with the data if you didn’t understand it. It’s important to understand which attributes (or features) are of real significance to a given dataset; it makes the process easier and the output better.

Passionate about Machine Learning? Turn passion into certification with our Amazon Machine Learning Course. ACG has your back yet again with a fresh course focused on helping you outsmart the new AWS Certified Machine Learning Specialty.

Data Labeling

When dealing with supervised learning, you’ll have to label your data. Let’s take the case of building a computer vision system that is reliable enough to identify cats, dogs, and birds. Well, the training data has to be labeled with the correct answer. So pictures of cats have to be labeled (or tagged) as cats, pictures of dogs have to be labeled as dogs, and pictures that have birds, well, have to be labeled as birds. 

Now, I know you’re thinking, wow, it’s going to take a long time to manually label the thousands and thousands of images a machine learning algorithm needs. Luckily, AWS has a service called SageMaker GroundTruth that can help speed up the process. GroundTruth is a service that uses machine learning to label data and also provides access to human labelers through Amazon Mechanical Turk.

Now let’s look at data cleaning and transformation.

Data Cleaning

Data cleaning is where you analyze the data and deal with missing or empty records, incorrect values, duplicates, typos, etc. 

Let’s say you have this dataset here and you notice several null (or empty) values. You can get rid of that single row, this single row is called an observation, or you can get rid of this observation altogether or you can substitute the missing value with dummy values. You may even find some features to be irrelevant to the problem you’re trying to solve so those features can be removed (and that’s called record sampling).

Data Transformation

Now let’s talk about data transformation. This is where you change the features to make them more useful and easier for a machine to understand. Let’s take the example of categorical encoding, which is just a fancy way of saying that you have a feature that can be grouped in one or more categories.

Within categorical encoding, there’s a really cool-sounding data transformation technique called one-hot encoding. Let’s say you have a feature in your dataset called color that can have one of three values: red, white, or blue. Well, most machine learning algorithms can’t handle strings, the features have to be numeric. 

How can you represent the colors red, white, and blue as numbers? Well, you can one-hot encode your categories. This technique creates a new column for each unique value. In our example, red, white, and blue will become three distinct columns. A row (or observation) with the color value of red will have a one in the Red column, while the White and Blue columns will have a zero. A machine can easily understand this format.

Another part of data prep is called feature engineering. Feature engineering is the process of creating new features based on existing features. Do you recognize this? 

2020-04-30 15:28:03.829 

You’re right, it’s a date timestamp. Do you think a machine can find patterns when a feature is formatted like this? No, it can’t. It would be easier if we split this feature out into three new features.

  • Month
  • Time of Day
  • Day of Week

Now, trends can be found across months, days of the week, or even time segments during a given day.

Training & Evaluation Datasets

Now that you have a dataset that is labeled, cleaned, and transformed, the final step is to split your dataset into two distinct sets, one for training and one for evaluation. The training set typically contains 80% the records and the evaluation set typically contains 20%. The evaluation dataset is used to evaluate the quality of the machine learning model produced during training. 

Today, we looked at data preprocessing and preparation with a focus on labeling, cleaning, and transformation. I hope you’ll join me in our next edition of Kesha’s Korner where we will talk about the inner workings of the training process. 

There’s more where that came from. A Cloud Guru offers learning paths, quizzes, certification prep, and more. 


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?