Using Scikit-learn Random Forest Classifier (AWS SageMaker)

1 hour
  • 8 Learning Objectives

About this Hands-on Lab

Scikit-learn is a great place to start working with machine learning and artificial intelligence. In this activity, we will use Scikit-learn to create a Random Forest Classifier that performs a basic classification of people to see if they are likely to prefer dogs or cats. In this lab, we use a fictitious data set that could easily be replaced with one of your own. The files used in this lab can be found on [GitHub](

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Navigate to the Jupyter Notebook

Navigate through the AWS console to the AWS SageMaker page. From there, load the Jupyter Notebook server that has been provided with this hands-on lab.

Load a CSV File into the Running Jupyter Notebook Environment
  1. Load the CSV file into a Pandas DataFrame and review:

    df = pd.read_csv("data.csv")

  2. Change the names of the DataFrame columns:

    df.columns = [‘walk’, ‘run’, ‘color’, ‘distance’, ‘label’]

  3. Look at the types of data:


  4. Change the data types of the data:

    df[‘walk’] = df[‘walk’].astype(‘bool’)
    df[‘run’] = df[‘run’].astype(‘bool’)
    color_type = CategoricalDtype(categories=[‘red’, ‘green’, ‘blue’])
    df[‘color’] = df[‘color’].astype(color_type)
    df[‘label’] = df[‘label’].astype(‘bool’)

  5. Re-review the data types:


Split the Training Data, and Then Use It to Train the Decision Trees Model
  1. Use train_test_split to prepare the data, and then check that the data looks right:

    X_train, X_test, y_train, y_test = train_test_split(df.drop('label', 1), df['label'], test_size = .2, random_state=10)
  2. Create the model object:

    model = RandomForestClassifier(max_depth=5)
  3. Fit the data to the model (train the model):, y_train)
Review the Trained Random Cut Forest Model
  1. Obtain a single estimator (tree) from the model as well as the names of the features:

    estimator = model.estimators_[0]
    feature_names = [i for i in X_train.columns]
  2. Use export_graphviz to display the tree graphically:

    export_graphviz(estimator, out_file='',
                    feature_names = feature_names,
                    rounded = True,
                    filled = True)
    call(['dot', '-Tpng', '', '-o', 'tree.png', '-Gdpi=600'])
    Image(filename = 'tree.png')
Perform Predictions with the Random Cut Forest Model, and Produce a Confusion Matrix
  1. Pass test data into the model and get test results:

    y_predict = model.predict(X_test)
    y_pred_quant = model.predict_proba(X_test)[:, 1]
    y_pred_bin = model.predict(X_test)
  2. Use scikit-learn’s confusion_matrix function to create a confusion matrix:

    confusion_matrix = confusion_matrix(y_test, y_pred_bin)
  3. Optionally, format the confusion matrix with matplotlib:

    y_true = ["Dog", "Cat"]
    y_pred = ["Dog", "Cat"]
    df_cm = pd.DataFrame(confusion_matrix, columns=np.unique(y_true), index = np.unique(y_true)) = 'Actual' = 'Predicted'
    plt.figure(figsize = (8,5))
    plt.title('Confusion Matrix')
    sn.set(font_scale=1.4)#for label size
    sn.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font size
Calculate the Sensitivity and Specificity of the Model
  1. Calculate the total number of results:

  2. Calculate the sensitivity:

    sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
  3. Calculate the specificity:

    specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
  4. Display the values:

    print('Sensitivity : ', sensitivity )
    print('Specificity : ', specificity)
Create an ROC Graph for our Random Forest Model, and Calculate the AUC
  1. Calculate the false positive rate (fpr) and the true positive rate (tpr) using scikit-learn’s roc_curve function:

    fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant)
  2. Use matplotlib to plot a ROC graph for our model:

    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.rcParams['font.size'] = 12
    plt.title('ROC curve for Dog vs Cat people')
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Sensitivity)')
  3. And use this to calculate the AUC:

    auc(fpr, tpr)
Use the Random Forest Model to Make Predictions from “Real-World” Data
  1. Set the input variables for the model:

    like_walking = 0
    like_running = 1
    distance_run = 10
    # Only one of:
    red = 0
    green = 1
    blue = 0
  2. Use this data to make a prediction:

    prediction = model.predict([[like_walking, like_running, distance_run, red, green, blue]])
  3. Display the model output:

    if prediction[0]:
        print('This is a DOG person!')
        print('This is a CAT person!')

Additional Resources

This is a follow-along lab. The videos will walk through each of the steps, including loading the Jupyter Notebook server from AWS SageMaker.

Please make sure you are in the us-east-1 (N. Virginia) region when in the AWS console.

The files used in this lab can be found on GitHub.

At the end of the lab videos, take the rest of the time available for your own experimentation.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Get Started
Who’s going to be learning?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!