Using Pandas Data Frames and Pivot Tables in Python

45 minutes
  • 4 Learning Objectives

About this Hands-on Lab

Pandas is a very popular library for performing data analysis with Python. Two common features that you’ll use when working with pandas are data frames and pivot tables. In this hands-on lab, you’ll go through the process of using data frames and pivot tables to analyze some employee information and provide some insight to others on your team.

*Warning:* This is a lab designed as part of a professional level course and is difficult. The lab asks you to accomplish something using exact methods and functionality of the `pandas` library that might not have been covered in lessons.

To feel comfortable completing this lab, you’ll want to know how to do the following:

* Use pandas data frames. Watch the “Creating and Using Dataframes” video from the [Using Python’s Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course.
* Create pivot tables. Watch the “Creating Pivot Tables” video from the [Using Python’s Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course.
* Comfortability reading the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) to find new functions and methods to use to accomplish your goal.

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Install Pandas

Before we can use pandas to analyze our employee data we’ll need to make sure that it is installed.

Create a data frame from the `employee_info` list and determine the number of unique job titles.

Our pay_analysis.py script already sets up a list of dictionary objects with our employee information named employee_info. We need to use this list to create a pandas.DataFrame for us to be able to perform our analysis. Get the column names from an item in the list and create a new data frame called employee_frame.

Note: We also need to import pandas.

Use a pivot table to determine the salary difference between people doing the same job depending on their gender.

To determine the average pay difference between men and women with the same job title, we’ll first need to determine which jobs have both men and women doing them. We can do this by grouping our data frame information by the job title first and then creating a pandas.Series that has the job titles as the index values and the number of unique genders as the values.

Next, using job_title_unique_gender_count, we’ll take all of the items that have a value of 2, drop the items that will be adjusted to a NaN (because they aren’t a 2) and then get a pandas.Index object with just the names by accessing the index of our pandas.Series. This seems a little complicated, but it’s a good way for us to get just a list of the job titles where the number of genders in that role is 2.

After that, we create a new data frame from our employee_frame data that only contains employees in a job that includes both genders.

The last two things we do are creating our pivot table based on the job title where we have a column for each gender and adding a column to the table to hold the pay difference between female and male employees.

Use a pivot table to determine the salary differences of people doing the same job depending on their age.

Our final pivot table will work a lot like the gender_difference_table, except we need to categorize each of our employees into an age range before we start making our comparisons. To do this, we’ll use the pandas.cut function and add the resulting series as a new column on our employee_frame object, calling it age_range. There are 2 possible categories for each employee to fall under "18 – 40" and "41 – 80". We’re going to also add a third, unused category to the list of categories for this column so that we can add a column to our pivot table later.

The data type of the age_range column is category, and it is important that when we’re adding a category to it that we use the inplace=True option. If we don’t do this, then we’ll get a new category returned to us and won’t modify the employee_frame at all.

Next, we need to determine which job titles have at least one employee from each of the age ranges. This code is virtually identical to what we did to determine while job titles had employees from each of the gender options.

Now that we have a new data frame, we’re ready to create our pivot table and populate a new ‘difference’ column. This will look very similar to our other pivot table, except we want to create columns for each of the values in our age_range column. Here’s the rest of the code that we need to write to create our table and calculate the differences in pay:

Additional Resources

Your HR department is in the processing of evaluating everyone's salary to make sure that there are no uneven pay differences and that everyone is getting paid their market value. They have asked you to analyze the employee information to determine if there are any large discrepancies in pay given employee age and gender. The information has been provided to you in the form of the employees.csv file, and you've decided that you can easily use pandas data frames and pivot tables to analyze the data.

You've been asked to determine:

  • The number of unique job titles that exist in the company.
  • The average pay difference between male and female workers with the same job title.
  • The average pay difference based on the age bracket for workers with the same job title. Grouping by ages less than or equal to 40 and those 41 to 80 years old.

The pay_analysis.py file contains some boilerplate to set up the data that you'll be working with. To achieve your goal, you'll need to leverage parts of the pandas library that you might not have learned about yet such as:

  • pandas.Series, nunique, dropna, index, where - For grouping and calculating values for a single column.
  • pandas.cut and pandas.Categorical - For categorizing employees into age brackets.

Here's the expected output that you're aiming for when running the pay_analysis.py script:

$ python3.8 pay_analysis.py
Number of Unique Job Titles: 183
                                   salary
gender                             Female      Male    difference
job_title
Account Coordinator          75666.666667   89000.0 -13333.333333
Account Executive            80333.333333   71500.0   8833.333333
Account Representative III   50500.000000  100500.0 -50000.000000
Accountant IV                60500.000000  112000.0 -51500.000000
Accounting Assistant I      121000.000000   38000.0  83000.000000
...                                   ...       ...           ...
Web Designer II              97333.333333   74000.0  23333.333333
Web Designer III             82500.000000   60000.0  22500.000000
Web Designer IV             102000.000000  121000.0 -19000.000000
Web Developer I             125000.000000   66000.0  59000.000000
Web Developer II             68666.666667   74000.0  -5333.333333

[129 rows x 3 columns]
                                   salary
age_range                         18 - 40        41 - 80    difference
job_title
Account Coordinator          76800.000000   86285.714286  -9485.714286
Account Representative I     82000.000000   66000.000000  16000.000000
Account Representative II   117000.000000   91000.000000  26000.000000
Account Representative III   38000.000000   88000.000000 -50000.000000
Accountant I                 99000.000000   80000.000000  19000.000000
...                                   ...            ...           ...
Web Designer II              88666.666667   87000.000000   1666.666667
Web Designer III             97000.000000   64000.000000  33000.000000
Web Developer I              66000.000000  125000.000000 -59000.000000
Web Developer II             43000.000000   89333.333333 -46333.333333
Web Developer III           118000.000000   58500.000000  59500.000000

[131 rows x 3 columns]

Logging In

There are a couple of ways to get in and work with the code. One is to use the credentials provided in the hands-on lab overview page, log in with SSH, and use a text editor in the terminal.

The other is using VS Code in the browser. If you'd like to go this route, then you will need to navigate to the public IP address of the workstation server (provided in the hands-on lab overview page) on port 8080 (example: http://PUBLIC_IP:8080). Your password will be the same password that you'd use to connect over SSH.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?