Pandas is a very popular library for performing data analysis with Python. Two common features that you’ll use when working with pandas are data frames and pivot tables. In this hands-on lab, you’ll go through the process of using data frames and pivot tables to analyze some employee information and provide some insight to others on your team.
*Warning:* This is a lab designed as part of a professional level course and is difficult. The lab asks you to accomplish something using exact methods and functionality of the `pandas` library that might not have been covered in lessons.
To feel comfortable completing this lab, you’ll want to know how to do the following:
* Use pandas data frames. Watch the “Creating and Using Dataframes” video from the [Using Python’s Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course.
* Create pivot tables. Watch the “Creating Pivot Tables” video from the [Using Python’s Math, Science, and Engineering Libraries](https://linuxacademy.com/cp/modules/view/id/621) course.
* Comfortability reading the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) to find new functions and methods to use to accomplish your goal.
Successfully complete this lab by achieving the following learning objectives:
- Install Pandas
Before we can use pandas to analyze our employee data we’ll need to make sure that it is installed.
- Create a data frame from the `employee_info` list and determine the number of unique job titles.
pay_analysis.pyscript already sets up a list of dictionary objects with our employee information named
employee_info. We need to use this list to create a
pandas.DataFramefor us to be able to perform our analysis. Get the column names from an item in the list and create a new data frame called
Note: We also need to import
- Use a pivot table to determine the salary difference between people doing the same job depending on their gender.
To determine the average pay difference between men and women with the same job title, we’ll first need to determine which jobs have both men and women doing them. We can do this by grouping our data frame information by the job title first and then creating a
pandas.Seriesthat has the job titles as the index values and the number of unique genders as the values.
job_title_unique_gender_count, we’ll take all of the items that have a value of
2, drop the items that will be adjusted to a
NaN(because they aren’t a
2) and then get a
pandas.Indexobject with just the names by accessing the
pandas.Series. This seems a little complicated, but it’s a good way for us to get just a list of the job titles where the number of genders in that role is 2.
After that, we create a new data frame from our
employee_framedata that only contains employees in a job that includes both genders.
The last two things we do are creating our pivot table based on the job title where we have a column for each gender and adding a column to the table to hold the pay difference between female and male employees.
- Use a pivot table to determine the salary differences of people doing the same job depending on their age.
Our final pivot table will work a lot like the
gender_difference_table, except we need to categorize each of our employees into an age range before we start making our comparisons. To do this, we’ll use the
pandas.cutfunction and add the resulting series as a new column on our
employee_frameobject, calling it
age_range. There are 2 possible categories for each employee to fall under "18 – 40" and "41 – 80". We’re going to also add a third, unused category to the list of categories for this column so that we can add a column to our pivot table later.
The data type of the
category, and it is important that when we’re adding a category to it that we use the
inplace=Trueoption. If we don’t do this, then we’ll get a new category returned to us and won’t modify the
Next, we need to determine which job titles have at least one employee from each of the age ranges. This code is virtually identical to what we did to determine while job titles had employees from each of the gender options.
Now that we have a new data frame, we’re ready to create our pivot table and populate a new ‘difference’ column. This will look very similar to our other pivot table, except we want to create columns for each of the values in our
age_rangecolumn. Here’s the rest of the code that we need to write to create our table and calculate the differences in pay: