Xiaoyu Zhu

Logo



Data Enthusiast | Investment Specialist | Curious Human Being

Shanghai ✈ Pittsburgh ✈ Boston ✈ Shanghai

View My LinkedIn Profile

View My GitHub Profile

◄ Go Back Data Analysis Projects

Salifort Motors: Predict and Improve Employee Retention

A capstone project for Google Advanced Data Analytics Certificate program

Table of Contents

Executive Summary

Overview

About the company

Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. Its global workforce of over 100,000 employees research, design, construct, validate, and distribute electric, solar, algae, and hydrogen-based vehicles. Salifort’s end-to-end vertical integration model has made it a global leader at the intersection of alternative energy and automobiles.

Business case

Currently, there is a high rate of turnover among Salifort employees. (Note: In this context, turnover data includes both employees who choose to quit their job and employees who are let go). Salifort’s senior leadership team is concerned about how many employees are leaving the company. Salifort strives to create a corporate culture that supports employee success and professional development. Further, the high turnover rate is costly in the financial sense. Salifort makes a big investment in recruiting, training, and upskilling its employees.

If Salifort could predict whether an employee will leave the company, and discover the reasons behind their departure, they could better understand the problem and develop a solution.

As a first step, the leadership team asks Human Resources to survey a sample of employees to learn more about what might be driving turnover. As a data specialist working for Salifort Motors, I have received the results of a recent employee survey. The senior leadership team has asked to analyze the data to come up with ideas for how to increase employee retention. To help with this, I am asked to design a model that predicts whether an employee will leave the company based on their department, number of projects, average monthly hours, and any other data points that may be helpful. A good model will help the company increase retention and job satisfaction for current employees, and save money and time training new employees.

Pre-Analysis Reflection

The primary stakeholder of the project is leadership team of Salifort. The HR department is also an important stakeholder. On one hand, they designed and implemented the survey, and will be an important resource for us to gain business insights. On the other hand, HR department is also a critical player in the execution if any findings from our analyses result in business actions.

The goal of this project is to predict whether an employee will leave the company, and discover the reasons behind their departure. In every step of the way, we should be wary of any ethical concerns that may arise. It is important that our recommendations do not result in any unfair treatments to a certain group of employees.

Exploratory Data Analysis

Data dictionary

This project uses a dataset called HR_capstone_dataset.csv. It represents 10 columns of self-reported information from employees of a multinational vehicle manufacturing corporation. The dataset can be found here on Kaggle.

The dataset contains: 14,999 rows – each row is a different employee’s self-reported information; and 10 columns.

Column name Type Description
satisfaction_level int64 The employee’s self-reported satisfaction level [0-1]
last_evaluation int64 Score of employee’s last performance review [0–1]
number_project int64 Number of projects employee contributes to
average_monthly_hours int64 Average number of hours employee worked per month
time_spend_company int64 How long the employee has been with the company (years)
work_accident int64 Whether or not the employee experienced an accident while at work
left int64 Whether or not the employee left the company
promotion_last_5years int64 Whether or not the employee was promoted in the last 5 years
department str The employee’s department
salary str The employee’s salary (low, medium, or high)

Inspecting the data

There is no missing value in the dataset, but there are quite a few duplicated rows. Considering many variables (such as satisfaction level, last evaluation) are continuous variables that could take any value between 0 and 1, it is not likely that two or more individuals have the same values across all variables. We removed these duplicates.

image

Outliers: We found potential outliers in time_spend_company which is the tenure an employee stayed with Salifort. Close to 95% of the surveyed employees have been with the company for under 6 years. The other numeric variables seem to be symmetrically distributed.

Class imbalance: All the categorical variables show class imbalance. When constructing a model, we need to make sure the minority class is represented in the train and test datasets.

Key findings

After inspecting the pairwise relationships between whether an employee left the company and all the other potential explanatory variables, we have the following findings:

image

image

image

All the explanatory variables seems to be good candidate to predict whether or not an employee will leave the company. And unlike what we suspected, we do not see strong evidence of pairwise correlation that could lead to multicollinearity. Next, we will use all variables as predictors of whether an employee will leave the company, and construct a model.

correlation

The outcome we are trying to predict is binary, therefore, we can either use a logistic regression or a tree-based model. Let’s construct both models and evaluate them.

Model construction and selection

Prior to building any model, we first establish that a high recall score will be the criteria for model selection. A good recall rate means that the model is good at spotting someone who is likely to leave the company, so the leadership team and HR department can take action to retain valuable employees. Of course, we do not want to achieve high recall at the cost of precision, which would incur a lot of cost to the company. Why? Because simply labelling everyone as “will leave” can boost recall to 100%, while destroying precision.

Logistic regression model

We first start with building a logistic regression model. The beauty of this model is that it is simple and easy to interpret - we can see which explanatory variables are statistically significant in predicting employee turnover, and we can quatify the relationship. The first thing we need to be cautious about is that logistic regression is highly sensitive to outliers, which we did observe in employee tenure. Therefore, outliers were removed prior to running the logistic regression model.

As it turns out, although we found meaningful explanatory variables in the EDA, such as hours worked, the model still does not do a good job at making predictions - the recall rate is merely 18%. Of all the employees who left, the model was only able to identify 18%. That would not be very helpful.

Random forest with cross validation

Next we try a random forest model, and tune the hyperparameters with GridSearch. When building the random forest model, as well as the XGBoost model in the next step, we revert back to the dataset with outliers to have more information at hand. Tree-based models are not highly sensitive with outliers.

As a result, recall improved significantly from 18% to 82%! Precision is also improved to 97.5%. We could stop here and use this model, but I still want to try the gradient boosting.

Extreme gradient boosting (XGB)

While the random forest model trains base learners simultaneously, the XGB model does so sequentially. It turns out that the XGB model improved recall rate by another 9% to 91.4%, while maintaining a high precision score.

Model selection

Here is a comparison of how the models perform on test data.

Model Accuracy Precision Recall F1
Logistic regression 0.833556 0.497207 0.178715 0.262925
Random forest CV 0.967419 0.975391 0.824503 0.893593
XGBoost 0.980874 0.968250 0.914935 0.940767

image

Our champion model is the tuned XGBoost model. While this tree-based model is not as easy to interpret as the logistic regression model, we can still find out which features may be most relevant in predicting employee turnover from this feature importance chart.

image

Conclusion

With an XGBoost model, we can identify employees who will likely leave the company with strong accuracy and confidence, so management and HR department can take action when the model identifies someone who will likely leave. There are a few areas the management team should pay special attention to:

◄ Go Back