Xiaoyu Zhu

Logo



Data Enthusiast | Investment Specialist | Curious Human Being

Shanghai ✈ Pittsburgh ✈ Boston ✈ Shanghai

View My LinkedIn Profile

View My GitHub Profile

◄ Go Back Data Analysis Projects

Bellabeat Case Study

a capstone project for Google Data Analytics Professional Certificate program

Background

In this case study, I will perform data analysis for Bellabeat, a high-tech manufacturer of health-focused products for women. I will analyze smart device data to gain insight into how consumers are using their smart devices. My analysis will help guide future marketing strategies for my team. Along the way, I will perform numerous real-world tasks of a junior data analyst by following the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

A detailed R-Markdown notebook can be found here.

To facilitate the planning stage of this project, I put together this mind map based on the guides in the packet and some action plans I jotted down as I read the requirements.

image

Now I am starting this case study by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act. So this document will follow the same structure.

Ask

Business Ask: Or the purpose statement of this study is, to identify key characteristics of users using wearable fitness tracker, and then use these insights to guide Bellabeat’s marketing strategy.

Stakeholders: The findings and recommendations will be shared with Bellabeat’s executive team, among whom are Urška Sršen, cofounder and Chief Creative Officer, and Sando Mur, Mathematician and cofounder. In addition, I will also collaborate and share results with the marketing analytics team, which includes a team of data experts and analysts just like myself.

Prepare

Urška has pointed me to a dataset of FitBit tracker. Upon inspection, this dataset contains usage data of 30 individual users. The number of observations is on the low side, barely making the cut to qualify it as a sample good enough to be representative of the user space of smart device market. This link provides a brief explanation of why a sample size of greater than 30 is desirable.

I tried searching in GitHub, Kaggle, Tableau, and Google. The only dataset other than the abovementioned FitBit data is one that records activity data of one single Redmi Fuel Band user over the past three years. While the data appear to be well maintained, and it is very applaudable that this individual is willing to share their data, I decide not to include it because it adds very little incremental value to our FitBit dataset. I was not surprised by how scarce data could be, when it comes to information related to health and activity, since people are very vigilent about their privacy. We can build on this thought when Bellabeat collects and uses user data in the future.

Moving on, let’s take a closer look to the FitBit dataset. This dataset is available on Kaggle, or alternatively here on Zenodo. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and/or preferences.

Here is my evaluation regarding how the data fare as good data, on a scale of 1-10 (10 being the best):

Data files are collected over 2 periods of time: one from Mar 12, 2016 to Apr 11, 2016, and the other from Apr 12, 2016 to May 12, 2016. But data files across periods do not necessarily align. If the same data tables exist for both time periods, I can simply stack them (when Col 3 and 4 are both “Yes” in the following table).

  File Name In Mar-Apr Batch In Apr-May Batch Number of Fields (Col) Description of Data Issues & Actions
1 dailyActivity_merged.csv Yes Yes 15 <ol><li>Id#️⃣</li><li>ActivityDate📅</li><li>TotalSteps#️⃣</li><li>TotalDistance#️⃣</li><li>TrackerDistance#️⃣</li><li>LoggedActivitiesDistance#️⃣</li><li>VeryActiveDistance#️⃣</li><li>ModeratelyActiveDistance#️⃣</li><li>LightActiveDistance#️⃣</li><li>SedentaryActiveDistance#️⃣</li><li>VeryActiveMinutes#️⃣</li><li>FairlyActiveMinutes#️⃣</li><li>LightlyActiveMinutes#️⃣</li><li>SedentaryMinutes#️⃣</li><li>Calories#️⃣</li></ol> <ul><li>Some numerical data are stored as text strings, need to convert data type.</li><li>TotalDistance may not equate the sum of its subcategories, need to inspect closer. </li><li>This table aggregates steps, distances, calories from the tables below, but does not capture all information, e.g. some users or days entries are left out.</li></ul>
2 dailyCalories_merged.csv No Yes 3 Data already captured in file 1  
3 dailyIntensities_merged.csv No Yes 10 Data already captured in file 1  
4 dailySteps_merged.csv No Yes 3 Data already captured in file 1  
5 hearrate_seconds_merged.csv Yes Yes 3 <ol><li>Id#️⃣</li><li>Time🕐</li><li>Value#️⃣</li></ol> This is indeed 30 users’ heart rate by the second over two months, so there are too many rows to be processed in a spreadsheet application.
6 hourlyCalories_merged.csv Yes Yes 3 <ol><li>Id#️⃣</li><li>ActivityHour🕐</li><li>Value#️⃣</li></ol>  
7 hourlyIntensities_merged.csv Yes Yes 4 <ol><li>Id#️⃣</li><li>ActivityHour🕐</li><li>TotalIntensity#️⃣</li><li>AverageIntensity#️⃣: hourly value / 60</li></ol>  
8 hourlySteps_merged.csv Yes Yes 3 <ol><li>Id#️⃣</li><li>ActivityHour🕐</li><li>StepTotal#️⃣</li></ol>  
9 minuteCaloriesNarrow_merged.csv Yes Yes 3 Similar to file 6, but broken down into minutes  
10 minuteCaloriesWide_merged.csv No Yes 62 Same data as file 9, but in wide format with each minute of hour as a column  
11 minuteIntensitiesNarrow_merged.csv Yes Yes 3 Similar to file 7, but broken down into minutes  
12 minuteIntensitiesWide_merged.csv No Yes 62 Same data as file 11, but in wide format with each minute of hour as a column  
13 minuteMETsNarrow_merged.csv Yes Yes 3 <ol><li>Id#️⃣</li><li>ActivityHour🕐</li><li>METs#️⃣: metabolic equivalents, used to estimate activity intensity</li></ol>  
14 minuteSleep_merged.csv Yes Yes 3 <ol><li>Id#️⃣</li><li>Date🕐</li><li>value#️⃣: in fact category labels, 1=light, 2=deep, 3=REM</li></ol>  
15 minuteStepsNarrow_merged.csv Yes Yes 3 Similar to file 8, but broken down into minutes  
16 minuteStepsWide_merged.csv No Yes 62 Same data as file 16, but in wide format with each minute of hour as a column  
17 weightLogInfo_merged.csv Yes Yes 8 <ol><li>Id#️⃣</li><li>Date🕐</li><li>WeightKg#️⃣</li><li>WeightPounds#️⃣</li><li>Fat#️⃣</li><li>BMI#️⃣</li><li>IsManualReport🔤</li><li>LogId🔤</li></ol>  
18 sleepDay_merged.csv No Yes 5 <ol><li>Id#️⃣</li><li>SleepDay🕐</li><li>TotalSleepRecords#️⃣</li><li>TotalMinutesAsleep#️⃣</li><li>TotalTimeInBed#️⃣</li></ol> Not available for Mar-Apr, but can be calculated from file 14

Building on my purpose statement, and taking into account what data I have at hand, I decide to look into these research questions:

Next, I will use the following data files to address these questions:

  1. dailyActivity_merged.csv: The format and structure of this table is pretty good, but there are missing entries and wrong data types in the provided file. So I will stick to this format, but calculate data with the other relevant files.
  2. sleepDay_merged.csv: since Mar-Apr data does not include this file, I will use minuteSleep_merged.csv to calculate all the fields for Mar-Apr and append it to Apr-May data.
  3. hearrate_seconds_merged.csv and weightLogInfo_merged.csv are both great for answering my question regarding health, but there are two issues:
    • Weight and BMI are self-reported, and there are a lot of missing data.
    • It is difficult to define a normal and a subnormal heart rate or BMI level, without knowing the person’s age and other health conditions.

Process

The dailyActivity_merged.csv may be incomplete. For exmaple the first user 1503960366 only has data from 2016-03-25, but hourly and minute versions actually do have data starting from 2016-03-12. We will start from the hourly version and build my way to the same format as dailyActivity_merged.csv file. image

During discovery, there are a few files in long format that are impossible to process in a spreadsheet application. I will use R throughout the data processing and visualization phases of this project.

Documentation of any cleaning or manipulation of data:

Please see detailed steps in Process phase in this RMD file.

Analyze

Please see detailed steps in Analyze phase in this RMD file.

Share

Key observation 1: A typical user tends not to be very active. On average, they take under 6000 steps a day, while CDC recommends at least 10,000 a day. This goal was only hit 31% of the time. image

Key observation 2: A typical user is likely someone who holds a regular five-day work week in an office. On any given day, number of steps peak during lunch time and after 5PM. Over a week, it is not surprising that most steps were taken on Satudays. Wednesdays are not bad either, but Sundays and Mondays are when users walked the least. It is a bit counterintuitive that Friday also appears to be an inactive day in terms of steps taken. image image

Key observation 3: Users do not take the fitness tracker to bed, more often than not. Considering we have data of 30 users wearing FitBit over two months’ time (that is 1800+ user-days), we only see 49% of the days when the tracker logged some sleep data. Sleep tracking may not be a top concern for these users, and not a key feature they emphasize when shopping for a fitness tracker. image

Key observation 4: Getting weight data is hard given everyone’s privacy concerns, but out of the 13 users who did log weight data, a majority (9 out of 13) was overweight or obese. This, combined with the fact that our sample users were not very active, showed that a typical user interested in getting a fitness tracker may be hoping to track their activity and motivate them to be fit. image

I understand that a lot more can be done with the dataset, producing fancy charts showing trends and relationships between different variables. But these are the four key observations I identified that can get us started with a profile of potential customer of our fitness products. There certainly are a lot more I want to recover, just to name a few:

Act

Based on my analysis, I would recommend taking these actions with Bellabeat Leaf and our smartphone application:

Marketing

Data Collecting

Product Features

◄ Go Back