1 Introduction
- 1.1 Business Objective
2 Data Source
3 Data Cleaning
4 Data Validation
5 Feature Engineering
- 5.1 User-Level Aggregation
6 Exploratory Data Analysis (EDA)
7 User Segmentation
8 Business Recommendations
- 8.1 Personalized Engagement Strategies
- 8.2 Strategic Implications
9 Limitations
10 Conclusion

1 Introduction

This project analyzes user activity and sleep data collected by the BellaBeat smart device.

1.1 Business Objective

The objective of this analysis is to identify behavioral patterns in user activity and sleep data in order to support data-driven marketing and engagement strategies.

2 Data Source

The analysis uses publicly available Fitbit datasets provided as part of the Google Data Analytics Case Study.

Datasets used: - Daily activity - Daily steps - Daily calories - Daily intensities - Sleep data

Weight-related data was excluded from the core analysis due to limited coverage and missing values.

Data source: Kaggle: BellaBeat Dataset (https://www.kaggle.com/datasets/arashnic/bellabeat-dataset)

3 Data Cleaning

Data cleaning focused on datasets directly used in the analysis.
Duplicate checks and date-time standardization were applied only to activity and sleep datasets, as these directly impact the calculated metrics and user-level aggregations.

Sleep data required date-time formatting due to potential multiple records per day and overnight sleep sessions, while daily activity data was analyzed at the day level and did not require time granularity.

Weight-related data was excluded from the core analysis due to limited observations and high missingness.

# Reading data files
Daily_Activity <- read_csv("../Data/dailyActivity_merged.csv")

## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sleep_Day <- read_csv("../Data/sleepDay_merged.csv")

## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Remove duplicates from sleep data
Sleep_Day_clean <- Sleep_Day %>%
  distinct(Id, SleepDay, .keep_all = TRUE)

# Convert date-time format for sleep data
Sleep_Day_clean <- Sleep_Day_clean %>%
  mutate(SleepDay = as.POSIXct(SleepDay, format = "%m/%d/%Y %I:%M:%S %p"))

# Convert activity date to Date format
Daily_Activity_clean <- Daily_Activity %>%
  mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y"))

4 Data Validation

Data validation included logical consistency checks. Activity data was filtered to remove negative values and ensure that total daily minutes did not exceed 1440.

Sleep records were validated to ensure that minutes asleep did not exceed time in bed. Sleep efficiency was calculated as an additional behavioral metric to better understand sleep quality patterns.

#Daily_Activity Validation
Daily_Activity_clean <- Daily_Activity_clean %>%
  filter(
    TotalSteps >= 0,
    SedentaryMinutes <= 1440,
    LightlyActiveMinutes >= 0,
    FairlyActiveMinutes >= 0,
    VeryActiveMinutes >= 0
  )
#Sleep_Day Validation
Sleep_Day_clean <- Sleep_Day_clean %>%
  filter(
    TotalMinutesAsleep > 0,
    between(TotalTimeInBed, 1, 1440),
    TotalMinutesAsleep <= TotalTimeInBed
  )
Sleep_Day_clean <- Sleep_Day_clean %>%
  mutate(
    SleepEfficiency = TotalMinutesAsleep / TotalTimeInBed
  )
summary(Sleep_Day_clean$SleepEfficiency)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4984  0.9118  0.9426  0.9165  0.9606  1.0000

5 Feature Engineering

After cleaning and validating the datasets, daily activity and sleep data were merged using user ID and date variables.

An inner_join() was applied to retain only records where both activity and sleep data were available. This ensures that the analysis focuses on complete behavioral observations.

Sleep_Day_clean <- Sleep_Day_clean %>%
  mutate(ActivityDate = as.Date(SleepDay))

Activity_Sleep <- Daily_Activity_clean %>%
  inner_join(Sleep_Day_clean,
             by = c("Id", "ActivityDate"))

dim(Activity_Sleep)

## [1] 410  20

The merge was verified to ensure correct alignment of user IDs and dates.

5.1 User-Level Aggregation

To identify consistent behavior patterns rather than daily fluctuations, data was aggregated at the user level.

Average values were calculated for:

-Daily steps

-Minutes asleep

-Sleep efficiency

User_summary <- Activity_Sleep %>%
  group_by(Id) %>%
  summarise(
    avg_steps = mean(TotalSteps),
    avg_sleep = mean(TotalMinutesAsleep),
    avg_sleep_efficiency = mean(SleepEfficiency)
  )

head(User_summary)

## # A tibble: 6 × 4
##           Id avg_steps avg_sleep avg_sleep_efficiency
##        <dbl>     <dbl>     <dbl>                <dbl>
## 1 1503960366    12406.      360.                0.936
## 2 1644430081     7968.      294                 0.882
## 3 1844505072     3477       652                 0.678
## 4 1927972279     1490       417                 0.947
## 5 2026352035     5619.      506.                0.941
## 6 2320127002     5079        61                 0.884

Aggregating the data at the user level enables the identification of long-term behavioral patterns. This transformation prepares the dataset for further analysis, including correlation assessment and user segmentation.

6 Exploratory Data Analysis (EDA)

Exploratory Data Analysis was conducted to understand the distribution of key behavioral variables and identify potential relationships between activity and sleep patterns.

6.1 Distribution of Daily Steps

Understanding the distribution of daily steps helps assess overall activity variability among users and detect potential outliers.

steps_plot<- ggplot(Activity_Sleep, aes(x = TotalSteps)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Daily Steps",
    x = "Total Steps",
    y = "Count of Days"
  )

steps_plot

The distribution shows variability in daily activity levels. Most observations fall within a moderate activity range, while a smaller number of highly active days indicate variability in user engagement.

6.2 Distribution of Sleep Duration

Examining sleep duration provides insight into overall sleep behavior and potential inconsistencies.

sleep_plot<-ggplot(Activity_Sleep, aes(x = TotalMinutesAsleep)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  labs(
    title = "Distribution of Sleep Duration",
    x = "Minutes Asleep",
    y = "Count of Days"
  )
sleep_plot

Most sleep observations fall within a typical healthy range (approximately 6–8 hours). However, some shorter sleep durations are observed, potentially indicating inconsistent device usage or irregular sleep patterns.

6.3 Relationship Between Activity and Sleep

To explore whether activity levels influence sleep duration, a scatter plot with a linear trend line was created.

relational_plot<-ggplot(Activity_Sleep, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_point(alpha = 0.4, color = "purple") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Daily Steps vs Sleep Duration",
    x = "Total Steps",
    y = "Minutes Asleep"
  )
relational_plot

## `geom_smooth()` using formula = 'y ~ x'

The visualization suggests a weak relationship between activity and sleep duration. Increased physical activity does not necessarily correspond to longer sleep duration, indicating that additional behavioral or lifestyle factors may influence sleep outcomes.

6.4 Correlation Analysis

To quantify the relationship between daily steps and sleep duration, Pearson correlation was calculated.

cor(Activity_Sleep$TotalSteps,
    Activity_Sleep$TotalMinutesAsleep,
    use = "complete.obs")

## [1] -0.1903439

The correlation coefficient confirms that the relationship between activity and sleep is weak. This suggests that segmentation analysis may provide more meaningful insights than direct linear relationships.

7 User Segmentation

While correlation analysis showed only a weak linear relationship between activity and sleep, user segmentation may reveal distinct behavioral patterns that are not visible through aggregate statistics.

To better understand user behavior, k-means clustering was applied at the user level.

7.1 Preparing Data for Clustering

Clustering algorithms are sensitive to variable scale. Since average daily steps and sleep minutes are measured in different units, the variables were standardized before applying k-means.

# Select clustering variables
Cluster_data <- User_summary %>%
  select(avg_steps, avg_sleep)

# Scale variables
Cluster_scaled <- scale(Cluster_data)

7.2 Determining the Optimal Number of Clusters

To determine an appropriate number of clusters, the Elbow Method was applied.
This approach evaluates the within-cluster sum of squares (WSS) across different values of k.
As k increases, WSS decreases because clusters become more compact.

# Elbow Method
wss <- sapply(1:10, function(k){
  kmeans(Cluster_scaled, centers = k, nstart = 25)$tot.withinss
})

plot(1:10, wss, type = "b",
     pch = 19,
     frame = FALSE,
     xlab = "Number of Clusters (k)",
     ylab = "Within-cluster Sum of Squares (WSS)")

The elbow plot shows a gradual decrease in WSS as the number of clusters increases, without a sharply defined inflection point.

While higher values of k continue to reduce within-cluster variance, selecting too many clusters would reduce interpretability and practical usability—especially given the relatively small dataset size.

Therefore, a 4-cluster solution was selected. This choice balances behavioral differentiation with clarity and business interpretability, enabling meaningful user segmentation without overfitting.

7.3 Applying K-Means Clustering

set.seed(123)

kmeans_result <- kmeans(Cluster_scaled, centers = 4, nstart = 25)


# Add cluster labels to dataset
User_summary <- User_summary %>%
  mutate(cluster = factor(kmeans_result$cluster))

7.4 Visualizing Clusters

Cluster_Plot<-ggplot(User_summary, aes(x = avg_steps, y = avg_sleep, color = cluster)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(
    title = "User Segmentation Based on Activity and Sleep",
    x = "Average Daily Steps",
    y = "Average Minutes Asleep",
    color = "Cluster"
  ) +
  theme_minimal()

Cluster_Plot

7.5 Cluster Profiles and Interpretation

Based on average activity and sleep metrics, four distinct user segments were identified.

7.5.1 Cluster 1 – Balanced Users

Moderate daily steps
Healthy sleep duration
Represent stable and consistent device users

These users demonstrate balanced wellness behavior and may benefit from goal-based engagement features and performance tracking.

7.5.2 Cluster 2 – Highly Active but Sleep-Deprived Users

High average steps
Lower-than-average sleep duration

These users may prioritize physical activity over recovery. They represent an opportunity for recovery-focused insights, sleep education, and personalized wellness notifications.

7.5.3 Cluster 3 – Low Engagement Users

Lower daily steps
Short or inconsistent sleep duration

This group may represent users with irregular device usage. Engagement campaigns, gamification, and habit-building prompts could improve retention.

7.5.4 Cluster 4 – Low Activity but Long Sleep Users

Lower activity levels
Longer sleep duration

These users may benefit from light-activity encouragement programs and daily movement reminders to increase engagement.

The behavioral differences identified across clusters highlight meaningful variation in user needs and engagement patterns. These insights provide a strong foundation for developing targeted, data-driven business strategies aimed at improving retention, personalization, and overall product value.

8 Business Recommendations

The segmentation analysis revealed distinct behavioral profiles across user groups. Rather than applying a uniform engagement strategy, Bellabeat can leverage these behavioral insights to implement personalized interventions tailored to each segment.

The following recommendations are designed to align product features and marketing initiatives with observed user behavior patterns.

8.1 Personalized Engagement Strategies

8.1.1 Cluster 1 – Balanced Users

These users maintain moderate activity levels and healthy sleep duration.

Recommendation: - Introduce achievement-based rewards - Offer performance tracking dashboards - Promote advanced analytics features

Goal: Increase loyalty and encourage premium feature adoption.

8.1.2 Cluster 2 – Highly Active but Sleep-Deprived Users

These users demonstrate high physical activity but lower sleep duration.

Recommendation: - Implement recovery-focused insights - Introduce sleep optimization notifications - Provide educational content about rest and recovery

Goal: Position Bellabeat as a holistic wellness solution rather than just a fitness tracker.

8.1.3 Cluster 3 – Low Engagement Users

These users show low activity and inconsistent sleep patterns.

Recommendation: - Gamified challenges (step streaks, weekly goals) - Behavioral nudges and reminders - Push notifications encouraging device usage

Goal: Improve engagement and reduce churn.

8.1.4 Cluster 4 – Low Activity but Long Sleep Users

These users sleep adequately but show lower physical activity.

Recommendation: - Light-activity challenges (e.g., daily movement goals) - Gentle motivational messaging - Beginner-level fitness programs

Goal: Gradually increase activity without overwhelming the user.

8.2 Strategic Implications

This analysis highlights that user behavior varies significantly across segments.

Rather than applying a single marketing strategy to all users, Bellabeat should adopt a personalized engagement model driven by behavioral data.

By leveraging segmentation: - Marketing campaigns can be targeted - In-app experiences can be customized - Retention strategies can be optimized

Segmentation enables Bellabeat to transition from generic tracking to data-driven personalization.

9 Limitations

While the analysis provides valuable behavioral insights, several limitations should be considered.

Small Sample Size:
The dataset contains a limited number of users, which may reduce generalizability.
Short Time Frame:
The data covers a relatively short observation period and may not capture long-term behavior patterns.
Device Usage Dependency:
Missing or zero values may reflect non-usage rather than actual behavior.
Limited Behavioral Variables:
The segmentation is based primarily on activity and sleep metrics. Additional features (e.g., stress levels, heart rate, demographics) could enhance cluster differentiation.

Therefore, findings should be interpreted as directional insights rather than definitive behavioral classifications.

10 Conclusion

This analysis explored smart device usage data to identify behavioral patterns among Bellabeat users.

While a weak linear relationship was observed between physical activity and sleep duration, user segmentation revealed meaningful behavioral groups with distinct engagement profiles.

The clustering analysis identified four key user segments: - Balanced users - Highly active but sleep-deprived users - Low engagement users - Low activity but long sleep users

These findings support the implementation of personalized engagement strategies tailored to different behavioral patterns.

By leveraging user segmentation, Bellabeat can: - Improve user retention - Enhance product positioning - Deliver personalized wellness experiences - Strengthen its competitive advantage in the smart wellness market

Future analysis incorporating additional behavioral and demographic variables could further refine segmentation and support more targeted strategic initiatives.

BellaBeat Case Study: User Activity & Sleep Analysis

Kseniia Tyshchenko

2026-02-12