This project analyzes user activity and sleep data collected by the BellaBeat smart device.
The objective of this analysis is to identify behavioral patterns in user activity and sleep data in order to support data-driven marketing and engagement strategies.
The analysis uses publicly available Fitbit datasets provided as part of the Google Data Analytics Case Study.
Datasets used: - Daily activity - Daily steps - Daily calories - Daily intensities - Sleep data
Weight-related data was excluded from the core analysis due to limited coverage and missing values.
Data source: Kaggle: BellaBeat Dataset (https://www.kaggle.com/datasets/arashnic/bellabeat-dataset)
Data cleaning focused on datasets directly used in the
analysis.
Duplicate checks and date-time standardization were applied only to
activity and sleep datasets, as these directly impact the calculated
metrics and user-level aggregations.
Sleep data required date-time formatting due to potential multiple records per day and overnight sleep sessions, while daily activity data was analyzed at the day level and did not require time granularity.
Weight-related data was excluded from the core analysis due to limited observations and high missingness.
# Reading data files
Daily_Activity <- read_csv("../Data/dailyActivity_merged.csv")
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sleep_Day <- read_csv("../Data/sleepDay_merged.csv")
## Rows: 413 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove duplicates from sleep data
Sleep_Day_clean <- Sleep_Day %>%
distinct(Id, SleepDay, .keep_all = TRUE)
# Convert date-time format for sleep data
Sleep_Day_clean <- Sleep_Day_clean %>%
mutate(SleepDay = as.POSIXct(SleepDay, format = "%m/%d/%Y %I:%M:%S %p"))
# Convert activity date to Date format
Daily_Activity_clean <- Daily_Activity %>%
mutate(ActivityDate = as.Date(ActivityDate, format = "%m/%d/%Y"))
Data validation included logical consistency checks. Activity data was filtered to remove negative values and ensure that total daily minutes did not exceed 1440.
Sleep records were validated to ensure that minutes asleep did not exceed time in bed. Sleep efficiency was calculated as an additional behavioral metric to better understand sleep quality patterns.
#Daily_Activity Validation
Daily_Activity_clean <- Daily_Activity_clean %>%
filter(
TotalSteps >= 0,
SedentaryMinutes <= 1440,
LightlyActiveMinutes >= 0,
FairlyActiveMinutes >= 0,
VeryActiveMinutes >= 0
)
#Sleep_Day Validation
Sleep_Day_clean <- Sleep_Day_clean %>%
filter(
TotalMinutesAsleep > 0,
between(TotalTimeInBed, 1, 1440),
TotalMinutesAsleep <= TotalTimeInBed
)
Sleep_Day_clean <- Sleep_Day_clean %>%
mutate(
SleepEfficiency = TotalMinutesAsleep / TotalTimeInBed
)
summary(Sleep_Day_clean$SleepEfficiency)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4984 0.9118 0.9426 0.9165 0.9606 1.0000
After cleaning and validating the datasets, daily activity and sleep data were merged using user ID and date variables.
An inner_join() was applied to retain only records where
both activity and sleep data were available. This ensures that the
analysis focuses on complete behavioral observations.
Sleep_Day_clean <- Sleep_Day_clean %>%
mutate(ActivityDate = as.Date(SleepDay))
Activity_Sleep <- Daily_Activity_clean %>%
inner_join(Sleep_Day_clean,
by = c("Id", "ActivityDate"))
dim(Activity_Sleep)
## [1] 410 20
The merge was verified to ensure correct alignment of user IDs and dates.
To identify consistent behavior patterns rather than daily fluctuations, data was aggregated at the user level.
Average values were calculated for:
-Daily steps
-Minutes asleep
-Sleep efficiency
User_summary <- Activity_Sleep %>%
group_by(Id) %>%
summarise(
avg_steps = mean(TotalSteps),
avg_sleep = mean(TotalMinutesAsleep),
avg_sleep_efficiency = mean(SleepEfficiency)
)
head(User_summary)
## # A tibble: 6 × 4
## Id avg_steps avg_sleep avg_sleep_efficiency
## <dbl> <dbl> <dbl> <dbl>
## 1 1503960366 12406. 360. 0.936
## 2 1644430081 7968. 294 0.882
## 3 1844505072 3477 652 0.678
## 4 1927972279 1490 417 0.947
## 5 2026352035 5619. 506. 0.941
## 6 2320127002 5079 61 0.884
Aggregating the data at the user level enables the identification of long-term behavioral patterns. This transformation prepares the dataset for further analysis, including correlation assessment and user segmentation.
Exploratory Data Analysis was conducted to understand the distribution of key behavioral variables and identify potential relationships between activity and sleep patterns.
Understanding the distribution of daily steps helps assess overall activity variability among users and detect potential outliers.
steps_plot<- ggplot(Activity_Sleep, aes(x = TotalSteps)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Daily Steps",
x = "Total Steps",
y = "Count of Days"
)
steps_plot
The distribution shows variability in daily activity levels. Most
observations fall within a moderate activity range, while a smaller
number of highly active days indicate variability in user
engagement.
Examining sleep duration provides insight into overall sleep behavior and potential inconsistencies.
sleep_plot<-ggplot(Activity_Sleep, aes(x = TotalMinutesAsleep)) +
geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
labs(
title = "Distribution of Sleep Duration",
x = "Minutes Asleep",
y = "Count of Days"
)
sleep_plot
Most sleep observations fall within a typical healthy range
(approximately 6–8 hours). However, some shorter sleep durations are
observed, potentially indicating inconsistent device usage or irregular
sleep patterns.
To explore whether activity levels influence sleep duration, a scatter plot with a linear trend line was created.
relational_plot<-ggplot(Activity_Sleep, aes(x = TotalSteps, y = TotalMinutesAsleep)) +
geom_point(alpha = 0.4, color = "purple") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Daily Steps vs Sleep Duration",
x = "Total Steps",
y = "Minutes Asleep"
)
relational_plot
## `geom_smooth()` using formula = 'y ~ x'
The visualization suggests a weak relationship between activity and
sleep duration. Increased physical activity does not necessarily
correspond to longer sleep duration, indicating that additional
behavioral or lifestyle factors may influence sleep outcomes.
To quantify the relationship between daily steps and sleep duration, Pearson correlation was calculated.
cor(Activity_Sleep$TotalSteps,
Activity_Sleep$TotalMinutesAsleep,
use = "complete.obs")
## [1] -0.1903439
The correlation coefficient confirms that the relationship between activity and sleep is weak. This suggests that segmentation analysis may provide more meaningful insights than direct linear relationships.
While correlation analysis showed only a weak linear relationship between activity and sleep, user segmentation may reveal distinct behavioral patterns that are not visible through aggregate statistics.
To better understand user behavior, k-means clustering was applied at the user level.
Clustering algorithms are sensitive to variable scale. Since average daily steps and sleep minutes are measured in different units, the variables were standardized before applying k-means.
# Select clustering variables
Cluster_data <- User_summary %>%
select(avg_steps, avg_sleep)
# Scale variables
Cluster_scaled <- scale(Cluster_data)
To determine an appropriate number of clusters, the Elbow Method was
applied.
This approach evaluates the within-cluster sum of squares (WSS) across
different values of k.
As k increases, WSS decreases because clusters become more compact.
# Elbow Method
wss <- sapply(1:10, function(k){
kmeans(Cluster_scaled, centers = k, nstart = 25)$tot.withinss
})
plot(1:10, wss, type = "b",
pch = 19,
frame = FALSE,
xlab = "Number of Clusters (k)",
ylab = "Within-cluster Sum of Squares (WSS)")
The elbow plot shows a gradual decrease in WSS as the number of clusters
increases, without a sharply defined inflection point.
While higher values of k continue to reduce within-cluster variance, selecting too many clusters would reduce interpretability and practical usability—especially given the relatively small dataset size.
Therefore, a 4-cluster solution was selected. This choice balances behavioral differentiation with clarity and business interpretability, enabling meaningful user segmentation without overfitting.
set.seed(123)
kmeans_result <- kmeans(Cluster_scaled, centers = 4, nstart = 25)
# Add cluster labels to dataset
User_summary <- User_summary %>%
mutate(cluster = factor(kmeans_result$cluster))
Cluster_Plot<-ggplot(User_summary, aes(x = avg_steps, y = avg_sleep, color = cluster)) +
geom_point(size = 3, alpha = 0.8) +
labs(
title = "User Segmentation Based on Activity and Sleep",
x = "Average Daily Steps",
y = "Average Minutes Asleep",
color = "Cluster"
) +
theme_minimal()
Cluster_Plot
Based on average activity and sleep metrics, four distinct user segments were identified.
These users demonstrate balanced wellness behavior and may benefit from goal-based engagement features and performance tracking.
These users may prioritize physical activity over recovery. They represent an opportunity for recovery-focused insights, sleep education, and personalized wellness notifications.
This group may represent users with irregular device usage. Engagement campaigns, gamification, and habit-building prompts could improve retention.
These users may benefit from light-activity encouragement programs and daily movement reminders to increase engagement.
The behavioral differences identified across clusters highlight meaningful variation in user needs and engagement patterns. These insights provide a strong foundation for developing targeted, data-driven business strategies aimed at improving retention, personalization, and overall product value.
The segmentation analysis revealed distinct behavioral profiles across user groups. Rather than applying a uniform engagement strategy, Bellabeat can leverage these behavioral insights to implement personalized interventions tailored to each segment.
The following recommendations are designed to align product features and marketing initiatives with observed user behavior patterns.
These users maintain moderate activity levels and healthy sleep duration.
Recommendation: - Introduce achievement-based rewards - Offer performance tracking dashboards - Promote advanced analytics features
Goal: Increase loyalty and encourage premium feature adoption.
These users demonstrate high physical activity but lower sleep duration.
Recommendation: - Implement recovery-focused insights - Introduce sleep optimization notifications - Provide educational content about rest and recovery
Goal: Position Bellabeat as a holistic wellness solution rather than just a fitness tracker.
These users show low activity and inconsistent sleep patterns.
Recommendation: - Gamified challenges (step streaks, weekly goals) - Behavioral nudges and reminders - Push notifications encouraging device usage
Goal: Improve engagement and reduce churn.
These users sleep adequately but show lower physical activity.
Recommendation: - Light-activity challenges (e.g., daily movement goals) - Gentle motivational messaging - Beginner-level fitness programs
Goal: Gradually increase activity without overwhelming the user.
This analysis highlights that user behavior varies significantly across segments.
Rather than applying a single marketing strategy to all users, Bellabeat should adopt a personalized engagement model driven by behavioral data.
By leveraging segmentation: - Marketing campaigns can be targeted - In-app experiences can be customized - Retention strategies can be optimized
Segmentation enables Bellabeat to transition from generic tracking to data-driven personalization.
While the analysis provides valuable behavioral insights, several limitations should be considered.
Small Sample Size:
The dataset contains a limited number of users, which may reduce
generalizability.
Short Time Frame:
The data covers a relatively short observation period and may not
capture long-term behavior patterns.
Device Usage Dependency:
Missing or zero values may reflect non-usage rather than actual
behavior.
Limited Behavioral Variables:
The segmentation is based primarily on activity and sleep metrics.
Additional features (e.g., stress levels, heart rate, demographics)
could enhance cluster differentiation.
Therefore, findings should be interpreted as directional insights rather than definitive behavioral classifications.
This analysis explored smart device usage data to identify behavioral patterns among Bellabeat users.
While a weak linear relationship was observed between physical activity and sleep duration, user segmentation revealed meaningful behavioral groups with distinct engagement profiles.
The clustering analysis identified four key user segments: - Balanced users - Highly active but sleep-deprived users - Low engagement users - Low activity but long sleep users
These findings support the implementation of personalized engagement strategies tailored to different behavioral patterns.
By leveraging user segmentation, Bellabeat can: - Improve user retention - Enhance product positioning - Deliver personalized wellness experiences - Strengthen its competitive advantage in the smart wellness market
Future analysis incorporating additional behavioral and demographic variables could further refine segmentation and support more targeted strategic initiatives.