I explored the variation in VADER sentiment of Bluesky social media posts based on time, specifically, the time of day (Morning, Afternoon, Evening, Night) and weekday (Monday to Sunday).
Due to strong imbalance in post distribution, I applied stratified downsampling to the dataset. Using non-parametric statistical tests (Kruskal-Wallis and Wilcoxon), the results revealed significant differences in sentiment in the time of day variable.
Further analysis with Co-occurence networks showed that social media post topics get more complex and diversified as the day progresses.
I loaded a variety of libraries essential for text processing and visualization.
library(tidyverse)
library(reticulate)
library(cld2)
library(tidytext)
library(wordcloud)
library(word2vec)
library(text2vec)
library(Rtsne)
library(plotly)
library(reshape2)
library(topicmodels)
library(igraph)
library(ggraph)
library(quanteda)
library(rvest)
library(hunspell)
library(textTinyR)
library(httr)
library(tm)
library(reshape2)
library(widyr)
library(FactoMineR)
library(factoextra)
library(patchwork)
The Alpindale Bluesky posts dataset contains 2 million Bluesky posts. A column (variable) is added in the dataset with Python, where I computed the sentiment scores of the posts with the VADER Sentiment Lexicon. VADER is well-suited for social media text, as it takes into account text variations commonly used in online communication. The resulting dataset is bluesky_data_VADER.csv, imported in the following code.
VADER scores are from -1 to 1 (more negative - more positive). which will serve as the sentiment variable (DV - response).
Initially, I sampled a subset of 100,000 posts from the larger 2 million-post dataset while doing the tests for efficiency. But, for the final run, I decided to use the whole dataset.
# Read CSV file into a tibble
tb<- read_csv("bluesky_data_VADER.csv")
## Rows: 2107530 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): text, author, uri, reply_to
## dbl (1): sentiment
## lgl (1): has_images
## dttm (1): created_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Sample only a portion of the data
# set.seed(123) # For reproducibility
# tb <- tb %>% sample_n(100000)
# View first few rows
head(tb)
## # A tibble: 6 × 7
## text created_at author uri has_images reply_to sentiment
## <chr> <dttm> <chr> <chr> <lgl> <chr> <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE <NA> 0.458
## 2 "Niet been, ie… 2024-11-27 07:53:46 did:p… at:/… FALSE at://di… 0
## 3 "よく考えたらchrがsbさ… 2024-11-27 07:53:46 did:p… at:/… FALSE <NA> 0
## 4 "#Br0kenColors… 2024-11-27 07:53:39 did:p… at:/… TRUE <NA> 0
## 5 "Congratulatio… 2024-11-27 07:53:47 did:p… at:/… FALSE at://di… 0.636
## 6 "Mi relación s… 2024-11-27 07:53:45 did:p… at:/… FALSE <NA> -0.410
The date and time is extracted and formatted from the created_at column with the right data type. To restrain the scope of the dataset, I only focused on: (1) posts created on 2024 and (2) posts categorized as English
# Get time
tb <- tb %>%
mutate(
created_at = ymd_hms(created_at), # Convert to datetime format
date = as.Date(created_at), # Extract date
time = format(created_at, "%H:%M:%S") # Extract time
) %>%
filter(date >= as.Date("2024-01-01")) # Only get 2024 posts
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `created_at = ymd_hms(created_at)`.
## Caused by warning:
## ! 642 failed to parse.
tb %>%
arrange(created_at) %>%
arrange(desc(created_at))
## # A tibble: 1,961,392 × 9
## text created_at author uri has_images reply_to sentiment
## <chr> <dttm> <chr> <chr> <lgl> <chr> <dbl>
## 1 "El carisma e… 2024-11-30 12:45:31 did:p… at:/… FALSE <NA> -0.296
## 2 "Förslagen oc… 2024-11-30 10:50:04 did:p… at:/… FALSE at://di… 0
## 3 "www.youtube.… 2024-11-29 02:11:49 did:p… at:/… FALSE <NA> 0
## 4 "Winslow Home… 2024-11-28 14:24:34 did:p… at:/… TRUE <NA> 0
## 5 "Name: Letiti… 2024-11-28 13:58:59 did:p… at:/… FALSE at://di… 0
## 6 "Daily post\n… 2024-11-28 13:44:03 did:p… at:/… FALSE <NA> 0
## 7 "No, free Pal… 2024-11-28 10:12:49 did:p… at:/… FALSE at://di… -0.046
## 8 "Free Palesti… 2024-11-28 09:10:34 did:p… at:/… FALSE <NA> -0.0516
## 9 "Top fan mess… 2024-11-28 08:53:42 did:p… at:/… FALSE at://di… 0.660
## 10 "Top fan mess… 2024-11-28 08:52:43 did:p… at:/… FALSE at://di… 0.660
## # ℹ 1,961,382 more rows
## # ℹ 2 more variables: date <date>, time <chr>
# Detect language and add to a column
tb <- tb %>%
mutate(language = cld2::detect_language(text)) %>%
filter(language == "en") # Only get English posts
head(tb)
## # A tibble: 6 × 10
## text created_at author uri has_images reply_to sentiment
## <chr> <dttm> <chr> <chr> <lgl> <chr> <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE <NA> 0.458
## 2 "#Br0kenColors… 2024-11-27 07:53:39 did:p… at:/… TRUE <NA> 0
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE at://di… 0.915
## 4 "I’m really ho… 2024-11-27 07:53:47 did:p… at:/… FALSE at://di… 0
## 5 "Choose 20 boo… 2024-11-27 07:53:42 did:p… at:/… FALSE <NA> 0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE <NA> 0.214
## # ℹ 3 more variables: date <date>, time <chr>, language <chr>
I cleaned the text by removing unnecessary symbols, URLs, mentions, hashtags, punctuation, numbers, and stop words. This was done to ensure that the sentiment analysis focuses on the content of the posts and not on the irrelevant elements.
tb <- tb %>%
mutate(
text = str_replace_all(text, "http[s]?://\\S+", ""), # Remove URLs
text = str_replace_all(text, "@\\w+", ""), # Remove mentions
text = str_replace_all(text, "#\\w+", ""), # Remove hashtags
text = str_replace_all(text, "[^\x01-\x7F]", ""), # Remove all non-ASCII characters (including emojis)
text = str_replace_all(text, "[[:punct:]]", ""), # Remove punctuation
text = str_replace_all(text, "\\d+", "") # Remove numbers
)
head(tb)
## # A tibble: 6 × 10
## text created_at author uri has_images reply_to sentiment
## <chr> <dttm> <chr> <chr> <lgl> <chr> <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE <NA> 0.458
## 2 "\nThis was su… 2024-11-27 07:53:39 did:p… at:/… TRUE <NA> 0
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE at://di… 0.915
## 4 "Im really hol… 2024-11-27 07:53:47 did:p… at:/… FALSE at://di… 0
## 5 "Choose books… 2024-11-27 07:53:42 did:p… at:/… FALSE <NA> 0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE <NA> 0.214
## # ℹ 3 more variables: date <date>, time <chr>, language <chr>
# Load stop words and convert to lowercase
data("stop_words")
stop_words <- tolower(stop_words$word)
# Remove apostrophes from stop words and create a new list with both forms
expanded_stop_words <- stop_words %>%
str_replace_all("'", "") %>% # Remove apostrophes
unique() # Get unique words
# Add the original stop words with apostrophes back to the list
expanded_stop_words <- unique(c(expanded_stop_words, stop_words))
# View the expanded stop words list
head(expanded_stop_words)
## [1] "a" "as" "able" "about" "above" "according"
# NOTE: This list will be used during the tokenization process for the data visualization since the VADER Sentiment Lexicon can handle stop words
I divided the time with a 24-hour interval range into 4 equally divided categories for a simpler hypothesis testing (time is set as military time):
This will serve as the time of day variable (IV - predictor).
I also extracted the specific weekday when the post was created.
This will serve as the weekday variable (IV - predictor).
tb <- tb %>%
mutate(
hour = lubridate::hour(created_at),
weekday = weekdays(created_at),
weekday = factor(weekday, levels = c(
"Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"
)),
time_of_day = case_when(
hour >= 0 & hour < 6 ~ "Night",
hour >= 6 & hour < 12 ~ "Morning",
hour >= 12 & hour < 18 ~ "Afternoon",
hour >= 18 & hour < 24 ~ "Evening"
),
time_of_day = factor(time_of_day, levels = c("Morning", "Afternoon", "Evening", "Night"))
)
head(tb)
## # A tibble: 6 × 13
## text created_at author uri has_images reply_to sentiment
## <chr> <dttm> <chr> <chr> <lgl> <chr> <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE <NA> 0.458
## 2 "\nThis was su… 2024-11-27 07:53:39 did:p… at:/… TRUE <NA> 0
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE at://di… 0.915
## 4 "Im really hol… 2024-11-27 07:53:47 did:p… at:/… FALSE at://di… 0
## 5 "Choose books… 2024-11-27 07:53:42 did:p… at:/… FALSE <NA> 0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE <NA> 0.214
## # ℹ 6 more variables: date <date>, time <chr>, language <chr>, hour <int>,
## # weekday <fct>, time_of_day <fct>
After categorizing the posts, I examined the distribution of posts across the time bins.
tb %>%
mutate(hour = hour(ymd_hms(created_at))) %>%
count(hour) %>%
ggplot(aes(x = hour, y = n)) +
geom_col(fill = "steelblue") +
scale_x_continuous(breaks = 0:23) +
labs(title = "Distribution of Posts by Hour",
x = "Hour of Day (24-hour format)",
y = "Number of Posts") +
theme_minimal()
weekday_counts <- tb %>%
count(weekday) %>%
mutate(percentage = round(100 * n / sum(n), 2))
ggplot(weekday_counts, aes(x = weekday, y = n)) +
geom_bar(stat = "identity", fill = "#F8766D") +
labs(title = "Post Count by Weekday",
x = "Weekday",
y = "Number of Posts") +
theme_minimal()
I checked the distribution of the dataset by time. It seemed suspicious that most of the dataset were created during morning~afternoon and every Wednesday.
I assume that the data was scraped by the person who collected the data every morning-afternoon Wednesday each week, so most of the dataset falls into that group.
Thankfully, the dataset acquired is large enough that we have enough sample for each category.
To handle the data imbalance, I applied stratified downsampling to ensure equal representation from each time of day and weekday groups.
set.seed(123)
min_n_time <- tb %>%
count(time_of_day) %>%
pull(n) %>%
min()
time_balanced <- tb %>%
group_by(time_of_day) %>%
slice_sample(n = min_n_time) %>%
ungroup()
# Check balance
time_balanced %>%
count(time_of_day)
## # A tibble: 4 × 2
## time_of_day n
## <fct> <int>
## 1 Morning 782
## 2 Afternoon 782
## 3 Evening 782
## 4 Night 782
# Stratified downsampling by weekday
set.seed(123)
min_n_weekday <- tb %>%
count(weekday) %>%
pull(n) %>%
min()
weekday_balanced <- tb %>%
group_by(weekday) %>%
slice_sample(n = min_n_weekday) %>%
ungroup()
# Check balance
weekday_balanced %>%
count(weekday)
## # A tibble: 7 × 2
## weekday n
## <fct> <int>
## 1 Monday 482
## 2 Tuesday 482
## 3 Wednesday 482
## 4 Thursday 482
## 5 Friday 482
## 6 Saturday 482
## 7 Sunday 482
I will be testing whether there is a significant difference in sentiment across different times of the day (morning, afternoon, evening, and night).
Predictor Variable (IV - Cat): Time of Day (Morning, Afternoon, Night, Evening) Response Variable (DV - Num): VADER Sentiment Score of Bluesky Posts (-1 to 1)
Null Hypothesis (H₀): There is NO significant difference in sentiment across different times of the day (morning, afternoon, evening, and night). Alternative Hypothesis (H₁): There is a significant difference in sentiment across different times of the day.
# Violin plot
ggplot(time_balanced, aes(x = time_of_day, y = sentiment, fill = time_of_day)) +
geom_violin(trim = FALSE, scale = "width") +
stat_summary(fun = "median", geom = "point", shape = 21, size = 2, fill = "white") +
labs(title = "Sentiment Distribution by Time of Day (Balanced)",
x = "Time of Day",
y = "Sentiment Score") +
theme_minimal() +
theme(legend.position = "none")
time_balanced %>%
group_by(time_of_day) %>%
summarise(
median_sentiment = median(sentiment),
mean_sentiment = mean(sentiment),
n = n()
)
## # A tibble: 4 × 4
## time_of_day median_sentiment mean_sentiment n
## <fct> <dbl> <dbl> <int>
## 1 Morning 0.202 0.202 782
## 2 Afternoon 0.168 0.187 782
## 3 Evening 0 0.0774 782
## 4 Night 0 0.116 782
# Kolmogorov-Smirnov test for checking normality with large sample sizes
# Null Hypothesis (H₀): The data follows a normal distribution
# Alternative Hypothesis (H₁):The data does NOT follow a normal distribution
time_balanced %>%
group_by(time_of_day) %>%
summarise(ks_test = ks.test(sentiment, "pnorm", mean(sentiment), sd(sentiment))$p.value)
## Warning: There were 4 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `ks_test = ks.test(sentiment, "pnorm", mean(sentiment),
## sd(sentiment))$p.value`.
## ℹ In group 1: `time_of_day = Morning`.
## Caused by warning in `ks.test.default()`:
## ! ties should not be present for the one-sample Kolmogorov-Smirnov test
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
## # A tibble: 4 × 2
## time_of_day ks_test
## <fct> <dbl>
## 1 Morning 7.84e-12
## 2 Afternoon 2.73e-10
## 3 Evening 1.70e-16
## 4 Night 2.97e-20
Before selecting a test statistic, the data was explored further. From looking at the distribution, it was apparent that the data is not normal.
The Kolmogorov-Smirnov test is a statistical test used to examine if variables are normally distributed.
The Kolmogorov-Smirnov test showed that the for each group (Afternoon, Evening, Morning, Night), the p-value is <0.001, which means that we REJECT the null hypothesis for all groups (Null Hypothesis (H₀): The data follows a normal distribution). Therefore, the data is does not follow a normal distribution.
Therefore, we will use an alternative to ANOVA. Kruskal-Wallis test, which is a non-parametric test suitable for comparing more than two groups based on their rankings, tests for significant differences in sentiment between time-of-day categories (alpha = 0.05 as the standard).
# Kruskal-Wallis Test
kruskal.test(sentiment ~ time_of_day, data = time_balanced)
##
## Kruskal-Wallis rank sum test
##
## data: sentiment by time_of_day
## Kruskal-Wallis chi-squared = 40.651, df = 3, p-value = 7.755e-09
Based on our test results (Kruskal-Wallis test value (3) = 85.491, p-value = 0.002), we reject the null hypothesis and conclude that there is a significant difference in sentiment across the different times of the day.
The Wilcoxon signed-rank test is a common test used in replacement to a paired t-test when analyzing paired or related samples for a data that does NOT have a normal distribution.
# Perform post hoc test
time_pairwise <- pairwise.wilcox.test(time_balanced$sentiment, time_balanced$time_of_day, p.adjust.method = "BH")
time_pairwise
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: time_balanced$sentiment and time_balanced$time_of_day
##
## Morning Afternoon Evening
## Afternoon 0.70201 - -
## Evening 4.1e-07 3.8e-06 -
## Night 0.00023 0.00125 0.13051
##
## P value adjustment method: BH
There are significant differences in sentiment between Morning and Evening, Morning and Night, and Afternoon and Evening (alpha = 0.05)
There is no significant difference between the rest of the pairs.
I will be testing whether there is a significant difference in sentiment across weekdays (Monday-Sunday)
Predictor Variable (IV): Weekday (Monday-Sunday) Response Variable (DV): Sentiment value of Bluesky posts
Predictor Variable (IV - Cat): Weekday (Monday-Sunday) Response Variable (DV - Num): VADER Sentiment Score of Bluesky Posts (-1 to 1)
Null Hypothesis (H₀): There is NO significant difference in sentiment across different weekday. Alternative Hypothesis (H₁): There is a significant difference in sentiment across different weekday.
ggplot(weekday_balanced, aes(x = weekday, y = sentiment, fill = weekday)) +
geom_violin(trim = FALSE, scale = "width") +
stat_summary(fun = "median", geom = "point", shape = 21, size = 2, fill = "white") +
labs(title = "Sentiment Distribution by Weekday (Balanced)",
x = "Weekday",
y = "Sentiment Score") +
theme_minimal() +
theme(legend.position = "none")
# Kolmogorov-Smirnov test for checking normality with large sample sizes
# Null Hypothesis (H₀): The data follows a normal distribution
# Alternative Hypothesis (H₁):The data does NOT follow a normal distribution
weekday_balanced %>%
group_by(weekday) %>%
summarise(ks_test = ks.test(sentiment, "pnorm", mean(sentiment), sd(sentiment))$p.value)
## Warning: There were 7 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `ks_test = ks.test(sentiment, "pnorm", mean(sentiment),
## sd(sentiment))$p.value`.
## ℹ In group 1: `weekday = Monday`.
## Caused by warning in `ks.test.default()`:
## ! ties should not be present for the one-sample Kolmogorov-Smirnov test
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.
## # A tibble: 7 × 2
## weekday ks_test
## <fct> <dbl>
## 1 Monday 5.04e-10
## 2 Tuesday 4.05e-13
## 3 Wednesday 2.88e-10
## 4 Thursday 3.58e- 8
## 5 Friday 1.24e- 9
## 6 Saturday 6.51e-11
## 7 Sunday 3.07e-11
Before selecting a test statistic, the data was explored further. From looking at the distribution, it was apparent that the data is not normal.
The Kolmogorov-Smirnov test is a statistical test used to examine if variables are normally distributed.
The Kolmogorov-Smirnov test showed that the for each group (Monday-Sunday), the p-value is <0.001, which means that we REJECT the null hypothesis for all groups (Null Hypothesis (H₀): The data follows a normal distribution). Therefore, the data is does not follow a normal distribution.
Therefore, we will use an alternative to ANOVA. Kruskal-Wallis test, which is a non-parametric test suitable for comparing more than two groups based on their rankings, tests for significant differences in sentiment between time-of-day categories (alpha = 0.05 as the standard).
# Kruskal-Wallis Test by weekday
kruskal.test(sentiment ~ factor(weekday), data = weekday_balanced)
##
## Kruskal-Wallis rank sum test
##
## data: sentiment by factor(weekday)
## Kruskal-Wallis chi-squared = 17.281, df = 6, p-value = 0.008304
Based on our test results (Kruskal-Wallis test value (3) = 85.491, p-value = 0.165), we accept the null hypothesis and conclude that there is NO significant difference in sentiment across the different times of the day.
Since the Kruskal-Wallis test did not see any significant results, there is no need to do a post hoc test.
# Add row ID
time_balanced <- time_balanced %>%
mutate(row_id = row_number())
# Filter morning and remove stop words
morning_words <- time_balanced %>%
filter(time_of_day == "Morning") %>%
select(row_id, text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) #stop word filter
# Calculate co-occurrence counts (within same post)
morning_pairs <- morning_words %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE)
# Filter for strong connections only
morning_pairs_filtered <- morning_pairs %>%
filter(n >= 3) # Only keep strong links
# Build graph
graph <- morning_pairs_filtered %>%
graph_from_data_frame()
# Plot with ggraph
ggraph(graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "skyblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
labs(title = "Morning Word Co-Occurrence Network") +
theme_void()
# Filter afternoon and remove stop words
afternoon_words <- time_balanced %>%
filter(time_of_day == "Afternoon") %>%
select(row_id, text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) #stop word filter
# Calculate co-occurrence counts (within same post)
afternoon_pairs <- afternoon_words %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE)
# Filter for strong connections only
afternoon_pairs_filtered <- afternoon_pairs %>%
filter(n >= 3) # Only keep strong links
# Build graph
graph <- afternoon_pairs_filtered %>%
graph_from_data_frame()
# Plot with ggraph
ggraph(graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "skyblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
labs(title = "Afternoon Word Co-Occurrence Network") +
theme_void()
# Filter evening and remove stop words
evening_words <- time_balanced %>%
filter(time_of_day == "Evening") %>%
select(row_id, text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) #stop word filter
# Calculate co-occurrence counts (within same post)
evening_pairs <- evening_words %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE)
# Filter for strong connections only
evening_pairs_filtered <- evening_pairs %>%
filter(n >= 3) # Only keep strong links
# Build graph
graph <- evening_pairs_filtered %>%
graph_from_data_frame()
# Plot with ggraph
ggraph(graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "skyblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
labs(title = "Evening Word Co-Occurrence Network") +
theme_void()
# Filter night and remove stop words
night_words <- time_balanced %>%
filter(time_of_day == "Night") %>%
select(row_id, text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) #stop word filter
# Calculate co-occurrence counts (within same post)
night_pairs <- night_words %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE)
# Filter for strong connections only
night_pairs_filtered <- night_pairs %>%
filter(n >= 3) # Only keep strong links
# Build graph
graph <- night_pairs_filtered %>%
graph_from_data_frame()
# Plot with ggraph
ggraph(graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
geom_node_point(color = "skyblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE, size = 4) +
labs(title = "Night Word Co-Occurrence Network") +
theme_void()
# Function to compute network
network_stats_by_time <- function(data, time_label) {
# Filter for one time_of_day group
word_pairs <- data %>%
filter(time_of_day == time_label) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE) %>%
filter(n >= 3)
# Build graph
g <- graph_from_data_frame(word_pairs, directed = FALSE)
# Compute metrics
data.frame(
time_of_day = time_label,
nodes = vcount(g),
edges = ecount(g),
density = edge_density(g),
components = components(g)$no,
modularity = modularity(cluster_louvain(g))
)
}
# Apply to all 4 time bins
network_summary <- bind_rows(
network_stats_by_time(time_balanced, "Morning"),
network_stats_by_time(time_balanced, "Afternoon"),
network_stats_by_time(time_balanced, "Evening"),
network_stats_by_time(time_balanced, "Night")
)
network_summary
## time_of_day nodes edges density components modularity
## 1 Morning 23 42 0.16600791 5 0.5136054
## 2 Afternoon 7 4 0.19047619 3 0.6250000
## 3 Evening 45 66 0.06666667 13 0.7470156
## 4 Night 51 55 0.04313725 13 0.7552066
# Bar chart of modularity
ggplot(network_summary, aes(x = factor(time_of_day, levels = c("Morning", "Afternoon", "Evening", "Night")),
y = modularity, fill = time_of_day)) +
geom_col(show.legend = FALSE) +
labs(title = "Topic Modularity by Time of Day",
x = "Time of Day",
y = "Modularity Score") +
theme_minimal()
Modularity: how well the network is divided into distinct groups.
There is low modularity in the morning/afternoon. There is a high modularity in the evening/night.
Social media post topics get more complex and partitioned as the day progresses.
# Build co-occurrence network per time bin
build_cooc_graph <- function(df, time_label) {
word_pairs <- df %>%
filter(time_of_day == time_label) %>%
unnest_tokens(word, text) %>%
filter(!word %in% expanded_stop_words,
hunspell_check(word, dict = "en_US")) %>%
pairwise_count(word, row_id, sort = TRUE, upper = FALSE) %>%
filter(n >= 3)
g <- graph_from_data_frame(word_pairs, directed = FALSE)
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)
# Add Louvain communities
V(g)$community <- cluster_louvain(g)$membership
V(g)$name <- V(g)$name
g
}
# Build graphs
g_morning <- build_cooc_graph(time_balanced, "Morning")
g_afternoon <- build_cooc_graph(time_balanced, "Afternoon")
g_evening <- build_cooc_graph(time_balanced, "Evening")
g_night <- build_cooc_graph(time_balanced, "Night")
# Plot function
plot_network <- function(graph, title) {
ggraph(graph, layout = "fr") +
geom_edge_link(alpha = 0.2) +
geom_node_point(aes(color = as.factor(community)), size = 3) +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
labs(title = title) +
theme_void() +
theme(legend.position = "none")
}
# Generate plots
p1 <- plot_network(g_morning, "Morning")
p2 <- plot_network(g_afternoon, "Afternoon")
p3 <- plot_network(g_evening, "Evening")
p4 <- plot_network(g_night, "Night")
# Combine plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Co-Occurrence: How often terms appear together in a post for each group.
Community Detection: What terms are clustered/partitioned to each other by topic.
We see that there are less clusters early in the day and more clusters late in the night.