1 Abstract

I explored the variation in VADER sentiment of Bluesky social media posts based on time, specifically, the time of day (Morning, Afternoon, Evening, Night) and weekday (Monday to Sunday).

Due to strong imbalance in post distribution, I applied stratified downsampling to the dataset. Using non-parametric statistical tests (Kruskal-Wallis and Wilcoxon), the results revealed significant differences in sentiment in the time of day variable.

Further analysis with Co-occurence networks showed that social media post topics get more complex and diversified as the day progresses.

2 Data Processing

2.1 Import Libraries

I loaded a variety of libraries essential for text processing and visualization.

library(tidyverse)
library(reticulate)
library(cld2)
library(tidytext)
library(wordcloud)
library(word2vec)
library(text2vec)
library(Rtsne)
library(plotly)
library(reshape2)
library(topicmodels)
library(igraph)
library(ggraph)
library(quanteda)
library(rvest)
library(hunspell)
library(textTinyR)
library(httr)
library(tm)
library(reshape2)
library(widyr)
library(FactoMineR)
library(factoextra)
library(patchwork)

2.2 Import Data

The Alpindale Bluesky posts dataset contains 2 million Bluesky posts. A column (variable) is added in the dataset with Python, where I computed the sentiment scores of the posts with the VADER Sentiment Lexicon. VADER is well-suited for social media text, as it takes into account text variations commonly used in online communication. The resulting dataset is bluesky_data_VADER.csv, imported in the following code.

VADER scores are from -1 to 1 (more negative - more positive). which will serve as the sentiment variable (DV - response).

Initially, I sampled a subset of 100,000 posts from the larger 2 million-post dataset while doing the tests for efficiency. But, for the final run, I decided to use the whole dataset.

# Read CSV file into a tibble
tb<- read_csv("bluesky_data_VADER.csv")
## Rows: 2107530 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): text, author, uri, reply_to
## dbl  (1): sentiment
## lgl  (1): has_images
## dttm (1): created_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Sample only a portion of the data
# set.seed(123)  # For reproducibility
# tb <- tb %>% sample_n(100000)

# View first few rows
head(tb)
## # A tibble: 6 × 7
##   text            created_at          author uri   has_images reply_to sentiment
##   <chr>           <dttm>              <chr>  <chr> <lgl>      <chr>        <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE      <NA>         0.458
## 2 "Niet been, ie… 2024-11-27 07:53:46 did:p… at:/… FALSE      at://di…     0    
## 3 "よく考えたらchrがsbさ… 2024-11-27 07:53:46 did:p… at:/… FALSE      <NA>         0    
## 4 "#Br0kenColors… 2024-11-27 07:53:39 did:p… at:/… TRUE       <NA>         0    
## 5 "Congratulatio… 2024-11-27 07:53:47 did:p… at:/… FALSE      at://di…     0.636
## 6 "Mi relación s… 2024-11-27 07:53:45 did:p… at:/… FALSE      <NA>        -0.410

2.3 Date and Time

The date and time is extracted and formatted from the created_at column with the right data type. To restrain the scope of the dataset, I only focused on: (1) posts created on 2024 and (2) posts categorized as English

# Get time
tb <- tb %>%
  mutate(
    created_at = ymd_hms(created_at),  # Convert to datetime format
    date = as.Date(created_at),        # Extract date
    time = format(created_at, "%H:%M:%S") # Extract time
  ) %>%
  filter(date >= as.Date("2024-01-01")) # Only get 2024 posts
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `created_at = ymd_hms(created_at)`.
## Caused by warning:
## !  642 failed to parse.
tb %>%
  arrange(created_at) %>%
  arrange(desc(created_at))
## # A tibble: 1,961,392 × 9
##    text           created_at          author uri   has_images reply_to sentiment
##    <chr>          <dttm>              <chr>  <chr> <lgl>      <chr>        <dbl>
##  1 "El carisma e… 2024-11-30 12:45:31 did:p… at:/… FALSE      <NA>       -0.296 
##  2 "Förslagen oc… 2024-11-30 10:50:04 did:p… at:/… FALSE      at://di…    0     
##  3 "www.youtube.… 2024-11-29 02:11:49 did:p… at:/… FALSE      <NA>        0     
##  4 "Winslow Home… 2024-11-28 14:24:34 did:p… at:/… TRUE       <NA>        0     
##  5 "Name: Letiti… 2024-11-28 13:58:59 did:p… at:/… FALSE      at://di…    0     
##  6 "Daily post\n… 2024-11-28 13:44:03 did:p… at:/… FALSE      <NA>        0     
##  7 "No, free Pal… 2024-11-28 10:12:49 did:p… at:/… FALSE      at://di…   -0.046 
##  8 "Free Palesti… 2024-11-28 09:10:34 did:p… at:/… FALSE      <NA>       -0.0516
##  9 "Top fan mess… 2024-11-28 08:53:42 did:p… at:/… FALSE      at://di…    0.660 
## 10 "Top fan mess… 2024-11-28 08:52:43 did:p… at:/… FALSE      at://di…    0.660 
## # ℹ 1,961,382 more rows
## # ℹ 2 more variables: date <date>, time <chr>
# Detect language and add to a column
tb <- tb %>%
  mutate(language = cld2::detect_language(text)) %>%
  filter(language == "en") # Only get English posts

head(tb)
## # A tibble: 6 × 10
##   text            created_at          author uri   has_images reply_to sentiment
##   <chr>           <dttm>              <chr>  <chr> <lgl>      <chr>        <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE      <NA>         0.458
## 2 "#Br0kenColors… 2024-11-27 07:53:39 did:p… at:/… TRUE       <NA>         0    
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE      at://di…     0.915
## 4 "I’m really ho… 2024-11-27 07:53:47 did:p… at:/… FALSE      at://di…     0    
## 5 "Choose 20 boo… 2024-11-27 07:53:42 did:p… at:/… FALSE      <NA>         0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE       <NA>         0.214
## # ℹ 3 more variables: date <date>, time <chr>, language <chr>

2.4 Removing Symbols

I cleaned the text by removing unnecessary symbols, URLs, mentions, hashtags, punctuation, numbers, and stop words. This was done to ensure that the sentiment analysis focuses on the content of the posts and not on the irrelevant elements.

tb <- tb %>%
  mutate(
    text = str_replace_all(text, "http[s]?://\\S+", ""),   # Remove URLs
    text = str_replace_all(text, "@\\w+", ""),             # Remove mentions
    text = str_replace_all(text, "#\\w+", ""),             # Remove hashtags
    text = str_replace_all(text, "[^\x01-\x7F]", ""),      # Remove all non-ASCII characters (including emojis)
    text = str_replace_all(text, "[[:punct:]]", ""),        # Remove punctuation
    text = str_replace_all(text, "\\d+", "")               # Remove numbers
  )

head(tb)
## # A tibble: 6 × 10
##   text            created_at          author uri   has_images reply_to sentiment
##   <chr>           <dttm>              <chr>  <chr> <lgl>      <chr>        <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE      <NA>         0.458
## 2 "\nThis was su… 2024-11-27 07:53:39 did:p… at:/… TRUE       <NA>         0    
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE      at://di…     0.915
## 4 "Im really hol… 2024-11-27 07:53:47 did:p… at:/… FALSE      at://di…     0    
## 5 "Choose  books… 2024-11-27 07:53:42 did:p… at:/… FALSE      <NA>         0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE       <NA>         0.214
## # ℹ 3 more variables: date <date>, time <chr>, language <chr>
# Load stop words and convert to lowercase
data("stop_words")
stop_words <- tolower(stop_words$word)

# Remove apostrophes from stop words and create a new list with both forms
expanded_stop_words <- stop_words %>%
  str_replace_all("'", "") %>%  # Remove apostrophes
  unique()  # Get unique words

# Add the original stop words with apostrophes back to the list
expanded_stop_words <- unique(c(expanded_stop_words, stop_words))

# View the expanded stop words list
head(expanded_stop_words)
## [1] "a"         "as"        "able"      "about"     "above"     "according"
# NOTE: This list will be used during the tokenization process for the data visualization since the VADER Sentiment Lexicon can handle stop words

2.5 Adding Time Bins

I divided the time with a 24-hour interval range into 4 equally divided categories for a simpler hypothesis testing (time is set as military time):

  1. 0 - 6 : Night
  2. 6 - 12 : Morning
  3. 12 - 18 : Afternoon
  4. 18 - 24 : Evening

This will serve as the time of day variable (IV - predictor).

I also extracted the specific weekday when the post was created.

  1. Monday
  2. Tuesday
  3. Wednesday
  4. Thursday
  5. Friday
  6. Saturday
  7. Sunday

This will serve as the weekday variable (IV - predictor).

tb <- tb %>%
  mutate(
    hour = lubridate::hour(created_at),
    weekday = weekdays(created_at),
    weekday = factor(weekday, levels = c(
      "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"
    )),
      time_of_day = case_when(
      hour >= 0 & hour < 6  ~ "Night",
      hour >= 6 & hour < 12 ~ "Morning",
      hour >= 12 & hour < 18 ~ "Afternoon",
      hour >= 18 & hour < 24 ~ "Evening"
    ),
    time_of_day = factor(time_of_day, levels = c("Morning", "Afternoon", "Evening", "Night"))
  )
head(tb)
## # A tibble: 6 × 13
##   text            created_at          author uri   has_images reply_to sentiment
##   <chr>           <dttm>              <chr>  <chr> <lgl>      <chr>        <dbl>
## 1 "This is reall… 2024-11-27 07:53:47 did:p… at:/… FALSE      <NA>         0.458
## 2 "\nThis was su… 2024-11-27 07:53:39 did:p… at:/… TRUE       <NA>         0    
## 3 "Haaland thriv… 2024-11-27 07:53:46 did:p… at:/… FALSE      at://di…     0.915
## 4 "Im really hol… 2024-11-27 07:53:47 did:p… at:/… FALSE      at://di…     0    
## 5 "Choose  books… 2024-11-27 07:53:42 did:p… at:/… FALSE      <NA>         0.400
## 6 "I like teasin… 2024-11-27 07:53:44 did:p… at:/… TRUE       <NA>         0.214
## # ℹ 6 more variables: date <date>, time <chr>, language <chr>, hour <int>,
## #   weekday <fct>, time_of_day <fct>

After categorizing the posts, I examined the distribution of posts across the time bins.

3 Sampling

3.1 Handling Data Imbalance

tb %>%
  mutate(hour = hour(ymd_hms(created_at))) %>%
  count(hour) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_col(fill = "steelblue") +
  scale_x_continuous(breaks = 0:23) +
  labs(title = "Distribution of Posts by Hour",
       x = "Hour of Day (24-hour format)",
       y = "Number of Posts") +
  theme_minimal()

weekday_counts <- tb %>%
  count(weekday) %>%
  mutate(percentage = round(100 * n / sum(n), 2))

ggplot(weekday_counts, aes(x = weekday, y = n)) +
  geom_bar(stat = "identity", fill = "#F8766D") +
  labs(title = "Post Count by Weekday",
       x = "Weekday",
       y = "Number of Posts") +
  theme_minimal()

I checked the distribution of the dataset by time. It seemed suspicious that most of the dataset were created during morning~afternoon and every Wednesday.

I assume that the data was scraped by the person who collected the data every morning-afternoon Wednesday each week, so most of the dataset falls into that group.

3.2 Stratified Downsampling

Thankfully, the dataset acquired is large enough that we have enough sample for each category.

To handle the data imbalance, I applied stratified downsampling to ensure equal representation from each time of day and weekday groups.

set.seed(123)

min_n_time <- tb %>%
  count(time_of_day) %>%
  pull(n) %>%
  min()

time_balanced <- tb %>%
  group_by(time_of_day) %>%
  slice_sample(n = min_n_time) %>%
  ungroup()

# Check balance
time_balanced %>%
  count(time_of_day)
## # A tibble: 4 × 2
##   time_of_day     n
##   <fct>       <int>
## 1 Morning       782
## 2 Afternoon     782
## 3 Evening       782
## 4 Night         782
# Stratified downsampling by weekday
set.seed(123)

min_n_weekday <- tb %>%
  count(weekday) %>%
  pull(n) %>%
  min()

weekday_balanced <- tb %>%
  group_by(weekday) %>%
  slice_sample(n = min_n_weekday) %>%
  ungroup()

# Check balance
weekday_balanced %>%
  count(weekday)
## # A tibble: 7 × 2
##   weekday       n
##   <fct>     <int>
## 1 Monday      482
## 2 Tuesday     482
## 3 Wednesday   482
## 4 Thursday    482
## 5 Friday      482
## 6 Saturday    482
## 7 Sunday      482

4 Hypothesis Testing - Time of Day

I will be testing whether there is a significant difference in sentiment across different times of the day (morning, afternoon, evening, and night).

4.1 Hypothesis

Predictor Variable (IV - Cat): Time of Day (Morning, Afternoon, Night, Evening) Response Variable (DV - Num): VADER Sentiment Score of Bluesky Posts (-1 to 1)

Null Hypothesis (H₀): There is NO significant difference in sentiment across different times of the day (morning, afternoon, evening, and night). Alternative Hypothesis (H₁): There is a significant difference in sentiment across different times of the day.

4.2 Test Statistic

# Violin plot
ggplot(time_balanced, aes(x = time_of_day, y = sentiment, fill = time_of_day)) +
  geom_violin(trim = FALSE, scale = "width") +
  stat_summary(fun = "median", geom = "point", shape = 21, size = 2, fill = "white") +
  labs(title = "Sentiment Distribution by Time of Day (Balanced)",
       x = "Time of Day",
       y = "Sentiment Score") +
  theme_minimal() +
  theme(legend.position = "none")

time_balanced %>%
  group_by(time_of_day) %>%
  summarise(
    median_sentiment = median(sentiment),
    mean_sentiment = mean(sentiment),
    n = n()
  )
## # A tibble: 4 × 4
##   time_of_day median_sentiment mean_sentiment     n
##   <fct>                  <dbl>          <dbl> <int>
## 1 Morning                0.202         0.202    782
## 2 Afternoon              0.168         0.187    782
## 3 Evening                0             0.0774   782
## 4 Night                  0             0.116    782
# Kolmogorov-Smirnov test for checking normality with large sample sizes

# Null Hypothesis (H₀): The data follows a normal distribution
# Alternative Hypothesis (H₁):The data does NOT follow a normal distribution

time_balanced %>%
  group_by(time_of_day) %>%
  summarise(ks_test = ks.test(sentiment, "pnorm", mean(sentiment), sd(sentiment))$p.value)
## Warning: There were 4 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `ks_test = ks.test(sentiment, "pnorm", mean(sentiment),
##   sd(sentiment))$p.value`.
## ℹ In group 1: `time_of_day = Morning`.
## Caused by warning in `ks.test.default()`:
## ! ties should not be present for the one-sample Kolmogorov-Smirnov test
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
## # A tibble: 4 × 2
##   time_of_day  ks_test
##   <fct>          <dbl>
## 1 Morning     7.84e-12
## 2 Afternoon   2.73e-10
## 3 Evening     1.70e-16
## 4 Night       2.97e-20

Before selecting a test statistic, the data was explored further. From looking at the distribution, it was apparent that the data is not normal.

The Kolmogorov-Smirnov test is a statistical test used to examine if variables are normally distributed.

The Kolmogorov-Smirnov test showed that the for each group (Afternoon, Evening, Morning, Night), the p-value is <0.001, which means that we REJECT the null hypothesis for all groups (Null Hypothesis (H₀): The data follows a normal distribution). Therefore, the data is does not follow a normal distribution.

Therefore, we will use an alternative to ANOVA. Kruskal-Wallis test, which is a non-parametric test suitable for comparing more than two groups based on their rankings, tests for significant differences in sentiment between time-of-day categories (alpha = 0.05 as the standard).

# Kruskal-Wallis Test
kruskal.test(sentiment ~ time_of_day, data = time_balanced)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  sentiment by time_of_day
## Kruskal-Wallis chi-squared = 40.651, df = 3, p-value = 7.755e-09

Based on our test results (Kruskal-Wallis test value (3) = 85.491, p-value = 0.002), we reject the null hypothesis and conclude that there is a significant difference in sentiment across the different times of the day.

4.3 Post Hoc Test

The Wilcoxon signed-rank test is a common test used in replacement to a paired t-test when analyzing paired or related samples for a data that does NOT have a normal distribution.

# Perform post hoc test
time_pairwise <- pairwise.wilcox.test(time_balanced$sentiment, time_balanced$time_of_day, p.adjust.method = "BH")

time_pairwise
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  time_balanced$sentiment and time_balanced$time_of_day 
## 
##           Morning Afternoon Evening
## Afternoon 0.70201 -         -      
## Evening   4.1e-07 3.8e-06   -      
## Night     0.00023 0.00125   0.13051
## 
## P value adjustment method: BH

4.4 Results

There are significant differences in sentiment between Morning and Evening, Morning and Night, and Afternoon and Evening (alpha = 0.05)

There is no significant difference between the rest of the pairs.

5 Hypothesis Testing - Weekday

I will be testing whether there is a significant difference in sentiment across weekdays (Monday-Sunday)

Predictor Variable (IV): Weekday (Monday-Sunday) Response Variable (DV): Sentiment value of Bluesky posts

5.1 Hypothesis

Predictor Variable (IV - Cat): Weekday (Monday-Sunday) Response Variable (DV - Num): VADER Sentiment Score of Bluesky Posts (-1 to 1)

Null Hypothesis (H₀): There is NO significant difference in sentiment across different weekday. Alternative Hypothesis (H₁): There is a significant difference in sentiment across different weekday.

5.2 Test Statistic

ggplot(weekday_balanced, aes(x = weekday, y = sentiment, fill = weekday)) +
  geom_violin(trim = FALSE, scale = "width") +
  stat_summary(fun = "median", geom = "point", shape = 21, size = 2, fill = "white") +
  labs(title = "Sentiment Distribution by Weekday (Balanced)",
       x = "Weekday",
       y = "Sentiment Score") +
  theme_minimal() +
  theme(legend.position = "none")

# Kolmogorov-Smirnov test for checking normality with large sample sizes

# Null Hypothesis (H₀): The data follows a normal distribution
# Alternative Hypothesis (H₁):The data does NOT follow a normal distribution

weekday_balanced %>%
  group_by(weekday) %>%
  summarise(ks_test = ks.test(sentiment, "pnorm", mean(sentiment), sd(sentiment))$p.value)
## Warning: There were 7 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `ks_test = ks.test(sentiment, "pnorm", mean(sentiment),
##   sd(sentiment))$p.value`.
## ℹ In group 1: `weekday = Monday`.
## Caused by warning in `ks.test.default()`:
## ! ties should not be present for the one-sample Kolmogorov-Smirnov test
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 6 remaining warnings.
## # A tibble: 7 × 2
##   weekday    ks_test
##   <fct>        <dbl>
## 1 Monday    5.04e-10
## 2 Tuesday   4.05e-13
## 3 Wednesday 2.88e-10
## 4 Thursday  3.58e- 8
## 5 Friday    1.24e- 9
## 6 Saturday  6.51e-11
## 7 Sunday    3.07e-11

Before selecting a test statistic, the data was explored further. From looking at the distribution, it was apparent that the data is not normal.

The Kolmogorov-Smirnov test is a statistical test used to examine if variables are normally distributed.

The Kolmogorov-Smirnov test showed that the for each group (Monday-Sunday), the p-value is <0.001, which means that we REJECT the null hypothesis for all groups (Null Hypothesis (H₀): The data follows a normal distribution). Therefore, the data is does not follow a normal distribution.

Therefore, we will use an alternative to ANOVA. Kruskal-Wallis test, which is a non-parametric test suitable for comparing more than two groups based on their rankings, tests for significant differences in sentiment between time-of-day categories (alpha = 0.05 as the standard).

# Kruskal-Wallis Test by weekday
kruskal.test(sentiment ~ factor(weekday), data = weekday_balanced)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  sentiment by factor(weekday)
## Kruskal-Wallis chi-squared = 17.281, df = 6, p-value = 0.008304

Based on our test results (Kruskal-Wallis test value (3) = 85.491, p-value = 0.165), we accept the null hypothesis and conclude that there is NO significant difference in sentiment across the different times of the day.

5.3 Post Hoc Test

Since the Kruskal-Wallis test did not see any significant results, there is no need to do a post hoc test.

6 Visualization

# Add row ID
time_balanced <- time_balanced %>%
  mutate(row_id = row_number())

# Filter morning and remove stop words
morning_words <- time_balanced %>%
  filter(time_of_day == "Morning") %>%
  select(row_id, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% expanded_stop_words,
         hunspell_check(word, dict = "en_US")) #stop word filter

# Calculate co-occurrence counts (within same post)
morning_pairs <- morning_words %>%
  pairwise_count(word, row_id, sort = TRUE, upper = FALSE)

# Filter for strong connections only
morning_pairs_filtered <- morning_pairs %>%
  filter(n >= 3)  # Only keep strong links

# Build graph
graph <- morning_pairs_filtered %>%
  graph_from_data_frame()

# Plot with ggraph
ggraph(graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  labs(title = "Morning Word Co-Occurrence Network") +
  theme_void()

# Filter afternoon and remove stop words
afternoon_words <- time_balanced %>%
  filter(time_of_day == "Afternoon") %>%
  select(row_id, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% expanded_stop_words,
         hunspell_check(word, dict = "en_US")) #stop word filter

# Calculate co-occurrence counts (within same post)
afternoon_pairs <- afternoon_words %>%
  pairwise_count(word, row_id, sort = TRUE, upper = FALSE)

# Filter for strong connections only
afternoon_pairs_filtered <- afternoon_pairs %>%
  filter(n >= 3)  # Only keep strong links

# Build graph
graph <- afternoon_pairs_filtered %>%
  graph_from_data_frame()

# Plot with ggraph
ggraph(graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  labs(title = "Afternoon Word Co-Occurrence Network") +
  theme_void()

# Filter evening and remove stop words
evening_words <- time_balanced %>%
  filter(time_of_day == "Evening") %>%
  select(row_id, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% expanded_stop_words,
         hunspell_check(word, dict = "en_US")) #stop word filter

# Calculate co-occurrence counts (within same post)
evening_pairs <- evening_words %>%
  pairwise_count(word, row_id, sort = TRUE, upper = FALSE)

# Filter for strong connections only
evening_pairs_filtered <- evening_pairs %>%
  filter(n >= 3)  # Only keep strong links

# Build graph
graph <- evening_pairs_filtered %>%
  graph_from_data_frame()

# Plot with ggraph
ggraph(graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  labs(title = "Evening Word Co-Occurrence Network") +
  theme_void()

# Filter night and remove stop words
night_words <- time_balanced %>%
  filter(time_of_day == "Night") %>%
  select(row_id, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% expanded_stop_words,
         hunspell_check(word, dict = "en_US")) #stop word filter

# Calculate co-occurrence counts (within same post)
night_pairs <- night_words %>%
  pairwise_count(word, row_id, sort = TRUE, upper = FALSE)

# Filter for strong connections only
night_pairs_filtered <- night_pairs %>%
  filter(n >= 3)  # Only keep strong links

# Build graph
graph <- night_pairs_filtered %>%
  graph_from_data_frame()

# Plot with ggraph
ggraph(graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE) +
  geom_node_point(color = "skyblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, size = 4) +
  labs(title = "Night Word Co-Occurrence Network") +
  theme_void()

# Function to compute network
network_stats_by_time <- function(data, time_label) {
  # Filter for one time_of_day group
  word_pairs <- data %>%
    filter(time_of_day == time_label) %>%
    unnest_tokens(word, text) %>%
    filter(!word %in% expanded_stop_words,
           hunspell_check(word, dict = "en_US")) %>%
    pairwise_count(word, row_id, sort = TRUE, upper = FALSE) %>%
    filter(n >= 3)

  # Build graph
  g <- graph_from_data_frame(word_pairs, directed = FALSE)

  # Compute metrics
  data.frame(
    time_of_day = time_label,
    nodes = vcount(g),
    edges = ecount(g),
    density = edge_density(g),
    components = components(g)$no,
    modularity = modularity(cluster_louvain(g))
  )
}

# Apply to all 4 time bins
network_summary <- bind_rows(
  network_stats_by_time(time_balanced, "Morning"),
  network_stats_by_time(time_balanced, "Afternoon"),
  network_stats_by_time(time_balanced, "Evening"),
  network_stats_by_time(time_balanced, "Night")
)

network_summary
##   time_of_day nodes edges    density components modularity
## 1     Morning    23    42 0.16600791          5  0.5136054
## 2   Afternoon     7     4 0.19047619          3  0.6250000
## 3     Evening    45    66 0.06666667         13  0.7470156
## 4       Night    51    55 0.04313725         13  0.7552066
# Bar chart of modularity
ggplot(network_summary, aes(x = factor(time_of_day, levels = c("Morning", "Afternoon", "Evening", "Night")), 
                             y = modularity, fill = time_of_day)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Topic Modularity by Time of Day",
       x = "Time of Day",
       y = "Modularity Score") +
  theme_minimal()

Modularity: how well the network is divided into distinct groups.

There is low modularity in the morning/afternoon. There is a high modularity in the evening/night.

Social media post topics get more complex and partitioned as the day progresses.

# Build co-occurrence network per time bin
build_cooc_graph <- function(df, time_label) {
  word_pairs <- df %>%
    filter(time_of_day == time_label) %>%
    unnest_tokens(word, text) %>%
    filter(!word %in% expanded_stop_words,
           hunspell_check(word, dict = "en_US")) %>%
    pairwise_count(word, row_id, sort = TRUE, upper = FALSE) %>%
    filter(n >= 3)
  
  g <- graph_from_data_frame(word_pairs, directed = FALSE)
  g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)
  
  # Add Louvain communities
  V(g)$community <- cluster_louvain(g)$membership
  V(g)$name <- V(g)$name
  g
}

# Build graphs
g_morning   <- build_cooc_graph(time_balanced, "Morning")
g_afternoon <- build_cooc_graph(time_balanced, "Afternoon")
g_evening   <- build_cooc_graph(time_balanced, "Evening")
g_night     <- build_cooc_graph(time_balanced, "Night")

# Plot function
plot_network <- function(graph, title) {
  ggraph(graph, layout = "fr") +
    geom_edge_link(alpha = 0.2) +
    geom_node_point(aes(color = as.factor(community)), size = 3) +
    geom_node_text(aes(label = name), repel = TRUE, size = 3) +
    labs(title = title) +
    theme_void() +
    theme(legend.position = "none")
}

# Generate plots
p1 <- plot_network(g_morning, "Morning")
p2 <- plot_network(g_afternoon, "Afternoon")
p3 <- plot_network(g_evening, "Evening")
p4 <- plot_network(g_night, "Night")

# Combine plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Co-Occurrence: How often terms appear together in a post for each group.

Community Detection: What terms are clustered/partitioned to each other by topic.

We see that there are less clusters early in the day and more clusters late in the night.