Textmining Friends: S1 E2 - The One with the Most Frequent Words

This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.) If you haven’t read part one, where we import and clean up the scripts, it can be found here.

Disclaimer – I do not claim to be an expert in textmining. There may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this second “episode,” we’ll look at word volume slash word “share” by character, and their most frequent words - overall and across time. Enjoy!

Isabell Hubert 2018

website :: twitter


Prep

We’ll load the following libraries:

load(".RData")     # this is necessary for knitting, as knitr starts a separate R session
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tidytext)
library(magrittr)
library(ggplot2)

This is our starting point:

head(tokens)

##   Friend season episode lineNum    word
## 1 Monica      1       1       2 there's
## 2 Monica      1       1       2 nothing
## 3 Monica      1       1       2      to
## 4 Monica      1       1       2    tell
## 5 Monica      1       1       2    he's
## 6 Monica      1       1       2    just

Script source: https://fangj.github.io/friends/

Word Volume

Let’s first find out which characters talk more than others. For this, we are concerned with the total “volume” of words spoken by each character, so we’re not excluding stopwords just yet.

Let’s define some character vectors that include our Friends, and one that includes our extended friends - feel free to include other characters you are interested in. We’ll also create a simple vector with integers from 1 to 10, which we will use for plotting, for example.

friendsNames <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe")
friendsExtended <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe", "Janice", "Gunther")
seasons = c(1:10)

Let’s find out how many words were spoken per character per episode. Note that these are absolute numbers - we’ll find a better way soon:

ep.vol <- tokens %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend, season, episode) %>%
    count()
head(ep.vol)

## # A tibble: 6 x 4
## # Groups:   Friend, season, episode [6]
##   Friend   season episode     n
##   <chr>     <int>   <int> <int>
## 1 Chandler      1       1   395
## 2 Chandler      1       2   233
## 3 Chandler      1       3   467
## 4 Chandler      1       4   353
## 5 Chandler      1       5   343
## 6 Chandler      1       6   841

Sometimes, we can have episodes where there’s just more words being spoken in total, which wouldn’t show in the individual counts per character. What we actually want to know is a character’s share of the total words spoken by all Friends per episode:

ep.share <- tokens %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend, season, episode) %>%
    count() %>%
    ungroup() %>%
    group_by(season, episode) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100)

head(ep.share)

## # A tibble: 6 x 6
## # Groups:   season, episode [6]
##   Friend   season episode     n totalWords wordPerc
##   <chr>     <int>   <int> <int>      <int>    <dbl>
## 1 Chandler      1       1   395       3091     12.8
## 2 Chandler      1       2   233       1725     13.5
## 3 Chandler      1       3   467       2123     22.0
## 4 Chandler      1       4   353       2177     16.2
## 5 Chandler      1       5   343       2164     15.9
## 6 Chandler      1       6   841       2153     39.1

Let’s create a simple plot, facetted by season and with different colors for each Friend:

ggplot(ep.share, aes(x = episode, y = wordPerc, color = Friend, fill = Friend)) +
    geom_line() +
    # geom_smooth(method = "loess", se = FALSE) +
    facet_grid(season ~ .) +
    theme_minimal()

Now this is very small and we basically cannot see a thing, other than a few outliers (like Joey late in Season 8, and Chandler in S9 E10).

Let’ s create a looping function (see e.g. Rich Majerus’s Using Loops with ggplot2) and loop over each Friend. We’ll also add some trendlines:

# create function
ep.share.viz <- function(df, na.rm = TRUE, ...){

  # create for loop 
  for (i in seq_along(friendsNames)) { 

    # plot for each Friend
    plot <- ggplot(subset(df, df$Friend==friendsNames[i]),aes(x = episode, y = wordPerc, color = Friend, fill = Friend)) +
    facet_wrap(~season) +
    theme_minimal() +
    geom_line(alpha = 0.2) +
    geom_smooth(method = "loess", color = "pink", fill = "gray") +
    #geom_smooth(method = "lm", color = "blue", fill = "blue", alpha = 0.2) +
    theme(legend.position = "none") +
  labs(title = friendsNames[i]) +
  scale_y_continuous(limits = c(0,50))

    # print plot
    print(plot)
  }
}

# run function
ep.share.viz(ep.share)

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

This is mostly just squiggles, and too grainy – other than saying that Phoebe has a slight uptick in volume at the ends of season 3 and 4, for example, there isn’t really much to be said about these plots.

Let’s look at trends across the entire show instead, to compare the characters to each other:

s.share <-tokens %>%
    filter(Friend %in% friendsNames) %>%
    group_by(Friend, season) %>%
    count() %>%
    ungroup() %>%
    group_by(season) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100)
head(s.share)

## # A tibble: 6 x 5
## # Groups:   season [6]
##   Friend   season     n totalWords wordPerc
##   <chr>     <int> <int>      <int>    <dbl>
## 1 Chandler      1  8800      48528     18.1
## 2 Chandler      2  8931      52533     17.0
## 3 Chandler      3  8952      54459     16.4
## 4 Chandler      4  9278      51438     18.0
## 5 Chandler      5  8618      54998     15.7
## 6 Chandler      6 10811      60304     17.9

And plot this as well:

ggplot(s.share, aes(x = season, y = wordPerc, color = Friend, fill = Friend)) +
    geom_smooth(method = "loess", se = FALSE, alpha = 0.3) +
    #geom_smooth(method = lm, se = FALSE) +
    scale_x_discrete(limits = seasons) +
    theme_minimal()

  • We see that Ross has a downward U-shape - he starts of strong, and then the writers gave him less space before he recovered around Season 7.
  • Rachel and Joey are the opposite; in fact, there are four seasons (5 to 8) where they are the main characters. That’s a big deal considering Joey started off as the most minor character of them all!
  • Monica has the most stable “word share” of them all.

Let’s use violin plots to visualize word share by Friend and season, which can show us outlier episodes:

ggplot(ep.share, aes(x = season, y = wordPerc, color = Friend, fill = Friend, group = season)) +
    geom_violin(alpha = 0.8) +
    geom_boxplot(width  = 0.1, fill = "white") +
    facet_wrap(~ Friend, ncol = 3) +
    scale_x_continuous(breaks = seasons) +
    labs(title = "Share of Words per Episode across the Seasons", x = NULL, y = NULL) +
    theme(legend.position = "none")

Chandler and Joey have the two biggest outlier episodes. What are they?

ep.share %>%
    filter(Friend == "Chandler") %>%
    arrange(desc(n)) %>%
    ungroup() %>%
    top_n(2)

## Selecting by wordPerc

## # A tibble: 2 x 6
##   Friend   season episode     n totalWords wordPerc
##   <chr>     <int>   <int> <int>      <int>    <dbl>
## 1 Chandler      9      10  1312       2523     52.0
## 2 Chandler      1       6   841       2153     39.1

For Chandler, it is: S9, E10 (1312 words, out of a total of 2523 - that’s more than half of all the words spoken by our Friends in this episode!) - “The One with Christmas in Tulsa”:

Chandler must spend Christmas in Tulsa, which Monica thinks he might have an affair with a former Miss Oklahoma runner-up while he is there. Chandler comes home early to announce that he quit his job.

That makes sense.

… and for Joey in Season 8?

ep.share %>%
    filter(Friend == "Joey") %>%
    arrange(desc(n)) %>%
    ungroup() %>%
    top_n(2)

## Selecting by wordPerc

## # A tibble: 2 x 6
##   Friend season episode     n totalWords wordPerc
##   <chr>   <int>   <int> <int>      <int>    <dbl>
## 1 Joey        8      19  1201       2157     55.7
## 2 Joey        6      20   894       2510     35.6

It’s Season 8, Episode 19, “The One with Joey’s Interview”:

Joey prepares for his epic interview with Soap Opera Digest, reminiscing about his past acting experiences, his friends and his romantic life.

He accounts for more than half of all words spoken among our Friends - which also definitely makes sense.

Removing Stop Words

The word volumes we looked at up until now include words like “the”, “a”, “an,” etc., which do not contribute much content, or tell us much about the character. They’re called stop words, and can easily be removed via an anti_join with the stop word list in tidytext:

nonstop <- tokens %>%
    anti_join(stop_words, by = "word")

However, this still leaves things like “yeah” and “uhm” in - let’s remove those, creating a character vector with our own personal list of stop words:

moreStop <- as.data.frame(c("uh", "hey", "yeah", "um", "gonna", "oh", "ooh", "ya", "huh", "guy", "guys", "ah", "y'know", "wanna", "umm", "ooh", "ohh"))
names(moreStop)[1] <- "word"
friends <- nonstop %>%
    anti_join(moreStop, by = "word")

## Warning: Column `word` joining character vector and factor, coercing into
## character vector

We can remove these dataframes as we won’t need them anymore - we now have friends, which is the df to use for all analyses that exclude stopwords; and tokens, which is the df to be used when it’s about volume.

rm(moreStop, nonstop)
write.csv(friends, file = "friends.csv")

Most Frequent Words

… by Friend

So what are the most frequent words by Friend, according to absolute word counts?

friends %>%
    group_by(Friend, word) %>%
    count() %>%
    filter(Friend %in% friendsExtended) %>%
    ungroup() %>%
    group_by(Friend) %>%
    arrange(desc(n)) %>%
    top_n(5)

## Selecting by n

## # A tibble: 57 x 3
## # Groups:   Friend [8]
##    Friend word         n
##    <chr>  <chr>    <int>
##  1 Rachel ross       535
##  2 Rachel god        482
##  3 Rachel joey       335
##  4 Monica god        327
##  5 Joey   ross       326
##  6 Phoebe god        282
##  7 Monica chandler   281
##  8 Monica phoebe     246
##  9 Ross   rachel     238
## 10 Monica joey       224
## # … with 47 more rows

That is great, but again it might be more interesting to know the most frequent words by Friend according to their share of all the words spoken by the character. Let’s count how many times each character said each of the words they said:

wordByFriend <- friends %>%
    group_by(Friend, word) %>%
    count() %>%
    filter(Friend %in% friendsExtended) 
head(wordByFriend)

## # A tibble: 6 x 3
## # Groups:   Friend, word [6]
##   Friend   word      n
##   <chr>    <chr> <int>
## 1 Chandler 0         1
## 2 Chandler 00        8
## 3 Chandler 007       2
## 4 Chandler 1         4
## 5 Chandler 1,000     1
## 6 Chandler 1,500     3

Now we’ll add a sum column, with a count of all the non-stopwords this character ever said, and a percentage column:

wordByFriend %>%
    group_by(Friend) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100) %>%
    ungroup() %>%
    arrange(desc(wordPerc))

## # A tibble: 31,416 x 5
##    Friend  word         n totalWords wordPerc
##    <chr>   <chr>    <int>      <int>    <dbl>
##  1 Gunther rachel      16        229     6.99
##  2 Janice  chandler    28        636     4.40
##  3 Rachel  ross       535      22196     2.41
##  4 Rachel  god        482      22196     2.17
##  5 Janice  god         13        636     2.04
##  6 Gunther birthday     4        229     1.75
##  7 Janice  bye         11        636     1.73
##  8 Monica  god        327      20035     1.63
##  9 Janice  bing        10        636     1.57
## 10 Janice  love        10        636     1.57
## # … with 31,406 more rows

Gunther knocks it out of the park – nearly 7% of all his “meaningful” words uttered are “Rachel”! This is closely followed by Janice, no doubt because (a) those two characters were fairly limited in their themes on the show, and because they were not given nearly as much opportunity as the core Friends to utter a larger variety of types and tokens.

We’ll save the top 5 words by friend:

top5 <- wordByFriend %>%
    group_by(Friend) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100) %>%
    top_n(5) %>%
    ungroup()

## Selecting by wordPerc

And plot those, comparing an untransformed y-axis to a log-transformed one:

ggplot(top5, aes(x = Friend, y = wordPerc, fill = Friend, label = word)) +
    geom_label(position = position_jitter(width = 0.3, height = 0.3), color = "white") +
    theme_minimal() +
    labs(title = "Five Most Frequent Words per Friend", x = NULL, y = "%") +
    theme(legend.position = "none")

# log scale:
ggplot(top5, aes(x = Friend, y = wordPerc, fill = Friend, label = word)) +
    geom_label(position = position_jitter(width = 0.3, height = 0.3), color = "white") +
    theme_minimal() +
    labs(title = "Five Most Frequent Words per Friend", x = NULL, y = "log(%)") +
    theme(legend.position = "none") +
    scale_y_continuous(trans = "log")

The log trans doesn’t really add value here - words are still crammed near the bottom, especially where there is a tie between a multitude of words (see Gunther), and there are only two outliers near the top.

Instead of log-transforming the scale let’s try to address the label overlap with some jittering to make the plot a little more legible - and plot the three most frequent words by Friend to have some more space:

wordByFriend %>%
    group_by(Friend) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100) %>%
    top_n(3) %>%
    ungroup() %>%
ggplot(aes(x = Friend, y = wordPerc, fill = Friend, label = word)) +
    geom_label(position = position_jitter(width = 0.3, height = 0.3), color = "white") +
    labs(title = "The Three Most Frequent Words per Friend", x = NULL, y = "%") +
    theme(legend.position = "none") +
    theme_minimal()

## Selecting by wordPerc

… by Season

Do the most frequent words change over time?

Let’s look at how often our Friends said each word they said by Season:

wordByFriendSeason <- friends %>%
    group_by(Friend, word, season) %>%
    count() %>%
    filter(Friend %in% friendsExtended) 

We’ll again add a percentage column:

top3season <- wordByFriendSeason %>%
    group_by(Friend, season) %>%
    mutate(totalWords = sum(n), wordPerc = (n/totalWords)*100) %>%
    group_by(Friend, season) %>%
    top_n(3) %>%
    ungroup()

## Selecting by wordPerc

Let’s plot the top three words by Friend and by season:

# create function
topwords.s.viz <- function(df, na.rm = TRUE, ...){

  # create for loop
  for (i in seq_along(friendsNames)) { 

    # create plot for each Friend
        plot <- ggplot(subset(df, df$Friend==friendsNames[i]), aes(x = season, y = wordPerc, fill = factor(season), label = word, group = season)) +
            geom_label(position = position_jitter(width = 0.2, height = 0.2), color = "white", fontface = "bold", size = 3.2) +
            theme_minimal() +
            labs(title = paste0(friendsNames[i], "'s Top 3 Words by Season") , x = NULL, y = "% of all words by Friend") +
            theme(legend.position = "none") +
            scale_x_continuous(breaks = seasons)
    print(plot)
  }
}

# run function
topwords.s.viz(top3season)

  • I am enjoying the fact that Phoebe’s most frequent “word” in Seasons 1 and 2 is “la.”
  • Ross goes from “Marcel” over “Julie” and “Emily” to “Rachel.”
  • Chandler’s is mostly Joey, unless it’s Monica - or Janice.
  • I would not have though that Joey’s most frequent word in six out of ten seasons is “Ross.”
  • Rachel’s is mostly “Ross.”
  • Monica’s is mostly “god” or “Chandler.”
friends.nrc <- friends %>%
  inner_join(get_sentiments("nrc"))

## Joining, by = "word"

friends.bing <- friends %>%
  inner_join(get_sentiments("bing"))

## Joining, by = "word"

write.csv(friends.bing, file = "friends-bing.csv")
write.csv(friends.nrc, file = "friends-nrc.csv")

Next up: S1 E3, The One with the Sentiment Analysis.