Textmining Friends: S1 E4 - The One with the TF-IDF's

This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.) If you haven’t read the previous three episodes, they are:

You can find a tutorial by Rich Majerus on how to loop with ggplot2 here.

Disclaimer – I do not claim to be an expert in textmining. There may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this fourth “episode,” we’ll do some TF-IDF (term frequency–inverse document frequency) analyses - essentially, we’ll try to find out what the most characteristic words for each Season and each Friend are. See also here and here.

Isabell Hubert 2018

Prep

library(dplyr)
library(tidyr)
library(stringr)
library(tidytext)
library(magrittr)
library(ggplot2)

And the dataframes we’ll need:

tokens <- readRDS("tokens.rds")     # pre stopword anti-join - useful for word volume
friends <- readRDS("friends-df.rds")            # the cleaned-up, post stopword anti-join

And define some useful character vectors we can use later for filtering, plotting, and looping:

friendsNames <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe")
friendsExtended <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe", "Janice", "Gunther")
seasons = c(1:10)

Token Frequencies

We’ll start with calculating some overall token frequencies - basically how many words were said total in the entire show, and how many times each word was said:

token.count <- tokens %>%
count(word, sort = TRUE) %>%
mutate(total = sum(n))

We’ll also create some counts by season:

s.token.count <- tokens %>%
count(word, season, sort = TRUE) %>%
group_by(season) %>%
mutate(seasonTotal = sum(n))

We can then illustrate Zipf’s law:

ggplot(s.token.count, aes(n/seasonTotal, fill = factor(season))) +
geom_histogram(show.legend = FALSE, binwidth = 0.0001) +
facet_wrap(~season, ncol = 2, scales = "free_y") +
coord_cartesian(xlim = c(0, 0.005))

TF-IDF

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.

(from Text Mining with R)[https://www.tidytextmining.com/tfidf.html]

by Season

s.tfidf <- s.token.count %>%
bind_tf_idf(season, word, n)

Let’s look at the highest TF-IDF values:

s.tfidf %>%
arrange(desc(tf_idf))

## # A tibble: 44,621 x 7
## # Groups:   season [10]
##    word       season     n seasonTotal    tf   idf tf_idf
##    <chr>       <int> <int>       <int> <dbl> <dbl>  <dbl>
##  1 fonzie          5    14       62328     1  1.39   1.39
##  2 resolution      5     8       62328     1  1.39   1.39
##  3 deed            5     7       62328     1  1.39   1.39
##  4 gala            5     7       62328     1  1.39   1.39
##  5 caitlin         5     6       62328     1  1.39   1.39
##  6 fogged          5     6       62328     1  1.39   1.39
##  7 godfather       5     6       62328     1  1.39   1.39
##  8 pbs             5     6       62328     1  1.39   1.39
##  9 deeds           5     5       62328     1  1.39   1.39
## 10 dropper         5     5       62328     1  1.39   1.39
## # … with 44,611 more rows

Fonzie, resolution, deed, gala… These sure are unique. Maybe a little too much, as we’ll see below.

Let’s visualize them:

s.tfidf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(season) %>%
top_n(5) %>%
ggplot(aes(word, tf_idf, fill = season)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~season, ncol = 2, scales = "free") +
coord_flip()

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

We notice that his is a mess. There are simply too many words with a very high TF-IDF value. If you go back to the previous output, you’ll see that there are a great many words with the same high value - essentially, there are too many words that are “too unique,” what with being said only once and all. These may not necessarily be the words that define our Friends. Looking at the plot facets, “hmmmmmmmmmm”” and “nononononononnonoo” are not very distinctive “words” in general, and they are certainly not what we think about when we think Friends. So let’s try limiting the plots to words that were said more than 10 times:

s.tfidf %>%
filter(n > 10) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(season) %>%
top_n(10) %>%
ggplot(aes(reorder(word, tf_idf), tf_idf, fill = factor(season))) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = NULL, title = "Words with highest TF-IDF by Season") +
facet_wrap(~season, ncol = 5, scales = "free") +
coord_flip()

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

Much better. Here we can actually see some topics that were pervasive in each season:

• Paolo, Marcel the monkey, and Ross’s ex-wife Carol feature prominently in S1.
• We meet Ross’s doppelganger Russ in S2, and Monica makes some Mockolate.
• S3 is full of dates and other side characters, and Monica *bang*s Ben’s head into the wall.
• Rachel meets Joshua in S4, and Chandler goes to Yemen.
• Ross and Rachel fight over the annulment in S6, and Joey gets a part on Mac and C.H.E.E.S.E.
• Ross meets Mona in S8, Joey’s sister Dina makes an appearance, and Rachel is pregnant.
• Chandler spends most of his time in* Tulsa, Oklahoma,* in S9. He and Monica are also trying to have a baby.
• Mike Hannigan features prominently in S10, and Chandler and Monica talk a lot about adoption. Emma is now in the picture.

by Friend

Instead of looking for distinctive words by Season, how about we find out what makes our Friends unique?

f.tfidf <- tokens %>%
count(word, Friend, sort = TRUE) %>%
ungroup() %>%
group_by(Friend) %>%
mutate(total = sum(n)) %>%
bind_tf_idf(Friend, word, n)

Let’s look at high TF-IDFwords:

f.tfidf %>%
filter(Friend %in% friendsNames) %>%
group_by(Friend) %>%
arrange(desc(tf_idf))

## # A tibble: 34,207 x 7
## # Groups:   Friend [6]
##    word      Friend     n total    tf   idf tf_idf
##    <chr>     <chr>  <int> <int> <dbl> <dbl>  <dbl>
##  1 barca     Monica     6 83634     1  1.10   1.10
##  2 lounger   Monica     5 83634     1  1.10   1.10
##  3 humidity  Monica     4 83634     1  1.10   1.10
##  4 meddle    Monica     4 83634     1  1.10   1.10
##  5 ache      Monica     3 83634     1  1.10   1.10
##  6 caesar    Monica     3 83634     1  1.10   1.10
##  7 calamari  Monica     3 83634     1  1.10   1.10
##  8 cries     Monica     3 83634     1  1.10   1.10
##  9 efficient Monica     3 83634     1  1.10   1.10
## 10 fund      Monica     3 83634     1  1.10   1.10
## # … with 34,197 more rows

And let’s visualize them:

f.tfidf %>%
filter(Friend %in% friendsNames) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(Friend) %>%
top_n(1) %>%
ggplot(aes(word, tf_idf, fill = Friend)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Friend, ncol = 2, scales = "free") +
coord_flip()

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

Again, there’s just too many words that are too unique. Let’s try limiting our search to words that were said more than five times:

f.tfidf %>%
filter(Friend %in% friendsExtended) %>%
filter(n > 5) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(Friend) %>%
top_n(7) %>%
ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Friend)) +
geom_col(show.legend = FALSE) +
labs(title = "Highest TF-IDF Words - Said more than 5 Times", x = NULL, y = NULL) +
facet_wrap(~Friend, ncol = 4, scales = "free") +
coord_flip()

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

Getting better! We can now see some distinguishing features of our Friends.

• Gunther, quite obviously, uses the word Rachel more distinctively than other Friends.
• Same for Janice and Sid; but also Bing. And her own name.
• Joey is ahead of his time and *tweet*s a lot (this might have to do with the Duck and the Chick). He also says “neurosurgeon” a lot, due to Dr. Drake Remoray. Words from acting and commercials further distinguish him.
• Monica uses the barca lounger distinctively, makes Mockolate, and obviously talks a lot about her catering business.
• Phoebe “ree”s a lot, says uhuh more than the others, and uses the syllable “la” frequently. In her songs.
• Rachel shoops a lot before her skiing trip, and wants “closure.” She also works as a junior buyer and is infatuated with Joshua.
• Ross apparently enjoys typing; other distinctive words for him are Carol and job/science-related words like species, evolution, and paleontology.

Let’s see how things change when we look at words said even more often - will the words get more distinct/familiar, or will they blend together more?

f.tfidf %>%
filter(Friend %in% friendsExtended) %>%
filter(n > 10) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(Friend) %>%
top_n(8) %>%
ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Friend)) +
geom_col(show.legend = FALSE) +
labs(title = "Highest TF-IDF Words - Said more than 10 Times", x = NULL, y = NULL) +
facet_wrap(~Friend, ncol = 4, scales = "free") +
coord_flip()

## Warning in mutate_impl(.data, dots, caller_env()): Unequal factor levels:
## coercing to character

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

## Warning in mutate_impl(.data, dots, caller_env()): binding character and
## factor vector, coercing into character vector

• Janice is now all about saying bye to Chandler.
• Joey’s list is full of acting-related words - (Dr. Drake) *Remoray, agent, director, robot, *and Estelle.
• Monica’s list is led by Michelle (Burke), Richard’s daughter, presumably due to the message on her answering machine. Some of her most distinctive words are 7, 6, and 4, which I’m pretty sure we all know why. She also talks a lot about adoption, and the boots she paid so much money for and then hated. (I am surprised to not see Richard listed, but then again, the other Friends also mention him a fair bit.)
• I am delighted to see that la and smelly top Phoebe’s list. That’s the way it should be. She also talks about massage *client*s, Mike, her brother Frank, and her guitar.
• Rachel talks a lot about Joshua, and Joanna, a colleague at Bloomingdale’s. Not surprisingly, her almost-husband Barry also plays a role, along with his wife Mindy.
• Ross’s most distinctive words come primarily from the science-y realm and his relationship with Carol. Pets Chichi and Marcel are also listed.