This Friends textmining effort in R was my Saturday project during a range of very snowy Saturdays we had here in Edmonton in September. It makes heavy use of the tidyverse, Text Mining in R by Julia Silge and David Robinson (which is highly recommended), and piping from the magrittr package (which makes things so much nicer.) If you haven’t read the previous two episodes, they are:

  1. The One with all the Import and Cleanup
  2. The One with the Most Frequent Words
  3. The One with the Sentiment Analysis

You can find a tutorial by Rich Majerus on how to loop with ggplot2 here.

Disclaimer – I do not claim to be an expert in textmining. There may be faster, smarter, or nicer ways to achieve a certain thing. Still, maybe you’ll find something interesting for your own projects - or just some funny tidbit about your favourite show. In this fourth “episode,” we’ll do some TF-IDF (term frequency–inverse document frequency) analyses - essentially, we’ll try to find out what the most characteristic words for each Season and each Friend are. See also here and here.

Isabell Hubert 2018

website | twitter


Prep

We’ll load the following libraries:

library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(tidytext)
library(magrittr)
library(ggplot2)

And the dataframes we’ll need:

tokens <- readRDS("tokens.rds")     # pre stopword anti-join - useful for word volume
friends <- readRDS("friends-df.rds")            # the cleaned-up, post stopword anti-join

And define some useful character vectors we can use later for filtering, plotting, and looping:

friendsNames <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe")
friendsExtended <- c("Monica", "Rachel", "Chandler", "Joey", "Ross", "Phoebe", "Janice", "Gunther")
seasons = c(1:10)

Token Frequencies

We’ll start with calculating some overall token frequencies - basically how many words were said total in the entire show, and how many times each word was said:

token.count <- tokens %>%
  count(word, sort = TRUE) %>%
  mutate(total = sum(n))

We’ll also create some counts by season:

s.token.count <- tokens %>%
  count(word, season, sort = TRUE) %>%
    group_by(season) %>%
  mutate(seasonTotal = sum(n))

We can then illustrate Zipf’s law:

ggplot(s.token.count, aes(n/seasonTotal, fill = factor(season))) +
  geom_histogram(show.legend = FALSE, binwidth = 0.0001) +
  facet_wrap(~season, ncol = 2, scales = "free_y") +
    coord_cartesian(xlim = c(0, 0.005))

TF-IDF

The idea of tf-idf is to find the important words for the content of each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection or corpus of documents.

(from Text Mining with R)[https://www.tidytextmining.com/tfidf.html]

by Season

s.tfidf <- s.token.count %>%
  bind_tf_idf(season, word, n)

Let’s look at the highest TF-IDF values:

s.tfidf %>%
  arrange(desc(tf_idf))