Who Mentions Who in the Office?
This project explores the relationships between different characters in the classic TV show The Office. Using transcript data newly released in Bradley H. Lindblad’s schrute
package, I’d like to see who mentions who in the Office. Is one character more popular than the others?
library(schrute)
library(tidyverse)
library(cr)
set_cr_theme()
Let’s take a look at the transcripts:
transcripts <- schrute::theoffice
knitr::kable(transcripts[1:3,])
index | season | episode | episode_name | director | writer | character | text | text_w_direction | imdb_rating | total_votes | air_date |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | All right Jim. Your quarterlies look very good. How are things at the library? | 7.6 | 3706 | 2005-03-24 |
2 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Jim | Oh, I told you. I couldn’t close it. So… | Oh, I told you. I couldn’t close it. So… | 7.6 | 3706 | 2005-03-24 |
3 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | 7.6 | 3706 | 2005-03-24 |
By using tidytext
, we can split the transcripts into their constituent parts (words).
transcripts_tokenized <- transcripts %>%
tidytext::unnest_tokens(word, text)
knitr::kable(transcripts_tokenized[1:3,])
index | season | episode | episode_name | director | writer | character | text_w_direction | imdb_rating | total_votes | air_date | word |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | 7.6 | 3706 | 2005-03-24 | all |
1 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | 7.6 | 3706 | 2005-03-24 | right |
1 | 1 | 1 | Pilot | Ken Kwapis | Ricky Gervais;Stephen Merchant;Greg Daniels | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | 7.6 | 3706 | 2005-03-24 | jim |
We can now use the text to see who mentions who. But first, let’s construct a vector with a list of characters to keep in the analysis. There are 485 characters in the transcripts, so its important we filter only those of relevance:
keep_characters <- transcripts %>%
group_by(character) %>%
count() %>%
arrange(desc(n)) %>%
head(9) %>%
pull(character)
knitr::kable(keep_characters)
x |
---|
Michael |
Dwight |
Jim |
Pam |
Andy |
Angela |
Kevin |
Erin |
Oscar |
This is an optional decision. One may be interested in seeing which characters talk about Jim most, including those characters who are otherwise less relevant. I decide to filter according to the main cast so that comparisons between characters (e.g., through a chord diagram) is feasible.
Who Mentions Who?
Jim: A Case Study
Who is talking to who in the Office?
Now that we have keep_characters
, we can filter according to it and spit out who mentions who among the most relevant Office characters.
transcripts_tokenized %>%
filter(character %in% keep_characters) %>%
mutate(jim = ifelse(word == "jim", 1, 0)) %>%
group_by(character) %>%
summarise(jim = sum(jim)) %>%
arrange(desc(jim)) %>%
mutate(character = reorder(character, jim)) %>%
ggplot(ggplot2::aes(character, jim)) +
geom_col() +
coord_flip() +
fix_bars() +
labs(title = "Who Mentions Jim?",
subtitle = "Counts of 'Jim' in The Office Transcripts",
x = element_blank(),
y = "Mentions")
The takeaway here is that Dwight mentions Jim the most, followed by Michael. No surprise there! What I find interesting is that only three characters really talk about/to Jim. After Dwight, Michael, and Pam (and Jim referencing himself, apparently), the mention rate for Jim’s name drops from over 200 to only 60 mentions. It seems as if the writers of the Office intentionally made Jim a subject of conversation among only a few characters!
Replicate for the rest of the cast
Next, we replicate that process for the rest of the cast. There is probably a better way to do this.
data_chord <- transcripts_tokenized %>%
filter(character %in% keep_characters) %>%
mutate(jim = ifelse(word == "jim", 1, 0)) %>%
mutate(michael = ifelse(word == "michael", 1, 0)) %>%
mutate(dwight = ifelse(word == "dwight", 1, 0)) %>%
mutate(pam = ifelse(word == "pam", 1, 0)) %>%
mutate(andy = ifelse(word == "andy", 1, 0)) %>%
mutate(angela = ifelse(word == "angela", 1, 0)) %>%
mutate(kevin = ifelse(word == "kevin", 1, 0)) %>%
mutate(erin = ifelse(word == "erin", 1, 0)) %>%
mutate(oscar = ifelse(word == "oscar", 1, 0)) %>%
# mutate(ryan = ifelse(word == "ryan", 1, 0)) %>%
# mutate(darryl = ifelse(word == "darryl", 1, 0)) %>%
# mutate(phyllis = ifelse(word == "phyllis", 1, 0)) %>%
# mutate(kelly = ifelse(word == "kelly", 1, 0)) %>%
# mutate(toby = ifelse(word == "toby", 1, 0)) %>%
group_by(character) %>%
summarise_at(vars(jim:oscar), funs(sum))
Visualize
Now, let’s make a chord diagram!
We first have to convert the data frame into a format chordDiagram
will recognize.
circlize_data <- as.data.frame(data_chord) %>%
pivot_longer(jim:oscar, names_to = "to", values_to = "value") %>%
rename(from = 'character') %>%
mutate(to = str_to_title(to))
This process pivots each row of data into a value-key combination, so that the data looks like this:
from | to | value |
---|---|---|
Andy | Jim | 48 |
Andy | Michael | 40 |
Andy | Dwight | 80 |
Andy | Pam | 34 |
Andy | Andy | 42 |
Andy | Angela | 33 |
Using that data, we can create a chord diagram quite easily, using a single command from the circlize
library. This chapter is helpful.
library(circlize)
chordDiagram(circlize_data, grid.col = c("#B997C7", "#824D99", "#4E78C4", "#57A2AC", "#7EB875", "#D0B541", "#E67F33", "#CE2220", "#521A13"))
Make It Interactive
With nine people, some of the data can get easily concealed (how often did Angela mention Michael’s name?). One way to fix this is to make the visualization interactive, so that a user can hover over chords to see relationships between characters.
First, we conduct some data cleaning. I found that the rownames and column names have to be of the same order; let’s do a little manipulation to get there:
int_chord <- as.data.frame(data_chord)
rownames(int_chord) <- int_chord$character
row.order <- c("Jim", "Michael", "Dwight", "Pam", "Andy", "Angela", "Kevin", "Erin", "Oscar")
#, "Ryan", "Darryl", "Phyllis", "Kelly", "Toby")
int_chord <- int_chord[row.order,]
Next, we load Matt Flor’s chorddiag
package, and construct a matrix according to its function’s liking:
# devtools::install_github("mattflor/chorddiag")
library(chorddiag)
m <- as.matrix(int_chord[-1])
dimnames(m) <- list(have = int_chord$character,
prefer = str_to_title(colnames(int_chord[-1])))
Finally, we add a color palette and construct the diagram.
groupColors <- c("#B997C7", "#824D99", "#4E78C4", "#57A2AC", "#7EB875", "#D0B541", "#E67F33", "#CE2220", "#521A13")
p <- chorddiag(m,
groupColors = groupColors,
groupnamePadding = 35,
tickInterval = 50,
groupnameFontsize = 12)
p
# save the widget
# library(htmlwidgets)
# saveWidget(p, file="chord_interactive.html")
Play around with the diagram here!