December 16, 2019
This project explores the relationships between different characters in the classic TV show The Office. Using transcript data newly released in Bradley H. Lindblad’s schrute
package, I’d like to see who mentions who in the Office. Is one character more popular than the others?
Let’s take a look at the transcripts:
index | season | episode | episode_name | character | text | text_w_direction |
---|---|---|---|---|---|---|
1 | 01 | 01 | Pilot | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | All right Jim. Your quarterlies look very good. How are things at the library? |
358 | 01 | 01 | Pilot | Jim | Oh, I told you. I couldn’t close it. So… | Oh, I told you. I couldn’t close it. So… |
715 | 01 | 01 | Pilot | Michael | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? | So you’ve come to the master for guidance? Is this what you’re saying, grasshopper? |
By using tidytext
, we can split the transcripts into their constituent parts (words).
index | season | episode | episode_name | character | text_w_direction | word |
---|---|---|---|---|---|---|
1 | 01 | 01 | Pilot | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | all |
1 | 01 | 01 | Pilot | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | right |
1 | 01 | 01 | Pilot | Michael | All right Jim. Your quarterlies look very good. How are things at the library? | jim |
We can now use the text to see who mentions who. But first, let’s construct a vector with a list of characters to keep in the analysis. There are 485 characters in the transcripts, so its important we filter only those of relevance:
x |
---|
Michael |
Dwight |
Jim |
Pam |
Andy |
Angela |
Kevin |
Erin |
Oscar |
This is an optional decision. One may be interested in seeing which characters talk about Jim most, including those characters who are otherwise less relevant. I decide to filter according to the main cast so that comparisons between characters (e.g., through a chord diagram) is feasible.
Who is talking to who in the Office?
Now that we have keep_characters
, we can filter according to it and spit out who mentions who among the most relevant Office characters.
The takeaway here is that Dwight mentions Jim the most, followed by Michael. No surprise there! What I find interesting is that only three characters really talk about/to Jim. After Dwight, Michael, and Pam (and Jim referencing himself, apparently), the mention rate for Jim’s name drops from over 200 to only 60 mentions. It seems as if the writers of the Office intentionally made Jim a subject of conversation among only a few characters!
Next, we replicate that process for the rest of the cast. There is probably a better way to do this.
Now, let’s make a chord diagram!
We first have to convert the data frame into a format chordDiagram
will recognize.
This process pivots each row of data into a value-key combination, so that the data looks like this:
from | to | value |
---|---|---|
Andy | Jim | 60 |
Andy | Michael | 47 |
Andy | Dwight | 92 |
Andy | Pam | 36 |
Andy | Andy | 65 |
Andy | Angela | 39 |
Using that data, we can create a chord diagram quite easily, using a single command from the circlize
library. This chapter is helpful.
With nine people, some of the data can get easily concealed (how often did Angela mention Michael’s name?). One way to fix this is to make the visualization interactive, so that a user can hover over chords to see relationships between characters.
First, we conduct some data cleaning. I found that the rownames and column names have to be of the same order; let’s do a little manipulation to get there:
Next, we load Matt Flor’s chorddiag
package, and construct a matrix according to its function’s liking:
Finally, we add a color palette and construct the diagram.
Play around with the diagram here!