Getting started with quanteda.tidy

Ken Benoit

Introduction

quanteda.tidy extends the quanteda package with dplyr-style verbs for manipulating corpus objects. These functions operate on document variables (docvars) while preserving the text content and structure of quanteda objects.

Note that quanteda.tidy very different from tidytext. While tidytext converts text to data frames with one token per row, quanteda.tidy keeps your corpus intact and extends dplyr functions to work directly with quanteda objects.

library(quanteda.tidy)

Overview of Functions

The functions in quanteda.tidy are organized into four categories, following the dplyr documentation:

quanteda.tidy functions by category
Category Function Description
Rows filter() Subset documents based on docvar conditions
Rows slice(), slice_head(), slice_tail() Subset documents by position
Rows slice_sample() Randomly sample documents
Rows slice_min(), slice_max() Select documents with min/max docvar values
Rows arrange(), distinct() Reorder documents; keep unique documents
Columns select() Keep or drop docvars by name
Columns rename(), rename_with() Rename docvars
Columns relocate() Change docvar column order
Columns mutate(), transmute() Create or modify docvars
Columns pull() Extract a single docvar as a vector
Columns glimpse() Get a quick overview of the corpus
Groups of rows add_count() Add count by group as a docvar
Groups of rows add_tally() Add total count as a docvar
Pairs of data frames left_join() Join corpus with external data frame

Verbs That Operate on Rows

These functions subset, reorder, or select documents based on their document variables or positions.

Filtering documents

Use filter() to keep documents that match specified conditions:

# Keep only Roosevelt's speeches
data_corpus_inaugural %>%
  filter(President == "Roosevelt") %>%
  summary()
## Corpus consisting of 5 documents, showing 5 documents:
## 
##            Text Types Tokens Sentences Year President   FirstName      Party
##  1905-Roosevelt   404   1079        33 1905 Roosevelt    Theodore Republican
##  1933-Roosevelt   743   2057        85 1933 Roosevelt Franklin D. Democratic
##  1937-Roosevelt   725   1989        96 1937 Roosevelt Franklin D. Democratic
##  1941-Roosevelt   526   1519        68 1941 Roosevelt Franklin D. Democratic
##  1945-Roosevelt   275    633        27 1945 Roosevelt Franklin D. Democratic

Slicing documents by position

Use slice() and its variants to select documents by position:

# First 3 documents
slice(data_corpus_inaugural, 1:3)
## Corpus consisting of 3 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."

# First 10%
slice_head(data_corpus_inaugural, prop = 0.10)
## Corpus consisting of 6 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."

# Last 3 documents
slice_tail(data_corpus_inaugural, n = 3)
## Corpus consisting of 3 documents and 4 docvars.
## 2017-Trump :
## "Chief Justice Roberts, President Carter, President Clinton, ..."
## 
## 2021-Biden :
## "Chief Justice Roberts, Vice President Harris, Speaker Pelosi..."
## 
## 2025-Trump :
## "Thank you.  Thank you very much, everybody.  Wow.  Thank you..."

Random sampling:

set.seed(42)
slice_sample(data_corpus_inaugural, n = 5)
## Corpus consisting of 5 documents and 4 docvars.
## 1981-Reagan :
## "Senator Hatfield, Mr. Chief Justice, Mr. President, Vice Pre..."
## 
## 1933-Roosevelt :
## "I am certain that my fellow Americans expect that on my indu..."
## 
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1885-Cleveland :
## "Fellow citizens, in the presence of this vast assemblage of ..."
## 
## 1825-Adams :
## "In compliance with an usage coeval with the existence of our..."

Select by minimum or maximum values of a docvar:

# Add token counts first
corp <- data_corpus_inaugural %>%
  mutate(n_tokens = ntoken(data_corpus_inaugural))

# Shortest speeches
slice_min(corp, n_tokens, n = 3)
## Corpus consisting of 3 documents and 5 docvars.
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1945-Roosevelt :
## "Chief Justice, Mr. Vice President, my friends, you will unde..."
## 
## 1865-Lincoln :
## "Fellow-Countrymen: At this second appearing to take the oath..."

# Longest speeches
slice_max(corp, n_tokens, n = 3)
## Corpus consisting of 3 documents and 5 docvars.
## 1841-Harrison :
## "Called from a retirement which I had supposed was to continu..."
## 
## 1909-Taft :
## "My fellow citizens: Anyone who has taken the oath I have jus..."
## 
## 1845-Polk :
## "Fellow citizens, without solicitation on my part, I have bee..."

Arranging documents

Use arrange() to reorder documents:

# Sort alphabetically by president
data_corpus_inaugural[1:5] %>%
  arrange(President)
## Corpus consisting of 5 documents and 4 docvars.
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."

# Sort by year descending
data_corpus_inaugural[1:5] %>%
  arrange(desc(Year))
## Corpus consisting of 5 documents and 4 docvars.
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."

Keeping distinct documents

Use distinct() to keep only unique combinations of docvar values:

# Keep first document for each president
data_corpus_inaugural %>%
  distinct(President, .keep_all = TRUE) %>%
  summary(n = 10)
## Corpus consisting of 36 documents, showing 10 documents:
## 
##             Text Types Tokens Sentences Year  President     FirstName
##  1789-Washington   625   1537        23 1789 Washington        George
##       1797-Adams   826   2577        37 1797      Adams          John
##   1801-Jefferson   717   1923        41 1801  Jefferson        Thomas
##     1809-Madison   535   1261        21 1809    Madison         James
##      1817-Monroe  1040   3677       121 1817     Monroe         James
##     1829-Jackson   517   1208        25 1829    Jackson        Andrew
##    1837-VanBuren  1315   4158        95 1837  Van Buren        Martin
##    1841-Harrison  1896   9125       210 1841   Harrison William Henry
##        1845-Polk  1334   5186       153 1845       Polk    James Knox
##      1849-Taylor   496   1178        22 1849     Taylor       Zachary
##                  Party
##                   none
##             Federalist
##  Democratic-Republican
##  Democratic-Republican
##  Democratic-Republican
##             Democratic
##             Democratic
##                   Whig
##                   Whig
##                   Whig

Verbs That Operate on Columns

These functions create, modify, rename, reorder, or select document variables.

Selecting docvars

Use select() to keep or drop docvars:

data_corpus_inaugural %>%
  select(President, Year) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences  President Year
##  1789-Washington   625   1537        23 Washington 1789
##  1793-Washington    96    147         4 Washington 1793
##       1797-Adams   826   2577        37      Adams 1797
##   1801-Jefferson   717   1923        41  Jefferson 1801
##   1805-Jefferson   804   2380        45  Jefferson 1805

Renaming docvars

Use rename() for direct renaming:

data_corpus_inaugural %>%
  rename(LastName = President, Given = FirstName) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences Year   LastName  Given
##  1789-Washington   625   1537        23 1789 Washington George
##  1793-Washington    96    147         4 1793 Washington George
##       1797-Adams   826   2577        37 1797      Adams   John
##   1801-Jefferson   717   1923        41 1801  Jefferson Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson Thomas
##                  Party
##                   none
##                   none
##             Federalist
##  Democratic-Republican
##  Democratic-Republican

Use rename_with() to rename using a function:

data_corpus_inaugural %>%
  rename_with(toupper) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences YEAR  PRESIDENT FIRSTNAME
##  1789-Washington   625   1537        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2577        37 1797      Adams      John
##   1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
##                  PARTY
##                   none
##                   none
##             Federalist
##  Democratic-Republican
##  Democratic-Republican

Relocating docvars

Use relocate() to change column order:

data_corpus_inaugural %>%
  relocate(Party, President) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences                 Party  President Year
##  1789-Washington   625   1537        23                  none Washington 1789
##  1793-Washington    96    147         4                  none Washington 1793
##       1797-Adams   826   2577        37            Federalist      Adams 1797
##   1801-Jefferson   717   1923        41 Democratic-Republican  Jefferson 1801
##   1805-Jefferson   804   2380        45 Democratic-Republican  Jefferson 1805
##  FirstName
##     George
##     George
##       John
##     Thomas
##     Thomas

Creating and modifying docvars

Use mutate() to add new docvars or modify existing ones:

data_corpus_inaugural %>%
  mutate(
    fullname = paste(FirstName, President, sep = " "),
    century = floor(Year / 100) + 1
  ) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1537        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2577        37 1797      Adams      John
##   1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
##                  Party          fullname century
##                   none George Washington      18
##                   none George Washington      18
##             Federalist        John Adams      18
##  Democratic-Republican  Thomas Jefferson      19
##  Democratic-Republican  Thomas Jefferson      19

Use transmute() to create new docvars and drop all others:

data_corpus_inaugural %>%
  transmute(
    speech_id = paste(Year, President, sep = "-"),
    party = Party
  ) %>%
  summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences       speech_id                 party
##  1789-Washington   625   1537        23 1789-Washington                  none
##  1793-Washington    96    147         4 1793-Washington                  none
##       1797-Adams   826   2577        37      1797-Adams            Federalist
##   1801-Jefferson   717   1923        41  1801-Jefferson Democratic-Republican
##   1805-Jefferson   804   2380        45  1805-Jefferson Democratic-Republican

Extracting docvars

Use pull() to extract a single docvar as a vector:

data_corpus_inaugural %>%
  filter(Year >= 2000) %>%
  pull(President)
## [1] "Bush"  "Bush"  "Obama" "Obama" "Trump" "Biden" "Trump"

Getting an overview

Use glimpse() (from tibble) to see a compact summary:

glimpse(data_corpus_inaugural)
## Rows: 60
## Columns: 6
## $ doc_id    <chr> "1789-Washington", "1793-Washington", "1797-Adams", "1801-Je…
## $ text      <chr> "Fellow-Cit…", "Fellow cit…", "When it wa…", "Friends an…", …
## $ Year      <int> 1789, 1793, 1797, 1801, 1805, 1809, 1813, 1817, 1821, 1825, …
## $ President <chr> "Washington", "Washington", "Adams", "Jefferson", "Jefferson…
## $ FirstName <chr> "George", "George", "John", "Thomas", "Thomas", "James", "Ja…
## $ Party     <fct> none, none, Federalist, Democratic-Republican, Democratic-Re…

Verbs That Operate on Groups of Rows

These functions compute summaries or add variables based on groups.

Counting observations

Use add_count() to add a count variable by group:

# Count speeches per president
data_corpus_inaugural %>%
  add_count(President, name = "n_speeches") %>%
  filter(n_speeches > 1) %>%
  summary(n = 10)
## Corpus consisting of 44 documents, showing 10 documents:
## 
##             Text Types Tokens Sentences Year  President   FirstName
##  1789-Washington   625   1537        23 1789 Washington      George
##  1793-Washington    96    147         4 1793 Washington      George
##       1797-Adams   826   2577        37 1797      Adams        John
##   1801-Jefferson   717   1923        41 1801  Jefferson      Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson      Thomas
##     1809-Madison   535   1261        21 1809    Madison       James
##     1813-Madison   541   1302        33 1813    Madison       James
##      1817-Monroe  1040   3677       121 1817     Monroe       James
##      1821-Monroe  1259   4886       131 1821     Monroe       James
##       1825-Adams  1003   3147        74 1825      Adams John Quincy
##                  Party n_speeches
##                   none          2
##                   none          2
##             Federalist          2
##  Democratic-Republican          2
##  Democratic-Republican          2
##  Democratic-Republican          2
##  Democratic-Republican          2
##  Democratic-Republican          2
##  Democratic-Republican          2
##  Democratic-Republican          2

Use add_tally() to add the total count:

data_corpus_inaugural %>%
  slice(1:5) %>%
  add_tally() %>%
  summary()
## Corpus consisting of 5 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1537        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2577        37 1797      Adams      John
##   1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
##                  Party n
##                   none 5
##                   none 5
##             Federalist 5
##  Democratic-Republican 5
##  Democratic-Republican 5

Verbs That Operate on Pairs of Data Frames

These functions combine a corpus with an external data frame.

Joining with external data

Use left_join() to add columns from a data frame to your corpus:

# Create some external data
party_colors <- data.frame(
  Party = c("Democratic", "Republican", "none", "Federalist",
            "Democratic-Republican", "Whig"),
  color = c("blue", "red", "gray", "purple", "green", "orange")
)

# Join to corpus
data_corpus_inaugural %>%
  left_join(party_colors, by = "Party") %>%
  summary(n = 10)
## Corpus consisting of 60 documents, showing 10 documents:
## 
##             Text Types Tokens Sentences Year  President   FirstName
##  1789-Washington   625   1537        23 1789 Washington      George
##  1793-Washington    96    147         4 1793 Washington      George
##       1797-Adams   826   2577        37 1797      Adams        John
##   1801-Jefferson   717   1923        41 1801  Jefferson      Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson      Thomas
##     1809-Madison   535   1261        21 1809    Madison       James
##     1813-Madison   541   1302        33 1813    Madison       James
##      1817-Monroe  1040   3677       121 1817     Monroe       James
##      1821-Monroe  1259   4886       131 1821     Monroe       James
##       1825-Adams  1003   3147        74 1825      Adams John Quincy
##                  Party  color
##                   none   gray
##                   none   gray
##             Federalist purple
##  Democratic-Republican  green
##  Democratic-Republican  green
##  Democratic-Republican  green
##  Democratic-Republican  green
##  Democratic-Republican  green
##  Democratic-Republican  green
##  Democratic-Republican  green

Special handling of document names

left_join() provides special handling for joining on document names. Use "docname" in the by argument to match on document names even when "docname" is not a docvar:

# Create data with document name as key
doc_metadata <- data.frame(
  docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
  notes = c("First inaugural", "Second inaugural", "First Adams speech")
)

# Join using docname
data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata, by = "docname") %>%
  summary()
## Corpus consisting of 5 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1537        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2577        37 1797      Adams      John
##   1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
##                  Party              notes
##                   none    First inaugural
##                   none   Second inaugural
##             Federalist First Adams speech
##  Democratic-Republican               <NA>
##  Democratic-Republican               <NA>

You can also match document names to a differently-named column:

doc_metadata2 <- data.frame(
  doc_id = c("1789-Washington", "1793-Washington"),
  rating = c(5, 4)
)

data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata2, by = c("docname" = "doc_id")) %>%
  summary()
## Corpus consisting of 5 documents, showing 5 documents:
## 
##             Text Types Tokens Sentences Year  President FirstName
##  1789-Washington   625   1537        23 1789 Washington    George
##  1793-Washington    96    147         4 1793 Washington    George
##       1797-Adams   826   2577        37 1797      Adams      John
##   1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
##   1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
##                  Party rating
##                   none      5
##                   none      4
##             Federalist     NA
##  Democratic-Republican     NA
##  Democratic-Republican     NA

Piping Operations

All quanteda.tidy functions work seamlessly with the pipe operator, allowing you to chain multiple operations:

data_corpus_inaugural %>%
  # Add metadata
  mutate(
    decade = floor(Year / 10) * 10,
    n_tokens = ntoken(data_corpus_inaugural)
  ) %>%
  # Filter to 20th century

  filter(Year >= 1900, Year < 2000) %>%
  # Keep only relevant columns
  select(President, Party, decade, n_tokens) %>%
  # Sort by speech length

  arrange(desc(n_tokens)) %>%
  summary(n = 10)
## Corpus consisting of 25 documents, showing 10 documents:
## 
##             Text Types Tokens Sentences  President      Party decade n_tokens
##        1909-Taft  1437   5821       158       Taft Republican   1900     5821
##    1925-Coolidge  1220   4440       196   Coolidge Republican   1920     4440
##      1929-Hoover  1090   3860       158     Hoover Republican   1920     3860
##     1921-Harding  1169   3719       148    Harding Republican   1920     3719
##      1985-Reagan   925   2909       123     Reagan Republican   1980     2909
##      1981-Reagan   902   2781       129     Reagan Republican   1980     2781
##  1953-Eisenhower   900   2743       119 Eisenhower Republican   1950     2743
##        1989-Bush   795   2674       141       Bush Republican   1980     2674
##      1949-Truman   781   2504       116     Truman Democratic   1940     2504
##    1901-McKinley   854   2437       100   McKinley Republican   1900     2437