| Title: | 'tidyverse' Extensions for 'quanteda' |
| Version: | 0.4 |
| Description: | Enables 'tidyverse' operations on 'quanteda' corpus objects by extending 'dplyr' verbs to work directly with corpus objects and their document-level variables ('docvars'). Implements row operations for 'subsetting' and reordering documents; column operations for managing document variables; grouped operations; and two-table verbs for merging external data. For more on 'quanteda' see 'Benoit et al.' (2018) <doi:10.21105/joss.00774>. For 'dplyr' see 'Wickham et al.' (2023) <doi:10.32614/CRAN.package.dplyr>. |
| Depends: | R (≥ 3.5.0), quanteda (≥ 3.0.0) |
| Imports: | dplyr, rlang, tibble, tidyselect |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | covr, knitr, rmarkdown, spelling, testthat |
| VignetteBuilder: | knitr |
| Language: | en-GB |
| NeedsCompilation: | no |
| Packaged: | 2025-12-11 02:18:53 UTC; kbenoit |
| Author: | Kenneth Benoit [aut, cre, cph] |
| Maintainer: | Kenneth Benoit <kbenoit@smu.edu.sg> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-17 10:10:08 UTC |
quanteda.tidy: Tidyverse Extensions for quanteda
Description
Extends 'dplyr' verbs to work directly on 'quanteda' corpus objects,
enabling users to manipulate document-level variables ("docvars") using
familiar 'tidyverse' syntax. Implements row operations for subsetting and
reordering documents; column operations for managing document variables;
grouped operations via add_count() and add_tally(); and two-table verbs
(such as left_join()) for merging external data.
Author(s)
Maintainer: Kenneth Benoit kbenoit@smu.edu.sg [copyright holder]
References
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). "quanteda: An R package for the quantitative analysis of textual data." Journal of Open Source Software, 3(30), 774. doi:10.21105/joss.00774
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4. doi:10.32614/CRAN.package.dplyr
Verbs that operate on groups of rows
Description
These functions operate on groups of rows (documents) of quanteda objects, typically counting or summarising documents by group.
Arguments
x |
a quanteda corpus object |
... |
additional arguments passed to methods |
Details
add_count() and add_tally() add a document variable containing the count
of observations in each group. See dplyr::add_count() for more details.
Value
A corpus with an additional document variable containing counts.
Add count of observations to corpus
Description
add_count() and add_tally() are wrappers around dplyr::add_count() and
dplyr::add_tally() that add a new document variable with the number of
observations. add_count() is a shortcut for group_by() + add_tally().
Usage
## S3 method for class 'corpus'
add_count(x, ..., wt = NULL, sort = FALSE, name = NULL, .drop = NULL)
## S3 method for class 'corpus'
add_tally(x, ..., wt = NULL, sort = FALSE, name = NULL)
Arguments
x |
a quanteda corpus object |
... |
for |
wt |
frequency weights. Can be
|
sort |
if |
name |
the name of the new column in the output. If omitted, it will
default to |
.drop |
not used for corpus objects; included for compatibility with the generic |
Value
a corpus with an additional document variable containing counts
Examples
# Count documents by President and add as a variable
data_corpus_inaugural %>%
add_count(President) %>%
summary(n = 10)
# Add total count to each document
data_corpus_inaugural %>%
head() %>%
add_tally() %>%
summary()
# Count by multiple variables
data_corpus_inaugural %>%
add_count(Party, President) %>%
summary(n = 10)
# Use custom name
data_corpus_inaugural %>%
add_count(Party, name = "party_count") %>%
summary(n = 10)
# Add tally to show total count
data_corpus_inaugural %>%
slice(1:6) %>%
add_tally() %>%
summary()
Add count of observations to corpus
Description
add_tally is a generic function for adding a count column. The default
method calls dplyr::add_tally().
Usage
add_tally(x, ...)
Arguments
x |
an object |
... |
additional arguments passed to methods |
Value
A corpus with an additional document variable containing counts.
Verbs that operate on rows
Description
These functions operate on the rows (documents) of quanteda objects, subsetting, reordering, or selecting distinct documents based on document variables.
Arguments
.data |
a quanteda corpus object |
... |
additional arguments passed to methods |
Details
arrange() orders documents by values of document variables. See
dplyr::arrange() for more details.
distinct() subsets documents to keep only unique/distinct rows based on
document variable values. See dplyr::distinct() for more details.
filter() subsets documents that satisfy specified conditions on document
variables. See dplyr::filter() for more details.
slice() and its variants (slice_head(), slice_tail(), slice_min(),
slice_max(), slice_sample()) select documents by their (integer)
positions. See dplyr::slice() for more details.
Value
A corpus, subsetted or reordered according to the operation.
Arrange the document order of a corpus by variables
Description
Order the documents in a corpus by variables, including document variables.
Usage
## S3 method for class 'corpus'
arrange(.data, ...)
Arguments
.data |
a corpus object whose documents will be sorted |
... |
comma-separated list of unquoted document variables, or expressions involving document variables. Use desc to sort a variable in descending order. |
Value
A corpus with documents reordered according to the specified variables.
Examples
arrange(data_corpus_inaugural[1:5], President)
arrange(data_corpus_inaugural[1:5], c(3, 2, 1, 5, 4))
arrange(data_corpus_inaugural[1:5], desc(President))
Subset documents distinct/unique by document variables
Description
Select only documents that are unique/distinct with respect to values of their document variables.
Usage
## S3 method for class 'corpus'
distinct(.data, ..., .keep_all = FALSE)
Arguments
.data |
a corpus object with document variables |
... |
comma-separated list of unquoted document variables, or expressions involving document variables |
.keep_all |
If |
Value
A corpus containing only documents with unique combinations of the specified document variables.
Examples
distinct(data_corpus_inaugural[1:5], President) %>%
summary()
distinct(data_corpus_inaugural[1:5], President, .keep_all = TRUE) %>%
summary()
Wrappers to dplyr functions
Description
Wrapper functions for dplyr functions to preserve texts, document names, and corpus meta-data.
Usage
corpus_stv_byvar(.data, ..., fun)
corpus_stv_bydoc(.data, ..., fun)
Arguments
.data |
input quanteda object |
... |
arguments for the dplyr function |
fun |
reference to the dplyr function |
Value
a modified quanteda object
Verbs that operate on columns
Description
These functions operate on the columns (document variables) of quanteda objects, creating, modifying, renaming, reordering, or selecting document variables.
Arguments
.data |
a quanteda corpus object |
... |
additional arguments passed to methods |
Details
mutate() creates new document variables or modifies existing ones.
transmute() creates new document variables and drops existing ones. See
dplyr::mutate() for more details.
pull() extracts a single document variable as a vector. See
dplyr::pull() for more details.
relocate() changes the column order of document variables. See
dplyr::relocate() for more details.
rename() changes the names of individual document variables using
new_name = old_name syntax. rename_with() renames document variables
using a function. See dplyr::rename() for more details.
select() keeps or drops document variables by name. See dplyr::select()
for more details.
Value
A corpus with modified document variables, or for pull(), a vector.
Context functions from dplyr
Description
These functions return information about the "current" group or "current"
variable, so only work inside specific contexts like summarise() and
mutate()
Arguments
... |
not used; present for compatibility with the generic |
Details
-
n()gives the current group size. -
cur_data()gives the current data for the current group (excluding grouping variables). -
cur_data_all()gives the current data for the current group (including grouping variables) -
cur_group()gives the group keys, a tibble with one row and one column for each grouping variable. -
cur_group_id()gives a unique numeric identifier for the current group. -
cur_column()gives the name of the current column (indplyr::across()only).
See dplyr::group_data() for equivalent functions that return values for all
groups.
Value
Context-dependent: n() returns an integer; cur_group_id() returns
an integer; cur_group() returns a tibble; cur_data() and cur_data_all()
return tibbles; cur_column() returns a character string.
Verbs that operate on pairs of data frames
Description
These functions combine a quanteda object with a data frame, adding new document variables based on matching values.
Arguments
x |
a quanteda corpus object |
y |
a data frame to join with |
... |
additional arguments passed to methods |
Details
left_join() adds columns from y to the corpus x, matching documents
based on a key variable. All documents in x are kept. See
dplyr::left_join() for more details.
Value
A corpus with document variables from both x and y.
Vector functions from dplyr
Description
Selected vector functions, re-exported.
desc() reverses the sort order of a variable; see
dplyr::desc() for details.
Arguments
x |
a vector to transform |
Value
A transformed vector of the same length as the input.
Return documents with matching conditions
Description
Use filter() to select documents where conditions evaluated on document
variables are true. Documents where the condition evaluates to NA are
dropped. A tidy replacement for corpus_subset().
Usage
## S3 method for class 'corpus'
filter(.data, ..., .preserve = FALSE)
Arguments
.data |
a quanteda object whose documents will be filtered |
... |
Logical predicates defined in terms of the document variables in
|
.preserve |
Relevant when the |
Value
A corpus containing only documents that satisfy the specified conditions.
Examples
data_corpus_inaugural %>%
filter(Year < 1810) %>%
summary()
Get a glimpse of a quanteda object
Description
Implementation of glimpse for quanteda objects, allowing docvars to be viewed.
Usage
## S3 method for class 'corpus'
glimpse(x, width = NULL, ...)
Arguments
x |
a corpus or quanteda object |
width |
width of the output; default to the width of the console |
... |
unused |
Value
Invisibly returns the input corpus. Called primarily for its side effect of printing a summary to the console.
Examples
glimpse(data_corpus_inaugural)
Join corpus with a data frame
Description
left_join() adds columns from y to the corpus x, matching documents
based on document variables. This is a mutating join that keeps all documents
from x and adds matching values from y. If a document in x has no match
in y, the new columns will contain NA.
Usage
## S3 method for class 'corpus'
left_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = NULL
)
Arguments
x |
a quanteda corpus object |
y |
a data frame or tibble to join |
by |
a join specification. See |
copy |
if |
suffix |
if there are non-joined duplicate variables in |
... |
other arguments passed to |
keep |
should the join keys from both |
Value
a corpus with document variables from both x and y
Special handling of "docname"
This function provides special handling for joining on document names:
If
by = "docname"(or "docname" appears in thebyvector), the function will usedocnames(x)as the joining column from the corpus, even if "docname" is not a document variable.If using
join_by(docname == other_col), the function will matchdocnames(x)toother_coliny.If "docname" exists as an actual document variable in
x, that variable will be used instead ofdocnames(x).
Examples
# Create example corpus and data
corp <- data_corpus_inaugural[1:5]
# Create data to join with document names
doc_data <- data.frame(
docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
century = c(18, 18, 18),
speech_number = c(1, 2, 1)
)
# Join using docname - matches docnames(corp) to doc_data$docname
left_join(corp, doc_data, by = "docname") %>%
summary()
# Join using different column names with named vector
doc_data2 <- data.frame(
doc_id = c("1789-Washington", "1793-Washington"),
rating = c(5, 4)
)
left_join(corp, doc_data2, by = c("docname" = "doc_id")) %>%
summary()
# Regular join on existing docvars
year_info <- data.frame(
Year = c(1789, 1793, 1797, 1801, 1805),
decade = c("1780s", "1790s", "1790s", "1800s", "1800s")
)
left_join(corp, year_info, by = "Year") %>%
summary()
Create or transform document variables
Description
mutate() adds new document variables and preserves
existing ones; transmute() adds new document variables and drops existing
ones. Both functions preserve the number of rows of the input. New variables
overwrite existing variables of the same name.
Usage
## S3 method for class 'corpus'
mutate(.data, ...)
## S3 method for class 'corpus'
transmute(.data, ...)
Arguments
.data |
a quanteda object whose document variables will be created or transformed |
... |
name-value pairs of expressions for document variable modification or assignment; see mutate. |
Value
A corpus with new or modified document variables.
Examples
data_corpus_inaugural %>%
mutate(fullname = paste(FirstName, President, sep = ", ")) %>%
summary(n = 5)
data_corpus_inaugural %>%
transmute(fullname = paste(FirstName, President, sep = ", ")) %>%
summary(n = 5)
Pull out a single document variable
Description
Works like $ for quanteda objects with document variables, or like
docvars(x, "varname").
Usage
## S3 method for class 'corpus'
pull(.data, var = -1, name = NULL, ...)
## S3 method for class 'tokens'
pull(.data, var = -1, name = NULL, ...)
## S3 method for class 'dfm'
pull(.data, var = -1, name = NULL, ...)
Arguments
.data |
a quanteda object with document variables |
var |
A variable specified as:
The default returns the last column (on the assumption that's the column you've created most recently). This argument is taken by expression and supports quasiquotation (you can unquote column names and column locations). |
name |
An optional parameter that specifies the column to be used
as names for a named vector. Specified in a similar manner as |
... |
For use by methods. |
Value
A vector containing the values of the specified document variable.
Examples
tail(data_corpus_inaugural) %>% pull(President)
tail(data_corpus_inaugural) %>% pull(-1)
tail(data_corpus_inaugural) %>% pull(1)
toks <- data_corpus_inaugural %>%
tail() %>%
tokens()
pull(toks, President)
dfmat <- data_corpus_inaugural %>%
tail() %>%
tokens() %>%
dfm()
pull(dfmat, President)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- tidyselect
all_of,any_of,contains,ends_with,everything,last_col,matches,num_range,one_of,starts_with
Change column order of document variables
Description
Use relocate() to change the column positions of document variables, using
the same syntax as select() to make it easy to move blocks
of columns at once.
Usage
## S3 method for class 'corpus'
relocate(.data, ...)
Arguments
.data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details. |
... |
< |
Value
A corpus with document variables reordered.
Examples
data_corpus_inaugural %>%
relocate(President, Party) %>%
summary(n = 5)
data_corpus_inaugural %>%
relocate(FirstName, President, .before = Year) %>%
summary(n = 5)
Rename document variables
Description
rename() changes the names of individual document variables using new_name = old_name syntax; rename_with() renames columns using a function.
Usage
## S3 method for class 'corpus'
rename(.data, ...)
## S3 method for class 'corpus'
rename_with(.data, .fn, .cols = everything(), ...)
Arguments
.data |
a quanteda object with document variables |
... |
For For |
.fn |
A function used to transform the selected |
.cols |
< |
Value
A corpus with renamed document variables.
Examples
data_corpus_inaugural %>%
rename(LastName = President) %>%
summary(n = 5)
data_corpus_inaugural %>%
rename_with(toupper) %>%
summary(n = 5)
data_corpus_inaugural %>%
rename_with(toupper, starts_with("P")) %>%
summary(n = 5)
Subset docvars using their names and types
Description
Select (and optionally rename) document variables in a data frame, using a
concise mini-language that makes it easy to refer to variables based on their
name (e.g. a:f selects all columns from a on the left to f on the
right). You can also use predicate functions like is.numeric to select
variables based on their properties.
Usage
## S3 method for class 'corpus'
select(.data, ...)
Arguments
.data |
a quanteda object with document variables |
... |
< |
Details
For an overview of selection features, see dplyr::select().
Value
A corpus with the specified subset of document variables.
Examples
data_corpus_inaugural %>%
select(Party, Year) %>%
summary(n = 5)
Subset documents using their positions
Description
slice() lets you index documents by their (integer) locations. It allows you
to select, remove, and duplicate documents. It is accompanied by a number of
helpers for common use cases:
-
slice_head()andslice_tail()select the first or last documents. -
slice_sample()randomly selects documents. -
slice_min()andslice_max()select documents with highest or lowest values of a document variable.
Usage
## S3 method for class 'corpus'
slice(.data, ..., .preserve = FALSE)
## S3 method for class 'corpus'
slice_head(.data, ..., n, prop)
## S3 method for class 'corpus'
slice_tail(.data, ..., n, prop)
## S3 method for class 'corpus'
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
## S3 method for class 'corpus'
slice_min(.data, ..., n, prop, with_ties = TRUE)
## S3 method for class 'corpus'
slice_max(.data, ..., n, prop, with_ties = TRUE)
Arguments
.data |
a quanteda corpus object |
... |
additional arguments passed to methods |
.preserve |
Relevant when the |
n, prop |
Provide either If |
weight_by |
< |
replace |
Should sampling be performed with ( |
with_ties |
Should ties be kept together? The default, |
Value
An object of the same type as .data. The output has the following
properties:
Each document may appear 0, 1, or many times in the output. (If duplicated, then document names will be modified to remain unique.)
Document variables are not modified.
Examples
slice(data_corpus_inaugural, 2:5)
slice(data_corpus_inaugural, 55:n())
slice_head(data_corpus_inaugural, n = 2)
slice_tail(data_corpus_inaugural, n = 3)
slice_tail(data_corpus_inaugural, prop = .05)
set.seed(42)
slice_sample(data_corpus_inaugural, n = 3)
slice_sample(data_corpus_inaugural, prop = .10, replace = TRUE)
data_corpus_inaugural <- data_corpus_inaugural %>%
mutate(ntoks = ntoken(data_corpus_inaugural))
# shortest three texts
slice_min(data_corpus_inaugural, ntoks, n = 3)
# longest three texts
slice_max(data_corpus_inaugural, ntoks, n = 3)