Package: wordpiece.data
Title: Data for Wordpiece-Style Tokenization
Version: 2.0.0
Authors@R: c(
    person(given = "Jonathan",
           family = "Bratt",
           role = c("aut"),
           email = "jonathan.bratt@macmillan.com",
           comment = c(ORCID = "0000-0003-2859-0076")),
    person(given = "Jon",
           family = "Harmon",
           role = c("aut", "cre"),
           email = "jonthegeek@gmail.com",
           comment = c(ORCID = "0000-0003-4781-4346")),
    person(given = "Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning", 
           role = c("cph")),
    person(given = "Google, Inc", 
           role = c("cph"), comment = "original BERT vocabularies")
    )
Description: Provides data to be used by the wordpiece algorithm in order to 
    tokenize text into somewhat meaningful chunks. Included vocabularies were 
    retrieved from 
    <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and 
    <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed
    into an R-friendly format.
License: Apache License (>= 2)
Encoding: UTF-8
RoxygenNote: 7.1.2
URL: https://github.com/macmillancontentscience/wordpiece.data
BugReports: https://github.com/macmillancontentscience/wordpiece.data/issues
Depends: R (>= 3.5.0)
Suggests: testthat (>= 3.0.0)
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2022-03-03 15:50:03 UTC; jonth
Author: Jonathan Bratt [aut] (<https://orcid.org/0000-0003-2859-0076>),
  Jon Harmon [aut, cre] (<https://orcid.org/0000-0003-4781-4346>),
  Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph],
  Google, Inc [cph] (original BERT vocabularies)
Maintainer: Jon Harmon <jonthegeek@gmail.com>
Repository: CRAN
Date/Publication: 2022-03-03 16:20:02 UTC
Built: R 4.5.1; ; 2025-10-06 01:21:56 UTC; windows
