---
title: "Retrieving and Processing PubMed Records using easyPubMed"
author: "Damiano Fantini, Ph.D."
date: "March 27, 2019"
output: html_document
vignette: >
  %\VignetteIndexEntry{Getting Started with easyPubMed}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
  % \VignetteDepends{easyPubMed}
---

PubMed is an online repository of references and abstracts of publications in the fields of medicine and life sciences. PubMed is a free resource that is developed and maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). PubMed homepage is located at the following URL: [https://www.ncbi.nlm.nih.gov/pubmed/](https://www.ncbi.nlm.nih.gov/pubmed/). Alternatively, PubMed can be programmatically queried via the NCBI Entrez E-utilities interface.

**easyPubMed** is an **R interface to the Entrez Programming Utilities** aimed at allowing an easy and smooth programmatic access to PubMed. The package is suitable for batch downloading large volumes of records (via the `batch_pubmed_download()` function), and comes with a set of functions to perform basic processing of the PubMed query output. *easyPubMed* can request and handle PubMed records in either XML or TXT format. 
This vignette covers the key functions of the package and provides informative examples.

# Installation

To install **easyPubMed** from CRAN, you can run the following line of code

```{r inst__0001, include = TRUE, echo = TRUE, eval = FALSE}
install.packages("easyPubMed")
```

Before using the functions included in **easyPubMed**, make sure to load the library. 

```{r inst___02, include = TRUE, echo = TRUE, eval = FALSE}
library(easyPubMed)
```

```{r include = FALSE}
library(easyPubMed)
data("EPMsamples")
```

### Dev Version

A *dev* version of this software may be found on GitHub. If you are interested in trying it out, you can install it using the `devtools`library as follows.

```{r inst___04, include = TRUE, echo = TRUE, eval = FALSE}
library(devtools)
install_github("dami82/easyPubMed")
```

# Getting started - Retrieving data from PubMed

The first section of this tutorial covers how to use easyPubMed for querying Entrez and how to retrieve or download PubMed records from the Entrez History Server. 

## A simple PubMed query via easyPubMed

Performing a standard PubMed search via easyPubMed is a two-step process: 

- the *PubMed query* step

- the *data retrieval* step

PubMed is queried via the `get_pubmed_ids()` function, which takes a **Query string** as argument. The standard PubMed synthax applies, i.e. you can use the same tags-filters as in the web search engine. A PubMed query by the `get_pubmed_ids()` function results in:  

- the query results are posted on the Entrez History Server ready for retrieval

- the function returns a list containing all information to access and download resuts from the server

Data can  be retrieved from the History Server via the `fetch_pubmed_data()` function. The records can be requested in either XML or TXT format. Here following you can find a very simple example. 

```{r message = FALSE, warning = FALSE, eval = FALSE}
my_query <- 'Damiano Fantini[AU] AND "2018"[PDAT]'
my_entrez_id <- get_pubmed_ids(my_query)
my_abstracts_txt <- fetch_pubmed_data(my_entrez_id, format = "abstract")
```

```{r message = FALSE, warning = FALSE, eval = TRUE, echo = FALSE, include=FALSE}
# Loading from the dataset attached to the package
# You may omit this conversion if your system supports UTF8
my_abstracts_txt <- iconv(EPMsamples$DF_papers_abs$pm_res, from = "UTF8", to = "ASCII", sub = ".")
```

```{r message = FALSE, warning = FALSE, eval = TRUE}
head(my_abstracts_txt)
```

Here, the PubMed records were retrieved in the *Abstract* format. The formats supported by Entrez and easyPubMed are the following: "asn.1", "xml", "medline", "uilist", "abstract". The following example shows how to retrieve PubMed records in XML format. 

**Note!!!** Unlike before, the `fetch_pubmed_data()` function returns its output as a *character*-class object, AND NOT as an *XMLInternalDocument* and *XMLAbstractDocument*-class object. To access fields in such object, we recommend using the `custom_grep()` function included in *easyPubMed*. For example, it is possible to extract the title of each Article as follows.

```{r message = FALSE, warning = FALSE, eval = FALSE}
my_abstracts_xml <- fetch_pubmed_data(pubmed_id_list = my_entrez_id)
```

```{r include=FALSE, echo = FALSE, eval = TRUE}
# Loading from the dataset attached to the package
# You may omit this conversion if your system supports UTF8
my_abstracts_xml <- iconv(EPMsamples$DF_papers_std$pm_res, from = "UTF8", to = "ASCII", sub = ".")
```

```{r message = FALSE, warning = FALSE, eval = TRUE}
class(my_abstracts_xml) 

my_titles <- custom_grep(my_abstracts_xml, "ArticleTitle", "char")

# use gsub to remove the tag, also trim long titles
TTM <- nchar(my_titles) > 75
my_titles[TTM] <- paste(substr(my_titles[TTM], 1, 70), "...", sep = "")

# Print as a data.frame (use kable)
head(my_titles)
```

## Downloading and saving records as XML or TXT files

Instead of retrieving PubMed records as character- or XML-class objects, it is also possible to download all records returned by a PubMed query, and save them as *txt* or *xml* files on the local machine. Downloaded records will be saved locally as one or more files with a common user-defined prefix followed by a sequential number and the *txt* or *xml* extension. If a destination folder is not specified, the current directory will be used as target directory for the download. The `batch_pubmed_download()` function is suitable for downloading very large volumes of PubMed records. 

**Note** that we included an argument (namely, `encoding`) to force the encoding of the retrieved records. Here, we recommend "UTF8". However, you can select different encodings (depending on the local platform). As an example, here we are specifying `encoding="ASCII"`.

```{r message = FALSE, warning = FALSE, eval=FALSE}
new_query <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]' 
out.A <- batch_pubmed_download(pubmed_query_string = new_query, 
                               format = "xml", 
                               batch_size = 20,
                               dest_file_prefix = "easyPM_example",
                               encoding = "ASCII")

```

```{r message = FALSE, warning = FALSE, include = FALSE, echo = FALSE, eval=TRUE}
# Loading from the dataset attached to the package
out.A <- EPMsamples$NUBL_dw18$pm_res
```

```{r message = FALSE, warning = FALSE, eval=TRUE}
# this variable stores the name of the output files
print(out.A) 
```

# Analyzing PubMed records

The second section of this tutorial covers those *easyPubMed* functionalities aimed at transforming and analyzing PubMed records. *easyPubMed* comes with a set of dedicated functions that perform these tasks and manipulate PubMed results. These functions will be covered in the following section.

## Extracting Affiliation data from a single PubMed record
 
To convert a whole set ofPubMed records (raw input from `fetch_pubmed_data()`, or `batch_pubmed_download()`) into a list of individual records (actually, a character vector of Pubmed records), the `articles_to_list()` function is used. This function identifies and splits the input at each occurrence of a <PubmedArticle> field. Individual records are returned as strings, and as elements of a character vector. Each element will include its XML tags. 

```{r message = FALSE, warning = FALSE, eval=TRUE}
my_PM_list <- articles_to_list(pubmed_data = my_abstracts_xml)
class(my_PM_list[1])
print(substr(my_PM_list[4], 1, 510))
```

Affiliations or other fields of interest can be extracted from a specific record using the `custom_grep()` function, that combines regular expressions (*regexpr*, *gsub*) and substring extraction (*substr*). The fields extracted from the record will be returned as elements of a list or a character vector. 

```{r message = FALSE, warning = FALSE, eval=TRUE}
curr_PM_record <- my_PM_list[1]
custom_grep(curr_PM_record, tag = "PubDate")

custom_grep(curr_PM_record, tag = "LastName", format = "char")
```

*easyPubMed* implements (out-of-the-box) a tool for extracting data from a PubMed record: the `article_to_df()` function. This function accepts a string as input (typically, an element of the vector returned by an `articles_to_list()` call) and returns a data.frame. Each row corresponds to a different author and/or paper; columns are features corresponding to the following fields: `c("pmid", "doi", "title", "abstract", "year", "month", "day", "jabbrv", "journal", "lastname", "firstname",` 
`"address", "email")`. One of these fields corresponds to the Article Abstract text (column n. 2). If the full text Abstract is not required, it is possible to limit the number of chars retrieved from this field by setting the max_chars argument. 

```{r message = FALSE, warning = FALSE, eval=TRUE}
# Select a single PubMed record from the internal dataset, NUBL_1618
curr_PM_record <- easyPubMed::EPMsamples$NUBL_1618$rec_lst[[37]]
my.df <- article_to_df(curr_PM_record, max_chars = 18)

# Fields extracted from the PubMed record
head(colnames(my.df))

# Trim long strings and then display some content: each row corresponds to one author
my.df$title <- substr(my.df$title, 1, 15)
my.df$address <- substr(my.df$address, 1, 19)
my.df$jabbrv <- substr(my.df$jabbrv, 1, 10)

# Visualize
my.df[,c("pmid", "title", "jabbrv", "firstname", "address")] 
```

When affiliation info are identical for multiple authors, they are usually omitted in the PubMed record. In these cases, addresses may be imputed for all authors in the data.frame by setting the `autofill` argument to `TRUE`.

```{r message = FALSE, warning = FALSE, eval=TRUE}
my.df2 <- article_to_df(curr_PM_record, autofill = TRUE)

# Trim long strings and then display some content: each row corresponds to one author
my.df2$title <- substr(my.df2$title, 1, 15)
my.df2$jabbrv <- substr(my.df2$jabbrv, 1, 10)
my.df2$address <- substr(my.df2$address, 1, 19)

# Visualize
my.df2[,c("pmid", "title", "jabbrv", "firstname", "address")]
```

We can recusively process PubMed records using a loop or `sapply()` or `lapply()`. Here's an example. For more info, check `?lapply` or `?do.call`.

```{r message = FALSE, warning = FALSE, eval=TRUE}
xx <- lapply(my_PM_list, article_to_df, autofill = TRUE, max_chars = 50)
full_df <- do.call(rbind, xx)

full_df[seq(1, nrow(full_df), by = 10), c("pmid", "lastname", "jabbrv")] 
```

## Automatic Data Extraction from XML PubMed Records

To retrieve author information and publication data from multiple XML records at once, it is possible to use the `table_articles_byAuth()` function. This function relies on the funcions discussed above and returns a data.frame including all the fields extracted in the previous example. The function accepts five arguments.

- **pubmed_data**: an XML file or an XML object with PubMed records  

- **max_chars** and **autofill**: same as discussed in the previous example

- **included_authors**: one of the following options c("first", "last", "all"). The function can return data corresponding to the first, the last or all the authors for each PubMed record.

- **dest_file**: if not NULL, the function attempts writing its output to the selected file. Existing files will be overwritten.

#### Approach num. 1: download records first
```{r takes_some_time, message = FALSE, warning = FALSE, eval=TRUE}
new_query <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]' 
out.B <- batch_pubmed_download(pubmed_query_string = new_query, 
                               dest_file_prefix = "NUBL_18_", 
                               encoding = "ASCII")

# Retrieve the full name of the XML file downloaded in the previous step
new_PM_file <- out.B[[1]]
new_PM_df <- table_articles_byAuth(pubmed_data = new_PM_file, 
                                   included_authors = "first", 
                                   max_chars = 0, 
                                   encoding = "ASCII")

# Printing a sample of the resulting data frame
new_PM_df$address <- substr(new_PM_df$address, 1, 28)
new_PM_df$jabbrv <- substr(new_PM_df$jabbrv, 1, 9)
sid <- seq(5, nrow(new_PM_df), by = 10)

new_PM_df[sid, c("pmid", "year", "jabbrv", "lastname", "address")]
```

#### Approach num. 2: fetch records on-the-fly!
```{r takes_some_time2, message = FALSE, warning = FALSE, eval=FALSE}
new_query <- 'Bladder[TIAB] AND Northwestern[AD] AND Chicago[AD] AND "2018"[PDAT]' 
new_query <- get_pubmed_ids(new_query)
fetched_data <- fetch_pubmed_data(new_query, encoding = "ASCII")
```

```{r takes_some_time2biz, include = FALSE, echo = FALSE, message = FALSE, warning = FALSE, eval=TRUE}
fetched_data <- EPMsamples$NUBL_1618$pm_res
```

```{r takes_some_time2triz, message = FALSE, warning = FALSE, eval=TRUE}
new_PM_df <- table_articles_byAuth(pubmed_data = fetched_data, 
                                   included_authors = "first", 
                                   max_chars = 0, 
                                   encoding = "ASCII")

# Printing a sample of the resulting data frame
new_PM_df$address <- substr(new_PM_df$address, 1, 28)
new_PM_df$jabbrv <- substr(new_PM_df$jabbrv, 1, 9)
sid <- seq(5, nrow(new_PM_df), by = 10)

new_PM_df[sid, c("pmid", "year", "jabbrv", "lastname", "address")] 
```

# More info, other examples and vignettes, and Advanced Guides

- **Getting started with easyPubMed - vignette** (https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_01_getting_started.html)

- **Advanced Features for Analysis of PubMed Records using easyPubMed - vignette** (https://www.data-pulse.com/projects/Rlibs/vignettes/easyPubMed_02_advanced_tutorial.html)


# References

- **easyPubMed official website** including news, vignettes, and further information [https://www.data-pulse.com/dev_site/easypubmed/](https://www.data-pulse.com/dev_site/easypubmed/)

- Sayers, E. **A General Introduction to the E-utilities** (NCBI) [https://www.ncbi.nlm.nih.gov/books/NBK25497/](https://www.ncbi.nlm.nih.gov/books/NBK25497/)

- **PubMed Help (NCBI)** [https://www.ncbi.nlm.nih.gov/books/NBK3827/](https://www.ncbi.nlm.nih.gov/books/NBK3827/)

- **Howto: basic usage of easyPubMed - an example** [Tutorial/Blog Post](http://www.biotechworld.it/bioinf/2016/01/05/querying-pubmed-via-the-easypubmed-package-in-r/)

- **Howto: using easyPubMed for a targeting campaign** [Tutorial/Blog Post](http://www.biotechworld.it/bioinf/2016/01/21/scraping-pubmed-data-via-easypubmed-xml-and-regex-in-r-for-a-targeting-campaign/)

- **Dev version** of easyPubMed on **GitHub** [Website](https://github.com/dami82/easyPubMed)

# Feedback and Citation

Thank you very much for using easyPubMed and/or reading this vignette. Please, feel free to contact me (author/maintainer) for feedback, questions and suggestions: my email is <damiano.fantini(at)gmail(dot)com>. More info about easyPubMed are available at the following URL: [www.data-pulse.com](http://www.data-pulse.com/).  

*easyPubMed* Copyright (C) 2017-2019 Damiano Fantini. *This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.*

**!!Note!!** If you are using *easyPubMed* for a scientific publication, please name the package in the Materials and Methods section of the paper. Thanks! Also, I am always open to collaborations. If you have an idea you would like to discuss or develop based on what you read in this Vignette, feel free to contact me via email. 
Thank you.

# SessionInfo

```{r message = FALSE, warning = FALSE, eval=TRUE}
sessionInfo()
```

```{r include = FALSE}
# cleaning
for (xfile in c(out.A, out.B)) {
   tryCatch(file.remove(xfile), error = function(e){NULL})  
}
```