Basic use • pubmedRecords

This package provides tools to download records from the NCBI PubMed database based on user-specified search criteria, and to add CrossRef citation data for the returned records. The output is in tidy data format, facilitating downstream analysis using tools from the ‘tidyverse’.

This vignette illustrates the use of the package to download data for an author from PubMed and the building a word-cloud from the titles of their publications.

I will be using Prof Rolf-Detlef Treede, a renowned scientist in the field of pain research, in this example.

Step 1

Load the packages required for this vignette.

# install.packages("devtools")
# devtools::install_github("kamermanpr/pubmedRecords")

library(pubmedRecords)

library(dplyr)
library(stringr)
library(tidyr)
library(tidytext)
library(wordcloud2)

Step 2

Enter search parameters and perform a search using the get_records function.

The function parameters are:

search_terms: A character string of terms that define the scope of the PubMed database query. Boolean operators (AND, OR, NOT) and search field tags may be used to create more complex search criteria. Commonly used search fields tags include:
- [TI]: Word in title
- [TIAB]: Word in title or abstract
- [MH]: Medical Subject Heading (MeSH)
- [AU]: Author name (e.g., Doe J)
- [AD]: Author institutional affiliation
- [TA]: Journal title (e.g., J Pain)
  For a full set of search fields tags: PubMed search field tags. Note that the article publication type, date type, and date range are modified using the pub_type, date_type, min_date and max_date arguments below.
pub_type: A character string specifying the type of publication the search must return. The default value is ‘journal article’. For more information: PubMed article types.
min_date: A character string in the format ‘YYYY/MM/DD’, ‘YYYY/MM’ or ‘YYYY’ specifying the starting date of the search. The default value is 1966/01/01’.
max_date: A character string in the format ‘YYYY/MM/DD’, ‘YYYY/MM’ or ‘YYYY’ specifying the end date of the search. The default value is Sys.Date().
date_type: A character string specifying the publication date type that is being specified in the search. Available values are:
- PDAT: Date the article was published (default)
- MDAT: Date the PubMed entry was modified.
- EDAT: Date the entry was added to PubMed.
has_abstract: Logical specifying whether the returned records should be limited to those records that have an abstract. The default value is TRUE.
api_key: An API character string obtained from the users NBCI account. The key is not essential, but it specifying a key gives substantially faster record query rates.

Returning the records can take a while if there are a lot of records, so I suggest that you use count_records before get_records (they use the same parameters) to check how many record queries will be made before executing a request.

# Search for journal articles by RD Treede in the journal "PAIN" and 
# which were published between 1 January 2000 and 31 December 2018
df <- get_records(search_terms = "Treede RD[AU] AND Pain[TA]",
            min_date = '2000/01/01',
            max_date = '2018/12/31',
            api_key = NULL, # Add only if you have one (see documentation)
            pub_type = 'journal article',
            date_type = 'PDAT')

Have a quick look at the output.

# Print first 10 lines
print(df)

## # A tibble: 1,933 × 12
##    surname initi…¹ title journal status volume pages year_…² year_…³ pmid  doi  
##    <chr>   <chr>   <chr> <chr>   <chr>  <chr>  <chr> <chr>   <chr>   <chr> <chr>
##  1 zou     l       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  2 yu      k       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  3 fan     y       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  4 cao     s       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  5 liu     s       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  6 shi     l       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  7 li      l       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  8 yuan    h       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
##  9 yang    r       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
## 10 yi      z       the … acs ch… ppubl… 10     1318… 2019    2018    3047… 10.1…
## # … with 1,923 more rows, 1 more variable: abstract <chr>, and abbreviated
## #   variable names ¹initials, ²year_published, ³year_online

# View structure
glimpse(df)

## Rows: 1,933
## Columns: 12
## $ surname        <chr> "zou", "yu", "fan", "cao", "liu", "shi", "li", "yuan", …
## $ initials       <chr> "l", "k", "y", "s", "s", "l", "l", "h", "r", "z", "y", …
## $ title          <chr> "the inhibition by guanfu base a of neuropathic pain me…
## $ journal        <chr> "acs chem neurosci", "acs chem neurosci", "acs chem neu…
## $ status         <chr> "ppublish", "ppublish", "ppublish", "ppublish", "ppubli…
## $ volume         <chr> "10", "10", "10", "10", "10", "10", "10", "10", "10", "…
## $ pages          <chr> "1318-1325", "1318-1325", "1318-1325", "1318-1325", "13…
## $ year_published <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2019",…
## $ year_online    <chr> "2018", "2018", "2018", "2018", "2018", "2018", "2018",…
## $ pmid           <chr> "30475578", "30475578", "30475578", "30475578", "304755…
## $ doi            <chr> "10.1021/acschemneuro.8b00399", "10.1021/acschemneuro.8…
## $ abstract       <chr> "activation of satellite glial cells (sgcs) in the dors…

Each author of a paper is found on a separate row, with the rest of the information duplicated down the authors of a given article. Making each row a unique co-author record helps keep the data in a tidy format, and makes filtering records by co-authors easier. The downside is that the returned dataframe can be quite large because of all the duplicated information.

Although not essential for this example, you can added CrossRef citation counts to the records using the citation_metrics function. This function requires you to pass to it the output from get_records.

The addition of citations also can take a while if there are a lot of records.

# Add a column called "crossref_citations" to the first 6 observations
df_citations <- get_citations(head(df))

## Joining, by = "pmid"

# View structure
glimpse(df_citations)

## Rows: 6
## Columns: 13
## $ surname            <chr> "zou", "yu", "fan", "cao", "liu", "shi"
## $ initials           <chr> "l", "k", "y", "s", "s", "l"
## $ title              <chr> "the inhibition by guanfu base a of neuropathic pai…
## $ journal            <chr> "acs chem neurosci", "acs chem neurosci", "acs chem…
## $ status             <chr> "ppublish", "ppublish", "ppublish", "ppublish", "pp…
## $ volume             <chr> "10", "10", "10", "10", "10", "10"
## $ pages              <chr> "1318-1325", "1318-1325", "1318-1325", "1318-1325",…
## $ year_published     <chr> "2019", "2019", "2019", "2019", "2019", "2019"
## $ year_online        <chr> "2018", "2018", "2018", "2018", "2018", "2018"
## $ pmid               <dbl> 30475578, 30475578, 30475578, 30475578, 30475578, 3…
## $ doi                <chr> "10.1021/acschemneuro.8b00399", "10.1021/acschemneu…
## $ abstract           <chr> "activation of satellite glial cells (sgcs) in the …
## $ crossref_citations <dbl> 11, 11, 11, 11, 11, 11

Step 3

Now that we have the data we can generate the wordcloud from article titles.

First, select the title column

words <- df %>% 
  # Select the title column
  select(title) %>% 
  # extract unique entries only
  unique(.)

Second, extract 2-ngrams

tidy_words <- words %>%
    unnest_tokens(word, title, token = "ngrams", n = 2) %>%
    # Remove stopwords
    separate(word, into = c('word1', 'word2'), sep = ' ') %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word) %>%
    # Convert terms containing numerals to NA
    mutate(word1 = ifelse(str_detect(word1, '[0-9]'),
                         yes = NA,
                         no = paste(word1))) %>%
    mutate(word2 = ifelse(str_detect(word2, '[0-9]'),
                          yes = NA,
                          no = paste(word2))) %>%
    # Remove NA
    filter(!is.na(word1)) %>%
    filter(!is.na(word2)) %>%
    # Join word columns them back together to form 2-ngrams
    unite(word, word1, word2, sep = ' ')

Third, count the number of occurances of each 2-ngram

ngram_count <- tidy_words %>%
    count(word) %>%
    arrange(desc(n))

Fourth, strip out the top 100 2-ngrams and plot

word_cloud <- ngram_count[1:100, ] %>% 
  rename(freq = n)
  
wordcloud2(data = word_cloud,
           fontFamily = 'arial',
           size = 0.4,
           color = tableau_color_pal(palette = 'Color Blind')(10))