This package provides tools to download records from the NCBI PubMed database based on user-specified search criteria, and to add CrossRef citation data for the returned records. The output is in tidy data format, facilitating downstream analysis using tools from the ‘tidyverse’.
This vignette illustrates the use of the package to download data for an author from PubMed and the building a word-cloud from the titles of their publications.
I will be using Prof Rolf-Detlef Treede, a renowned scientist in the field of pain research, in this example.
Enter search parameters and perform a search using the
get_records
function.
The function parameters are:
search_terms: A character string of terms that define the scope of the PubMed database query. Boolean operators (AND, OR, NOT) and search field tags may be used to create more complex search criteria. Commonly used search fields tags include:
pub_type: A character string specifying the type of publication the search must return. The default value is ‘journal article’. For more information: PubMed article types.
min_date: A character string in the format ‘YYYY/MM/DD’, ‘YYYY/MM’ or ‘YYYY’ specifying the starting date of the search. The default value is 1966/01/01’.
max_date: A character string in the format
‘YYYY/MM/DD’, ‘YYYY/MM’ or ‘YYYY’ specifying
the end date of the search. The default value is
Sys.Date()
.
date_type: A character string specifying the publication date type that is being specified in the search. Available values are:
has_abstract: Logical specifying whether the returned records should be limited to those records that have an abstract. The default value is TRUE.
api_key: An API character string obtained from the users NBCI account. The key is not essential, but it specifying a key gives substantially faster record query rates.
Returning the records can take a while if there are a lot of
records, so I suggest that you use count_records
before
get_records
(they use the same parameters) to check how
many record queries will be made before executing a
request.
# Search for journal articles by RD Treede in the journal "PAIN" and
# which were published between 1 January 2000 and 31 December 2018
df <- get_records(search_terms = "Treede RD[AU] AND Pain[TA]",
min_date = '2000/01/01',
max_date = '2018/12/31',
api_key = NULL, # Add only if you have one (see documentation)
pub_type = 'journal article',
date_type = 'PDAT')
Have a quick look at the output.
# Print first 10 lines
print(df)
## # A tibble: 1,933 × 12
## surname initi…¹ title journal status volume pages year_…² year_…³ pmid doi
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 zou l the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 2 yu k the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 3 fan y the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 4 cao s the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 5 liu s the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 6 shi l the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 7 li l the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 8 yuan h the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 9 yang r the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## 10 yi z the … acs ch… ppubl… 10 1318… 2019 2018 3047… 10.1…
## # … with 1,923 more rows, 1 more variable: abstract <chr>, and abbreviated
## # variable names ¹initials, ²year_published, ³year_online
# View structure
glimpse(df)
## Rows: 1,933
## Columns: 12
## $ surname <chr> "zou", "yu", "fan", "cao", "liu", "shi", "li", "yuan", …
## $ initials <chr> "l", "k", "y", "s", "s", "l", "l", "h", "r", "z", "y", …
## $ title <chr> "the inhibition by guanfu base a of neuropathic pain me…
## $ journal <chr> "acs chem neurosci", "acs chem neurosci", "acs chem neu…
## $ status <chr> "ppublish", "ppublish", "ppublish", "ppublish", "ppubli…
## $ volume <chr> "10", "10", "10", "10", "10", "10", "10", "10", "10", "…
## $ pages <chr> "1318-1325", "1318-1325", "1318-1325", "1318-1325", "13…
## $ year_published <chr> "2019", "2019", "2019", "2019", "2019", "2019", "2019",…
## $ year_online <chr> "2018", "2018", "2018", "2018", "2018", "2018", "2018",…
## $ pmid <chr> "30475578", "30475578", "30475578", "30475578", "304755…
## $ doi <chr> "10.1021/acschemneuro.8b00399", "10.1021/acschemneuro.8…
## $ abstract <chr> "activation of satellite glial cells (sgcs) in the dors…
Each author of a paper is found on a separate row, with the rest of the information duplicated down the authors of a given article. Making each row a unique co-author record helps keep the data in a tidy format, and makes filtering records by co-authors easier. The downside is that the returned dataframe can be quite large because of all the duplicated information.
Although not essential for this example, you can added CrossRef
citation counts to the records using the citation_metrics
function. This function requires you to pass to it the output from
get_records
.
The addition of citations also can take a while if there are a lot of records.
# Add a column called "crossref_citations" to the first 6 observations
df_citations <- get_citations(head(df))
## Joining, by = "pmid"
# View structure
glimpse(df_citations)
## Rows: 6
## Columns: 13
## $ surname <chr> "zou", "yu", "fan", "cao", "liu", "shi"
## $ initials <chr> "l", "k", "y", "s", "s", "l"
## $ title <chr> "the inhibition by guanfu base a of neuropathic pai…
## $ journal <chr> "acs chem neurosci", "acs chem neurosci", "acs chem…
## $ status <chr> "ppublish", "ppublish", "ppublish", "ppublish", "pp…
## $ volume <chr> "10", "10", "10", "10", "10", "10"
## $ pages <chr> "1318-1325", "1318-1325", "1318-1325", "1318-1325",…
## $ year_published <chr> "2019", "2019", "2019", "2019", "2019", "2019"
## $ year_online <chr> "2018", "2018", "2018", "2018", "2018", "2018"
## $ pmid <dbl> 30475578, 30475578, 30475578, 30475578, 30475578, 3…
## $ doi <chr> "10.1021/acschemneuro.8b00399", "10.1021/acschemneu…
## $ abstract <chr> "activation of satellite glial cells (sgcs) in the …
## $ crossref_citations <dbl> 11, 11, 11, 11, 11, 11
Now that we have the data we can generate the wordcloud from article titles.
tidy_words <- words %>%
unnest_tokens(word, title, token = "ngrams", n = 2) %>%
# Remove stopwords
separate(word, into = c('word1', 'word2'), sep = ' ') %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
# Convert terms containing numerals to NA
mutate(word1 = ifelse(str_detect(word1, '[0-9]'),
yes = NA,
no = paste(word1))) %>%
mutate(word2 = ifelse(str_detect(word2, '[0-9]'),
yes = NA,
no = paste(word2))) %>%
# Remove NA
filter(!is.na(word1)) %>%
filter(!is.na(word2)) %>%
# Join word columns them back together to form 2-ngrams
unite(word, word1, word2, sep = ' ')
word_cloud <- ngram_count[1:100, ] %>%
rename(freq = n)
wordcloud2(data = word_cloud,
fontFamily = 'arial',
size = 0.4,
color = tableau_color_pal(palette = 'Color Blind')(10))