install.packages("openalexR")
install.packages("ggplot2")
install.packages("tibble")
## Static plot
install.packages("tidygraph")
install.packages("ggraph")
## Interactive plot
install.packages("networkD3")
install.packages("DT")Technical Guideline Series
Snowball Search for Literature Search and Analysis using OpenAlex
In addition to the typical literature search using search terms, one can conduct a snowball search. A snowball search starts from a set of identified key-papers, and the snowball search will identify the publications cited in the key papers as well as the publications citing the key-paper. This search strategy identifies related articles using citation relationships (not keywords), building a citation network based on the key-papers Here we will demonstrate how a snowballing search can be conducted using OpenAlex and R with code examples and discuss some shortcomings and advantages of this approach. Furthermore, we will outline some analysis approaches and possibilities of a citation network without going into too much detail
Prepared by
- Rainer M. Krug - IPBES Task Force on Knowledge and Data.
Reviewed by
- Aidin Niamir - IPBES Technical Support Unit for Knowledge and Data
- Renske Gudde - IPBES Technical Support Unit for Knowledge and Data
- Yanina V. Sica - IPBES Technical Support Unit for Knowledge and Data
For any inquires please contact Aidin.Niamir@senckenberg.de
Version: 0.6.0
Last updated: 24 September, 2025
Introduction
Snowball search (also known as chain sampling, chain-referral sampling, referral sampling) is in the context of literature search a search which starts with some key-papers and identifies the papers which are referenced in the key-papers (also called backward search) as well as the papers citing one or more key-papers (also called forward search).
These key-papers should cover the range of the topic which should be covered by the resulting literature corpus, while containing the most relevant papers of the topic, and form the core of the search. Consequently, the selection of the key-papers is essential and this selection determines the quality and usability of the resulting corpus for a specific question asked.
Methods
Databases suitable for Snowballing
In principle, snowballing can be done with most (if not all) literature databases. Examples are Web of Science,Scopus or PubMed which provide the possibility to search for papers from their web interface or using API services (e.g. https://dev.elsevier.com/sc_apis.html). When using their web interface, one can only search for one article at the time, so it is very cumbersome to conduct a snowball search for multiple keypapers. Using the provided APIs of these proprietary databases makes automating the search possible, but becomes easily extremely expensive and still laden with restrictions doe to high licels fees and restrictions.
A good alternative is OpenAlex. OpenAlex is an open scholary data source, whcich does not charge fees for the basic API usage (which is enough for most projects). Therefore, the snowball search can be scripted. API clients are available for many different programming languages, including R (openalexR) and Python pyalex. This technical guideline will focus particularly on the use of OpenAlex (the by the data tsu recommended data source for literature) through the R package openalexR which is perfectly suited for this task, in addition to being easy to use.
Snowballing in R
Installation and loading of the package
The package openalexR is available from CRAN. The package plus additional ones used in this technical guideline and can be installed using
and loaded using
library(openalexR)openalexR v2.0.0 introduces breaking changes.
See NEWS.md for details.
To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
Snowballing
Let’s assume, we have a list of DOIs of the key papers, which have been identified by experts with knowledge of the field and topic.
dois <- c(
"10.1016/j.marpol.2023.105710", "10.1007/s13280-014-0582-z",
"10.1016/j.cosust.2022.101160", "10.1146/annurev-environ-102014-021340"
# "10.1016/j.eist.2016.09.001", "10.1016/j.cosust.2019.12.004"
)As the snowball search in OpenAlex is based on internal identifiers (OpenAlex ids), the DOIs need to be converted to these internal ids. This can be done using the function oa_fetch() of the package openalexR which can take a list of DOIs and returns a list of works (Works are documents in OpenAlex.). The function has many other arguments, but I will only show the use for this ssnowball search. For more information, please refer to the help in R or the documentation.
key_works <- oa_fetch(
entity = "works",
doi = dois,
verbose = FALSE
)Now we can do the snowball search, by using the function oa_snowball() with key_works$id as the identifier.
It is always a good idea to wrap the snowball search (and other long running calculations) a construct which only does tha snowball search if it has not been conducted before, as expecially with many key papers the snowball search can tike some time.
fn <- file.path("data", "snowball.rds")
if (file.exists(fn)) {
snowball <- readRDS(fn)
} else {
### Here comes the actual snowball search
snowball <- oa_snowball(
identifier = key_works$id,
verbose = FALSE
)
###
saveRDS(snowball, fn)
}The resulting list snowball has two elements:
nodeswhich contains the nodes of the citation network, i.e. the individual publications (orworksin the terminology of OpenAlex). This includes the key papers which are identified by the fieldoa_inputwhich isTRUEfor the key papers.edgeswhich contains the edges of the citation network, i.e. the citations between the works.
The edges only contain the citations from and to the key papers. The citation network does not include any citations between the newly identified works. We will come back to this later.
Another useful function is snowball2df() which flattens the object returned by oa_snowball() into a tibble. It is usually much easier to work with this flat structure.
flat_snow <- snowball2df(snowball) |>
tibble::as_tibble()Plotting the network
Static Network Graph
There ara many different packages in R to plot networks (igraph, ggraph together with tidygraph and others).
For now, I will only show how to plot the network using ggraph and tidygraph and not go into details how the plot can be tweaked.
library(tidygraph)
Attaching package: 'tidygraph'
The following object is masked from 'package:stats':
filter
library(ggraph)Loading required package: ggplot2
fn <- file.path("figures", "cited_by_count.png")
if (!file.exists(fn)) {
p_cb <- ggraph::ggraph(tidygraph::as_tbl_graph(snowball),
graph = , layout = "stress") + ggraph::geom_edge_link(ggplot2::aes(alpha = ggplot2::after_stat(index)),
show.legend = FALSE) + ggraph::geom_node_point(ggplot2::aes(fill = oa_input,
size = cited_by_count), shape = 21, color = "white") +
ggraph::geom_node_label(ggplot2::aes(filter = oa_input,
label = id), nudge_y = 0.2, size = 3) + ggraph::scale_edge_width(range = c(0.1,
1.5), guide = "none") + ggplot2::scale_size(range = c(3,
10), guide = "none") + ggplot2::scale_fill_manual(values = c("#a3ad62",
"#d46780"), na.value = "grey", name = "") + ggraph::theme_graph() +
ggplot2::theme(plot.background = ggplot2::element_rect(fill = "transparent",
colour = NA), panel.background = ggplot2::element_rect(fill = "transparent",
colour = NA), legend.position = "bottom") + ggplot2::guides(fill = "none") +
ggplot2::ggtitle(paste0("Cited by count"))
ggplot2::ggsave(file.path("figures", "cited_by_count.pdf"),
plot = p_cb, device = cairo_pdf, width = 20, height = 15)
ggplot2::ggsave(file.path("figures", "cited_by_count.png"),
plot = p_cb, width = 20, height = 15, bg = "white", dpi = 600)
}