This is a technical background document for the IPBES Thematic assessment of the underlying causes of biodiversity loss, determinants of transformative change and options for achieving the 2050 vision for biodiversity. It provides technical details and implementation settings for the data management report of the Transformative Change Assessment Corpus and its usage. The sole purpose of this document is to document the workflows used to produce statistics, figures and maps, to document the source of the data and to make the process transparent and reproducible.
The Buid No is automatically increased by one each time the report is rendered. It is used to indicate different renderings when the version stays the same.
To guarantee reproducibility, it will be downloaded and extracted when the folder ch5_subsidies_reform/data does not exist
All code to re-generate the data is included but might take a long time to run and produce different numbers as OpenAlex is updated continously.
Disable this block and delete all content in the folder ch5_subsidies_reform/data to re-generate the data. The folder ch5_subsidies_reform/data has to exist.
This code will only work after the approval of the assessment by the plenary as the repository will remain confidential before.
The Subsidies Corpus was downloaded at the 11th of April 2024.
The corpus download will be stored in data/pages and the parquet database in data/corpus.
This is not on github!
The corpus can be read by running read_corpus() which opens the database so that then it can be fed into a dplyr pipeline. After most dplyr functions, the actual data needs to be collected via dplyr::collect().
Only then is the actual data read!
Needs to be enabled by setting eval: true in the code block below.
The Sectors definition is based on the subfields assigned to each work by OpenAlex. These were grouped by experts into sectors. See this Google Doc for details.
Show the code
#|if (!dir.exists(params$corpus_topics_dir)) { con <- duckdb::dbConnect(duckdb::duckdb(), read_only =FALSE)corpus_read(params$corpus_dir) |> arrow::to_duckdb(table_name ="corpus", con = con) |>invisible()corpus_read(file.path("ch5_subsidies_reform", "input", "sectors_def.parquet")) |> arrow::to_duckdb(table_name ="sectors", con = con) |>invisible()paste0("CREATE VIEW corpus_unnest AS ","SELECT ","corpus.id AS work_id, ","corpus.publication_year AS publication_year, ","UNNEST(topics).i AS i, ","UNNEST(topics).score AS score, ","UNNEST(topics).name AS name, ","UNNEST(topics).id AS id, ","UNNEST(topics).display_name AS display_name ","FROM ","corpus " ) |>dbExecute(conn = con) select_sql <-paste0("SELECT ","corpus_unnest.*, ","sectors.sector ","FROM ","corpus_unnest ","LEFT JOIN ","sectors ","ON ","corpus_unnest.id == sectors.id " )# dbGetQuery(con, paste(select_sql, "LIMIT 10")) sql <-paste0("COPY ( ", select_sql,") TO '", params$corpus_topics_dir, "' ","(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')" )dbExecute(con, sql) duckdb::dbDisconnect(con, shutdown =TRUE)}
Authors
Show the code
#|if (!dir.exists(params$corpus_authors_dir)) { con <- duckdb::dbConnect(duckdb::duckdb(), read_only =FALSE)corpus_read(params$corpus_dir) |> arrow::to_duckdb(table_name ="corpus", con = con) |>invisible()paste0("CREATE VIEW corpus_unnest AS ","SELECT ","corpus.id AS work_id, ","corpus.publication_year AS publication_year, ","UNNEST(author).au_id AS au_id, ","UNNEST(author).au_display_name AS au_display_name, ","UNNEST(author).au_orcid AS au_orcid, ","UNNEST(author).author_position AS author_position, ","UNNEST(author).is_corresponding AS is_corresponding, ","UNNEST(author).au_affiliation_raw AS au_affiliation_raw, ","UNNEST(author).institution_id AS institution_id, ","UNNEST(author).institution_display_name AS institution_display_name, ","UNNEST(author).institution_ror AS institution_ror, ","UNNEST(author).institution_country_code AS institution_country_code, ","UNNEST(author).institution_type AS institution_type, ","UNNEST(author).institution_lineage AS institution_lineage ","FROM ","corpus " ) |>dbExecute(conn = con)paste0("COPY ( ","SELECT * FROM corpus_unnest ",") TO '", params$corpus_authors_dir, "' ","(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')" ) |>dbExecute(conn = con) duckdb::dbDisconnect(con, shutdown =TRUE)}
The number of hits are hits of the terms of the whole of the OpenAlex corpus. Due to methodological issues, the number of R1 AND R2 are overestimates and contain some double counting.
government_financial_subsidies in OpenAlex: 134,167 hits
government_financial_subsidies in downloaded corpus: 124,517 hits
Subsidies corpus: 124,517 hits
Manual review 50 paper
The file contains the id, doi, author_abbr and abstract of the papers. Two samples were generated:
works in the subsidies corpus AND in the TCA corpus which can be downloded here.
Publications over time
The red line is the cumulative proportion of publications, the blue line the cumulative proportion of all of the OpenAlex corpus. Both use the secondeary (red) axis.
The following calculations were done (count refers to the count of works per country in the subsidies corpus, count_oa to the count of works per country in the OpenAlex corpus):
**count** = ifelse(is.na(count), 0, count)
**log_count** = log(count + 1)
**p** = count / sum(count)
**p_output** = count / count_oa
**p_diff** = (p_oa - p) * 100
These are based on three different counts: - **count_oa**: Count of first authors all papers in the Open Alex Corpus per country - **count_first**: Count of first authors all papers in the Subsidies Corpus per country - **count_all**: Count of first authors of all papers in the Subsidies Corpus per country, weighted by 1/NO_AUTHORS_PER_PAPER
Two .parquet files containing the id, publication_year and ab (abstract) were extracted and are available upon request due to their size.
For analyzing the sentiments of the provided abstracts, we have used the Python NLTK package, and VADER (Valence Aware Dictionary for Sentiment Reasoning) which is an NLTK module that provides sentiment scores based on the words used. VADER is a pre-trained, rule-based sentiment analysis model in which the terms are generally labeled as per their semantic orientation as either positive or negative.
The main advantage/reason for using this model was that it doesn’t require a labbed training dataset. The output of the model is 4 statistical scores:
compound: composite score that summarizes the overall sentiment of the text, where scores close to 1 indicate a positive sentiment, scores close to -1 indicate a negative sentiment, and scores close to 0 indicate a neutral sentiment
negative: percentage of negative sentiments in the text
neutral: percentage of neutral sentiments in the text
positive: percentage of positive sentiments in the text
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
@report{krug,
author = {Krug, Rainer M. and Reyes Garcia, Victoria and Villasante,
Sebastian},
title = {Chapter 5 {Subsidies} {Reform} - {Technical} {Background}
{Report}},
doi = {10.5281/zenodo.11389482},
langid = {en}
}
For attribution, please cite this work as:
Krug, Rainer M., Victoria Reyes Garcia, and Sebastian Villasante. n.d.
“Chapter 5 Subsidies Reform - Technical Background Report.”IPBES Transformative Change Assessment. https://doi.org/10.5281/zenodo.11389482.