%The BuidNo is automatically increased by one each time the report is rendered. It is used to indicate different renderings when the version stays the same%.
Introduction
The following steps will be done in documented in this report:
The Subsidies Corpus was downloaded at the 11th of April 2024.
The corpus download will be stored in data/pages and the arrow database in data/corpus.
This is not on github!
The corpus can be read by running read_corpus() which opens the database so that then it can be fed into a dplyr pipeline. After most dplyr functions, the actual data needs to be collected via dplyr::collect().
Only then is the actual data read!
Needs to be enabled by setting eval: true in the code block below.
The Sectors definition is based on the subfields assigned to each work by OpenAlex. These were grouped by experts into sectors. See this Google Doc for details.
Show the code
#|if (!dir.exists(params$corpus_topics_dir)) { con <- duckdb::dbConnect(duckdb::duckdb(), read_only =FALSE)corpus_read(params$corpus_dir) |> arrow::to_duckdb(table_name ="corpus", con = con) |>invisible()corpus_read(file.path("input", "ch_5_subsidies_reform", "sectors_def.parquet")) |> arrow::to_duckdb(table_name ="sectors", con = con) |>invisible()paste0("CREATE VIEW corpus_unnest AS ","SELECT ","corpus.id AS work_id, ","corpus.publication_year AS publication_year, ","UNNEST(topics).i AS i, ","UNNEST(topics).score AS score, ","UNNEST(topics).name AS name, ","UNNEST(topics).id AS id, ","UNNEST(topics).display_name AS display_name ","FROM ","corpus " ) |>dbExecute(conn = con) select_sql <-paste0("SELECT ","corpus_unnest.*, ","sectors.sector ","FROM ","corpus_unnest ","LEFT JOIN ","sectors ","ON ","corpus_unnest.id == sectors.id " )# dbGetQuery(con, paste(select_sql, "LIMIT 10")) sql <-paste0("COPY ( ", select_sql,") TO '", params$corpus_topics_dir, "' ","(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')" )dbExecute(con, sql) duckdb::dbDisconnect(con, shutdown =TRUE)}
Authors
Show the code
#|if (!dir.exists(params$corpus_authors_dir)) { con <- duckdb::dbConnect(duckdb::duckdb(), read_only =FALSE)corpus_read(params$corpus_dir) |> arrow::to_duckdb(table_name ="corpus", con = con) |>invisible()paste0("CREATE VIEW corpus_unnest AS ","SELECT ","corpus.id AS work_id, ","corpus.publication_year AS publication_year, ","UNNEST(author).au_id AS au_id, ","UNNEST(author).au_display_name AS au_display_name, ","UNNEST(author).au_orcid AS au_orcid, ","UNNEST(author).author_position AS author_position, ","UNNEST(author).is_corresponding AS is_corresponding, ","UNNEST(author).au_affiliation_raw AS au_affiliation_raw, ","UNNEST(author).institution_id AS institution_id, ","UNNEST(author).institution_display_name AS institution_display_name, ","UNNEST(author).institution_ror AS institution_ror, ","UNNEST(author).institution_country_code AS institution_country_code, ","UNNEST(author).institution_type AS institution_type, ","UNNEST(author).institution_lineage AS institution_lineage ","FROM ","corpus " ) |>dbExecute(conn = con)paste0("COPY ( ","SELECT * FROM corpus_unnest ",") TO '", params$corpus_authors_dir, "' ","(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')" ) |>dbExecute(conn = con) duckdb::dbDisconnect(con, shutdown =TRUE)}
The number of hits are hits of the terms of the whole of the OpenAlex corpus. Due to methodological issues, the number of R1 AND R2 are overestimates and contain some double counting.
government_financial_subsidies in OpenAlex: 126,116 hits
government_financial_subsidies in downloaded corpus: 124,517 hits
Subsidies corpus: 124,517 hits
Manual review 50 paper
The file contains the id, doi, author_abbr and abstract of the papers. Two samples were generated:
works in the subsidies corpus AND in the TCA corpus which can be downloded here.
Publications over time
The red line is the cumulative proportion of publications, the blue line the cumulative proportion of all of the OpenAlex corpus. Both use the secondeary (red) axis.
The following calculations were done (count refers to the count of works per country in the subsidies corpus, count_oa to the count of works per country in the OpenAlex corpus):
**count** = ifelse(is.na(count), 0, count)
**log_count** = log(count + 1)
**p** = count / sum(count)
**p_output** = count / count_oa
**p_diff** = (p_oa - p) * 100
These are based on three different counts: - **count_oa**: Count of first authors all papers in the Open Alex Corpus per country - **count_first**: Count of first authors all papers in the Subsidies Corpus per country - **count_all**: Count of first authors of all papers in the Subsidies Corpus per country, weighted by 1/NO_AUTHORS_PER_PAPER
Two .parquet files containing the id, publication_year and ab (abstract) were extracted and are available upon request due to their size.
For analyzing the sentiments of the provided abstracts, we have used the Python NLTK package, and VADER (Valence Aware Dictionary for Sentiment Reasoning) which is an NLTK module that provides sentiment scores based on the words used. VADER is a pre-trained, rule-based sentiment analysis model in which the terms are generally labeled as per their semantic orientation as either positive or negative.
The main advantage/reason for using this model was that it doesn’t require a labbed training dataset. The output of the model is 4 statistical scores:
compound: composite score that summarizes the overall sentiment of the text, where scores close to 1 indicate a positive sentiment, scores close to -1 indicate a negative sentiment, and scores close to 0 indicate a neutral sentiment
negative: percentage of negative sentiments in the text
neutral: percentage of neutral sentiments in the text
positive: percentage of positive sentiments in the text
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
@report{krug,
author = {Krug, Rainer M.},
title = {Report {Assessment} {Ch5} {Subsidies} {Reform}},
doi = {XXXXXX},
langid = {en},
abstract = {A short description what this is about. This is not a
tracditional abstract, but rather something else ...}
}
For attribution, please cite this work as:
Krug, Rainer M. n.d. “Report Assessment Ch5 Subsidies
Reform.”IPBES Data Management Report. https://doi.org/XXXXXX.