Chapter 5 Subsidies Reform - Technical Background Report

IPBES Transformative Change Assessment

Authors

Rainer M. Krug

Victoria Reyes Garcia

Sebastian Villasante

Doi

10.5281/zenodo.11389482

Part of the Data Management Report

Disclaimer

This is a technical background document for the IPBES Thematic assessment of the underlying causes of biodiversity loss, determinants of transformative change and options for achieving the 2050 vision for biodiversity. It provides technical details and implementation settings for the data management report of the Transformative Change Assessment Corpus and its usage. The sole purpose of this document is to document the workflows used to produce statistics, figures and maps, to document the source of the data and to make the process transparent and reproducible.

Contributors

Data and Knowledge tsu

Niamir, Aidin
Gudde, Renske

Working Title

IPBES_TCA_Ch5_Subsidies_Reform

Code repo

Github - private

Build No: 309

The Buid No is automatically increased by one each time the report is rendered. It is used to indicate different renderings when the version stays the same.

Introduction

Methods

The search terms are based on the shared google doc. They are cleaned up for the usage in OpenAlex.

Download of Data Files from Zenodo

The data folder is also available in a separate deposit at 10.5281/zenodo.11389148.

To guarantee reproducibility, it will be downloaded and extracted when the folder ch5_subsidies_reform/data does not exist

All code to re-generate the data is included but might take a long time to run and produce different numbers as OpenAlex is updated continously.

Disable this block and delete all content in the folder ch5_subsidies_reform/data to re-generate the data. The folder ch5_subsidies_reform/data has to exist.

This code will only work after the approval of the assessment by the plenary as the repository will remain confidential before.

Show the code

dn <- file.path("ch5_subsidies_reform", "data")
if (!dir.exists(dn)) {
    dir.create(dn)
    url <- paste0("https://zenodo.org/record/", "11389258", "/files/data.zip")
    destfile <- tempfile(fileext = ".zip")
    download.file(url, destfile)

    unzip(destfile, exdir = "data")
}

Get and calculate Data from OpenAlex

The search terms is st

Show the code

#|

if (!file.exists(params$fn_count)) {
    st_count  <- list(
        timestamp = Sys.time()
    )
    st_count$st_count <- openalexR::oa_fetch(
        title_and_abstract.search = st,
        count_only = TRUE,
        verbose = TRUE
    )[, "count"]

    saveRDS(st_count, params$fn_count)
} else {
    st_count <- readRDS(params$fn_count)
}

Download Subsidies Corpus

The Subsidies Corpus was downloaded at the 11th of April 2024.

The corpus download will be stored in data/pages and the parquet database in data/corpus.

This is not on github!

The corpus can be read by running read_corpus() which opens the database so that then it can be fed into a dplyr pipeline. After most dplyr functions, the actual data needs to be collected via dplyr::collect().

Only then is the actual data read!

Needs to be enabled by setting eval: true in the code block below.

Show the code

#|

tic()

IPBES.R::corpus_download(
    pages_dir = params$pages_dir,
    title_and_abstract_search = st,
    continue = TRUE,
    delete_pages_dir = FALSE,
    set_size = 1000,
    verbose = TRUE,
    dry_run = TRUE,
    mc_cores = 6
)

toc()

Convert downloaded corpus to parquet

Show the code

tic()

IPBES.R::corpus_pages_to_arrow(
    pages_dir = params$pages_dir,
    arrow_dir = params$corpus_dir,
    continue = TRUE,
    delete_arrow_dir = FALSE,
    dry_run = FALSE,
    verbose = TRUE,
    mc_cores = 3
)

toc()

Check the number of dulicates before running this next block, and then verify the new corpus afterwards. RUN ONLY MANUALY!

Show the code

#|

ONLY RUN MANUALLY!!!!!!!!!!!!!!!!!!!!!!!

(read_corpus("./data/ch_5_subsidies_reform/corpus") |> group_by(id) |> summarize(n = n()) |> filter(n > 1) |> collect() |> nrow()) / (corpus_read("./data/ch_5_subsidies_reform/corpus") |> nrow())

years <- IPBES.R::corpus_read(params$corpus_dir) |>
    distinct(publication_year) |>
    collect() |>
    unlist() |>
    as.vector() |>
    sort()

lapply(
    years,
    function(y) {
        message("\nProcessing year: ", y)
        tic()
        dataset <- IPBES.R::corpus_read(params$corpus_dir) |>
            dplyr::filter(publication_year == y) |>
            dplyr::collect() |>
            group_by(id) |>
            slice_max(
                publication_year,
                n = 1,
                with_ties = FALSE,
                na_rm = TRUE
            )
        # unlink(
        #     file.path(params$corpus_dir, paste0("publication_year=", y)),
        #     recursive = TRUE,
        #     force = TRUE
        # )
        arrow::write_dataset(
            dataset = dataset,
            path = paste0(params$corpus_dir, "_deduplicated"),
            partitioning = c("publication_year", "set"),
            format = "parquet",
            existing_data_behavior = "overwrite"
        )
        toc()
    }
)

(read_corpus("./data/ch_5_subsidies_reform/corpus_deduplicated") |> 
    group_by(id) |> 
    summarize(n = n()) |> 
    filter(n > 1) |> 
    collect() |> 
    nrow()) / (corpus_read("./data/ch_5_subsidies_reform/corpus_deduplicated") |> 
    nrow())

id_org <- corpus_read("./data/ch_5_subsidies_reform/corpus") |>
    distinct(id) |>
    collect()

id_deduplicated <- corpus_read("./data/ch_5_subsidies_reform/corpus_deduplicated") |>
    distinct(id) |>
    collect()

(!(id_deduplicated$id %in% id_org$id)) |> sum() |> as.numeric()

NOW IF EVERYTHING IS OK, DELETE THE OLD CORPUS AND RENAME THE NEW ONE

Show the code

#|

fn <- file.path("ch5_subsidies_reform", "data", "corpus_dois.csv")
if (!file.exists(fn)) {
corpus_read(params$corpus_dir) |>
    dplyr::distinct(doi) |>
    dplyr::rename(
        tca_corpus_dois = doi
    ) |>
    dplyr::collect() |>
    write.csv(
        file = fn, row.names = FALSE
    )
}

Export Data from Subsidies Corpus

Export data for sentiment analysis

Show the code

#|

fn <- file.path("ch5_subsidies_reform", "data", "sent_analysis_subsidies.parquet")
if (!file.exists(fn)) {
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, publication_year, ab) |>
        arrow::write_parquet(fn)
}

Export 250 random papers for manual analysis

Show the code

#|


fn <- file.path("ch5_subsidies_reform", "data", "random_250_subsidies_in_tca.xlsx")
if (!file.exists(fn)) {
    set.seed(15)
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, doi, author_abbr, display_name, ab) |>
        dplyr::rename(abstract = ab, title = display_name) |>
        dplyr::collect() |>
        dplyr::slice_sample(n = 250) |>
        dplyr::mutate(
            abstract = substr(abstract, 1, 5000)
        ) |>
        writexl::write_xlsx(path = fn)
}


fn <- file.path("ch5_subsidies_reform", "data", "random_50_subsidies_in_tca.xlsx")
if (!file.exists(fn)) {
    set.seed(13)
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, doi, author_abbr, display_name, ab) |>
        dplyr::rename(abstract = ab, title = display_name) |>
        dplyr::collect() |>
        dplyr::slice_sample(n = 50) |>
        dplyr::mutate(
            abstract = substr(abstract, 1, 5000)
        ) |>
        writexl::write_xlsx(path = fn)
}

Topics and Sectors

The Sectors definition is based on the subfields assigned to each work by OpenAlex. These were grouped by experts into sectors. See this Google Doc for details.

Show the code

#|

if (!dir.exists(params$corpus_topics_dir)) {
    con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)

    corpus_read(params$corpus_dir) |>
        arrow::to_duckdb(table_name = "corpus", con = con) |>
        invisible()
    corpus_read(file.path("ch5_subsidies_reform", "input", "sectors_def.parquet")) |>
        arrow::to_duckdb(table_name = "sectors", con = con) |>
        invisible()

    paste0(
        "CREATE VIEW corpus_unnest AS ",
        "SELECT  ",
        "corpus.id AS work_id,  ",
        "corpus.publication_year AS publication_year,  ",
        "UNNEST(topics).i AS i,  ",
        "UNNEST(topics).score AS score,  ",
        "UNNEST(topics).name AS name, ",
        "UNNEST(topics).id AS id,  ",
        "UNNEST(topics).display_name AS display_name  ",
        "FROM  ",
        "corpus "
    ) |>
        dbExecute(conn = con)

    select_sql <- paste0(
        "SELECT ",
        "corpus_unnest.*, ",
        "sectors.sector ",
        "FROM ",
        "corpus_unnest ",
        "LEFT JOIN ",
        "sectors ",
        "ON ",
        "corpus_unnest.id  == sectors.id "
    )

    # dbGetQuery(con, paste(select_sql, "LIMIT 10"))

    sql <- paste0(
        "COPY ( ",
        select_sql,
        ") TO '", params$corpus_topics_dir, "' ",
        "(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')"
    )

    dbExecute(con, sql)

    duckdb::dbDisconnect(con, shutdown = TRUE)
}

Authors

Show the code

#|

if (!dir.exists(params$corpus_authors_dir)) {
    con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)

    corpus_read(params$corpus_dir) |>
        arrow::to_duckdb(table_name = "corpus", con = con) |>
        invisible()

    paste0(
        "CREATE VIEW corpus_unnest AS ",
        "SELECT  ",
        "corpus.id AS work_id,  ",
        "corpus.publication_year AS publication_year,  ",
        "UNNEST(author).au_id AS au_id,  ",
        "UNNEST(author).au_display_name AS au_display_name, ",
        "UNNEST(author).au_orcid AS au_orcid,  ",
        "UNNEST(author).author_position AS author_position,  ",
        "UNNEST(author).is_corresponding AS is_corresponding,  ",
        "UNNEST(author).au_affiliation_raw AS au_affiliation_raw,  ",
        "UNNEST(author).institution_id AS institution_id,  ",
        "UNNEST(author).institution_display_name AS institution_display_name,  ",
        "UNNEST(author).institution_ror AS institution_ror,  ",
        "UNNEST(author).institution_country_code AS institution_country_code,  ",
        "UNNEST(author).institution_type AS institution_type,  ",
        "UNNEST(author).institution_lineage AS institution_lineage  ",
        "FROM  ",
        "corpus "
    ) |> dbExecute(conn = con)

    paste0(
        "COPY ( ",
        "SELECT * FROM corpus_unnest ",
        ") TO '", params$corpus_authors_dir, "' ",
        "(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')"
    ) |>
        dbExecute(conn = con)

    duckdb::dbDisconnect(con, shutdown = TRUE)
}

Publications over time

Show the code

#|

if (!file.exists(params$fn_publications_temporal)) {
    data <- read_corpus(params$corpus_dir) |>
        dplyr::select(publication_year) |>
        dplyr::arrange(publication_year) |>
        dplyr::collect() |>
        table() |>
        as.data.frame() |>
        mutate(
            publication_year = as.integer(as.character(publication_year)),
            p = Freq / sum(Freq),
            p_cum = cumsum(p)
        ) |>
        rename(
            count = Freq
        ) |>
        dplyr::inner_join(
            y = openalexR::oa_fetch(
                group_by = "publication_year",
                output = "tibble",
                verbose = FALSE
            ) |>
                dplyr::select(
                    key,
                    count
                ) |>
                dplyr::rename(
                    publication_year = key,
                    count_oa = count
                ) |>
                dplyr::arrange(publication_year) |>
                dplyr::mutate(
                    publication_year = as.integer(as.character(publication_year)),
                    p_oa = count_oa / sum(count_oa),
                    p_oa_cum = cumsum(p_oa)
                )
        ) |>
        saveRDS(params$fn_publications_temporal)
}

Show the code

#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "publications_over_time")) < 2) {
    figure <- readRDS(params$fn_publications_temporal) |>
        dplyr::filter(publication_year >= 1900) |>
        ggplot() +
        #
        geom_bar(aes(x = publication_year, y = p), stat = "identity") +
        geom_line(aes(x = publication_year, y = p_cum / 5, color = "Cumulative proportion"), linewidth = 1) +
        geom_line(aes(x = publication_year, y = p_oa_cum / 5, color = "Cumulative proportion OA"), linewidth = 1) +
        #
        scale_color_manual(values = c("Cumulative proportion" = "red", "Cumulative proportion OA" = "blue")) +
        #
        scale_x_continuous(breaks = seq(1900, 2020, 10)) +
        scale_y_continuous(
            "Proportion of publications",
            sec.axis = sec_axis(~ . * 5, name = "Cumulative proportion") # divide by 100 to scale back the secondary axis
        ) +
        labs(
            title = "Publications over time",
            x = "Year",
            y = "Number of publications"
        ) +
        theme_minimal() +
        theme(
            axis.text.y.right = element_text(color = "red"),
            legend.position = "inside" # Move the legend to the top
        )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "publications_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "publications_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Countries of Authors Spatial

Show the code

if (!file.exists(params$fn_publications_per_country_data)) {
    corpus <- corpus_read(params$corpus_authors_dir)

    data_first <- corpus |>
        dplyr::filter(
            author_position == "first"
        ) |>
        dplyr::select(
            work_id,
            institution_country_code,
        ) |>
        dplyr::group_by(
            work_id,
            institution_country_code
        ) |>
        dplyr::summarise(
            count_first = 1 / n(),
            .groups = "drop"
        ) |>
        dplyr::group_by(
            institution_country_code
        ) |>
        dplyr::summarise(
            count = sum(count_first),
            .groups = "drop"
        ) |>
        dplyr::mutate(
            position = "first"
        )

    data_all <- corpus |>
        dplyr::select(
            work_id,
        ) |>
        dplyr::group_by(
            work_id,
        ) |>
        dplyr::summarize(
            count = n()
        ) |>
        dplyr::right_join(
            y = corpus |>
                dplyr::select(
                    work_id,
                    institution_country_code
                ),
            by = "work_id"
        ) |>
        dplyr::group_by(
            institution_country_code
        ) |>
        dplyr::summarise(
            count = sum(count),
            .groups = "drop"
        ) |>
        dplyr::mutate(
            position = "all"
        )

    data_oa <- openalexR::oa_fetch(
        group_by = "authorships.countries",
        output = "tibble",
        verbose = FALSE
    ) |>
        dplyr::mutate(
            iso3c = countrycode::countrycode(
                key_display_name,
                origin = "country.name",
                destination = "iso3c"
            ),
            key_display_name = NULL,
            key = NULL,
            position = "oa"
        )

    dplyr::add_row(
        collect(data_first),
        collect(data_all)
    ) |>
        dplyr::mutate(
            iso3c = countrycode::countrycode(
                institution_country_code,
                origin = "iso2c",
                destination = "iso3c"
            ),
            institution_country_code = NULL
        ) |>
        dplyr::add_row(
            data_oa
        ) |>
        saveRDS(file = params$fn_publications_per_country_data)

    rm(data_first, data_all, data_oa, fn)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "publications_countries")) < 2) {
    data <- readRDS(params$fn_publications_per_country_data) |>
        dplyr::group_by(iso3c) |>
        dplyr::summarize(
            count_first = sum(count[position == "first"], na.rm = TRUE),
            count_all = sum(count[position == "all"], na.rm = TRUE),
            count_oa = sum(count[position == "oa"], na.rm = TRUE)
        ) |>
        dplyr::mutate(
            count_oa = ifelse(is.na(count_oa), 0, count_oa),
            log_count_oa = log(count_oa + 1),
            p_oa = count_oa / sum(count_oa),
            #
            count_first = ifelse(is.na(count_first), 0, count_first),
            log_count_first = log(count_first + 1),
            p_first = count_first / sum(count_first),
            p_first_output = count_first / count_oa,
            p_first_diff = (p_oa - p_first) * 100,
            #
            count_all = ifelse(is.na(count_all), 0, count_all),
            log_count_all = log(count_all + 1),
            p_all = count_all / sum(count_all),
            p_all_output = count_all / count_oa,
            p_all_diff = (p_oa - p_all) * 100,
        )

    # data <- readRDS(params$fn_publications_per_country_data)

    map <- patchwork::wrap_plots(
        data |> map_country_codes(
            map_type = "countries",
            values = "count_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of overall publications (count_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "count_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of TCA publications (count_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "count_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of TCA publications (count_all)"),
        ####
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of overall publications (log_count_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of TCA publications (log_count_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of TCA publications (log_count_all)"),
        ####
        data |> map_country_codes(
            map_type = "countries",
            values = "p_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("Overall research output (p_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "p_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("TCA research output (p_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "p_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("TCA research output (p_all)"),
        ####
        ggplot() +
            theme_void(),
        data |>
            map_country_codes(
                map_type = "countries",
                values = "p_first_diff",
                geodata_path = params$gdm_dir
            ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
            ggplot2::ggtitle("difference (TCA - overall) output (p_oa - p_first)"),
        #
        data |>
            map_country_codes(
                map_type = "countries",
                values = "p_all_diff",
                geodata_path = params$gdm_dir
            ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
            ggplot2::ggtitle("difference (TCA - overall) output (p_oa - p_all)"),
        ncol = 3
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "publications_countries.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "publications_countries.png"),
        width = 12,
        height = 8,
        map
    )
}

Sentiment Analysis

Spatial

Show the code

#|


if (!file.exists(params$fn_sentiment_spatial_data)) {
    data <- read_corpus(params$corpus_authors_dir) |>
        dplyr::select(
            work_id,
            institution_country_code
        ) |>
        dplyr::filter(
            !is.na(institution_country_code)
        ) |>
        collect() |>
        mutate(
            iso3c = countrycode::countrycode(
                institution_country_code,
                origin = "iso2c",
                destination = "iso3c"
            ),
            institution_country_code = NULL
        ) |>
        dplyr::left_join(
            readRDS(params$fn_sentiment_results) |>
                select(
                    work_id = id,
                    neg,
                    neu,
                    pos,
                    compound
                ),
            by = "work_id"
        ) |>
        dplyr::group_by(iso3c) |>
        dplyr::summarize(
            mean_neg = mean(neg, na.rm = TRUE),
            mean_neu = mean(neu, na.rm = TRUE),
            mean_pos = mean(pos, na.rm = TRUE),
            mean_compound = mean(compound, na.rm = TRUE),
            n = n()
        ) |>
        dplyr::arrange(
            dplyr::desc(mean_neg)
        ) |>
        # Filter out missing countries - only necessary as analysis not completed yet
        dplyr::filter(
            !is.nan(mean_neg)
        ) |>
        saveRDS(file = params$fn_sentiment_spatial_data)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_neu_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neu",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean neutral sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neu",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean neutral sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_pos_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean positive sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean positive sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}

Show the code

#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_neg_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neg",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean negative sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean negative sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_comp_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_compound",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean compound sentiment (-1: negative; 1: positive) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_compound",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean compound sentiment (-1: negative; 1: positive) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}

Timeseries

Show the code

#|

if (!file.exists(params$fn_sentiment_temporal_data)) {
    data <- readRDS(params$fn_sentiment_results) |>
        select(
            work_id = id,
            year = date,
            neg,
            neu,
            pos,
            compound
        ) |>
        dplyr::group_by(year) |>
        dplyr::summarize(
            neg = mean(neg),
            neu = mean(neu),
            pos = mean(pos),
            compound = mean(compound),
            n = n()
        ) |>
        saveRDS(file = params$fn_sentiment_temporal_data)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        tidyr::pivot_longer(cols = c(neg, neu, pos, compound), names_to = "type", values_to = "value") |>
        ggplot2::ggplot() +
        ggplot2::geom_line(aes(x = year, y = value, color = type, linetype = type)) +
        ggplot2::scale_color_manual(values = c("black", "red", "blue", "green")) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Scores (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Score",
            color = "Type",
            linetype = "Type"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_over_time_neg_pos")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        tidyr::pivot_longer(cols = c(neg, pos), names_to = "type", values_to = "value") |>
        ggplot2::ggplot() +
        ggplot2::geom_line(aes(x = year, y = value, color = type, linetype = type)) +
        ggplot2::scale_color_manual(values = c("black", "red", "blue", "green")) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Scores (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Score",
            color = "Type",
            linetype = "Type"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time_neg_pos.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time_neg_pos.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_neg_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = neg)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis negative Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Negative score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neg_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neg_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Show the code

#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_neu_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = neu)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis neutral Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Neutral score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neu_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neu_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Show the code

#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_pos_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = pos)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis positive Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Positive score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_pos_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_pos_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Show the code

#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_comp_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = compound)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Compound Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Compound score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_comp_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_comp_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Results

Number of hits

The number of hits are hits of the terms of the whole of the OpenAlex corpus. Due to methodological issues, the number of R1 AND R2 are overestimates and contain some double counting.

government_financial_subsidies in OpenAlex: 134,167 hits
government_financial_subsidies in downloaded corpus: 124,517 hits
Subsidies corpus: 124,517 hits

Manual review 50 paper

The file contains the id, doi, author_abbr and abstract of the papers. Two samples were generated:

works in the subsidies corpus AND in the TCA corpus which can be downloded here.

Publications over time

The red line is the cumulative proportion of publications, the blue line the cumulative proportion of all of the OpenAlex corpus. Both use the secondeary (red) axis.

To download high resolution, click here

Show the code

readRDS(params$fn_publications_temporal) |>
    IPBES.R::table_dt(fn = "publications_per_year")

Map Countries of Author Affiliations

Distribution of the author affiliation countries

The following calculations were done (count refers to the count of works per country in the subsidies corpus, count_oa to the count of works per country in the OpenAlex corpus):

**count** = ifelse(is.na(count), 0, count)
**log_count** = log(count + 1)
**p** = count / sum(count)
**p_output** = count / count_oa
**p_diff** = (p_oa - p) * 100

These are based on three different counts: - **count_oa**: Count of first authors all papers in the Open Alex Corpus per country - **count_first**: Count of first authors all papers in the Subsidies Corpus per country - **count_all**: Count of first authors of all papers in the Subsidies Corpus per country, weighted by 1/NO_AUTHORS_PER_PAPER

Countries of the works in different corpi. Left column: first author Open Alex Corpus; Middle Column: first author Subsidies Corpus; Right column: All authors weighted by inverse number of authors per paper of Subsidies Corpus” ”

To download high resolution, click here

Show the code

#|

readRDS(params$fn_publications_per_country_data) |>
    IPBES.R::table_dt(fn = "countries_first_author", fixedColumns = list(leftColumns = 2))

Sentiment Analysis

Two .parquet files containing the id, publication_year and ab (abstract) were extracted and are available upon request due to their size.

For analyzing the sentiments of the provided abstracts, we have used the Python NLTK package, and VADER (Valence Aware Dictionary for Sentiment Reasoning) which is an NLTK module that provides sentiment scores based on the words used. VADER is a pre-trained, rule-based sentiment analysis model in which the terms are generally labeled as per their semantic orientation as either positive or negative.

The main advantage/reason for using this model was that it doesn’t require a labbed training dataset. The output of the model is 4 statistical scores:

compound: composite score that summarizes the overall sentiment of the text, where scores close to 1 indicate a positive sentiment, scores close to -1 indicate a negative sentiment, and scores close to 0 indicate a neutral sentiment
negative: percentage of negative sentiments in the text
neutral: percentage of neutral sentiments in the text
positive: percentage of positive sentiments in the text

Show the code

#|

readRDS(params$fn_sentiment_results) |>
    IPBES.R::table_dt(fn = "sentiment_scores", fixedColumns = list(leftColumns = 2))

Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Here is the per country table

Show the code

#|

readRDS(params$fn_sentiment_spatial_data) |>
    IPBES.R::table_dt(fn = "sentiments_comp_over_time")

Show the code

rm(data)

Warning in rm(data): object 'data' not found

CC BY 4.0

Citation

BibTeX citation:

@report{krug,
  author = {Krug, Rainer M. and Reyes Garcia, Victoria and Villasante,
    Sebastian},
  title = {Chapter 5 {Subsidies} {Reform} - {Technical} {Background}
    {Report}},
  doi = {10.5281/zenodo.11389482},
  langid = {en}
}

For attribution, please cite this work as:

Krug, Rainer M., Victoria Reyes Garcia, and Sebastian Villasante. n.d. “Chapter 5 Subsidies Reform - Technical Background Report.” IPBES Transformative Change Assessment. https://doi.org/10.5281/zenodo.11389482.