Report Assessment Ch5 Subsidies Reform

Data Management Report

Author
Affiliation

Rainer M. Krug

Doi
Abstract

A short description what this is about. This is not a tracditional abstract, but rather something else …

Working Title

IPBES_TCA_Ch5_Subsidies_Reform

Code repo

Github - private

Build No: 289

%The BuidNo is automatically increased by one each time the report is rendered. It is used to indicate different renderings when the version stays the same%.

Introduction

The following steps will be done in documented in this report:

  • Do asearch with params$st_subsidies determine the number of hits
  • Download Subsidies Corpus
  • Identify works in the subsidies corpus AND TCA corpus
  • Do some analysis

Methods

The search terms are based on the shared google doc. They are cleaned up for the usage in OpenAlex.

Get and calculate Data from OpenAlex

The search terms is st

Show the code
#|

if (!file.exists(params$fn_count)) {
    st_count  <- list(
        timestamp = Sys.time()
    )
    st_count$st_count <- openalexR::oa_fetch(
        title_and_abstract.search = st,
        count_only = TRUE,
        verbose = TRUE
    )[, "count"]

    saveRDS(st_count, params$fn_count)
} else {
    st_count <- readRDS(params$fn_count)
}

Download Subsidies Corpus

The Subsidies Corpus was downloaded at the 11th of April 2024.

The corpus download will be stored in data/pages and the arrow database in data/corpus.

This is not on github!

The corpus can be read by running read_corpus() which opens the database so that then it can be fed into a dplyr pipeline. After most dplyr functions, the actual data needs to be collected via dplyr::collect().

Only then is the actual data read!

Needs to be enabled by setting eval: true in the code block below.

Show the code
#|

tic()

IPBES.R::corpus_download(
    pages_dir = params$pages_dir,
    title_and_abstract_search = st,
    continue = TRUE,
    delete_pages_dir = FALSE,
    set_size = 1000,
    verbose = TRUE,
    dry_run = TRUE,
    mc_cores = 6
)

toc()

Convert downloaded corpus to arrow

Show the code
tic()

IPBES.R::corpus_pages_to_arrow(
    pages_dir = params$pages_dir,
    arrow_dir = params$corpus_dir,
    continue = TRUE,
    delete_arrow_dir = FALSE,
    dry_run = FALSE,
    verbose = TRUE,
    mc_cores = 3
)

toc()

Check the number of dulicates before running this next block, and then verify the new corpus afterwards. RUN ONLY MANUALY!

Show the code
#|

ONLY RUN MANUALLY!!!!!!!!!!!!!!!!!!!!!!!

(read_corpus("./data/ch_5_subsidies_reform/corpus") |> group_by(id) |> summarize(n = n()) |> filter(n > 1) |> collect() |> nrow()) / (corpus_read("./data/ch_5_subsidies_reform/corpus") |> nrow())

years <- IPBES.R::corpus_read(params$corpus_dir) |>
    distinct(publication_year) |>
    collect() |>
    unlist() |>
    as.vector() |>
    sort()

lapply(
    years,
    function(y) {
        message("\nProcessing year: ", y)
        tic()
        dataset <- IPBES.R::corpus_read(params$corpus_dir) |>
            dplyr::filter(publication_year == y) |>
            dplyr::collect() |>
            group_by(id) |>
            slice_max(
                publication_year,
                n = 1,
                with_ties = FALSE,
                na_rm = TRUE
            )
        # unlink(
        #     file.path(params$corpus_dir, paste0("publication_year=", y)),
        #     recursive = TRUE,
        #     force = TRUE
        # )
        arrow::write_dataset(
            dataset = dataset,
            path = paste0(params$corpus_dir, "_deduplicated"),
            partitioning = c("publication_year", "set"),
            format = "parquet",
            existing_data_behavior = "overwrite"
        )
        toc()
    }
)

(read_corpus("./data/ch_5_subsidies_reform/corpus_deduplicated") |> group_by(id) |> summarize(n = n()) |> filter(n > 1) |> collect() |> nrow()) / (corpus_read("./data/ch_5_subsidies_reform/corpus_deduplicated") |> nrow())

id_org <- corpus_read("./data/ch_5_subsidies_reform/corpus") |>
    distinct(id) |>
    collect()

id_deduplicated <- corpus_read("./data/ch_5_subsidies_reform/corpus_deduplicated") |>
    distinct(id) |>
    collect()

(!(id_deduplicated$id %in% id_org$id)) |> sum() |> as.numeric()

NOW IF EVERYTHING IS OK, DELETE THE OLD CORPUS AND RENAME THE NEW ONE

Export Data from Subsidies Corpus

Export data for sentiment analysis

Show the code
#|

fn <- file.path("ch5_subsidies_reform", "data", "sent_analysis_subsidies.parquet")
if (!file.exists(fn)) {
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, publication_year, ab) |>
        arrow::write_parquet(fn)
}

Export 250 random papers for manual analysis

Show the code
#|


fn <- file.path("ch5_subsidies_reform", "data", "random_250_subsidies_in_tca.xlsx")
if (!file.exists(fn)) {
    set.seed(15)
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, doi, author_abbr, display_name, ab) |>
        dplyr::rename(abstract = ab, title = display_name) |>
        dplyr::collect() |>
        dplyr::slice_sample(n = 250) |>
        dplyr::mutate(
            abstract = substr(abstract, 1, 5000)
        ) |>
        writexl::write_xlsx(path = fn)
}


fn <- file.path("ch5_subsidies_reform", "data", "random_50_subsidies_in_tca.xlsx")
if (!file.exists(fn)) {
    set.seed(13)
    read_corpus(params$corpus_dir) |>
        dplyr::select(id, doi, author_abbr, display_name, ab) |>
        dplyr::rename(abstract = ab, title = display_name) |>
        dplyr::collect() |>
        dplyr::slice_sample(n = 50) |>
        dplyr::mutate(
            abstract = substr(abstract, 1, 5000)
        ) |>
        writexl::write_xlsx(path = fn)
}

Topics and Sectors

The Sectors definition is based on the subfields assigned to each work by OpenAlex. These were grouped by experts into sectors. See this Google Doc for details.

Show the code
#|

if (!dir.exists(params$corpus_topics_dir)) {
    con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)

    corpus_read(params$corpus_dir) |>
        arrow::to_duckdb(table_name = "corpus", con = con) |>
        invisible()
    corpus_read(file.path("ch5_subsidies_reform", "input", "sectors_def.parquet")) |>
        arrow::to_duckdb(table_name = "sectors", con = con) |>
        invisible()

    paste0(
        "CREATE VIEW corpus_unnest AS ",
        "SELECT  ",
        "corpus.id AS work_id,  ",
        "corpus.publication_year AS publication_year,  ",
        "UNNEST(topics).i AS i,  ",
        "UNNEST(topics).score AS score,  ",
        "UNNEST(topics).name AS name, ",
        "UNNEST(topics).id AS id,  ",
        "UNNEST(topics).display_name AS display_name  ",
        "FROM  ",
        "corpus "
    ) |>
        dbExecute(conn = con)

    select_sql <- paste0(
        "SELECT ",
        "corpus_unnest.*, ",
        "sectors.sector ",
        "FROM ",
        "corpus_unnest ",
        "LEFT JOIN ",
        "sectors ",
        "ON ",
        "corpus_unnest.id  == sectors.id "
    )

    # dbGetQuery(con, paste(select_sql, "LIMIT 10"))

    sql <- paste0(
        "COPY ( ",
        select_sql,
        ") TO '", params$corpus_topics_dir, "' ",
        "(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')"
    )

    dbExecute(con, sql)

    duckdb::dbDisconnect(con, shutdown = TRUE)
}

Authors

Show the code
#|

if (!dir.exists(params$corpus_authors_dir)) {
    con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)

    corpus_read(params$corpus_dir) |>
        arrow::to_duckdb(table_name = "corpus", con = con) |>
        invisible()

    paste0(
        "CREATE VIEW corpus_unnest AS ",
        "SELECT  ",
        "corpus.id AS work_id,  ",
        "corpus.publication_year AS publication_year,  ",
        "UNNEST(author).au_id AS au_id,  ",
        "UNNEST(author).au_display_name AS au_display_name, ",
        "UNNEST(author).au_orcid AS au_orcid,  ",
        "UNNEST(author).author_position AS author_position,  ",
        "UNNEST(author).is_corresponding AS is_corresponding,  ",
        "UNNEST(author).au_affiliation_raw AS au_affiliation_raw,  ",
        "UNNEST(author).institution_id AS institution_id,  ",
        "UNNEST(author).institution_display_name AS institution_display_name,  ",
        "UNNEST(author).institution_ror AS institution_ror,  ",
        "UNNEST(author).institution_country_code AS institution_country_code,  ",
        "UNNEST(author).institution_type AS institution_type,  ",
        "UNNEST(author).institution_lineage AS institution_lineage  ",
        "FROM  ",
        "corpus "
    ) |> dbExecute(conn = con)

    paste0(
        "COPY ( ",
        "SELECT * FROM corpus_unnest ",
        ") TO '", params$corpus_authors_dir, "' ",
        "(FORMAT PARQUET, COMPRESSION 'SNAPPY', PARTITION_BY 'publication_year')"
    ) |>
        dbExecute(conn = con)

    duckdb::dbDisconnect(con, shutdown = TRUE)
}

Publications over time

Show the code
#|

if (!file.exists(params$fn_publications_temporal)) {
    data <- read_corpus(params$corpus_dir) |>
        dplyr::select(publication_year) |>
        dplyr::arrange(publication_year) |>
        dplyr::collect() |>
        table() |>
        as.data.frame() |>
        mutate(
            publication_year = as.integer(as.character(publication_year)),
            p = Freq / sum(Freq),
            p_cum = cumsum(p)
        ) |>
        rename(
            count = Freq
        ) |>
        dplyr::inner_join(
            y = openalexR::oa_fetch(
                group_by = "publication_year",
                output = "tibble",
                verbose = FALSE
            ) |>
                dplyr::select(
                    key,
                    count
                ) |>
                dplyr::rename(
                    publication_year = key,
                    count_oa = count
                ) |>
                dplyr::arrange(publication_year) |>
                dplyr::mutate(
                    publication_year = as.integer(as.character(publication_year)),
                    p_oa = count_oa / sum(count_oa),
                    p_oa_cum = cumsum(p_oa)
                )
        )

    saveRDS(data)
}
Show the code
#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "publications_over_time")) < 2) {
    figure <- readRDS(params$fn_publications_temporal) |>
        dplyr::filter(publication_year >= 1900) |>
        ggplot() +
        #
        geom_bar(aes(x = publication_year, y = p), stat = "identity") +
        geom_line(aes(x = publication_year, y = p_cum / 5, color = "Cumulative proportion"), linewidth = 1) +
        geom_line(aes(x = publication_year, y = p_oa_cum / 5, color = "Cumulative proportion OA"), linewidth = 1) +
        #
        scale_color_manual(values = c("Cumulative proportion" = "red", "Cumulative proportion OA" = "blue")) +
        #
        scale_x_continuous(breaks = seq(1900, 2020, 10)) +
        scale_y_continuous(
            "Proportion of publications",
            sec.axis = sec_axis(~ . * 5, name = "Cumulative proportion") # divide by 100 to scale back the secondary axis
        ) +
        labs(
            title = "Publications over time",
            x = "Year",
            y = "Number of publications"
        ) +
        theme_minimal() +
        theme(
            axis.text.y.right = element_text(color = "red"),
            legend.position = "inside" # Move the legend to the top
        )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "publications_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "publications_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Countries of Authors Spatial

Show the code
if (!file.exists(params$fn_publications_per_country_data)) {
    corpus <- corpus_read(params$corpus_authors_dir)

    data_first <- corpus |>
        dplyr::filter(
            author_position == "first"
        ) |>
        dplyr::select(
            work_id,
            institution_country_code,
        ) |>
        dplyr::group_by(
            work_id,
            institution_country_code
        ) |>
        dplyr::summarise(
            count_first = 1 / n(),
            .groups = "drop"
        ) |>
        dplyr::group_by(
            institution_country_code
        ) |>
        dplyr::summarise(
            count = sum(count_first),
            .groups = "drop"
        ) |>
        dplyr::mutate(
            position = "first"
        )

    data_all <- corpus |>
        dplyr::select(
            work_id,
        ) |>
        dplyr::group_by(
            work_id,
        ) |>
        dplyr::summarize(
            count = n()
        ) |>
        dplyr::right_join(
            y = corpus |>
                dplyr::select(
                    work_id,
                    institution_country_code
                ),
            by = "work_id"
        ) |>
        dplyr::group_by(
            institution_country_code
        ) |>
        dplyr::summarise(
            count = sum(count),
            .groups = "drop"
        ) |>
        dplyr::mutate(
            position = "all"
        )

    data_oa <- openalexR::oa_fetch(
        group_by = "authorships.countries",
        output = "tibble",
        verbose = FALSE
    ) |>
        dplyr::mutate(
            iso3c = countrycode::countrycode(
                key_display_name,
                origin = "country.name",
                destination = "iso3c"
            ),
            key_display_name = NULL,
            key = NULL,
            position = "oa"
        )

    dplyr::add_row(
        collect(data_first),
        collect(data_all)
    ) |>
        dplyr::mutate(
            iso3c = countrycode::countrycode(
                institution_country_code,
                origin = "iso2c",
                destination = "iso3c"
            ),
            institution_country_code = NULL
        ) |>
        dplyr::add_row(
            data_oa
        ) |>
        saveRDS(file = params$fn_publications_per_country_data)

    rm(data_first, data_all, data_oa, fn)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "publications_countries")) < 2) {
    data <- readRDS(params$fn_publications_per_country_data) |>
        dplyr::group_by(iso3c) |>
        dplyr::summarize(
            count_first = sum(count[position == "first"], na.rm = TRUE),
            count_all = sum(count[position == "all"], na.rm = TRUE),
            count_oa = sum(count[position == "oa"], na.rm = TRUE)
        ) |>
        dplyr::mutate(
            count_oa = ifelse(is.na(count_oa), 0, count_oa),
            log_count_oa = log(count_oa + 1),
            p_oa = count_oa / sum(count_oa),
            #
            count_first = ifelse(is.na(count_first), 0, count_first),
            log_count_first = log(count_first + 1),
            p_first = count_first / sum(count_first),
            p_first_output = count_first / count_oa,
            p_first_diff = (p_oa - p_first) * 100,
            #
            count_all = ifelse(is.na(count_all), 0, count_all),
            log_count_all = log(count_all + 1),
            p_all = count_all / sum(count_all),
            p_all_output = count_all / count_oa,
            p_all_diff = (p_oa - p_all) * 100,
        )

    # data <- readRDS(params$fn_publications_per_country_data)

    map <- patchwork::wrap_plots(
        data |> map_country_codes(
            map_type = "countries",
            values = "count_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of overall publications (count_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "count_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of TCA publications (count_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "count_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("count of TCA publications (count_all)"),
        ####
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of overall publications (log_count_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of TCA publications (log_count_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "log_count_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("log(count + 1) of TCA publications (log_count_all)"),
        ####
        data |> map_country_codes(
            map_type = "countries",
            values = "p_oa",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("Overall research output (p_oa)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "p_first",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("TCA research output (p_first)"),
        #
        data |> map_country_codes(
            map_type = "countries",
            values = "p_all",
            geodata_path = params$gdm_dir
        ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", high = "#56B4E9") +
            ggplot2::ggtitle("TCA research output (p_all)"),
        ####
        ggplot() +
            theme_void(),
        data |>
            map_country_codes(
                map_type = "countries",
                values = "p_first_diff",
                geodata_path = params$gdm_dir
            ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
            ggplot2::ggtitle("difference (TCA - overall) output (p_oa - p_first)"),
        #
        data |>
            map_country_codes(
                map_type = "countries",
                values = "p_all_diff",
                geodata_path = params$gdm_dir
            ) +
            ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
            ggplot2::ggtitle("difference (TCA - overall) output (p_oa - p_all)"),
        ncol = 3
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "publications_countries.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "publications_countries.png"),
        width = 12,
        height = 8,
        map
    )
}

Sentiment Analysis

Spatial

Show the code
#|


if (!file.exists(params$fn_sentiment_spatial_data)) {
    data <- read_corpus(params$corpus_authors_dir) |>
        dplyr::select(
            work_id,
            institution_country_code
        ) |>
        dplyr::filter(
            !is.na(institution_country_code)
        ) |>
        collect() |>
        mutate(
            iso3c = countrycode::countrycode(
                institution_country_code,
                origin = "iso2c",
                destination = "iso3c"
            ),
            institution_country_code = NULL
        ) |>
        dplyr::left_join(
            readRDS(params$fn_sentiment_results) |>
                select(
                    work_id = id,
                    neg,
                    neu,
                    pos,
                    compound
                ),
            by = "work_id"
        ) |>
        dplyr::group_by(iso3c) |>
        dplyr::summarize(
            mean_neg = mean(neg, na.rm = TRUE),
            mean_neu = mean(neu, na.rm = TRUE),
            mean_pos = mean(pos, na.rm = TRUE),
            mean_compound = mean(compound, na.rm = TRUE),
            n = n()
        ) |>
        dplyr::arrange(
            dplyr::desc(mean_neg)
        ) |>
        # Filter out missing countries - only necessary as analysis not completed yet
        dplyr::filter(
            !is.nan(mean_neg)
        ) |>
        saveRDS(file = params$fn_sentiment_spatial_data)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_neu_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neu",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean neutral sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neu",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean neutral sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neu_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_pos_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean positive sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean positive sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_pos_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}
Show the code
#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_neg_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_neg",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean negative sentiment (0 - 1) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_pos",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean negative sentiment (0 - 1) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_neg_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "maps"), pattern = "sentiment_comp_per_countries")) < 4) {
    data <- readRDS(params$fn_sentiment_spatial_data)

    map <- data |>
        map_country_codes(
            map_type = "countries",
            values = "mean_compound",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean compound sentiment (-1: negative; 1: positive) - all countries")

    map_sel <- data |>
        dplyr::filter(n > params$min_count_sentiment_timeseries) |>
        map_country_codes(
            map_type = "countries",
            values = "mean_compound",
            geodata_path = params$gdm_dir
        ) +
        ggplot2::scale_fill_gradient2(low = "#E69F00", mid = "white", high = "#56B4E9", midpoint = 0) +
        ggplot2::ggtitle("Mean compound sentiment (-1: negative; 1: positive) - more than 10 works")

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_all.pdf"),
        width = 12,
        height = 8,
        map
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_all.png"),
        width = 12,
        height = 8,
        map
    )

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_10.pdf"),
        width = 12,
        height = 8,
        map_sel
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "maps", "sentiment_comp_per_countries_10.png"),
        width = 12,
        height = 8,
        map_sel
    )

    rm(map, data)
}

Timeseries

Show the code
#|

if (!file.exists(params$fn_sentiment_temporal_data)) {
    data <- readRDS(params$fn_sentiment_results) |>
        select(
            work_id = id,
            year = date,
            neg,
            neu,
            pos,
            compound
        ) |>
        dplyr::group_by(year) |>
        dplyr::summarize(
            neg = mean(neg),
            neu = mean(neu),
            pos = mean(pos),
            compound = mean(compound),
            n = n()
        ) |>
        saveRDS(file = params$fn_sentiment_temporal_data)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        tidyr::pivot_longer(cols = c(neg, neu, pos, compound), names_to = "type", values_to = "value") |>
        ggplot2::ggplot() +
        ggplot2::geom_line(aes(x = year, y = value, color = type, linetype = type)) +
        ggplot2::scale_color_manual(values = c("black", "red", "blue", "green")) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Scores (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Score",
            color = "Type",
            linetype = "Type"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_over_time_neg_pos")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        tidyr::pivot_longer(cols = c(neg, pos), names_to = "type", values_to = "value") |>
        ggplot2::ggplot() +
        ggplot2::geom_line(aes(x = year, y = value, color = type, linetype = type)) +
        ggplot2::scale_color_manual(values = c("black", "red", "blue", "green")) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Scores (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Score",
            color = "Type",
            linetype = "Type"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time_neg_pos.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_over_time_neg_pos.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_neg_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = neg)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis negative Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Negative score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neg_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neg_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}
Show the code
#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_neu_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = neu)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis neutral Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Neutral score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neu_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_neu_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}
Show the code
#|

if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_pos_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = pos)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis positive Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Positive score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_pos_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_pos_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}
Show the code
#|
if (length(list.files(path = file.path("ch5_subsidies_reform", "figures"), pattern = "sentiments_comp_over_time")) < 2) {
    figure <- readRDS(params$fn_sentiment_temporal_data) |>
        dplyr::filter(
            n > params$min_count_sentiment_timeseries
        ) |>
        ggplot2::ggplot() +
        ggplot2::geom_line(ggplot2::aes(x = year, y = compound)) +
        ggplot2::labs(
            title = paste0("Sentiment Analysis Compound Score (n > ", params$min_count_sentiment_timeseries, ")"),
            x = "Year",
            y = "Compound score"
        ) +
        ggplot2::theme_minimal()

    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_comp_over_time.pdf"),
        width = 12,
        height = 6,
        figure
    )
    ggplot2::ggsave(
        file.path("ch5_subsidies_reform", "figures", "sentiments_comp_over_time.png"),
        width = 12,
        height = 6,
        figure
    )

    rm(figure)
}

Results

Number of hits

The number of hits are hits of the terms of the whole of the OpenAlex corpus. Due to methodological issues, the number of R1 AND R2 are overestimates and contain some double counting.

  • government_financial_subsidies in OpenAlex: 126,116 hits
  • government_financial_subsidies in downloaded corpus: 124,517 hits
  • Subsidies corpus: 124,517 hits

Manual review 50 paper

The file contains the id, doi, author_abbr and abstract of the papers. Two samples were generated:

  • works in the subsidies corpus AND in the TCA corpus which can be downloded here.

Publications over time

The red line is the cumulative proportion of publications, the blue line the cumulative proportion of all of the OpenAlex corpus. Both use the secondeary (red) axis.

To download high resolution, click here

Show the code
readRDS(params$fn_publications_temporal) |>
    IPBES.R::table_dt(fn = "publications_per_year")

Map Countries of Author Affiliations

Distribution of the author affiliation countries

The following calculations were done (count refers to the count of works per country in the subsidies corpus, count_oa to the count of works per country in the OpenAlex corpus):

  • **count** = ifelse(is.na(count), 0, count)
  • **log_count** = log(count + 1)
  • **p** = count / sum(count)
  • **p_output** = count / count_oa
  • **p_diff** = (p_oa - p) * 100

These are based on three different counts: - **count_oa**: Count of first authors all papers in the Open Alex Corpus per country - **count_first**: Count of first authors all papers in the Subsidies Corpus per country - **count_all**: Count of first authors of all papers in the Subsidies Corpus per country, weighted by 1/NO_AUTHORS_PER_PAPER

Countries of the works in different corpi. Left column: first author Open Alex Corpus; Middle Column: first author Subsidies Corpus; Right column: All authors weighted by inverse number of authors per paper of Subsidies Corpus” ”

To download high resolution, click here

Show the code
#|

readRDS(params$fn_publications_per_country_data) |>
    IPBES.R::table_dt(fn = "countries_first_author", fixedColumns = list(leftColumns = 2))

Sentiment Analysis

Two .parquet files containing the id, publication_year and ab (abstract) were extracted and are available upon request due to their size.

For analyzing the sentiments of the provided abstracts, we have used the Python NLTK package, and VADER (Valence Aware Dictionary for Sentiment Reasoning) which is an NLTK module that provides sentiment scores based on the words used. VADER is a pre-trained, rule-based sentiment analysis model in which the terms are generally labeled as per their semantic orientation as either positive or negative.

The main advantage/reason for using this model was that it doesn’t require a labbed training dataset. The output of the model is 4 statistical scores:

  • compound: composite score that summarizes the overall sentiment of the text, where scores close to 1 indicate a positive sentiment, scores close to -1 indicate a negative sentiment, and scores close to 0 indicate a neutral sentiment
  • negative: percentage of negative sentiments in the text
  • neutral: percentage of neutral sentiments in the text
  • positive: percentage of positive sentiments in the text
Show the code
#|

readRDS(params$fn_sentiment_results) |>
    IPBES.R::table_dt(fn = "sentiment_scores", fixedColumns = list(leftColumns = 2))
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Here is the per country table

Show the code
#|

readRDS(params$fn_sentiment_spatial_data) |>
    IPBES.R::table_dt(fn = "sentiments_comp_over_time")
Show the code
rm(data)
Warning in rm(data): object 'data' not found

Sentiments Over Time

This graphs shows the sentiment scores of the sentiment analysis over time.

To download high resolution, click here

For clarity, here only the positive and egative sentiments.

To download high resolution, click here

Negative Sentiment

Over Time

This graphs shows the negative score of the sentiment analysis over time. It only

To download high resolution, click here

Per country

To download high resolution, click here

To download high resolution, click here

Neutral Sentiment

Over Time

This graphs shows the compound score of the sentiment analysis over time. It only

To download high resolution, click here

Per country

To download high resolution, click here

To download high resolution, click here

Positive Sentiment

Over Time

This graphs shows the compound score of the sentiment analysis over time. It only

To download high resolution, click here

Per country

To download high resolution, click here

To download high resolution, click here

Compound Sentiment

Over Time

This graphs shows the compound score of the sentiment analysis over time. It only

To download high resolution, click here

Per country

To download high resolution, click here

To download high resolution, click here

Reuse

Citation

BibTeX citation:
@report{krug,
  author = {Krug, Rainer M.},
  title = {Report {Assessment} {Ch5} {Subsidies} {Reform}},
  doi = {XXXXXX},
  langid = {en},
  abstract = {A short description what this is about. This is not a
    tracditional abstract, but rather something else ...}
}
For attribution, please cite this work as:
Krug, Rainer M. n.d. “Report Assessment Ch5 Subsidies Reform.” IPBES Data Management Report. https://doi.org/XXXXXX.