Chapter 2 Before starting the analyses

In the context of evidence maps and meta-analyses, data files typically contain structured information derived from primary studies. A well-organized dataset is essential for ensuring transparency, reproducibility, and clarity in statistical analyses. The structure of these files plays a crucial role in data management and visualization, particularly when handling large datasets that summarize study characteristics, interventions, and outcomes. Below are some best practices and examples to follow when preparing and using such files.

2.1 Toward transparency and reproducibility

The FAIR principles (Findable, Accessible, Interoperable, and Reusable) provide a framework for improving the management and sharing of digital research assets. These principles ensure that data is discoverable through search engines, accessible with appropriate authorization, interoperable with other datasets, and reusable for various purposes. A key aspect is machine-actionability, enabling computers to process and understand data without significant human intervention.

France has made significant strides in promoting open science. The French National Plan for Open Science (2021-2024) mandates open access to both scientific publications and research data generated with public funding.

The overarching objective of this plan is to promote transparency, accessibility, and the preservation of scientific knowledge. By mandating open access, France ensures that research funded by public resources benefits the wider global community, fostering international collaboration and cross-disciplinary advancements.

A cornerstone of this effort is the adoption of the FAIR principles (Findable, Accessible, Interoperable, Reusable), which are integral in addressing common challenges in data management. By adhering to these principles, research data becomes more reliably reusable, supporting better documentation, accessibility, and data compatibility across different systems and disciplines.

Breakdown of FAIR Principles:

Findable: Data must be easy to locate for both humans and machines. This involves assigning globally unique identifiers (such as DOIs) to datasets and ensuring that metadata is searchable and indexed in databases.
Accessible: Data must be accessible under clear and transparent terms. This means storing datasets in repositories that guarantee long-term access, either through open-access platforms or specialized data journals that maintain the integrity of the data over time.
Interoperable: Data should be compatible with other datasets and tools. Standardized formats (e.g., CSV, JSON) and recognized metadata structures like Dublin Core help ensure that datasets can be integrated and compared across different systems.
Reusable: Data must be well-documented, with detailed metadata providing sufficient context to allow future researchers to reuse it effectively. This includes information on the dataset’s provenance, context, and usage conditions, ensuring that it can be reliably understood and repurposed in new research contexts.

2.1.0.1 Reporting Standards in systematic reviews: PRISMA, ROSES, and Beyond

In systematic reviews and meta-analyses, following standardized reporting guidelines is essential for transparency and reproducibility. The most widely used framework is the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline, which outlines the minimum information that should be included in a systematic review, covering everything from search strategies to result synthesis. PRISMA encourages the use of flow diagrams to illustrate the study selection process, making the review process clear and replicable.

For environmental and social sciences, the ROSES (Reporting Standards for Systematic Evidence Syntheses) framework offers a tailored alternative. It includes checklists and flow diagrams similar to PRISMA but adapted for the specific challenges of conducting systematic reviews in complex, interdisciplinary fields like ecology, conservation, and agriculture.

Using these frameworks ensures:

Transparency in Study Selection and Data Extraction: Flow diagrams such as the PRISMA diagram clearly document how many studies were identified, screened, and ultimately included in the synthesis. This transparency helps prevent biases in study selection and allows future researchers to see the logic behind inclusion and exclusion criteria.
Comprehensive Reporting of Methods and Results: Both PRISMA and ROSES encourage detailed reporting of the data extraction process, statistical methods used in meta-analyses, and sensitivity analyses, which are crucial for assessing the robustness of results.
Enhanced Reproducibility: These guidelines ensure that other researchers can reproduce the review process, validate findings, and use the extracted data for new meta-analyses, secondary syntheses, or policy assessments.

2.2 Publishing Fully Reproducible Protocols

While pre-registration of protocols has become standard practice in fields like medicine—facilitated by platforms such as PROSPERO—it is still in the early stages of adoption within agronomy and ecology. In evidence synthesis and meta-analysis, publishing detailed and reproducible research protocols is increasingly recognized as essential for enhancing transparency and minimizing bias. This approach is well-established in medical research, where systematic reviews and meta-analyses typically adhere to stringent pre-registration guidelines. However, it has yet to gain similar traction in agronomy and ecology, highlighting an important area for growth and improvement in these disciplines. Encouraging the use of pre-published protocols in these fields would improve methodological rigor, comparability of results, and overall transparency in environmental and agricultural research. Protocols describe the step-by-step methodologies researchers intend to follow before conducting a study. They ensure transparency, reproducibility, and consistency in systematic reviews, meta-analyses, and other research designs by pre-registering the research questions, criteria for study inclusion, and planned analytical methods. This practice minimizes bias, prevents selective reporting, and enhances the credibility of findings.

2.2.0.1 Key Components of a Research Protocol

Research Objectives and Questions: Clearly defines the goals of the study and the specific research questions to be addressed.
Eligibility Criteria: Specifies which studies will be included or excluded based on predefined parameters (e.g., study design, population characteristics, intervention type).
Search Strategy: Describes the databases, search terms, and timeframe for literature searches.
Data Extraction and Coding: Outlines the methods for extracting, coding, and managing data, including variable definitions and metadata structures.
Risk of Bias and Quality Assessment: Details the criteria and tools used to assess the quality and potential biases of included studies.
Analytical Plan: Pre-specifies statistical methods, models, and subgroup analyses to be used, ensuring that analytical choices are not influenced by observed results.

2.2.0.2 Importance in Meta-Analyses and Evidence Synthesis

Publishing a detailed protocol before initiating a meta-analysis or systematic review is crucial for avoiding bias and maintaining scientific rigor. Protocols act as a roadmap, guiding researchers through the review process and serving as a reference point against which deviations can be assessed. This is particularly important for high-stakes reviews, such as those informing policy decisions or large-scale evidence syntheses in public health and environmental sciences.

Well-developed protocols also enhance collaboration and standardization within research communities by enabling other researchers to replicate or build upon the same methodology. In ecological and agronomic meta-analyses, where diverse study designs and heterogeneous data sources are common, robust protocols are indispensable for harmonizing evidence and ensuring comparability across studies.

2.2.0.3 Standards and Guidelines

Several frameworks provide comprehensive guidance for developing and publishing reproducible protocols:

Cochrane Handbook for Systematic Reviews: The Cochrane Collaboration sets the gold standard for systematic reviews in health and medical research. Its protocols follow a highly structured format that emphasizes transparency, replicability, and methodological rigor.
ROSES (RepOrting standards for Systematic Evidence Syntheses): Tailored for ecological and environmental sciences, the ROSES framework outlines specific guidelines for planning and reporting systematic reviews and maps in these fields.
PRISMA-P (Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols): PRISMA-P is designed to standardize the reporting of protocols for systematic reviews and meta-analyses, ensuring all critical elements are included.

2.2.0.4 Journals Specializing in Protocols

Several specialized journals focus on publishing research protocols, providing a platform for researchers to share detailed methodological plans and facilitate reproducibility:

BMC Systematic Reviews: Publishes protocols and reviews in health, social, and environmental sciences. BMC Systematic Reviews requires that all protocols adhere to PRISMA-P or similar reporting standards.
Protocols.io: An open-access platform that allows researchers to publish detailed experimental protocols, workflows, and analysis pipelines. It is widely used across disciplines to promote transparent research.
BMJ Open: Features protocols for any research area, including environmental, health, and social sciences. The journal emphasizes open science and reproducibility.
Nature Protocols: Focuses on detailed experimental protocols in life sciences. Although primarily designed for laboratory research, it offers high visibility for methodological papers.
PROSPERO: An international database for pre-registering protocols of systematic reviews focused on health and social care.

2.2.0.5 Example of a Protocol Publication

Rousset, C., Segura, C., Gilgen, A., Alfaro, M., Mendes, L.A., Dodd, M., Dashpurev, B., Bastidas, M., Rivera, J., Merbold, L. and Vázquez, E., 2024. What evidence exists relating the impact of different grassland management practices to soil carbon in livestock systems? A systematic map protocol. Environmental Evidence, 13(1), p.22.

This protocol describes a systematic review and meta-analysis aimed at adapting health systems in crisis settings. The document pre-specifies all methodological details, including eligibility criteria, data extraction strategies, and planned analyses, ensuring reproducibility and transparency throughout the study.

2.2.0.6 Useful Links for Protocol Standards and Templates

Cochrane Handbook for Systematic Reviews of Interventions: Cochrane Handbook
PRISMA-P Reporting Guidelines: PRISMA-P Checklist
ROSES Guidelines for Environmental Sciences: ROSES Reporting Standards
Equator Network: A comprehensive resource for research reporting guidelines and protocol standards: Equator Network

useful links: https://environmentalevidencejournal.biomedcentral.com/submission-guidelines/preparing-your-manuscript/systematic-review-protocol

2.3 Publish a DataPaper

A Data Paper is a publication dedicated to describing the structure, collection, and value of a dataset. Unlike traditional research papers, which focus on findings and interpretations, Data Papers emphasize the metadata, methodology, and potential uses of the dataset itself. They offer detailed insights into how the data was gathered, processed, and structured, which is essential for reproducibility in scientific studies. Key Components of a Data Paper:

Dataset Overview: Provides a summary of the dataset, including its purpose and potential applications.
Metadata: Describes each variable, including units of measurement, data types, and any transformations applied.
Collection Methods: Details the experimental or observational methods used to gather the data.
Limitations and Uncertainties: Discloses any potential biases, gaps, or limitations in the dataset.
Data Access: Specifies how the data can be accessed and reused, often with a permanent DOI link.

In evidence mapping and meta-analyses, the publication of Data Papers ensures that large datasets, which could be difficult to interpret otherwise, are accompanied by clear, accessible documentation. This reduces barriers to data reuse and promotes collaboration across research communities.

Journals Publishing Data Papers

Several specialized journals focus on publishing Data Papers, promoting high-quality data curation and sharing. Scientific Data(by Nature Research) and Data in Brief (by Elsevier) are prominent examples, offering platforms for data-specific publications. These journals often require the dataset to be archived in an open-access repository, accompanied by rich metadata, and adhere to rigorous peer review processes. For example, Biodiversity Data Journal also publish data papers focused on biodiversity and ecological datasets. This ensures that the data shared is of high quality, reusable, and follows best practices for transparency and openness.

Example of Data Papers:

Beillouin, Damien, Marc Corbeels, Julien Demenois, David Berre, Annie Boyer, Abigail Fallot, Frédéric Feder, and Rémi Cardinael. “A global meta-analysis of soil organic carbon in the Anthropocene.” Nature Communications 14, no. 1 (2023): 3700.
Byun, E., Müller, C., Parisse, B., Napoli, R., Zhang, J.B., Rezanezhad, F., Van Cappellen, P., Moser, G., Jansen-Willems, A.B., Yang, W.H. and Urakawa, R., 2024. A global dataset of gross nitrogen transformation rates across terrestrial ecosystems. Scientific Data, 11(1), p.1022.

useful links: 
-   CIRAD publier un Datapaper https://coop-ist.cirad.fr/gerer-des-donnees/publier-un-data-paper/1-qu-est-ce-qu-un-data-paper 

-   CINES, 2017. Les formats de fichier. https://www.cines.fr/archivage/des-expertises/les-formats-de-fichier/

-   CNRS, 2023 (version 2.0) . Guide de bonnes pratiques sur la gestion des données de recherche. Publier un Datapaper pour valoriser et expliciter les données. https://mi-gt-donnees.pages.math.unistra.fr/guide/00-introduction.html

-   DoRANum, 2018. La minute Publier un Data paper. https://doi.org/10.13143/4mhn-mq42

2.3.1 Publishing in Open Access Journals

When submitting research for publication, consider choosing open access journals, particularly those that operate on a non-profit basis. This approach ensures that publicly funded research is readily accessible to the public, promoting transparency and facilitating broader dissemination of knowledge. Open access publishing removes paywalls, allowing researchers, practitioners, and policymakers to engage with your work without financial barriers, thereby enhancing the impact and reach of your findings. Prioritizing non-profit journals also supports sustainable publishing practices that align with the principles of open science.

2.4 Best practices for structuring Meta-analysis DataFiles

2.4.1 Generalities

Consistent Naming Conventions: Ensure that file names are clear, consistent, and meaningful. For example, naming columns such as Study_ID, Outcome, Intervention, and Effect_Size helps in avoiding confusion during data manipulation. Avoid special characters in column names, and use underscores or camel case for readability (e.g., StudyName or study_name).
Comprehensive Metadata: Metadata should accompany the main data file, providing explanations of each column and the coding used (e.g., what constitutes “intervention type” or “effect size unit”). A “Data Dictionary” should always be part of your dataset, explaining variables such as:
- Outcome: The primary outcome measured in the study.
- Intervention: Types of interventions, such as “land-use change” or “management.”
- Effect_Size: Numeric or categorical data on effect size (e.g., Hedge’s g or Cohen’s d).
Wide vs. Long Format: Choose the format that best suits your analysis:

Wide Format: Used when each row represents a study, with multiple columns for each outcome (e.g., separate columns for effect sizes).

Field Soil pH Nitrogen Content (%) Crop Yield (kg/ha)

1 6.5 45 3000

2 6.8 50 3200

3 6.2 40 2800
Long Format: More suitable for meta-analysis and visualization in R. Each row contains a single observation or a study’s outcome, which allows for easier aggregation, filtering, and plotting.

Field	Soil pH	Nitrogen Content (%)	Crop Yield (kg/ha)
1	6.5	45	3000
2	6.8	50	3200
3	6.2	40	2800

Field	Variable	Value
1	Soil pH	6.5
1	Nitrogen Content	45
1	Crop Yield	3000
2	Soil pH	6.8
2	Nitrogen Content	50
2	Crop Yield	3200
3	Soil pH	6.2
3	Nitrogen Content	40
3	Crop Yield	2800

Handling Missing Data: It’s common to encounter missing data in meta-analyses. Best practices include:
- Using a consistent code for missing values, such as NA.
- Avoiding empty cells, which can cause issues when importing data into R.
- Documenting missing data in the metadata.
Version Control: Ensure version control for your datasets. Tools like Git or a simple versioning system (e.g., dataset_v1.csv, dataset_v2.csv) can help track changes and maintain the integrity of your data over time.
Data Cleanliness: Ensure all numeric data are formatted correctly (e.g., avoid mixing numbers and text in the same column). Double-check for typographical errors, duplicates, and inconsistencies in categorical data. Tools like dplyr::mutate() and tidyr::pivot_longer() can aid in cleaning and restructuring data for analysis.

2.4.2 harmonisable classifications of practices and outcome

Meta-analysis and evidence synthesis necessitate consistent and harmonized classifications of interventions, practices, and outcomes to ensure the comparability of findings across studies and geographic contexts. In agricultural and ecological research, the diversity of practices, variations in terminology, and the complex relationships between interventions and their impacts on multiple outcomes complicate this classification task. This chapter highlights the importance of employing ontologies as a foundational step in developing harmonizable classifications. Investing the time to establish clear definitions and boundaries between classes for practices, outcomes, and site descriptions is crucial. A well-defined research question can further refine the scope, facilitating the classification process. By systematically categorizing agricultural practices and outcomes, researchers can enhance the rigor and relevance of meta-analytical studies, ultimately contributing to more robust evidence synthesis

2.5 Example: Meta-analysis datasets

To explore and utilize meta-analysis datasets, you can refer to the metadat package in R, which provides a comprehensive collection of datasets tailored for teaching, illustrating meta-analytic methods, and validating published analyses. You can install the package from CRAN using:

Once installed, you can browse available datasets by using:

# install metadat package
#install.packages("metadat")

# load metadat package
library(metadat)

#List of dataset included
help(package = metadat)

Each dataset is well-documented with metadata, including concept terms such as research field, outcome measures, and analytic models. These metadata provide insight into the structure and purpose of each dataset. Additionally, the datsearch() function allows you to search for datasets based on specific concept terms or perform a full-text search through their documentation.

The datasets in metadat follow structured formats, typically containing variables related to effect sizes, moderators, and sample information. To contribute or explore more in-depth examples, visit the package’s online documentation at metadat GitHub, where you can also view the output of example analyses for each dataset

# load curtis databse
dat <- dat.curtis1998

# Explore curtis data
#install.packages("skimr")
library(skimr)
head(dat)

##   id paper   genus     species fungrp co2.ambi co2.elev units time pot method
## 1 21    44   ALNUS       RUBRA  N2FIX      350      650  ul/l   47 0.5     GC
## 2 22    44   ALNUS       RUBRA  N2FIX      350      650  ul/l   47 0.5     GC
## 3 27   121    ACER      RUBRUM  ANGIO      350      700   ppm   59 2.6     GH
## 4 32   121 QUERCUS      PRINUS  ANGIO      350      700   ppm   70 2.6     GH
## 5 35   121   MALUS   DOMESTICA  ANGIO      350      700   ppm   64 2.6     GH
## 6 38   121    ACER SACCHARINUM  ANGIO      350      700   ppm   50 2.6     GH
##   stock xtrt   level     m1i      sd1i n1i    m2i      sd2i n2i
## 1  SEED FERT    HIGH  6.8169 1.7699820   3 3.9450 1.1157970   5
## 2  SEED FERT CONTROL  2.5961 0.6674662   5 2.2512 0.3275839   5
## 3  SEED NONE       .  2.9900 0.8560000   5 1.9300 0.5520000   5
## 4  SEED NONE       .  5.9100 1.7420000   5 6.6200 1.6310000   5
## 5  SEED NONE       .  4.6100 1.4070000   4 4.1000 1.2570000   4
## 6  SEED NONE       . 10.7800 1.1630000   5 6.4200 2.0260000   3

2.5.1 Dataset for Our Exercises

We will be using the dataset titled A Global Database of Diversified Farming Effects on Biodiversity and Yield. This comprehensive dataset includes 4,076 comparisons of biodiversity outcomes and 1,214 comparisons of yield in diversified farming systems, contrasting these outcomes with two reference systems.

The dataset encompasses evidence from 48 countries and evaluates the effects of diversified farming systems on species across 33 taxonomic orders, including insects, plants, birds, mammals, eukaryotes, annelids, fungi, and bacteria. It specifically addresses systems that produce both annual and perennial crops across 12 commodity groups.

This dataset serves as a valuable resource for researchers and practitioners, facilitating access to critical information regarding the positive contributions of diversified farming systems to both biodiversity and food production outcomes.

2.5.2 Steps to Access the Dataset

Load the File from Harvard Dataverse
Visit the Harvard Dataverse website.
Download all files and save them in your current working directory.

By following these steps, you will be well-equipped to utilize the dataset for the upcoming exercises in this module.

#installer le package pour lire des données depuis Excel
#install.packages("readxl")
#Charger le package
library(readxl)
# charge le fichier
Meta_Data <- read_excel("data/Dataset 1_sources.xlsx", sheet = "Literature_screened")
head(Meta_Data)

## # A tibble: 6 × 19
##      ID Article_source  Inclusion_yes_no Exclusion_reasion_pico Exclusion_reason
##   <dbl> <chr>           <chr>            <chr>                  <chr>           
## 1    13 Stakeholder re… Yes              NA                     NA              
## 2    15 Scopus or WoS   No               Unsuitable outcomes    Effect on yield…
## 3    18 Stakeholder re… Yes              NA                     NA              
## 4    25 Scopus or WoS   No               Unsuitable population  Irrelevant      
## 5    30 Stakeholder re… Yes              NA                     NA              
## 6    34 Stakeholder re… Yes              NA                     NA              
## # ℹ 14 more variables: Authors <chr>, Year <dbl>, Title <chr>,
## #   Source.title <chr>, Volume <chr>, Issue <chr>, Art_No <chr>,
## #   Page_start <chr>, Page_end <chr>, Page_count <chr>, DOI <chr>, Link <chr>,
## #   ISSN <chr>, ISBN <chr>

# For detailled summary use:
#skim(Outcome)

Outcome <- read_excel("data/Dataset_2_outcomes.xlsx", sheet = "Data")
head(Outcome)

## # A tibble: 6 × 105
##      ID Experiment_stage Comparison_ID_C Comparison_class_C Crop_C Crop_FAO_C   
##   <dbl> <chr>            <chr>           <chr>              <chr>  <chr>        
## 1  1033 1                C1              Natural            Forest NA           
## 2  1033 1                C2              Simplified         Coffee 12 - STIMULA…
## 3  1033 2                C1              Natural            Forest NA           
## 4  1033 2                C2              Simplified         Coffee 12 - STIMULA…
## 5  1033 3                C2              Simplified         Coffee 12 - STIMULA…
## 6  1033 4                C1              Natural            Forest NA           
## # ℹ 99 more variables: Crop_ann_pen_C <chr>, Crop_woodiness_C <chr>,
## #   crops_all_common_C <chr>, crops_all_scientific_C <chr>,
## #   crops_all_scientific_level_C <chr>, System_raw_C <chr>,
## #   System_details_C <chr>, System_C <chr>, Fertiliser_C <chr>,
## #   Fertiliser_chem_C <chr>, Pesticide_C <chr>, Pesticide_quantity_C <chr>,
## #   Soil_management_C <chr>, Time_state_C <chr>, Study_length_C <chr>,
## #   Sampling_unit_C <chr>, B_error_measure_C <chr>, B_error_value_C <dbl>, …

# For detailled summary use:
#skim(Outcome)