16 May 2026

ADB Working Paper Evaluates Large Language Models (LLM) for Automated Evidence Synthesis

ADB study finds that while high-reasoning AI models excel at qualitative metadata extraction, major reliability gaps persist in statistical and quantitative synthesis

SDG 9: Industry, Innovation and Infrastructure | SDG 4: Quality Education | SDG 3: Good Health and Well-being

Ministry of Science and Technology MoST | NITI Aayog

The Labor-Intensive Evidence Synthesis

The Asian Development Bank (ADB) working paper Automating Evidence Synthesis: A Comparative Evaluation of Large Language Models for Data Extraction evaluates the efficacy of Large Language Models (LLMs) in automating data extraction for systematic reviews and meta-analyses (SRMAs).

SRMAs are vital for global evidence-based policy formulation, but the manual extraction of data from academic literature is traditionally slow, highly expensive, and prone to human inconsistency. The study evaluates advanced AI systems including Gemini 2.5 Pro, GPT-5.0, and Sonnet 4.0 across research domains such as mobile health interventions and COVID-19 learning-loss studies. To standardize evaluation, researchers designed an automated pipeline that converts academic PDF documents into machine-readable Markdown text before applying structured extraction instructions through YAML-based coding manuals. [Academic PDF Document] ──► [Markdown Text Conversion] ──► [LLM + YAML Coding Manual] ──► [Reasoning Trace + Structured Data Output]

Qualitative Mastery vs. Quantitative Friction

The benchmark results uncovered a stark divergence between qualitative and quantitative AI capabilities. High-reasoning models, particularly Gemini 2.5 Pro, achieved near-perfect accuracy (up to 1.00) in extracting qualitative metadata such as study locations, age demographics, and intervention types.

Conversely, extracting raw quantitative data remains a critical failure point across all models, with accuracy levels frequently dropping below 0.5. AI models consistently struggled with complex tabular layouts, table orientations, and implicit statistical calculations (e.g., deriving effect sizes from raw means and standard deviations).

The paper concludes that while AI can radically accelerate qualitative screening, full automation of quantitative data synthesis is not yet reliable and demands strict human oversight.

Key Benchmarks & Findings

Research Scope: Evaluation of LLMs for systematic reviews and meta-analyses (SRMAs)
Models Assessed: Gemini 2.5 Pro, GPT-5.0, Sonnet 4.0
Pipeline Design: PDF → Markdown conversion → YAML-guided extraction workflow
Qualitative Strength: Near-perfect accuracy in metadata extraction and contextual classification
Quantitative Weakness: Low reliability in extracting statistical values and effect-size metrics
Cost Range: High-reasoning models cost approximately US$ 4.79–9.85 per paper
Auditability Mechanism: “Thinking prompts” and reasoning traces improved verification capacity

What is "Evidence Synthesis"?

Evidence synthesis is the scientific practice of combining data and findings from multiple independent studies to reach a comprehensive, over-arching conclusion on a specific research question. Often operationalized through systematic reviews and meta-analyses, it allows policymakers to understand "what works" globally by pooling sample sizes and neutralizing localized bias. For instance, instead of relying on a single school's data, an evidence synthesis pools dozens of global education studies to evaluate the true impact of digital remedial learning on post-pandemic recovery.

Policy Relevance

Accelerates Evidence-Based Policy: Automating qualitative metadata extraction allows government think tanks like NITI Aayog to synthesize thousands of global policy papers in days rather than months, drastically reducing the turnaround time for drafting national strategies.
Highlights the Limits of Pure Automation: The low accuracy (<0.5) in quantitative data extraction serves as a guardrail for state agencies, demonstrating that human statisticians remain indispensable for verifying complex economic or clinical data.
Standardizes Governance Frameworks: The study’s insistence on a YAML coding manual highlights that AI performance depends entirely on highly prescriptive, step-by-step regulatory instructions rather than ambiguous prompts.
Improves Auditability in Public Data: Mandating a "reasoning trace" within the AI pipeline introduces a verifiable audit trail, ensuring that public policies are not built on unchecked, hallucinated machine outputs.
Identifies Overlooked Research Outcomes: Because LLMs strictly follow manuals without applying implicit human filters, they often capture secondary outcomes missed by human reviewers, ensuring a more comprehensive view of research data.

Follow the Full Report Here: Automating Evidence Synthesis: A Comparative Evaluation of Large Language Models for Data Extraction

ADB Working Paper Evaluates Large Language Models (LLM) for Automated Evidence Synthesis

The Labor-Intensive Evidence Synthesis

Qualitative Mastery vs. Quantitative Friction

Key Benchmarks & Findings

Policy Relevance

Rethinking Public Policy Through Insight | Inquiry | Impact

People & Network

For Contributors

Platform Governance