Aditya Retnanto, Yohan Iddawela, and Elaine S. Tan ADB Economics Working Paper Series Automating Evidence Synthesis: A Comparative Evaluation Aditya Retnanto, Yohan Iddawela,and Elaine S. Tan Aditya Retnanto (aretnanto.consultant@adb.org)is a consultant, Yohan Iddawela (yiddawela@adb.org)is an economist (data science), and Elaine S. Tan(estan@adb.org) is the director of the Data Division,Economic Research and Development Impact No. 845 | May 2026 TheADB Economics Working Paper Seriespresents research in progress to elicit commentsand encourage debate on development issuesin Asia and the Pacific. The views expressedare those of the authors and do not necessarily © 2026 Asian Development Bank6 ADB Avenue, Mandaluyong City, 1550 Metro Manila, PhilippinesTel +63 2 8632 4444; Fax +63 2 8636 2444 Some rights reserved. Published in 2026. ISSN 2313-6537 (print), 2313-6545 (PDF)Publication Stock No. WPS260200-2DOI: http://dx.doi.org/10.22617/WPS260200-2 The views expressed in this publication are those of the authors and do not necessarily reflect the views and policiesof the Asian Development Bank (ADB) or its Board of Governors or the governments they represent. ADB does not guarantee the accuracy of the data included in this publication and accepts no responsibility for anyconsequence of their use. The mention of specific companies or products of manufacturers does not imply that they By making any designation of or reference to a particular territory or geographic area in this document, ADB does notintend to make any judgments as to the legal or other status of any territory or area. This publication is available under the Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)https://creativecommons.org/licenses/by/3.0/igo/. By using the content of this publication, you agree to be boundby the terms of this license. For attribution, translations, adaptations, and permissions, please read the provisions This CC license does not apply to non-ADB copyright materials in this publication. If the material is attributedto another source, please contact the copyright owner or publisher of that source for permission to reproduce it. Please contact pubsmarketing@adb.org if you have questions or comments with respect to content, or if you wishto obtain copyright permission for your intended use that does not fall within these terms, or for permission to use ABSTRACT Systematic reviews and meta-analyses (SRMAs) are important tools for evidence synthesis buthave historically required substantial manual effort, particularly during the data extraction phase.To address this bottleneck, we developed and evaluated an automated pipeline that utilizes largelanguage models (LLMs) to ingest full text scientific articles and extract structured metadata. Webenchmarked the performance of leading models, including Gemini 2.5 Pro, GPT-5, and Sonnet4.0, across two distinct domains: mobile health interventions and education. Our results indicatethat Gemini 2.5 Pro achieved the strongest performance in qualitative metadata extraction andoutcomeidentification.However,quantitative metadata extraction remained a significantchallenge. Models struggled to interpret complex data across multiple tables and failed to Keywords:evidence synthesis automation, large language models (LLMs), data extractionbenchmarking, systematic reviews and meta-analyses (SRMA) JEL code:C88 1.INTRODUCTION The rapid progress of large language models (LLMs) has created new opportunities to assist withsystematic reviews and meta-analyses (SRMAs). These reviews remain essential for evidencesynthesis across many fields, yet manual approaches require substantial time and expert labor The SRMA process is generally divided into two tasks: (1) screening and (2) data extraction.Screening involves identifying and evaluating studies according to strict inclusion criteria toreduce irrelevant or poor quality papers. Data extraction involves converting information from Recent research has examined the use of LLMs for screening and data extraction. Someconfigurations, such as GPT-4 with few-shot prompting, can screen titles and abstracts at levelscomparable to human reviewers. Extracting detailed information from full articles, however,remains more challenging. Earlier natural language processing tools worked mainly at the phrase In this paper, we propose and evaluate an end-to-end LLM pipeline designed to conductautomated SRMAs. The system ingests full text portable document formats (PDFs) of scientificarticles and outputs structured data files complete with LLM annotations. A key methodologicalinnovation in our approach is the instruction for models to “think” and “reason” prior to generating To assess general performance, we benchmark these models in two different domains: healthand education. We compare their outputs against manual annotations, which serve as thereference standard. We measure accuracy for three tasks: (1) outcome extraction, (2) quali