行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

自动化证据综合：用于数据提取的大型语言模型的比较评估（英）

机械设备 2026-05-01 亚开行董亚琴

该研究评估了大型语言模型（LLMs）在从全文科学文献中自动提取结构化元数据方面的能力，旨在加速系统综述和荟萃分析（SRMAs）。研究比较了包括Gemini 2.5 Pro、GPT-5和Sonnet 4.0在内的多个领先模型的性能，涵盖移动健康和教育两个领域。结果表明，LLMs在定性元数据和结果识别方面表现出色，其中Gemini 2.5 Pro表现最佳。然而，在提取定量元数据（如均值、标准差和置信区间）方面仍面临重大挑战，模型难以解释复杂表格数据并进行必要的计算。研究还发现，人类标注员应用了未被编码手册记载的隐含筛选标准，导致基准测试结果存在偏差。研究强调，尽管LLMs可以加快编码过程，但要实现可靠的自动化需要高度规范的编码手册，以确保严格指导模型行为并进行公平的基准测试。最终结论是，LLMs是扩大SRMA能力的强大工具，但在定量数据提取方面仍需人类审核。

Aditya Retnanto, Yohan Iddawela, and Elaine S. Tan ADB Economics Working Paper Series Automating Evidence Synthesis: A Comparative Evaluation Aditya Retnanto, Yohan Iddawela,and Elaine S. Tan Aditya Retnanto (aretnanto.consultant@adb.org)is a consultant, Yohan Iddawela (yiddawela@adb.org)is an economist (data science), and Elaine S. Tan(estan@adb.org) is the director of the Data Division,Economic Research and Development Impact No. 845 | May 2026 TheADB Economics Working Paper Seriespresents research in progress to elicit commentsand encourage debate on development issuesin Asia and the Pacific. The views expressedare those of the authors and do not necessarily © 2026 Asian Development Bank6 ADB Avenue, Mandaluyong City, 1550 Metro Manila, PhilippinesTel +63 2 8632 4444; Fax +63 2 8636 2444 Some rights reserved. Published in 2026. ISSN 2313-6537 (print), 2313-6545 (PDF)Publication Stock No. WPS260200-2DOI: http://dx.doi.org/10.22617/WPS260200-2 The views expressed in this publication are those of the authors and do not necessarily reflect the views and policiesof the Asian Development Bank (ADB) or its Board of Governors or the governments they represent. ADB does not guarantee the accuracy of the data included in this publication and accepts no responsibility for anyconsequence of their use. The mention of specific companies or products of manufacturers does not imply that they By making any designation of or reference to a particular territory or geographic area in this document, ADB does notintend to make any judgments as to the legal or other status of any territory or area. This publication is available under the Creative Commons Attribution 3.0 IGO license (CC BY 3.0 IGO)https://creativecommons.org/licenses/by/3.0/igo/. By using the content of this publication, you agree to be boundby the terms of this license. For attribution, translations, adaptations, and permissions, please read the provisions This CC license does not apply to non-ADB copyright materials in this publication. If the material is attributedto another source, please contact the copyright owner or publisher of that source for permission to reproduce it. Please contact pubsmarketing@adb.org if you have questions or comments with respect to content, or if you wishto obtain copyright permission for your intended use that does not fall within these terms, or for permission to use ABSTRACT Systematic reviews and meta-analyses (SRMAs) are important tools for evidence synthesis buthave historically required substantial manual effort, particularly during the data extraction phase.To address this bottleneck, we developed and evaluated an automated pipeline that utilizes largelanguage models (LLMs) to ingest full text scientific articles and extract structured metadata. Webenchmarked the performance of leading models, including Gemini 2.5 Pro, GPT-5, and Sonnet4.0, across two distinct domains: mobile health interventions and education. Our results indicatethat Gemini 2.5 Pro achieved the strongest performance in qualitative metadata extraction andoutcomeidentification.However,quantitative metadata extraction remained a significantchallenge. Models struggled to interpret complex data across multiple tables and failed to Keywords:evidence synthesis automation, large language models (LLMs), data extractionbenchmarking, systematic reviews and meta-analyses (SRMA) JEL code:C88 1.INTRODUCTION The rapid progress of large language models (LLMs) has created new opportunities to assist withsystematic reviews and meta-analyses (SRMAs). These reviews remain essential for evidencesynthesis across many fields, yet manual approaches require substantial time and expert labor The SRMA process is generally divided into two tasks: (1) screening and (2) data extraction.Screening involves identifying and evaluating studies according to strict inclusion criteria toreduce irrelevant or poor quality papers. Data extraction involves converting information from Recent research has examined the use of LLMs for screening and data extraction. Someconfigurations, such as GPT-4 with few-shot prompting, can screen titles and abstracts at levelscomparable to human reviewers. Extracting detailed information from full articles, however,remains more challenging. Earlier natural language processing tools worked mainly at the phrase In this paper, we propose and evaluate an end-to-end LLM pipeline designed to conductautomated SRMAs. The system ingests full text portable document formats (PDFs) of scientificarticles and outputs structured data files complete with LLM annotations. A key methodologicalinnovation in our approach is the instruction for models to “think” and “reason” prior to generating To assess general performance, we benchmark these models in two different domains: healthand education. We compare their outputs against manual annotations, which serve as thereference standard. We measure accuracy for three tasks: (1) outcome extraction, (2) quali

点击免费查看完整报告

自动化证据综合：用于数据提取的大型语言模型的比较评估（英）

你可能感兴趣

开源视角下看大规模语言模型研发中的数据工程、自动化评估及与知识图谱的结合

BloombergGPT：一个用于金融的大型语言模型

预算有限情况下的大型语言模型：用于高效分类大型文本语料库的主动知识蒸馏

用于能源系统研究的大型语言模型

大型语言模型的自动化越狱

全面召回？大型语言模型的宏观经济知识评价（英）

评估大型语言模型接管灾难的风险

评估并缓解大型语言模型中的状态焦虑

全面召回？评估大型语言模型的宏观经济知识

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向