行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

利用生成式AI增强数据提取

信息技术 2025-05-23 安永张东旭

核心观点与挑战

随着数据类型的多样化（结构化数据库、非结构化文本和多媒体），传统搜索和检索方法难以应对海量复杂数据。生成式AI（Gen AI）通过语言嵌入和来源 grounding 技术，优化检索策略，提升性能、速度和可扩展性，解决数据提取挑战。当前挑战包括：数据提取困难（非结构化、不一致、大规模数据）、LLM 可能产生幻觉、成本和速度限制导致可扩展性有限、现成 LLM 和搜索引擎难以按需配置。

技术方案与流程

EY 与 Elastic 合作，利用生成式AI 和检索策略构建端到端解决方案。该方案采用语言嵌入模型（如 Elastic Learned Sparse EncodeR，ELSER）将自然语言转换为向量，通过向量存储处理多种数据类型，结合相似性搜索（如 k-NearestNeighbors，kNN）和排名模型（如 Reciprocal Rank Fusion，RRF）提升检索精度。流程包括嵌入模型、向量存储、相似性搜索和排名模型，形成高效、可扩展的检索生态系统。

应用案例与效果

ESG 数据提取
从银行年度 ESG 报告中提取数据，目标实现高精度、高速度和可扩展性。EY 解决方案结合 Elastic RAG 技术，通过 PDF 分析、索引和分块，精准提取 Scope 1、2、3 排放等关键变量。与 Naive RAG 对比，Elastic RAG 在上下文相关性和准确性上均显著提升（五家加拿大银行数据支持），且响应速度提高三倍。多种检索方法（如 Elastic RAG + 关键词过滤、混合检索）保持高精度，证明系统鲁棒性和可扩展性。
财务数据提取
从季度报告中提取 40 余项财务变量，解决 LLM 处理表格数据的难题。EY 采用链式思维和验证流程，结合向量搜索与 BM25 算法，提升数据提取精度。2023 年 Q1 补充财务报告数据显示，准确率提升近 24%，优于传统 RAG 方法，推动财务数据分析效率和质量升级。

研究结论

生成式AI 通过与先进搜索技术的融合，显著提升金融服务业数据提取的准确性、速度和可扩展性。EY 与 Elastic 的解决方案不仅解决当前数据挑战，还为未来 ESG 和财务数据分析设定新标准，强调 AI 驱动策略在数据价值挖掘中的重要性。

EY and Elastic Collaboration Abstract The growing accessibility of diverse types of dataincluding structured databases, unstructured text, andmultimedia, pose significant challenges for organizationsthat want to derive meaningful insights from complexdata. Conventional search and retrieval methods areincreasingly inadequate for managing the complexityand immense volume of data today. Let’s take a look athow generative AI (gen AI) can enhance retrievalstrategies through language embeddings and sourcegrounding, focusing on optimizing performance, speed,and scalability to effectively address these challenges. To assess the effectiveness of these gen AI-drivenstrategies, we’ll explore a critical intersection betweenfinancial services and environmental, social, andgovernance (ESG). We’ll specifically focus on extracting data fromunstructured documents, such as banks’ emissionsreports and quarterly reports, and constructing adatabasefrom these data points that were previouslydifficult to access,demonstrating the practicalapplications and benefits of advanced data retrieval inthe financial services sector. Introduction Organizations that have attempted to implement gen AI solutionshave quickly encountered new challenges, including: Data extraction has always been challenging,particularly when dealing with unstructured,inconsistent, and notably large amounts of data.Organizations have often relied on external dataproviders, which was not only costly but also notalways up-to-date or live. Large language models (LLMs) may generatehallucinations—responses that are out of context—that result in unreliable outcomes. Alternatively, organizations had to build their ownextraction pipelines, an endeavour that came withits own challenges. But with the advent of gen AI,the entire financial services industry has beendisrupted, resulting in a lasting change in the fieldof data extraction. Cost and speed constraints can result in limitedscalability across extensive source databases. Gen AI canautonomouslyanalyzeand interpretvast amounts of unstructured data withunprecedented accuracy and speed, using naturallanguage processing and machine learningalgorithms. These innovative capabilities includecontextual understanding, pattern recognition andthe generation of coherent data summaries, whichsignificantly reduce the time and resourcesrequired to extract data. Out-of-the-box LLMs and search engines aredifficult to set up for the most suitable parameters Let’s take a look at varying retrieval and language modelstrategies that can offer innovative information retrieval methodsfor the financial services sector. Current state and main challenges The recent surge in data availability hasrendered traditional methods of dataextraction and analysis obsolete. Theselegacy systems, once reliant on manualkeyword searches and static queries,struggle when confronted with today’s vast,dynamic, and diverse data streams. Key challenges in information retrieval include: Keyword dependencyLimited to exact keyword matches, traditional systems suffer with thenuances of language, failing to capturecontext and semantic variations. These challenges highlight the need for a sophisticated solution to data extraction and analysis. Such solutions should bedesigned to handle the intricacies of language, adapt to evolving data types, and scale in response to the increasingvolume and complexity of data. Gen AI andretrieval strategies Throughout the pipeline, the technology was developed by EY’s genAI professionals and the technology was enabled by Elasticsearch1.For comparison, we’ll compare the efficiency, cost, and speedbetween EY’s approach and a naïve retrieval pipeline. Due to thedistributed systems approach and overall design, EY’s solutionshowed superior performance alongside Elastic’s technology stack. The process of data search, storage, and analysis is beingrevolutionized using advanced retrieval systems enabled bygen AI. These systems, characterized by their scalability andhigh-levels of performance, excel in real-time processing ofvarious data types, including structured, unstructured text,numerical, and geospatial information. The use ofsophisticated domain specific queries in these systemsenable intricate and detailed searches, unlocking profoundinsights from extensive datasets. These strategies areintegral for a wide array of applications including log andevent data analysis, full-text searches, security intelligence,business analytics, and operational intelligence. The pipeline of these retrieval systems is a comprehensivecollection of tools that enhance the core functionalities. Itcombines language embedding models and sourcegroundings, data transformation and storage (includingvectors), and data search and retrieval, all within a singleecosystem. It also encompasses tools for data security andprovides integration capabilities with other software,including various data sources and LLMs. This integration isparticularly valu

点击免费查看完整报告

利用生成式AI增强数据提取

核心观点与挑战

技术方案与流程

应用案例与效果

研究结论

你可能感兴趣

利用生成式人工智能增强数据提取

2024 年重振实业营调研报：利用生成式 AI 动业务激增、激增强战态势

AI辅助编码：利用生成式AI增强软件开发——探索生成式AI在软件工程中的集成以增强编码和团队协作

利用生成式AI促进就业增强和劳动生产率提升：场景、案例研究与行动框架

2024利用生成式AI增强竞争优势洞察报告-整合产品开发、供应链和可持续性-IBM商业价值研究院

生成式AI优势：创始人指南之利用数据脱颖而出

Project Spectrum：利用生成式人工智能增强通货膨胀临近预报（英文）

初等中等教育阶段生成式AI利用暂行指南（日语）

利用生成式 AI 推动创新

2024 生成式 AI Copilots 指南：关于务业如如何最大化利用生成式人工智能能

利用生成式AI增强数据提取

你可能感兴趣

利用生成式人工智能增强数据提取

2024 年重振实业营调研报 ： 利用生成式 AI 动业务激增、激增强战态势

AI辅助编码：利用生成式AI增强软件开发——探索生成式AI在软件工程中的集成以增强编码和团队协作

利用生成式AI促进就业增强和劳动生产率提升：场景、案例研究与行动框架

2024利用生成式AI增强竞争优势洞察报告-整合产品开发、供应链和可持续性-IBM商业价值研究院

生成式AI优势：创始人指南之利用数据脱颖而出

Project Spectrum：利用生成式人工智能增强通货膨胀临近预报（英文）

初等中等教育阶段生成式AI利用暂行指南（日语）

利用生成式 AI 推动创新

2024 生成式 AI Copilots 指南 ： 关于务业如如何最大化利用生成式人工智能能

2024 年重振实业营调研报：利用生成式 AI 动业务激增、激增强战态势

2024 生成式 AI Copilots 指南：关于务业如如何最大化利用生成式人工智能能