您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。 [安永]:利用生成式AI增强数据提取 - 发现报告

利用生成式AI增强数据提取

信息技术 2025-05-23 安永 张东旭
报告封面

EY and Elastic Collaboration Abstract The growing accessibility of diverse types of dataincluding structured databases, unstructured text, andmultimedia, pose significant challenges for organizationsthat want to derive meaningful insights from complexdata. Conventional search and retrieval methods areincreasingly inadequate for managing the complexityand immense volume of data today. Let’s take a look athow generative AI (gen AI) can enhance retrievalstrategies through language embeddings and sourcegrounding, focusing on optimizing performance, speed,and scalability to effectively address these challenges. To assess the effectiveness of these gen AI-drivenstrategies, we’ll explore a critical intersection betweenfinancial services and environmental, social, andgovernance (ESG). We’ll specifically focus on extracting data fromunstructured documents, such as banks’ emissionsreports and quarterly reports, and constructing adatabasefrom these data points that were previouslydifficult to access,demonstrating the practicalapplications and benefits of advanced data retrieval inthe financial services sector. Introduction Organizations that have attempted to implement gen AI solutionshave quickly encountered new challenges, including: Data extraction has always been challenging,particularly when dealing with unstructured,inconsistent, and notably large amounts of data.Organizations have often relied on external dataproviders, which was not only costly but also notalways up-to-date or live. Large language models (LLMs) may generatehallucinations—responses that are out of context—that result in unreliable outcomes. Alternatively, organizations had to build their ownextraction pipelines, an endeavour that came withits own challenges. But with the advent of gen AI,the entire financial services industry has beendisrupted, resulting in a lasting change in the fieldof data extraction. Cost and speed constraints can result in limitedscalability across extensive source databases. Gen AI canautonomouslyanalyzeand interpretvast amounts of unstructured data withunprecedented accuracy and speed, using naturallanguage processing and machine learningalgorithms. These innovative capabilities includecontextual understanding, pattern recognition andthe generation of coherent data summaries, whichsignificantly reduce the time and resourcesrequired to extract data. Out-of-the-box LLMs and search engines aredifficult to set up for the most suitable parameters Let’s take a look at varying retrieval and language modelstrategies that can offer innovative information retrieval methodsfor the financial services sector. Current state and main challenges The recent surge in data availability hasrendered traditional methods of dataextraction and analysis obsolete. Theselegacy systems, once reliant on manualkeyword searches and static queries,struggle when confronted with today’s vast,dynamic, and diverse data streams. Key challenges in information retrieval include: Keyword dependencyLimited to exact keyword matches, traditional systems suffer with thenuances of language, failing to capturecontext and semantic variations. These challenges highlight the need for a sophisticated solution to data extraction and analysis. Such solutions should bedesigned to handle the intricacies of language, adapt to evolving data types, and scale in response to the increasingvolume and complexity of data. Gen AI andretrieval strategies Throughout the pipeline, the technology was developed by EY’s genAI professionals and the technology was enabled by Elasticsearch1.For comparison, we’ll compare the efficiency, cost, and speedbetween EY’s approach and a naïve retrieval pipeline. Due to thedistributed systems approach and overall design, EY’s solutionshowed superior performance alongside Elastic’s technology stack. The process of data search, storage, and analysis is beingrevolutionized using advanced retrieval systems enabled bygen AI. These systems, characterized by their scalability andhigh-levels of performance, excel in real-time processing ofvarious data types, including structured, unstructured text,numerical, and geospatial information. The use ofsophisticated domain specific queries in these systemsenable intricate and detailed searches, unlocking profoundinsights from extensive datasets. These strategies areintegral for a wide array of applications including log andevent data analysis, full-text searches, security intelligence,business analytics, and operational intelligence. The pipeline of these retrieval systems is a comprehensivecollection of tools that enhance the core functionalities. Itcombines language embedding models and sourcegroundings, data transformation and storage (includingvectors), and data search and retrieval, all within a singleecosystem. It also encompasses tools for data security andprovides integration capabilities with other software,including various data sources and LLMs. This integration isparticularly valu