行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

多模态大语言模型在富文本图像理解中的应用：全面综述

信息技术 2025-02-23 Pei Fu, Tongkong Guan, Zining Wang, Zhentao Guo, Chen Duan, Hao Sun, Bom ing Chen, Jiayao Ma, Qianyi Jiang, Kai Zhou, Junfeng Luo 美团黄崇贵-中国医药城15189901173

文本丰富图像理解多模态大语言模型综述

本文对文本丰富图像理解（TIU）领域的多模态大语言模型（MLLMs）进行了系统性和全面的综述，涵盖了模型架构、训练流程、数据集与基准以及挑战与趋势。

模型架构

TIU MLLMs 通常包含三个核心组件：视觉编码器、模态连接器和LLM解码器。

视觉编码器：分为无OCR、基于OCR和混合三种方法。无OCR编码器如CLIP、ConvNeXt、SAM等，用于提取高级视觉特征；基于OCR编码器如BLIP-2、LayoutLMv3等，通过OCR技术捕获文本和布局信息；混合方法则结合了无OCR和基于OCR编码器的优势。
模态连接器：用于将视觉特征与LLM的语言特征对齐，常见的策略包括线性投影、多层感知（MLP）、交叉注意力、H-Reducer、C/D-abstract和注意力池化等。
LLM解码器：利用LLM强大的理解能力进行语义推理，生成最终答案。常用的LLM包括LLaMA、Qwen、Vicuna、InternLM等。

训练流程

MLLMs的训练流程分为三个阶段：模态对齐、指令对齐和偏好对齐。

模态对齐：通过OCR数据和传统OCR任务的监督进行预训练，旨在弥合模态差距。常见的对齐方法包括识别、定位和解析三种类型。
指令对齐：通过监督微调（SFT）提升MLLMs的多模态感知和跨模态推理能力，增强对多样化指令的鲁棒性，并实现未见任务场景的零样本泛化。主要方法包括视觉语义锚定、提示多样性增强和零样本泛化等。
偏好对齐：通过偏好对齐技术优化模型输出，使其更符合人类价值观和期望。例如，InternVL2-MPO引入了混合偏好优化（MPO）策略，有效提升了多模态推理能力。

数据集与基准

TIU任务的快速发展得益于大量专用数据集和标准化基准的出现。数据集可分为特定领域（文档、图表、场景、表格和GUI）和综合场景两种类型。常用的基准包括DocVQA、InfoVQA、ChartQA、TextVQA等。

挑战与趋势

尽管当前MLLMs取得了显著进展，但仍面临一些挑战：

长文档理解能力：MLLMs在单页文档理解方面表现出色，但在多页或长文档任务上的性能仍有待提升。
计算效率和模型压缩：当前MLLMs的计算需求较高，需要开发更高效的架构来平衡性能和计算开销。
多语言文档理解：现有MLLMs主要针对英语和高资源语言进行优化，在多语言和低资源语言场景下的性能不足。

未来研究方向包括：

优化视觉特征表示：开发高效的视觉编码器、自适应token压缩机制和先进的跨模态特征融合技术。
提升长文档理解能力：通过长文档理解基准（如MMLongBench-Doc）推动相关研究。
增强多语言文档理解：构建包含多样语言和文化背景的综合多语言数据集，并利用跨语言迁移学习技术提升模型在多语言场景下的性能。

研究结论

本文系统地分析了TIU MLLMs的研究现状，并指出了未来研究方向。随着技术的不断发展，MLLMs将在文档理解、图表解释、自然场景文本理解等领域发挥越来越重要的作用。

Pei Fu1, Tongkun Guan2, Zining Wang1, Zhentao Guo3, Chen Duan1,Hao Sun4, Boming Chen1, Jiayao Ma1, Qianyi Jiang1, Kai Zhou1, Junfeng Luo11Meituan,2Shanghai Jiao Tong University,3Beijing Institute of Technology,4MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences{fupei,duanchen02,wangzining03,chenboming,majiayao02}@meituan.com{jiangqianyi02,zhoukai03,luojunfeng}@meituan.comgtk0615@sjtu.edu.cn,hao.sun@cripac.ia.ac.cn,zt_guo1230@163.com itText-richImageUnderstanding (TIU), whichencompasses two core capabilities: perception andunderstanding. The perception dimension focuseson visual recognition tasks, such as text detection(Liao et al., 2022), text recognition (Guan et al.,2025), formula recognition (Truong et al., 2024;Guan et al., 2024a), and document layout analysis(Yupan et al., 2022).The understanding dimen-sion, conversely, requires semantic reasoning forapplications like key information extraction anddocument-based visual question answering (e.g.,DocVQA (Mathew et al., 2021b), ChartQA (Masryet al., 2022), and TextVQA (Singh et al., 2019)).arXiv:2502.16586v1 [cs.CV] 23 Feb 2025 Abstract The recent emergence of Multi-modal LargeLanguage Models (MLLMs) has introduceda new dimension to the Text-rich Image Un-derstanding (TIU) field, with models demon-strating impressive and inspiring performance.However, their rapid evolution and widespreadadoption have made it increasingly challengingto keep up with the latest advancements. Toaddress this, we present a systematic and com-prehensive survey to facilitate further researchon TIU MLLMs. Initially, we outline the time-line, architecture, and pipeline of nearly all TIUMLLMs. Then, we review the performance ofselected models on mainstream benchmarks.Finally, we explore promising directions, chal-lenges, and limitations within the field. Historically, perception and understanding taskswere handled separately through specialized mod-els or multi-stage pipelines.Recent advancesin vision-language models have unified thesetasks within Visual Question Answering (VQA)paradigms, driving research towards the develop-ment of end-to-end universal models. 1Introduction Text-rich images play a pivotal role in real-worldscenarios by efficiently conveying complex infor-mation and improving accessibility (Biten et al.,2019). Accurately interpreting these images is es-sential for automating information extraction, ad-vancing AI systems, and optimizing user interac-tions. To formalize this research domain, we term Figure 1 presents an evolutionary timeline de-lineating critical milestones in unified text-rich im-age understanding models. The trajectory revealstwo distinct eras: (a) The pre-LLM period (2019- all TIU MLLMs in four dimensions: Model Archi-tectures (Section 2), Training Pipeline (Section 3),Datasets and Benchmarks (Section 4), Challengesand Trends (Section 5). This holds both academicand practical significance for advancing the field. 2022) characterized by specialized architectureslike LayoutLM (Xu et al., 2019) and Donut (Kimet al., 2021), which employed modality-specificpre-training objectives (masked language model-ing, masked image modeling,etc.) coupled withOCR-derived supervision (text recognition, spatialorder recovery,etc.). While effective in controlledsettings, these models exhibited limited adaptabil-ity to open-domain scenarios due to their task-specific fine-tuning requirements and constrainedcross-modal interaction mechanisms. (b) The post-LLM era (2023–present) is marked by the growingpopularity of LLMs. Some studies propose Multi-modal Large Language Models (MLLMs), whichintegrate LLM with visual encoders to jointly pro-cess visual tokens and linguistic elements throughunified attention mechanisms, achieving end-to-end sequence modeling. 2Model Architecture TIU MLLM methods typically leverage pre-trainedgeneral visual foundation models to extract robustvisual features or employ OCR engines to capturetext and layout information from images. A modal-ity connector is then used to align these visual fea-tures with the semantic space of the language fea-tures from the LLM. Finally, the combined visual-language features are fed into the LLM, which uti-lizes its powerful comprehension capabilities forsemantic reasoning to generate the final answer.As illustrated in Figure 2, the framework of TIUMLLMs can be abstracted into three core compo-nents: Visual Encoder, Modality Connector, andLLM Decoder. This paradigm evolution addresses two criticallimitations of earlier methods. First, the emergentMLLM framework eliminates modality-specific in-ductive biases through homogeneous token repre-sentation, enabling seamless multi-task integration.Second, the linguistic priors encoded in LLMs em-power unprecedented zero-shot generalization andallow direct application to diverse tasks withouttask-specific tuning. 2.1Visual Encoder The Visual EncoderF(·)is responsible for trans-forming input imageIinto feature representationsV, expressed asV=F(·)

点击免费查看完整报告

多模态大语言模型在富文本图像理解中的应用：全面综述

文本丰富图像理解多模态大语言模型综述

模型架构

训练流程

数据集与基准

挑战与趋势

研究结论

你可能感兴趣

【点金互动易】机器人+AI+多模态，智能体位追踪系统已搭载于泰尔系统实验室，助力人形机器人运动能力测试的标准化，这家公司可实现Al生成文本、语音、图像、视频等多模态内容

这一指标或可直观反应情绪是否回暖；国内首个音视频多媒体大模型万兴“天幕”正式发布，这家公司在腾讯优图实验室开展的多模态大语言模型测评中，综合得分排名第一

【掘金行业龙头】多模态+AIGC，多模态大模型进入实验性训练阶段，AIGC产品覆盖图像、音乐、文本、编程等多模态内容生成能力，这家公司已发布多个AI助手

【电报解读】苹果入局生成式Al领域，多模态大模型新一轮浪潮有望开启，这家公司在腾讯优图实验室多模态大语言模型测评中，综合得分排名第一-20240321

AIGC 新质生产力，能够实现Al生成文本语音、图像、视频等多模态内容，智能化产品已广泛应用于消费电子、新能源等多个行业，这家公司实现虚拟制作系统等全系列自主研发产品布局

量化分析报告：大语言模型(LLM)在量化金融中的应用展望

大语言模型在投研中的应用：DeepSeek、QwQ-32B与Manus技术解析、投研场景与量化应用

【财联社早知道】谷歌Gemini AI新计划曝光，多模态不断突破或推动AI应用打开商业化空间，这家公司的多媒体大模型涵盖语言、音频、图像、视频等多模态能力

多模态与多人工智能代理在硬件设计中的应用

全面图像检索的全面综述