Pei Fu1, Tongkun Guan2, Zining Wang1, Zhentao Guo3, Chen Duan1,Hao Sun4, Boming Chen1, Jiayao Ma1, Qianyi Jiang1, Kai Zhou1, Junfeng Luo11Meituan,2Shanghai Jiao Tong University,3Beijing Institute of Technology,4MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences{fupei,duanchen02,wangzining03,chenboming,majiayao02}@meituan.com{jiangqianyi02,zhoukai03,luojunfeng}@meituan.comgtk0615@sjtu.edu.cn,hao.sun@cripac.ia.ac.cn,zt_guo1230@163.com itText-richImageUnderstanding (TIU), whichencompasses two core capabilities: perception andunderstanding. The perception dimension focuseson visual recognition tasks, such as text detection(Liao et al., 2022), text recognition (Guan et al.,2025), formula recognition (Truong et al., 2024;Guan et al., 2024a), and document layout analysis(Yupan et al., 2022).The understanding dimen-sion, conversely, requires semantic reasoning forapplications like key information extraction anddocument-based visual question answering (e.g.,DocVQA (Mathew et al., 2021b), ChartQA (Masryet al., 2022), and TextVQA (Singh et al., 2019)).arXiv:2502.16586v1 [cs.CV] 23 Feb 2025 Abstract The recent emergence of Multi-modal LargeLanguage Models (MLLMs) has introduceda new dimension to the Text-rich Image Un-derstanding (TIU) field, with models demon-strating impressive and inspiring performance.However, their rapid evolution and widespreadadoption have made it increasingly challengingto keep up with the latest advancements. Toaddress this, we present a systematic and com-prehensive survey to facilitate further researchon TIU MLLMs. Initially, we outline the time-line, architecture, and pipeline of nearly all TIUMLLMs. Then, we review the performance ofselected models on mainstream benchmarks.Finally, we explore promising directions, chal-lenges, and limitations within the field. Historically, perception and understanding taskswere handled separately through specialized mod-els or multi-stage pipelines.Recent advancesin vision-language models have unified thesetasks within Visual Question Answering (VQA)paradigms, driving research towards the develop-ment of end-to-end universal models. 1Introduction Text-rich images play a pivotal role in real-worldscenarios by efficiently conveying complex infor-mation and improving accessibility (Biten et al.,2019). Accurately interpreting these images is es-sential for automating information extraction, ad-vancing AI systems, and optimizing user interac-tions. To formalize this research domain, we term Figure 1 presents an evolutionary timeline de-lineating critical milestones in unified text-rich im-age understanding models. The trajectory revealstwo distinct eras: (a) The pre-LLM period (2019- all TIU MLLMs in four dimensions: Model Archi-tectures (Section 2), Training Pipeline (Section 3),Datasets and Benchmarks (Section 4), Challengesand Trends (Section 5). This holds both academicand practical significance for advancing the field. 2022) characterized by specialized architectureslike LayoutLM (Xu et al., 2019) and Donut (Kimet al., 2021), which employed modality-specificpre-training objectives (masked language model-ing, masked image modeling,etc.) coupled withOCR-derived supervision (text recognition, spatialorder recovery,etc.). While effective in controlledsettings, these models exhibited limited adaptabil-ity to open-domain scenarios due to their task-specific fine-tuning requirements and constrainedcross-modal interaction mechanisms. (b) The post-LLM era (2023–present) is marked by the growingpopularity of LLMs. Some studies propose Multi-modal Large Language Models (MLLMs), whichintegrate LLM with visual encoders to jointly pro-cess visual tokens and linguistic elements throughunified attention mechanisms, achieving end-to-end sequence modeling. 2Model Architecture TIU MLLM methods typically leverage pre-trainedgeneral visual foundation models to extract robustvisual features or employ OCR engines to capturetext and layout information from images. A modal-ity connector is then used to align these visual fea-tures with the semantic space of the language fea-tures from the LLM. Finally, the combined visual-language features are fed into the LLM, which uti-lizes its powerful comprehension capabilities forsemantic reasoning to generate the final answer.As illustrated in Figure 2, the framework of TIUMLLMs can be abstracted into three core compo-nents: Visual Encoder, Modality Connector, andLLM Decoder. This paradigm evolution addresses two criticallimitations of earlier methods. First, the emergentMLLM framework eliminates modality-specific in-ductive biases through homogeneous token repre-sentation, enabling seamless multi-task integration.Second, the linguistic priors encoded in LLMs em-power unprecedented zero-shot generalization andallow direct application to diverse tasks withouttask-specific tuning. 2.1Visual Encoder The Visual EncoderF(·)is responsible for trans-forming input imageIinto feature representationsV, expressed asV=F(·)