π[2025-05-16]: We are excited to launch WildDoc, the first benchmark focus on document understanding of MLLMs in the wild!π
π[2025-05-19]: WildDoc is now supported in VLMEvalKit!π
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting
real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs
on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding.
Our main contributions are summarized as follows:
(1) We establish WildDoc, a benchmark designed to systematically evaluate the document understanding ability of existing MLLMs, which provides the community with fresh insights on document understanding in the real world.
(2) To thoroughly evaluate existing models, we further propose a new robustness metric. This metric evaluates whether the model can consistently handle varying real-world situations.
(3) We benchmark numerous MLLMs on WildDoc, revealing significant potential for improvement in robust document understanding.
We establish WildDoc, a benchmark focus on document understanding in real world, where all images are collected from real-world considering a wide range of scenarios encountered in daily life, we selected five key factors: environment, illumination, view, distortion, effect.
Examples of real-world document images captured under different conditions.
Based on WildDoc, we conduct experiments to evaluate numerous representative MLLMs, including general MLLMs (e.g., Qwen2.5-VL ) and the leading closed-course MLLMs (e.g., GPT-4o ). The experiment results demonstrate that (1) Existing MLLMs exhibit a large performance decline in WildDoc compared to traditional document benchmarks, with models like GPT-4o showing an average performance decrease of 35.3. (2) Existing MLLMs demonstrate inadequate robustness in document understanding. This is evident from their lower scores in consistency evaluations, with Qwen2.5-VL-72B achieving the highest score of 49.7. (3) Some models exhibit minimal performance variations and tend to saturate on the original benchmark, yet they experience significant performance declines and disparities on WildDoc.
The performance of several state-of-the-art open-source and closed-source MLLMs. All models suffers a decline in all three subsets, GPT-4o suffers a decline of 28.3, 56.4, 21.3 in the three subsets, respectively. The results indicate that current MLLMs have not yet achieved satisfactorylevels of document understanding capability when handling real-world scenarios. For performance comparison in our WildDoc, we recommond the ''Consistency'' metric.
We provide more analysis on the different real-world factors. Results reveal a substantial performance degradation of MLLMs when facing documents affected by common real-world distortions such as wrinkles, bends, and creases.
@misc{wang2025wilddoc,
title={WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?},
author={An-Lan Wang and Jingqun Tang and Liao Lei and Hao Feng and Qi Liu and Xiang Fei and Jinghui Lu and Han Wang and Weiwei Liu and Hao Liu and Yuliang Liu and Xiang Bai and Can Huang},
year={2025},
eprint={2505.11015},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.11015},
}