Logo WildDoc

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

An-Lan Wang*,1,
Jingqun Tang*,1, Liao Lei*,1,
Hao Feng1, Qi Liu1 Xiang Fei1, Jinghui Lu1, Han Wang1, Weiwei Liu1, Hao Liu1
Yuliang Liu2, Xiang Bai2, Can Huang*†,1

1ByteDance Inc., 2Huazhong University of Science and Technology,

*Core Contributors
†Corresponding to: wanganlan@bytedance.com, tangjingqun@bytedance.com,
overview examples

Comparison of WildDoc with existing benchmarks for document understanding, highlighting the predominance of scanned or digital document images in current benchmarks versus the real-world captured document images in WildDoc..

overview distribution

(a) For every document, we manually capture four images under different setups. (b) Several representative examples that encompass different real-world conditions.

πŸ””News

πŸš€[2025-05-16]: We are excited to launch WildDoc, the first benchmark focus on document understanding of MLLMs in the wild!🌟
πŸš€[2025-05-19]: WildDoc is now supported in VLMEvalKit!🌟

Introduction

This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions. Evaluations of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks, highlighting the unique challenges posed by real-world document understanding.
Our main contributions are summarized as follows:
(1) We establish WildDoc, a benchmark designed to systematically evaluate the document understanding ability of existing MLLMs, which provides the community with fresh insights on document understanding in the real world.
(2) To thoroughly evaluate existing models, we further propose a new robustness metric. This metric evaluates whether the model can consistently handle varying real-world situations.
(3) We benchmark numerous MLLMs on WildDoc, revealing significant potential for improvement in robust document understanding.

Logo WildDoc Benchmark

Overview

We establish WildDoc, a benchmark focus on document understanding in real world, where all images are collected from real-world considering a wide range of scenarios encountered in daily life, we selected five key factors: environment, illumination, view, distortion, effect.

anno process

Examples of real-world document images captured under different conditions.

Comparisons with Existing Benchmarks

Prevailing benchmarks like DocVQA and ChartQA predominantly comprise scanned or digital documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. WildDoc is the first document focus on document understanding in the wild.

Statistics

Experiment Results

Based on WildDoc, we conduct experiments to evaluate numerous representative MLLMs, including general MLLMs (e.g., Qwen2.5-VL ) and the leading closed-course MLLMs (e.g., GPT-4o ). The experiment results demonstrate that (1) Existing MLLMs exhibit a large performance decline in WildDoc compared to traditional document benchmarks, with models like GPT-4o showing an average performance decrease of 35.3. (2) Existing MLLMs demonstrate inadequate robustness in document understanding. This is evident from their lower scores in consistency evaluations, with Qwen2.5-VL-72B achieving the highest score of 49.7. (3) Some models exhibit minimal performance variations and tend to saturate on the original benchmark, yet they experience significant performance declines and disparities on WildDoc.

perf

The performance of several state-of-the-art open-source and closed-source MLLMs. All models suffers a decline in all three subsets, GPT-4o suffers a decline of 28.3, 56.4, 21.3 in the three subsets, respectively. The results indicate that current MLLMs have not yet achieved satisfactorylevels of document understanding capability when handling real-world scenarios. For performance comparison in our WildDoc, we recommond the ''Consistency'' metric.

perf perf

We provide more analysis on the different real-world factors. Results reveal a substantial performance degradation of MLLMs when facing documents affected by common real-world distortions such as wrinkles, bends, and creases.

More Examples

Citation


      @misc{wang2025wilddoc,
        title={WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?}, 
        author={An-Lan Wang and Jingqun Tang and Liao Lei and Hao Feng and Qi Liu and Xiang Fei and Jinghui Lu and Han Wang and Weiwei Liu and Hao Liu and Yuliang Liu and Xiang Bai and Can Huang},
        year={2025},
        eprint={2505.11015},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.11015}, 
  }