Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Abstract
Chinese ancient documents preserve knowledge across millennia. However, most digitization efforts remain at the scanned-image level, limiting knowledge discovery and machine understanding. Existing document benchmarks primarily focus on English printed materials or Simplified Chinese and cannot fully assess VLMsβ OCR and higher-level understanding capabilities in ancient-document scenarios. We introduce AncientDoc, the first systematic multi-task benchmark for Chinese ancient documents, consisting of five tasks: Page-level OCR, Vernacular Translation, Reasoning-based QA, Knowledge-based QA, and Linguistic-variant QA. The dataset spans 14 categories, 100+ books, and about 3,000 pages. We evaluate mainstream VLMs with multiple metrics and complement them with LLM-based scoring that correlates strongly with human assessments, offering a unified framework for future research on ancient-document understanding.
Tasks
1) Page-level OCR
Directly transcribe the entire page into reading order without explicit detection/cropping. Key challenges include vertical right-to-left layouts, marginalia/small fonts, and robust handling of Traditional/variant characters.
2) Vernacular Translation
Translate Classical Chinese into modern vernacular Chinese. Difficulties include lexical disambiguation and semantic-aware segmentation/punctuation.
3) Reasoning-based QA
Answer questions requiring implicit reasoning (e.g., factual, causal, relational). Tests multi-step reasoning and contextual understanding.
4) Knowledge-based QA
Answer objective knowledge questions (people, places, terms, historical facts) grounded in the text, requiring classical knowledge background.
5) Linguistic-variant QA
Identify and analyze stylistic, rhetorical, and genre characteristics, assessing understanding and generation with respect to linguistic variants.
Coverage vs. Prior Benchmarks
| Task | DocVQA | TKH | MTH | OCRBench | OCRBench v2 | AncientDoc |
|---|---|---|---|---|---|---|
| Page-level OCR | β | β | β | β | β | β |
| Vernacular Translation | β | β | β | β | β | β |
| Reasoning-based QA | β | β | β | β | β | β |
| Knowledge-based QA | β | β | β | β | β | β |
| Linguistic-variant QA | β | β | β | β | β | β |
Dataset
Sources: primarily high-quality digital collections (e.g., Harvard Library). Selection criteria prioritize vertical layouts with Traditional characters, real-world degradation artifacts, and high semantic density suitable for annotation.
- Coverage: 14 categories, 100+ books, and ~3,000 pages (β 2,973β3,000).
- Dynasty distribution (example): Ming (~1,148 pages), Qing (~778), Song (~540), Tang (~208).
- Script style: Regular script β 97%, cursive β 3%.
Metrics
Page-level OCR
- Character Error Rate (CER)
- Character Precision / Recall / F1
Other Tasks
- CHRF++
- BERTScore (BS-F1)
- LLM-based scores (0β10), selecting the LLM with the highest agreement with human ratings
LLMβHuman Agreement (Illustrative)
We compare several LLM judges (e.g., GPT-4o, Gemini, Qwen-Plus, Doubao, Qwen2.5-72B) against human ratings using Pearson/Spearman/Kendall correlations, MSE/MAE, and bias, and select the judge with the best agreement.
Results (Selected)
Page-level OCR
Gemini 2.5-Pro achieves the best overall performance on this task (e.g., higher Char F1, lower CER), while Qwen2.5 is stable across settings. Smaller models can sometimes outperform much larger ones for OCR-specific tasks.
Vernacular Translation
Gemini 2.5-Pro leads in BERTScore and LLM-based ratings; Qwen-VL-Max / Qwen2.5-VL-72B follow closely.
Reasoning-based QA
Qwen2.5-VL-72B reaches the highest BERTScore; the 7B variant approaches large-model performance with much fewer parameters.
Knowledge-based QA
GPT-4o tops BERTScore; Doubao-V2 and Gemini 2.5-Pro perform best under LLM-based scoring.
Linguistic-variant QA
GPT-4o and Gemini 2.5-Pro lead this task; notably, InternVL2.5 outperforms InternVL3 variants here.
BibTeX
@article{ancientdoc2025,
title = {Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning},
author = {},
journal = {arXiv preprint arXiv:},
year = {2025}
}
Replace the author list and arXiv ID when public.