Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

We present AncientDoc — a comprehensive multi-task benchmark for Chinese ancient documents. It evaluates mainstream vision-language models (VLMs) across the full pipeline from page-level OCR and vernacular translation to reasoning-based QA, knowledge-based QA, and linguistic-variant QA.

📄 Paper 💻 Code 🧾 Dataset

TL;DR: AncientDoc covers 14 categories, 100+ books, and roughly 3,000 pages...

Page-level OCR Vernacular Translation Reasoning & Knowledge QA

Abstract

Chinese ancient documents preserve knowledge across millennia. However, most digitization efforts remain at the scanned-image level, limiting knowledge discovery and machine understanding. Existing document benchmarks primarily focus on English printed materials or Simplified Chinese and cannot fully assess VLMs’ OCR and higher-level understanding capabilities in ancient-document scenarios. We introduce AncientDoc, the first systematic multi-task benchmark for Chinese ancient documents, consisting of five tasks: Page-level OCR, Vernacular Translation, Reasoning-based QA, Knowledge-based QA, and Linguistic-variant QA. The dataset spans 14 categories, 100+ books, and about 3,000 pages. We evaluate mainstream VLMs with multiple metrics and complement them with LLM-based scoring that correlates strongly with human assessments, offering a unified framework for future research on ancient-document understanding.

Tasks

1) Page-level OCR

Directly transcribe the entire page into reading order without explicit detection/cropping. Key challenges include vertical right-to-left layouts, marginalia/small fonts, and robust handling of Traditional/variant characters.

2) Vernacular Translation

Translate Classical Chinese into modern vernacular Chinese. Difficulties include lexical disambiguation and semantic-aware segmentation/punctuation.

3) Reasoning-based QA

Answer questions requiring implicit reasoning (e.g., factual, causal, relational). Tests multi-step reasoning and contextual understanding.

4) Knowledge-based QA

Answer objective knowledge questions (people, places, terms, historical facts) grounded in the text, requiring classical knowledge background.

5) Linguistic-variant QA

Identify and analyze stylistic, rhetorical, and genre characteristics, assessing understanding and generation with respect to linguistic variants.

Coverage vs. Prior Benchmarks

Task	DocVQA	TKH	MTH	OCRBench	OCRBench v2	AncientDoc
Page-level OCR	✗	✓	✓	✓	✓	✓
Vernacular Translation	✗	✗	✗	✗	✗	✓
Reasoning-based QA	✓	✗	✗	✓	✓	✓
Knowledge-based QA	✗	✗	✗	✗	✗	✓
Linguistic-variant QA	✗	✗	✗	✗	✗	✓

Reconstructed summary based on the paper’s comparison.

Dataset

Sources: primarily high-quality digital collections (e.g., Harvard Library). Selection criteria prioritize vertical layouts with Traditional characters, real-world degradation artifacts, and high semantic density suitable for annotation.

Coverage: 14 categories, 100+ books, and ~3,000 pages (≈ 2,973–3,000).
Dynasty distribution (example): Ming (~1,148 pages), Qing (~778), Song (~540), Tang (~208).
Script style: Regular script ≈ 97%, cursive ≈ 3%.

Dynasty distribution (placeholder) — Figure: page counts by dynasty (illustrative).

Figure: page counts by 14 document categories (illustrative).

Metrics

Page-level OCR

Character Error Rate (CER)
Character Precision / Recall / F1

Other Tasks

CHRF++
BERTScore (BS-F1)
LLM-based scores (0–10), selecting the LLM with the highest agreement with human ratings

LLM–Human Agreement (Illustrative)

We compare several LLM judges (e.g., GPT-4o, Gemini, Qwen-Plus, Doubao, Qwen2.5-72B) against human ratings using Pearson/Spearman/Kendall correlations, MSE/MAE, and bias, and select the judge with the best agreement.

Results (Selected)

Page-level OCR

Gemini 2.5-Pro achieves the best overall performance on this task (e.g., higher Char F1, lower CER), while Qwen2.5 is stable across settings. Smaller models can sometimes outperform much larger ones for OCR-specific tasks.

Vernacular Translation

Gemini 2.5-Pro leads in BERTScore and LLM-based ratings; Qwen-VL-Max / Qwen2.5-VL-72B follow closely.

Reasoning-based QA

Qwen2.5-VL-72B reaches the highest BERTScore; the 7B variant approaches large-model performance with much fewer parameters.

Knowledge-based QA

GPT-4o tops BERTScore; Doubao-V2 and Gemini 2.5-Pro perform best under LLM-based scoring.

Linguistic-variant QA

GPT-4o and Gemini 2.5-Pro lead this task; notably, InternVL2.5 outperforms InternVL3 variants here.

BibTeX

@article{ancientdoc2025,
  title   = {Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning},
  author  = {},
  journal = {arXiv preprint arXiv:},
  year    = {2025}
}

Replace the author list and arXiv ID when public.