MTVQA

🔔News

🚀[2024-06-04]: We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes 9 widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI!🌟

🔥[2023-06-04]: GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models! 😆

Introduction

We introduce MTVQA: the first benchmark focusing on Text-Centric Visual Question Answering (TEC-VQA), featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. A dedicated team of annotators meticulously created this large-scale dataset to ensure high fidelity and linguistic diversity. By comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.

Overview

We establish MTVQA, a novel and high-quality multilingual TEC-VQA benchmark, where all images are collected from real-world and meticulously annotated by human experts in nine languages: Arabic (AR), Korean (KO), Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR), German (DE), and Italian (IT). More concretely, to ensure the visual-textual alignment at most, the annotation process follows the raise-then-correct paradigm, where a group of human annotators raises several distinct questions, ranging from simple content extraction to text-related reasoning, and subsequently provides answers. Another group then double-checks these QA pairs to ensure accuracy and consistency. Consequently, 6,678 training images and 21,829 question-answer pairs, as well as 2,116 test images and 6,778 question-answer pairs are obtained, covering more than 20 fine-grained scenarios from both documents and natural scenes, such as menus, logos, maps, bills, PPTs, research papers, and etc. To our knowledge, MTVQA is the first TEC-VQA dataset to provide native human annotations for multilingual text-rich scenarios, especially for low-source languages.

A brief diagram of the annotation process.

Comparisons with Existing Benchmarks

MTVQA vs. General VQA Benchmarks:
- MTVQA focuses specifically on multilingual text-centric visual question answering, addressing the "visual-textual misalignment" problem.
- General VQA benchmarks primarily focus on visual questions that do not necessarily require understanding textual information within images.
- MTVQA provides human expert annotations for multilingual text-centric scenarios, ensuring accurate and contextually appropriate translations.
MTVQA vs. TEC-VQA Benchmarks:
- MTVQA targets a wider range of languages, including low-resource languages, unlike most TEC-VQA benchmarks that concentrate on high-resource languages (e.g., English, Chinese, Japanese).
- MTVQA addresses the limitations of translation-based approaches in TEC-VQA, such as nuanced meaning loss, contextual distortion, language bias, and question type diversity.
- MTVQA integrates human expert annotations to resolve the "visual-textual misalignment" problem, whereas other TEC-VQA benchmarks often rely on off-the-shelf translation engines.

Statistics

Statistics of question and answer lengths of different languages aggregating training and test set, using GPT-4o tokenizer.

Mean lengths of question-answer pairs in different languages of the training set and test set, using GPT-4o tokenizer.

Word clouds showcasing top questions in various languages, tokenized via NLTK with removing stop words, punctuation, and digits.

Word clouds showcasing top answers in various languages, tokenized via NLTK with removing stop words, punctuation, and digits.

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. We conduct the evaluation experiments over the baseline MLLMs with their default settings, ignoring the effect of generation configuration on the results. To make the output of MLLMs more evaluation-friendly, we design the following prompt format to limit the output length: "Answer the question using a word or phrase in the language of the question. + [Question]", where "[Question]" represents the actual question from the MTVQA test set. This approach aims to make the answers as concise as possible. Besides, InternLM-Xcomposer2-4KHD is chosen as the base model for an instruction tuning experiment on the MTVQA training set.

Open-Source Proprietary SFT

Model	Overall	AR	DE	FR	IT	JA	KO	RU	TH	VI
GPT-4O(mni)	27.8	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2
Claude3 Opus	25.7	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1
Geimini Ultra	23.2	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6
GPT-4V(ision)	22.0	11.5	1.5	40.4	32.3	11.5	16.7	10.3	15.0	28.9
Claude3 Sonnet	21.1	10.5	28.9	35.6	31.8	13.9	22.2	11.0	15.2	20.8
QwenVL Max	21.1	7.7	31.4	37.6	30.2	18.6	15.4	10.4	4.8	23.5
Xcomposer-SFT	19.7	11.8	31.7	37.4	29.3	14.5	12.9	5.8	13.9	20.2
QwenVL Plus	17.8	4.8	28.8	33.7	27.1	12.8	19.9	9.4	5.6	18.1
MiniCPM-V 2.5	17.3	6.1	29.6	35.7	26.0	12.1	13.1	5.3	12.6	15.3
InternVL-V1.5	14.9	3.4	27.1	31.4	27.1	9.9	9.0	4.9	8.7	12.4
GLM-4V	13.6	0.3	30.0	34.1	30.1	3.4	5.7	3.0	3.5	12.3
TextSquare	13.6	3.7	27.0	30.7	26.7	3.2	7.2	6.7	5.2	12.4
Mini-Gemini-HD-34B	13.0	2.2	25.0	29.2	25.5	6.1	8.6	4.1	4.3	11.8
Xcomposer2-4KHD	11.2	2.0	20.6	23.2	21.6	5.6	7.7	4.1	6.1	10.1
Llava-Next-34B	11.1	3.3	24.0	28.0	22.3	3.6	6.1	2.6	0.4	9.8
TextMonkey	9.9	2.0	18.1	19.9	22.1	4.6	7.2	3.2	0.9	11.1
MiniCPM-V 2.0	7.4	1.3	12.7	14.9	17.0	3.7	5.6	2.2	2.2	6.8
mPLUG-DocOwl 1.5	7.2	1.0	13.9	14.9	18.2	2.9	5.0	2.0	0.9	6.4
YI-VL-34B	6.8	1.7	13.5	15.7	12.1	4.8	5.2	0.8	3.5	4.1
DeepSeek-VL	6.6	0.6	14.2	15.3	15.2	2.9	3.8	1.6	0.9	5.2

Overall results of different models on the MTVQA test set. The best-performing model in each category is in-bold. The performance is measured using accuray.

Left: comparison of the overall performance of various MLLMs in the MTVQA benchmark. Right: comparison of the performance exhibited by MLLMs in the 9 languages of the MTVQA.

Error Examples

Bad case demonstration using Xcomposer2-4KHD in Japanese.

Bad case demonstration using Xcomposer2-4KHD in French.

Correct Examples

Good case demonstration using Xcomposer2-4KHD in German.

Good case demonstration using Xcomposer2-4KHD in Italian.

BibTeX


      @misc{tang2024mtvqa,
        title={MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering}, 
        author={Jingqun Tang and Qi Liu and Yongjie Ye and Jinghui Lu and Shu Wei and Chunhui Lin and Wanqing Li and Mohamad Fitri Faiz Bin Mahmood and Hao Feng and Zhen Zhao and Yanjie Wang and Yuliang Liu and Hao Liu and Xiang Bai and Can Huang},
        year={2024},
        eprint={2405.11985},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }