Logo MTVQA

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

Jingqun Tang*1, Qi Liu*1 , Yongjie Ye*1,
Jinghui Lu*1 , Shu Wei1,
Chunhui Lin1, Wanqing Li1, Mohamad Fitri Faiz Bin Mahmood1, Hao Feng1, Zhen Zhao1, Yanjie Wang1, Yuliang Liu2, Hao Liu*โ€ ,1 , Xiang Bai2, Can Huang*โ€ ,1

1ByteDance Inc., 2Huazhong University of Science and Technology,

*Core Contributors
โ€ Corresponding to: can.huang@bytedance.com, haoliu.0128@bytedance.com
overview examples

Multilingual text-centric VQA example visualization for each language.

overview distribution

Left: overview of various categories of text-rich images. Right: image and QA pairs distribution over the 9 languages in MTVQA benchmark.

๐Ÿ””News

๐Ÿš€[2024-06-04]: We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes 9 widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI!๐ŸŒŸ

๐Ÿ”ฅ[2023-06-04]: GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models! ๐Ÿ˜†

Introduction

We introduce MTVQA: the first benchmark focusing on Text-Centric Visual Question Answering (TEC-VQA), featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. A dedicated team of annotators meticulously created this large-scale dataset to ensure high fidelity and linguistic diversity. By comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still a large room for performance improvement, underscoring the value of MTVQA. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.

Logo MTVQA Benchmark

Overview

We establish MTVQA, a novel and high-quality multilingual TEC-VQA benchmark, where all images are collected from real-world and meticulously annotated by human experts in nine languages: Arabic (AR), Korean (KO), Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR), German (DE), and Italian (IT). More concretely, to ensure the visual-textual alignment at most, the annotation process follows the raise-then-correct paradigm, where a group of human annotators raises several distinct questions, ranging from simple content extraction to text-related reasoning, and subsequently provides answers. Another group then double-checks these QA pairs to ensure accuracy and consistency. Consequently, 6,678 training images and 21,829 question-answer pairs, as well as 2,116 test images and 6,778 question-answer pairs are obtained, covering more than 20 fine-grained scenarios from both documents and natural scenes, such as menus, logos, maps, bills, PPTs, research papers, and etc. To our knowledge, MTVQA is the first TEC-VQA dataset to provide native human annotations for multilingual text-rich scenarios, especially for low-source languages.

anno process

A brief diagram of the annotation process.

Comparisons with Existing Benchmarks

  • MTVQA vs. General VQA Benchmarks:
    • MTVQA focuses specifically on multilingual text-centric visual question answering, addressing the "visual-textual misalignment" problem.
    • General VQA benchmarks primarily focus on visual questions that do not necessarily require understanding textual information within images.
    • MTVQA provides human expert annotations for multilingual text-centric scenarios, ensuring accurate and contextually appropriate translations.
  • MTVQA vs. TEC-VQA Benchmarks:
    • MTVQA targets a wider range of languages, including low-resource languages, unlike most TEC-VQA benchmarks that concentrate on high-resource languages (e.g., English, Chinese, Japanese).
    • MTVQA addresses the limitations of translation-based approaches in TEC-VQA, such as nuanced meaning loss, contextual distortion, language bias, and question type diversity.
    • MTVQA integrates human expert annotations to resolve the "visual-textual misalignment" problem, whereas other TEC-VQA benchmarks often rely on off-the-shelf translation engines.

Statistics

Experiment Results

Leaderboard

We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. We conduct the evaluation experiments over the baseline MLLMs with their default settings, ignoring the effect of generation configuration on the results. To make the output of MLLMs more evaluation-friendly, we design the following prompt format to limit the output length: "Answer the question using a word or phrase in the language of the question. + [Question]", where "[Question]" represents the actual question from the MTVQA test set. This approach aims to make the answers as concise as possible. Besides, InternLM-Xcomposer2-4KHD is chosen as the base model for an instruction tuning experiment on the MTVQA training set.

Open-Source Proprietary SFT
Model Overall AR DE FR IT JA KO RU TH VI
GPT-4O(mni) 27.8 20.2 34.2 41.2 32.7 20.0 33.9 11.5 22.5 34.2
Claude3 Opus 25.7 15.1 33.4 40.6 34.4 19.4 27.2 13.0 19.5 29.1
Geimini Ultra 23.2 14.7 32.3 40.0 31.8 12.3 17.2 11.8 20.3 28.6
GPT-4V(ision) 22.0 11.5 1.5 40.4 32.3 11.5 16.7 10.3 15.0 28.9
Claude3 Sonnet 21.1 10.5 28.9 35.6 31.8 13.9 22.2 11.0 15.2 20.8
QwenVL Max 21.1 7.7 31.4 37.6 30.2 18.6 15.4 10.4 4.8 23.5
Xcomposer-SFT 19.7 11.8 31.7 37.4 29.3 14.5 12.9 5.8 13.9 20.2
QwenVL Plus 17.8 4.8 28.8 33.7 27.1 12.8 19.9 9.4 5.6 18.1
MiniCPM-V 2.5 17.3 6.1 29.6 35.7 26.0 12.1 13.1 5.3 12.6 15.3
InternVL-V1.5 14.9 3.4 27.1 31.4 27.1 9.9 9.0 4.9 8.7 12.4
GLM-4V 13.6 0.3 30.0 34.1 30.1 3.4 5.7 3.0 3.5 12.3
TextSquare 13.6 3.7 27.0 30.7 26.7 3.2 7.2 6.7 5.2 12.4
Mini-Gemini-HD-34B 13.0 2.2 25.0 29.2 25.5 6.1 8.6 4.1 4.3 11.8
Xcomposer2-4KHD 11.2 2.0 20.6 23.2 21.6 5.6 7.7 4.1 6.1 10.1
Llava-Next-34B 11.1 3.3 24.0 28.0 22.3 3.6 6.1 2.6 0.4 9.8
TextMonkey 9.9 2.0 18.1 19.9 22.1 4.6 7.2 3.2 0.9 11.1
MiniCPM-V 2.0 7.4 1.3 12.7 14.9 17.0 3.7 5.6 2.2 2.2 6.8
mPLUG-DocOwl 1.5 7.2 1.0 13.9 14.9 18.2 2.9 5.0 2.0 0.9 6.4
YI-VL-34B 6.8 1.7 13.5 15.7 12.1 4.8 5.2 0.8 3.5 4.1
DeepSeek-VL 6.6 0.6 14.2 15.3 15.2 2.9 3.8 1.6 0.9 5.2

Overall results of different models on the MTVQA test set. The best-performing model in each category is in-bold. The performance is measured using accuray.

perf

Left: comparison of the overall performance of various MLLMs in the MTVQA benchmark. Right: comparison of the performance exhibited by MLLMs in the 9 languages of the MTVQA.

Error Examples

Correct Examples

BibTeX


      @misc{tang2024mtvqa,
        title={MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering}, 
        author={Jingqun Tang and Qi Liu and Yongjie Ye and Jinghui Lu and Shu Wei and Chunhui Lin and Wanqing Li and Mohamad Fitri Faiz Bin Mahmood and Hao Feng and Zhen Zhao and Yanjie Wang and Yuliang Liu and Hao Liu and Xiang Bai and Can Huang},
        year={2024},
        eprint={2405.11985},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }