VCU-Bridge: Hierarchical Visual
Connotation Understanding via Semantic Bridging

1Zhejiang University, 2Peking University, 3Sun Yat-sen University, 4CUHK
*Equal Contribution, Project Leader, Corresponding Authors

Abstract

While Multimodal Large Language Models (MLLMs) excel on benchmarks, their processing paradigm differs from the human ability to integrate visual information. Unlike humans who naturally bridge details and high-level concepts, models tend to treat these elements in isolation. Prevailing evaluation protocols often decouple low-level perception from high-level reasoning, overlooking their semantic and causal dependencies, which yields non-diagnostic results and obscures performance bottlenecks. We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding: multi-level reasoning that advances from foundational perception through semantic bridging to abstract connotation, with an explicit evidence-to-inference trace from concrete cues to abstract conclusions. Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics. Comprehensive experiments demonstrate a consistent decline in performance as reasoning progresses to higher levels. We further develop a data generation pipeline for instruction tuning guided by Monte Carlo Tree Search (MCTS) and show that strengthening low-level capabilities yields measurable gains at higher levels. Interestingly, it not only improves on HVCU-Bench but also brings benefits on general benchmarks (average +2.53%), especially with substantial gains on MMStar (+7.26%), demonstrating the significance of the hierarchical thinking pattern and its effectiveness in enhancing MLLM capabilities.

Teaser Image

Showcase of different pattern between human and model. Models can appear capable by correctly answering both concrete and abstract questions while fundamentally failing at the reasoning that bridges them. Current evaluation may miss critical reasoning failures when models produce correct answers at both concrete and abstract levels.

VCU-Bridge Framework

Motivated by the observation above, we formalize VCU-Bridge as the structured inference of abstract scene interpretations grounded in perceptual evidence, with each intermediate semantic justification made explicit and auditable. VCU-Bridge is represented as a structured triple of statements ordered by their level of abstraction: Foundational Perceptual Level (Lperc)Semantic Bridge Level (Lbridge)Abstract Connotative Level (Lconn).

🔍 Foundational Perceptual Level

Lperc involves identifying low-level, objective visual primitives like objects and their attributes that are directly observable in the image.

🌉 Semantic Bridge Level

Lbridge provides explanatory statements that causally link perceptual evidence to higher-level meaning, establishing the critical connection between concrete details and abstract interpretations.

💭 Abstract Connotative Level

Lconn captures subjective, high-level interpretations of the scene such as aesthetics, emotion, or symbolic meaning inferred from the visual content.

A valid hierarchy must satisfy pairwise support constraints, where each higher-level statement is sufficiently justified by the lower-level one. This formulation enables explicit modeling of the associative reasoning that connects concrete perceptual evidence to abstract connotative meaning.

HVCU-Bench: A Benchmark for VCU-Bridge

To systematically evaluate VCU-Bridge capabilities, we construct HVCU-Bench, a benchmark specifically designed to measure hierarchical visual reasoning ability. Unlike traditional benchmarks that test perception and reasoning in isolation, HVCU-Bench explicitly models the critical semantic bridge that connects low-level visual details to high-level abstract interpretations.

Overview of HVCU-Bench

Overview of HVCU-Bench. We evaluate MLLMs across 3 task families spanning 15 diverse aspects (top left). Our benchmark employs hierarchical decomposition: each question is systematically broken down into sub-questions across three levels (Perception, Bridge, Connotation), with validation ensuring logical coherence. During evaluation, models progress from low to high levels, constructing inter-level reasoning chains that emulate human visual comprehension (bottom). While GPT-4o achieves top performance among MLLMs, it falls substantially short of human capability, exposing a significant gap (top right).

Task Design

HVCU-Bench comprises three task families covering fifteen fine-grained aspects:

  • Affective Reasoning (joy, affection, wonder, anger, fear, sadness)
  • Aesthetic Appreciation (color, composition, font, graphics)
  • Implication Understanding (metaphor, symbolism, contrast, exaggeration, dislocation)

All items follow a hierarchical multiple-choice QA format with four options per question.

Evaluation Metrics

We evaluate model performance using metrics that capture both level-specific correctness and full-chain consistency:

  • Per-Level Accuracy (Accperc, Accbridge, Accconn): The proportion of correct predictions at each individual level, providing insight into capability at different abstraction levels.
  • Full-Chain Accuracy (Accfull): Requires simultaneous correctness across all three levels, evaluating the model's ability to perform hierarchical reasoning from perception to connotation.
  • Overall Score (Score): The mean of Accfull scores across all tasks, enabling fair model comparison.

Experimental Results

Overall Performance

Model Model Size Implication Understanding Aesthetic Appreciation Affective Reasoning Score
Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull Accperc Accbridge Accconn Accfull
Basic Reference
Human - 99.25 96.00 86.50 86.00 99.14 92.29 90.29 88.86 99.33 93.33 88.67 86.67 87.18
GPT-4o - 95.50 85.25 62.75 53.25 95.43 78.29 68.00 53.14 91.33 83.67 64.33 50.33 52.24
Open-Source MLLMs
Qwen3-VL-Instruct 4B 86.75 82.75 58.00 43.25 90.57 70.86 60.00 41.14 90.33 82.67 56.67 39.33 41.24
Qwen3-VL-Instruct 8B 93.50 89.50 59.50 50.75 91.71 73.43 63.43 44.00 94.33 84.67 60.00 48.00 47.58
LLaVA-1.6 7B 81.75 58.00 40.25 18.75 79.14 36.86 33.14 9.43 92.00 58.00 19.33 12.00 13.39
LLaVA-1.6 13B 84.75 79.00 55.00 39.50 84.86 55.14 50.57 26.29 94.33 77.33 29.00 21.33 29.04
Deepseek-VL2-tiny MoE 1B/3B 88.25 62.25 49.75 29.25 89.71 45.14 41.14 19.71 93.33 65.33 29.00 19.00 22.65
Deepseek-VL2 MoE 4.5B/27B 93.75 83.00 60.75 49.50 95.14 58.00 38.00 23.71 96.33 81.33 46.00 36.67 36.63
Gemma3 4B 76.50 72.00 49.75 30.75 68.86 62.86 68.00 29.14 87.00 76.00 51.00 36.00 31.96
Gemma3 12B 87.50 85.25 60.50 47.50 82.86 70.29 68.00 38.86 90.67 86.33 58.00 46.33 44.23
InternVL3.5 4B 82.50 83.75 58.50 42.00 82.86 64.57 40.00 23.43 91.00 81.67 60.67 47.67 37.70
InternVL3.5 8B 82.00 85.25 55.75 41.75 84.00 68.00 60.57 36.29 86.00 83.67 55.67 42.00 40.01
Phi-4-Multimodal-Instruct 6B 90.25 56.50 42.75 32.25 90.29 42.57 23.14 15.14 90.00 85.00 45.33 33.67 27.02
Phi-3.5-Vison-Instruct 4B 84.25 83.25 61.25 44.75 88.29 61.14 53.43 33.14 91.33 82.00 54.33 41.33 39.74

Context Mode Analysis

Model Task Accuracy Score
Accperc Accbridge Accconn Accfull
GPT-4o (Base) Impl. 95.50 85.25 62.75 53.25 52.24
Aesth. 95.43 78.29 68.00 53.14
Affect. 91.33 83.67 64.33 50.33
GPT-4o (Context) Impl. 95.50 89.75 76.50 65.00 68.18 (+15.94)
Aesth. 95.43 82.29 87.71 72.86
Affect. 91.33 86.00 80.67 66.67
Qwen3-VL-8B (Base) Impl. 93.50 89.50 59.50 50.75 47.58
Aesth. 91.71 73.43 63.43 44.00
Affect. 94.33 84.67 60.00 48.00
Qwen3-VL-8B (Context) Impl. 93.50 90.00 74.75 62.75 62.28 (+14.70)
Aesth. 91.71 74.00 82.57 59.43
Affect. 94.33 89.00 76.00 64.67
Gemma3-4B (Base) Impl. 76.50 72.00 49.75 30.75 31.96
Aesth. 68.86 62.86 68.00 29.14
Affect. 87.00 76.00 51.00 36.00
Gemma3-4B (Context) Impl. 76.50 78.25 63.50 40.75 36.39 (+4.43)
Aesth. 68.86 65.14 82.57 35.43
Affect. 87.00 75.00 50.00 33.00

Key Findings from HVCU-Bench Evaluation

🔍 Significant Gap Between Humans and MLLMs

Humans nearly saturate all HVCU-Bench levels across all tasks, achieving an overall score of 87.18% with consistently strong performance at both Bridge and Connotation levels. In contrast, GPT-4o demonstrates near-human performance at Perception (-3.75%) but exhibits a substantial disparity at Connotation (-23.75%), resulting in an overall gap of -34.94%. This reveals that current MLLMs still lack a stable semantic bridge from concrete evidence to abstract meaning.

📉 Universal Performance Degradation

Most models exhibit a sharp, cascading decline when moving from Perception to Connotation. While nearly all achieve high accuracy at Perception, their performance systematically deteriorates at Bridge and typically drops precipitously at Connotation. On Implication Understanding, GPT-4o experiences a degradation of -32.75%, Qwen3-VL-8B-Instruct degrades by -34.00%, and Gemma3-12B suffers a -27.00% decline, validating that a critical weakness exists in bridging perception to abstract reasoning.

🔗 Hierarchical Context Brings Substantial Gains

Providing hierarchical context yields substantial performance gains across all evaluated models. GPT-4o demonstrates an overall improvement of +15.94%, while Qwen3-VL-8B-Instruct achieves a gain of +14.70%. This demonstrates that lower levels provide critical grounding for higher-level connotative reasoning, confirming that connotative inference fundamentally relies on a coherent chain of reasoning from perception through semantic bridging to abstract interpretation.

📊 Scale Alone Cannot Resolve the Challenge

While increasing model scale generally improves performance, it does not resolve the fundamental challenges of hierarchical visual connotation understanding. Larger models possess stronger foundational capabilities but still lack the specialized knowledge required for connotative reasoning, suggesting that the challenge transcends mere model scale and points to a deeper gap in current multimodal understanding paradigms.

Hierarchical Data Generation Pipeline

Based on the above analysis, we observe that current MLLMs exhibit significant performance degradation when moving from perception to connotation, with a critical weakness in bridging concrete evidence to abstract meaning. To address this fundamental gap, we propose a novel data generation pipeline to create hierarchical training data for fine-tuning models to enhance VCU-Bridge capabilities.

To enhance the hierarchical reasoning capabilities of MLLMs, we propose a novel data generation pipeline for instruction tuning. In contrast to the benchmark's top-down sequential generation of individual samples, our training data is constructed through a large-scale bottom-up search using Monte Carlo Tree Search (MCTS) to explore a tree of candidate paths, ultimately selecting diverse high-quality chains for training.

Pipeline

Overview of our hierarchical data generation pipeline. An MCTS-driven approach for generating high-quality hierarchical training data. The pipeline iteratively constructs a reasoning tree through three phases: Selection (based on UCB strategy), Expansion & Evaluation (generating and assessing candidate children), and Backpropagation (updating statistics for ancestors). After MCTS convergence, top-K highest-rated complete reasoning paths are extracted for instruction tuning.

Effectiveness of Hierarchical Data Generation

To validate the effectiveness of our data generation pipeline, we instruction-tune Qwen3-VL-4B-Instruct on approximately 10k hierarchical QA pairs generated from 1k images, yielding Qwen3-VL-4B-Bridge. Qwen3-VL-4B-Bridge exhibits consistent improvements across all three tasks of HVCU-Bench. Although training supervision is provided exclusively for Implication Understanding, Qwen3-VL-4B-Bridge achieves substantial gains not only on this task (+6.75% in Accfull) but also on Aesthetic Appreciation (+5.43%) and Affective Reasoning (+6.34%), where no direct training signals are given. This cross-task transfer strongly indicates that Qwen3-VL-4B-Bridge has learned to establish generalizable semantic connections, linking perceptual evidence to abstract connotations through intermediate factual reasoning, rather than memorizing task-specific templates. The overall HVCU-Bench score demonstrates an improvement of +6.17%, further confirming that bottom-up data generation with validation effectively teaches structured visual reasoning.

To assess whether these hierarchical reasoning improvements generalize beyond HVCU-Bench, we evaluate Qwen3-VL-4B-Bridge on four established general benchmarks: MMBench, HallusionBench, MMStar, and MMMU. Qwen3-VL-4B-Bridge demonstrates strong generalization, achieving substantial improvements on MMStar (+7.26%) and MMMU (+3.22%) while maintaining competitive performance on MMBench and HallusionBench, even though it was trained exclusively on HVCU-Bench's connotation-focused data. This cross-benchmark transfer demonstrates that grounding abstract interpretations in perceptual evidence through a structured semantic bridge enhances reasoning skills, benefiting tasks beyond connotative understanding.

Model Performance Chart

Performance improvements after instruction tuning. Qwen3-VL-4B-Bridge achieves substantial improvements on HVCU-Bench while maintaining strong performance on diverse general benchmarks, demonstrating the effectiveness of hierarchical training data.

HVCU-Bench Samples

BibTeX

@misc{zhong2025vcubridgehierarchicalvisualconnotation,
      title={VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging}, 
      author={Ming Zhong and Yuanlei Wang and Liuzhou Zhang and Arctanx An and Renrui Zhang and Hao Liang and Ming Lu and Ying Shen and Wentao Zhang},
      year={2025},
      eprint={2511.18121},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.18121}, 
}