VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging

Model	Model Size	Implication Understanding	Aesthetic Appreciation	Affective Reasoning	Score
Basic Reference
Human	-	99.25	96.00	86.50	86.00	99.14	92.29	90.29	88.86	99.33	93.33	88.67	86.67	87.18
GPT-4o	-	95.50	85.25	62.75	53.25	95.43	78.29	68.00	53.14	91.33	83.67	64.33	50.33	52.24
Open-Source MLLMs
Qwen3-VL-Instruct	4B	86.75	82.75	58.00	43.25	90.57	70.86	60.00	41.14	90.33	82.67	56.67	39.33	41.24
Qwen3-VL-Instruct	8B	93.50	89.50	59.50	50.75	91.71	73.43	63.43	44.00	94.33	84.67	60.00	48.00	47.58
LLaVA-1.6	7B	81.75	58.00	40.25	18.75	79.14	36.86	33.14	9.43	92.00	58.00	19.33	12.00	13.39
LLaVA-1.6	13B	84.75	79.00	55.00	39.50	84.86	55.14	50.57	26.29	94.33	77.33	29.00	21.33	29.04
Deepseek-VL2-tiny	MoE 1B/3B	88.25	62.25	49.75	29.25	89.71	45.14	41.14	19.71	93.33	65.33	29.00	19.00	22.65
Deepseek-VL2	MoE 4.5B/27B	93.75	83.00	60.75	49.50	95.14	58.00	38.00	23.71	96.33	81.33	46.00	36.67	36.63
Gemma3	4B	76.50	72.00	49.75	30.75	68.86	62.86	68.00	29.14	87.00	76.00	51.00	36.00	31.96
Gemma3	12B	87.50	85.25	60.50	47.50	82.86	70.29	68.00	38.86	90.67	86.33	58.00	46.33	44.23
InternVL3.5	4B	82.50	83.75	58.50	42.00	82.86	64.57	40.00	23.43	91.00	81.67	60.67	47.67	37.70
InternVL3.5	8B	82.00	85.25	55.75	41.75	84.00	68.00	60.57	36.29	86.00	83.67	55.67	42.00	40.01
Phi-4-Multimodal-Instruct	6B	90.25	56.50	42.75	32.25	90.29	42.57	23.14	15.14	90.00	85.00	45.33	33.67	27.02
Phi-3.5-Vison-Instruct	4B	84.25	83.25	61.25	44.75	88.29	61.14	53.43	33.14	91.33	82.00	54.33	41.33	39.74

Model	Task	Accuracy	Score
GPT-4o (Base)	Impl.	95.50	85.25	62.75	53.25	52.24
Aesth.	95.43	78.29	68.00	53.14
Affect.	91.33	83.67	64.33	50.33
GPT-4o (Context)	Impl.	95.50	89.75	76.50	65.00	68.18 (+15.94)
Aesth.	95.43	82.29	87.71	72.86
Affect.	91.33	86.00	80.67	66.67
Qwen3-VL-8B (Base)	Impl.	93.50	89.50	59.50	50.75	47.58
Aesth.	91.71	73.43	63.43	44.00
Affect.	94.33	84.67	60.00	48.00
Qwen3-VL-8B (Context)	Impl.	93.50	90.00	74.75	62.75	62.28 (+14.70)
Aesth.	91.71	74.00	82.57	59.43
Affect.	94.33	89.00	76.00	64.67
Gemma3-4B (Base)	Impl.	76.50	72.00	49.75	30.75	31.96
Aesth.	68.86	62.86	68.00	29.14
Affect.	87.00	76.00	51.00	36.00
Gemma3-4B (Context)	Impl.	76.50	78.25	63.50	40.75	36.39 (+4.43)
Aesth.	68.86	65.14	82.57	35.43
Affect.	87.00	75.00	50.00	33.00

Implication Understanding (Metaphor)

Level 1 -- Perception
Question: What is the primary method of communication used by the characters in the comic strip?
A. Characters speak dialogue shown in speech bubbles.
B. They use only non-verbal actions like gestures and facial expressions.
C. Their communication is shown through thought bubbles above their heads.
D. A narrator provides descriptions of their thoughts and actions.
Answer: A

Level 2 -- Bridge
Question: Based on the sequence of events in the comic, what is the most direct reason the yellow horse decides to become taller?
A. To gain a better vantage point for observing its surroundings.
B. To explore its own potential for extreme physical transformation.
C. To satisfy the blue horse's stated dating preference for tall individuals.
D. To develop a more imposing physique for self-defense.
Answer: C

Level 3 -- Connotation
Question: What could the abrupt transformation of the shorter character into a giraffe in the final panel of the comic strip symbolize in terms of social commentary?
A. It symbolizes the importance of personal growth and improvement.
B. It represents a critique of the quest for physical perfection and the extremes to which people will go to achieve it.
C. The transformation illustrates the beauty of embracing one's unique nature rather than conforming.
D. The character symbolizes the absurdity of changing oneself to meet others' arbitrary standards.
Answer: D

Implication Understanding (Contrast)

Level 1 -- Perception
Question: What object is clearly visible on the head of one of the horse-like creatures in the final panel of the comic strip?
A. A flower.
B. A small hat.
C. A unicorn horn.
D. An ear of corn.
Answer: D

Level 2 -- Bridge
Question: What is the primary misunderstanding that occurs between the child and the "unicorn" character in the initial panels of the comic strip?
A. The child is referring to a different animal with an actual corn "horn," but the "unicorn" character believes the child is misidentifying or insulting its own mythical horn.
B. The child simply mispronounced the word "unicorn," leading the creature to correct them indignantly.
C. The "unicorn" character mistook the child's comment as a criticism of its horn's appearance, thinking it resembled corn.
D. The "unicorn" character was primarily offended by the child's informal language, rather than the specific word "unihorn."
Answer: A

Level 3 -- Connotation
Question: What is the underlying message conveyed through the humorous twist in the final panel of the comic strip, where an ear of corn is revealed instead of a traditional unicorn horn?
A. The humor emphasizes our inability to recognize the ordinary when we are obsessed with the extraordinary.
B. This humorously indicates that our expectations from mythical tales often overlook the charm and surprise found in nature's simplicities.
C. It mocks the convention of seeking deep, hidden meanings in every aspect of art by providing an unexpectedly literal twist.
D. It illustrates the importance of deeper investigation and not taking things at face value.
Answer: A

Implication Understanding (Exaggeration)

Level 1 -- Perception
Question: What specific item is the bald character asked about in the first panel, and is visibly holding?
A. An extra set of headphones.
B. A mobile phone.
C. A book.
D. A pair of sunglasses.
Answer: A

Level 2 -- Bridge
Question: In the comic, what specific action by the newly arrived character immediately explains why the bald character says "Speak of the Devil"?
A. He is holding a smartphone in a public space.
B. He is talking very loudly on his phone, disturbing the quiet environment.
C. He is about to ask the bald character for a set of headphones.
D. He appears to be ignoring the presence of other passengers.
Answer: B

Level 3 -- Connotation
Question: What does the exaggerated action in the last panel symbolize?
A. An overblown representation of how technology can suffocate personal freedoms.
B. The frustrations of public transit riders with loud music.
C. The extreme discomfort caused by modern technology's invasion into personal space.
D. A visual hyperbole of the silent plea for etiquette and respect in shared environments.
Answer: D

Implication Understanding (Dislocation)

Level 1 -- Perception
Question: How many distinct panels or frames are present in the image?
A. One.
B. Four.
C. Two.
D. Three.
Answer: D

Level 2 -- Bridge
Question: How does the third panel visually recontextualize the seemingly dire situation?
A. Shows consistent water level and horizon line.
B. Shows protagonist sitting with crossed arms.
C. Reveals shallow depth of the water, proving the initial panic was an overreaction.
D. Shifts to a wider shot including surroundings.
Answer: C

Level 3 -- Connotation
Question: What hidden meaning might the comic be conveying through the exaggerated expressions followed by the reveal?
A. It satirizes people's tendency to overreact to non-dangerous situations.
B. It shows preparation for worst-case scenarios.
C. It illustrates emotional oscillation between terror and relief.
D. It mocks sensationalizing everyday events.
Answer: A

Implication Understanding (Symbolism)

Level 1 -- Perception
Question: What animals are depicted interacting in the comic panels?
A. Two dogs.
B. A dog and a cat.
C. A dog and a squirrel.
D. Two cats.
Answer: A

Level 2 -- Bridge
Question: What is the source of the humor in the final panel?
A. The brown dog is actually a human in disguise and speaks perfect English.
B. The brown dog claims it cannot "speak" the specific sound ("ARF") used by the grey dog, treating a simple bark as a foreign language.
C. The grey dog is mute and cannot respond to the brown dog's questions.
D. The human owner misunderstands the barking as a conversation about food.
Answer: B

Level 3 -- Connotation
Question: What social phenomenon is implied by the "language barrier" between the two dogs?
A. The biological inability of different species to communicate.
B. The tendency for people to talk past each other when angry.
C. Even within similar groups, arbitrary linguistic or cultural differences can create barriers to communication.
D. Pets often mimic the behavior and language of their owners.
Answer: C

Affective Reasoning (Fear)

Level 1 -- Perception
Question: What object is the person in the image holding?
A. A smartphone.
B. A hand mirror.
C. A magnifying glass.
D. A framed photograph.
Answer: B

Level 2 -- Bridge
Question: What is the primary emotional dynamic created by the interplay between the woman's facial expression and the visible injuries in the mirror?
A. The woman's look of surprise indicates she is seeing the injury for the first time, which makes the viewer feel like a witness to a tragic discovery.
B. The woman's expression of horror validates and amplifies the viewer's reaction to the injury, creating a feedback loop of shared shock and disgust.
C. The gruesome injuries are the sole source of the viewer's disgust, and the woman's expression simply serves to confirm the reality of the situation.
D. The woman's horrified expression creates empathy, which conflicts with the feeling of aversion caused by the injury, resulting in a confusing mix of pity and disgust.
Answer: B

Level 3 -- Connotation
Question: You think the emotional transfer of this picture is perceived as either direct or indirect? Please explain your perspective.
A. The emotional conveyance is direct, but it primarily communicates feelings of surprise and sadness...
B. The emotional transfer is indirect as it functions on a symbolic level...
C. The emotional conveyance of this image is direct. It showcases a close-up of the character's face through a hand mirror, particularly highlighting the scar on the face...
D. The emotional transfer is indirect because it relies on the viewer to construct a narrative...
Answer: C

Affective Reasoning (Joy)

Level 1 -- Perception
Question: What prominent, colorful object is being held up by the people amidst the splashing water?
A. A large, open umbrella.
B. A large rainbow flag.
C. A multi-colored beach towel.
D. A colorful plastic tarp.
Answer: B

Level 2 -- Bridge
Question: Why is the atmosphere in the image best characterized as celebratory, rather than simply playful or disorganized?
A. The participants' uninhibited splashing and visible smiles are the primary drivers of the mood, indicating a scene of spontaneous, individual fun.
B. The combination of a prominent flag and the occupation of a public fountain suggests a demonstration, where the energy is more disruptive than joyful.
C. The large rainbow flag provides a shared symbol that gives the energetic activity in the water a sense of collective purpose and joyous expression.
D. The energy stems from the refreshing relief the fountain provides on what appears to be a hot day, making the activity a practical response to the environment.
Answer: C

Level 3 -- Connotation
Question: Which specific elements in the image (such as color, setting, people, etc.) triggered your several main emotional responses? please provide a detailed explanation.
A. My emotional response is driven by the overcast sky and the muted color palette...
B. The vibrant rainbow flag and the splashing water in the image evoke a strong emotional response. The rainbow-colored flag adds vividness and liveliness to the scene...
C. The image evokes a sense of anger and frustration due to the apparent disorder...
D. The primary emotional triggers are the dense crowd and the architectural style of the background buildings...
Answer: B

Affective Reasoning (Wonder)

Level 1 -- Perception
Question: What kind of animal is sitting on the rock in the center of the image, looking up towards the sky?
A. A wolf.
B. A fox.
C. A dog.
D. A cat.
Answer: C

Level 2 -- Bridge
Question: How does the use of light in the image primarily establish the relationship between the earthly foreground and the fantastical moon?
A. The diffused glow illuminating the mist and mountains establishes a realistic nocturnal setting, creating a dominant atmosphere of quiet serenity and peace.
B. The light selectively leaves the abandoned cabin in deep shadow, making its decay the focal point and evoking a primary sense of melancholy and loss.
C. By casting a direct, focused beam from the moon to the dog, the light creates a narrative link that transforms physical distance into a moment of connection and wonder.
D. The sharp contrast between the dark landscape and the bright moon serves to isolate the two realms, emphasizing a feeling of insurmountable distance and loneliness.
Answer: C

Level 3 -- Connotation
Question: Which specific elements (for example, colors, scenes, people, etc.) in the image do you find to evoke the several main emotional reactions from you? please provide a detailed explanation.
A. Several elements in this image evoke strong emotional responses. First, the colors: the dark tones in the scene contrast sharply with the soft moonlight, creating a mysterious and serene atmosphere...
B. The image primarily evokes a sense of fear and anxiety. The dark, ominous tones combined with the dilapidated, abandoned cabin suggest a scene of horror or danger...
C. This image communicates overwhelming sadness and melancholy. The abandoned cabin is a clear symbol of loss and forgotten times, evoking a deep sense of grief...
D. The image evokes a sense of vibrant energy and excitement. The sharp contrast between the light and dark areas creates a dynamic, high-energy feeling...
Answer: A

Affective Reasoning (Anger)

Level 1 -- Perception
Question: What primary emotion is the woman's facial expression conveying?
A. Sadness.
B. Anger.
C. Surprise.
D. Joy.
Answer: B

Level 2 -- Bridge
Question: How does the background interact with the woman's expression?
A. The intense red of the background amplifies the raw anger in her expression, creating a more powerful and overwhelming feeling of rage.
B. The red background is the primary source of aggression, causing the woman's expression to be interpreted as a reaction of fear or feeling cornered.
C. The flat, theatrical quality of the red background contrasts with the realistic expression, creating an emotional ambiguity between genuine rage and a staged performance.
D. The chaotic details in the red background, like graffiti and decay, suggest a specific narrative cause for her anger, making it feel targeted at her surroundings.
Answer: A

Level 3 -- Connotation
Question: What elements create the emotional responses you feel?
A. The image primarily conveys a profound sense of sadness and despair. The character's wide-open mouth is not a scream of anger but a cry of anguish and loss...
B. The scene appears to be one of theatrical performance, which inspires a sense of awe and surprise...
C. The primary emotional trigger is the lack of a detailed environment, focusing solely on the character against a plain background...
D. The intense red background and the expression of the character in the image evoke the strongest emotional reaction in me. Red is often associated with passion and power, and here it may symbolize anger or intense emotion...
Answer: D

Affective Reasoning (Affection)

Level 1 -- Perception
Question: What three main subjects are gathered at the table?
A. A man and a young girl.
B. A man, a young girl, and a dog.
C. A man, a young girl, and a cat.
D. A woman, a young boy, and a dog.
Answer: B

Level 2 -- Bridge
Question: How do their actions and expressions establish the emotional tone?
A. The father's expression of gentle nostalgia as he looks down suggests a bittersweet moment, while the daughter's wide-eyed look shows her concern for his feelings.
B. The father's gentle gaze towards the girl, combined with her look of happy surprise and the presence of a birthday cake, creates a shared moment of familial joy.
C. The scene's happiness stems primarily from the dog, which is being presented as a gift, causing the daughter's excitement and the father's look of satisfaction.
D. The daughter's expression of eager anticipation is directed at the cake, while the father's observant look suggests he is waiting for her reaction, creating a sense of suspense.
Answer: B

Level 3 -- Connotation
Question: Describe the emotional content (valence, arousal, dominance).
A. The picture depicts a father and daughter in a moment of disagreement...
B. The picture depicts a scene where the father and the family's little dog are celebrating the daughter's birthday together. The father, wearing glasses, looks at his daughter with a kind expression, holding the little dog in his hands...
C. The image portrays a family preparing for a pet competition...
D. This is a scene of a family saying goodbye...
Answer: B

Affective Reasoning (Sadness)

Level 1 -- Perception
Question: What type of animal is the main subject of this close-up photograph?
A. A monkey.
B. A cat.
C. A dog.
D. A bear cub.
Answer: C

Level 2 -- Bridge
Question: How do the different parts of the pug's facial expression interact to create a sense of emotional ambiguity?
A. The primary source of ambiguity lies solely within the pug's eyes, which appear both large and curious due to light reflection while also having a sad, downward-turned shape.
B. The expression is a result of the pug's physical structure; the bared teeth are a common trait of the breed's underbite and not an emotional sign, while the wide eyes indicate a state of high alert or fear.
C. The deep wrinkles on the pug's forehead, suggesting worry, conflict with the dramatic, high-contrast lighting, which gives the image an aggressive and menacing quality.
D. The pug's wide, seemingly sad eyes contrast with its bared teeth, which could be interpreted as either aggression or a playful smile, creating an uncertain emotional signal.
Answer: D

Level 3 -- Connotation
Question: When you look at this image, do you feel any conflicting emotions or uncertainty?
A. Upon viewing the image, there is no sense of emotional conflict. The pug's expression is one of clear and unambiguous aggression and anger...
B. When viewing this image, one might experience an emotional conflict. The expression of the pug in the picture appears somewhat melancholic yet with a hint of playfulness...
C. This image presents a straightforward and non-conflicting emotional state of pure curiosity...
D. There is no emotional uncertainty in this image; it is a clear depiction of physical discomfort and pain...
Answer: B

Aesthetic Appreciation (Graphic)

Level 1 -- Perception
Question: What type of animal is the main subject of this advertisement?
A. A rabbit
B. A hamster
C. A puppy
D. A kitten
Answer: D

Level 2 -- Bridge
Question: By comparing the kitten image to the surrounding text and graphics, what is the primary visual tension or conflict within the ad's composition?
A. The blur on the kitten is a deliberate technique to create depth of field, pushing it into the background to make the "ADOPT a PET" text the main focus.
B. There is a conflict in balance; the visually heavy kitten on the left competes for attention with the text block on the right, dividing the viewer's focus.
C. There is a conflict in style; the realistic photograph of the kitten clashes with the flat, illustrative style of the heart graphics and typography.
D. There is a conflict in sharpness; the kitten, the emotional focus, is blurry and indistinct, while the text and logos are crisp and clear.
Answer: D

Level 3 -- Connotation
Question: Does this design have any issues that affect its effectiveness? If so, what are the main impacts?
A. Yes, the lack of sharpness in the main subject significantly weakens the design's effectiveness. Even though the text is legible, this visual flaw undermines the emotional connection with the audience and damages the organization's perceived professionalism.
B. Yes, the inconsistent and overly playful typography creates a sense of disorganization that is difficult to read quickly...
C. No, the design does not have significant issues; in fact, the soft focus on the subject creates an artistic, dreamy effect...
D. The design has a minor issue with image clarity, but its impact is minimal...
Answer: A

Aesthetic Appreciation (Color)

Level 1 -- Perception
Question: What is the color of the large circular area on the left side of the image where the text is located?
A. Light Green
B. Light Yellow
C. Light Blue
D. White
Answer: A

Level 2 -- Bridge
Question: Why might a potential adopter have difficulty reading the contact information for the "Claytoni Shelter" in this advertisement?
A. The font size for the contact information is too small compared to the headline, making it the primary reason it is difficult to read.
B. The white text of the address and phone number has very low color contrast against the light green background, making it nearly illegible.
C. The circular text box is placed over a visually complex part of the background image, which interferes with the text.
D. The large word "WAITING" is so visually dominant that it draws the eye away from the contact details, making them hard to find.
Answer: B

Level 3 -- Connotation
Question: Does this design have any issues that affect its effectiveness? If so, what are the main impacts?
A. No, the design is highly effective and communicates a gentle, caring brand identity...
B. While the text is somewhat faint, the overall minimalist aesthetic is clean, modern, and visually pleasing...
C. Yes, the design's color choices create significant legibility problems that obscure vital information. Even if the palette intends to be soft and gentle, this choice severely undermines communication effectiveness and damages the organization's credibility by appearing unprofessional.
D. Yes, the design's typography feels dated and generic, which fails to capture attention effectively...
Answer: C

Aesthetic Appreciation (Font)

Level 1 -- Perception
Question: What color is the handwritten text that reads "limited offer"?
A. Beige
B. Black
C. Red
D. White
Answer: C

Level 2 -- Bridge
Question: How do the visual characteristics of the "SPRING SALE" text and the "limited offer" text create different impressions for the viewer?
A. The "SPRING SALE" text is bold to show importance, while the "limited offer" text is red to serve as a warning to the customer.
B. The "SPRING SALE" text establishes a modern, minimalist theme, while the "limited offer" text adds a personal, artistic touch.
C. Both text elements are designed primarily to grab attention, using different fonts to distinguish the main headline from the secondary condition.
D. The "SPRING SALE" text uses an elegant, formal font suggesting high quality, while the "limited offer" text uses a casual, handwritten font suggesting informal urgency.
Answer: D

Level 3 -- Connotation
Question: Does this design present any significant issues affecting its overall effectiveness? If so, what are the primary high-level impacts on communication and brand perception?
A. The design has a minor issue with the handwritten font feeling slightly out of place, but its impact is minimal...
B. Yes, the design suffers from an incongruous text element that clashes with the established sophisticated aesthetic. Even though the urgent message is legible, this stylistic mismatch undermines the brand's perceived professionalism and creates a visual dissonance that can cheapen the overall impression.
C. No, the design is highly effective because the mix of formal and informal typography creates a dynamic contrast...
D. Yes, the placement of the text elements creates a cluttered and unbalanced composition...
Answer: B

Aesthetic Appreciation (Composition)

Level 1 -- Perception
Question: What is the primary object that serves as the background for most of the text in the image?
A. A cutting board
B. A piece of parchment paper
C. A baker's peel
D. A serving platter
Answer: A

Level 2 -- Bridge
Question: How is the quote "A party without cake is just a meeting" spatially organized in relation to the cutting board illustration?
A. The quote begins inside the top of the cutting board and flows downwards, with only the author's name appearing outside the board.
B. The entire quote is contained within the boundaries of the cutting board, with the most important words enlarged for emphasis.
C. The main subjects, "party" and "cake", are positioned outside the board to draw attention, while the rest of the phrase is inside.
D. The first two words, "A party", are positioned outside the cutting board, while the remainder of the quote is located inside it.
Answer: D

Level 3 -- Connotation
Question: Does this design exhibit any issues that compromise its overall effectiveness? If so, what are the primary consequences for the viewer's experience and the design's perceived quality?
A. No, the design does not have significant issues; in fact, its unconventional typography is effective at capturing attention...
B. While the separation of the first two words is slightly unconventional, it creates a unique visual entry point into the quote...
C. Yes, the muted, monochromatic color palette makes the design feel dated and unexciting...
D. Yes, the design suffers from disjointed text placement that fragments the central message. Even if the playful font is appealing, this structural flaw disrupts reading flow and undermines the design's sense of polish and professionalism.
Answer: D

VCU-Bridge: Hierarchical Visual
Connotation Understanding via Semantic Bridging

Abstract

VCU-Bridge Framework

🔍 Foundational Perceptual Level

🌉 Semantic Bridge Level

💭 Abstract Connotative Level

HVCU-Bench: A Benchmark for VCU-Bridge

Task Design

Evaluation Metrics

Experimental Results

Overall Performance

Context Mode Analysis

Key Findings from HVCU-Bench Evaluation

🔍 Significant Gap Between Humans and MLLMs

📉 Universal Performance Degradation

🔗 Hierarchical Context Brings Substantial Gains

📊 Scale Alone Cannot Resolve the Challenge

Hierarchical Data Generation Pipeline

Effectiveness of Hierarchical Data Generation

HVCU-Bench Samples

BibTeX

Model	Model Size	Implication Understanding				Aesthetic Appreciation				Affective Reasoning				Score
Model	Model Size	*Acc_perc*	*Acc_bridge*	*Acc_conn*	*Acc_full*	*Acc_perc*	*Acc_bridge*	*Acc_conn*	*Acc_full*	*Acc_perc*	*Acc_bridge*	*Acc_conn*	*Acc_full*	Score
Basic Reference
Human	-	99.25	96.00	86.50	86.00	99.14	92.29	90.29	88.86	99.33	93.33	88.67	86.67	87.18
GPT-4o	-	95.50	85.25	62.75	53.25	95.43	78.29	68.00	53.14	91.33	83.67	64.33	50.33	52.24
Open-Source MLLMs
Qwen3-VL-Instruct	4B	86.75	82.75	58.00	43.25	90.57	70.86	60.00	41.14	90.33	82.67	56.67	39.33	41.24
Qwen3-VL-Instruct	8B	93.50	89.50	59.50	50.75	91.71	73.43	63.43	44.00	94.33	84.67	60.00	48.00	47.58
LLaVA-1.6	7B	81.75	58.00	40.25	18.75	79.14	36.86	33.14	9.43	92.00	58.00	19.33	12.00	13.39
LLaVA-1.6	13B	84.75	79.00	55.00	39.50	84.86	55.14	50.57	26.29	94.33	77.33	29.00	21.33	29.04
Deepseek-VL2-tiny	MoE 1B/3B	88.25	62.25	49.75	29.25	89.71	45.14	41.14	19.71	93.33	65.33	29.00	19.00	22.65
Deepseek-VL2	MoE 4.5B/27B	93.75	83.00	60.75	49.50	95.14	58.00	38.00	23.71	96.33	81.33	46.00	36.67	36.63
Gemma3	4B	76.50	72.00	49.75	30.75	68.86	62.86	68.00	29.14	87.00	76.00	51.00	36.00	31.96
Gemma3	12B	87.50	85.25	60.50	47.50	82.86	70.29	68.00	38.86	90.67	86.33	58.00	46.33	44.23
InternVL3.5	4B	82.50	83.75	58.50	42.00	82.86	64.57	40.00	23.43	91.00	81.67	60.67	47.67	37.70
InternVL3.5	8B	82.00	85.25	55.75	41.75	84.00	68.00	60.57	36.29	86.00	83.67	55.67	42.00	40.01
Phi-4-Multimodal-Instruct	6B	90.25	56.50	42.75	32.25	90.29	42.57	23.14	15.14	90.00	85.00	45.33	33.67	27.02
Phi-3.5-Vison-Instruct	4B	84.25	83.25	61.25	44.75	88.29	61.14	53.43	33.14	91.33	82.00	54.33	41.33	39.74

VCU-Bridge: Hierarchical VisualConnotation Understanding via Semantic Bridging