VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks.

🎯 Motivation

1. Current hallucination benchmarks are typically small in scale, containing fewer than 1K videos and 2K QA pairs.

2. The primary focus of the existing benchmarks is on hallucinations in static elements of videos. e.g., objects, their attributes, and spatial relationships.

3. The benchmarks are constructed with limited question types. e.g., only binary QA.

💡 VidHalluc Pipeline

Overview of the VidHalluc benchmark construction process.

VidHalluc is constructed under 4 stages.

1. Semantic and Visual Similarity Filtering
On existing video datasets (e.g., ActivityNet, YouCook2, VALOR32K), semantic similarity through CLIP and evaluate visual similarity using DINOv2. Video pairs are selected based on the criteria of having a semantic similarity score exceeding 0.9 while maintaining a visual similarity score below 0.6.

2. Quality Filtering
Filter out videos if (1) duration is shorter than one second, (2) GPT4 identifies a mismatch between the caption and the depicted actions or scenes.

3. Human Validation
Filter out videos if (1) lack of clear action in either video, (2) presence of multiple actions in either video, (3) identical actions in both videos.

4. Automatic Question Generation
We categorize three distinct hallucination types: (1) Action Hallucination (ACH) (2) Temporal Sequence Hallucination (TSH) (3) Scene Transition Hallucination (STH). We design specific question formats to evaluate model performance: binary QA, multiple-choice questions, sorting questions, open-ended questions.

📊 Statistics

Statistics	ACH	TSH	STH	Total
# of Videos	3957	600	445	5002
# of Questions	8250	600	445	9295
Avg. Duration (s)	21.79	41.19	28.72	24.70

🦖🏥 DINO-HEAL

We introduce DINO-HEAL, a training-free method that mitigates hallucinations by using saliency maps from DINOv2 to reweight features from the frozen visual encoder, focusing on key spatial regions. DINO-HEAL requires no architectural modifications or additional training.

DINO-HEAL pipeline. Since DINOv2 effectively captures salient regions in the input video, we leverage it to guide the reweighting of the attention given to different spatial regions within the feature from the visual encoder.

This adaptive reweighting strategy enables DINOv2 to enhance key visual features by directly focusing on areas highlighted by the saliency map, thereby mitigating hallucinations while preserving the original feature representation.
The visualization of the saliency map corroborates our hypothesis!

The visualization of salieincy maps.

🧪 Experimental Results

Model	# Params	Binary QA	MCQ	TSH	STH	Avg.
Human	-	95.14	93.29	90.17	87.43	91.51
GPT-4o	-	81.17	90.97	83.42	74.17	82.43
Gemini-1.5-Pro	-	75.02	79.04	82.67	64.11	75.21
VILA 1.5	13B	57.75	81.95	68.84	35.04	60.90
VidelLLaMA2	7B	48.23	83.79	22.50	65.22	54.94
LLaVA-NeXT-Video	34B	26.04	77.57	20.67	44.39	42.17
PLLaVA	13B	35.04	77.31	17.83	32.94	40.78
Video-LLaVA	7B	23.88	65.18	28.83	30.12	37.00
Chat-UniVi	13B	23.20	55.07	32.50	31.55	35.58
SharGPT4Video	8B	29.58	44.83	49.00	17.08	35.12
Video-ChatGPT	7B	9.36	23.25	29.83	8.13	17.64

Hallucination types
Binary QA: binary question and answer, MCQ: multiple choice question,
TSH: temporal sequence hallucination, STH: scene transition hallucination.

State-of-the-art MLLMs on VidHalluc

👀 Qualitative Results

Action hallucination example of VidHalluc.

Temporal sequence hallucination of VidHalluc.

Scene transition hallucination of VidHalluc.

BibTeX


@inproceedings{li2024vidhalluc,
author    = {Li, Chaoyu and Im, Eun Woo and Fazli, Pooyan},
title     = {Vidhalluc: evaluating temporal hallucination in multimodal large language models for video understanding},
booktitle={arXiv:0000.0000},
year      = {2024}
}