VidHalluc

Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

1Arizona State University
geometric reasoning

An example of the video pair in VidHalluc.

Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, the problem of hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that the visual encoder of MLLMs often struggles to differentiate between video pairs that are visually distinct but semantically similar, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding tasks. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. VidHalluc consists of 5,002 videos, paired based on semantic similarity and visual differences, focusing on cases where hallucinations are most likely to occur. Through comprehensive testing, our experiments show that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency information from DINOv2 to reweight visual features during inference. Our results demonstrate that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations among all tasks.

๐ŸŽฏ Motivation

1. Current hallucination benchmarks are typically small in scale, containing fewer than 1K videos and 2K QA pairs.

2. The primary focus of the existing benchmarks is on hallucinations in static elements of videos. e.g., objects, their attributes, and spatial relationships.

3. The benchmarks are constructed with limited question types. e.g., only binary QA.


๐Ÿ’ก VidHalluc Pipeline

pipeline

Overview of the VidHalluc benchmark construction process.

VidHalluc is constructed under 4 stages.

1. Semantic and Visual Similarity Filtering
On existing video datasets (e.g., ActivityNet, YouCook2, VALOR32K), semantic similarity through CLIP and evaluate visual similarity using DINOv2. Video pairs are selected based on the criteria of having a semantic similarity score exceeding 0.9 while maintaining a visual similarity score below 0.6.

2. Quality Filtering
Filter out videos if (1) duration is shorter than one second, (2) GPT4 identifies a mismatch between the caption and the depicted actions or scenes.

3. Human Validation
Filter out videos if (1) lack of clear action in either video, (2) presence of multiple actions in either video, (3) identical actions in both videos.

4. Automatic Question Generation
We categorize three distinct hallucination types: (1) Action Hallucination (ACH) (2) Temporal Sequence Hallucination (TSH) (3) Scene Transition Hallucination (STH). We design specific question formats to evaluate model performance: binary QA, multiple-choice questions, sorting questions, open-ended questions.


๐Ÿ“Š Statistics

Statistics ACH TSH STH Total
# of Videos 3957 600 445 5002
# of Questions 8250 600 445 9295
Avg. Duration (s) 21.79 41.19 28.72 24.70


๐Ÿฆ–๐Ÿฅ DINO-HEAL

We introduce DINO-HEAL, a training-free method that mitigates hallucinations by using saliency maps from DINOv2 to reweight features from the frozen visual encoder, focusing on key spatial regions. DINO-HEAL requires no architectural modifications or additional training.

pipeline

DINO-HEAL pipeline. Since DINOv2 effectively captures salient regions in the input video, we leverage it to guide the reweighting of the attention given to different spatial regions within the feature from the visual encoder.


This adaptive reweighting strategy enables DINOv2 to enhance key visual features by directly focusing on areas highlighted by the saliency map, thereby mitigating hallucinations while preserving the original feature representation.
The visualization of the saliency map corroborates our hypothesis!


pipeline

The visualization of salieincy maps.

๐Ÿงช Experimental Results

Model # Params Binary QA MCQ TSH STH Avg.
Human - 95.14 93.29 90.17 87.43 91.51
GPT-4o - 81.17 90.97 83.42 74.17 82.43
Gemini-1.5-Pro - 75.02 79.04 82.67 64.11 75.21
VILA 1.5 13B 57.75 81.95 68.84 35.04 60.90
VidelLLaMA2 7B 48.23 83.79 22.50 65.22 54.94
LLaVA-NeXT-Video 34B 26.04 77.57 20.67 44.39 42.17
PLLaVA 13B 35.04 77.31 17.83 32.94 40.78
Video-LLaVA 7B 23.88 65.18 28.83 30.12 37.00
Chat-UniVi 13B 23.20 55.07 32.50 31.55 35.58
SharGPT4Video 8B 29.58 44.83 49.00 17.08 35.12
Video-ChatGPT 7B 9.36 23.25 29.83 8.13 17.64
Hallucination types
Binary QA: binary question and answer, MCQ: multiple choice question,
TSH: temporal sequence hallucination, STH: scene transition hallucination.

pipeline

State-of-the-art MLLMs on VidHalluc

๐Ÿ‘€ Qualitative Results

BibTeX


@inproceedings{li2024vidhalluc,
author    = {Li, Chaoyu and Im, Eun Woo and Fazli, Pooyan},
title     = {Vidhalluc: evaluating temporal hallucination in multimodal large language models for video understanding},
booktitle={arXiv:0000.0000},
year      = {2024}
}