How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems
**How to Detect AI-Generated Videos: A Survey on Dynamic, Traceable, and Explainable Detection Systems**
With rapid advances in AI video generation (e.g., Sora, Veo), creating highly realistic, multi-minute videos is now possible, widening the gap with detection research. Current AI video detection, often limited to unreliable binary classifications, is insufficient. This survey, accepted at ACL 2026, reframes the goal as **"factual fidelity verification"**—checking if a video's content (who, when, where, what) aligns with the real world perceptually and cognitively.
It categorizes AI-generated videos into three paradigms: **Local Manipulation Videos (LMV**, e.g., face swaps), **Audio-Visual Editing (AVE**, e.g., lip-syncing), and **Generative Video Synthesis (GVS**, fully synthetic videos like Sora's). Detection challenges evolve from visual artifacts in LMV to multi-modal inconsistencies in AVE and higher-level world knowledge violations in GVS.
The core proposal is a **Vision-Language Dual-View framework** with four hierarchical layers:
1. **Layer 1 (Intrinsic Visual Cues):** Analyzes low-level signal statistics, noise patterns, and physiological signals.
2. **Layer 2 (Spatiotemporal Consistency):** Checks for temporal coherence in object motion and scene dynamics.
3. **Layer 3 (Cross-Modal Consistency):** Verifies alignment between video, audio, and text within the video.
4. **Layer 4 (Language-Guided World-Level Reasoning):** Uses external knowledge, facts, and physical laws to judge semantic plausibility and factual correctness.
The survey traces a shift in detection focus from lower layers (1 & 2) toward higher, language-involved layers (3 & 4). It also reviews evolving evaluation metrics and datasets tailored for each video paradigm.
The conclusion advocates for a **dynamic, evidence-first detection system** that moves beyond simple classification. Future trustworthy detection requires combining visual evidence (from CV) with semantic reasoning and explanation (from NLP & multimodal AI), ultimately creating traceable and explainable judgments about a video's adherence to real-world constraints.
marsbit27 хв тому