How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

marsbitОпубліковано о 2026-06-26Востаннє оновлено о 2026-06-26

Анотація

**How to Detect AI-Generated Videos: A Survey on Dynamic, Traceable, and Explainable Detection Systems** With rapid advances in AI video generation (e.g., Sora, Veo), creating highly realistic, multi-minute videos is now possible, widening the gap with detection research. Current AI video detection, often limited to unreliable binary classifications, is insufficient. This survey, accepted at ACL 2026, reframes the goal as **"factual fidelity verification"**—checking if a video's content (who, when, where, what) aligns with the real world perceptually and cognitively. It categorizes AI-generated videos into three paradigms: **Local Manipulation Videos (LMV**, e.g., face swaps), **Audio-Visual Editing (AVE**, e.g., lip-syncing), and **Generative Video Synthesis (GVS**, fully synthetic videos like Sora's). Detection challenges evolve from visual artifacts in LMV to multi-modal inconsistencies in AVE and higher-level world knowledge violations in GVS. The core proposal is a **Vision-Language Dual-View framework** with four hierarchical layers: 1. **Layer 1 (Intrinsic Visual Cues):** Analyzes low-level signal statistics, noise patterns, and physiological signals. 2. **Layer 2 (Spatiotemporal Consistency):** Checks for temporal coherence in object motion and scene dynamics. 3. **Layer 3 (Cross-Modal Consistency):** Verifies alignment between video, audio, and text within the video. 4. **Layer 4 (Language-Guided World-Level Reasoning):** Uses external knowledge, facts, and ph...

Over the past two years, video generation models have evolved rapidly, from the stunning effects of Sora at the end of 2024 to the multi-point explosion of video generation models like Google Veo, Sora 2, Kling series, and Seedance 2.0 earlier this year. The quality of AI-generated videos has undergone a qualitative leap, capable of producing movie-level realistic effects in videos lasting several minutes with multiple characters and complex scenes.

In contrast to the rapid progress on the generation side, research interest in AI-generated video detection has remained lukewarm.

Yet in reality, it's not hard to observe the significant social impact brought by the far greater deceptive potential of videos due to their multimodal nature:

On various social platforms, AI-generated fake videos frequently emerge, with their quantity, quality, and coverage rapidly increasing. When users ask foundational models like Grok or Doubao "Is this video AI-generated?", the answers often only provide binary judgments lacking in explainability and credibility. On platforms like Xiaohongshu, genuinely recorded videos are often labeled as "suspected AI-generated."

A vast chasm exists between the rapid development of generation and the lack of attention on the detection side. We must promptly address: in today's era of rapid AI video generation iteration, what stage has research on AI-generated video detection reached, what paradigm shifts is it undergoing, and what directions should it pursue in the future?

Against this backdrop, researchers from MBZUAI, Renmin University of China, and Harvard University jointly authored and published a comprehensive review, systematically organizing technical approaches for the first time from both visual and linguistic perspectives, spanning from low-level visual perception to high-level world reasoning. Based on this, the review analyzes the urgently needed multi-layered evidence-coupled dynamic, traceable, and explainable trustworthy detection system. The work has been accepted for publication at ACL 2026.

Paper Link:https://www.researchgate.net/doi/10.13140/RG.2.2.31713.88168

GitHub Link:https://github.com/dxhou/AI-Generated-Video-Detection

Homepage Link:https://AIgcvdetection.github.io

Redefining the Goal of AI-Generated Video Detection

Figure 1 | The complete pipeline of AI-generated video detection: from generation side, dual-view detection, to evidence sets.

Before the explosion of generative AI, AI-generated videos left relatively obvious visual artifacts. Based on this premise, in early Deepfake scenarios represented by face-swapping, frame-level visual perception verification was sufficiently effective.

However, in the past two years, video quality in the rapidly developing era of generative AI has gradually surpassed this "premise". The human eye is increasingly unable to judge the authenticity of realistic, complete videos. At this point, detection that only outputs binary classification can no longer meet the demand. There is an urgent need to answer: on what evidence does the detector base its trustworthy judgment?

This review first pushes the boundaries of the detection problem forward: it argues that the detection output needs to shift from "true/false binary classification" towards interpretable, trustworthy structured judgment, thereby advancing the detection object to verifying the gap between the "virtual world" in the video and the "real world".

Therefore, the review first redefines the detection goal as "factual fidelity verification", which is to verify whether propositions about "who, when, where, what happened" in the video content are consistent and aligned with the real world both perceptually and cognitively. Beyond cross-modal verification between vision and other modalities, it requires further judgment on whether these propositions in the video content conflict with external "facts, physical laws, world knowledge, etc."

Detection Objects: Three Paradigms of AI-Generated Videos

Figure 2 | Three paradigms of AI-generated videos defined in this review.

From 2020 to the present, AI-generated videos have undergone a paradigm shift: from early Deepfake-era local modifications via GANs, to audiovisual recombination like lip-syncing and voice cloning, and then to the latent diffusion model-driven "world simulator"-supported full synthesis of AI videos (akin to Sora). The review classifies AI-generated videos into the following three paradigms:

Local Manipulation Video (LMV) with Real Carrier Retention

LMV has long been the most typical and mature paradigm for traditional Deepfake detection. The video itself modifies local regions of a genuinely recorded video, such as face-swapping or background replacement. However, most of the original video structure—scenes, character actions, camera motion, lighting relationships—usually remains. Therefore, most early methods focused precisely on local artifacts, frequency domain features, geometric anomalies, and regional consistency. As generative models' capabilities in local fusion, lighting adaptation, and identity transfer become increasingly stronger, and platform processing and secondary dissemination further erase many subtle traces, the detection focus for the LMV paradigm is gradually shifting more towards the robustness of detection methods across different scenarios.

Audio-Visual Editing (AVE) under Cross-Modal Coupling Constraints

The AVE paradigm emerged mainly in 2024. In this type of AI-generated video, what is altered are the established correspondences within the video itself—such as the relationship between the visual content and sound, lip movements, speaker identity, speech rhythm, subtitle content, etc. This includes speech-driven facial synthesis, re-dubbing original videos, modifying lip movements, or changing speakers. This shift forces the detection side to move from looking for visual artifacts to inspecting whether the relationships between several modalities within the video truly hold, examining audio, lip movements, identity, and content together to find truly discriminative clues.

End-to-End Generative Video Synthesis (GVS)

In the GVS paradigm, which exploded in 2025, models directly generate entire video sequences based on conditional information like text, images, or noise, no longer relying on a real video as a base, presenting entirely new challenges for detection.

These videos often appear very realistic in single frames or over short periods, but vulnerabilities tend to appear over long spatiotemporal sequences: for example, characters' actions or positions in scenes may fail to connect logically from start to finish; object shapes or movements may change in ways that violate physical laws; or the events depicted in the video may be impossible in the real world.

Correspondingly, detection approaches for the GVS paradigm cannot be confined to local or inter-modal consistency. They need to move towards higher levels, starting from long-range consistency, common sense, physical laws, narrative and causality, proposition-level truthfulness, and traceability. Detection must verify over long sequences whether the content itself is plausible, examining whether the video content can hold true across all levels within the constraints of the real world.

Four-Layer Taxonomy of Detection Methods from a Vision-Language Dual-View Perspective

Figure 3 | Vision-Language Dual-View four-layer framework: the first two layers lean towards the visual perspective, the latter two move towards the linguistic perspective.

Currently, the modal perspectives for AI-generated video detection have diverged, forming two core scientific problems. The first starts from the visual modality, focusing on low-level signal forensics and spatiotemporal consistency of the visuals. The other starts from the language modality, focusing on cross-modal linguistic information within the video itself—judging "whether the video is narrating coherently with good cross-modal alignment"—and leveraging the language modality to introduce reasoning related to world knowledge and facts, judging "whether the video content can withstand scrutiny against external real-world knowledge, facts, and laws."

Capturing this trend, the review proposes organizing AI-generated video detection research methods and evaluation paradigms from a Vision-Language Dual-View perspective. Based on this, it further proposes the following four-layer landscape of methods, progressing from low-level perception to high-level cognition:

Layer 1, Intrinsic Cues Analysis: The First Screening Net

Methods in Layer 1 address the research question: At the level of low-level visual signals, does the video conform to the statistical patterns that real videos must satisfy, and does the video contain low-level cues introduced by AI model generation or editing operations?

At the low-level signal level, real videos satisfy corresponding statistical properties, and videos obtained through real capture and processing naturally align with the acquisition, encoding, and post-processing pipelines. In contrast, the AI generation process often leaves behind clues that deviate from the real video distribution: monotonous stylistic patterns, model-specific watermarks and artifacts, detectable artificial physiological signals, etc. Methods within this first layer take a visual perspective, performing forensics by modeling, extracting, and amplifying these low-level signals. This includes detecting:

Pixel and geometric anomalies like frequency domain patterns, textures, boundaries, noise patterns.
Physiological signals on human faces like pulse coupling, subtle muscle movements, and blinking rhythms.
Whether systematic shifts exist in the feature space between real and fake videos.

Layer 2, Spatiotemporal Consistency: Checking "Does the Video Flow Smoothly?"

Methods in Layer 2 address the concept of "sequential combination of multiple video frames across space and time." The research question they focus on is: In the spatiotemporal dimension, does the image stream of the video exhibit characteristics that real videos' object motion processes must satisfy? Real captured videos are constrained by continuous camera trajectories and real environmental scenes; objects and backgrounds between adjacent frames exhibit continuous, predictable spatiotemporal change patterns consistent with physical feasibility and camera motion. In contrast, AI-generated videos may exhibit spatiotemporal discontinuities over longer sequences, such as object or background distortion, sudden local blurring, etc. This includes detecting:

Temporal and motion inconsistencies like local object deformation, background drift, sudden blurring, motion residue anomalies.
Human behavior and interaction dynamics like expression changes, identity dynamics, interaction rhythms between characters in the scene.
Physical and frequency anomalies related to temporal frequency and visual continuity.

Layer 3, Cross-Modal Consistency: Multi-Modal Verification Within the Video

Layer 3 represents a crucial turning point in the entire framework: detection begins to enter the realm of multi-modal verification within the video. The research question it focuses on is: Are the various modalities within the video—visuals, audio, subtitles—"telling the same story" across all levels?

Real videos often exhibit high alignment between accompanying audio, text, and visuals. AI-generated videos may exhibit systematic mismatches: lip movements–speech, identity–voiceprint, visuals–text. Third-layer methods perform fine-grained, multi-angle consistency analysis of inter-modal alignment. This includes three types:

Detecting consistency between sound and visuals.
Introducing subtitles, titles, transcribed text, or descriptive text for text–video semantic consistency reasoning.
Robust learning oriented towards temporally localizing inter-modal inconsistencies.

Layer 4, Language-Guided World-Level Reasoning: Focusing on the Gap Between Video and the Real World

Layer 4 elevates the detection perspective from "internal consistency of the video" to "consistency with rules and knowledge in the external real world." The research question shifts to: At the semantic and factual level, is the video content plausible or possible in the real world?

All content in a real video should align with facts, physical laws, domain knowledge, common sense, etc., from the real world. AI-generated video content often struggles to fully align with the real world, which is precisely the detection space utilized by the fourth layer. This includes:

Using prompts, textual priors, text prototypes, or lightweight modules to recalibrate the model's representation space, making it easier for the model to correlate observed anomalies with more explicit semantic categories.
Treating detection as an investigation process, constructing an investigator agent that can consult sources, call tools, and revise judgments, linking judgments to evidence, tool outputs, and verification processes.
Through fine-tuning, preference learning, reward modeling, and reinforcement learning, training into the model itself "how to select evidence, how to organize explanations, how to reach conclusions," focusing on producing clear, structurally stable, and evidentially complete detection outputs.

Evolution Map of Generation Side and Detection Side

Figure 4 | Evolution map of representative detection methods: escalating generation-side threats and advancing detection capabilities progress in parallel.

The figure above presents, along a timeline, the continuous elevation of the "realism ceiling" achievable by fake videos on the generation side. Against the backdrop of the evolution of the foundational models underpinning detection technology—from deep convolutional and recurrent networks, to Vision Transformers, and then to reasoning-capable Vision-Language Large Models and agent systems—the figure shows the progression of the detection side from visual forensics towards multimodal verification and high-level reasoning-based detection.

The review further provides temporal statistics on the distribution of detection methods across layers: the proportion of methods focusing on Layers 3 & 4 was only 7.7% in 2020, rose to 40.0% in 2023, and exceeded 50% in 2025.

Overall, the focus of detection methods is continuously shifting upwards: early efforts were concentrated primarily in Layers 1 and 2. As generated videos become smoother and more realistic, detection is increasingly moving into Layers 3 and 4.

Figure 5 | Statistical change in distribution of detection methods: proportion of language-perspective methods gradually rises.

Evaluation of Detection Methods

Facing the goal of factual fidelity verification, evaluating detection methods needs to answer: does the model capture transferable visual cues? Can it identify spatiotemporal and cross-modal inconsistencies? Can it effectively judge against facts, knowledge, and world constraints? The review systematically traces the evolution of evaluation metrics and datasets from the traditional Deepfake era to the present day.

Evaluation Metrics from a Vision-Language Dual-View

Shared Metrics: Acc / AUC Remain Necessary but Are Far from Sufficient

Accuracy, AUC, Precision, Recall, F1, Equal Error Rate (EER), PR-AUC, and aggregation methods (frame-level vs. video-level) remain the most basic common language for comparing different methods, enabling horizontal comparison across methods from different layers. However, while these fundamental evaluation metrics are still necessary, they are insufficient to meet the requirements for explainable, trustworthy evaluation under the goal of factual fidelity verification.

Metrics from the Visual Perspective: Assessing Robustness Under Real-World Interference

The evaluation focus here is on whether the detector's original cues remain valid when faced with distribution shifts, compression during dissemination, and real-world environmental interference. It is divided into two categories:

Robustness of Low-Level Cues: Includes metrics like TPR@FPR=α at fixed thresholds, cross-dataset testing, perturbation stress tests, etc.
Spatiotemporal and Physical Consistency: Focuses on video-level reporting, temporal perturbation drop, motion ablation, and assessing whether the model significantly degrades when temporal information is removed, thereby evaluating if the detector is genuinely examining the continuity of the entire video sequence rather than relying on shortcuts from single frames.

Metrics from the Language Perspective: Multimodal Localization and Reasoning Evaluation

The coverage of detection approaches from the language perspective is broader; a simple set of classification metrics can no longer summarize evaluation. The review proposes the following layered categorization:

Cross-Modal Alignment and Temporal Localization: These evaluation metrics assess the accuracy of detection in cross-modal alignment and the detector's ability to localize clues to specific time segments. Beyond basic Acc and AUC, common metrics also include Average Precision (AP), Average Recall (AR), Recall@K, mAP@IoU, etc.
World Knowledge and Reasoning: Facing the higher-level question "Can the events depicted in the video be supported by common sense, physical laws, external knowledge, and concrete evidence?" The evaluation metrics for detection need to introduce human judgments, pairwise preferences, question answering, and metrics for evaluating explanation quality like BLEU, ROUGE-L, METEOR, CIDEr, and embedding-based similarity.

Datasets: Reorganized According to the Three Paradigms of Detection Objects

Most datasets used for training and evaluating detection methods naturally diverge along the aforementioned AI-generated video paradigms. The review organizes them as follows:

Datasets for the LMV Paradigm: Evaluation focus is primarily on the stability of visual cues used by detection methods and whether these cues remain effective under distortion, compression, and cross-domain dissemination conditions. These datasets are increasingly incorporating temporal reasoning and explainability evaluation to approach real-world conditions.
Datasets for the AVE Paradigm: These datasets often emphasize fine-grained temporal annotations, clearer cross-modal correspondences, and stronger modeling of local misalignments and semantic mismatches. They test whether models can detect when audio and video are not conveying the same content, locate the time segments where misalignments occur, and distinguish between synchronization issues, identity issues, and semantic issues.
Datasets for the GVS Paradigm: Fully synthetic videos, on one hand, continuously weaken explicit editing traces; on the other hand, they persistently present challenges to detection such as generator diversity, semantic misalignment, and transfer risks. Correspondingly, evaluation for this paradigm is evolving most rapidly—from early efforts collecting large volumes of fully synthetic videos to evaluate detection accuracy, to works like LOKI, GenWorld, DAVID-X, and DeeptraceReward that incorporate world simulation, defect-level annotations, and human-perceptible forgery cues into the evaluation system.

From "Can Distinguish" to "Can Provide Evidence"

High-fidelity AI-generated videos are continuously raising the realism ceiling of forged content. The problem facing the detection task is increasingly difficult to summarize with a simple real/fake score; it necessitates factual fidelity verification. Correspondingly, the evaluation stage and detection systems also need to expand along with this extended task boundary:

Evidence-First Dynamic Evaluation System

Facing newly emerging AI-generated complex videos with long temporal spans, evaluation needs to answer not just "can the model classify?" but also "on what evidence did the model base its correct or incorrect judgment?". Coarse-grained evaluation labels can obscure a great deal of truly critical information. Data annotation, model training, and result reporting in evaluation need to advance together. There is a need to decompose videos back into verifiable propositional units, transforming "long sequential narratives" into operable structured objects like event chains, entity state trajectories, or event graphs, to facilitate causal and constraint verification over long timescales. This allows further interrogation of "which specific propositions did the detection capture" and "whether evidence and judgment correspond one-to-one."

Furthermore, most detectors are still evaluated under a "closed world" assumption. In real deployment scenarios, new video generation models, editing tools, and content styles continuously emerge, and different platforms introduce their own downsampling, transcoding, and filtering pipelines. To bridge this long-term robustness gap, there is a need to adopt arena/leaderboard-style continuous update mechanisms, incorporating newly released generators and new platform transcoding pipelines into the evaluation set in a streaming fashion.

Collaborative Dual-View Trustworthy and Explainable Detection System

To achieve the explainable detection for the aforementioned factual fidelity goal, it is necessary to balance the perception–cognition dual pathways, combining the ability of the visual perspective to reveal visual artifacts and spatiotemporal inconsistencies with the ability of the high-level linguistic perspective to perform structured reasoning, thereby integrating the four-layer method landscape across the dual views. On one hand, current vision-language models and video understanding models perform relatively poorly on judgments related to "perceptual fidelity," requiring supplementation by visual-perspective methods. On the other hand, for videos generated by stronger generation models and anti-detection techniques that are highly perceptually faithful, detection at the semantic and factual level using a linguistic perspective is necessary.

Further, it is essential to establish an explicit reasoning path of "identification–localization–explanation." This means that within the aforementioned dual-pathway system, every tool call or knowledge reference must be strictly bound to a specific argumentation step.

Additionally, the detection system constituted on the "content side" above needs to cross-verify with potentially existing "source-side" authentication signals, etc., connecting content analysis with source tracing. Ultimately, this forms a cross-layer, multimodal detection system alongside a trustworthy, explainable evidence space.

Conclusion

AI video detection is a task that will only become more challenging.

For future AIGC-V detection research and practical applications, this review provides a map closer to real-world needs. It redefines the task of AI-generated video detection, proposes a "Vision–Language Dual-View" four-layer framework, and systematically organizes existing methods, related benchmarks, and evaluation metrics accordingly. It also connects these layers to challenges in real deployment, gaps in current evaluations, and emerging development directions.

Following this framework, it points out several key requirements for trustworthy detection, including evidence-first prioritization, traceable conclusions, and maintaining robustness across generators and real-world conditions.

Looking ahead, trustworthy AI video detection can hardly be accomplished by any single field independently. It is becoming a cross-disciplinary issue that requires joint attention from CV, NLP, multimodal understanding, and world model research: CV provides spatiotemporal evidence modeling and forensic robustness; NLP provides proposition decomposition, reasoning, evidence grounding, and explanatory capabilities; multimodal and world model research provides stronger cross-modal alignment capabilities and richer priors regarding physics, causality, and temporal consistency.

Only by truly integrating these capabilities can video detection gradually move beyond the search for local artifacts towards a more rigorous "view of reality": the question is no longer just whether a video looks plausible, but whether its entities, events, and dynamic processes remain faithful to the constraints of the real world—searching for the increasingly blurred boundary between the virtual world and the real world.

References: https://www.researchgate.net/doi/10.13140/RG.2.2.31713.88168

This article is from the WeChat public account "新智元", edited by LRST.

Пов'язані питання

QAccording to the article, what are the three paradigms of AI-generated video defined by the survey?

AThe survey defines three paradigms of AI-generated video: 1) Local Manipulation Video (LMV), which involves modifying local areas of a real video like face swapping. 2) Audio-Visual Editing (AVE), which involves editing the correspondence between visual and audio elements like lip-syncing or re-dubbing. 3) Generative Video Synthesis (GVS), which involves end-to-end generation of entire videos from scratch using models like Sora.

QWhat is the proposed new goal for AI-generated video detection, as redefined in the article?

AThe article redefines the goal of AI-generated video detection as 'Factual Fidelity Verification.' This means verifying whether the propositions about 'who, when, where, and what happened' in the video content are both perceptually and cognitively consistent with the real world. It involves checking for conflicts with external facts, physical laws, and world knowledge, not just visual artifacts.

QWhat are the four layers of the Vision-Language Dual-View framework proposed for video detection methods?

AThe four layers of the Vision-Language Dual-View framework are: 1) Layer 1: Intrinsic Cues Analysis (low-level visual signal statistics). 2) Layer 2: Spatiotemporal Consistency (checking temporal and motion coherence). 3) Layer 3: Cross-Modal Consistency (verifying alignment between modalities like video, audio, and text within the video). 4) Layer 4: Language-Guided World-Level Reasoning (checking video content against real-world knowledge, facts, and physical plausibility).

QWhat key shift in the focus of detection methods does the article highlight based on the timeline statistics?

AThe article highlights a significant shift in the focus of detection methods from lower-level visual perspectives to higher-level language-guided reasoning. Statistics show that the proportion of methods focusing on language-guided layers (Layer 3 & 4) increased from 7.7% in 2020 to over 50% in 2025. This indicates the detection community is moving beyond visual forensics to address semantic and factual inconsistencies as AI videos become more perceptually realistic.

QWhat are the two main requirements for building a trustworthy and explainable detection system outlined in the conclusion?

AThe conclusion outlines two key requirements: 1) Establishing a dynamic, evidence-first evaluation system. This involves breaking down videos into verifiable propositional units for precise evidence-judgment mapping and incorporating continuous, arena-style testing against new generators and real-world processing pipelines. 2) Building a collaborative dual-perspective system. This requires combining the low-level visual evidence from the visual perspective with the high-level structured reasoning from the language perspective, creating explicit 'identification-localization-explanation' reasoning paths where each step is traceable to specific evidence.

Пов'язані матеріали

Dawn Song, The First Lady of Computer Security, Joins Meta

Computer security and AI safety pioneer Dawn Song has joined Meta's Superintelligence Labs as Vice President of AI Research, reporting to MSL head Nat Friedman. Song, a professor at UC Berkeley's EECS department and a MacArthur Fellow, is renowned for her foundational work in dynamic taint analysis and her leadership in adversarial machine learning and AI agent security. Her lab is considered a top training ground in computer security. Song is also a founder of Oasis Labs and Virtue AI, a company focused on enterprise AI safety infrastructure. Virtue AI co-founders Bo Li and Sanmi Koyejo, along with other team members, are also joining Meta, a move seen as strengthening Meta's safety measures for AI agents amid growing industry concerns. Separately, Denny Zhou, founder of Google's Gemini Reasoning Team and a key figure in advancing large language model reasoning techniques like Chain-of-Thought, reportedly joined Meta several months ago. These high-profile hires come as Meta seeks to deploy AI across its products while assuring regulators and the public of its models' robustness against misuse.

marsbit24 хв тому

Dawn Song, The First Lady of Computer Security, Joins Meta

marsbit24 хв тому

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

**Summary: South Korea's Institutional Crypto Race: Stablecoins and RWA Take Off** South Korea is undergoing a structural shift in its crypto ecosystem, moving beyond its historical role as a major retail trading hub. Major financial institutions and internet platforms are now building institutional-grade blockchain infrastructure, with stablecoins and Real-World Asset (RWA) tokenization as the primary drivers. The push for a regulated Korean won stablecoin market is a major policy and corporate focus. This is driven partly by an estimated $115 billion outflow into dollar stablecoins like USDC, threatening the domestic financial system. Banks (e.g., KB Financial, Hana), payment giants (e.g., Shinhan Card, BC Card), and internet super-apps (KakaoPay, NAVER Pay) are all conducting pilots. The goal is to anchor future digital finance to the Korean won and local regulations. In RWA, South Korea is advancing rapidly within regulatory sandboxes, focusing on unique domestic assets beyond typical global templates like US Treasuries. Projects involve tokenizing ships (with Hyundai Heavy Industries), defense supply chain assets, and K-pop intellectual property, alongside more conventional assets. A legal framework is set for 2027, and platforms like NXT are preparing for regulated trading. Key opportunities for crypto-native projects lie in providing the underlying technology these traditional institutions lack: global distribution channels for tokenized assets, cross-chain liquidity solutions, and enabling infrastructure tools (e.g., for asset packaging and management). Partnerships, such as Solana with Shinhan Card or LayerZero with the Korea Gold Exchange, exemplify this proactive approach. Crucially, user access is being shaped by consumer platforms. NAVER's planned acquisition of Upbit's operator Dunamu and Kakao's development of a unified wallet aim to seamlessly integrate crypto with everyday payments for tens of millions of users. The race is now about which protocols and projects will become the foundational standards as regulation solidifies and institutional adoption accelerates.

Foresight News33 хв тому

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

Foresight News33 хв тому

NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

NVIDIA has open-sourced NeMo AutoModel, a tool designed to significantly accelerate the fine-tuning of Mixture-of-Experts (MoE) large language models. By adding just one import line to existing code based on Hugging Face Transformers v5, users can achieve a 3.4x to 3.7x increase in training throughput and reduce GPU memory usage by 29% to 32% without altering their API. The key innovations include Expert Parallelism (EP) to distribute expert weights across GPUs, lowering memory pressure; DeepEP to fuse computation and communication; and TransformerEngine kernels for accelerated core operations. Benchmarks on models like Qwen3-30B-A3B show training throughput per GPU jumping from 3075 to 11340 tokens per second. The solution also enables the fine-tuning of very large models, such as the 550B parameter Nemotron 3 Ultra, which would exceed memory limits with the standard Transformers v5. Code and benchmarks are available on GitHub.

marsbit1 год тому

NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

marsbit1 год тому

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

The article explores the surprising trend where AI's first major impact on crypto has been in security auditing, not in areas like trading or analytics. It details how AI-powered tools are dramatically lowering the barrier to finding smart contract vulnerabilities, enabling attackers to scan thousands of contracts and execute exploits within minutes. This has rendered traditional, manually-produced audit reports with their month-long validity periods increasingly obsolete, creating a critical "structural crack" in the old security model. Cases like Drift Protocol and KelpDAO show that even extensively audited protocols can be hacked through social engineering, operational flaws, or infrastructure misconfigurations beyond pure code review. Attackers are also using AI to find and exploit vulnerabilities in years-old, deployed contracts. Notably, OpenZeppelin's co-founder has expressed a grim view that "all DeFi is insecure" due to AI's asymmetric advantage. In response, the audit industry is undergoing a fundamental shift. While there's a short-term spike in defensive re-audits, the long-term business model is changing. Firms are developing AI-assisted systems and moving from one-time report deliveries towards embedded, continuous services like real-time monitoring and formal verification. Examples include AI tools uncovering critical, previously missed vulnerabilities in heavily audited protocols like Curve Finance and Zcash. The conclusion is that security must become a continuous investment, not a one-time checkbox, and audit firms must rapidly evolve their tools and service models to survive.

marsbit1 год тому

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

marsbit1 год тому

Never expected that the first tangible application of AI x Crypto is in security auditing

Unexpectedly, the initial major application of AI in the Crypto sphere has turned out to be security auditing. In 2026, DeFi has faced significant security challenges, with 121 hacking incidents resulting in approximately $942 million in losses. While AI was expected to first impact areas like quantitative trading, its initial breakthrough has instead transformed security auditing by drastically lowering the cost and skill barrier for finding smart contract vulnerabilities. The traditional audit model is facing obsolescence. Advanced AI models, such as Claude Mythos, enable attackers to scan thousands of contracts and identify vulnerability patterns at scale, compressing the time from discovery to execution to mere minutes. This renders the month-long validity of traditional audit reports ineffective. Notably, attacks now frequently target well-audited, established protocols by exploiting business logic flaws, operational security weaknesses, and even years-old historical contracts, demonstrating that old audit reports offer zero protection. This pressure is forcing a fundamental shift in the industry. In the short term, a wave of defensive re-auditing is occurring, driven by projects seeking to meet new AI-era security standards and regulatory requirements. In the long run, audit firms' business models are diverging. The one-time report delivery model is declining in value, as evidenced by platforms like Code4rena shutting down. Leading firms are now pivoting towards AI-powered defense, integrating continuous monitoring, real-time on-chain risk detection, and embedding security directly into the development phase, as seen with tools like OpenZeppelin's Skills system. Ultimately, the era of "audit once, secure forever" is over. Security must become a continuous, embedded infrastructure investment for projects. For audit companies, survival depends on proactively transforming from traditional service providers into platforms offering AI-native, ongoing security solutions.

链捕手1 год тому

Never expected that the first tangible application of AI x Crypto is in security auditing

链捕手1 год тому

Торгівля

Спот

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

Анотація

Redefining the Goal of AI-Generated Video Detection

Detection Objects: Three Paradigms of AI-Generated Videos

Four-Layer Taxonomy of Detection Methods from a Vision-Language Dual-View Perspective

Layer 1, Intrinsic Cues Analysis: The First Screening Net

Layer 2, Spatiotemporal Consistency: Checking "Does the Video Flow Smoothly?"

Layer 3, Cross-Modal Consistency: Multi-Modal Verification Within the Video

Layer 4, Language-Guided World-Level Reasoning: Focusing on the Gap Between Video and the Real World

Evolution Map of Generation Side and Detection Side

Evaluation of Detection Methods

Evaluation Metrics from a Vision-Language Dual-View

Shared Metrics: Acc / AUC Remain Necessary but Are Far from Sufficient

Metrics from the Visual Perspective: Assessing Robustness Under Real-World Interference

Metrics from the Language Perspective: Multimodal Localization and Reasoning Evaluation

Datasets: Reorganized According to the Three Paradigms of Detection Objects

Related Evaluations for Video Generation Model Diagnosis

From "Can Distinguish" to "Can Provide Evidence"

Evidence-First Dynamic Evaluation System

Collaborative Dual-View Trustworthy and Explainable Detection System

Conclusion

Пов'язані питання

Пов'язані матеріали

Dawn Song, The First Lady of Computer Security, Joins Meta

South Korean Institutions' Crypto Race: Dual Explosion of Stablecoins and RWA

NVIDIA's New Open-Source MoE: One Line of Import, Fine-Tuning Accelerated by 3.7x

It Turns Out the First Real-World Application of AI x Crypto is in Security Auditing

Never expected that the first tangible application of AI x Crypto is in security auditing

Торгівля