How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

marsbitXuất bản vào 2026-06-26Cập nhật gần nhất vào 2026-06-26

Tóm tắt

**How to Detect AI-Generated Videos: A Survey on Dynamic, Traceable, and Explainable Detection Systems** With rapid advances in AI video generation (e.g., Sora, Veo), creating highly realistic, multi-minute videos is now possible, widening the gap with detection research. Current AI video detection, often limited to unreliable binary classifications, is insufficient. This survey, accepted at ACL 2026, reframes the goal as **"factual fidelity verification"**—checking if a video's content (who, when, where, what) aligns with the real world perceptually and cognitively. It categorizes AI-generated videos into three paradigms: **Local Manipulation Videos (LMV**, e.g., face swaps), **Audio-Visual Editing (AVE**, e.g., lip-syncing), and **Generative Video Synthesis (GVS**, fully synthetic videos like Sora's). Detection challenges evolve from visual artifacts in LMV to multi-modal inconsistencies in AVE and higher-level world knowledge violations in GVS. The core proposal is a **Vision-Language Dual-View framework** with four hierarchical layers: 1. **Layer 1 (Intrinsic Visual Cues):** Analyzes low-level signal statistics, noise patterns, and physiological signals. 2. **Layer 2 (Spatiotemporal Consistency):** Checks for temporal coherence in object motion and scene dynamics. 3. **Layer 3 (Cross-Modal Consistency):** Verifies alignment between video, audio, and text within the video. 4. **Layer 4 (Language-Guided World-Level Reasoning):** Uses external knowledge, facts, and ph...

Over the past two years, video generation models have evolved rapidly, from the stunning effects of Sora at the end of 2024 to the multi-point explosion of video generation models like Google Veo, Sora 2, Kling series, and Seedance 2.0 earlier this year. The quality of AI-generated videos has undergone a qualitative leap, capable of producing movie-level realistic effects in videos lasting several minutes with multiple characters and complex scenes.

In contrast to the rapid progress on the generation side, research interest in AI-generated video detection has remained lukewarm.

Yet in reality, it's not hard to observe the significant social impact brought by the far greater deceptive potential of videos due to their multimodal nature:

On various social platforms, AI-generated fake videos frequently emerge, with their quantity, quality, and coverage rapidly increasing. When users ask foundational models like Grok or Doubao "Is this video AI-generated?", the answers often only provide binary judgments lacking in explainability and credibility. On platforms like Xiaohongshu, genuinely recorded videos are often labeled as "suspected AI-generated."

A vast chasm exists between the rapid development of generation and the lack of attention on the detection side. We must promptly address: in today's era of rapid AI video generation iteration, what stage has research on AI-generated video detection reached, what paradigm shifts is it undergoing, and what directions should it pursue in the future?

Against this backdrop, researchers from MBZUAI, Renmin University of China, and Harvard University jointly authored and published a comprehensive review, systematically organizing technical approaches for the first time from both visual and linguistic perspectives, spanning from low-level visual perception to high-level world reasoning. Based on this, the review analyzes the urgently needed multi-layered evidence-coupled dynamic, traceable, and explainable trustworthy detection system. The work has been accepted for publication at ACL 2026.

Paper Link:https://www.researchgate.net/doi/10.13140/RG.2.2.31713.88168

GitHub Link:https://github.com/dxhou/AI-Generated-Video-Detection

Homepage Link:https://AIgcvdetection.github.io

Redefining the Goal of AI-Generated Video Detection

Figure 1 | The complete pipeline of AI-generated video detection: from generation side, dual-view detection, to evidence sets.

Before the explosion of generative AI, AI-generated videos left relatively obvious visual artifacts. Based on this premise, in early Deepfake scenarios represented by face-swapping, frame-level visual perception verification was sufficiently effective.

However, in the past two years, video quality in the rapidly developing era of generative AI has gradually surpassed this "premise". The human eye is increasingly unable to judge the authenticity of realistic, complete videos. At this point, detection that only outputs binary classification can no longer meet the demand. There is an urgent need to answer: on what evidence does the detector base its trustworthy judgment?

This review first pushes the boundaries of the detection problem forward: it argues that the detection output needs to shift from "true/false binary classification" towards interpretable, trustworthy structured judgment, thereby advancing the detection object to verifying the gap between the "virtual world" in the video and the "real world".

Therefore, the review first redefines the detection goal as "factual fidelity verification", which is to verify whether propositions about "who, when, where, what happened" in the video content are consistent and aligned with the real world both perceptually and cognitively. Beyond cross-modal verification between vision and other modalities, it requires further judgment on whether these propositions in the video content conflict with external "facts, physical laws, world knowledge, etc."

Detection Objects: Three Paradigms of AI-Generated Videos

Figure 2 | Three paradigms of AI-generated videos defined in this review.

From 2020 to the present, AI-generated videos have undergone a paradigm shift: from early Deepfake-era local modifications via GANs, to audiovisual recombination like lip-syncing and voice cloning, and then to the latent diffusion model-driven "world simulator"-supported full synthesis of AI videos (akin to Sora). The review classifies AI-generated videos into the following three paradigms:

Local Manipulation Video (LMV) with Real Carrier Retention

LMV has long been the most typical and mature paradigm for traditional Deepfake detection. The video itself modifies local regions of a genuinely recorded video, such as face-swapping or background replacement. However, most of the original video structure—scenes, character actions, camera motion, lighting relationships—usually remains. Therefore, most early methods focused precisely on local artifacts, frequency domain features, geometric anomalies, and regional consistency. As generative models' capabilities in local fusion, lighting adaptation, and identity transfer become increasingly stronger, and platform processing and secondary dissemination further erase many subtle traces, the detection focus for the LMV paradigm is gradually shifting more towards the robustness of detection methods across different scenarios.

Audio-Visual Editing (AVE) under Cross-Modal Coupling Constraints

The AVE paradigm emerged mainly in 2024. In this type of AI-generated video, what is altered are the established correspondences within the video itself—such as the relationship between the visual content and sound, lip movements, speaker identity, speech rhythm, subtitle content, etc. This includes speech-driven facial synthesis, re-dubbing original videos, modifying lip movements, or changing speakers. This shift forces the detection side to move from looking for visual artifacts to inspecting whether the relationships between several modalities within the video truly hold, examining audio, lip movements, identity, and content together to find truly discriminative clues.

End-to-End Generative Video Synthesis (GVS)

In the GVS paradigm, which exploded in 2025, models directly generate entire video sequences based on conditional information like text, images, or noise, no longer relying on a real video as a base, presenting entirely new challenges for detection.

These videos often appear very realistic in single frames or over short periods, but vulnerabilities tend to appear over long spatiotemporal sequences: for example, characters' actions or positions in scenes may fail to connect logically from start to finish; object shapes or movements may change in ways that violate physical laws; or the events depicted in the video may be impossible in the real world.

Correspondingly, detection approaches for the GVS paradigm cannot be confined to local or inter-modal consistency. They need to move towards higher levels, starting from long-range consistency, common sense, physical laws, narrative and causality, proposition-level truthfulness, and traceability. Detection must verify over long sequences whether the content itself is plausible, examining whether the video content can hold true across all levels within the constraints of the real world.

Four-Layer Taxonomy of Detection Methods from a Vision-Language Dual-View Perspective

Figure 3 | Vision-Language Dual-View four-layer framework: the first two layers lean towards the visual perspective, the latter two move towards the linguistic perspective.

Currently, the modal perspectives for AI-generated video detection have diverged, forming two core scientific problems. The first starts from the visual modality, focusing on low-level signal forensics and spatiotemporal consistency of the visuals. The other starts from the language modality, focusing on cross-modal linguistic information within the video itself—judging "whether the video is narrating coherently with good cross-modal alignment"—and leveraging the language modality to introduce reasoning related to world knowledge and facts, judging "whether the video content can withstand scrutiny against external real-world knowledge, facts, and laws."

Capturing this trend, the review proposes organizing AI-generated video detection research methods and evaluation paradigms from a Vision-Language Dual-View perspective. Based on this, it further proposes the following four-layer landscape of methods, progressing from low-level perception to high-level cognition:

Layer 1, Intrinsic Cues Analysis: The First Screening Net

Methods in Layer 1 address the research question: At the level of low-level visual signals, does the video conform to the statistical patterns that real videos must satisfy, and does the video contain low-level cues introduced by AI model generation or editing operations?

At the low-level signal level, real videos satisfy corresponding statistical properties, and videos obtained through real capture and processing naturally align with the acquisition, encoding, and post-processing pipelines. In contrast, the AI generation process often leaves behind clues that deviate from the real video distribution: monotonous stylistic patterns, model-specific watermarks and artifacts, detectable artificial physiological signals, etc. Methods within this first layer take a visual perspective, performing forensics by modeling, extracting, and amplifying these low-level signals. This includes detecting:

Pixel and geometric anomalies like frequency domain patterns, textures, boundaries, noise patterns.
Physiological signals on human faces like pulse coupling, subtle muscle movements, and blinking rhythms.
Whether systematic shifts exist in the feature space between real and fake videos.

Layer 2, Spatiotemporal Consistency: Checking "Does the Video Flow Smoothly?"

Methods in Layer 2 address the concept of "sequential combination of multiple video frames across space and time." The research question they focus on is: In the spatiotemporal dimension, does the image stream of the video exhibit characteristics that real videos' object motion processes must satisfy? Real captured videos are constrained by continuous camera trajectories and real environmental scenes; objects and backgrounds between adjacent frames exhibit continuous, predictable spatiotemporal change patterns consistent with physical feasibility and camera motion. In contrast, AI-generated videos may exhibit spatiotemporal discontinuities over longer sequences, such as object or background distortion, sudden local blurring, etc. This includes detecting:

Temporal and motion inconsistencies like local object deformation, background drift, sudden blurring, motion residue anomalies.
Human behavior and interaction dynamics like expression changes, identity dynamics, interaction rhythms between characters in the scene.
Physical and frequency anomalies related to temporal frequency and visual continuity.

Layer 3, Cross-Modal Consistency: Multi-Modal Verification Within the Video

Layer 3 represents a crucial turning point in the entire framework: detection begins to enter the realm of multi-modal verification within the video. The research question it focuses on is: Are the various modalities within the video—visuals, audio, subtitles—"telling the same story" across all levels?

Real videos often exhibit high alignment between accompanying audio, text, and visuals. AI-generated videos may exhibit systematic mismatches: lip movements–speech, identity–voiceprint, visuals–text. Third-layer methods perform fine-grained, multi-angle consistency analysis of inter-modal alignment. This includes three types:

Detecting consistency between sound and visuals.
Introducing subtitles, titles, transcribed text, or descriptive text for text–video semantic consistency reasoning.
Robust learning oriented towards temporally localizing inter-modal inconsistencies.

Layer 4, Language-Guided World-Level Reasoning: Focusing on the Gap Between Video and the Real World

Layer 4 elevates the detection perspective from "internal consistency of the video" to "consistency with rules and knowledge in the external real world." The research question shifts to: At the semantic and factual level, is the video content plausible or possible in the real world?

All content in a real video should align with facts, physical laws, domain knowledge, common sense, etc., from the real world. AI-generated video content often struggles to fully align with the real world, which is precisely the detection space utilized by the fourth layer. This includes:

Using prompts, textual priors, text prototypes, or lightweight modules to recalibrate the model's representation space, making it easier for the model to correlate observed anomalies with more explicit semantic categories.
Treating detection as an investigation process, constructing an investigator agent that can consult sources, call tools, and revise judgments, linking judgments to evidence, tool outputs, and verification processes.
Through fine-tuning, preference learning, reward modeling, and reinforcement learning, training into the model itself "how to select evidence, how to organize explanations, how to reach conclusions," focusing on producing clear, structurally stable, and evidentially complete detection outputs.

Evolution Map of Generation Side and Detection Side

Figure 4 | Evolution map of representative detection methods: escalating generation-side threats and advancing detection capabilities progress in parallel.

The figure above presents, along a timeline, the continuous elevation of the "realism ceiling" achievable by fake videos on the generation side. Against the backdrop of the evolution of the foundational models underpinning detection technology—from deep convolutional and recurrent networks, to Vision Transformers, and then to reasoning-capable Vision-Language Large Models and agent systems—the figure shows the progression of the detection side from visual forensics towards multimodal verification and high-level reasoning-based detection.

The review further provides temporal statistics on the distribution of detection methods across layers: the proportion of methods focusing on Layers 3 & 4 was only 7.7% in 2020, rose to 40.0% in 2023, and exceeded 50% in 2025.

Overall, the focus of detection methods is continuously shifting upwards: early efforts were concentrated primarily in Layers 1 and 2. As generated videos become smoother and more realistic, detection is increasingly moving into Layers 3 and 4.

Figure 5 | Statistical change in distribution of detection methods: proportion of language-perspective methods gradually rises.

Evaluation of Detection Methods

Facing the goal of factual fidelity verification, evaluating detection methods needs to answer: does the model capture transferable visual cues? Can it identify spatiotemporal and cross-modal inconsistencies? Can it effectively judge against facts, knowledge, and world constraints? The review systematically traces the evolution of evaluation metrics and datasets from the traditional Deepfake era to the present day.

Evaluation Metrics from a Vision-Language Dual-View

Shared Metrics: Acc / AUC Remain Necessary but Are Far from Sufficient

Accuracy, AUC, Precision, Recall, F1, Equal Error Rate (EER), PR-AUC, and aggregation methods (frame-level vs. video-level) remain the most basic common language for comparing different methods, enabling horizontal comparison across methods from different layers. However, while these fundamental evaluation metrics are still necessary, they are insufficient to meet the requirements for explainable, trustworthy evaluation under the goal of factual fidelity verification.

Metrics from the Visual Perspective: Assessing Robustness Under Real-World Interference

The evaluation focus here is on whether the detector's original cues remain valid when faced with distribution shifts, compression during dissemination, and real-world environmental interference. It is divided into two categories:

Robustness of Low-Level Cues: Includes metrics like TPR@FPR=α at fixed thresholds, cross-dataset testing, perturbation stress tests, etc.
Spatiotemporal and Physical Consistency: Focuses on video-level reporting, temporal perturbation drop, motion ablation, and assessing whether the model significantly degrades when temporal information is removed, thereby evaluating if the detector is genuinely examining the continuity of the entire video sequence rather than relying on shortcuts from single frames.

Metrics from the Language Perspective: Multimodal Localization and Reasoning Evaluation

The coverage of detection approaches from the language perspective is broader; a simple set of classification metrics can no longer summarize evaluation. The review proposes the following layered categorization:

Cross-Modal Alignment and Temporal Localization: These evaluation metrics assess the accuracy of detection in cross-modal alignment and the detector's ability to localize clues to specific time segments. Beyond basic Acc and AUC, common metrics also include Average Precision (AP), Average Recall (AR), Recall@K, mAP@IoU, etc.
World Knowledge and Reasoning: Facing the higher-level question "Can the events depicted in the video be supported by common sense, physical laws, external knowledge, and concrete evidence?" The evaluation metrics for detection need to introduce human judgments, pairwise preferences, question answering, and metrics for evaluating explanation quality like BLEU, ROUGE-L, METEOR, CIDEr, and embedding-based similarity.

Datasets: Reorganized According to the Three Paradigms of Detection Objects

Most datasets used for training and evaluating detection methods naturally diverge along the aforementioned AI-generated video paradigms. The review organizes them as follows:

Datasets for the LMV Paradigm: Evaluation focus is primarily on the stability of visual cues used by detection methods and whether these cues remain effective under distortion, compression, and cross-domain dissemination conditions. These datasets are increasingly incorporating temporal reasoning and explainability evaluation to approach real-world conditions.
Datasets for the AVE Paradigm: These datasets often emphasize fine-grained temporal annotations, clearer cross-modal correspondences, and stronger modeling of local misalignments and semantic mismatches. They test whether models can detect when audio and video are not conveying the same content, locate the time segments where misalignments occur, and distinguish between synchronization issues, identity issues, and semantic issues.
Datasets for the GVS Paradigm: Fully synthetic videos, on one hand, continuously weaken explicit editing traces; on the other hand, they persistently present challenges to detection such as generator diversity, semantic misalignment, and transfer risks. Correspondingly, evaluation for this paradigm is evolving most rapidly—from early efforts collecting large volumes of fully synthetic videos to evaluate detection accuracy, to works like LOKI, GenWorld, DAVID-X, and DeeptraceReward that incorporate world simulation, defect-level annotations, and human-perceptible forgery cues into the evaluation system.

From "Can Distinguish" to "Can Provide Evidence"

High-fidelity AI-generated videos are continuously raising the realism ceiling of forged content. The problem facing the detection task is increasingly difficult to summarize with a simple real/fake score; it necessitates factual fidelity verification. Correspondingly, the evaluation stage and detection systems also need to expand along with this extended task boundary:

Evidence-First Dynamic Evaluation System

Facing newly emerging AI-generated complex videos with long temporal spans, evaluation needs to answer not just "can the model classify?" but also "on what evidence did the model base its correct or incorrect judgment?". Coarse-grained evaluation labels can obscure a great deal of truly critical information. Data annotation, model training, and result reporting in evaluation need to advance together. There is a need to decompose videos back into verifiable propositional units, transforming "long sequential narratives" into operable structured objects like event chains, entity state trajectories, or event graphs, to facilitate causal and constraint verification over long timescales. This allows further interrogation of "which specific propositions did the detection capture" and "whether evidence and judgment correspond one-to-one."

Furthermore, most detectors are still evaluated under a "closed world" assumption. In real deployment scenarios, new video generation models, editing tools, and content styles continuously emerge, and different platforms introduce their own downsampling, transcoding, and filtering pipelines. To bridge this long-term robustness gap, there is a need to adopt arena/leaderboard-style continuous update mechanisms, incorporating newly released generators and new platform transcoding pipelines into the evaluation set in a streaming fashion.

Collaborative Dual-View Trustworthy and Explainable Detection System

To achieve the explainable detection for the aforementioned factual fidelity goal, it is necessary to balance the perception–cognition dual pathways, combining the ability of the visual perspective to reveal visual artifacts and spatiotemporal inconsistencies with the ability of the high-level linguistic perspective to perform structured reasoning, thereby integrating the four-layer method landscape across the dual views. On one hand, current vision-language models and video understanding models perform relatively poorly on judgments related to "perceptual fidelity," requiring supplementation by visual-perspective methods. On the other hand, for videos generated by stronger generation models and anti-detection techniques that are highly perceptually faithful, detection at the semantic and factual level using a linguistic perspective is necessary.

Further, it is essential to establish an explicit reasoning path of "identification–localization–explanation." This means that within the aforementioned dual-pathway system, every tool call or knowledge reference must be strictly bound to a specific argumentation step.

Additionally, the detection system constituted on the "content side" above needs to cross-verify with potentially existing "source-side" authentication signals, etc., connecting content analysis with source tracing. Ultimately, this forms a cross-layer, multimodal detection system alongside a trustworthy, explainable evidence space.

Conclusion

AI video detection is a task that will only become more challenging.

For future AIGC-V detection research and practical applications, this review provides a map closer to real-world needs. It redefines the task of AI-generated video detection, proposes a "Vision–Language Dual-View" four-layer framework, and systematically organizes existing methods, related benchmarks, and evaluation metrics accordingly. It also connects these layers to challenges in real deployment, gaps in current evaluations, and emerging development directions.

Following this framework, it points out several key requirements for trustworthy detection, including evidence-first prioritization, traceable conclusions, and maintaining robustness across generators and real-world conditions.

Looking ahead, trustworthy AI video detection can hardly be accomplished by any single field independently. It is becoming a cross-disciplinary issue that requires joint attention from CV, NLP, multimodal understanding, and world model research: CV provides spatiotemporal evidence modeling and forensic robustness; NLP provides proposition decomposition, reasoning, evidence grounding, and explanatory capabilities; multimodal and world model research provides stronger cross-modal alignment capabilities and richer priors regarding physics, causality, and temporal consistency.

Only by truly integrating these capabilities can video detection gradually move beyond the search for local artifacts towards a more rigorous "view of reality": the question is no longer just whether a video looks plausible, but whether its entities, events, and dynamic processes remain faithful to the constraints of the real world—searching for the increasingly blurred boundary between the virtual world and the real world.

References: https://www.researchgate.net/doi/10.13140/RG.2.2.31713.88168

This article is from the WeChat public account "新智元", edited by LRST.

Câu hỏi Liên quan

QAccording to the article, what are the three paradigms of AI-generated video defined by the survey?

AThe survey defines three paradigms of AI-generated video: 1) Local Manipulation Video (LMV), which involves modifying local areas of a real video like face swapping. 2) Audio-Visual Editing (AVE), which involves editing the correspondence between visual and audio elements like lip-syncing or re-dubbing. 3) Generative Video Synthesis (GVS), which involves end-to-end generation of entire videos from scratch using models like Sora.

QWhat is the proposed new goal for AI-generated video detection, as redefined in the article?

AThe article redefines the goal of AI-generated video detection as 'Factual Fidelity Verification.' This means verifying whether the propositions about 'who, when, where, and what happened' in the video content are both perceptually and cognitively consistent with the real world. It involves checking for conflicts with external facts, physical laws, and world knowledge, not just visual artifacts.

QWhat are the four layers of the Vision-Language Dual-View framework proposed for video detection methods?

AThe four layers of the Vision-Language Dual-View framework are: 1) Layer 1: Intrinsic Cues Analysis (low-level visual signal statistics). 2) Layer 2: Spatiotemporal Consistency (checking temporal and motion coherence). 3) Layer 3: Cross-Modal Consistency (verifying alignment between modalities like video, audio, and text within the video). 4) Layer 4: Language-Guided World-Level Reasoning (checking video content against real-world knowledge, facts, and physical plausibility).

QWhat key shift in the focus of detection methods does the article highlight based on the timeline statistics?

AThe article highlights a significant shift in the focus of detection methods from lower-level visual perspectives to higher-level language-guided reasoning. Statistics show that the proportion of methods focusing on language-guided layers (Layer 3 & 4) increased from 7.7% in 2020 to over 50% in 2025. This indicates the detection community is moving beyond visual forensics to address semantic and factual inconsistencies as AI videos become more perceptually realistic.

QWhat are the two main requirements for building a trustworthy and explainable detection system outlined in the conclusion?

AThe conclusion outlines two key requirements: 1) Establishing a dynamic, evidence-first evaluation system. This involves breaking down videos into verifiable propositional units for precise evidence-judgment mapping and incorporating continuous, arena-style testing against new generators and real-world processing pipelines. 2) Building a collaborative dual-perspective system. This requires combining the low-level visual evidence from the visual perspective with the high-level structured reasoning from the language perspective, creating explicit 'identification-localization-explanation' reasoning paths where each step is traceable to specific evidence.

Nội dung Liên quan

Dữ Liệu Cho Thấy Các Ứng Dụng Phi Tập Trung (dApps) Trên Solana Tạo Ra 257 Triệu USD Doanh Thu Trong Quý 2

Dữ liệu từ các bảng phân tích DeFi cho thấy các ứng dụng phi tập trung (dApps) trên Solana đã tạo ra tổng doanh thu 257 triệu USD trong quý 2 năm 2026. Con số này đánh dấu quý thứ 9 liên tiếp Solana dẫn đầu các mạng Layer 1 và Layer 2 lớn về hoạt động tạo phí. Doanh thu là một chỉ số quan trọng vì nó phản ánh hoạt động kinh tế thực tế trên chuỗi, như giao dịch trên sàn phi tập trung, phát hành token và các hành động tần suất cao, thay vì chỉ là sự chú ý trên mạng xã hội. Thành tích này củng cố lập luận rằng hệ sinh thái Solana là một trong những nền tảng năng động nhất trong crypto, mang lại lợi thế so với các đối thủ cạnh tranh với Ethereum về mặt định lượng. Tuy nhiên, điểm cần lưu ý là doanh thu của Solana có liên quan chặt chẽ đến môi trường giao dịch tốc độ cao, đặc biệt là các hoạt động liên quan đến meme coin và đầu cơ. Đây là mức sử dụng thực tế, nhưng có thể mang tính chu kỳ và giảm mạnh nếu thị trường suy giảm. Sức mạnh bền vững của mạng lưới sẽ được kiểm chứng qua khả năng duy trì cơ sở doanh thu này trong điều kiện thị trường ít biến động hơn. Hiện tại, số liệu quý 2 cho thấy nền kinh tế dApp của Solana vẫn đang tạo ra giá trị hữu hình.

bitcoinist18 phút trước

Dữ Liệu Cho Thấy Các Ứng Dụng Phi Tập Trung (dApps) Trên Solana Tạo Ra 257 Triệu USD Doanh Thu Trong Quý 2

bitcoinist18 phút trước

Người chiến thắng AI bí ẩn nhất

Bài viết thảo luận về sự trỗi dậy của các công ty sản xuất truyền thống Nhật Bản như Toto, Nittobo và Ajinomoto trong chuỗi cung ứng AI, nhờ vào các sản phẩm vật liệu chuyên biệt quan trọng cho sản xuất chất bán dẫn. Toto, nổi tiếng với thiết bị vệ sinh, đã chứng kiến cổ phiếu tăng 145% nhờ vào bộ phận gốm sứ chính xác cho bán dẫn, chiếm 9% doanh thu nhưng đóng góp 54% lợi nhuận. Sản phẩm chủ chốt là bàn hút tĩnh điện bằng gốm, một linh kiện thiết yếu trong quy trình khắc plasma cho chip NAND 3D và chip AI tiên tiến. Vị thế độc quyền, sự gắn kết lâu năm với Lam Research và rào cản gia nhập cao (thời gian phê duyệt ~5 năm) giúp Toto hưởng lợi lớn. Tương tự, Nittobo (sợi thủy tinh) thống trị thị trường vải sợi thủy tinh T-glass cho lớp nền đóng gói chip, trong khi Ajinomoto (bột ngọt) kiểm soát 80-95% thị phần màng cách điện ABF. Cả hai đều có lợi nhuận cao từ các bộ phận này, dù chúng chỉ chiếm một phần nhỏ trong tổng doanh thu. Bài viết cũng chỉ ra sự phản ánh trên thị trường A-shares (Trung Quốc) với chủ đề thay thế hàng nhập khẩu, nêu bật các công ty như China Electronics Technology (gốm kỹ thuật), Honghe Technology (vải điện tử siêu mỏng) và Feilihua (vải thạch anh). Tuy nhiên, sự thành công của họ phụ thuộc vào khả năng mở rộng quy mô sản xuất và nâng cao tỷ lệ sản phẩm đạt chất lượng. Kết luận nhấn mạnh xu hướng các công ty truyền thống với chuyên môn sâu về vật liệu đang trở thành những người chiến thắng "thầm lặng" trong cuộc đua AI, khi nhu cầu về chip tiên tiến làm sâu sắc thêm sự phụ thuộc vào các công nghệ và vật liệu cốt lõi.

marsbit19 phút trước

marsbit19 phút trước

Người chiến thắng AI bí mật nhất

**Tiêu đề: Những người chiến thắng AI bí ẩn nhất** Bài viết chỉ ra rằng trong cơn sốt AI, ngoài các gã khổng lồ công nghệ, những công ty sản xuất truyền thống với các bộ phận kinh doanh “chìm” phục vụ ngành bán dẫn đang trở thành những người chiến thắng bất ngờ. **1. Trường hợp điển hình TOTO: Từ phòng tắm đến bán dẫn** TOTO, hãng thiết bị vệ sinh Nhật Bản nổi tiếng, chứng kiến cổ phiếu tăng mạnh nhờ mảng kinh doanh gốm sứ chính xác cho bán dẫn. Bắt đầu từ năm 1984, mảng này đã hợp tác lâu dài với Lam Research. Đến năm 2020, việc cải tiến quy trình đã giúp tăng đáng kể năng suất và tỷ suất lợi nhuận. Năm 2025, dù chỉ đóng góp 9% doanh thu, gốm bán dẫn chiếm tới 54% lợi nhuận hoạt động của TOTO. Sản phẩm chủ lực là “bàn hút tĩnh điện” (electrostatic chuck) – một linh kiện quan trọng trong quy trình khắc plasma, trở nên thiết yếu với chip AI và bộ nhớ NAND tiên tiến. Sự thống trị của TOTO đến từ công nghệ thiêu kết độc đáo và rào cản gia nhập ngành cao (quy trình chứng nhận kéo dài 5 năm). **2. Hiện tượng phổ biến: Lợi nhuận từ những vị trí “nghẽn cổ chai”** TOTO không đơn độc. Nhiều công ty Nhật Bản khác cũng được định giá lại: - **Nittobo**: Công ty sợi thủy tinh 128 tuổi, sở hữu 90% thị phần vải thủy tinh T-glass, vật liệu thiết yếu cho chất nền đóng gói chip AI. Giá đã tăng nhiều lần do cung không đủ cầu. - **Ajinomoto**: Tập đoàn sản xuất bột ngọt, chiếm 80-95% thị phần màng cách điện ABF dùng trong chất nền đóng gói chip. Mảng này có tỷ suất lợi nhuận trên 50%. Điểm chung: Lợi nhuận tập trung vào các khâu vật liệu chuyên biệt, có rào cản kỹ thuật cao và khó mở rộng năng lực sản xuất nhanh trong chuỗi cung ứng AI. **3. Cơ hội cho thị trường A-shares: Thay thế nhập khẩu** Ở Trung Quốc, câu chuyện tương tự diễn ra với chủ đề “thay thế nhập khẩu”: - **Gốm sứ chính xác**: Các công ty như **Zhongci Electronic** đang dẫn đầu trong việc sản xuất bàn hút tĩnh điện nội địa. - **Vải điện tử (Electronic cloth)**: **Honghe Technology** và **Feilihua** là những nhà sản xuất chủ chốt cung cấp vải siêu mỏng và vải thạch anh chất lượng cao cho ngành đóng gói tiên tiến, được chứng nhận bởi các khách hàng lớn như NVIDIA. **Kết luận**: Sự bùng nổ AI không chỉ định hình lại các công ty công nghệ mà còn làm thay đổi giá trị của những doanh nghiệp sản xuất truyền thống sở hữu công nghệ vật liệu chuyên sâu. Sự chênh lệch giữa phân loại ngành cũ và cấu trúc lợi nhuận mới tạo ra cơ hội định giá lại. Xu hướng này sẽ tiếp diễn khi yêu cầu về độ chính xác của chip ngày càng cao, làm sâu sắc thêm sự phụ thuộc vào các quy trình vật liệu truyền thống tinh vi.

链捕手24 phút trước

链捕手24 phút trước

DeFi Bước Vào Thời Điểm Đánh Giá Lại Giá Trị: Nguy Cơ và Cơ Hội Đằng Sau TVL 700 Tỷ USD

Tổng giá trị bị khóa (TVL) toàn mạng lưới DeFi vừa giảm xuống dưới mốc 700 tỷ USD, đạt khoảng 693,58 tỷ USD vào ngày 1/7, mức thấp nhất kể từ tháng 2/2024. Sự sụt giảm của chỉ số TVL - thước đo sức khỏe và thanh khoản của hệ sinh thái tài chính phi tập trung - phản ánh dòng vốn rút lui và tâm lý thận trọng trên thị trường tiền mã hóa nói chung. Nguyên nhân chính bao gồm việc giảm mức độ chấp nhận rủi ro của thị trường, mô hình khuyến khích thanh khoản dựa trên token đang dần mất hiệu lực, và vốn đang dịch chuyển sang các lĩnh vực mới như AI, RWA và cơ sở hạ tầng module. Bài viết chỉ ra rằng DeFi đang đối mặt với các điểm nghẽn tăng trưởng: tốc độ đổi mới chậm lại, lợi suất giảm mạnh và không thu hút được người dùng đại chúng do trải nghiệm phức tạp. Tuy nhiên, TVL sụt giảm không có nghĩa là kết thúc. Nó đánh dấu sự chuyển đổi từ giai đoạn tăng trưởng dựa trên trợ cấp sang một giai đoạn trưởng thành hơn, nơi hiệu quả vốn, trải nghiệm người dùng và giá trị thực trở thành trọng tâm. Các xu hướng như tài sản thế giới thực được mã hóa (RWA), sự bùng nổ của stablecoin và sự phát triển của cơ sở hạ tầng lớp 2 đang mở ra chương mới cho DeFi, hướng tới các dịch vụ tài chính ổn định, an toàn và hiệu quả hơn, gần hơn với nhu cầu thực tế.

marsbit48 phút trước

DeFi Bước Vào Thời Điểm Đánh Giá Lại Giá Trị: Nguy Cơ và Cơ Hội Đằng Sau TVL 700 Tỷ USD

marsbit48 phút trước

Các công ty tiền điện tử dẫn đầu đóng góp cho bầu cử Mỹ 2026 với 189 triệu USD: 'Hối lộ hợp pháp!'

Ngành công nghiệp tiền điện tử (crypto) đã trở thành nhà vận động chính trị hàng đầu trong cuộc bầu cử giữa kỳ Mỹ 2026, với tổng số tiền đóng góp lên tới 189 triệu USD, theo dữ liệu từ Ủy ban Bầu cử Liên bang (FEC). Khoản tiền này cao hơn 20 triệu USD so với chi tiêu năm 2024 và chiếm 36,5% tổng chi tiêu vận động hành lang của các tập đoàn (517,5 triệu USD), vượt xa các lĩnh vực khác như AI/công nghệ lớn (60 triệu USD). Các nhà tài trợ lớn nhất bao gồm quỹ đầu tư mạo hiểm a16z (51,65 triệu USD), Ripple Labs (50 triệu USD), Crypto.com (38 triệu USD) và Coinbase (35 triệu USD). Phần lớn nguồn lực được hướng đến ủng hộ các ứng viên thân crypto, chủ yếu thuộc đảng Cộng hòa, thông qua các tổ chức như Fairshake PAC. Động lực chính của làn sóng vận động này là thúc đẩy việc thông qua Đạo luật CLARITY, một dự luật quan trọng về quy định cho ngành crypto hiện đang bị trì hoãn tại Thượng viện. Ngành công nghiệp muốn bảo vệ các thành tựu đã đạt được (như thay đổi lãnh đạo SEC, thông qua Đạo luật GENIUS về stablecoin) và đảm bảo dự luật then chốt này được thông qua trong Quốc hội tiếp theo. Thượng nghị sĩ Bernie Sanders đã chỉ trích mức chi tiêu khổng lồ này là "hối lộ hợp pháp".

ambcrypto53 phút trước

Các công ty tiền điện tử dẫn đầu đóng góp cho bầu cử Mỹ 2026 với 189 triệu USD: 'Hối lộ hợp pháp!'

ambcrypto53 phút trước

Giao dịch

Giao ngay

How to Detect AI-Generated Videos? A Review of Dynamic, Traceable, and Explainable Detection Systems

Tóm tắt

Redefining the Goal of AI-Generated Video Detection

Detection Objects: Three Paradigms of AI-Generated Videos

Four-Layer Taxonomy of Detection Methods from a Vision-Language Dual-View Perspective

Layer 1, Intrinsic Cues Analysis: The First Screening Net

Layer 2, Spatiotemporal Consistency: Checking "Does the Video Flow Smoothly?"

Layer 3, Cross-Modal Consistency: Multi-Modal Verification Within the Video

Layer 4, Language-Guided World-Level Reasoning: Focusing on the Gap Between Video and the Real World

Evolution Map of Generation Side and Detection Side

Evaluation of Detection Methods

Evaluation Metrics from a Vision-Language Dual-View

Shared Metrics: Acc / AUC Remain Necessary but Are Far from Sufficient

Metrics from the Visual Perspective: Assessing Robustness Under Real-World Interference

Metrics from the Language Perspective: Multimodal Localization and Reasoning Evaluation

Datasets: Reorganized According to the Three Paradigms of Detection Objects

Related Evaluations for Video Generation Model Diagnosis

From "Can Distinguish" to "Can Provide Evidence"

Evidence-First Dynamic Evaluation System

Collaborative Dual-View Trustworthy and Explainable Detection System

Conclusion

Câu hỏi Liên quan

Nội dung Liên quan

Dữ Liệu Cho Thấy Các Ứng Dụng Phi Tập Trung (dApps) Trên Solana Tạo Ra 257 Triệu USD Doanh Thu Trong Quý 2

Người chiến thắng AI bí ẩn nhất

Người chiến thắng AI bí mật nhất

DeFi Bước Vào Thời Điểm Đánh Giá Lại Giá Trị: Nguy Cơ và Cơ Hội Đằng Sau TVL 700 Tỷ USD

Các công ty tiền điện tử dẫn đầu đóng góp cho bầu cử Mỹ 2026 với 189 triệu USD: 'Hối lộ hợp pháp!'

Giao dịch

Danh mục Phổ biến

Thẻ Nổi bật