By | Alphabet AI
The AI video track has been a bit cold lately. Seedance 2.0 is embroiled in copyright disputes, and OpenAI shut down Sora, casting a shadow over this field.
Right at this moment, Alibaba brought out a dark horse.
In April 2026, HappyHorse-1.0 surged to the top of the Artificial Analysis leaderboard, outperforming rivals like ByteDance and Kuaishou in both text-to-video and image-to-video (without audio) tracks.
Zhang Di returned to Alibaba in November 2025, taking up the position of Head of Taotian Group's Future Life Laboratory and reporting directly to Zheng Bo, CTO of Alimama.
This means that from Zhang Di's return to making a name for himself, only about 5 months passed.
The key point is that, like Alibaba's Qwen, HappyHorse released a commercially usable open-source version.
What is Qwen's status in Alibaba now? It is Alibaba Group's core general-purpose large model foundation, the absolute core carrier of its AI strategy. Everything Alibaba does now is centered around Qwen.
Therefore, the significance of HappyHorse to Alibaba might be far more than just a model that tops leaderboards to show off technology.
However, before understanding Alibaba's intentions, we should first talk about who Zhang Di is.
01 From Alibaba to Kuaishou and Back to Alibaba
Zhang Di graduated from Shanghai Jiao Tong University with a degree in Computer Science, completing a combined bachelor's and master's program. After graduating in 2010, he joined Alibaba, where he was long responsible for Alimama's big data and machine learning engineering architecture.
Alimama focuses on advertising, recommendation, search, and conversion, which involve large-scale data, massive distribution, and complex engineering systems. These things might not sound as exciting as large models, but they are precisely the places that later trained AI talent for Chinese internet companies.
Many people who can truly turn models into products did not purely come from laboratories. They earlier underwent training in systems like search, recommendation, advertising, and content distribution.
Let me give you a few examples to make this clear. Google CEO Sundar Pichai started out working on the search bar and YouTube content recommendations. Microsoft CEO Satya Nadella initially developed the Bing search engine and Microsoft's advertising system at Microsoft.
Because these systems process vast amounts of user behavior daily and require models to run stably in real business scenarios. They don't allow engineers to create just a flashy demo; they force you to build something truly useful, while constantly balancing latency, cost, effectiveness, and feedback.
Zhang Di's ten years at Alibaba were largely spent in such an environment. At that time, the outside world hadn't yet started calling everything large models, but Alibaba internally already had a training ground centered around data, algorithms, and engineering.
In 2020, Zhang Di left Alibaba for Kuaishou.
At that time, short video platforms had already moved from traffic competition to technology competition. Zhang Di served at Kuaishou as Vice President of Technology, Head of the Large Model and Multimedia Technology Team, and later led the underlying architecture R&D and application deployment of the Kling large model.
Kling's significance to Kuaishou is very substantial.
Kling enabled Kuaishou to upgrade from a past "content distribution platform" to a "content production infrastructure provider," building a complete closed loop of "creative generation - video production - one-click distribution - traffic monetization - data iteration."
In April 2025, Kuaishou established the Kling AI Division and upgraded it to a first-level department reporting directly to CEO Cheng Yixiao, on par with the main short video business.
Therefore, when he briefly joined Bilibili in September 2025 and returned to Alibaba two months later, this move could hardly be seen as just ordinary talent mobility.
Bilibili needs video technology, and Alibaba also needs video technology, but Alibaba's needs are more complex.
For Kuaishou, doing video generation is essentially about distribution. But if Alibaba does video generation, it involves many more linked aspects: e-commerce, advertising, live streaming, cloud services, and overseas merchants.
As mentioned earlier, after returning to Alibaba in November 2025, Zhang Di took up the position of Head of Taotian Group's "Future Life Laboratory" at level P11.
Arranged this way, it still has a strong Alibaba flavor. It didn't simply place the video model in a pure research department; instead, its position is closer to Taotian, a transaction scene.
In other words, from its conception, HappyHorse was a product emphasizing deployment and bound to Alibaba's existing ecosystem.
Five months later, HappyHorse appeared.
This speed was indeed fast. Alibaba gave Zhang Di a new business scenario and team, and he once again打通 (opened up) the video model route.
He neither started from scratch in AI video nor simply parachuted into Alibaba from outside.
His career path is like a line that went out and came back. He learned how large-scale commercial systems operate at Alibaba, then went to Kuaishou to turn video generation into a product, and then returned to Alibaba to integrate this capability into a larger commercial machine.
Many companies are scrambling for large model talent, but the scarce individuals are often those who can simultaneously understand models, business, and organization.
There are many people who solely know how to train models, and many who solely know how to talk strategy. The difficulty lies in finding someone who knows where every step from the technical route, to architecture design, to training and inference, to the product outlet, and finally to being used by merchants and users, might get stuck.
HappyHorse pushed Zhang Di back into the spotlight and also gave Alibaba's relatively dispersed AI narrative over the past few years a more concrete entry point through a person.
02 How an Open-Source Model Defeated Closed-Source Giants
The point that truly drew attention to HappyHorse is that it won too suddenly.
On the video generation track, overseas there are Runway, Pika, Luma, Google's Veo; domestically there are ByteDance's Seedance and Kuaishou's Kling. Alibaba wasn't even on the list.
So when HappyHorse first topped the charts, people were even more willing to believe it was a model developed by some startup than to believe it was an Alibaba model.
HappyHorse is in the first tier in both text-to-video and image-to-video tracks, with an Elo rating of 1333 for text-to-video and 1392 for image-to-video.
The Artificial Analysis leaderboard itself changes with user blind tests, and the scores on subsequent pages have been updated, but it indeed outperformed a group of earlier famous closed-source models in user preference tests.
This is actually quite unusual. Generally speaking, video generation is one of the directions that consumes the most money, data, and computing power.
Closed-source large companies can hide data, model details, inference systems, and product experience within their platforms, continuously iterating internally.
Open-source models face more practical limitations. Their parameters must be public, inference must be runnable, the community must be able to reproduce them, and the effects must withstand横向比较 (horizontal comparison).
So before HappyHorse appeared, most open-source video models were toys. The output videos weren't stable enough, and characters often experienced drift.
HappyHorse has 15 billion parameters, a 40-layer unified self-attention Transformer architecture, and jointly models text, video, and audio三种模态 (three modalities) tokens放入 (placed into) the same sequence.
This approach is very similar to Qwen, which also explains why Zhang Di managed to produce HappyHorse in just 5 months—it reused the high-quality native multimodal training methods left by Qwen.
Non-native multimodal video generation models like Sora often experience issues like characters' mouths moving but the sound being half a beat late. Sometimes character expressions are rich, but the tone is wrong. Characters might also move before the sound is emitted.
The reason for HappyHorse's high rating lies in its solution to this problem through native multimodality.
HappyHorse natively supports lip synchronization for multiple languages including English, Mandarin, Cantonese, Japanese, Korean, German, French, etc. Its word error rate is also compared with同类开源模型 (similar open-source models).
Why did Zhang Di do this? My understanding is that if Alibaba wants video generation technology to enter advertising, e-commerce, short dramas, education, and live streaming, it cannot rely solely on pretty pictures.
It must be able to speak, to dub, to make sound and picture成立 (hold true) simultaneously.
Another key point is cost and speed.
HappyHorse takes about 38 seconds to generate a 5-second 1080p video on a single H100 GPU and uses DMD-2 distillation technology to compress the denoising steps to 8.
This is an unavoidable hurdle for the commercialization of video generation. No matter how good the model effect is, if generating a short video costs too much or takes too long, it's hard to enter merchants' daily workflows.
Merchants won't wait half a day for each product, nor will they pay过高成本 (excessive costs) for dozens of test materials.
So the significance of HappyHorse is not just "able to generate," but also its attempt to push generation speed and inference costs into the usable range.
For developers, open source means they can self-host, fine-tune, and integrate it into their own products. For platforms, open source also brings more community feedback.
The progress of a closed-source model mainly relies on the internal team of the company. An open-source model will be subjected to各种奇怪测试 (various strange tests) by developers. Problems are exposed quickly, and improvement directions increase.
The Artificial Analysis video arena uses user preference voting. Often, it doesn't just look at a single technical indicator but rather at which of two videos users prefer.
Of course, Zhang Di cannot be too proud yet; topping the chart once does not mean leading forever.
Competitors won't stand still. HappyHorse has now only won one public test, not the entire war.
If HappyHorse were just a model that can刷榜 (top leaderboards), its significance would be limited. But if it can become the video generation foundation commonly used by Alibaba Cloud and Taotian's business, it will become an entry point.
Therefore, the most interesting part of HappyHorse defeating closed-source giants is not just the领先分数 (leading scores). What is truly worth paying attention to is that it allowed Alibaba to find a way to re-enter the video generation game.
It didn't first make a C-end user APP, nor did it only do internal demos. Instead, it directly subjected the open-source model to industry-wide scrutiny.
This victory might not last long, but Zhang Di changed the external perception of Alibaba's capabilities in video generation models.
The new question becomes: where does Alibaba plan to use this capability?
03 The Significance of HappyHorse for Alibaba
The most direct landing point for HappyHorse is e-commerce.
In the past, when people talked about AI video, they最容易想到 (most easily thought of) film, television, short dramas, advertising blockbusters, and creator tools. Admittedly, these are all substantial markets, but they are still some distance from Alibaba's main business.
Alibaba's strength does not lie in building a video community itself, nor in having ordinary users open an AI video APP daily to kill time. Alibaba's real advantage is that it holds China's most concentrated collection of商品 (products),商家 (merchants),交易 (transactions), and广告系统 (advertising systems).
This is also why many people care that HappyHorse was born in Taotian Group's "Future Life Laboratory."
Taotian deals daily with how merchants sell goods, how products are seen, why users click in, and why they place orders. Placing HappyHorse here naturally makes people think: can it improve商品内容生产 (product content production) efficiency? Can it improve conversion? Can it help the platform do more business?
For an ordinary merchant, video content has always been a hassle.
To shoot a 30-second product video, you need to find a scene, find a模特 (model), set up lighting, edit, and dub. Big brands can hire teams, but small and medium-sized merchants often have to make do on their own.
Many product selling points are not complicated; the problem is that no one films them. They look very ordinary against a white background, but once placed in a specific scene, users realize what they can be used for.
Recently overseas, the solar fountain pump product sold out. It was originally just a small garden item, and the effect was so-so. But after being packaged by AI video as a bird bath, fish pond, and children's bathtub with cool water-spraying toys, everyone went crazy for it.
AI didn't change the product itself, but it changed the way users understand the product. It turned "functional description" into "usage scenario."
This正好击中 (hits right at) the pain point of e-commerce content.
Product pages filled with parameters未必有耐心看 (users may not have the patience to read); hosts talking for a long time未必相信 (users may not believe). But a十几秒 (ten-plus second) video, if it can clearly explain the scene, the conversion efficiency might be much higher.
More importantly, AI videos can be generated in batches. Merchants can generate children's version, family version, holiday version, outdoor version for the same product, or generate different languages, different characters, and different scenes for different countries.
This is more significant for Alibaba than simply making a video generation tool. Whether it's Taobao or Tmall,上面都有大量商家 (there are a large number of merchants on them), and also a large amount of product data and transaction feedback.
If an AI video tool only knows how to generate pretty pictures, it will quickly become a material software. If it can know in what场景 (scenes) this product is more likely to be clicked, what copy is more likely to lead to add-to-cart, what video's first few seconds are more likely to retain users, it will approach being part of an e-commerce operating system.
What Alibaba has more than other video generation model companies is precisely this反馈闭环 (feedback loop).
Product images, detail pages, reviews, Q&A, search terms, click-through rates, add-to-cart rates, refund reasons, live stream dwell time—these things seem fragmented but are all fuel for training e-commerce content capabilities.
If HappyHorse接入 (connects to) these feedbacks, it can evolve from "helping merchants generate a video" to "helping merchants generate videos more likely to sell goods."
面向淘天 (Facing Taotian), it can handle主图视频 (main image videos),商品场景短片 (product scenario short films),直播切片 (live stream clips),虚拟主播 (virtual hosts), and营销素材 (marketing materials).
In the past, when a merchant launched a new product, they might only upload a few pictures, at most shooting a rough short video. In the future, they can give the product images, selling points, reviews, and audience tags to the system, let the system generate multiple different versions of videos, and then use real投放 (placement) and成交数据 (transaction data) to筛选出 (filter out) the more effective one.
If this process runs smoothly, platform content supply will increase significantly, and the content threshold for small and medium-sized merchants will also decrease.
However, AI video带货 (product placement) also has risks. It can amplify selling points but can also amplify illusions. A fountain pump喷得很高 (sprays very high) in an AI video might not achieve that effect in reality.
Alibaba's opportunity should not be to allow merchants to use AI to create dreams. The focus should be on product parameters,实拍素材 (real-shot materials), buyer reviews, and platform auditing (platform auditing) to keep generated content within boundaries.
In late March, OpenAI announced the shutdown of the Sora standalone application and related APIs. The reason is realistic: video generation burns too much money, user retention cannot support the cost, and OpenAI needs to put computing power back into coding, enterprise services, and robotics directions.
Sora fell on the commercial account.
ByteDance also encountered trouble on another front. Although Seedance 2.0's effects are also fierce, due to copyright issues, ByteDance paused the global release of Seedance 2.0.
The stronger the model is trained, the easier it is to step into the mire of copyright,肖像权 (portrait rights), and training data.
Looking back at HappyHorse, led by Zhang Di, it has a clear commercial scenario. Moreover, the product images, merchant materials, real-shot videos, and transaction feedback in Alibaba's hands are naturally more suitable for controlled generation than影视IP (film and TV IP).
Therefore, the value of HappyHorse is not only in the leaderboard. It found a more stable landing point for AI video.







