Why Did Zhipu Surge Nearly 30% in a Single Day?
"Global AI Model Unicorn" Zhipu's stock surged nearly 30% in a single day, reaching a new market cap high. The catalyst was the launch of its GLM-5.1-highspeed API, boasting a generation speed of **400 tokens per second**, setting a new global benchmark.
This speed, roughly 3-5 times faster than industry leaders like OpenAI's GPT-4o and Anthropic's Claude, is achieved **without compromising the full-scale model's capabilities**. In the era of AI Agents requiring dozens of self-calls, such latency reduction is critical, transforming speed from a system metric into a determinant of intelligence limits.
The breakthrough stems from a three-layer technical overhaul:
1. **TileRT Inference Engine**: Compiles the entire model into a continuous, always-on computation pipeline using "Warp Specialization," minimizing GPU idle time by having different processor groups handle data loading, computation, and communication in parallel.
2. **Heterogeneous Parallelism for MLA**: To efficiently run the GLM-5.1 model using the MLA attention mechanism, TileRT employs a heterogeneous strategy. One GPU handles sparse indexing/routing, while the others perform dense computation, optimizing for MLA's unique workflow.
3. **ZCube Network Architecture**: Replaces the standard Spine-Leaf (ROFT) network topology with a flat, dual-group interconnect. This design creates a single optimal path between any two GPUs, eliminating network congestion at scale and reducing latency.
The business impact is significant: a 15% increase in cluster throughput (free extra capacity), a 40.6% reduction in tail latency (improved stability), and a one-third cut in networking hardware costs. Long-term, this innovation challenges the dominance of NVIDIA's integrated hardware-software stack (GPU+NVLink+InfiniBand), potentially benefiting manufacturers of high-density Leaf switches and optical modules while lowering the software barrier for domestic AI chips like Huawei's Ascend. The innovation proves that more can be achieved with the same compute, reshaping the infrastructure beyond just GPUs.
marsbit27m ago