Article | CloudSurge AI, Author | Huang Yunhao
One. After Google I/O 2026: The Four Major End-Device OS Step into the Agent Era
On May 12, 2026, Google held the Android Show|I/O Edition press conference, an Android-focused event ahead of the I/O conference on May 19. Sameer Samat, President of the Android Ecosystem, set the tone for this conference: Android must transform from an operating system into a smart system. The concept carrying this main thread is Gemini Intelligence – a set of proactive AI capabilities at the Android system layer.
2026 Android Show|I/O Edition Press Conference Poster
Source: Android Heaadlines
Compared to last year's Gemini Nano + AICore combination, this time Google further embedded Agent capabilities for cross-App and contextual processing into the OS layer: cross-App task automation (ordering meals, shopping, placing orders), automatic form filling, webpage summarization, and custom widgets were successively written into the system-level capability list. Google also listed explicit user control, comprehensive data protection, and operational transparency as three product principles.
A week later, on May 19, in the I/O keynote speech, Google CEO Sundar Pichai started along the same line:
Welcome to the agentic Gemini era
In joining the wave of end-device OS agentization, Google was hardly an early starter.
Microsoft introduced Copilot+PC (a new category of Windows 11 devices equipped with 40+ TOPS NPUs) at Build 2024 in May 2024, embedding Agent capabilities into the OS based on three abilities: the on-device small model Phi Silica, the screen Agent capability Click to Do, and the system-level activity memory Recall.
At WWDC24 in June 2024, Apple formally announced "Apple Intelligence," which it positioned as a "personal intelligence system." Some AI-assisted features were subsequently rolled out, but the core Agent capabilities of Apple Intelligence have not yet materialized due to issues like delays in its own large model development and Siri's shortcomings.
Huawei, at HDC 2025 in June 2025, released HarmonyOS 6 and the Harmony Smart Agent Framework (HMAF), followed by the launch of the Xiaoyi Smart Agent Plaza featuring over 80 agents.
The major trend of end-device OS agentization has simultaneously emerged in mainstream operating systems like Android, iOS, HarmonyOS, and Windows.
Press conferences only showcase features; what OS vendors are truly competing over is the three-layered foundational capability underpinning the reliable operation and practical problem-solving of OS Agents: the system-level AI Runtime, controllable chips, and the end-cloud model matrix.
Two. Beyond the Press Conference: The Three-Layered Foundation Supporting OS Agents
System-Level AI Runtime: The Scheduling Hub for On-Device Intelligence
Runtime is the inference engine and system service through which on-device models run within the operating system. Downwards, it directly interfaces with the NPU and system resource scheduling; upwards, it exposes inference capabilities to all Apps via stable APIs. It turns on-device models into "shared intelligence at the OS layer": sharing model weights across Apps, uniformly scheduling computing power and memory, supporting the tool calling, guided generation, context, and permission docking required by Agents. It determines whether an OS Agent is merely a chat button within an App or a resident service on the operating system capable of performing system-level operations.
The most complete example within the Android system is Google's AICore. In December 2023, AICore went live as a system service in Android 14; in August 2025, Gemini Nano was opened to developers via ML Kit GenAI APIs. From a system service foundation to stable APIs for Apps, AICore has been polished for nearly two years.
Other OS vendors are on the same path, with different tempos. Apple opened the Foundation Models framework to developers at WWDC25. The framework comes with decorators like @Generable, tool calling, guided generation, and stateful sessions, connecting to an on-device foundation model of about 3B parameters, supplemented by private cloud computing for cloud support. Microsoft integrated the on-device AI framework Foundry on Windows and Phi Silica into Windows 11, using Windows ML as the underlying inference backend. Huawei released the Agent Framework Kit (Harmony Smart Agent Framework, HMAF) at HDC 2025, opening up the intent system and Agent collaboration protocols.
Android AICore as a system service, scheduling Gemini Nano inference on hardware accelerators
Source: Android Developers
Controllable Chips: The Fulcrum of Hardware-Software Synergy
At the Android Show|I/O Edition, Google set clear hardware thresholds for Gemini Intelligence: the full feature set debuted exclusively on a few latest flagships like the Pixel 10 series and Galaxy S26 series, with last year's models not included. This points to a simple fact: AI models are still evolving rapidly, and software continuously imposes new demands on hardware. Controllable chips are the foundation for meeting these demands, and the degree of control determines the space OS vendors have for hardware-software adaptation of end-device OS Agents.
Apple is the exemplar of the integrated hardware-software approach. iOS and macOS have evolved in tandem with the A-series and M-series chips from the start, and Core ML encapsulates the scheduling of CPU, GPU, and ANE into the framework layer. This path continues into the LLM era. Apple Machine Learning Research provided a set of actual measurements: following Core ML's optimization path to deploy Llama 3.1 8B Instruct onto an M1 Max, local decoding speed can reach about 33 tokens/s. The "Apple Intelligence Foundation Language Models" technical report also disclosed that Apple performed architecture-level optimizations like KV cache sharing and 2-bit quantization-aware training for its own chips, enabling the successful opening of the ~3B on-device foundation model to developers via the Foundation Models framework. This level of depth is only achievable when the chip is held in one's own hands – this is precisely the value of controllable chips for OS vendors: it dictates the depth of hardware-software synergy and raises the experience ceiling for end-device OS Agents.
Entering the AI era, Google is doing the same thing – pursuing its self-developed Tensor SoC path since the Pixel 6. The latest Tensor G5 boosts TPU performance by up to 60% and CPU performance by an average of 34%, landing in the Pixel 10 as the first SoC to fully run the latest-generation Gemini Nano. Of course, Tensor G5 also has weaknesses: Android Central's real-world tests show its memory configuration (RAM capacity) remains an AI performance bottleneck, and its Geekbench AI scores trail the Snapdragon 8 Elite; in Macworld's Geekbench 6 tests, G5's single-core and multi-core scores are lower than the A18 Pro's. Google is still catching up, but the synergistic path of self-developed Tensor plus on-device Gemini has taken shape.
Huawei's Kirin paired with the Da Vinci NPU and the Pangu on-device model is another controllable chip path running parallel to Apple and Google. Xiaomi, with its Xuanjue O1, is a newer entrant moving in the direction of controllable chips.
End-Cloud Model Matrix: The Source of Intelligence for Agents
The end-cloud model matrix is the source of "intelligence" for end devices: cloud models support the capability ceiling for complex tasks, while on-device models underpin the baseline for daily operation – latency, battery life, privacy, and stability all rest on the on-device side. Both ends are indispensable; the difference lies in the depth of coupling with the OS. On-device models must be embedded into the OS of every terminal device and deeply coupled with the local NPU, assuming a dual identity within the OS: downwards, they are the local inference backend for the Runtime; upwards, they are exposed as system-level APIs to Apps via the Runtime's framework and SDK.
Self-development makes sense both in the cloud and on-device, but the returns are more tangible on-device. While cloud models can be sourced externally to support the capability ceiling, the advantages of self-development mainly manifest in routing control, commercial terms, and model iteration pace. The on-device side is different. On-device models are embedded into the OS and NPU of every device; the returns on self-development are directly reflected in product performance: KV cache sharing, 2-bit quantization-aware training specifically designed for a chip generation, Per-Layer Embedding (originating from Gemma 3n, incrementally loading embedding parameters layer-by-layer from fast storage), etc. – these are only conveniently realized when the model and hardware are designed synchronously; meanwhile, the synergy tempo is no longer constrained by third-party hardware vendors.
Tensor G5's TPU computing power saw up to a 60% increase over the previous G4, but Gemini Nano's improvement on the G5 far exceeds that – according to Google official and Jon Peddie Research data compilation, local processing speed is 2.6 times that of the previous generation, energy consumption is halved, and the token window expanded from 12,000 to 32,000 (equivalent to digesting about a hundred screenshots at once). These significantly surpassing performance gains stem from the Matryoshka Transformer elastic inference architecture adopted by Gemini Nano v3, combined with synergistic optimizations with the Tensor G5 TPU.
Performance Leap of Gemini Nano on Tensor G5 Compared to Previous Generation
Source: Google/Jon Peddie Research, CloudSurge AI Chart
In this layer of on-device models, the major OS vendors all hold their own cards: Google's Gemini Nano, Apple's ~3B parameter on-device foundation model, Microsoft's Phi Silica, Huawei's Pangu on-device model. Self-development is the default option for this layer.
Three. Between the Layers: Deeper Synergy, Greater Space for Differentiation
The three-layered capability foundation is coupled from bottom to top: Controllable Chip → On-Device/Cloud Models → Runtime → Agent. The controllable chip determines the achievable inference efficiency and power consumption for on-device models; on-device models determine the local intelligence schedulable by the Runtime; the Runtime determines the reliability of the Agent executing cross-App operations as a system service. The deeper the synergy among the three, the greater the product experience differentiation for OS vendors in on-device Agents, and the thicker the moat.
The more tightly the three layers interlock within the same hardware-software system, the more the product capabilities of OS Agents will exhibit differentiation that a single layer cannot achieve.
- Response latency and power consumption. The 2.6x processing speed and halved energy consumption achieved by Gemini Nano on Tensor G5 rely on mutual adaptation of model architecture, chip design, and Runtime scheduling within the same generation of hardware-software design – improvements of this magnitude only emerge from such synergy.
- Privacy and trust. Common tasks involving private data are handled locally by on-device models, while complex requests are passed to the cloud – this is the reasonable default posture for OS Agents regarding user data at the current stage. The three-layer coupling determines whether this "on-device first, cloud fallback" can be truly realized: deep adaptation between the NPU and on-device model is the key path for on-device models, still in development, to shoulder daily high-frequency inference; model quantization compression and KV cache sharing for the NPU; Runtime routing between on-device and cloud based on task complexity. If any of the three layers is inadequate, "on-device first" remains mere marketing talk.
- System-level context. OS vendors reorganizing cross-App and OS-layer user data (semantic indexing, screen perception, long-term memory) into a system-level personal context for the Agent is a prerequisite for the Agent to truly "understand the user" and a core characteristic differentiating OS Agents from single App-level Agents. Implementation depends on the three-layer interlock: the Runtime holds cross-App indexing and permissions, the on-device model resides to handle understanding and inference, and the NPU provides local efficient computing power. Apple's Core Spotlight builds semantic indexes on-device, Apps expose actions and data to the system via App Intents, and Agents will obtain context through Personal Context (Apple announced this capability will come with a future software update); Android's AppFunctions follows a similar path.
- Reliability as a system service. For an OS Agent to be invoked as a system-level service, it must remain usable in real-world scenarios like being offline, low battery, or thermal throttling. The on-device model residing on the device allows the Agent to work without a network; a highly hardware-software optimized NPU handles low-power inference; the Runtime falls back scheduling based on availability when device resources are tight (switching to lighter models or routing requests to the cloud). If any of the three layers is missing, the OS Agent cannot sustain the form of a system service and can only revert to an App-level chat button.
Apple Intelligence presents a complete synergy paradigm: Apple Silicon, the ~3B on-device foundation model, and the Foundation Models framework interlock from bottom to top, handling common scenarios on-device and transferring complex requests to private cloud computing. Google represents another form. Tensor G5, landing in the Pixel 10 as the first SoC to fully run the latest-generation Gemini Nano, is uniformly scheduled by AICore, enabling system-level Agent features like Magic Cue and Pixel Screenshots to be enabled by default without relying on the cloud. Huawei is an exemplary case of constructing three-layer synergy domestically: Kirin, Da Vinci NPU, Pangu on-device model, and HMAF – all four are self-owned, coupling from bottom to top into a complete three-layer foundation.
Interlocking Mechanism of the Three-Layered Foundation for End-Device OS Agents
Source: CloudSurge AI
Four. Above the Foundation: Other Key Variables for the Long-Term Moat
The three-layer synergy builds the core of the moat. Above the foundation, numerous other variables affect product competitiveness in the OS Agent era, including Agent-App interaction capabilities, privacy protection, etc.
The interaction between OS Agents and Apps is at the forefront of the contest between OS vendors and App vendors. Currently, two paths run in parallel. One is screen recognition and automation, including Gemini Live screen sharing, Apple Visual Intelligence, Circle to Search, etc. OS Agents intervene in Apps by reading the screen and clicking buttons. This works for single tasks, but each invocation lacks structured information, making it difficult to build stable multi-step workflows. The other is API deep integration, including Google AppFunctions, Apple App Intents, Huawei Intents Kit, etc. Apps expose core actions as structured interfaces to the system, enabling stable Agent calls and the building of multi-step workflows. Whether the API path can spread depends not on OS vendors but on App vendors. Handing over core functionalities to be called by Agents means users may no longer directly open the App, with risks of brand exposure, ad slots, behavioral data, and payment portals being intercepted by the OS. This will be a core battleground for the distribution of end-user traffic.
Privacy protection is a key value proposition and bottom line for end-device systems. OS vendors hold the deepest system-level permissions and the most sensitive user data on the end-device side. Privacy is both a professional stance and a prerequisite for the long-term advancement of the aforementioned aspects. Apple has built an end-device-based privacy protection system through the integrated hardware-level security design shared between the on-device Secure Enclave independent security chip and Private Cloud Compute nodes. This product strategy has turned "Privacy. That’s Apple." into a core brand label for Apple in the global premium market, thereby winning user trust.
Apple's "Privacy. That’s Apple." Label
Source: Apple Website
The three-layer synergy establishes the core of the moat, and these long-term variables above the foundation influence how deeply it can be fortified.
Five. More Than Just Remaking the OS
Under the trend of end-device OS agentization, the more solid the three-layered foundation of system-level AI Runtime, controllable chips, and the end-cloud model matrix, the higher the product baseline for OS vendors in this battle and the greater their space for differentiation. OS vendors that grasp this trend will have the opportunity to drive a reset in the distribution of traffic at the end-device entry point, securing a stronger competitive position.
This trend extends beyond phones and PCs. The underlying capabilities of OS Agents are spilling over into more terminals along the multi-device ecosystems already built by each company, especially IoT. Controllable chips are moving into scenarios like automotive SoCs; Huawei has already deployed vehicle-grade Kirin chips, and Xiaomi's HyperOS is entering its own vehicle models. On-device models are being lightened for migration to new form-factor hardware like glasses; the Android XR smart glasses jointly developed by Google, Samsung, Gentle Monster, and Warby Parker are set to launch in Fall 2026. Runtime and Agent synergy is expanding to device clusters via the "Super Terminal/Distributed" frameworks already deployed by each company, e.g., Huawei's 1+8+N and Harmony Distributed Soft Bus, Xiaomi's "Human-Vehicle-Home Full Ecosystem" and HyperConnect, Apple's Continuity, and Google's Cross device SDK and Cross device services. The battle over OS Agents is far from limited to the victory or defeat on phones and PCs.
AICore has been polished for nearly two years; Apple's OS and Apple silicon series chips have been co-evolving for over a decade; Tensor has been revised all the way to G5, with the Pixel 10 finally capable of shouldering the burden of Gemini Nano v3. The outcome of this battle never lies in the one or two hours of a press conference, but in the chips, models, and Runtime honed across generations.
References:
- Gemini Intelligence brings proactive AI to Android|Google Blog
- I/O 2026: Welcome to the agentic Gemini era|Google Blog
- Phi Silica, small but mighty on-device SLM|Windows Experience Blog
- Apple Delays Siri Upgrade Indefinitely|Bloomberg
- HarmonyOS 6 Developer Beta Launch Press Release (HDC 2025)|Huawei
- The latest Gemini Nano with on-device ML Kit GenAI APIs|Android Developers Blog
- Foundation Models framework documentation|Apple Developer
- Harmony Smart Agent Framework White Paper|Huawei Developer
- On-Device Llama 3.1 with Core ML|Apple Machine Learning Research
- Apple Intelligence Foundation Language Models Tech Report 2025|Apple Machine Learning Research
- Google Tensor G5: Benchmarks and everything you need to know|Android Central
- Google’s new M5 SoC(Tensor G5 detailed - Matryoshka Transformer)|Jon Peddie Research
- Private Cloud Compute: A new frontier for AI privacy in the cloud|Apple Security Engineering
- Overview of AppFunctions|Android Developers
- App Intents|Apple Developer
- Introduction to Intents Kit (HarmonyOS)|Huawei Developer
- The Google Pixel 10 Pro’s Tensor G5 chip is impressive—if you compare it to an iPhone 14|Macworld
- Gemma 3n model overview|Google AI for Developers













