Running Gemma 4 Locally on iPhone Goes Viral: How Far Are We from the Zero Token Era?

marsbitPublicado a 2026-04-06Actualizado a 2026-04-06

Resumen

Google's newly open-sourced Gemma 4 model, built on the same architecture as Gemini 3, has gained significant attention for its ability to run locally on mobile devices like the iPhone and Samsung Galaxy. With smaller versions such as E2B (2.3B parameters) and E4B (4.5B parameters), it supports native multimodal capabilities and offers a 128K context window. Users report impressive speeds—over 40 tokens per second on Apple chips with MLX optimization—making it feel "like magic." The model is accessible via Google’s official AI Edge Gallery app, ensuring ease of use and security. While Gemma 4 excels in tasks like text generation, coding, and image understanding, it struggles with more complex agent-based workflows, such as tool calling and structured outputs, where models like Qwen3-coder perform better. Despite some limitations in reasoning, Gemma 4’s local performance hints at a future where everyday AI tasks—chat, coding, reasoning—can be handled offline, reducing reliance on cloud-based token services. Although cloud models still lead in advanced reasoning and large-scale multi-agent tasks, the trend suggests that as hardware and quantization improve, on-device models will increasingly handle high-frequency simple tasks. This shift could disrupt the AI industry’s reliance on token sales and API subscriptions, pushing providers to focus on more complex, data-intensive capabilities. Gemma 4 is just the beginning of this transformation.

Machine Heart Editorial Department

Google's newly open-sourced model, Gemma 4, released a few days ago, gave the industry a huge surprise.

It adopts the same technological architecture as Gemini 3, supports native full-modality, ranked third globally on the Arena AI leaderboard, and comes in multiple model sizes. Several smaller models — E2B (2.3B effective parameters) and E4B (4.5B effective parameters) — can be deployed directly to run locally on mobile devices, with a context window of 128K. They can be described as a "Gemini alternative that fits in your pocket".

As expected, the model quickly became a new toy for mobile users after its release.

Among them, a post by an X user was viewed hundreds of thousands of times. In the post, he shared a video demonstrating how he ran Gemma 4 locally on an iPhone, including processing images, audio, and controlling the flashlight. He stated that Gemma 4 is incredibly fast, feeling like magic.

Someone quantified this speed on an iPhone 17 Pro, pointing out that if the phone uses Apple silicon, the model's inference speed can exceed 40 tokens per second with the help of MLX (Apple's machine learning framework) optimized for this chipset.

Others achieved similar speeds on a Samsung Galaxy, even with a 'thinking mode' enabled. This led people to exclaim that it's "unbelievably fast".

Such speeds make running AI models on mobile devices a viable option for the future, and are particularly useful in sensitive scenarios like healthcare.

The 128k context window also makes these small models more attractive.

So how do you run it? It's actually very simple and not exclusive to geeks, because Google released an official App — Google AI Edge Gallery. Those who want to experience it on their phone can directly download this App, then download the desired model version, and open it to run.

Moreover, since it's officially released by Google, security concerns are naturally less of an issue.

Beyond these small models running on phones, some have tried larger versions of Gemma 4 on more powerful hardware, such as running Gemma 4 Mixture-of-Experts 26B on a MacBook Pro with an M5 Pro chip.

For direct conversation, this model is still very fast, with smooth text generation and code explanation.

But when he actually tried to use Gemma 4 as a coding agent, problems arose. Because running an agent requires a large context (Gemma 4 26B has a 256k context window), complex prompts, and stable tool calls, Gemma 4 clearly couldn't handle it, often freezing, reporting errors, or outputting incorrect structures.

The turning point came when he switched the model to qwen3-coder. In the same environment, file creation, command execution, and multi-step tasks all ran normally. He believes the problem lies not with the agent framework, but with whether the model itself has been optimized for "tool calling + structured output". In this regard, Gemma 4 might not be sufficient, or perhaps this developer hasn't found the correct method yet.

Additionally, some say that Gemma 4's intelligence level is still somewhat lacking.

Even so, the emergence of a "performance powerhouse" like Gemma 4 should not be underestimated. If in the future, a large number of daily queries, chats, simple reasoning, code generation, and image understanding tasks can all be run locally without needing to buy tokens, wouldn't vendors who sell tokens be in an awkward position?

Of course, the current situation is not that pessimistic yet. After all, there is still a gap between the currently open-sourced models and the cutting-edge closed-source flagship models. Furthermore, most capable open-source models are still constrained by hardware capabilities and暂时 (zànshí - temporarily) haven't reached a usable level on the device side.

But the future trend is clear. In the short term, cloud-based closed-source models will still lead in cutting-edge complex reasoning and ultra-large-scale multi-agent collaboration. But in the long term, as hardware continues to advance and quantization techniques continue to optimize, on-device models will gradually encroach on the cloud's high-frequency simple tasks.

Those vendors who rely solely on selling tokens and API subscriptions will have to compete more fiercely on the "truly tough" parts — super-powered Agents, ultra-long reliable context, and specialized capabilities requiring massive real-time data.

Gemma 4 is just the beginning. The next surprise might be an on-device model that, in daily use, completely makes users unaware of the difference between "local" and "cloud". When that day comes, the entire AI industry's business model will undergo a real reshuffle.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Machine Heart

Preguntas relacionadas

QWhat is the key feature of Google's newly open-sourced Gemma 4 model that makes it suitable for mobile devices?

AGemma 4 has smaller variants, E2B (2.3B effective parameters) and E4B (4.5B effective parameters), which are designed to run locally on mobile devices with a 128K context window.

QWhat speed was reported for running Gemma 4 on an iPhone 17 Pro with Apple's MLX framework?

AThe model's inference speed was reported to exceed 40 tokens per second on an iPhone 17 Pro using Apple's optimized MLX framework.

QWhat is the name of the official Google app that allows users to easily run Gemma 4 on their mobile devices?

AThe official app is called 'Google AI Edge Gallery', where users can download the model and run it directly.

QWhat was a significant limitation observed when using the larger Gemma 4 26B model as a coding agent?

AThe Gemma 4 26B model struggled with tasks requiring large context (256K), complex prompts, and stable tool calls, often leading to crashes, errors, or incorrect structured outputs.

QAccording to the article, what long-term impact could the advancement of on-device models like Gemma 4 have on the AI industry?

AOn-device models could gradually erode the market for cloud-based models on high-frequency simple queries, forcing API and token-selling companies to focus on more complex areas like super-powered Agents, ultra-long reliable context, and capabilities requiring massive real-time data.

Lecturas Relacionadas

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

**Title: Great Powers Blockade Chips, Giants Buy Nuclear Plants: Why It's Time to Seriously Consider DeAI** In May 2026, the US closed loopholes for Chinese firms to acquire advanced NVIDIA chips via overseas subsidiaries. That same month, Kenya halted a $1B geothermal data center project involving Microsoft, fearing its immense energy consumption. Meanwhile, Huawei announced mass production of its Ascend AI chip. These disparate events underscore a new reality: the competition for computing power ("compute") has escalated beyond the tech industry, becoming a geopolitical and infrastructural battleground. A new era of oligopoly is forming, with control over the AI stack—from GPU chips (NVIDIA) and cloud platforms (AWS, Azure, Google Cloud) to foundational models (OpenAI, Anthropic)—concentrating in a few Western "AI Octopus" corporations. This centralization creates systemic risks: pricing power and platform lock-in for users, infrastructure fragility, and a widening "compute divide" that threatens to marginalize nations without independent AI capacity. An "AI Iron Curtain" is deepening through export controls. In response, some nations like Saudi Arabia and the UAE are investing heavily to buy compute power, aiming to transition from oil to AI economies. The EU seeks to triple its compute capacity by 2030 to reduce dependency. However, the spending gap is vast, with four US tech giants alone planning ~$750B in AI capex for 2026. The race is increasingly constrained by energy, with AI tasks consuming up to 1000x more power than web searches, pushing firms to even acquire nuclear plants. This landscape is fueling interest in Decentralized AI (DeAI). It proposes a third way: using open protocols to coordinate a global network of idle GPUs, independent developers, and data centers, creating an AI infrastructure without a single controlling entity. Leveraging blockchain and cryptographic verification, DeAI aims to break market concentration, disperse energy demands, reduce geopolitical dependencies, and enhance transparency. While still nascent in performance and stability, DeAI's core promise is not immediate superiority but providing a crucial alternative architecture to resist monopoly, censorship, and centralized power. As specialized AI hardware costs fall and open-source models flourish, the window to build this foundation is open. The very existence of such competition serves as a vital check against the inevitable abuse of concentrated power.

marsbitHace 29 min(s)

A Nation Blocks Chips, a Giant Buys a Nuclear Power Plant: Why It's Time to Seriously Consider DeAI

marsbitHace 29 min(s)

Outpoll Review: A Prediction Market Platform Built for Active Traders

Outpoll Review: A Prediction Market Platform Built for Active Traders In recent years, prediction markets have grown from a niche sector to a mainstream arena, attracting billions in trading volume and institutional capital. However, the user experience and tools for traders have not kept pace. Outpoll, a new global prediction market platform, aims to fill this gap by providing enhanced trading infrastructure for active and professional traders. Built on standard prediction market principles, Outpoll allows users to trade on the outcome of specific events. It uses fully collateralized contracts with USDC settlement, charges a competitive 0.1% fee per trade, and provides clear settlement rules upfront to minimize disputes. A key focus for Outpoll is its professional-grade trading tools. The platform supports limit and market orders, as well as take-profit and stop-loss orders for open positions—features uncommon in prediction markets. For automated trading, Outpoll offers comprehensive REST and WebSocket APIs, enabling portfolio management, price arbitrage, and integration with existing tools. The platform also features a creator-led market model, where approved experts and community leaders can create and manage markets for niche topics under platform supervision. Its integrated interface combines news feeds directly with trading functions, allowing users to monitor events and manage positions seamlessly. Outpoll launched with a native Android app (available on Google Play) and plans an iOS version later this year. In summary, Outpoll distinguishes itself with trader-focused tools, practical APIs, transparent and collateralized markets, integrated news, and an expanding creator program. For active traders, its advanced order types and API access alone make it a platform worth watching. Outpoll is now globally accessible via outpoll.com and Google Play.

marsbitHace 38 min(s)

Outpoll Review: A Prediction Market Platform Built for Active Traders

marsbitHace 38 min(s)

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

**Summary** Matt Hougan, Bitwise's CIO, analyzes the current crypto market through three key lenses, arguing it has shifted from a momentum-driven to a contrarian investment. **1) Crypto Becomes a Contrarian Play:** The market is weak, with major assets like Bitcoin and Ethereum down significantly. Capital has moved to hot sectors like AI, leaving crypto as an "unloved" asset class. This transforms crypto investing from trend-following to a test of patience and fundamental analysis. Investors now favor projects with solid fundamentals (e.g., Hyperliquid) over speculative ones. **2) Regulatory Overhang:** The uncertain fate of the U.S. CLARITY Act, a major crypto regulatory framework, is a key headwind. With its passage in 2024 seen as far from guaranteed (estimates range from 30-55%), institutional capital remains on the sidelines, choosing less risky alternatives like AI stocks. The market needs clarity—whether the bill passes or fails—more than any specific outcome to move decisively. **3) Capital Rotates to New Fundamentals:** This cycle differs from past bear markets where money fled to Bitcoin. Now, capital seeks smaller assets with strong use cases. While major cryptos fell in May 2024, tokens like Hyperliquid (+72%), Zcash (+50%), and XLM (+44%) rallied on their specific fundamentals. This rotation confirms the new contrarian, fundamentals-driven logic and signals the bear market may be in its later stages. **Conclusion:** Short-term pressure persists due to regulatory uncertainty and competition from AI narratives. Investing in crypto now requires a contrarian mindset—acting against the crowd and focusing on fundamental value. Patience and targeting high-quality projects based on their merits are essential for capturing long-term gains.

marsbitHace 1 hora(s)

Bitwise: Crypto Becomes a Contrarian Investment, Three Logics to Understand the Current Market

marsbitHace 1 hora(s)

ChatGPT Might Be Disappearing Soon

OpenAI announced at its "Intelligence at Work" event that its coding assistant, Codex, will be fully integrated into the ChatGPT app within weeks. This move marks a strategic shift from a conversational AI (Chat) towards a unified "agentic" platform capable of execution. Codex, originally launched to compete with Anthropic's Claude Code, has grown rapidly to 5 million weekly active users, with 20% being non-developers like analysts and designers. Its enterprise revenue now constitutes 40% of OpenAI's total. The integration is the first step in creating a super-app combining ChatGPT (interface), Codex (execution engine), and the Atlas browser (web access). OpenAI also unveiled new Codex features: specialized Agent plugins for six professional roles, an "Annotations" tool for direct document editing, and a "Sites" function to turn work into shareable web apps. Internally, this reflects a power shift; the Codex team now leads core product strategy. While the ChatGPT brand remains for its vast user base, the platform's future is focused on autonomous agents that perform tasks, not just chat. The article notes that competition with Claude Code pushed OpenAI's development, with Codex competing on cost-effectiveness and accessibility rather than raw coding quality. It concludes that the essence of "ChatGPT" is evolving from a chatbot into an AI agent platform, with the name potentially becoming a legacy symbol of its original function.

marsbitHace 1 hora(s)

ChatGPT Might Be Disappearing Soon

marsbitHace 1 hora(s)

Trading

Spot
Futuros

Artículos destacados

Cómo comprar 4

¡Bienvenido a HTX.com! Hemos hecho que comprar 4 (4) sea simple y conveniente. Sigue nuestra guía paso a paso para iniciar tu viaje de criptos.Paso 1: crea tu cuenta HTXUtiliza tu correo electrónico o número de teléfono para registrarte y obtener una cuenta gratuita en HTX. Experimenta un proceso de registro sin complicaciones y desbloquea todas las funciones.Obtener mi cuentaPaso 2: ve a Comprar cripto y elige tu método de pagoTarjeta de crédito/débito: usa tu Visa o Mastercard para comprar 4 (4) al instante.Saldo: utiliza fondos del saldo de tu cuenta HTX para tradear sin problemas.Terceros: hemos agregado métodos de pago populares como Google Pay y Apple Pay para mejorar la comodidad.P2P: tradear directamente con otros usuarios en HTX.Over-the-Counter (OTC): ofrecemos servicios personalizados y tipos de cambio competitivos para los traders.Paso 3: guarda tu 4 (4)Después de comprar tu 4 (4), guárdalo en tu cuenta HTX. Alternativamente, puedes enviarlo a otro lugar mediante transferencia blockchain o utilizarlo para tradear otras criptomonedas.Paso 4: tradear 4 (4)Tradear fácilmente con 4 (4) en HTX's mercado spot. Simplemente accede a tu cuenta, selecciona tu par de trading, ejecuta tus trades y monitorea en tiempo real. Ofrecemos una experiencia fácil de usar tanto para principiantes como para traders experimentados.

738 Vistas totalesPublicado en 2025.10.20Actualizado en 2026.06.02

Cómo comprar 4

Discusiones

Bienvenido a la comunidad de HTX. Aquí puedes mantenerte informado sobre los últimos desarrollos de la plataforma y acceder a análisis profesionales del mercado. A continuación se presentan las opiniones de los usuarios sobre el precio de 4 (4).

活动图片