Running Gemma 4 Locally on iPhone Goes Viral: How Far Are We from the Zero Token Era?

marsbitPublicado em 2026-04-06Última atualização em 2026-04-06

Resumo

Google's newly open-sourced Gemma 4 model, built on the same architecture as Gemini 3, has gained significant attention for its ability to run locally on mobile devices like the iPhone and Samsung Galaxy. With smaller versions such as E2B (2.3B parameters) and E4B (4.5B parameters), it supports native multimodal capabilities and offers a 128K context window. Users report impressive speeds—over 40 tokens per second on Apple chips with MLX optimization—making it feel "like magic." The model is accessible via Google’s official AI Edge Gallery app, ensuring ease of use and security. While Gemma 4 excels in tasks like text generation, coding, and image understanding, it struggles with more complex agent-based workflows, such as tool calling and structured outputs, where models like Qwen3-coder perform better. Despite some limitations in reasoning, Gemma 4’s local performance hints at a future where everyday AI tasks—chat, coding, reasoning—can be handled offline, reducing reliance on cloud-based token services. Although cloud models still lead in advanced reasoning and large-scale multi-agent tasks, the trend suggests that as hardware and quantization improve, on-device models will increasingly handle high-frequency simple tasks. This shift could disrupt the AI industry’s reliance on token sales and API subscriptions, pushing providers to focus on more complex, data-intensive capabilities. Gemma 4 is just the beginning of this transformation.

Machine Heart Editorial Department

Google's newly open-sourced model, Gemma 4, released a few days ago, gave the industry a huge surprise.

It adopts the same technological architecture as Gemini 3, supports native full-modality, ranked third globally on the Arena AI leaderboard, and comes in multiple model sizes. Several smaller models — E2B (2.3B effective parameters) and E4B (4.5B effective parameters) — can be deployed directly to run locally on mobile devices, with a context window of 128K. They can be described as a "Gemini alternative that fits in your pocket".

As expected, the model quickly became a new toy for mobile users after its release.

Among them, a post by an X user was viewed hundreds of thousands of times. In the post, he shared a video demonstrating how he ran Gemma 4 locally on an iPhone, including processing images, audio, and controlling the flashlight. He stated that Gemma 4 is incredibly fast, feeling like magic.

Someone quantified this speed on an iPhone 17 Pro, pointing out that if the phone uses Apple silicon, the model's inference speed can exceed 40 tokens per second with the help of MLX (Apple's machine learning framework) optimized for this chipset.

Others achieved similar speeds on a Samsung Galaxy, even with a 'thinking mode' enabled. This led people to exclaim that it's "unbelievably fast".

Such speeds make running AI models on mobile devices a viable option for the future, and are particularly useful in sensitive scenarios like healthcare.

The 128k context window also makes these small models more attractive.

So how do you run it? It's actually very simple and not exclusive to geeks, because Google released an official App — Google AI Edge Gallery. Those who want to experience it on their phone can directly download this App, then download the desired model version, and open it to run.

Moreover, since it's officially released by Google, security concerns are naturally less of an issue.

Beyond these small models running on phones, some have tried larger versions of Gemma 4 on more powerful hardware, such as running Gemma 4 Mixture-of-Experts 26B on a MacBook Pro with an M5 Pro chip.

For direct conversation, this model is still very fast, with smooth text generation and code explanation.

But when he actually tried to use Gemma 4 as a coding agent, problems arose. Because running an agent requires a large context (Gemma 4 26B has a 256k context window), complex prompts, and stable tool calls, Gemma 4 clearly couldn't handle it, often freezing, reporting errors, or outputting incorrect structures.

The turning point came when he switched the model to qwen3-coder. In the same environment, file creation, command execution, and multi-step tasks all ran normally. He believes the problem lies not with the agent framework, but with whether the model itself has been optimized for "tool calling + structured output". In this regard, Gemma 4 might not be sufficient, or perhaps this developer hasn't found the correct method yet.

Additionally, some say that Gemma 4's intelligence level is still somewhat lacking.

Even so, the emergence of a "performance powerhouse" like Gemma 4 should not be underestimated. If in the future, a large number of daily queries, chats, simple reasoning, code generation, and image understanding tasks can all be run locally without needing to buy tokens, wouldn't vendors who sell tokens be in an awkward position?

Of course, the current situation is not that pessimistic yet. After all, there is still a gap between the currently open-sourced models and the cutting-edge closed-source flagship models. Furthermore, most capable open-source models are still constrained by hardware capabilities and暂时 (zànshí - temporarily) haven't reached a usable level on the device side.

But the future trend is clear. In the short term, cloud-based closed-source models will still lead in cutting-edge complex reasoning and ultra-large-scale multi-agent collaboration. But in the long term, as hardware continues to advance and quantization techniques continue to optimize, on-device models will gradually encroach on the cloud's high-frequency simple tasks.

Those vendors who rely solely on selling tokens and API subscriptions will have to compete more fiercely on the "truly tough" parts — super-powered Agents, ultra-long reliable context, and specialized capabilities requiring massive real-time data.

Gemma 4 is just the beginning. The next surprise might be an on-device model that, in daily use, completely makes users unaware of the difference between "local" and "cloud". When that day comes, the entire AI industry's business model will undergo a real reshuffle.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Machine Heart

Criptomoedas em alta

Haedal ProtocolHAEDAL

Perguntas relacionadas

QWhat is the key feature of Google's newly open-sourced Gemma 4 model that makes it suitable for mobile devices?

AGemma 4 has smaller variants, E2B (2.3B effective parameters) and E4B (4.5B effective parameters), which are designed to run locally on mobile devices with a 128K context window.

QWhat speed was reported for running Gemma 4 on an iPhone 17 Pro with Apple's MLX framework?

AThe model's inference speed was reported to exceed 40 tokens per second on an iPhone 17 Pro using Apple's optimized MLX framework.

QWhat is the name of the official Google app that allows users to easily run Gemma 4 on their mobile devices?

AThe official app is called 'Google AI Edge Gallery', where users can download the model and run it directly.

QWhat was a significant limitation observed when using the larger Gemma 4 26B model as a coding agent?

AThe Gemma 4 26B model struggled with tasks requiring large context (256K), complex prompts, and stable tool calls, often leading to crashes, errors, or incorrect structured outputs.

QAccording to the article, what long-term impact could the advancement of on-device models like Gemma 4 have on the AI industry?

AOn-device models could gradually erode the market for cloud-based models on high-frequency simple queries, forcing API and token-selling companies to focus on more complex areas like super-powered Agents, ultra-long reliable context, and capabilities requiring massive real-time data.

Leituras Relacionadas

When the World Cup Collides with Agents: From Web2 to Web3, How Are Wallets Evolving into Agentic Wallets?

World Cup as a Catalyst for Agentic Wallets: From Web2 to Web3 This article explores how the World Cup provides a real-world scenario for observing the evolution of digital wallets from simple asset managers towards "Agentic Wallets"—intelligent, AI-powered interfaces. Using the example of prediction markets like Polymarket, it illustrates how AI Agents can lower the barrier to Web3 interaction. Instead of navigating complex DApps, users can express intent in natural language (e.g., "I think Portugal will win") within platforms like Discord or web pages. The Agent then interprets this intent, finds the relevant market, and seamlessly guides the user through the on-chain transaction via their wallet. The core shift is from wallets as mere "function menus" for signing transactions to "intent interpreters" that understand user goals. The article highlights parallel developments in traditional finance, such as Mastercard's "Agent Pay" and WeChat Pay's AI tests, which focus on granting AI controlled, authorized, and auditable payment capabilities. This underscores a broader trend of AI entering the financial layer. However, the article emphasizes that the primary challenge for Agentic Wallets in Web3 is not automation but establishing clear security boundaries. Unlike traditional systems with chargebacks, on-chain transactions are often irreversible. Therefore, future wallets must ensure users retain ultimate control and comprehension. They need to transparently communicate an Agent's permissions, spending limits, authorized durations, and provide easy ways to pause or revoke access. The World Cup experiments represent early steps toward wallets that are not just applications but ubiquitous, intelligent interfaces that simplify Web3 while keeping users securely in control.

marsbitHá 1h

When the World Cup Collides with Agents: From Web2 to Web3, How Are Wallets Evolving into Agentic Wallets?

marsbitHá 1h

Options Don't Work in DeFi? Vitalik Might Not Agree

For years, the prevailing view has been that options struggle to gain traction in DeFi due to complexity, fragmented liquidity, and lack of natural demand compared to products like perpetual futures. However, a recent algorithmic stablecoin design proposed by Vitalik Buterin presents a different perspective, using options not as a standalone trading product, but as foundational infrastructure for other financial instruments. In this design, one unit of ETH is split into two components: a "stable" side (P) that retains value up to a specified strike price, and an "upside" side (N) that captures all appreciation above that strike. Combined, they always equal one ETH, eliminating debt, margin, and liquidation risks inherent in typical collateralized debt position (CDP) stablecoins. The stable component essentially mimics the payoff of a covered call option. To function as a stablecoin, this structure requires continuously rolling deep in-the-money calls, which introduces challenges like rollover slippage, predictable transaction flow vulnerable to front-running, and persistent liquidity needs. A core hurdle is finding consistent buyers for the leveraged ETH upside exposure (N). While it offers leverage without funding rates or liquidation, it must compete with simpler alternatives like direct call options or perpetuals. The system's scalability depends on a sustained demand for this specific form of leverage. The author draws parallels to their experience with Rysk, where earlier versions of DeFi options protocols struggled. The breakthrough came with Rysk V12, which aligns incentives: asset holders generate yield by selling covered calls against their holdings, while market makers efficiently acquire the desired option exposure. This demonstrates that options can find product-market fit when embedded as a risk distribution and pricing engine within structured products, stablecoins, or yield-generating assets, rather than marketed as a complex direct trading instrument. Vitalik's proposal reinforces this architectural approach—using fully collateralized, non-custodial, and physically settled options as a fundamental building block. The real opportunity for options in DeFi may lie not in becoming the next perpetual swap, but in powering the next generation of on-chain financial products.

marsbitHá 1h

Options Don't Work in DeFi? Vitalik Might Not Agree

marsbitHá 1h

Ethereum Foundation Faces Another Departure as Hsiao-Wei Wang Steps Down

Hsiao-Wei Wang, a key researcher and contributor, has resigned from the Ethereum Foundation. Her work was pivotal in Ethereum's transition to Proof-of-Stake and projects related to scalability and validator engagement. This departure is part of a broader period of organizational restructuring and strategic reorientation for the Foundation, which has seen several other high-profile exits recently. The Foundation is implementing leadership and structural changes to better define its role within the ecosystem, while emphasizing that Ethereum's development remains decentralized. This transition occurs as developers continue work on major initiatives like the Glamsterdam update and Layer-1 scaling. Market observers view these changes as the Foundation adapting to support Ethereum's ongoing growth.

TheNewsCryptoHá 2h

Ethereum Foundation Faces Another Departure as Hsiao-Wei Wang Steps Down

TheNewsCryptoHá 2h

Conversation with Investor Zheng Di: MicroStrategy's Coin Sale Experiment, AI Economy, and Opportunities in US Stocks

Frontier tech investor Zheng "Didier" Di discusses the recent Bitcoin price drop, the financial strategy shift at MicroStrategy, the AI-driven surge in U.S. stocks, and the evolving role of crypto exchanges. Didier posits that the recent BTC decline stems less from macro factors or ETF outflows, and more from market repricing due to MicroStrategy's new financial structure. Following a wave of preferred stock and debt issuance (STRC, STRZ, etc.), MicroStrategy must now manage cash flow to pay dividends, potentially leading to a market expectation of sustained, small-scale BTC sales to maintain its "per-share bitcoin neutral" principle. Didier views this as a financial "experiment" testing market capacity for such recurring sell pressure, which, while creating near-term structural headwinds, likely avoids a true "death spiral" absent major new external shocks. Shifting to AI, Didier argues that tokens are becoming the new form of labor, with AI models and compute (tokenized inputs) increasingly replacing human roles in execution and middle-management. This drives enterprise efficiency and higher margins, fueling the sustained rally in U.S. semiconductor, data center, and infrastructure stocks. He foresees an emerging "machine economy" where automated agents transact and collaborate on-chain. Regarding crypto exchanges offering U.S. equities, Didier sees this as a natural evolution. With few crypto-native assets generating lasting value, exchanges are pivoting towards real-world assets (RWAs) like stocks and bonds. This doesn't necessarily cannibalize crypto but reflects a maturing industry focusing on blockchain's core utilities: decentralized choice and efficient settlement. He notes that trading logic for crypto natives doesn't need to drastically change, as meme-driven and fundamentalist strategies find analogs in U.S. markets. The "1011 event" (likely referring to a major market crash) severely damaged crypto market liquidity, marking a probable end to the altcoin speculative cycle, with capital flowing towards the deeper liquidity of U.S. markets. For the macro outlook, Didier is cautious about near-term market pressure from potential mega-IPOs (e.g., SpaceX) and the U.S. midterm elections, which could bring more regulatory scrutiny. Long-term, he remains bullish on AI's productivity gains and its convergence with blockchain/Web3, predicting a shift from speculative frenzy to a more institutionalized, industrial phase for the crypto sector.

marsbitHá 2h

Conversation with Investor Zheng Di: MicroStrategy's Coin Sale Experiment, AI Economy, and Opportunities in US Stocks

marsbitHá 2h

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

Playnance's native token, $GCOIN, has been listed on the cryptocurrency exchange KoinBX as of June 18. This move aims to enhance accessibility for its rapidly growing community, particularly in India, where the blockchain-powered Web3 iGaming ecosystem has gained significant traction. Over 130 partners in Playnance's "Be the Boss" program have built communities engaging thousands of active players in the region. The "Be the Boss" model allows participants to create and manage their own gaming communities, earning rewards tied to community activity. CEO Pini Peter noted India's high engagement, with community leaders successfully building player networks. One partner, Dr. Nicolas, reported earning over $57,000 through the program in recent months, highlighting both the financial rewards and the opportunity to grow an engaged community. $GCOIN serves as the ecosystem's core utility token, incentivizing participation and aligning the interests of players and community leaders ("Bosses"). The listing on KoinBX is part of Playnance's strategy to expand globally, increasing the token's utility and accessibility by combining community ownership, gamified engagement, and blockchain-based incentives. Founded in 2020, Playnance is a Web3 iGaming infrastructure company focused on creating live, non-custodial, on-chain products to onboard mainstream users. It currently processes approximately one million transactions daily, aiming to simplify the user experience while maintaining full on-chain transparency.

TheNewsCryptoHá 2h

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

TheNewsCryptoHá 2h

Trading

Spot

Futuros

Artigos em Destaque

Como comprar 4

Bem-vindo à HTX.com!Tornámos a compra de 4 (4) simples e conveniente.Segue o nosso guia passo a passo para iniciar a tua jornada no mundo das criptos.Passo 1: cria a tua conta HTXUtiliza o teu e-mail ou número de telefone para te inscreveres numa conta gratuita na HTX.Desfruta de um processo de inscrição sem complicações e desbloqueia todas as funcionalidades.Obter a minha contaPasso 2: vai para Comprar Cripto e escolhe o teu método de pagamentoCartão de crédito/débito: usa o teu visa ou mastercard para comprar 4 (4) instantaneamente.Saldo: usa os fundos da tua conta HTX para transacionar sem problemas.Terceiros: adicionamos métodos de pagamento populares, como Google Pay e Apple Pay, para aumentar a conveniência.P2P: transaciona diretamente com outros utilizadores na HTX.Mercado de balcão (OTC): oferecemos serviços personalizados e taxas de câmbio competitivas para os traders.Passo 3: armazena teu 4 (4)Depois de comprar o teu 4 (4), armazena-o na tua conta HTX.Alternativamente, podes enviá-lo para outro lugar através de transferência blockchain ou usá-lo para transacionar outras criptomoedas.Passo 4: transaciona 4 (4)Transaciona facilmente 4 (4) no mercado à vista da HTX.Acede simplesmente à tua conta, seleciona o teu par de trading, executa as tuas transações e monitoriza em tempo real.Oferecemos uma experiência de fácil utilização tanto para principiantes como para traders experientes.

624 Visualizações TotaisPublicado em {updateTime}Atualizado em 2026.06.02

Discussões

Bem-vindo à Comunidade HTX. Aqui, pode manter-se informado sobre os mais recentes desenvolvimentos da plataforma e obter acesso a análises profissionais de mercado. As opiniões dos utilizadores sobre o preço de 4 (4) são apresentadas abaixo.

Running Gemma 4 Locally on iPhone Goes Viral: How Far Are We from the Zero Token Era?

Resumo

Criptomoedas em alta

Perguntas relacionadas

Leituras Relacionadas

When the World Cup Collides with Agents: From Web2 to Web3, How Are Wallets Evolving into Agentic Wallets?

Options Don't Work in DeFi? Vitalik Might Not Agree

Ethereum Foundation Faces Another Departure as Hsiao-Wei Wang Steps Down

Conversation with Investor Zheng Di: MicroStrategy's Coin Sale Experiment, AI Economy, and Opportunities in US Stocks

Playnance’s $GCOIN Lists on KoinBX Amid Rapid Growth in India

Trading

Artigos em Destaque

Como comprar 4

Discussões

Categorias populares

Etiquetas Populares