Running Gemma 4 Locally on iPhone Goes Viral: How Far Are We from the Zero Token Era?

marsbitОпубліковано о 2026-04-06Востаннє оновлено о 2026-04-06

Анотація

Google's newly open-sourced Gemma 4 model, built on the same architecture as Gemini 3, has gained significant attention for its ability to run locally on mobile devices like the iPhone and Samsung Galaxy. With smaller versions such as E2B (2.3B parameters) and E4B (4.5B parameters), it supports native multimodal capabilities and offers a 128K context window. Users report impressive speeds—over 40 tokens per second on Apple chips with MLX optimization—making it feel "like magic." The model is accessible via Google’s official AI Edge Gallery app, ensuring ease of use and security. While Gemma 4 excels in tasks like text generation, coding, and image understanding, it struggles with more complex agent-based workflows, such as tool calling and structured outputs, where models like Qwen3-coder perform better. Despite some limitations in reasoning, Gemma 4’s local performance hints at a future where everyday AI tasks—chat, coding, reasoning—can be handled offline, reducing reliance on cloud-based token services. Although cloud models still lead in advanced reasoning and large-scale multi-agent tasks, the trend suggests that as hardware and quantization improve, on-device models will increasingly handle high-frequency simple tasks. This shift could disrupt the AI industry’s reliance on token sales and API subscriptions, pushing providers to focus on more complex, data-intensive capabilities. Gemma 4 is just the beginning of this transformation.

Machine Heart Editorial Department

Google's newly open-sourced model, Gemma 4, released a few days ago, gave the industry a huge surprise.

It adopts the same technological architecture as Gemini 3, supports native full-modality, ranked third globally on the Arena AI leaderboard, and comes in multiple model sizes. Several smaller models — E2B (2.3B effective parameters) and E4B (4.5B effective parameters) — can be deployed directly to run locally on mobile devices, with a context window of 128K. They can be described as a "Gemini alternative that fits in your pocket".

As expected, the model quickly became a new toy for mobile users after its release.

Among them, a post by an X user was viewed hundreds of thousands of times. In the post, he shared a video demonstrating how he ran Gemma 4 locally on an iPhone, including processing images, audio, and controlling the flashlight. He stated that Gemma 4 is incredibly fast, feeling like magic.

Someone quantified this speed on an iPhone 17 Pro, pointing out that if the phone uses Apple silicon, the model's inference speed can exceed 40 tokens per second with the help of MLX (Apple's machine learning framework) optimized for this chipset.

Others achieved similar speeds on a Samsung Galaxy, even with a 'thinking mode' enabled. This led people to exclaim that it's "unbelievably fast".

Such speeds make running AI models on mobile devices a viable option for the future, and are particularly useful in sensitive scenarios like healthcare.

The 128k context window also makes these small models more attractive.

So how do you run it? It's actually very simple and not exclusive to geeks, because Google released an official App — Google AI Edge Gallery. Those who want to experience it on their phone can directly download this App, then download the desired model version, and open it to run.

Moreover, since it's officially released by Google, security concerns are naturally less of an issue.

Beyond these small models running on phones, some have tried larger versions of Gemma 4 on more powerful hardware, such as running Gemma 4 Mixture-of-Experts 26B on a MacBook Pro with an M5 Pro chip.

For direct conversation, this model is still very fast, with smooth text generation and code explanation.

But when he actually tried to use Gemma 4 as a coding agent, problems arose. Because running an agent requires a large context (Gemma 4 26B has a 256k context window), complex prompts, and stable tool calls, Gemma 4 clearly couldn't handle it, often freezing, reporting errors, or outputting incorrect structures.

The turning point came when he switched the model to qwen3-coder. In the same environment, file creation, command execution, and multi-step tasks all ran normally. He believes the problem lies not with the agent framework, but with whether the model itself has been optimized for "tool calling + structured output". In this regard, Gemma 4 might not be sufficient, or perhaps this developer hasn't found the correct method yet.

Additionally, some say that Gemma 4's intelligence level is still somewhat lacking.

Even so, the emergence of a "performance powerhouse" like Gemma 4 should not be underestimated. If in the future, a large number of daily queries, chats, simple reasoning, code generation, and image understanding tasks can all be run locally without needing to buy tokens, wouldn't vendors who sell tokens be in an awkward position?

Of course, the current situation is not that pessimistic yet. After all, there is still a gap between the currently open-sourced models and the cutting-edge closed-source flagship models. Furthermore, most capable open-source models are still constrained by hardware capabilities and暂时 (zànshí - temporarily) haven't reached a usable level on the device side.

But the future trend is clear. In the short term, cloud-based closed-source models will still lead in cutting-edge complex reasoning and ultra-large-scale multi-agent collaboration. But in the long term, as hardware continues to advance and quantization techniques continue to optimize, on-device models will gradually encroach on the cloud's high-frequency simple tasks.

Those vendors who rely solely on selling tokens and API subscriptions will have to compete more fiercely on the "truly tough" parts — super-powered Agents, ultra-long reliable context, and specialized capabilities requiring massive real-time data.

Gemma 4 is just the beginning. The next surprise might be an on-device model that, in daily use, completely makes users unaware of the difference between "local" and "cloud". When that day comes, the entire AI industry's business model will undergo a real reshuffle.

This article is from the WeChat public account "Machine Heart" (ID: almosthuman2014), author: Machine Heart

Пов'язані питання

QWhat is the key feature of Google's newly open-sourced Gemma 4 model that makes it suitable for mobile devices?

AGemma 4 has smaller variants, E2B (2.3B effective parameters) and E4B (4.5B effective parameters), which are designed to run locally on mobile devices with a 128K context window.

QWhat speed was reported for running Gemma 4 on an iPhone 17 Pro with Apple's MLX framework?

AThe model's inference speed was reported to exceed 40 tokens per second on an iPhone 17 Pro using Apple's optimized MLX framework.

QWhat is the name of the official Google app that allows users to easily run Gemma 4 on their mobile devices?

AThe official app is called 'Google AI Edge Gallery', where users can download the model and run it directly.

QWhat was a significant limitation observed when using the larger Gemma 4 26B model as a coding agent?

AThe Gemma 4 26B model struggled with tasks requiring large context (256K), complex prompts, and stable tool calls, often leading to crashes, errors, or incorrect structured outputs.

QAccording to the article, what long-term impact could the advancement of on-device models like Gemma 4 have on the AI industry?

AOn-device models could gradually erode the market for cloud-based models on high-frequency simple queries, forcing API and token-selling companies to focus on more complex areas like super-powered Agents, ultra-long reliable context, and capabilities requiring massive real-time data.

Пов'язані матеріали

The Era Has Arrived Where Human Writers Must Prove They Are Not Machines

The article describes an era where AI-generated content is flooding the market, forcing human authors to prove they are not machines. It begins with the example of dozens of AI-written, error-ridden biographies of Henry Kissinger appearing on Amazon within hours of his death, a pattern repeated for other deceased celebrities and even living experts who find fraudulent books under their names. This spam content has exploded, with monthly new book releases on platforms like Amazon reaching 300,000 by late 2025. The issue spans genres, from suspiciously high proportions of AI-written teen romance and self-help books to dangerous, AI-generated foraging guides containing lethal advice. The platforms' automated review systems, designed to catch plagiarism and banned words, are ill-equipped to detect AI-generated text that avoids these pitfalls while being nonsensical or fraudulent. The problem has infiltrated traditional publishing. A major publisher, Hachette, had to recall a bestselling horror novel after AI detection tools suggested 78% of its content was machine-generated. An acclaimed European philosophy book was later revealed to be entirely written by AI under a fake author persona. In response, authors are fighting back. At the 2026 London Book Fair, 10,000 writers published a blank book titled "Don't Steal This Book" containing only their signatures—using emptiness as a protest weapon in an age of AI overproduction. Initiatives like the "Human Author Certification" program have emerged, ironically placing the burden on humans to prove their work is not machine-made. The article warns of a vicious cycle: AI-generated low-quality books pollute the data used to train future AI models, leading to "model collapse" and an ever-worsening flood of digital waste, eroding trust in publishing and devaluing human creativity.

marsbit13 хв тому

The Era Has Arrived Where Human Writers Must Prove They Are Not Machines

marsbit13 хв тому

The King of Blind Date Attire in Korea: How SK Hynix Made a Comeback Against Samsung?

In South Korea's dating scene, SK Hynix employees are now highly sought after, a status shift fueled by the company's astronomical profits and employee bonuses, projected to reach up to 6.1 million RMB per person by 2027. This marks a dramatic reversal for the long-time second-place player in memory semiconductors, which has now surpassed its rival Samsung in annual operating profit. The turnaround story began in 2008 when a struggling Hynix, emerging from bankruptcy restructuring, took a risky bet by agreeing to develop High Bandwidth Memory (HBM) with AMD. At the time, HBM had no clear market beyond high-end graphics cards and was a costly, complex technology. Major players like Samsung, pursuing its own HMC technology, declined. For Hynix, with only memory as its core business, it was a gamble born of necessity. The pivotal moment came in 2012 when SK Group Chairman Chey Tae-won acquired Hynix. Defying industry downturns, he invested heavily in R&D and fabrication, sustaining the HBM project through over a decade of commercial uncertainty and internal challenges. A key break occurred around 2016-2017 when Samsung faced production issues supplying HBM2 for Google's TPU, allowing SK Hynix to gain a crucial foothold in the data center market. The AI explosion post-ChatGPT in 2022 was the catalyst, turning HBM into a critical bottleneck for AI accelerators like NVIDIA's GPUs. By 2025, SK Hynix captured 62% of the global HBM market, leaving Samsung at 17%. For the first time, its annual operating profit exceeded Samsung's. Analysts point to the "innovator's dilemma" to explain Samsung's miss: its vast, successful business portfolio made it risk-averse, preventing an all-in bet on the initially niche HBM technology. In contrast, SK Hynix, as a challenger with its back against the wall, had no choice but to commit fully. The story highlights how Korea's chaebol system allows for ultra-long-term bets beyond quarterly pressures. However, SK Hynix's lead isn't guaranteed. Samsung is aggressively catching up on HBM4, and challenges like customer concentration (heavy reliance on NVIDIA) and technical hurdles in advanced packaging remain. The narrative underscores a market truth: the greatest alpha often comes from betting on uncertain, long-term directions others dismiss, much like HBM in 2008.

marsbit53 хв тому

The King of Blind Date Attire in Korea: How SK Hynix Made a Comeback Against Samsung?

marsbit53 хв тому

Торгівля

Спот
Ф'ючерси

Популярні статті

Як купити 4

Ласкаво просимо до HTX.com! Ми зробили покупку 4 (4) простою та зручною. Дотримуйтесь нашої покрокової інструкції, щоб розпочати свою криптовалютну подорож.Крок 1: Створіть обліковий запис на HTXВикористовуйте свою електронну пошту або номер телефону, щоб зареєструвати обліковий запис на HTX безплатно. Пройдіть безпроблемну реєстрацію й отримайте доступ до всіх функцій.ЗареєструватисьКрок 2: Перейдіть до розділу Купити крипту і виберіть спосіб оплатиКредитна/дебетова картка: використовуйте вашу картку Visa або Mastercard, щоб миттєво купити 4 (4).Баланс: використовуйте кошти з балансу вашого рахунку HTX для безперешкодної торгівлі.Треті особи: ми додали популярні способи оплати, такі як Google Pay та Apple Pay, щоб підвищити зручність.P2P: Торгуйте безпосередньо з іншими користувачами на HTX.Позабіржова торгівля (OTC): ми пропонуємо індивідуальні послуги та конкурентні обмінні курси для трейдерів.Крок 3: Зберігайте свої 4 (4)Після придбання 4 (4) збережіть його у своєму обліковому записі на HTX. Крім того, ви можете відправити його в інше місце за допомогою блокчейн-переказу або використовувати його для торгівлі іншими криптовалютами.Крок 4: Торгівля 4 (4)Легко торгуйте 4 (4) на спотовому ринку HTX. Просто увійдіть до свого облікового запису, виберіть торгову пару, укладайте угоди та спостерігайте за ними в режимі реального часу. Ми пропонуємо зручний досвід як для початківців, так і для досвідчених трейдерів.

366 переглядів усьогоОпубліковано 2025.10.20Оновлено 2025.10.20

Як купити 4

Обговорення

Ласкаво просимо до спільноти HTX. Тут ви можете бути в курсі останніх подій розвитку платформи та отримати доступ до професійної ринкової інформації. Нижче представлені думки користувачів щодо ціни 4 (4).

活动图片