Apple Re-invented Image Compression with AI: Same Quality, One-Third the File Size

marsbitPublished on 2026-05-30Last updated on 2026-05-30

Abstract

Apple’s PICO: An AI-Powered Image Codec That Cuts File Size by Two-Thirds at Equal Perceived Quality In 2025, JPEG AI became the first international standard for learned image compression. However, it, like most codecs, still prioritizes mathematical metrics like PSNR over true perceptual quality—what the human eye finds pleasing. Apple researchers have introduced PICO (Perceptual Image Codec), a neural codec designed to optimize for human perception. It tackles key practical challenges: 1) Speed: A novel "one-shot context model" accelerates entropy encoding without sacrificing compression efficiency. 2) Artifacts: A dedicated TextFidelity loss preserves text clarity, and a TilingArtifact loss eliminates color seams between image tiles processed in parallel. 3) Control: It avoids the "hallucinations" common in GAN-based perceptual models. In a large-scale human evaluation (74,925 comparisons), PICO achieved the same perceived quality as standards like AV1, VVC, and JPEG AI while using only 30-43% of the bitrate. It also outperforms other learned perceptual codecs by 20-40%. Remarkably, it runs in 230ms (encode) and 150ms (decode) on an iPhone 17 Pro Max. While less efficient on synthetic graphics, PICO represents a significant shift from optimizing mathematical scores to directly targeting human visual experience, making high-quality perceptual compression practical for consumer devices. The work builds on expertise from WaveOne, whose team joined Apple and previously adv...

How small can an image be compressed?

In February 2025, the Joint Photographic Experts Group (JPEG) quietly announced a milestone celebrated within the industry: the official release of JPEG AI, the first end-to-end learned image coding international standard, which had been years in the making and was highly anticipated.

The news spread, with many researchers reposting on social media, adding comments like 'AI has finally entered the standards.'

The JPEG standard was born in 1992 and has been a fundamental language for digital images for over three decades. Now, artificial intelligence is starting to rewrite the grammar of this language.

However, behind the celebration lies a subtle reality: even JPEG AI still has considerable distance from true 'perceptual compression.'

Engineers know that traditional metrics like Peak Signal-to-Noise Ratio (PSNR) have little to do with what the human eye perceives as 'good-looking.' An image scoring high on PSNR might look mediocre to a person, while another image with lower PSNR might appear detailed and realistic. Optimizing mathematical metrics and optimizing for human perception are two entirely different things.

For decades, from JPEG to VVC and now JPEG AI, the design logic of almost all codecs has revolved within the framework of mathematical metrics. Perceptual compression (directly optimizing for the human visual experience) has always seemed like a distant goal in academic papers, not an engineering reality that could fit into a phone.

At this critical juncture, a team of engineers at Apple quietly published a paper with their answer, codenamed: PICO.

Paper Title: What Matters in Practical Learned Image Compression

Paper Address: https://arxiv.org/pdf/2605.05148

Why is 'Looking Better' Much Harder Than 'Scoring Higher'?

To understand PICO, one must first understand what image compression is actually doing.

Saving a photo as a file is essentially a problem of 'choosing what to forget and what to remember.' With limited storage space, some information must be discarded while making it as unnoticeable as possible to the viewer. Different codecs follow different 'discarding' rules.

Traditional codecs like JPEG, AV1, and VVC are manually designed rule-based systems. They divide the image into blocks, transform, quantize, and entropy code—each step based on decades of accumulated human expertise. These systems can perform excellently on mathematical metrics like PSNR, but their design is inherently oriented toward 'reducing pixel error,' not 'reducing visual discomfort for the human eye.'

The problem is that the human eye is not a pixel error meter. The human eye's sensitivity to texture, text, and detail is far more complex than mathematical formulas. When you compress a street scene photo heavily, the PSNR might still be respectable, but you might see blurred building edges or distorted text on street signs—precisely what the human eye detects first.

The emergence of learned codecs theoretically opened a new door: neural networks could be trained end-to-end directly for human perception, rather than for mathematical formulas. But before PICO, existing perceptual learned codecs were either too slow for practical use, lacked cross-device compatibility, or couldn't flexibly control bitrate, making them impossible to integrate into a consumer-grade product.

Three Core Problems, Three Solutions

The full name of PICO is Perceptual Image Codec. This name directly states its goal: to satisfy the human eye.

The research team systematically explored millions of model configurations and introduced several key technological innovations.

First Problem: Entropy Coding is Slow. What to Do?

A major challenge in image compression: to compress further, a codec needs an 'entropy model' to accurately estimate the information content of each pixel. The most accurate method is autoregressive coding: compressing each pixel requires looking at the surrounding already-compressed pixels for sequential prediction. It's like a chef checking the pot's state after adding each ingredient before deciding the next step. Accurate, but extremely slow.

PICO's solution is the 'One-shot Context Model': decoupling the crucial 'scale parameter' in entropy coding and computing it all in one forward pass, eliminating the need for waiting back and forth; other parameters can be computed in parallel. This retains the precision of autoregressive methods while circumventing their speed bottleneck. The result: removing this module degrades model performance by 10.28%; with it, speed is almost unaffected.

Second Problem: Perceptual Training Can Cause Hallucinations. What to Do?

Images trained with GANs (Generative Adversarial Networks) often 'look realistic,' but it might be a fabricated realism—hair strands turning into non-existent patterns, smooth surfaces gaining false textures. More troublesome, the human eye is extremely sensitive to text; even a slight distortion of a single letter is immediately noticeable.

PICO specifically designed TextFidelityLoss for text: using an off-the-shelf text detector to automatically find text regions in the image, then applying strict pixel fidelity constraints in these areas while suppressing the GAN's 'creative freedom' in text regions. Experiments showed that adding this loss function halved the absolute error in text regions.

Third Problem: Processing Images in Blocks Leaves Color Block Boundaries. What to Do?

To run fast on mobile phone chips, PICO divides images into 504×504 pixel tiles, processes them separately, and then stitches them back. However, GANs during training tend to ignore low-frequency color, often causing visible color discrepancies between adjacent tiles, similar to a poorly 'stitched' feeling in photo editing. The research team specifically introduced TilingArtifactLoss, a multi-resolution L1 loss, forcing the model to maintain color consistency across multiple spatial frequencies. This measure reduced errors at tile boundaries by more than half.

Experimental Results

The Apple team didn't rely solely on benchmark metrics. They commissioned a third-party platform, Mabyduck, to organize a large-scale human subjective evaluation.

The evaluation used a blind, pairwise comparison method: 610 screened evaluators (required to pass color blindness and compression artifact detection tests) compared reconstructed results of the same image using different codecs in paired comparisons, ultimately aggregated into a Bayesian ELO score. A total of 74,925 pairwise comparisons were collected.

The final numbers tell the story: At the same visual quality, PICO's file size is only one-third to one-half that of AV1, AV2, VVC, ECM, and JPEG AI—in other words, to store the same image, it requires only 30%-43% of the bits needed by these standards. Compared to the strongest existing perceptual learned codecs (HiFiC, MRIC, etc.), PICO also saves 20%-40% in file size.

In terms of speed, on an iPhone 17 Pro Max, PICO encodes a 12MP photo in just 230 milliseconds and decodes in 150 milliseconds. Most top-tier ML codecs running on NVIDIA V100 server GPUs are slower than this.

Notably, the paper also specifically recorded a 'counterexample': on the traditional PSNR metric, PICO performed average, even inferior to DCVC-RT and VVC. This恰好印证了团队的基本判断 perfectly illustrates the team's fundamental judgment: optimizing perceptual quality and optimizing mathematical metrics are inherently two different directions; you cannot have your cake and eat it too.

A Milestone, Not the Finish Line

PICO certainly has limitations. The paper acknowledges that for highly regular synthetic images like cartoons or schematic diagrams, PICO's compression efficiency is inferior to traditional codecs, as such content is inherently more suitable for rule-driven autoregressive modeling than perceptual generation.

But these limitations do not diminish the significance of this work.

For the past thirty years, technological progress in image compression has almost exclusively occurred on the track of 'making the numbers look better.' From JPEG to HEVC to VVC, engineers optimized metrics like PSNR and SSIM generation after generation. Human visual perception remained a 'difficult problem' that was circumvented.

PICO is the first time someone has systematically and directly tackled this difficult problem: from architecture search and loss function design to large-scale human subjective evaluation, culminating in a codec that can run in real-time on a mobile phone.

The next time you share a photo from an Apple device, you might not notice anything different. But perhaps within that quiet compression process, an algorithm tailored for human perception is deciding which information is worth keeping and which can be quietly forgotten.

The Team: From WaveOne to Apple

The corresponding author of this paper is Oren Rippel, an Apple researcher and a familiar face in the compression field.

His name first gained widespread attention in 2017. At that time, he was at the startup WaveOne, publishing a paper titled 'Real-Time Adaptive Image Compression,' using neural networks to outperform all mainstream codecs while maintaining real-time speeds. That paper caused significant waves in academia and established Rippel's standing in the field of learned compression.

Afterwards, the same core personnel continued their work at WaveOne, introducing ELF-VC for video compression, achieving a 44% bitrate saving compared to H.264 on the UVG video test set while running over five times faster than similar ML codecs.

This team from WaveOne later joined Apple as a group. And this PICO is their first systematic answer on perceptual image compression, backed by Apple's computing power and platform resources.

This article is from the WeChat public account "Almost Human" (ID: almosthuman2014), author: Compression is Intelligence

Related Questions

QWhat is the main difference between traditional image codecs and the new PICO codec developed by Apple?

ATraditional codecs like JPEG and VVC optimize for mathematical metrics such as PSNR (Peak Signal-to-Noise Ratio), focusing on minimizing pixel error. In contrast, Apple's PICO codec is a perceptual image codec designed to directly optimize for human visual perception, aiming for better visual quality even if mathematical scores are not the highest.

QWhat were the three key technical challenges addressed in the design of the PICO codec, and what solutions were implemented?

AThe three key challenges were: 1) Slow entropy coding: Solved using a 'One-shot Context Model' to avoid the speed bottleneck of autoregressive methods. 2) Artifacts/hallucinations in perceptual training, especially for text: Addressed with a dedicated 'Text Fidelity Loss' function. 3) Visible color block boundaries from tiled image processing: Mitigated through a 'Tiling Artifact Loss' function to enforce color consistency across tiles.

QHow does the file size of images compressed with PICO compare to other state-of-the-art codecs at similar visual quality, according to human subjective tests?

AAccording to large-scale human subjective evaluations, PICO achieves the same visual quality with only 30% to 43% of the file size (i.e., one-third to one-half) required by codecs like AV1, AV2, VVC, ECM, and JPEG AI.

QWhat are the limitations of the PICO codec mentioned in the article?

AThe article notes that for highly structured synthetic content like cartoons or schematic diagrams, PICO's compression efficiency is lower than traditional rule-based codecs. These types of content are naturally more suited for autoregressive modeling based on rules rather than perceptual generation.

QWhat is the historical background of the core research team behind the PICO project?

AThe core team, led by Oren Rippel, has a history in learned compression. They first gained prominence at the startup WaveOne in 2017, where they published work on real-time adaptive image compression. Later, they developed the ELF-VC video codec at WaveOne before the team was acquired by Apple, where they developed PICO with Apple's resources.

Related Reads

Alibaba 'Stocks Up', ByteDance 'Trains'

"In late May, two closely timed events in China's AI industry clearly revealed the divergent strategic approaches of two tech giants: Alibaba and ByteDance. Alibaba is aggressively integrating AI into its existing commercial ecosystem, prioritizing immediate monetization. Its Qwen App now fully integrates with Taobao, leveraging the platform's 4-billion-item database for AI-powered shopping features like virtual try-on and price comparison. Internally, Alibaba has reorganized to incentivize AI-driven business growth, notably through the 'Agentic Commerce Trust Protocol' to enable AI-agent transactions. Financially, it emphasizes ROI, with CEO Daniel Wu stating every AI chip purchased is generating revenue. Alibaba's strategy bets that foundational AI model capabilities won't be leapfrogged in the next five years, allowing its 'AI-as-a-utility' approach to succeed. In stark contrast, ByteDance's Seed division focuses on pushing the frontiers of AGI with a long-term, research-oriented mindset. Its video generation model, Seedance 2.0, topped international benchmarks. The division, led by researchers Wu Yonghui and product head Zhu Wenjia, is tasked with 'exploring the upper limits of intelligence,' even considering open-sourcing its models—a rare move among Chinese firms. ByteDance is investing heavily, with reports of its 2026 capital expenditure plan being nearly triple that of 2024, funded by its substantial private profits. This allows it to pursue projects like an 8-month research paper questioning if video models are true 'world models,' devoid of immediate commercial pressure. The core divergence is less about corporate philosophy and more about structural constraints. As a publicly traded company, Alibaba is bound to quarterly financial expectations, forcing a pragmatic, revenue-focused AI integration. As a private entity, ByteDance has the luxury to fund long-term, high-risk foundational research without answering to public markets. The article concludes that the true determinant of a Chinese company's AI path is its IPO status, suggesting that if ByteDance were public, or if Alibaba were private, their strategies might well be reversed."

marsbit1h ago

Alibaba 'Stocks Up', ByteDance 'Trains'

marsbit1h ago

Why More AI Agents Does Not Equal Higher Productivity?

Editor's Note: As AI Agents become cheaper and easier to use, a new constraint emerges: the cost isn't in launching more Agents, but in the human attention required to manage, judge, and integrate their outputs. This hidden cost is called the "orchestration tax." The article argues that a developer's cognitive bandwidth is the key bottleneck—a serial, non-parallelizable resource akin to a Global Interpreter Lock (GIL). While many Agents can run concurrently, their results ultimately require human judgment for review, conflict resolution, and final integration. Therefore, more Agents don't automatically mean higher productivity; they can simply create longer queues, lead to cognitive fatigue, and create the illusion of busyness without real output. The core solution is to design workflows around this scarce human attention. Key strategies include: scaling the number of Agents to match review capacity (not UI capacity), categorizing tasks (delegating independent ones, keeping complex judgment-heavy ones serial), batch reviewing results to minimize context-switching costs, automating verifiable checks to reserve human judgment for critical decisions, and protecting focused, uninterrupted thinking time. Ultimately, the critical skill is not launching many Agents, but architecting systems that respect the fundamental limit of human attention. Unpaid "orchestration tax" accumulates as both technical and cognitive debt, undermining system understanding and quality. True productivity comes from thoughtfully managing the single-threaded resource—your focus.

marsbit2h ago

Why More AI Agents Does Not Equal Higher Productivity?

marsbit2h ago

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

Three Years Later: Revisiting My 2023 Predictions on ChatGPT In March 2023, shortly after ChatGPT's launch, I made 20 predictions about its future. Now, in mid-2026, I've used AI agents to fact-check each one against the latest data. Overall, most major directional forecasts were correct, with only one outright error (incorrectly stating GPT-4 had 100 trillion parameters). Key successes included predicting that RAG and retrieval architectures would become the standard for handling knowledge and hallucinations, that natural language interfaces (LUI) would create a massive new industry layer beyond the models themselves, and that China would develop viable large language models, significantly closing the performance gap with Western counterparts within about three years. Predictions about the absence of mass unemployment, the rise of a new "robot network" for agent communication, and ChatGPT not possessing consciousness also held true in their core arguments. However, the "devil was in the details." Errors frequently involved specific numbers, timelines, or overlooking distributional effects. I tended to overestimate the speed of adoption (e.g., for agent networks) while underestimating the ultimate scale of capabilities or costs (e.g., AI winning IMO gold without tools, or the extreme capital required for frontier models). Other misjudgments included: underestimating how AI would reinforce, not dissolve, information filter bubbles; incorrectly assuming AI-generated content would easily circumvent copyright (it has instead triggered record-breaking settlements); and misidentifying where value would be captured (it accrued overwhelmingly to the compute layer, like Nvidia, not just the application or model layers). Key lessons from reviewing these predictions are: 1) Directional and mechanistic insights are far more reliable than precise numbers or absolute statements. 2) There's a consistent bias to overestimate short-term speed but underestimate long-term magnitude. 3) Errors often lie in missing distributional impacts within a generally correct aggregate trend. 4) Predictions phrased with nuance and caveats aged the best. 5) Some fundamental debates (e.g., on machine consciousness or the ultimate value chain) remain unresolved even after three years. This exercise is less about scoring the past and more about establishing rules for clearer thinking about the next three years of AI.

marsbit9h ago

Three Years Later: Looking Back at My Predictions About ChatGPT in 2023

marsbit9h ago

Trading

Spot
Futures

Hot Articles

Discussions

Welcome to the HTX Community. Here, you can stay informed about the latest platform developments and gain access to professional market insights. Users' opinions on the price of AI (AI) are presented below.

活动图片