Google's Deep Think Dominates Eight-Language Olympiads, Autonomously Solves Four Unsolved Problems, Research Barriers Collapse

marsbitDipublikasikan tanggal 2026-04-08Terakhir diperbarui pada 2026-04-08

Abstrak

Google DeepMind's "Deep Think" AI system has demonstrated exceptional performance across eight languages in regional academic competitions, including mathematics and informatics Olympiads. It achieved perfect scores in Japanese and French contests, and high results in Chinese, Korean, Hindi, Vietnamese, Russian, and Portuguese exams. This multi-language capability aims to reduce linguistic barriers in scientific research, enabling non-English-speaking researchers to access advanced AI tools equally. Beyond competitions, Deep Think has solved four previously unsolved mathematical problems and contributed to breakthroughs in computer science, physics, and economics. It powers the Aletheia agent, which autonomously generates and verifies research-level mathematical solutions. Despite these achievements, the results are based on internal evaluations without third-party verification or detailed methodology disclosure. Google positions Deep Think as a "human intelligence multiplier," expanding AI's role in global scientific collaboration beyond English-dominated benchmarks.

"Deep Think has defeated/matched competitors in all competitions"!

Just now, Google DeepMind senior researcher Conglong Li posted 12 messages on the X platform, revealing an unprecedented scorecard.

One AI, the same brain, eight exam papers in different languages, all submitted with high scores.

Such results are rare for any model.

From IMO Gold Medals to Full Coverage of Regional Competitions

Deep Think's high scores across multiple leaderboards are not a sudden breakthrough but part of a nearly year-long evolution of capabilities.

First, it topped the most rigorous reasoning competitions.

In July 2025, Gemini Deep Think achieved the gold medal standard at the International Mathematical Olympiad (IMO) for the first time, scoring 35 out of 42 points. It also achieved similarly high-level performance at the ICPC World Finals around the same time.

These two achievements have been officially announced in the DeepMind blog.

Google DeepMind subsequently included these two results in its official blog, marking Deep Think's crossing of the "world-class competition threshold" in mathematics and programming.

Next, Deep Think began moving from "world-champion-level individual breakthroughs" to "systematic validation across languages, disciplines, and scenarios."

In February 2026, Google published three blog posts.

One introduced the Gemini 3.1 Pro model itself, one detailed a major upgrade to the Deep Think specialized reasoning mode, and one from the DeepMind scientific discovery team directly positioned Deep Think as a "human intelligence multiplier."

The upgraded Deep Think delivered a series of hard metrics:

48.4% on Humanity's Last Exam (without tool assistance), 84.6% on ARC-AGI-2 (officially verified by the ARC Prize Foundation), a Codeforces competitive programming Elo rating of 3455, and gold medal-level performance on the written portions of the 2025 International Physics and Chemistry Olympiads.

The strategy is very clear: first use world-class competitions like the IMO and ICPC to prove its powerful reasoning abilities, then use multi-language, regional competition, and cross-disciplinary Olympiad results to prove its general, deep reasoning ability that stably transfers across languages and domains.

Gemini Deep Think's capability evolution from IMO gold medals to PhD-level research acceleration

A Detailed Look at the 8-Language Scorecard

Now, let's take a closer look at this scorecard.

Japanese results are the most impressive.

2025 35th Japanese Mathematical Olympiad Finals (JMO Finals), perfect score.

ICPC Asia Japan Preliminary Contest, perfect score.

Among these, the JMO Finals score even exceeded the level corresponding to the top 80% of scores that year, meeting the official "gold medal equivalent" standard.

French results were also a perfect 100%.

The Chinese results are interesting.

At the 41st Chinese Mathematical Olympiad (CMO), Deep Think scored 86.3%, which is quite outstanding. But at the Chinese National Olympiad in Informatics (NOI), it only scored 63.3%.

The gap between 86.3% and 63.3% outlines the real boundaries of AI reasoning ability.

In math competitions, the model faces abstract deduction, proof construction, and multi-step reasoning, which happens to be Deep Think's strongest suit.

But in informatics competitions, the problem is not just "figuring it out," but also translating logic into executable code, controlling boundary conditions, considering complexity constraints, and avoiding implementation errors.

The former is closer to pure reasoning, while the latter requires "reasoning + algorithm design + engineering implementation" to be successful simultaneously.

In the other languages—Korean, Hindi, Vietnamese, Russian, Portuguese—Deep Think also achieved results that either defeated competitors or at least matched them.

Looking at Japanese, French, and Chinese together, the most unusual aspect this time is not necessarily scoring a perfect mark in any single subject, but rather that the same model, the same Deep Think reasoning system, delivered first-tier results on exam papers in multiple languages.

Is This Scorecard Reliable?

But there is a key omission:

Conglong Li did not list specific comparative data from competitors: all results come from Google evaluations. There is no independent third-party replication, no official certification from the competitions, and the evaluation methodology is completely undisclosed.

Was each problem attempted once or many times with the best score taken? How much computational power was used during reasoning? Was there any manual prompt engineering involved?

These details, which directly affect the credibility of the results, were also not mentioned.

Another easily overlooked point: these exams are all regional selection competitions, not international finals.

There is an order of magnitude difference in difficulty between regional competition problems and international finals.

The researcher explicitly stated that these results "will be included in the model card." As of publication, the model card has not been officially updated.

So, for now, this still seems like a scorecard graded by the examinee themselves, announced by themselves, and not yet stamped by the academic affairs office.

Multilingual Research Equity: The Overlooked Real Battlefield

Why did Google specifically invest effort in evaluating 8 different regional languages?

Current evaluations of AI reasoning ability are almost entirely based on English.

MATH, GSM8K, HumanEval, ARC-AGI... these are all in English.

Mathematicians, physicists, and engineers worldwide whose native language is not English must first overcome a language barrier when using AI research tools.

Google's selection of these 8 languages is not random.

Japanese, Korean, and Chinese cover East Asian research powerhouses; Hindi and Vietnamese cover emerging markets; French, Russian, and Portuguese cover Europe and South America.

Together, this represents the majority of global research output.

In its official blog, DeepMind positioned Deep Think as a "human intelligence multiplier," saying it can "handle knowledge retrieval and rigorous verification, allowing scientists to focus on conceptual depth and creative direction."

Combined with these multi-language results, the subtext of this statement is not hard to understand: this multiplier is not just for scientists who use English.

More notably is how far Deep Think has already gone in research落地 (landing/application).

DeepMind announced a mathematical research agent called Aletheia, powered by Deep Think, capable of autonomously generating, verifying, and revising solutions to research-level mathematical problems.

Aletheia, driven by Deep Think, capable of iterative generation, verification, and correction for research-level mathematical problems

Aletheia has already contributed to multiple research papers, one of which was completed entirely autonomously by the AI, calculating specific structural constants in arithmetic geometry.

Furthermore, in a semi-autonomous evaluation of 700 open mathematical problems, it independently solved 4 previously unsolved problems.

The Gemini Deep Think mode also shows great potential in computer science, physics, economics, and other fields.

In computer science, Deep Think helped refute a conjecture that had remained open for a decade; in physics, it found a new analytical solution for gravitational radiation from cosmic strings; in economics, it extended an auction theory theorem.

Schematic diagram of the AI reasoning process, showing how large-scale exploration of the solution space at the network layer is aggregated into structured reasoning and confirmed through automated and manual verification.

By collaborating with experts to solve 18 research challenges, the advanced version of Gemini Deep Think helped break through long-standing bottlenecks in algorithms, machine learning and combinatorial optimization, information theory, and economics.

This goes far beyond "solving competition problems."

While competitors are still competing on English benchmark leaderboards, Google has already found a new battlefield in the "AI research accelerator" field.

The most important thing about this is not the scores; the real signal behind it is: the language barrier for AI research tools is being treated as an engineering problem to be solved.

If this path succeeds, scientists conducting research in Japanese, Korean, Chinese, Hindi, and other languages will, for the first time, stand on the same starting line as native English speakers.

This time, Google has laid its cards on the table.

As for which competitors will follow suit, we believe we will see soon.

References:

https://blog.google/intl/ja-jp/company-news/technology/gemini-31-pro-gemini-31-pro-deep-think/%20

https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/%20

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/%20

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/

This article is from the WeChat public account "新智元" (New Zhiyuan), author: 新智元

Pertanyaan Terkait

QWhat is the key achievement of Google's Deep Think AI model as reported in the article?

ADeep Think achieved top-tier results in eight different language versions of academic competitions, including perfect scores in Japanese and French math and programming contests, and high performance in Chinese, Korean, Hindi, Vietnamese, Russian, and Portuguese exams.

QWhich specific world-class competitions did Deep Think first demonstrate its reasoning capabilities in?

ADeep Think first demonstrated its reasoning capabilities by reaching gold medal standards in the International Mathematical Olympiad (IMO) with a score of 35 out of 42 in July 2025, and achieving similarly high performance in the ICPC World Finals.

QWhat is the significance of Deep Think's performance across multiple languages according to the article?

AIts performance across multiple languages signifies a breakthrough in breaking down language barriers in AI research tools, potentially allowing non-English speaking scientists worldwide to access advanced AI research assistance on equal footing.

QWhat are some research breakthroughs mentioned that were achieved using Deep Think?

ADeep Think autonomously solved 4 previously unsolved mathematical problems, refuted a decade-old conjecture in computer science, found new analytical solutions for cosmic string gravitational radiation in physics, and extended an auction theory theorem in economics.

QWhat concerns does the article raise about the reliability of Deep Think's reported results?

AThe article notes that all results are from internal Google evaluations without third-party verification, official contest authentication, or disclosure of testing methods such as attempt counts, computational resources used, or potential human prompt engineering involvement.

Bacaan Terkait

Intervensi Bersama AS-Jepang: 'Kesepakatan Plaza' Baru, Awal Sistem Bretton Woods 2.0 dan Akhir Era Carry Trade Yen

Amerika Serikat dan Jepang melakukan intervensi bersama yang langka untuk mendukung yen, memicu penilaian ulang pasar terhadap pola arus modal global yang bergantung pada pendanaan yen berbunga rendah. Menteri Keuangan AS Beshear dan Presiden Trump mengonfirmasi partisipasi aktif AS. Intervensi ini mendorong yen menguat ke level terkuat sejak awal Mei, menimbulkan reaksi keras di pasar keuangan. Analis berpendapat tindakan ini melampaui manajemen nilai tukar biasa dan dapat menandai titik balik untuk perdagangan carry yen. Jika Jepang menjual cadangan devisa, termasuk obligasi AS, untuk mempertahankan mata uangnya, hal ini dapat memberi tekanan pada imbal hasil obligasi AS jangka panjang dan mendorong restrukturisasi likuiditas global. Penurunan leverage global dari berakhirnya era carry trade yen membutuhkan penanganan yang hati-hati. Selain itu, pergeseran peran perusahaan teknologi besar—dari penyedia tabungan menjadi pengguna kredit untuk investasi AI—makin memperketat kondisi kredit. Beberapa melihat intervensi ini sebagai awal dari "Perjanjian Plaza" baru atau "Sistem Bretton Woods 2.0", menandai perubahan potensial dalam koordinasi kebijakan internasional dan struktur aliran modal global.

marsbit39m yang lalu

Intervensi Bersama AS-Jepang: 'Kesepakatan Plaza' Baru, Awal Sistem Bretton Woods 2.0 dan Akhir Era Carry Trade Yen

marsbit39m yang lalu

Seberapa Tinggi Kemungkinan TradeXYZ Memisahkan Diri dari Hyperliquid?

Penulis|Golem, Odaily Planet Daily Seiring pertumbuhan bisnis TradeXYZ yang terus meluas dan menguasai lebih dari 90% pangsa pasar HIP-3 Hyperliquid, diskusi komunitas mengenai kemungkinan TradeXYZ memisahkan diri dari Hyperliquid dan membangun platform perdagangan mandiri semakin meningkat. Analis Sam menyoroti tiga pertanyaan kritis bagi investor HYPE terkait hal ini. TradeXYZ kini mendominasi HIP-3 Hyperliquid, menyumbang 93% dari total volume perdagangan dan 99.7% dari total open interest (OI). Kontribusinya terhadap total volume Hyperliquid mencapai lebih dari 70%, menjadikannya pilar utama. Alasan utama potensi 'keluar' adalah untuk menangkap lebih banyak biaya perdagangan dasar, mengingat pembagian pendapatan saat ini adalah 50/50 dengan Hyperliquid. Namun, terdapat faktor-faktor yang membatasi keputusan tersebut. Pertama, kinerja Hyperliquid yang unggul sulit untuk ditiru dalam waktu singkat. Kedua, Hyperliquid berperan sebagai saluran distribusi dan akuisisi pengguna utama bagi TradeXYZ. Ketiga, terdapat hubungan kepercayaan yang kuat antara pendiri kedua proyek. Kesimpulannya, meskipun TradeXYZ memiliki pengaruh dan daya tawar yang besar, membangun infrastruktur mandiri dianggap tidak rasional karena berisiko tinggi dan dapat berakhir sebagai situasi 'kalah-kalah'. Hyperliquid kehilangan narasi pertumbuhan utama, sementara TradeXYZ menghadapi tantangan teknis, kehilangan saluran distribusi, dan risiko reputasi. Jalur terbaik bagi TradeXYZ kemungkinan adalah bernegosiasi ulang pembagian pendapatan dan secara bertahap mengalihkan fokus ke pengembangan produk dan kepemilikan pengguna, sambil mempertahankan integrasi yang menguntungkan dengan Hyperliquid.

marsbit50m yang lalu

Seberapa Tinggi Kemungkinan TradeXYZ Memisahkan Diri dari Hyperliquid?

marsbit50m yang lalu

Robot "Tinky Winky" Datang ke Rumah untuk Bersih-Bersih, Rp 300.000/Jam, Murni "Kecerdasan Buatan" Manual

Perusahaan robotika AS Tau Robotics memperkenalkan layanan pembersihan rumah menggunakan robot humanoid yang dikendalikan dari jarak jauh (remote). Robot dengan antena di kepala seperti "Wi-Fi router" ini menawarkan jasa bersih-bersih di area San Francisco dengan harga $30 (sekitar Rp 200.000) per jam, lebih murah dibandingkan jasa pembersih manusia. Artikel mengungkapkan bahwa demonstrasi video yang memperlihatkan robot mampu mencuci tangan, mengepel lantai, dan membuang sampah sebenarnya dioperasikan secara manual dari jarak jauh, bukan sepenuhnya otonom. Meski begitu, video tersebut diputar dalam kecepatan normal (1x), berbeda dengan demo robot lain yang sering dipercepat. Tau Robotics, startup yang baru didirikan tahun 2024, memiliki tiga "karyawan" robot: Chelsea untuk dapur dan kamar mandi, Elon untuk pembersihan rutin dan mengingat penataan barang, serta Tony untuk pembersihan mendalam. Pendekatan remote control ini dinilai sebagai cara cerdas untuk mengisi kesenjangan kemampuan (gap) sambil mengumpulkan data dari lingkungan rumah nyata, mirip dengan konsep "shadow mode" pada mobil otonom. Alasan menggunakan bentuk humanoid adalah agar operator dapat memetakan gerakan tubuhnya secara intuitif ke robot, sehingga memudahkan kendali jarak jauh untuk tugas-tugas rumit. Artikel juga menyebutkan tantangan besar robot humanoid masuk ke rumah tangga dan membandingkannya dengan upaya serupa di Tiongkok serta robot 1X Neo dari AS. Meski demikian, kehadiran robot berbentuk manusia di rumah dianggap dapat memberikan nilai emosional tertentu bagi pengguna. Layanan Tau saat ini tersedia secara undangan di San Francisco.

marsbit1j yang lalu

Robot "Tinky Winky" Datang ke Rumah untuk Bersih-Bersih, Rp 300.000/Jam, Murni "Kecerdasan Buatan" Manual

marsbit1j yang lalu

Dari Korea Selatan ke AS: Berkat AI, Pekerjaan Kerah Biru Makin Diminati

AI sedang mengubah logika pasar tenaga kerja secara global. Di Amerika Serikat dan Korea Selatan, gelar sarjana empat tahun semakin kehilangan daya tarik, sementara profesi kerah biru seperti teknisi listrik, tukang las, dan tukang ledeng mengalami peningkatan permintaan dan gaji yang signifikan. Data menunjukkan, pada Juni 2025, pendapatan sekolah kejuruan di AS naik 11,4%, sementara PHK karena AI mencapai rekor tertinggi. Generasi Z semakin mempertimbangkan pekerjaan kerah biru karena dianggap lebih tahan terhadap otomatisasi AI. Gaji di bidang ini kini setara atau bahkan melampaui profesi yang membutuhkan gelar sarjana. Di Korea, lulusan sekolah menengah kejuruan semikonduktor sangat dicari, dengan sebagian besar langsung diterima bekerja. Kekurangan tenaga terampil diperparah oleh gelombang pensiun dan ekspansi infrastruktur, seperti pusat data. Perusahaan seperti JPMorgan, Meta, dan Lowe's berinvestasi besar-besaran dalam program pelatihan. Namun, tantangan tetap ada dalam mengubah persepsi masyarakat yang lama memandang rendah pendidikan kejuruan. Para ahli menekankan perlunya upaya aktif dari industri untuk memperkenalkan peluang nyata dalam karir teknis kepada generasi muda.

marsbit1j yang lalu

Dari Korea Selatan ke AS: Berkat AI, Pekerjaan Kerah Biru Makin Diminati

marsbit1j yang lalu

Dari TPU ke Agent yang Berevolusi Sendiri: Bagaimana Jeff Dean Menilai Langkah Selanjutnya AI

Jeff Dean, Kepala Ilmuwan AI di Google, membagikan pandangannya tentang arah perkembangan AI dalam wawancara di YC Startup School 2026. Ia menekankan bahwa tahap berikutnya AI bukan hanya tentang melatih model yang lebih pintar, tetapi membangun sistem yang memungkinkan AI bekerja jangka panjang, terus mencoba, memvalidasi otomatis, dan mengakumulasi kemampuan. **Tren Utama AI:** * **Perubahan Fokus:** Kompetisi AI bergeser dari "siapa yang memiliki model lebih besar" ke "siapa yang dapat mengorganisir kecerdasan dengan lebih baik." * **Kemampuan Agent:** Kemampuan AI kini mendekati insinyur pemula dan berkembang pesat dalam tugas kompleks dan alur kerja panjang. * **Otomatisasi Penelitian:** Sistem AI akan semakin banyak digunakan untuk meningkatkan sistem AI itu sendiri melalui eksperimen otomatis dan evaluasi. **Peluang dan Tantangan:** * **Peluang Startup:** Tim kecil dapat bersaing dengan fokus pada bidang spesifik di mana model umum gagal (tingkat keberhasilan ~1%). Peluang terletak pada data milik pribadi, alat evaluasi khusus, atau model domain-spesifik. * **Tantangan Teknis:** Biaya utama dalam sistem AI saat ini adalah **perpindahan data**, bukan komputasi. **Latensi rendah** dan **efisiensi energi** sangat penting untuk pengambilan keputusan real-time dan agen yang berjalan lama. * **Rekayasa Konteks:** Kunci untuk sistem AI yang berguna terletak pada organisasi konteks—alat, memori, lingkungan eksekusi, dan mekanisme umpan balik—bukan hanya modelnya sendiri. **Keterampilan (skills)** yang dikodekan menjadi aset berharga. **Saran untuk Inovasi:** * **Prinsip Pertama:** Tantang asumsi yang ada dan hitung kembali batasan dari sudut pandang prinsip pertama. * **Spesifikasi yang Jelas:** Ketika kode menjadi murah, **spesifikasi yang jelas**, **selera (taste)** dalam memilih masalah, dan **standar penerimaan** menjadi lebih berharga. * **Metode Ilmiah Terotomatisasi:** AI dapat mempercepat siklus penemuan dengan mengotomatisasi proposal eksperimen dan membuat verifikasi yang cepat dan murah. Pada intinya, di era AI, kemampuan langka yang paling berharga tetaplah **kemampuan untuk melihat dan mendefinisikan masalah dengan jelas.**

marsbit1j yang lalu