Alibaba's T2I Evaluation Qwen-Image-Bench Released, GPT Image 2 Takes the Crown

05/29 10:15

According to monitoring by Dongcha Beating, Alibaba's Qwen team has announced the open-source release of a new drawing evaluation benchmark, Qwen-Image-Bench, specifically designed to assess the capabilities of large models in generating images from text (referred to as T2I, or text-to-image). Alongside this, a unified visual judging model, Q-Judger, based on the deep training of Qwen3.6-27B, has also been launched. The evaluation benchmark simulates a professional artistic creation workflow and includes five major dimensions: image quality, aesthetics, text-image alignment, as well as newly added real-world fidelity and creative generation, with 23 sub-capabilities and 56 detailed metrics. Qwen-Image-Bench comprises 1,000 bilingual prompts in Chinese and English, with 500 long and 500 short descriptions, averaging over four dimensions assessed simultaneously. For precise evaluation, the Q-Judger visual judging model underwent blind reviews and triple annotations under the supervision of 80 professional reviewers from art institutions, with the training dataset covering over 130,000 bilingual expert-annotated pairs. The model outputs structured scores across 56 dimensions, achieving a 92% agreement rate with human expert scores. The evaluation results for the first batch of 18 mainstream image generation models show that GPT Image 2 topped the list with a comprehensive score of 64.69, ranking first in all five dimensions. Nano Banana 2.0 scored 59.82, GPT Image 1.5 scored 59.65, and Nano Banana Pro scored 59.45, placing second, third, and fourth respectively. Alibaba's self-developed Qwen Image 2.0 Pro ranked fifth with a score of 57.84, while GLM Image came in last with a score of 48.19. The data indicates that real-world fidelity and creative generation are key metrics that differentiate model performance. The evaluation also reveals common technical bottlenecks in the industry, with AI drawing models generally prone to errors in depicting human skeletal structures, representing gravity and light, and handling details such as object interpenetration, with top models scoring below 44 in these dimensions.
bullishbullishbullishBullishbearishbearishbearishBearishSukaBagikan
PenafianKonten diatas tidak merepresentasikan posisi HTX.HTX tidak memberikan rekomendasi perdagangan apa pun.

Artikel Terkait

  • Image

    Solana App Economy Sees Significant Upside Momentum As Cumulative Revenue Surges

  • Image

    GSR Research Says Ethereum’s Identity Crisis Is Deepening

  • Image

    Ripple Doesn’t Move Randomly: The Strategic Moves Behind XRP’s Domination

Semua Komentar0TerkiniHangat

avatar
TerkiniHangat

Artikel Terkait

  • Image

    Solana App Economy Sees Significant Upside Momentum As Cumulative Revenue Surges

  • Image

    GSR Research Says Ethereum’s Identity Crisis Is Deepening

  • Image

    Ripple Doesn’t Move Randomly: The Strategic Moves Behind XRP’s Domination