GPT-5.5 '9.7T Parameter' Re-evaluated: Revised to Approximately 1.5T
05/02 14:25
According to monitoring by Beating, AI researchers Lawrence Chan and Benno Sturgeon have published a review of the paper by Pine AI's Chief Scientist Li Bojie titled 'Incompressible Knowledge Probes: Estimating the Parameter Count of Black Box Large Language Models Based on Fact Capacity.' The original paper estimated GPT-5.5 to be about 9.7T, Claude Opus 4.7 to be around 4.0T, and o1 to be approximately 3.5T using 1,400 trivia questions to 'weigh' the closed-source models. The reviewers believe that while the approach itself is valuable, the original figures were significantly inflated due to the scoring criteria and question quality. The main issue lies in the 'floor score.' The original paper divided the questions into seven difficulty levels, and when a model answered too many incorrectly at a certain level, the score could theoretically become negative; however, the code actually pulled the minimum score for each level back to 0. This inflated the performance gap of cutting-edge models on difficult questions and further increased the inferred parameter count. The paper claims this was not handled in such a manner, yet the code and published results employed this treatment. After removing the 'floor score,' the fitting slope decreased from 6.79 to 3.56. This slope can be understood as 'for every point increase in the score, how much parameter growth is translated'; a smaller slope indicates that the same score difference no longer corresponds to such an exaggerated parameter difference. The R² value dropped from 0.917 to 0.815, indicating that the 'score to parameter count' fitting curve is not as stable as in the original paper. The 90% prediction interval expanded from 3.0 times to 5.7 times, suggesting a wider margin of error and that single-point figures should not be taken seriously. The review also pointed out that 131 out of 1,400 questions had ambiguities or incorrect answers, accounting for 9.4%. The issues were mainly concentrated in the difficult questions, which were used to differentiate cutting-edge closed-source models like GPT-5.5 and Claude Opus 4.7. According to their revised criteria, GPT-5.5 was reduced from the original paper's 9659B to 1458B, with a 90% prediction interval of 256B to 8311B; Claude Opus 4.7 was reduced from 4042B to 1132B; and GPT-5 was reduced from 4088B to 1330B. The reviewers also emphasized that 1.5T should not be regarded as the true parameter count for GPT-5.5. A more accurate conclusion is that this 'trivia weighing method' is highly sensitive to scoring details and question quality, and figures like 9.7T cannot be directly used as a weight measure for closed-source models.
AlcistaBajista3Compartir
Descargo de responsabilidad:El contenido anterior no representa las posiciones de HTX.,HTX no ofrece ninguna recomendación de trading.。
Todos los comentarios0Lo más recientePopular