Back to NewsRead Original
AI News CN (Telegram) - English Translation

The secrets of GPT-4o image generation that OpenAI didn't disclose, have netizens already pieced together the truth? However, OpenAI has never been truly "open", and this time is no exception. They merely released an appendix (supplementary document) for the GPT-4o system card, which mainly...

The secrets of GPT-4o image generation that OpenAI didn't reveal, have netizens pieced together the truth?

However, OpenAI has always been not so "open", and this time is no exception. They only released an appendix (supplementary document) to the GPT-4o system card, which mainly discusses evaluation, security, and governance aspects. Address: https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_Image_Generation_System_Card.pdf For the technology, in this 13 - page appendix document, only one sentence was mentioned at the very beginning: "Unlike DALL·E based on diffusion models, 4o image generation is an autoregressive model embedded in ChatGPT." Although OpenAI keeps the technology confidential, it can't resist everyone's enthusiasm for how GPT-4o works. Now, various speculations and reverse engineering have emerged on the Internet. For example, Jon Barron, a researcher at Google DeepMind, guessed that it might be a combination of some multi - scale technology and autoregression based on the process of 4o generating images. However, it is worth mentioning that Jie Liu, a Ph.D. student at the Chinese University of Hong Kong, found when studying the front - end of GPT-4o that the effect of generating images line by line that users see when generating images is actually just a front - end animation effect on the browser and cannot accurately and truly reflect the specific process of its image generation. In fact, during each generation process, OpenAI's server only sends 5 intermediate images to the user - end. You can even manually adjust the height of the blur function in the console to change the blur range of the generated image! Therefore, when inferring the working principle of GPT-4o, the front - end display effect during its generation may not be a good basis. Nevertheless, let's take a look at what speculations various researchers have made. Overall, the inferences about GPT-4o's native image - generation ability mainly focus on two directions: autoregressive + diffusion generation, and non - diffusion autoregressive generation. Below, we will detail the relevant conjectures and briefly introduce some related papers associated with netizens' conjectures. Conjecture 1: Autoregressive + Diffusion Many netizens conjecture that GPT-4o's image generation adopts the paradigm of "autoregressive + diffusion". For example, Sangyun Lee, a Ph.D. student at CMU, tweeted shortly after the feature was released, conjecturing that GPT-4o would first generate visual tokens, and then a diffusion model would decode them into the pixel space. Moreover, he believes that the diffusion method used by GPT-4o is a grouped diffusion decoder similar to Rolling Diffusion, which decodes in a top - to - bottom order. He further gave the basis for his conjecture. Reason 1: If there is a strong conditional signal (such as text, and possibly visual tokens), users usually first see a blurred sketch of the content to be generated. Therefore, the areas to be generated will show a rough structure. Reason 2: Its UI indicates that the image is generated from top to bottom. Sangyun Lee had tried the bottom - to - top order in his own research. Sangyun Lee guessed that under such a grouped pattern, the FID in the high NFE (number of function evaluations) area would be better. But when he discovered this in his research, he just thought it was a bug, not a feature. But now the situation is different, and people are calculating during research and testing. Finally, he concluded: "Therefore, this is a model between a diffusion and an autoregressive model. In fact, by setting num_groups = num_pixels, you can even restore autoregression!" In addition, some other researchers have given similar judgments: If you are interested in this conjecture, you can refer to the following papers: Rolling Diffusion Models, arXiv:2402.09470; Sequential Data Generation with Groupwise Diffusion Process, arXiv:2310.01400; Transfusion: Predict the Next Token and Diffuse Images with One Multi - Modal Model, arXiv:2408.11039. Conjecture 2: Non - diffusion Autoregressive Generation Those who have used GPT-4o know that during the image - generation process, the upper part always appears first, and then the complete image is generated. Peter Gostev, the AI director of Moonpig, believes that GPT-4o generates images starting from streaming tokens at the top of the image, just like the text - generation method. Source: https://www.linkedin.com/feed/update/urn:li:activity:7311176227078172674/Gostev said that the key difference between GPT-4o image generation and traditional image - generation models is that it is an autoregressive model. This means that it streams image tokens one by one in sequence, just like generating text. In contrast, models based on the diffusion process (such as Midjourney, DALL - E, Stable Diffusion) usually complete the transformation from noise to a clear image in one go. The main advantage of this autoregressive model is that the model does not need to generate the entire global image at once. Instead, it can generate images in the following ways: Utilize the general knowledge embedded in its model weights; Generate images more coherently by streaming tokens in sequence. Furthermore, Gostev believes that if you use ChatGPT and click Inspect, and then navigate to the Network tab in the browser, you can monitor the traffic between the browser and the server. This allows you to view the intermediate images sent by ChatGPT during the image - generation process and thus obtain some valuable clues. Gostev gave some preliminary observations (which may not be complete): The image is generated from top to bottom; This process indeed involves streaming tokens, which is very different from the diffusion method; The general outline of the image can be seen from the beginning; The previously generated pixels may change significantly during the generation process; This may indicate that the model adopts some kind of coherence optimization, especially more obvious near the completion stage. Finally, Gostev said that there are some additional observations that cannot be directly seen from the images: For simple image generation, GPT-4o is much faster, usually with only one intermediate image, not multiple. This may imply the use of speculative decoding or other similar methods; The image generation also has a background - removal function. As of now, initially, GPT-4o generates pictures with a fake checkerboard background, and the actual background is removed only at the end, which slightly reduces the image quality. This seems to be an additional processing step, not a function of GPT-4o itself. Developer @KeyTryer also gave his conjecture. He said that 4o is an autoregressive model that generates images pixel by pixel through multiple passes, rather than performing denoising steps like diffusion models. And this ability itself is part of the GPT-4o LLM neural network. Theoretically, it can better grasp the concepts they are operating on than diffusion systems, which are just guesses about random noise. GPT-4o can also use the information "known" by the LLM to generate images. Therefore, they have better generalization ability, can use multiple messages for context learning, output the same (or very close) results through specific edits, and have a generalized sense of space and scene. Luigi Acerbi, an associate professor at the University of Helsinki in Finland, also pointed out that GPT-4o basically just uses a Transformer to predict the next token, and its native image - generation ability has been there from the beginning, but it has never been publicly released. However, Professor Acerbi also mentioned that OpenAI may use a diffusion model or some refinement models to perform some cleaning or add small details to the images generated by GPT-4o. How exactly is the GPT-4o native image - generation function implemented? After all, we still have to wait for OpenAI to reveal the secret itself. What's your own conjecture about this? ...

PC version: https://www.cnbeta.com.tw/articles/soft/1489108.htm
Mobile version: https://m.cnbeta.com.tw/view/1489108.htm

via cnBeta.COM Chinese Industry Information Station - Telegram Channel

•••