Pictures with text are everywhere, in comic books, advertisement and scientific articles, enhancing both the clarity of the images and the vividness of the text. However, it appears that GenAI struggle to create a picture with readable, correct and complete text on it. I have used several GenAI to evaluate their performance on that.
In first step, I put in “Generates an image of a monster on the Erasmus Bridge” as the basic context, in which ChatGPT, AI Image Generator and ChatGLM all perform pretty well in combining the Erasmusbrug in reality with monster figures. In the next step, I try to add text on the scene by requesting “the monster is analyzing strategies to make the traffic on the bridge more reasonable and faster, expressed in words”. Some bubbles with unrecognizable words appear and further instructions like “making the text clearer and more specific” also fail to make it clear. Lack of specific input of text may explain for that, so I state the specific word the text should contain. The results remain unsatisfying, as ChatGPT and ChatGLM generate something like alien writing.
Given the excellence of AI in generating text and creative pictures respectively, what would the obstacles for it to create informative pictures with text? A possible reason is that when GenAI specialized in image is trained, it skipped most of the pictures with text in avoidance of copyright infringement, making text image insufficient (Growcoot, 2024). Another explanation says that AI has limited ability to understand your words, when the description of the image and the text needed on it appears in a single sentence, it would be confused and make mistakes. More fundamentally, generating text images can be difficult in nature. For a text-to-image model, text symbols are just more precise combinations of lines and shapes, but not meaningful words. As text comes in so many different styles, the model often won’t understand how to effectively reproduce text, and minor imperfections in text are noticeable (Mirjalili, 2023).
Leading AI technology companies have been working on this problem by training the GenAI model in a more advanced and specialized way. Research suggests that adding more parameters when models are trained can dramatically improve text rendering (Growcoot, 2024). Stability AI has released Stable Diffusion 3 in 2024, with diffusion transformer architecture combined and the ability of text writing on pictures said to be improved. While the effect of advanced model remains a mystery, progress is being made. One day soon, GenAI might actually make ‘a picture worth a thousand readable words.’
Reference
Growcoot, M. (2024, March 6). Why AI image generators struggle to get text right. PetaPixel. https://petapixel.com/2024/03/06/why-ai-image-generators-struggle-to-get-text-right/
Mirjalili, S. (2023, July 5). If AI image generators are so smart, why do they struggle to write and count? TechXplore. https://techxplore.com/news/2023-07-ai-image-generators-smart-struggle.html
Silberling, A. (2024, March 21). Why is AI so bad at spelling? Because image generators aren’t actually reading text. TechCrunch. https://techcrunch.com/2024/03/21/why-is-ai-so-bad-at-spelling/
Appendix
1.Input in Gen AI
1) Generates an image of a monster on the Erasmusbrug
2) The monster is analyzing strategies to make the traffic on the bridge more reasonable and faster, expressed in words. Make the text clearer and more specific
3) The text on the picture contains the following: Optimize traffic lights with algorithm, Divide lanes for different speeds, Set up tram tracks in different time periods
2.Output of ChatGPT (DALL·E 3)
3. Output of AI Image Generator (DeepAI)
4.Output of ChatGLM