Why OpenAI’s Text-to-image Model Falls Short

No ratings yet.

Since OpenAI launched its text-to-image GenAI tool, DALL-E, I have been using it for quite a while. Initially, I was amazed by how accurate and impressive the tool is for creating an image based on a short prompt, accurately depicting what I had expected in my mind. However, after using it a bit more, I started to realise that DALL-E has its limitations and is not yet ready to completely replace graphic designers or artists.

If you have been using DALL-E for a while, you have probably experienced it creating weird texts in the image. As shown below in image 1, (prompt: Create an image of a tree in autumn in the Dutch city Utrecht, there are some stores with storefront names in the background), the store names are inaccurate, while the image itself looks amazing and of high quality.

Why can DALL-E create amazing visual images, but can’t produce normal text on these images while this seems so easy?

This is due to several technical issues. While the model is effective at understanding and generating visual elements based on prompts, it often lacks the capabilities to distinguish between visual content and written content. This limitation is rooted in the training technique of the models, when text is rendered as part of the image it becomes a visual pattern instead of a linguistic one. OpenAI has said that the next version, DALL-E 4, will have better results in distinguishing linguistic vs visual elements.

Another important issue to address, is the biased results from DALL-E. When asking to create an image of a CEO at work with an assistant, see image 2 (prompt: Create an immage of a CEO at work with an assistant), it will show a image of a man as a CEO and a woman as an assistant. This is because the model is trained on data from databases, where CEO’s are often represented as a man instead of a woman. Leading to biased and discriminatory results, ultimately reinforcing outdated gender stereotypes. The same counts for culturally insensitive and inappropriate results, because the model is not adapted to the cultural awareness of humans.

To conclude, these are just two of DALL-E’s limitations, while there are many more that I haven’t discussed. It is clear that DALL-E is not perfect and needs to improve their model before it can replace graphic designers or artists. Nonetheless, the potential for the future is immense, and for now, it is incredibly fun to experiment with.

Thanks for reading, if you have an opinion about this topic, please leave a comment below.

(Disclaimer: this blog is written based on my personal experience)

1 thought on “Why OpenAI’s Text-to-image Model Falls Short”

The article clearly explains two pitfalls of DALL-E. I was also wondering why DALL-E struggles to generate clear fonts in images, and now, thanks to your article, I fully understand the reason. As for the second drawback, yes, I’ve experienced similar problems with AI-generated images, such as gender bias – for example, generating a woman when depicting a child-care role. In addition, AI-generated images often suffer from human body anomalies, such as six fingers or awkward poses.
I believe that as technology advances, image-generating AI will become more sophisticated, which will undoubtedly improve efficiency. But it also risks encroaching on the space of artists creating original works. I think there is a unique beauty and style in the works created by human artists that cannot be replaced by AI. So I hope that the market can regulate the use of AI in a way that allows artists to thrive.

Why OpenAI’s Text-to-image Model Falls Short

10

Please rate this

Related

1 thought on “Why OpenAI’s Text-to-image Model Falls Short”

Leave a Reply to Yongqi Jia Cancel reply