The Model Collapse Theory highlights serious concerns about the future of generative AI and its reliance on high-quality human data. Below, the number of papers found by PubMed containing the word “delve” is shown. It does not take a background in statistics to note the large increases since 2023, the same year that ChatGPT gained popularity. Coincidentally, ChatGPT is known to use the word “delve” more frequently than humans.
The chart demonstrates that AI generated content has found its way into our everyday lives. The line between a response from ChatGPT and a piece written by a human is fading quickly. In 2023, an expert even estimated that 90% of online content could be generated by AI by 2025. This leads to an urging question: can we even distinguish AI generated from human created content anymore?
The crux of the issue lies in how generative AI models are trained. Generative AI models require vast datasets of high-quality human content to train on. Human content is more nuanced, creative, and reflective of the real world. However, as more AI generated content is produced, it becomes increasingly more difficult to keep AI generated content out of the datasets. Training generative AI models on AI generated data has been shown to decrease both the quality and diversity of AI generated content (Briesch et al., 2023). This results in a feedback loop where the AI recycles patterns without innovation, leading to stagnation. In short: using AI to generate content has polluted the very data sources needed to train future generative AI models.
In response, GenAI companies have been looking for solutions. This sparked a race to secure exclusive partnerships with organizations who can provide human created data. These partnerships may postpone the issue, but do not solve it. At some point, more data will be needed, and proper data sources will become increasingly difficult to find. The only remaining option will be to use the content that is available, which includes AI generated content.
Some experts warn that the influx of AI generated content will slow down generative AI advancement – or potentially even bring them to a halt. Without sufficient high-quality human data, we risk a future where genAI models produce content that lacks depth and creativity. From my personal use, I found that the lack of creativity of GenAI is one of its greatest limitations. I think that genAI could provide much more value if it could be more innovative in its output. To think that generative AI might become less creative is a serious concern for me.
But how do you assess the risk of Model Collapse? Is Model Collapse a real threat, or do you have a solution in mind?
Sources:
Briesch, M., Sobania, D., & Rothlauf, F. (2023, November 28). Large language models suffer from their own output: An analysis of the Self-Consuming Training Loop. arXiv.org. https://arxiv.org/abs/2311.16822
Banner: https://www.thedigitalspeaker.com/content/images/2023/06/Danger-of-AI-Model-Collapse-Futurist-Speaker.jpg
Chart: https://www.linkedin.com/posts/marnimolina_the-rise-of-delve-in-scientific-literature-activity-7186362840360869888-Sz16
While the concern over Model Collapse is realistic, I think a solution will be found. The AI community is already testing new techniques to take care of this risk, such as advanced data filtering and the development of models that can distinguish between human and AI-generated content. Human creativity is limitless. And I think as long as we continue to produce new, original content, AI models will have enough human content data to learn from.
Super interesting topic to read about, I had never really thought about how this could impact the training data of these models. I think something which could be interesting to learn is how much of the current new content on the internet is being created by AI models and how much is still created by humans. At the moment I feel that most content may still be created by humans but this may change in the future. To solve this problem of Model Collapse we would need a tool which could accurately predict if something is written by AI or not, something which has still not been developed. However, I do think this will be discovered in the near future, solving this problem of model collapse as AI-generated content could be filtered out by the tool. Thanks for the interesting read!