Many of us like to indulge in scrolling our phones in our free time. It’s a quick guaranteed way to kill some time while waiting around, or gathering the strength to finally get out of bed. While our options were limited to classic picture-and-text posts until a few years ago, the meteoric rise of short-form video content has been dominating the doomscrolling niche in recent times. After personally seeing a specific type of content (rehashed reddit drama stories), I began to wonder, could you automate this with GenAI?
Down the rabbit hole
I first began by analyzing the catalyst for my idea, hoping to find the exact mental tasks necessary to pass to GenAI to automate it, however, I got sidetracked. After listening to the specific video closely, I have noticed that the voice and sound design were extremely high quality, it was as if someone who specialized in narrating commercials took some time out of their working hours, and used their studio setup in order to record the voiceover. The display of text and editing also seemed off for a typical TikTok post, it had a few errors that in my mind were “no-brainers”, editing choices that made the story less understandable. That’s when I realized that I didn’t spend my time looking at someone’s geniune effort, I was staring right at my own idea, implemented before I did it.
Shifting gears
At that point, I was both disappointed and amazed, my original idea would probably not find the success that I hoped it could, after all the market for AI generated short stories in video form was presumably already saturated, however, I wanted to know more about this topic. I never personally found this type of content to be of amazing artistic quality, but it could be the springboard for future GenAI development in entertainment. I decided to still go ahead with the project until a proof-of-concept stage, to capitalize on the learning opportunity.
When finally I had the code up and running, churning out videos with little-to-no human input, taking a fraction of the time (and artistic integrity) that a human would, I felt like I stumbled onto the crux of emotional manipulation of contemporary social media. I felt like I found something that every one of my friends should at least know about. Even more disturbingly, during the research for this project, I have not found any complete documentation of what this content is and how it’s made. I only had my account of my experiences and thoughts to go on.
This blogpost outlines my findings and opinions on how this content is made, with the hope that at least getting this knowledge out there will restore some fairness in knowing what you consume to social media.
However, there are still a topic of discussion I could not answer even at the end of the project. Is this truly a new form of entertainment? Am I quick to condemn something new and exciting? Therefore throughout the blogpost I will try to give as much practical information as possible (without encouraging or condemning the non-illegal parts) and hope that this topic will sort itself out in due time if more people have the ability to do it.
How the sausage is made
After conducting some research (if you can call scrolling TikTok for a few hours research), I have found some elements that are common across many of these videos:
Profile picture
Automating away the process of selecting a profile picture seems to be an exercise in futility, after all, why spend any time on automating something that will only occur once, right? Well, online speculation is focused on the conjecture that most platforms purposefully de-amplify AI generated content, and the channel itself will hit a low ceiling on how high it can get in the algorithm. Therefore, I speculate that most organizations that run channels like the one I was trying to create, actually run multiple channels concurrently, thus cheating the “algorithmic ceiling” imposed on them.
Voice
The main way that the underlying story is conveyed to the viewer, is the AI generated voiceover. It is extremely common, with only a small percentage of these videos opting for a musical background or simply no voice. These voices can be created by training a neural network on a few thousand hours of publicly available speeches of one specific person and the transcript. The main objective is to find a voice that is both soothing and fitting for the type of story. Voice can only be done “wrong” after all, by selecting a voice that is irritating or hard to understand, excluding potential viewers from consuming the content.
Most posters use a third-party service (most commonly Elevenlabs) to acquire this aspect of the post. Most AI voiceover services (including Elevenlabs) charge between 30-100 euros per month of API access (compared to a rumored figure of 20-50$ of income per million views), but offer a free tier, given that the user first registers with their Google account. This creates a large incentive for unscrupulus (but strongly business minded) posters to buy stolen Google accounts (which entails supporting the theft of gmail accounts) to generate the voiceover. Free alternatives exist, but require a quite strong PC to actually run and generally slightly underperform when compared to paid services.
Story
Finding an interesting story to put into video form is hard work. However, as most of these stories are selected for their strong language, uncontroversial interpretability and general relatability, there is a really good proxy to look for when selecting them. As Reddit has a rating system, these stories are easy to find, go on a subreddit (a sub forum where related topics are discussed), sort by highest rated, set the filter to “highest rated of all time”, and just like that, you have a scoop…. or at least I initially thought so.
The problem with this approach is that these stories are “internet famous”, there is a high likelyhood that your viewer will have heard the story before; They’ll listen for the first few seconds, conclude that they don’t need to listen to it again, swipe down, and tank your video’s rating to the bottom of the algorithm.
Thankfully, if you are willing to abandon artistic integrity (or don’t view it as such), you can fine tune a generative AI model to write the stories for you. By collecting the top stories from a given story niche (or better yet, collecting it from many niches and using an unsupervised machine learning algorithm to classify them into niches), you can make sure that your generative AI model will be able to create tall tales to entertain the world.
Scenery
A defining feature of these types of short-form videos is the background video. Most posters strive to find something that occupies the part of the viewers brain that isn’t actively engaged in listening to the story. For this, bright colors and easy to interpret picture content is a must. The end goal is to totally engage the viewer at all levels, making them focused on the content. Familiar video content (the games Subway Surfers, Minecraft and GTA 5 work exceptionally well) makes interpretation more fun for the viewer, evoking feelings of either nostalgia or active interest.
Generative AI does not play a role in this part of the content. While it is theoretically possible to train an AI model to play games for background video, it is currently easier to find a large chunk of video content that can be cut up into many small pieces.
Again, the problem of the most obvious solution being suboptimal rises. There is a finite amount of explicitly royalty free content that fit well with this medium. If viewers recognize the specific background video from a competing channel, it will engage them less. Therefore, some posters opt to either pay royalties (again, hopelessly eating away at the profit margin), or just stealing content that wasn’t royalty free.
As most of the footage originates from YouTube and the story content is posted on TikTok, this leverages a gap in content moderation policy. The two sites rarely coordinate on copyright issues, especially if the piece of content is of low value (for example someone’s ages old minecraft gameplay), ensuring that this method entails low risk.
Editing
The last component of a post of this nature is the editing. There are 3 main aspects that have to be covered, cutting the background footage to last until the story does, displaying subtitles and animating said subtitles. There are currently many Python libraries that offer rudimentary video editing that fit this purpose, such as MoviePy.
A simple approach would entail defining static rules around these tasks. The subtitles in this type of content usually utilize a “pop” effect, where the text enlarges and slightly shrinks in quick succession. This captures the viewer’s attention, as humans are hard-wired to pay attention to fast moving objects, naturally drawing the viewers attention to the subtitles.
This leaves only two tasks, cutting the video to fit the story (keeping in mind to only show attention grabbing sequences) and displaying the subtitles (keeping in mind to display related words together). For this task, GenAI outperforms static rules, most general knowledge large language models that support video RAG (retrieval augmented generation) can be prompted to accomplish these tasks. Better yet, they’re able to write code themselves that they can interact with, setting the appropriate parameters for each video.
Running these models locally might not make business sense, as these models require a strong PC, which might prohibitively eat away at the profit margin. Third-party solutions do exist and I trust that by now you have noticed a pattern. The token pricing is too high to be sustainable for the enterprise. All that is necessary to acquire this capability below market price is to get a hold of a stolen API key for any state-of-the-art model, offloading the computational task to a server far away.
FIN
With this, you now know all the details that I uncovered behind this type of content. I found that this content is very troublesome to make. My overarching theory behind the creation pipeline is that it is simply too expensive to create compared to the little income it brings in for any organization. Tiktok reportedly pays around 20-50$ per million views, this is simply not enough to support an ethical creation of this type of content right now. However, I sincerely believe that this will change, at which point the internet will have to collectively decide the fate of short-form storytime content. We all play a part in this conversation, so I encourage to leave your opinion down below in the comments section.