The evolution of sound in gaming video game audio has come a long way since the early days of simple beeps and 8-bit soundtracks in the late 80’s (Drake, 2019). Today, sound and voice overs play a crucial role in immersing players in rich, dynamic worlds and game scenarios (Gallacher, 2013)(Bormann & Greitemeyer, 2015)(Stingel-Voigt, 2020)(Cesário et al., 2023). And another revolution is just underway – one powered by artificial intelligence. AI-generated sounds and voice overs are rapidly transforming how game developers approach audio design, offering a completely new dimension of customization, scalability, and immersion (Filipović, 2023). This blog post dives into the emerging world of AI-generated sound and voice overs, highlighting companies that already provide such solutions, and provides a case study on how major gaming companies such as Electronic Arts (EA) could harness this technology to enhance their flagship franchises like FIFA (Nelva, 2024).
(Drake, 2019)
“And now of course we’re in the era of generative AI which is the most exciting yet by a fairly wide margin and something that we’re embracing deeply. We think about it in three core vectors: efficiency, expansion, and transformation.” – Andrew Wilson, CEO EA.
Speaking of efficiency, EA Sports CEO Andrew Wilson mentioned that the company’s business involves an incredibly iterative development cycle as pressing a button in a game doesn’t just need to trigger the desired effect on the screen, but also needs to be fun. As a result, game development is very time-consuming and new games are still taking a couple of years to fully develop (Morgan Stanley, 2024).
What is AI-Generated Sound and Voice Over?
At its core, AI-generated sound and voiceover technology relies on deep learning models that synthesize speech and sound effects based on their vast datasets. These models can be trained to replicate human voices, create new ones, or generate contextual soundscapes in real time. Those can further be added to in-game NPC’s, background noises such as stadium chanting or interactions with other users (Replica Studios, 2024). AI voice models are particularly powerful because they can scale endlessly – enabling developers to generate unique character voices or sound environments that evolve with player actions and game development. This technology is already being adopted by various companies, making sound design faster, cheaper, and more flexible than ever before (Filipović, 2023). I hope to see it implemented on a larger scale as soon as possible as it brings a lot of new functionalities and increased accessibility for all users.
A feasibility study across all of EA’s game development processes showed that about 60% of them have “high feasibility to be positively impacted by generative AI.” (Morgan Stanley, 2024).
To provide a more material example, in the past building a stadium for a sports game such as FIFA took six months. In the past year, it took six weeks, and it’s not unnatural to think that very soon, it’ll take less than six days. Wilson believes that extending this concept to every aspect of development could drive meaningful efficiency for the EA (Nelva, 2024).
EA Sports uses many advanced technologies powered by revolutionary AI systems such as HyperMotion and AI Mimic (Molina, 2024).
Companies Leading the Way
- Replica Studios
Replica Studios is at the forefront of AI-generated voice overs offering both text-to-speech as well as speech-to-speech solutions in multiple languages. They provide game developers with a library of AI voices that can deliver dialogue, narration, and character voices at scale. Replica’s platform allows developers to generate voice lines in minutes, which is a massive leap in efficiency compared to traditional voiceover production. They recently introduced a very interesting plug-in for unreal engine called Smart NPC, which basically allows talking to any NPC through your microphone and receiving a custom dialogue response in real-time. It adjusts the emotional tone and intensity of the response as well as adds NPC face expressions based on in-game events (Replica Studios, 2024). For games like FIFA, where commentators could dynamically react to player performance or key moments, this kind of AI-driven personalization could significantly elevate player engagement.
Voice Lab: Describe your voice, or the role or character you would like the AI to portray, and dream it into existence with Voice Lab, a prompt-to-voice design feature which can create a blend of up to 5 Replica voices which all contribute their unique accents, prosody, and other vocal features to the resulting new voice (Replica Studios, 2024).
- Eleven Labs
Eleven Labs specializes in deep-learning models that generate highly realistic and natural-sounding speech. Their ability to clone voices and synthesize speech in multiple languages has the potential to revolutionize localization for global games (Eleven Labs, 2024). For a company like EA, which regularly releases games in dozens of languages, Eleven Labs could dramatically reduce time and costs associated with localization. Moreover, Eleven Labs’ technology could allow players to customize their in-game avatars’ voices, adding an additional layer of personalization and immersion that would further enhance user experience across EA’s game portfolio.
(Eleven Labs, 2023)
- Sonantic
Sonantic, recently acquired by Spotify, focuses on generating emotionally modulated voice overs. Their AI voice models can express a range of emotions, from subtle sadness to intense excitement. This level of emotional depth is essential for creating believable character interactions and narratives in games (Virtucio, 2023). For a game like FIFA, Sonantic’s technology could enable commentators or characters to convey emotions based on real-time match scenarios- turning a victory celebration or last-minute goal into a more compelling experience for players.
Spotify to Acquire Sonantic, an AI Voice Platform (Spotify, 2022).
How does it work?
AI-generated voices and sounds work by using models that have been trained to understand and mimic human speech or create sounds based on patterns. Here’s how it works step by step, using text-to-voice or in-game action to sound as an example:
1. Training the AI Model
First, developers “teach” an AI how human voices sound. This is done by feeding the AI large amounts of voice recordings paired with the text spoken in those recordings. The AI learns to recognize patterns like how certain words are pronounced, different voice tones, and even emotions. This process is called machine learning. For sound effects, the same concept applies – the AI learns how different sounds (like footsteps, explosions, or wind) should sound based on data it’s trained on (Eleven Labs, 2023) (PlayHT, 2024).
2. Turning Text or Actions into Sound
Once the AI is trained, here’s what happens when it needs to turn text or an in-game action into sound or voice:
- Step 1 – Input: The game sends the AI a command based on what’s happening. This command can be a piece of text (for voiceovers) or an in-game action (like a character running). When FIFA produces the text, the AI system breaks it down into phonetic components. It then synthesizes these components, piecing them together to form words and sentences (Eleven Labs, 2023).
- Step 2 – Processing: The AI processes this input. It uses its training to understand how the text should be pronounced or what sound should be made based on the action. For example, it knows how to emphasize excitement when saying “Goal!” in a sports game. To enhance realism, some advanced AI voice generators incorporate techniques like Natural Language Processing (NLP). NLP helps the system understand and interpret the nuances of language, allowing it to modify its speech output accordingly. This includes adjusting for sarcasm, questions, or excitement, making the synthetic voice sound more natural and human-like (Eleven Labs, 2023).
- Step 3 – Sound Generation: Using this understanding, the AI generates the actual sound. For text, it creates a voiceover that sounds natural, as if a human said it. For in-game actions, it produces the appropriate sound effect, like the crowd cheering or the sound of the ball hitting the net (PlayHT, 2024).
3. Real-Time Adjustments
One of the cool things about AI-generated sounds is that they can adjust in real time. In a game, AI can react immediately to what’s happening:
– If a player scores a goal, the AI might dynamically adjust the commentator’s tone to match the excitement of the moment.
– For sound effects, the AI might adjust the intensity or volume of crowd noises based on the importance of the goal.
In the end, all of this happens quickly and seamlessly in the background. The AI takes text or actions, interprets what they mean, and instantly turns them into sound or voiceover, creating a more immersive experience for the player.
What are the differences between text-to-speech versus AI voice generation?
Feature | Text-to-Speech (TTS) | AI Voice Generation |
Technology | Uses synthesized speech from text using basic digital voices. | Employs advanced machine learning algorithms to generate more natural-sounding voices. |
Customization | Limited to pre-set voices and basic adjustments in pitch and speed. | Offers extensive customization, including voice cloning and nuanced emotional tones. |
Realism | Often sounds robotic and less natural. | Produces highly realistic and human-like speech. |
Application | Widely used for reading text aloud in a straightforward manner. | Used for creating dynamic and engaging audio content, mimicking human speech patterns more accurately. |
Flexibility | Generally offers a one-size-fits-all approach. | Allows for creating unique voices tailored to specific needs or characters. |
User Interaction | Primarily unidirectional; reads text as-is. | Can interact more fluidly in conversational AI, adapting tone and style contextually. |
Development | Based on simpler speech synthesis technology. | Involves complex AI models like neural networks for voice generation. |
Use Cases | Useful in accessibility tools, GPS navigation, and basic voice assistants. | Ideal for high-quality voiceovers, virtual assistants, gaming, and personalized customer interactions. |
How EA and FIFA Could Leverage AI-Generated Voiceovers: Case Study
EA Sports is already a giant in the gaming industry, with its FIFA franchise being one of the best-selling games of all time. By integrating AI-generated voiceovers and sounds, EA could unlock several strategic advantages (Nelva, 2024).
EA Sports FC 24 has hundreds of run cycles for its players built by generative AI (Nelva, 2024).
- Enhanced Player Engagement with Dynamic Voiceovers
One of the most exciting applications of AI-generated voices for EA would be the introduction of dynamic, real-time voice overs in FIFA. Currently, FIFA’s commentators are pre-recorded, with a fixed number of responses to match events. With AI-generated voice overs, commentators could react dynamically to player actions, offering new commentary each time a similar event occurs. For example, in a high-stakes match, the AI commentator could offer unique insights based on the players’ performance history or their current standing in a tournament. This level of customization could lead to increased immersion, keeping players more engaged and extending the life cycle of each game. Replayability would also improve as players receive fresh commentary in every match.
- Cost-Effective Localization and Multi-Language Support
FIFA games are released in numerous languages, requiring extensive voice recording for each localized version. With tools like Eleven Labs, EA could significantly reduce the cost and time associated with this process. AI voice synthesis could generate high-quality localized commentary and dialogue for global markets quicker, and at a fraction of the traditional cost. This scalability would also allow EA to release more language options simultaneously, expanding its market reach and improving its presence in regions that currently have limited localization support.
- Monetization Opportunities Through Custom Voice Packs
AI-generated voices open up a new avenue for monetization through downloadable content (DLC). EA could sell custom voice packs – allowing players to download unique commentators, player voice overs, or even region-specific packs that provide a more personalized gaming experience. For example, fans could purchase special voice packs for their favorite leagues or teams, or even retro-style commentators from past FIFA games. This type of microtransactions could drive revenue while providing additional value to players.
Risks and Ethical Considerations
Despite the clear advantages, implementing AI-generated voices and sounds is not without risks. One concern is the potential displacement of voice actors, as AI-generated voices reduce the need for human talent. Companies like EA will need to balance innovation with the preservation of creative jobs, potentially by using AI voices as a supplement to human actors rather than a full replacement. Another ethical concern is the misuse of AI-generated voices, particularly when it comes to voice cloning. Companies must ensure that AI models are used transparently and ethically to avoid issues like deep fakes or unauthorized voice replication. In the case of EA, clear policies on voice data and AI usage will be necessary to maintain player trust.
Strategic Implications for EA
By adopting AI-generated voice overs, EA could further solidify its leadership in the gaming industry while enhancing its ability to innovate and scale. Key strategic benefits include:
- Faster Development Cycles: With AI handling more repetitive voiceover tasks, EA could release games and updates more quickly, maintaining its competitive edge.
- Expanded Market Reach: Efficient localization would allow EA to target more global markets, increasing the international appeal of its FIFA franchise.
- New Revenue Streams: Custom voice packs and AI-enhanced features could create additional microtransaction opportunities and further drive this classics’ popularity, ensuring EA’s continued financial success.
Conclusion
This new era of game audio AI-generated sounds and voice overs represent a major step forward in how games are developed and experienced. Companies like Replica Studios and Eleven Labs are pushing the boundaries of what’s possible in the video game world, and large developers like EA are to benefit immensely from these advancements. By embracing this technology, EA can not only improve its games but also shape the future of audio in gaming – creating richer, more personalized, and more immersive experiences for players all around the world. I’m excited to see how such concepts become reality in the nearest future.
References
Bormann, D., & Greitemeyer, T. (2015). Immersed in Virtual Worlds and Minds. Social Psychological and Personality Science. https://www.semanticscholar.org/paper/Immersed-in-Virtual-Worlds-and-Minds-Bormann-Greitemeyer/cd705ccbcb2d3316e8645ec05bf08e22974fbbce
Cesário, V., Ribeiro, M., & Coelho, A. (2023). Design Recommendations for Improving Immersion in Role-Playing Video Games. A Focus on Storytelling and Localisation. Interaction Design & Architecture(s) Journal. https://doi.org/10.55612/s-5002-058-009
Drake, J. (2019, August 11). The 10 Best Soundtracks From The 8-Bit Generation. TheGamer. Retrieved September 19, 2024, from https://www.thegamer.com/best-soundtracks-retro-games-8-bit/
Eleven Labs. (2023, January 11). This Voice Doesn’t Exist – Generative Voice AI. ElevenLabs. Retrieved September 19, 2024, from https://elevenlabs.io/blog/enter-the-new-year-with-a-bang
Eleven Labs. (2023, December 3). What is an AI voice generator and how does it work? ElevenLabs. https://elevenlabs.io/blog/what-is-an-ai-voice-generator
Eleven Labs. (2024). AI Dubbing: Free Online Video Translator. ElevenLabs. https://elevenlabs.io/dubbing
Filipović, A. (2023). THE ROLE OF ARTIFICIAL INTELLIGENCE IN VIDEO GAME DEVELOPMENT. Kultura Polisa. https://www.ceeol.com/search/article-detail?id=1201751
Gallacher, N. (2013). Game audio — an investigation into the effect of audio on player immersion. The Computer Games Journal. https://link.springer.com/article/10.1007/BF03392342
Molina, D. (2024, March 11). The Dawning of a New Era: AI Takes the Field in EA Sports FC. FIFA Infinity. Retrieved September 19, 2024, from https://www.fifa-infinity.com/ea-sports-fc/the-dawning-of-a-new-era-ai-takes-the-field-in-ea-sports-fc/
Morgan Stanley. (2024). Tech, Media & Telecom 2024: The State of Generative AI. Morgan Stanley. https://www.morganstanley.com/Themes/tech-media-telecom-trends-insights-outlook
Nelva, G. (2024). EA Hopes to Use Generative AI to Drive Monetization and Make Development 30% More Efficient. TechRaptor. https://techraptor.net/gaming/news/ea-hopes-to-use-generative-ai-to-drive-more-monetization-and-make-development-30-more
PlayHT. (2024). What is an AI Voice Generator? PlayHT. https://play.ht/blog/what-is-an-ai-voice-generator/#:~:text=AI%20voice%20generators%20convert%20text,structure%20and%20generate%20corresponding%20audio.
Replica Studios. (2024). Smart NPCs | Ethical AI. Replica Studios. https://www.replicastudios.com/products/smart-npcs
Replica Studios. (2024). Voice Lab. Replica Studios. https://www.replicastudios.com/products/voice-lab
Spotify. (2022, June 13). Spotify to Acquire Sonantic, an AI Voice Platform — Spotify. Spotify Newsroom. Retrieved September 19, 2024, from https://newsroom.spotify.com/2022-06-13/spotify-to-acquire-sonantic-an-ai-voice-platform/
Stingel-Voigt, Y. (2020). Functions and Meanings of Vocal Sound in Video Games. Journal of Sound and Music in Games. https://online.ucpress.edu/jsmg/article-abstract/1/2/25/106828/Functions-and-Meanings-of-Vocal-Sound-in-Video?redirectedFrom=fulltext
Virtucio, M. (2023, January 20). Sonantic AI Voice Generator: Detailed. Softlist.io. Retrieved September 19, 2024, from https://www.softlist.io/sonantic-ai-voice-generator-detailed/