No more need for audiobook recordings? Comparing Speech Synthesis of Eleven Labs between English and German

13

October

2023

No ratings yet.

The AI text-to-speech generator Eleven Labs (https://llelevenlabs.com/), offers a total of 28 languages, including German and English. I used the default voice settings, Eleven Multilingual v1, and I tried four different voices (see picture below) in the audiobook category because my input text is retrieved from a book that is published in both English and German (Yogananda, 2005; Yogananda, 2006).

The first voice “Mathilda” is described as warm and American for the English speech. In my opinion, the generated speeches in both languages had the same good style, flow, and warmth. However, being extremely critical, they were not quite as playful and lively as I am used to from traditional audiobooks.

The next voice “Grace” is supposed to be calm and southern-American for the English speech. In German the pauses and emphases appear slightly unauthentic which makes the speech sound a little bit robotic. In English in contrast, the speech is calm as promised because pauses and emphases appear at the right time.

Next, I tried two men voices. “Matthew” is labelled warm and British for the English speech. For the two languages, “he” was my favourite. Calm, taking pauses, yet lively. It seems like a real recorded audiobook. I didn’t recognise a quality difference between German and English.

Lastly, “Michael” is orotund and American for the English speech. To me, the speech did sound orotund in both languages, but the German voice sounded slightly robotic again.

The two images below show the interface on Eleven Labs with my text (picture 1 in English and picture 2 in German) and my settings. After generating, the audio is available on the bottom.

To conclude, I am overall fascinated by how authentic the audio generated by the AI text-to-speech generator Eleven Labs sounds. I experienced two out of four voices more authentic in English, so there seems to be more room for improvement in the German speeches to make them sound less robotic. As the AI text-to-speech generators will probably only improve further in the future, it seems that AI is a disruptive technology for the audiobook recording industry. And the technology may take root in similar industries: Will we soon have the news on the radio read out by an AI?

References

Yogananda, P. (2005). Autobiography of a Yogi: The Original 1946 Edition plus Bonus Material (p.13).      Crystal Clarity Publishers.

Yogānanda. (2006). Autobiographie: Übersetzung der Originalausgabe von “Autobiography of a Yogi” aus dem Jahre 1946 (p.13). Hans-Nietsch-Verlag.

Please rate this

Text-to-image + text? Testing the accuracy of text-to-image AI generators when the desired image includes text.

29

September

2023

No ratings yet.

Is the inaccuracy of generating text in image currently a limitation of text-to-image AI generators? The motivation behind testing this is the potential disruption it could have on industries such as the advertisement industry and the comic industry, where text plays an important role in images. I tested three different AI generators (Night Café, Canva, Dall-E 2) with two different input texts:

  1. “A poster in a sports bar that says “Happy Hour from 4-6pm”
  2. “A girl saying “Hello” in a speech bubble and a boy in a garden on a sunny day”

Night Café generated a sports bar with an illuminated wall panel. Overall, the text on the cardboard is blurry, with only one word clearly legible, which is “Hour ”.  Using input 2), it generated an image depicting a girl and a boy smiling at each other in a park on a sunny day, however without any sign of “Hello ” (see two images below).

Compared to Nigh Café, the AI generator on Canva delivered more accurate output. For the first example, the words “Happy Hour” were generated in illuminated letters. Yet, instead of “from 4-6pm”, a second combination of illuminated letters shows “Happur”. For 2), the generated image is almost accurate, only “HELLO!” is shown above the boy’s head instead of the girl’s head (see two images below).

Lastly, Dall-E 2 performed similar to Night Café on 1) and the best on 2). For 1) it generated an illuminated panel with the word “Happy” along with random letters and numbers underneath. As for 2), the picture was generated according to the request (see two images below).

To conclude, the technology is not yet at a point at which text-to-image generation works accurately. Text elements may not be generated at all in the image, blurry or incorrectly but occasionally correctly. Thus AI cannot replace designers working in advertisement and comic industries. However, it can currently be seen as a “tool or collaborative assistant for creativity” (Anantrasirichai & Bull, 2022) that can increase the efficiency of the design process.

References

Anantrasirichai, N., & Bull, D. (2022). Artificial intelligence in the creative industries: a review. Artificial intelligence review, 1-68.

Please rate this