Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

OpenVoice currently ranks second-to-last in the Huggingface TTS arena leaderboard, well below alternatives like styletts2 and xtts2:

https://huggingface.co/spaces/TTS-AGI/TTS-Arena

(Click the leaderboard tab at the top to see rankings)



Having gone through almost ten rounds of the TTS Arena, XTT2 has tons of artifacts that instantly make it sound non-human. OpenVoice doesn't.

It wouldn't surprise me if people recognize different algorithms and purposefully promote them over others, or alter the page source with a userscript to see the algorithm before listening and click the one they're trying to promote. Looking at the leaderboard, it's obvious there's manipulation going on, because Metavoice is highly ranked but generates absolutely terrible speech with extremely unnatural pauses.

Elevenlabs was scarily natural sounding and high quality; the best of the ones I listened to so far. Pheme's speech overall sounds really natural, but has terrible sound quality, which is probably why it's ranked so well. If Pheme could be higher quality audio, it'd probably match Elevenlabs.


I would like to see the new VoiceCraft model on that list eventually (weights released yesterday, discussion at [1]).

1 = https://news.ycombinator.com/item?id=39865340


I haven't tried openvoice, but I did try whisperspeech and it will do the same thing. You can optionally pass in a file with a reference voice, and the tts uses it.

https://github.com/collabora/whisperspeech

I found it to be kind of creepy hearing it in my own voice. I also tried a friend of mine who had a french canadian accent and strangely the output didn't have his accent.


Is there a benchmark for compute needed? Curious to see if anyone is building / has built a Zoom filter, or Mobile app, whereby I can speak English, and out comes Chinese to the listener.


HG TTS arena is asking if the text-to-speech sounds human like. That's somewhat different from voice cloning. A model might produce audio which is less human like, but still sound closer to the target voice.


As someone who has used the arena maybe ~3 times, the subpar voice quality in the demo linked immediately stood out to me.


I'd like to see Deepgram Aura on here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: