Real Time Speech Recognition with Gradio

fxtentacle · on March 30, 2022

The title is very misleading. This is a thin 10-line Gradio GUI in front of the Huggingface Pipeline API, the latter of which will download 1000+ python files, a professionally pre-trained 1GB asr model, and a 500MB language model. But to all of that, Gradio isn't contributing. They are merely the GUI framework.

"Gradio GUI Python Package is compatible with Huggingface Inference Python Package"

Yeah, duh.

Also, I'm surprised that they chose Mozilla DeepSpeach which was last updated in 2020 instead of wav2vec2 which is actually competitive in recognition quality.

EDIT: BTW if you're curious, you can try out many of the Huggingface pre-trained models here:

https://huggingface.co/spaces/huggingface/hf-speech-bench

and for example here's a Facebook pre-trained English model with good performance that you can easily embed into your own Python apps. [Use in Transformers] button at the top right of the page.

https://huggingface.co/facebook/wav2vec2-base-960h

captainmuon · on March 30, 2022

Do you know if there are any modern, open recognizers that are not "massively deep learning", i.e. that one could tweak and retrain without Google-level resources, and that are not completely black box (not just wave to text)?

It would be interesting to build a more clever voice assistant. E.g. you say "computer, play ..." and it would immediately call a recognizer for game titles, and one for song titles. And the song title recognizer would know that certain song names are in English, other in Spanish, and so on. "OK Google" constantly gets tripped up on this case for me. It seems to transcribe everything to my language (German in this case) and only then interpret it. If I ask for an English language song I get silly results.

nmfisher · on March 31, 2022

Kaldi hybrid models (mixture of DNNs and HMM acoustic modelling with WFST based decoding) are still competitive. They’re only really outperformed by the largest models (and even then, that may be a function of investment in man-hours in tuning and data cleaning, rather than the capacity of the underlying models).

mhuffman · on March 30, 2022

You can tweak and retrain many existing models with transfer learning and that does not require you to have Google-level resources ... only the people that do the original model.

fxtentacle · on April 1, 2022

Sorry, but the short answer is: No.

If you want good recognition quality, there's no way around massive deep learning models and getting access to 64x V100 for a month is just expensive, no matter how you approach that.

On the other hand, if you are interested in German and English, you might be in luck :)

In Schleswig-Holstein, there'll be local government voting in May and the chamber of commerce has pushed the topics towards "digitization" and "high tech research". But the problem is, even if you wanted to, most German companies cannot just use Google/Microsoft/Amazon for their speech recognition, due to data privacy and industry espionage concerns. We're in the process of setting up a collaboration between multiple small companies to promote local AI usage (https://deutscheki.de/ work in progress). As part of that effort, we're building mixed-language German-English acoustic models and language models with the focus on them being re-usable and a heavy focus on privacy. The financial approach is that tools will be accessible for free for non-commercial use and then companies can buy them for integration into their own products and/or to install it onto their own private infrastructure. Simultaneously, we'll be lobbying the local government to buy out the source code, which would turn the whole thing into a government-paid open source product. (Horizon Europe is the funding budget for that)

In our approach, the super-expensive acoustic model is going to output a real-time stream of phoneme- or character-level tokens. So if you say "Hallo" it'll output "H H A L L : L L O O O _" with : being the glottal stop and _ being the word end silence.

This data can then be easily transformed into "HALLO " and on average we see a 95% success rate on the character level.

For more advanced use-cases, the token sequence along with their likelihoods can be fed into a language model. That one would do stuff like turn "HALO" into "HALLO". Those language models are based on KenLM and you can train a good one with a 2TB SSD and 64 GB of RAM, so that'll be in the realm of regular people.

The full pipeline currently ranks worse than Microsoft Azure but better than Google Cloud in word recognition accuracy. Estimated training costs so far are €20k.

Users will hopefully receive the whole thing as a pre-compiled Windows DLL / Linux SO with SWIG language bindings. The idea behind splitting things into "wav to phonemes" and "phonemes to words" is that it allows participating companies to train specialized word models for their field and also to do hotword boosting, which is kind of what you want. If the AI recognizes that you said "computer, play ..." then it can switch to a language model which only allows song titles for the rest of the sentence. Similarly, the acoustic part will work for both German and English, so by switching the language model, you can switch the recognition language on the fly.

We are also preparing additional models like webcam virtual background replacement, spell- and grammar-checking, automated video subtitling, phone Q&A systems, etc.

Feel free to email us with the address on that page and I'll reply when we have a first release out.

thomasfromcdnjs · on March 30, 2022

Given that hf-speech-bench, is there any easy way to tell if they support streaming speech? Or do all models support streaming speech?

fxtentacle · on April 1, 2022

Most models sadly do not. If they rely on a beam search decoder (most do) then doing streaming data processing requires additional trickery to keep the search state up-to-date.

As such, I'm also not aware of any easy check. That said, you can probably convert most deep learning models to work with streaming data without having to retrain the AI.

spullara · on March 30, 2022

Wow, this is really, really bad. Try this one to compare.

https://azure.microsoft.com/en-us/services/cognitive-service...

I don't work for MSFT.

mnkmnk · on March 30, 2022

This is not really a demo of the voice transcription quality but of the platform. It just shows how to take an open source machine learning library (deepspeech) and make it into a webapp quickly.

spullara · on March 30, 2022

Ah ok. I have done that as well. Deepspeech and Speechbrain and the other open source models for transcription are unfortunately not good. Probably because they don't have enough training data relative to the big guys. You should show CLIP - probably the best open source model I have seen as it was trained on a huge corpus.

renierbotha · on March 30, 2022

Yeah not really working as expected.

Was interested in this as I'm looking to build a "swearing detector" to help me swear less in video calls but this could not pick up one sentence properly out of a couple and then it started throwing errors.

Think it needs some time back in the lab tbh.

jamal-kumar · on March 30, 2022

TUSTING TUSTING TESTTEST ONE TO TEST TEST ONE DO ONE TWO HREE FOUR OUD FIVE SIX SEVEN EIGHT

I wouldn't exactly call this a success

thomasfromcdnjs · on March 30, 2022

Really nice work on the GUI, keep it up!

monkeydust · on March 30, 2022

ASR on my Pixel 6 has been a game changer, combo of accuracy and speed.