#Testing the quality of the new lightweight ASR

1 messages · Page 1 of 1 (latest)

uncut junco
#

Hello. If you have some time and basic Python knowledge (not mandatory, you can ask an LLM for help), I suggest testing the new streaming models from Kroko. They claim impressive metrics, but practical testing is needed. I can’t evaluate all languages myself due to the language barrier.
The following languages are supported: [English, Dutch, French, Portuguese, Spanish, German, Italian, Swedish, Hebrew, Turkish].

Server: https://github.com/mitrokun/wyoming_streaming_asr/

Obtaining the models is somewhat complicated by the creators; I had to create an encoder. You need to download the model from HF, place it in the models/kroko directory, and run the script. There’s a detailed instruction in the directory. 64 and 128 refer to the window size in ms; you can try any model.
The server has absolutely no hardware requirements; almost any CPU will do.

To start, specify the path to the directory, the language, and the --debug flag to conveniently evaluate the ASR performance. The default port is 10303.
The "command" option is a separate mechanism for enhancing user experience; leave it for later.

I would appreciate your feedback. If it is really as good as it claims to be, then we can make an add-on.

spare bridge
potent estuary
#

Did you try it yourself? Is it worth hassle?
Good STT is something we all need, and the one that doesn't require GPU to work fast - looks like unicorn.

uncut junco
# potent estuary Did you try it yourself? Is it worth hassle? Good STT is something we all need, ...

Of course, I did a lot of tests with the vosk model while setting up the server. I wouldn’t say a GPU is strictly necessary; models like Parakeet int8 (or gigaAM for ru) work well on CPU.
However, streaming models are more interesting because they allow for schemes with fast command recognition interruption without waiting for VAD (this may be a solution for noisy rooms), or you can cancel input after a couple of words.
Also, Zipformer runs very fast even on old Celerons since operations are performed simultaneously with listening. You can speak for 10 seconds, and the final latency will still be 0.1 or 0.2 seconds.

uncut junco