QtSpeechRecognition API for Qt Using Pocketsphinx

Qt Logo
It is really great to see the wide variety of APIs raising around Pocketsphinx, one recent new one is QtSpeechRecognition API implemented by Code-Q for assistive applications. This undertaking is quite ambitious, the main features include

  • Speech recognition engines are loaded as plug-ins.
  • Engine is controlled asynchronously, causing only minimal load to the
    application thread.
  • Built-in task queue makes plug-in development easier and forces
    unified behavior between engine integrations.
  • Engine integration handles the audio recording, making it easy to use
    from the application.
  • Application can create multiple grammars and switch between them.
  • Setting mute temporarily disables speech recognition, allowing
    co-operation with audio output (speech prompts or audio cues).
  • Includes integration to PocketSphinx engine (latest codebase) as a

You can discuss features and find more details on the following thread in Qt mailing list. You can find the sources in review in qtspeech project, branch wip/speech-recognition.

The implementation already includes pretty interesting features, for example it intelligently saves and restores CMN state for more robust recognition. So let us see how it goes.

New language model binary format

Expectations for the vocabulary size in LVCSR has grown dramatically in recent years. 150 thousand words is a must for modern speech recognizers while 10 years ago most system operated only with 60 thousand words. In several morphologically-rich languages like Russian the vocabulary of such size is critical for good OOV rate, but even in English it is important because of the variety of topics one can expect as an input. With such a large vocabulary ngram language models should store millions of ngrams and their weights, which requires memory efficient data structure that allows fast queries. Ngram language models are also widely used in NLP, machine translation, so this topic got a lot of attention in recent years. Several toolkits for language modeling like SRILM, IRSTLM, MITLM, BerkeleyLM implement special data structures to hold the language model.

CMUSphinx decoders use its own ngram model data structure that support files in ARPA and DMP format. While it has some fancy techniques like trie organization of ngrams, simple weight quantizing and sorted ngram arrays, there is a serious shortcoming. Word ID is limited with uint16 type, so maximum vocabulary is 65k words. Simply replacing the ID type could seriously increase currently used language models sizes. Moreover, current implementation is limited by a maximum ngram order of 3. So it was decided to implement a new state-of-art data structure. KenLM reverse trie data structure was selected as a base for CMUSphinx implementation. “Reverse” means that last word of ngram is looked up first:

In example above trigram “is one of” is queried. Each node contains “next pointer” to the beginning of successors list. Separate array for each ngram order is maintained. Each node contains quantized weights: probability and backoff. Nodes are stored in bit array, i.e. minimum amount of bits required to store next pointers and word IDs are used.

Those ideas where carefully implemented in both sphinxbase and sphinx4 and now the one can use language model of unlimited vocabulary size with improved memory consumption and query speed. It is still possible to use arpa and dmp models, but to enhance loading time, convert your model into trie binary format using sphinx_lm_convert:

sphinx_lm_convert -i your_model.lm.(dmp) -o your_model.lm.bin

Worth to mention that while it is possible to read DMP format, ability to generate DMP files is removed.

New generic 70k language model also landed in trunk. Check out its performance and report how it works on your tasks. It is expected to somewhat boost recognition accuracy decreasing OOV rate. For example in lecture transcription task OOV decreased by factor of 2.

Looking forward for your feedback: opinions, suggestions, bug reports. Say thanks to Vyacheslav Klimkov for the amazing implementation!

Virtual Assistants in Games

There is a lot of discussion today where the very hot virtual assistant market will head. There are assistants to ask for the weather, assistants to ask about sports games and running assistants. Home assistants help you to turn off TV and watch the temperature. All those doesn't seem too attractive. "Okay, Google, why Siri doesn’t talk to me anymore?"

One interesting application of speech recognition technology is games. It is much more to run through the dark dungeons casting light with something like "Ekto Lumeh" and calling for dragons.

Unlike real-word users players in games feel way more natural to speak with virtual characters and even forgive some recognition mistake. So it is definitely something that would be popular in a near future. In that sense it is interesting to consider In Verbis Virtus, a game created by a talented Italian studio Indomitus Games.

It is fun that game is implemented using CMUSphinx, you can read about implementation details here.

Current state of offline speech recognition and SmartTV

Recently there was a lot of buzz in the internet about Samsung SmartTV speech recognition policy. The part of user agreement that states that everything you speak in a room can be recorded and sent to the third party company (not even to Samsung) raises reasonable concerns of the users. So the question arrives what is the current state of technology to recognize speech fully offline without the internet. And what CMUSphinx can suggest here.

Beside privacy concern speech recognition with the server has other disadvantages. For example, unpredictable response time. Not everyone expects to wait for 1-2 seconds while data is sent to the server and results sent back. Immediate response is way more attractive.

Modern phones, TVs and embedded devices are not supercomputers, with ARM processor at 1-2GHz frequency they are comparable to the old good Pentium II Intel CPUs in general computing performance. Of course they have hardware extension modules to solve more resource-consuming tasks like to decode a HD video stream, but for general computing their performance is not that high. Such extensions are proprietary and not always easy to use. And energy consumption puts very strong limits on possible computation. CPUs have multiple cores if you use them you drain battery faster, so it will not last a day. Due to that it is not easy to expect to see a major CPU power boost in coming years.

Modern large vocabulary speech recognition solutions like the one that runs at Google requires about 2Gb of memory per decoding stream and requires access to the language model of terabyte in size. For that reason it is not practical to run it on the phone. There have been several decoders designed to work in low-resource environments and among them:

  • Dragon Naturally Speaking, the first continuous dictation system appeared in 1997. Successfully run on 25Mb of memory and was able to use vocabulary of 25 thousand words
  • SRI Dynaspeak engine from SRI. Up to 50k vocabulary with pretty modest hardware requirements

Well, almost every decoder from 90's might be considered as a decoder with low requirements now. However, the expectation from speech recognition system has been changed since them. The decoder is expected to recognize the vocabulary up to 200 thousand words and recognize natural language with high accuracy. So old decoders do not match expectations either.

Among recent publications on efficient speech recognition one must note a publication from Google about offline mobile speech recognizer:

Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices by Xin Lei, Andrew Senior, Alexander Gruenstein and Jeffrey Sorensen

The key features of this recognizer are:

  • Accurate DNN models which score input features on multiple cores
  • Quantization everywhere
  • Very cleaver compression of the language model

Those feature allow decoder to recognize user queries with 50 thousand words vocabulary with pretty low error rate of just 15%. This is a very impressive accuracy but still below the user accuracy expectation. So a lot of work is still required to recognize on mobile phone efficiently and without the use of the internet. Unless there would be means to restrict the vocabulary and increase the accuracy.

As for Pocketsphinx, our main embedded speech recognition engine, there are few features missing. First of all, pocketsphinx uses less accurate GMM models. However, one should note that those model do not require multiple-core scoring like DNN, so their energy consumption is way lower.

Pocketsphinx model compression is a bit worse than Google's one but not significantly worse. The data in our models is compressed to 16bits, scores are compressed to 22bits. We can use within 30mb per language.

One of the advantages of Pocketsphinx is configurable vocabulary so you can create the models you need. With proper configuration and tuning of the beams pocketsphinx can recognize 10 thousand words vocabulary on a device in realtime with error rate around 20%. With smaller vocabulary the accuracy is significantly higher, you can expect just 3% of errors when you recognize below 100 words, this approaches to practical system. You can also use keyword activation mode with pocketsphinx which allows modern phone to listen continuously for the whole day.

So if you really need a system which recognizes a dozen commands without the internet, pocketsphinx is a solution to consider. There is no need to send user speech anywhere and include strange items in user agreement. If you want bigger vocabulary, there is still some work to do.