Recently there was a lot of buzz in the internet about Samsung SmartTV speech recognition policy. The part of user agreement that states that everything you speak in a room can be recorded and sent to the third party company (not even to Samsung) raises reasonable concerns of the users. So the question arrives what is the current state of technology to recognize speech fully offline without the internet. And what CMUSphinx can suggest here.
Beside privacy concern speech recognition with the server has other disadvantages. For example, unpredictable response time. Not everyone expects to wait for 1-2 seconds while data is sent to the server and results sent back. Immediate response is way more attractive.
Modern phones, TVs and embedded devices are not supercomputers, with ARM processor at 1-2GHz frequency they are comparable to the old good Pentium II Intel CPUs in general computing performance. Of course they have hardware extension modules to solve more resource-consuming tasks like to decode a HD video stream, but for general computing their performance is not that high. Such extensions are proprietary and not always easy to use. And energy consumption puts very strong limits on possible computation. CPUs have multiple cores if you use them you drain battery faster, so it will not last a day. Due to that it is not easy to expect to see a major CPU power boost in coming years.
Modern large vocabulary speech recognition solutions like the one that runs at Google requires about 2Gb of memory per decoding stream and requires access to the language model of terabyte in size. For that reason it is not practical to run it on the phone. There have been several decoders designed to work in low-resource environments and among them:
Well, almost every decoder from 90's might be considered as a decoder with low requirements now. However, the expectation from speech recognition system has been changed since them. The decoder is expected to recognize the vocabulary up to 200 thousand words and recognize natural language with high accuracy. So old decoders do not match expectations either.
Among recent publications on efficient speech recognition one must note a publication from Google about offline mobile speech recognizer:
The key features of this recognizer are:
Those feature allow decoder to recognize user queries with 50 thousand words vocabulary with pretty low error rate of just 15%. This is a very impressive accuracy but still below the user accuracy expectation. So a lot of work is still required to recognize on mobile phone efficiently and without the use of the internet. Unless there would be means to restrict the vocabulary and increase the accuracy.
As for Pocketsphinx, our main embedded speech recognition engine, there are few features missing. First of all, pocketsphinx uses less accurate GMM models. However, one should note that those model do not require multiple-core scoring like DNN, so their energy consumption is way lower.
Pocketsphinx model compression is a bit worse than Google's one but not significantly worse. The data in our models is compressed to 16bits, scores are compressed to 22bits. We can use within 30mb per language.
One of the advantages of Pocketsphinx is configurable vocabulary so you can create the models you need. With proper configuration and tuning of the beams pocketsphinx can recognize 10 thousand words vocabulary on a device in realtime with error rate around 20%. With smaller vocabulary the accuracy is significantly higher, you can expect just 3% of errors when you recognize below 100 words, this approaches to practical system. You can also use keyword activation mode with pocketsphinx which allows modern phone to listen continuously for the whole day.
So if you really need a system which recognizes a dozen commands without the internet, pocketsphinx is a solution to consider. There is no need to send user speech anywhere and include strange items in user agreement. If you want bigger vocabulary, there is still some work to do.
To ensure best quality it is critical to gain a mass of projects using CMUSphinx, for that reason we are pleased to present a newly announced ILA Voice Assistant which is pretty interesting for several reasons.
First, it uses Sphinx4 for speech recognition, second, it can learn while you interact with it. This is an interesting idea which many developers tried to implement, however, it is not trivial as it may seem. From our opinion, ILA came closer to perfection here. Using latest Sphinx4 features ILA can learn new commands from you and it also learn your voice while you speak with it improving the accuracy of recognition.
ILA designed to run on your desktop or media center PC and integrates into your home enviroment with ease. ILA can search the web for you, find locations, get directions, start timers, read news headlines, open programs, execute system commands and much more! It's especially fun when used with a bluetooth headset or microphone array to freely move around in your home while talking to ILA. Nice thing here that everything works offline so your data remains on your server and is not sold to anything else. This is so rare in our days.
ILA is frequently updated with a large list of features implemented in every update, it's very exciting to track the progress of the releases. Not many voice assistant projects have more than 2 releases and go beyond simple command-and-control, we know numerous examples where project stopped development after 1 month of work without getting critical quality, hopefully ILA will be different.
These days speech recognition technology is under focus of many large corporations like Google, Baidu, Microsoft. It is an exciting time of shifting paradigms and approaches which lead to quite significant improvement of the accuracy and stability of the technology. Many ideas which seemed fundamental are now questioned, for example, the modern research tries to replace the Hidden Markov Models with recursive neural networks and more complex structures like long-short term memory (LTSM) networks. There is a lot of marketing here, for example LTSM papers often present that such networks improve frame classification accuracy significantly which was never a parameter to optimize in ASR system because the word-error rate results of LTSM networks are not as great as frame accuracy results.
One idea that was quite successfully challenged is the use of feature context in speech recognition system. MFCC with delta-deltas usually use just 7-9 frames around the current frame while modern deep neural network (DNN) classifiers can use up to 30 frames successfully. That certainly improves the accuracy of speech recognition for DNN system. One important thing here is that we still use breadth-first search in our decoders, thus our context is only used in classification, not in search. This is a major drawback in modern systems.
If we consider graph best path search there are two approaches here - breadth first search where we first explore all possible branches coming out of a node and depth-first approach where we continue to explore the path till some point without looking on branches. Modern decoders mostly use breadth-first search (BFS), in many cases it is considered a suboptimal approach.
For example, if we have some noise in a frame and next frame is correct we can not recover from it because we already pruned the correct path on the noisy frame. There is no way to recover. Only by looking on several frames at once we can figure out which path is correct and which is not.
Next advantage of depth-first search is speed. By using larger context we can quickly reduce the hypothesis space, for example, by looking on 3-4 following phonemes we can reduce the amount of words to search.
Depth-first search was quite popular before in 90-es, many decoders those days were using it like Dragon decoder or IBM Stack decoder or Noway decoder. Unfortunately, with introduction of WFST framework, those decoders declined.
WFST framework is also an interesting case to consider here. The problem with triphone models is that they explode when we consider cross-word transition. We do not know following phone so we have to consider all possible variants and every possible history, that grows the search space significantly. Before this problem was solved in a different ways, for example, developers use multipass search without cross-word triphones first and then rescored with cross-word triphones (pocketsphinx approach). This problem is well considered in Dynamic Network Decoding Revisited paper by H. Soltau, G. Saon.
By using graph reduction operation we solve the very specific problem - improve speed of decoding by properly compressing cross-word contexts. With free implementation in OpenFST toolkit this method become very popular. This is an improvement in decoding speed which allowed to improve decoding performance two-three folds, but it also has disadvantages. Due to the strict and simplistic formalism of WFST it is not easy to use it to perform more complex searches and integrate more complex models into search, for example, hierarchical language model. Also, it is quite memory extensive because precompiled WFST graph has to reside in system memory during decoding. An attempts to overcome WFST restrictions continue these days, one possible approach is dynamic context compression in decoder inspired by WFST ideas, which still requires recombination of word labels and careful tracking of context. And, if we consider simple things like noises and paralinguistic fillers, we make our system much more complex to implement in WFST framework.
If we get out of restrictions of breadth-first search we get another solution here. Just by looking on following phones we can greatly reduce amount of hypothesis to consider during cross-word transition and thus we do not need complex WFST compression anymore. And here we just need to tolerate a bit the delay in recognition. The allowed delay to tolerate can be derived from human response expectation. About 0.2 to 0.5 seconds is ok, thus we can consider up to 50 frames ahead. This is an old idea which was used in 90-es and somewhat supported in pocketsphinx: a phonetic loop fast match. We first decode the audio with very fast phonetic decoder and only after a delay we start main large vocabulary search. This way we not only improve word transition search space but also improve decoding inside lextree because we can predict following phones efficiently. Because of that we consider phone loop search as quite important feature of the decoder. Such ideas has been long advocated by Dr. James Baker and by Dr. Tony Robinson from CantabResearch.
Recently we improved fast match in sphinx4 and fixed fast-match issues in pocketsphinx. Since today we include phone-loop fast match both in sphinx4 and pocketsphinx by default, it greatly improves the speed of decoding and combined with PTM models you can decode large vocabulary speech on desktop in realtime with sphinx4 and high accuracy. That's pretty big improvement. Please checkout either pocketsphinx or sphinx4, try it out and let us know what do you think.
We are pleased to announce that we have just released two new acoustic noise-robust model for US English. Trained from a big amount of speech data they advance CMUSphinx accuracy and robustness to a new level.
You can download new models in downloads
Two models have been released - a traditional continuous model with 5000 senones and 32 mixtures and a new PTM model with 5000 senones and 128 mixtures. This PTM model is worth some attention because it provides a great balance between decoding speed, accuracy and model size. We have recently added support for PTM models in sphinx4, you can already use this PTM model with sphinx4 trunk and get a decent decoding result.
The difference between PTM, semi-continuous and continuous models is the following. We use mixture of gaussians to compute the score of each frame, the difference is how do we build such mixture. In continuous model every senone has it's own set of gaussians thus the total number of gaussians in the model is about 150 thousand. That's too much to compute the mixture efficiently. In semi-continuous model we have just 700 gaussians, way less than in continuous and we only use them with different mixtures to score the frame. Due to the smaller number of gaussians semi-continuous models are fast. PTM models is a gold middle here. It uses about 5000 gaussians thus providing better accuracy than semi-continuous, but it is still significantly faster than continuous thus can be used in mobile.
So far PTM model demonstrates very good result - it decodes almost with continuous accuracy and works about 5 times faster than continuous model. With new PTM model you can decode speech with 60k words vocabulary in realtime with Java with sphinx4 and you can decode up to 5000 words in mobile phone in realtime. We consider this model an important direction of development, so all our future models will have this format. Of course PTM can not match the best results of deep neural networks yet but it is sufficiently faster, we are doing research to match the DNN performance keeping the impressive speed and model size.
Those are good news, but we are going to release a model for a new language soon, guess which one.