These days speech recognition technology is under focus of many large corporations like Google, Baidu, Microsoft. It is an exciting time of shifting paradigms and approaches which lead to quite significant improvement of the accuracy and stability of the technology. Many ideas which seemed fundamental are now questioned, for example, the modern research tries to replace the Hidden Markov Models with recursive neural networks and more complex structures like long-short term memory (LTSM) networks. There is a lot of marketing here, for example LTSM papers often present that such networks improve frame classification accuracy significantly which was never a parameter to optimize in ASR system because the word-error rate results of LTSM networks are not as great as frame accuracy results.
One idea that was quite successfully challenged is the use of feature context in speech recognition system. MFCC with delta-deltas usually use just 7-9 frames around the current frame while modern deep neural network (DNN) classifiers can use up to 30 frames successfully. That certainly improves the accuracy of speech recognition for DNN system. One important thing here is that we still use breadth-first search in our decoders, thus our context is only used in classification, not in search. This is a major drawback in modern systems.
If we consider graph best path search there are two approaches here - breadth first search where we first explore all possible branches coming out of a node and depth-first approach where we continue to explore the path till some point without looking on branches. Modern decoders mostly use breadth-first search (BFS), in many cases it is considered a suboptimal approach.
For example, if we have some noise in a frame and next frame is correct we can not recover from it because we already pruned the correct path on the noisy frame. There is no way to recover. Only by looking on several frames at once we can figure out which path is correct and which is not.
Next advantage of depth-first search is speed. By using larger context we can quickly reduce the hypothesis space, for example, by looking on 3-4 following phonemes we can reduce the amount of words to search.
Depth-first search was quite popular before in 90-es, many decoders those days were using it like Dragon decoder or IBM Stack decoder or Noway decoder. Unfortunately, with introduction of WFST framework, those decoders declined.
WFST framework is also an interesting case to consider here. The problem with triphone models is that they explode when we consider cross-word transition. We do not know following phone so we have to consider all possible variants and every possible history, that grows the search space significantly. Before this problem was solved in a different ways, for example, developers use multipass search without cross-word triphones first and then rescored with cross-word triphones (pocketsphinx approach). This problem is well considered in Dynamic Network Decoding Revisited paper by H. Soltau, G. Saon.
By using graph reduction operation we solve the very specific problem - improve speed of decoding by properly compressing cross-word contexts. With free implementation in OpenFST toolkit this method become very popular. This is an improvement in decoding speed which allowed to improve decoding performance two-three folds, but it also has disadvantages. Due to the strict and simplistic formalism of WFST it is not easy to use it to perform more complex searches and integrate more complex models into search, for example, hierarchical language model. Also, it is quite memory extensive because precompiled WFST graph has to reside in system memory during decoding. An attempts to overcome WFST restrictions continue these days, one possible approach is dynamic context compression in decoder inspired by WFST ideas, which still requires recombination of word labels and careful tracking of context. And, if we consider simple things like noises and paralinguistic fillers, we make our system much more complex to implement in WFST framework.
If we get out of restrictions of breadth-first search we get another solution here. Just by looking on following phones we can greatly reduce amount of hypothesis to consider during cross-word transition and thus we do not need complex WFST compression anymore. And here we just need to tolerate a bit the delay in recognition. The allowed delay to tolerate can be derived from human response expectation. About 0.2 to 0.5 seconds is ok, thus we can consider up to 50 frames ahead. This is an old idea which was used in 90-es and somewhat supported in pocketsphinx: a phonetic loop fast match. We first decode the audio with very fast phonetic decoder and only after a delay we start main large vocabulary search. This way we not only improve word transition search space but also improve decoding inside lextree because we can predict following phones efficiently. Because of that we consider phone loop search as quite important feature of the decoder. Such ideas has been long advocated by Dr. James Baker and by Dr. Tony Robinson from CantabResearch.
Recently we improved fast match in sphinx4 and fixed fast-match issues in pocketsphinx. Since today we include phone-loop fast match both in sphinx4 and pocketsphinx by default, it greatly improves the speed of decoding and combined with PTM models you can decode large vocabulary speech on desktop in realtime with sphinx4 and high accuracy. That's pretty big improvement. Please checkout either pocketsphinx or sphinx4, try it out and let us know what do you think.
We are pleased to announce that we have just released two new acoustic noise-robust model for US English. Trained from a big amount of speech data they advance CMUSphinx accuracy and robustness to a new level.
You can download new models in downloads
Two models have been released - a traditional continuous model with 5000 senones and 32 mixtures and a new PTM model with 5000 senones and 128 mixtures. This PTM model is worth some attention because it provides a great balance between decoding speed, accuracy and model size. We have recently added support for PTM models in sphinx4, you can already use this PTM model with sphinx4 trunk and get a decent decoding result.
The difference between PTM, semi-continuous and continuous models is the following. We use mixture of gaussians to compute the score of each frame, the difference is how do we build such mixture. In continuous model every senone has it's own set of gaussians thus the total number of gaussians in the model is about 150 thousand. That's too much to compute the mixture efficiently. In semi-continuous model we have just 700 gaussians, way less than in continuous and we only use them with different mixtures to score the frame. Due to the smaller number of gaussians semi-continuous models are fast. PTM models is a gold middle here. It uses about 5000 gaussians thus providing better accuracy than semi-continuous, but it is still significantly faster than continuous thus can be used in mobile.
So far PTM model demonstrates very good result - it decodes almost with continuous accuracy and works about 5 times faster than continuous model. With new PTM model you can decode speech with 60k words vocabulary in realtime with Java with sphinx4 and you can decode up to 5000 words in mobile phone in realtime. We consider this model an important direction of development, so all our future models will have this format. Of course PTM can not match the best results of deep neural networks yet but it is sufficiently faster, we are doing research to match the DNN performance keeping the impressive speed and model size.
Those are good news, but we are going to release a model for a new language soon, guess which one.
Recently Professor Rudnicky has updated CMUDict with the latest changes and now we have cmudict-0.7b version which you are welcome to checkout from subversion and use in your applications. CMUDict, the phonetic dictionary for US English has been one of the major components of the CMUSphinx toolkit. CMUDict has a long history being a unique resource for English pronunciation which is used by many other speech projects, commercial and open source.
There are few things in CMUDict which we would love to improve, those things would definitely have a huge impact on speech recognition research and overall speech recognition technology:
Phonemic vs phonetic
CMUDict is originally a phonetic dictionary opposed to phonemic dictionary. It contains the approximations to the word pronunciation, it describes how the US native speaker would pronounce the word in a read speech. On the other hand in other condition even native US speaker would pronounce them differently. For example
uniform Y UW1 N AH0 F AO2 R M
with AH0 in the middle is already a reduced form of the
uniform Y UW1 N IH0 F AO2 R M
Which the speaker would say if he will pronounce the word slowly. On the other hand it doesn't have the form
uniform Y UW1 N AH0 F AH0 R M
Which native speaker would use in fast conversational speech. There are many cases like that. Because of such structure the dictionary is ready to use in speech recognition system but it makes it hard to conduct research on real phonetic reduction in various contexts just because the dictionary often doesn't have the original form. In modern systems where phonetic reduction gets more importance, we need to have more information on it in the dictionary. Hopefully, one day we will be able to collect both the information about original phonemic representation of words and their possible phonetic representation.
Newly appeared words
All the words which appeared in last few years and widely used around are often missing in CMUDict: "skype", "ipad", "spotify", there are so many important entries to add. Well, "spotify" has to be added to my spellchecker first. Hopefully we could keep the update rate of the dictionary faster. The reasonable estimation of the required size of the dictionary is about 200-500 thousand US English words, so the size of the dictionary has to be increased twice. That's a lot of work to review.
Word origins and morphology
There are many research projects on modeling the pronunciation of the words automatically. Still, for CMUDict the symbol error rate is about 8% which causes word error rate about 30%. However, it's often very sad they are trying to model words as blackboxes without the attempt to add some sense to them. It's very important in what context the word is used, what is the origin of the word. Is it a surname, an abbreviation, a geographical term or a scientific term. Such information could greatly improve the quality of the dictionary and the accuracy of the prediction.
Other languages
There is a growing interest in supporting other languages in CMUSphinx toolkit - Spanish, French, British English. One of the serious problems is that we still lack a lot of data for them and dictionary is one big issue. Hopefully, we will be able to make an original approximation to the dictionary with rule-based systems for at least some of the languages. Such data would enable research on multi-language and language-independent speech recognizer and would greatly benefit the speech recognition toolkits.
Automatic dictionary acquisition
This is still an emerging technology, however, there are already some advancements in automatic dictionary collection with software by LIUM. One can imagine the tool which scans through the audio and just learns the words it met and generates pronunciation for them to add to the dictionary. Hopefully, such tools are not a far future of speech recognition.
So there is a huge space of improvement for CMUDict alone which is very important for the speech recognition research unrelated to the toolkit or speech recognition implementation. For that reason it is worth to note that CMUDict is also available on github, so you are welcome to clone the repository, make your changes and submit a pull request, that would be very much appreciated!
One of the main problems with existing open source speech recognition systems is that they are not really designed to be used in end-user software. They are mainly research projects created by universities and they are intended to support new research. They allow to quickly add with new features and get best results for various evaluations.
The end-user software doesn't work like that, you might not need to demonstrate the best accuracy but you need to match the user expectations. For example, user expects to get a reasonable result even if he speaks far from the microphone or whispers the words. No modern system can recognize whisper reliably, thus mismatched expectation, thus complains. A lot of work is required to solve all the problems like this.
However, since many commercial companies promoted speech recognition to end-users, open source software also got a chance. We can build software for mass-market and match commercial solutions in terms of accuracy and robustness. Important step here would be to gain the audience attention, instead of software for geeks we need to become a software for everybody, a very hard problem to solve. It's great there are projects with big ambitions here, in particular Mozilla Foundation.
Recently Mozilla Foundation announced a project to support WebSpeech HTML API in their browsers. Celebrating 10 years of Firefox development, Mozilla CTO, Andreas Gal, announced this and many other features coming in Mozilla codebase. During Google Summer of Code project by Andre Natal a base system was implemented and Andre continues work on the project. You can get some ideas of where it is now and how it developers from the following post. So we will probably be able to see speech interface in Firefox browser and Firefox OS pretty soon.
One of the main issue in wide adoption of the speech interfaces would be the support for big and small languages. Firefox considers this as important direction of development, in particular support for Indian languages. I hope we are going to see a lot of progress here.