Microsoft traditionally has very good speech recognition technology. Recently announced speech recognition assistant Cortana is one of the best available assistant. However, it might lack support for your native language or just behave not the way you expect (hey, Siri also still doesn't support many languages).
Thanks to a wonderful work by Toine de Boer you can now enjoy Pocketsphinx on Windows phone platform. It is as straightforward as on Android, you can just download the project from our github http://github.com/cmusphinx/pocketsphinx-wp-demo, import it into your Visual Studio and run on the phone. You can enjoy all the features of CMUSphinx on Windows phone: continuous hands-free operation, switchable grammars, support for custom acoustic and language models. There is no need to wait for the speech recognition input in the game. We hope this opens the possibilities for new great applications.
The demo includes continuous listening for the keyphrase "oh mighty computer" and once keyphrase is detected it switches to grammar mode to let you input some information. Let us know how it works.
Python programming language is getting amazing popularity recently due to the elegance of the language, wide range of tools for scientific computing including scipy and NLTK and the immediacy of a "scripting" style language. We often get request to explain how to decode with pocketsphinx and Python.
Another interesting activity going around CMUSphinx is an updated acoustic model for German language. A frequent updates are posted on Voxforge website by Guenter, please check his new very much improved German models here: http://goofy.zamia.org/voxforge/de/. With the new improvements like audio aligner tool you can build a very accurate model for almost any language in a week or so.
To summarize these new features, one of our users, Matthias provided a nice tutorial on how to start with Pocketsphinx and Python and German models. With the new SWIG-based API we increased support for decoder features available in Python, now you can do from Python almost the same things you can do from C. If you are interested, please check his blog post here:
https://mattze96.safe-ws.de/blog/?p=640
If you have issues with Python API or want to help with your language, let us know.
After three years of development we have finally merged an aligner for long audio files into trunk. The aligner takes audio file and corresponding text and dumps timestamps for every word in the audio. This functionality is useful for processing of the transcribed files like podcasts with further applications like better support for audio editing or for automatic subtitle syncronization. Another important application is acoustic model training, with a new feature you can easily collect databases of thousand hours for your native language with the data from the Internet like news broadcasts, podcasts and audio books. With that new feature we expect the list of supported languages will grow very fast.
To access new feature checkout sphinx4 from subversion or from our new repository on Github http://github.com/cmusphinx/sphinx4 and build code with maven with "mvn install"
For the best accuracy download En-US generic acoustic model from downloads as well as g2p model for US English.
Then run the alignment
java -cp sphinx4-samples/target/sphinx4-samples-1.0-SNAPSHOT-jar-with-dependencies.jar \ edu.cmu.sphinx.demo.aligner.AlignerDemo file.wav file.txt en-us-generic \ cmudict-5prealpha.dict cmudict-5prealpha.fst.ser
The result will look like this:
+ of [10110:10180] there [11470:11580] are [11670:11710] - missing
Where + denotes inserted word and - is for missing word. Numbers are times in milliseconds.
Please remember that input file must be 16khz 16bit mono. Text must be preprocessed, the algorithm doesn't handle numbers yet.
The work on long alignment started during 2011 GSoC project with proposal from Bhiksha Raj and James Baker. Apurv Tiwari made a lot of progress on it, however, we were not able to produce a robust algorithm for alignment. It still failed on too many cases and failures were critical. Finally we changed algorithm to multipass decoding and it started to work better and survive gracefully in case of errors in transcription. Alexander Solovets was responsible for the implementation. Still, the algorithm doesn't handle some important tokens like numbers or abbreviations and the speed requires improvement, however, the algorithm is useful already so we can proceed with further steps of model training. We hope to improve the situation in near future.
It's interesting that since speech recognition becomes widespread the approach to the architecture of speech recognition system changes significantly. When only a single application needed speech recognition it was enough to provide a simple library for the speech recognition functions like pocketsphinx and link it to the application. It's still a valid approach for embedded devices and specialized deployments. However, approach changes significantly when you start to plan the speech recognition framework on a desktop. There are many applications which require voice interface and we need to let all of them interact with the user. Each interaction requires time to load the models into memory and memory to hold the models. Since the requirements are pretty high it becomes obvious that speech recognition service has to be placed into a centralized process. Naturally a concept of speech recognition server appears.
It's interesting that many speech recognition projects start to talk about the server:
Simon has been using a common daemon (SimonD) managed over the sockets in order to provide speech recognition functions
Rodrigo Parra implements dbus-based server for TamTam Listens project - a speech recognition framework for Sugar OLPC project. This is a very active work in progress, subscribe to the Tumblr blog to get the latest updates .
Andre Natal talks about speech recognition server for the FirefoxOS during his summer project.
Right now the solution is not yet stable, it is more work in progress. It would be great if such efforts could converge to a single point in the future, probably CMUSphinx can be the common denominator here and provide the desktop service for the applications looking to implement voice interfaces. A standalone library is certainly needed, we shouldn't only focus on the service architecture, but service would be a good addition too. It could provide the common interfaces for the applications which just need to register required commands on the service.
Of course there is an option to put everything in the cloud, but cloud solution has its own disadvantages. Privacy concerns are still here and the data connection is still expensive and slow. There are similar issues with other resource-intensive APIs like text-to-speech, desktop natural language processing, translation, and so on, so soon quite a lot of memory on the desktop will be spent on desktop intelligence. So reserve few more gigabytes of memory in your systems, it will be taken pretty soon.