Please consider Sphinx4-based spoken dialog system toolkit demonstration http://www.okkoblog.com/2010/11/04/inprotk-demonstration/ by University of Potsdam and the Inpro Group. Few interesting improvements to the core are there such as prosody-driven end-of-turn classification, mid-utterance action execution and display of partial ASR hypotheses
Are you one of those who think that CMUSphinx is build for mad scientists to support DARPA sniffing on the Arabic broadcasts? Let me show you something interesting. Today's story is about a very important project that uses CMUSphinx - JVoiceXML, a VoiceXML browser written in Java. It serves hundreds of VoiceXML browsers in the VoIP world. Beside VoiceXML gateways, it provides development tools, including a VoiceXML plugin for Eclipse.
We asked the lead JVoiceXML developer Dr. Dirk Schnelle-Walka several questions about JVoixeXML and CMUSphinx. Here are his answers.
Q: When was JVoiceXML started?
I started JVoiceXML in 2005 as I pursued my PhD at Telecooperation Lab at Technical University of Darmstadt.
The goal of my PhD was to use speech interfaces in smart environments. The dilemma for a research institute is that commercial systems are too expensive. Free VoiceXML hosting exists but can not be used without a telephone. So I decided: "Well, maybe I can do the programming myself". It was an easy start but there is still a lot to do for full compliance with the W3C standard. After a short time, Steve Doyle joined the project. He worked on the XML library and provided a robust support for VoiceXML document authoring. In 2006 two new members joined the project: Torben Hardt and Christoph Bünte. Torben took care of the scripting engine and Christoph did his masters thesis about the grammar processor. Shortly after that, some people of INESC discovered JVoiceXML and pushed the project a lot. They were Hugo Monteiro, Renato Cassaca and David Rodriguez. In 2009, Zhang Nan joined and started to work on the W3C compliance test. Shortly thereafter, Spencer Lord form SpokenTech added initial support for MRCP and SIP. At the same time, the Telecom Labs sponsored support for the Mary Speech Synthesizer. In 2010, Cibek started to use this project to control installations in home environments. Their engagement is now leading to further improvements of the JSAPI 2 support.
Q: What are the advantages of JVoiceXML over commercial offerings?
The main advantage is that it can be used for free, even in commercial settings.
During the design of JVoiceXML we focused on extensibility, interoperability and support for standard interfaces. The use of JSAPI 1 enables the use it without any telephony environment. The platform makes it easy to test VoiceXML applications quickly, you can even define your own custom tags and assign functionality to them. This makes it also ideal to be used when teaching students about speech user interface development.
Q: Who are your users?
JVoiceXML is mostly used in teaching and by some smaller start-up companies, especially in India, China and the U.S. It is also used in several research projects.
Q: What is the status of VXML2.1 support?
There is initial support for nearly all tags. Missing tags are "link" and "transfer". Transfer has some initial code that has not been tested yet. By initial support I mean that not all attributes of each tag are supported, but the interpreter seems to behave as expected and can be used for simple applications.
Q: What about VXML 3.0?
Currently, I am investigating the publicly available working draft for an analysis. One of the next steps will be to adjust the interpreter to match the interfaces that are described there. So we will definitely go for it.
Q: What are the plans for speaker (voice) recognition which is part of VXML 3.0 specification?
JVoiceXML is only the voice browser. But the current architecture is designed for easy extensibility. It's easy to plug in external functionality. We are also looking for a suitable speaker recognition software. Marf could be a candidate, but it seems that it is no longer actively developed. So we are also interested in good hints about a decent piece of software filling that gap.
Q: What place does JVoiceXML take in the open source telephony stack?
JVoiceXML does not explicitly require a telephone. It can also be used without any telephony support on a desktop for example.
The extensible architecture allows to use different PBX's. JVoiceXML has been successfully used together with Asterisk and currently we are working on an integration with FreeSwitch. Some recent contacts promised to offer support for Mobicents as well.
The integration with telephony applications is done by means of standard APIs and protocols. JTAPI can take over the role as a bridge for other protocols or telephony products, but MRCP will be needed to enable audio streaming to them. The MRCP stack used is identical to the one that is used in Zanzibar. Zanzibar, which is using JVoiceXML as its VoiceXML processor, offers an integrated product while JVoiceXML can also be used with other PBX's.
Q: What are the future plans for JVoiceXML development?
The major goal is to establish a framework that can be used for free. So all tags should be fully supported. Currently we are also working to use the W3C conformance test. Unfortunately, it will not be possible to get a certification since this has to be paid for and we do not earn money with JVoiceXML.
Q: What was your biggest issue with CMUSphinx?
Sphinx4 was easy to integrate. The most tricky part was to make it JSAPI 1.0 compliant. We developed some code, that was submitted back to the Sphinx4 project.
Q: What feature do you miss the most in CMUSphinx?
Since VoiceXML utilizes SRGS it would be good to have this supported natively.
We hope to add this to our contribution of JSAPI 2 support for sphinx 4. Currently, we are trying to transform the grammars which leads to some ambiguities.
Q: What is the state of TTS in JVoiceXML?
We have support for different TTS engines. The most important ones are FreeTTS, OpenMary and the Microsoft Speech API.
Q: What do you think about natural language interfaces and open vocabulary communication?
We came closer and closer to this vision but we are not there. In fact this would mean that the computer has the same communication capabilities as a human. To achieve this, we will also need to incorporate context into the evaluation. This is still a long way to go. VoiceXML tries do achieve this by what they call "mixed initiative". For me, this is rather slot-filling than real mixed initiative. It is possible to mimic the behavior of a true human to human communication with the help of some voice user interface design techniques, but there are still some important aspects missing, like information overloading, intention recognition and partial speech recognition.
Q: Can you list few projects a student can do as a term project in the university?
We have many interesting tasks, to list a few:
- completion of the SRGS processor
- implement a cache for resources
- complete the W3C conformance test
You definitely need to contact us if you are interested!
Want to share you experience about CMUSphinx? Contact us.
We are very pleased to see the ongoing progress on OpenEars. Please consider
OpenEars is an iOS library for continuous, multithreaded speech recognition and text-to-speech using CMU Pocketsphinx and CMU Flite, for use in iPhone and iPad development. OpenEars can:
• Do continuous speech recognition on a managed background thread that uses less than 10% CPU on average on an iPhone 3G while listening (decoding and text-to-speech use more CPU),
• Quickly suspend and resume continuous recognition on demand,
• Choose between 8 Flite voices for text-to-speech using a simple config file,
• Suspend recognition during Flite speech automatically when using the external speaker,
• Make use of a Cocoa-standard static library project, allowing SDK and architecture re-targeting from the application project,
• Do management and notification of the state of the Audio Session to handle microphone changes and interruptions like incoming calls,
• Return input/output decibel metering of the audio functions so it is ready for your UI.
• Let you use these features via Objective-C methods.
Please report any bugs at http://www.politepix.com/forums/
Since you probably urge to know more, we asked the OpenEars developer, Halle Winkler a few questions. Halle is a professional developer specializing in software development for the iPhone, iPad, and iPod Touch, as well as UX design, with an emphasis on usability and the emerging interaction possibilities of multitouch platforms.
Q: How are you going to expand this?
Halle: I'm going to see what users ask for, but my guess is that they will want in-app lm/dic generation or a RESTful API for creating new lm/dic files. Other features that I would consider would be switching lmsets on the fly in the course of the listening loop, or maybe an API for managing a logical tree of different outcomes from commands, which was definitely the most headache-causing aspect of AllEars to test. I'm very interested to see what people do with it and what they tell me they want. I'm not ready to publish any kind of roadmap yet.
Oh, something I definitely need to do for an upcoming version is improve the responsiveness/threading/CPU overhead for Flite processing and speech playback. I need to use a lower-level audio API for Flite and get Flite streaming working. On an iPhone 4 some of the voices can generate a sentence in a third of a second, which is impressive and definitely not in the range that is getting unresponsive from a UX perspective, but on my iPhone 3G it can take a second and a half plus the latency of the audio API I'm using, which is getting into "is anything happening?" UX territory (I know that is still an impressive speed given the CPU, but for endusers it's confusing).
Q: What are the most important issues users complain about?
Halle: There are no complains yet, but I'm would be glad to hear any feedback from OpenEars users.
Q: What are most important pocketsphinx issues you've met?
Halle: Well, when I was originally using Pocketsphinx, a big frustration was the build_for_iphone.sh method of creating static libraries because it often didn't work for me since I often don't have my developer tools installed to the default location (which seems to be required by the script), and once I got it working I ended up having to make or copy 12 different static libraries in order to be able to target 3 different SDK versions while I was experimenting. Then in the middle of it, Apple shipped a beta iOS4 SDK that wiped out those libraries with its installer, which has nothing to do with Pocketsphinx but was a time-killer to figure out what had happened, which is the point at which I made a new method for linking to your libraries.
OpenEars ships with Cocoa static library projects for Pocketsphinx and Sphinxbase which are linked via cross-project references with the user's app project, so when they want to target the simulator versus the device, or target to one SDK but deploy with backwards-compatibility to an earlier SDK, the Cocoa static library project just gets that information passively from the main project and recompiles itself at the build time for the user's app project.
In general I think Pocketsphinx is fantastic! It runs really well on the iPhone in continuous mode. More documentation is always good. I tried to err on the side of overkill for the OpenEars docs since I think the topic of speech recognition can be very complex to first get into as an outsider, when you're just expecting that you'll compile the library and suddenly you'll have a device that understands an entire language's worth of vocabulary with no trouble, but actually there are lms, dics, hmms, all the arguments that you can run pocketsphinx with, etc. Easing developers into some complexities that would benefit them (or me) to understand on a more fundamental level while encapsulating other complexities that might not need to be grappled with in a run-of-the-mill speech app seems like the challenge.
Congratulations with the new release.
Get it here:
New Features and Improvements:
* Alignment demo and grammar to align long speech recordings to
transcription and get word times
* Lattice grammar for multipass decoding
* Explicit-backoff in LexTree linguist
* Significant LVCSR speedup with proper LexTree compression
* Simple filter to drop zero energy frames
* Graphviz for grammar dump vizualization instead of AISee
* Voxforge decoding accuracy test
* Lattice scoring speedup
* JSAPI-free JSGF parser
* Insertion probabilities are counted in lattice scores
* Don't waste resources and memory on dummy acoustic model
* Small DMP files are loaded properly
* JSGF parser fixes
* Documentation improvements
* Debian package stuff
Antoine Raux, Marek Lesiak, Yaniv Kunda, Brian Romanowski, Tony
Robinson, Bhiksha Raj, Timo Baumann, Michele Alessandrini, Francisco
Aguilera, Peter Wolf, David Huggins-Daines, Dirk Schnelle-Walka.