CMUSphinx Open Source Speech Recognition

Feb 17, 2011

CMLLR Adaptation in SphinxTrain

The problem is that with a complexity of ASR algorithms it's very hard to implement them all. While some of them are sometimes better, some are worse. For specific application you can always choose most reasonable approach but it may be not readily available in your system and it might be quite resource-consuming to implement them. That's why frameworks
like CMUSphinx are valuable for both researchers and speech application developers. That's why we are so happy to see your contributions to CMUSphinx.

Good example of this is a set of approaches to train MLLR transform. Basically there is MLLR where mean and variance of the gaussians are estimated alone or CMLLR where mean and variance of the gaussian distribution are estimated together. CMLLR is more complex to estimate but because of smaller amount of parameters it does make sense to apply
it when your adaptation data is small. For example if you have just a minute of speech to adapt, CMLLR can give you better results than MLLR.

Why do we write this today you'll ask? Easy. Today CMLLR estimation code landed in Sphinxtrain trunk. See the file cmllr.py. Thanks a lot to Stephan Vanni who contributed that part, that's really valuable addition! Enjoy!

Nov 30, 2010

We are happy to see every app using CMUSphinx

It's interesting that CMUSphinx gives the developers all over the world the ability to build speech systems, interact with voice and build something unique and useful

This Kinect imitation is not really very impressive or complicated speech recognition, it's all about freedom of creativity. If you want to build interactive application, just write few lines of code and it will work. In many languages, for many people.

And we'll be satisfied as well.

Nov 4, 2010

InproTK Demonstraction

Please consider Sphinx4-based spoken dialog system toolkit demonstration http://www.okkoblog.com/2010/11/04/inprotk-demonstration/ by University of Potsdam and the Inpro Group. Few interesting improvements to the core are there such as prosody-driven end-of-turn classification, mid-utterance action execution and display of partial ASR hypotheses

Oct 11, 2010

JVoiceXML and CMUSphinx

Are you one of those who think that CMUSphinx is build for mad scientists to support DARPA sniffing on the Arabic broadcasts? Let me show you something interesting. Today's story is about a very important project that uses CMUSphinx - JVoiceXML, a VoiceXML browser written in Java. It serves hundreds of VoiceXML browsers in the VoIP world. Beside VoiceXML gateways, it provides development tools, including a VoiceXML plugin for Eclipse.

We asked the lead JVoiceXML developer Dr. Dirk Schnelle-Walka several questions about JVoixeXML and CMUSphinx. Here are his answers.

Q: When was JVoiceXML started?

I started JVoiceXML in 2005 as I pursued my PhD at Telecooperation Lab at Technical University of Darmstadt.

The goal of my PhD was to use speech interfaces in smart environments. The dilemma for a research institute is that commercial systems are too expensive. Free VoiceXML hosting exists but can not be used without a telephone. So I decided: "Well, maybe I can do the programming myself". It was an easy start but there is still a lot to do for full compliance with the W3C standard. After a short time, Steve Doyle joined the project. He worked on the XML library and provided a robust support for VoiceXML document authoring. In 2006 two new members joined the project: Torben Hardt and Christoph Bünte. Torben took care of the scripting engine and Christoph did his masters thesis about the grammar processor. Shortly after that, some people of INESC discovered JVoiceXML and pushed the project a lot. They were Hugo Monteiro, Renato Cassaca and David Rodriguez. In 2009, Zhang Nan joined and started to work on the W3C compliance test. Shortly thereafter, Spencer Lord form SpokenTech added initial support for MRCP and SIP. At the same time, the Telecom Labs sponsored support for the Mary Speech Synthesizer. In 2010, Cibek started to use this project to control installations in home environments. Their engagement is now leading to further improvements of the JSAPI 2 support.

Q: What are the advantages of JVoiceXML over commercial offerings?

The main advantage is that it can be used for free, even in commercial settings.

During the design of JVoiceXML we focused on extensibility, interoperability and support for standard interfaces. The use of JSAPI 1 enables the use it without any telephony environment. The platform makes it easy to test VoiceXML applications quickly, you can even define your own custom tags and assign functionality to them. This makes it also ideal to be used when teaching students about speech user interface development.

Q: Who are your users?

JVoiceXML is mostly used in teaching and by some smaller start-up companies, especially in India, China and the U.S. It is also used in several research projects.

Q: What is the status of VXML2.1 support?

There is initial support for nearly all tags. Missing tags are "link" and "transfer". Transfer has some initial code that has not been tested yet. By initial support I mean that not all attributes of each tag are supported, but the interpreter seems to behave as expected and can be used for simple applications.

Q: What about VXML 3.0?

Currently, I am investigating the publicly available working draft for an analysis. One of the next steps will be to adjust the interpreter to match the interfaces that are described there. So we will definitely go for it.

Q: What are the plans for speaker (voice) recognition which is part of VXML 3.0 specification?

JVoiceXML is only the voice browser. But the current architecture is designed for easy extensibility. It's easy to plug in external functionality. We are also looking for a suitable speaker recognition software. Marf could be a candidate, but it seems that it is no longer actively developed. So we are also interested in good hints about a decent piece of software filling that gap.

Q: What place does JVoiceXML take in the open source telephony stack?

JVoiceXML does not explicitly require a telephone. It can also be used without any telephony support on a desktop for example.

The extensible architecture allows to use different PBX's. JVoiceXML has been successfully used together with Asterisk and currently we are working on an integration with FreeSwitch. Some recent contacts promised to offer support for Mobicents as well.

The integration with telephony applications is done by means of standard APIs and protocols. JTAPI can take over the role as a bridge for other protocols or telephony products, but MRCP will be needed to enable audio streaming to them. The MRCP stack used is identical to the one that is used in Zanzibar. Zanzibar, which is using JVoiceXML as its VoiceXML processor, offers an integrated product while JVoiceXML can also be used with other PBX's.

Q: What are the future plans for JVoiceXML development?

The major goal is to establish a framework that can be used for free. So all tags should be fully supported. Currently we are also working to use the W3C conformance test. Unfortunately, it will not be possible to get a certification since this has to be paid for and we do not earn money with JVoiceXML.

Q: What was your biggest issue with CMUSphinx?

Sphinx4 was easy to integrate. The most tricky part was to make it JSAPI 1.0 compliant. We developed some code, that was submitted back to the Sphinx4 project.

Q: What feature do you miss the most in CMUSphinx?

Since VoiceXML utilizes SRGS it would be good to have this supported natively.
We hope to add this to our contribution of JSAPI 2 support for sphinx 4. Currently, we are trying to transform the grammars which leads to some ambiguities.

Q: What is the state of TTS in JVoiceXML?

We have support for different TTS engines. The most important ones are FreeTTS, OpenMary and the Microsoft Speech API.

Q: What do you think about natural language interfaces and open vocabulary communication?

We came closer and closer to this vision but we are not there. In fact this would mean that the computer has the same communication capabilities as a human. To achieve this, we will also need to incorporate context into the evaluation. This is still a long way to go. VoiceXML tries do achieve this by what they call "mixed initiative". For me, this is rather slot-filling than real mixed initiative. It is possible to mimic the behavior of a true human to human communication with the help of some voice user interface design techniques, but there are still some important aspects missing, like information overloading, intention recognition and partial speech recognition.

Q: Can you list few projects a student can do as a term project in the university?

We have many interesting tasks, to list a few:

- completion of the SRGS processor
- implement a cache for resources
- complete the W3C conformance test

You definitely need to contact us if you are interested!

Want to share you experience about CMUSphinx? Contact us.

Newer

Older

Page 30 of 37