Web Data Collection For Language Modeling - New plan, IRST LM and perplexity

Dr. Tony Robinson, one of the mentors, has come up with a new plan for the project. The core idea remains the same. The difference is that we are going to use Lucene for obtaining text from a small set of websites that provide high-quality text, such as news websites. This eliminates the problem with search engine automated query policies and also increases the speed of processing data as everything is queried locally. The language model toolkit was chosen to be IRST LM. I have run some experiments on the training data that was provided by IRST LM. Details inside.

GSOC 2012 Accepted Projects Announced

We are happy to announce a list of students which will participate in Google Summer Of Code 2012 project with CMUSphinx organization:

Letter to Phoneme Conversion in sphinx4


Currently sphinx4 can only work with predefined dictionary. It's possible to build phonetic dictionary automatically but it requires both application of machine learning for training and development of decoder module as well as testing. Various language modules needs to be trained as well. This work will be implement letter to sound rules with OpenFST in sphinx4.

Student John Salatas

Pronunciation Evaluation


Implement the simple reading and pronunciation learning system


Srikanth Ronanki and Troy Lee

Semantic language model

Current language models are very basic that means they don't really understand what's transcribed. That affects error rate. Create a decoder over the lattices that will select semantically correct path and create a perfectly readable result.


Wencan Luo

Postprocessing punctuation and capitalization framework

Create language-independent postprocessing framework that will turn ASR results into something readable with punctuation, abbreviations and capitalization.



Alexandru-Dan Tomescu

Web Data Collection For Language Modeling

Write a crawler which can collect text data for language model training on certain topic


Emre Çelikten

We expect great features implemented this summer. Please stay tuned, the news will appear here.

Podcast About CMUSphinx History

Hello CMUSphinx User and Developers

If you are interested in CMUSphinx history or just want to become more familar with core CMUSphinx developers and listen to them you can now do so. Recently Sourceforge team and Rich Bowen has made a great podcast with the CMUSphinx team

Check it out


CMUSphinx powers mobile dictation application

Sonalight, which showed off its product at this week’s Y Combinator Demo Day, thinks voice tech is better put to use tackling real issues users have with their mobiles in everyday settings, like texting while driving. Sonalight actually employs Google’s own existing voice recognition tech, in combination with the CMU Sphinx open source software, to achieve its results. This is a great use case for CMUSphinx.



To try it.

onalight actually employs Google’s own existing voice recognition tech, in combination with the CMU Sphinx open source software, to achieve its results.