A new English language model release

A new English language model is available (updated) for download on our new Torrent tracker.

This is a good trigram language model for a general transcription trained on a various open sources, for example Guttenberg texts.

It archives the good transcription performance on various types of
texts, for example on the following tests sets the perplexities are:

TED talks from IWSLT 2012

Perplexity: 158.3

Lecture transcription task

Perplexity: 206.677

Beside the transcription task, this model should be significantly better on conversational data like movie transcription.

The language model was pruned with a beam 5e-9 to reduce the model. It can be pruned further if needed or a vocabulary could be reduced to fit the target domain.

Help to distribute CMUSphinx data through Bittorrent

Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it's not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.

Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google's word error rate is over 40%.

mrflip/CC BY-NC-SA 2.0

While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.

We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at

http://cmusphinx.info

Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.

We encourage you to make other speech-related data available through our tracker. Please contact cmusphinx-devel@lists.sourceforge.net mailing list if you want to add your data set to the tracker.

Please help us to distribute the data, start a client on your host and make the data available to others.

To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.

You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.

New release: sphinxbase-0.8, pocketsphinx-0.8 and sphinxtrain-0.8

We are pleased to announce that today a pack of CMUSphinx packages was released:

  • sphinxbase-0.8
  • pocketsphinx-0.8
  • sphinxtrain-0.8

For the download links see:

https://cmusphinx.github.io/wiki/download

The biggest update of this release is a new sphinxtrain. The code sharing between sphinxbase and sphinxtrain significantly increased bringing more consistent codebase and interface, accurate memory management and increased usability.

Beside that, a single sphinxtrain binary is introduced to provide an easy and flexible access to the whole training procedure. In the future we hope to reduce the amount of Perl scripts in training setup and to port everything on Python. This will open the access to an advanced Python ecosystem including scientific packages, graphics and distributed computing.

Another notable change of this release in a new openfst-based G2P framework implemented during Google Summer of Code. Credits for this should go to Josef Robert Novak and John Salatas. This framework is also supported by sphinx4 and provides a uniform and accurate algorithm to create dictionaries from word lists.

A numerous bug fixes and improvements were submitted by our contributors. We should be grateful to the great developers who made this release possible. Many thanks to our star team, which is impressively long:

Alessandro Curzi
Alexandru-Dan Tomescu
Balkce
Bhiksha Raj
Blake Lemoine
Boris Mansencal
Douglas Bagnall
Erik Andresen
Evandro Gouvea
Glenn Pierce
Halle Winkler
Jidong Tao
John Salatas
Josef Novak
Kho-And-Mica
Kris Thielemans
Lionel Koenig
Marc Legendre
Melmahdy
Michal Krajnansky
Nicola Murino
Pankaj Pailwar
Paul Dixon
Pecastro
Peter Grasch
Riccardo Magliocchetti
Scott Silliman
Shea Levy
Tanel Alumae
Tony Robinson
Vassil Panayotov
Vijay Abharadwaj
Vyacheslav Klimkov
Yuri Orlov
Zheng6822

For more detailed information see the NEWS file in the corresponding packages.

The new sphinx4 package and an android demo using pocketsphinx will be released soon, finalizing the release cycle. After that, a great new features will start their way into codebase. Stay tuned.

A bunch of great CMUSphinx posts

For those who are interested in CMUSphinx on mobile, please check out the PolitePix blog where you could find some interesting ideas about pocketsphinx on iPhone:

OpenEars tips #1: create a language model before runtime from a text file or corpus

OpenEars tips #2: N-Best hypotheses with OpenEars

OpenEars tips #3: Acoustic model adaptation

OpenEars tips #4: Testing someone else’s recognition results using a recording

OpenEars is the easiest way to try open offline speech recognition on iPhone platform. If you are interested to add speech recognition to your iPhone application, you should definitely check it out.