Modern speech recognition algorithms require enormous amount of data to estimate speech parameters. Audio recordings, transcriptions, texts for langauge model, pronuncation dictionaries and vocabularies are collected by speech developers. While it's not necessary to be the case in the future and better algorithms might require just a few examples, now you need to process thousands of hours of recordings to build a speech recognition system.
Estimates show that human recieves thousands hours of speech data before it learns to understand speech. Note that human has prior knowledge structure embedded into the brain we are not aware of. Google trains their models on 100 thousands hours of audio recorings and petabytes of transcriptions, still it behind the human performance in speech recognition tasks. For search queries they still have word error rate of 10%, for youtube Google's word error rate is over 40%.
While Google has a vast of resources so we do. We definitely can collect, process and share even more data than Google has. The first step in this direction is to create a shared storage for the audio data and CMUSphinx models.
We created a torrent tracker specifically to distribute a legal speech data related to CMUSphinx, speech recognition, speech technologies and natural language processing. Thanks to Elias Majic, the tracker is available at
Currently tracker contains torrents for the existing acoustic and language models but new more accurate models for US English and other languages will be released soon.
We encourage you to make other speech-related data available through our tracker. Please contact email@example.com mailing list if you want to add your data set to the tracker.
Please help us to distribute the data, start a client on your host and make the data available to others.
To learn more about BitTorrent visit this link or search in the web, there is a vast amount of resources about it.
You might wonder what is the next step. Pretty soon we will be able to run a distributed acoustic model training system to train the acoustic model using vast amount of distributed data and computing power. With a BOINC-grid computation network of CMUSphinx tools we together will create the most accurate models for speech. Stay tuned.
We are pleased to announce that today a pack of CMUSphinx packages was released:
For the download links see:
The biggest update of this release is a new sphinxtrain. The code sharing between sphinxbase and sphinxtrain significantly increased bringing more consistent codebase and interface, accurate memory management and increased usability.
Beside that, a single sphinxtrain binary is introduced to provide an easy and flexible access to the whole training procedure. In the future we hope to reduce the amount of Perl scripts in training setup and to port everything on Python. This will open the access to an advanced Python ecosystem including scientific packages, graphics and distributed computing.
Another notable change of this release in a new openfst-based G2P framework implemented during Google Summer of Code. Credits for this should go to Josef Robert Novak and John Salatas. This framework is also supported by sphinx4 and provides a uniform and accurate algorithm to create dictionaries from word lists.
A numerous bug fixes and improvements were submitted by our contributors. We should be grateful to the great developers who made this release possible. Many thanks to our star team, which is impressively long:
For more detailed information see the NEWS file in the corresponding packages.
The new sphinx4 package and an android demo using pocketsphinx will be released soon, finalizing the release cycle. After that, a great new features will start their way into codebase. Stay tuned.
For those who are interested in CMUSphinx on mobile, please check out the PolitePix blog where you could find some interesting ideas about pocketsphinx on iPhone:
OpenEars is the easiest way to try open offline speech recognition on iPhone platform. If you are interested to add speech recognition to your iPhone application, you should definitely check it out.
(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Final Report)
This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.
Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.
Edit-distance Neighbor phones decoding:
1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.
This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.
This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.
Please try our demo @ http://talknicer.net/~ronanki/test/ and help us by giving the feedback.
Documentation and Codes
All codes are uploaded at CMUSphinx svn @ http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/ and raw documentation of the project can be found here.
The pronunciation evaluation system really helps second-language learners to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn't complete some of the things like CART modelling I have mentioned earlier during the project. But I hope that I can keep my contributions to this project in future as well.
This summer has been a great experience to me. Google Summer of code 2012 has finally ended. As a final note, the current article is just a summary of the work during the project, an extensive set of documentation will be updated at https://cmusphinx.github.io/wiki/faq#qhow_to_implement_pronunciation_evaluation. You can also read more about this project and weekly progress reports at http://pronunciationeval.blogspot.in/