CMUSphinx Open Source Speech Recognition

Jul 10, 2012

Scoring routines for pronunciation evaluation: part1

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Week 5)

The basic scoring routine for the pronunciation evaluation system is now available at http://talknicer.net/~ronanki/test/. The output is generated for each phoneme in the phrase and displays the total score.

1. Edit-distance neighbor grammar generation:

Earlier, I did this with:

(a) a single-phone decoder
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_1phone.txt

(b) a three-phone decoder (contextual)
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_3phones.txt

(c) an entire phrase decoder with neighboring phones
http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_compgram.txt

This week, I added two more decoders: a word-decoder and a complete phrase decoI used using each phoneme at each time

word-decoder: Use sox to split each wav file into worded based on forced-alignment output and then present each word as follows.

Ex: word - "with" is presented as

public = ( (W | L | Y) (IH) (TH) );

public = ( (W) (IH | IY | AX | EH) (TH) );

public = ( (W) (IH) (TH | S | DH | F | HH) );

The accuracy turned out to be better than single-phone/three-phone decoder, same as entire phrase decoder and the output of a sample test phrase is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_words.txt

Complete phrase decoder using each phoneme: This is again more similar to entire phrase decoder. This time I supplied neighboring phones for each phoneme at each time and fixed the rest of the phonemes in the phrase. Not a good approach, takes more time to decode. But, the accuracy is better than all the previous methods. The output is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt

The code for above methods are uploaded in cmusphinx sourceforge at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/

Please follow the README file in each folder for detailed instructions on how to use them.

2. Scoring paradigm:

Phrase_wise:
The current basic scoring routine which is deployed at http://talknicer.net/~ronanki/test/ aligns the test recording with the utterance using forced alignment in sphinx and generates a phone segmentation file. Each phoneme in the file is then compared with mean, std. deviation of the respective phone in phrase_statistics (http://talknicer.net/~ronanki/phrase_data/phrase1_stats.txt) and standard scores are calculated from z-scores of acoustic_score and duration.

Random_phrase:
I also derived statistics (mean score, std. deviation score, mean duration) for each phone in CMUphoneset irrespective of context using the exemplar recordings for all the three phrases (http://talknicer.net/~ronanki/phrase_data/phrases.txt) which I have as of now. So, If a test utterance is given, I can test each phone in the random phrase with respective phone statistics.

Statistics are at : http://talknicer.net/~ronanki/phrase_data/all_phrases_stats (column count represents number of times each phone occurred)

Things to do in the upcoming week:

1. Use of an edit-distance grammar to derive standard scores such that the minimal effective training data set is required.
2. Use of the same grammar to detect the words that are having two correct different pronunciation (ex: READ/RED)
3. In a random phrase scoring method, another column can be added to store the position of each phone with respect to word (or SILence) such that each phone will have three statistics and can be compared better with the exemplar phonemes based on position.
4. Link all those modules to try to match experts' scores.
5. Provide feedback to the user with underlined mispronunciations, or numerical labels.

Future tasks:

1. Use of CART models in training to do better match of statistics for each phoneme in the test utterance with the training data based on contextual information
2. Use of phonological features instead of mel-cepstral features, which are expected to better represent the state of pronunciation.
3. Develop a complete web-based system so that end user can test their pronunciation in an efficient way.

Jul 4, 2012

Web data collection for pronunciation evaluation

(author: Troy)

(status: week 4)

[Project mentor note: I have been holding these more recent blog posts pending some issues with Adobe Flash security updates which periodically break cross-platform audio upload web browser solutions. We have decided to plan for a fail-over scheme using low-latency HTTP POST multipart/form-data binary Speex uploads to provide backup in case Flash/rtmplite fails again in the future. This might also support most of the mobile devices. Please excuse the delay and rest assured that progress continues and will continue to be announced at such time as we are confident that we won't need to contradict ourselves as browser technology for audio upload continues to develop. --James Salsman]

The data collection website now can provide basic capabilities. Anyone interested, please check out http://talknicer.net/~li-bo/datacollection/login.php and give it a try. If you encounter any problems, please let us know.

Here are my accomplishments from last week:

1) Discussed the project schema design with the project mentor and created the database with MySQL. The current schema is shown at http://talknicer.net/w/Database_schema. During the development of the user interface, slight modifications were made to refine the database schema, such as the age field in for the users table: Storing the user's birth date is much better. Other similar changes were made. I learned that good database design comes from practice, not purely imagination.

2) Implemented the two types of user registration pages: one for students and one for exemplar uploaders. To avoid redundant work and allow for fewer constraints on types of users, the registration process involves two steps: one basic registration and one extra information update. For students, only the basic one is mandatory, but the exemplar uploaders have to fill out two separate forms.

3) Added extra supporting functionality for user management, including password reset and mode selection for users with more than one type.

4) Incorporated the audio recorder with the website for recording and uploading to servers.

Jul 4, 2012

edit-distance grammar decoding using sphinx3: Part 2

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Week 4)

The source code for the functions below [1] have been uploaded to http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/
Here are some brief notes on how to use those programs:

Method 1: (phoneme decode)
Path:
neighborphones_decode/one_phoneme/
Steps To Run:
1. Use split_wav2phoneme.py to split a sample wav file in to individual phoneme wav files
$ python split_wav2phoneme.py
2. Create split.ctl file using extracted split_wav directory
$ ls split_wav/* > split.ctl $ sed -i 's/.wav//g' split.ctl
3. Run feature_extract.sh program to extract features for individual phoneme wav files
$ sh feature_extract.sh
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.
$ sh jsgf2fsg.sh
6. Run decode_1phoneme.py to get the required output in output_decoded_phones.txt
$ python decode_1phoneme.py

Method 2: (Three phones decode)
Path:
neighborphones_decode/three_phones/
Steps To Run:
1. Use split_wav2threephones.py to split a sample wav file in to individual phoneme wav files which consists of three phones the other two being served as contextual information for the middle one.
$ python split_wav2threephones.py
2. Create split.ctl file using extracted split_wav directory
$ ls split_wav/* > split.ctl $ sed -i 's/.wav//g' split.ctl
3. Run feature_extract.sh program to extract features for individual phoneme wav files
$ sh feature_extract.sh
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg
$ sh jsgf2fsg.sh
6. Run decode_3phones.py to get the required output in output_decoded_phones.txt
$ python decode_3phones.py

Method 3: (Single/Batch phrase decode)
Path:
neighborphones_decode/phrases/
Steps To Run:
1. Construct grammar file (JSGF) using my earlier scripts from phonemes2ngbphones [2] and then use jsgf2fsg in sphinxbase to convert from JSGF to FSG which serves as input Language Model to sphinx3_decode
2. Provide the input arguments such as grammar file, feats, acoustic models etc., for the input test phrase
3. Run decode.sh program to get the required output in sample.out
$ sh decode.sh

References:

[1] edit-distance grammar decoding using sphinx3: Part 1

[2] Input string of phonemes to CMUBet neighboring phones

Jul 1, 2012

Porting openFST to java: Part 3

(author: John Salatas)

Foreword

This article, the third in a series regarding, porting openFST to java, introduces the latest update to the java code, which resolve the previously raised issues regarding the java fst architecture in general and its compatibility with the original openFST format for saving models. [1]

1. Code Changes

1.1. Simplified java generics usage

As suggested in [1], the latest java fst code revision (11456), available in the cmusphinx SVN Repository [2], assumes only the base Weight class and modifies the State, Arc and Fst classes definition to simply use a type parameter.

The above modifications provide an easier to use api. As an example the construction of a basic FST in the class edu.cmu.sphinx.fst.demos.basic.FstTest is simplified as follows

... Fst fst = new Fst();

// State 0 State s = new State(); s.AddArc(new Arc(new Weight(0.5), 1, 1, 1)); s.AddArc(new Arc(new Weight(1.5), 2, 2, 1)); fst.AddState(s);

// State 1 s = new State(); s.AddArc(new Arc(new Weight(2.5), 3, 3, 2)); fst.AddState(s);

// State 2 (final) s = new State(new Weight(3.5)); fst.AddState(s); ...

1.2. openFST models compatibilty

Besides the simplified java generics usage above, the most important change is the code to load an openFST model in text format and convert it to a java fst serialized model. This is achieved also in the latest java fst code revision (11456) [2].

2. Converting openFST models to java

2.1. Installation

The procedure below is tested on an Intel CPU running openSuSE 12.1 x64 with gcc 4.6.2, Oracle Java Platform (JDK) 7u5, and ant 1.8.2.

In order to convert an openFST model in text format to java fst model, the first step is to checkout from the cmusphinx SVN repository the latest java fst code revision:

# svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/fst

Next step is to build the java fst code
cd fst # ant jar Buildfile: /fst/build.xml jar: build-subprojects: init: [mkdir] Created dir: /fst/bin build-project: [echo] fst: /fst/build.xml [javac] /fst/build.xml:38: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 10 source files to /fst/bin [javac] /fst/build.xml:42: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds build: [jar] Building jar: /fst/fst.jar BUILD SUCCESSFUL Total time: 2 seconds #

2.2. Usage

Having completed the installation as described above, and trained an openfst model named binary.fst as described in [3], with the latest model training code revision (11455) [4] the model is also saved in the openFST text format in a file named binary.fst.txt. The conversion to a java fst model is performed using the openfst2java.sh which can be found in the root directory of the java fst code. The openfst2java.sh accepts two parameters being the openfs input text model and the java fst output model as follows:

# ./openfst2java.sh binary.fst.txt binary.fst.ser Parsing input model... Saving as binary java fst model... Import completed. Total States Imported: 1091667 Total Arcs Imported: 2652251 #

The newly generated binary.fst.ser model can then be loaded in java, as follows:

try { Fst fst = (Fst) Fst.loadModel("binary.fst.ser"); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }

3. Performance: Memory Usage

Testing the conversion and loading of the cmudict.fst model generated in [3], reveal that the conversion task requires about 1.0GB and the loading of the converted model requires about 900MB of RAM.

4. Conclusion – Future Works

Having the ability to convert and load an openFST model in java, takes the “Letter to Phoneme Conversion in CMU Sphinx-4” project to the next step, which is the port of phonetisaurus decoder to java which will eventually lead to its integration with cmusphinx 4.

A major concern at this point is the high memory utilization while loading large models. Although it is expected for java applications to consume more memory compared to a similar C++ application, this could be a problem especially when running in low end machines and needs further investigation and optimization (if possible).

References

[1] Porting openFST to java: Part 2

[2] Java fst SVN (Revision 11456)

[3] Automating the creation of joint multigram language models as WFST: Part 2

[4] openFST model training SVN (Revision 11455)

Newer

Older

Page 18 of 37