GSoC 2012: Pronunciation Evaluation using CMUSphinx – Project Conclusions

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Final Report)

This article briefly summarizes the implementation of GSoC 2012 Pronunciation Evaluation project.

Primarily, I started with sphinx forced-alignment and obtained the spectral matching acoustic scores, duration at phone, word level using WSJ models. After that I tried concentrating mainly on two things. They are edit-distance neighbor phones decoding and Scoring routines for both Text-dependent and Text-independent systems as a part of GSoC 2012 project.

Edit-distance Neighbor phones decoding:

1. Primarily started with single-phone decoder and then explored three-phones decoder, word decoder and complete phrase decoder by providing neighbor phones as alternate to the expected phone.
2. The decoding results shown that both word level and phrase level decoding using JFGF are almost same.
3. This method helps to detect the mispronunciations at phone level and to detect homographs as well if the percentage of error in decoding can be reduced.

Scoring Routines:

Text-dependent:
This method is based on exemplars for each phrase. Initially, mean acoustic score, mean duration along with deviations are calculated for each of the phone in the phrase based on exemplar recordings. Now, given the test recording, each phone in the phrase is then compared with exemplar statistics. After that, z-scores are calculated and then normalized scores are calculated based on maximum and minimum of z-scores from exemplar recordings. All phone scores are aggregated to get word score and then all word scores are aggregated with POS weight to get complete phrase score.

Text-independent:
This method is based on predetermined statistics built from any corpus. Here, in this project, I used TIMIT corpus to build statistics for each phone based on its position (begin/middle/end) in the word. Given any random test file, each phone acoustic score, duration is compared with corresponding phone statistics based on contextual information. The scoring method is same as to that of Text-dependent system.

Demo:
Please try our demo @ http://talknicer.net/~ronanki/test/ and help us by giving the feedback.

Documentation and Codes
All codes are uploaded at CMUSphinx svn @ http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/ and raw documentation of the project can be found here.

Conclusions:
The pronunciation evaluation system really helps second-language learners to improve their pronunciation by trying multiple times and it lets you correct your-self by giving necessary feedback at phone, word level. I couldn't complete some of the things like CART modelling I have mentioned earlier during the project. But I hope that I can keep my contributions to this project in future as well.

This summer has been a great experience to me. Google Summer of code 2012 has finally ended. As a final note, the current article is just a summary of the work during the project, an extensive set of documentation will be updated at https://cmusphinx.github.io/wiki/faq#qhow_to_implement_pronunciation_evaluation. You can also read more about this project and weekly progress reports at http://pronunciationeval.blogspot.in/

GSoC 2012: Pronunciation Evaluation #Troy - Project Conclusions

(author: Troy Lee)

This article briefly summarized the Pronunciation Evaluation Web Portal Design and Implementation for the GSoC 2012 Pronunciation Evaluation Project.

The pronunciation evaluation system mainly consists following components:

1) Database management module: Store, retrieve and update all the necessary information including both user information and various data information such as phrases, words, correct pronunciations, assessment scores and etc.

2) User management module: New user registration, information update, change/reset password and so on.

3) Audio recording and playback module: Recording the user's pronunciation for further processing.

4) Exemplar verification module: Justify whether a given recording is an exemplar or not.

5) Pronunciation assessment module: Provide numerical evaluation at the phoneme level (which could be aggregated to form higher level evaluation scores) in both acoustic and duration aspects.

6) Phrase library module: Allow users to create new phrases into the database for evaluation.

7) Human evaluation module: Support human experts to evaluate the users' pronunciations which could be compared with the automatically generated evaluations.

The website could be tested at http://talknicer.net/~li-bo/datacollection/login.php. Do let me know (troy.lee2008@gmail.com) once you encounter any problem as the site needs quite a lot testing before it works robustly. The complete setup of the website could be found at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy/. More detailed functionality and implementations could be found in a more manual like report:

Although it is the end of this GSoC, it is just the start of our project that leveraging on open source tools to improve people's lives around the world using speech technologies. We are currently preparing using Amazon Mechanical Turk to collect more exemplar data through our web portal to build a rich database for improved pronunciation evaluation performance and further making the learning much more fun through gamification.

GSOC 1012: Grapheme to Phoneme Conversion in sphinx-4 – Project conclusions

(author: John Salatas)

Foreword

This article tries to summarize the Grapheme-to-Phoneme (g2p) in sphinx-4 project which was part of the GSoC 2012 program and can be thought as an integration of phonetisaurus [1] g2p application with both SphinxTrain and Sphinx-4. The project can be divided in three parts which are the g2p model training procedure integrated in the SphinxTrain application, the java g2p decoder integrated in Sphinx-4 and finally the new FST framework in java which was created for the project's needs.

The training procedure

The training procedure is based on the original phonetisaurus' training procedure using the openGRM NGram Library instead of the MITLM toolkit and in order to use it, you need first to install the openFST [2] and openGRM NGram [3] libraries in your system and then build the SphinxTrain application providing the --enable-g2p-decoder parameter to the autogen.sh script.

Training an acoustic model following the instructions found at [4], can train also a g2p model. As an addition to [4], after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= 'yes';

By enabling the g2p functionality, the SphinxTrain application will in its initial steps train a new model  based on the provided dictionary, and then will also use it to provide any missing pronunciations in the training transcription file.

The new java FST framework

In order to be able to use the generated g2p model in java we needed to port the original phonetisaurus' decoder to java. As a first step a general use java fst framework was created which is capable of handling fst models generated with openFST library and which contains all the required fst functionality and operations needed by the g2p decoder.

The java FST framework is available at CMUSphinx SVN Repository in [5].

Using the g2p models in sphinx-4

Having the various files (fst text file and input/output symbol tables files) of text format of the g2p model created with SphinxTrain, we need first to convert to the java FST binary format. This can be done using the openfst2java.sh script which is distributed with the java FST framework. The script accepts two parameters: the first one pointing to the base location (path and base filename excluding extensions) of the trained model's text format and the second providing the full path and filename to which the java FST model will be saved.
After the conversion, in order to use the java FST model, we need to add the following lines to the dictionary component in the configuration file

notice that the "wordReplacement" property should not exist in the dictionary component. The property "g2pModelPath" should contain a URI pointing to the g2p model in java fst format. The property "g2pMaxPron" holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [6].

Conclusion

Further to the new g2p feature introduced in sphinx-4, we need to emphasize the new java FST framework. Its' usage and extensive testing in the sphinx-4 g2p decoder suggest that its' implemented functionality are usable in general, although it may luck functionality required in different applications (eg. additional operations) which in any case should be not hard to implemented.

As a final note, the current article is just a summary of the work during the project, an extensive set of documentation is available at the GSoC project page [7].

References

[1] phonetisaurus A WFST-driven Phoneticizer

[2] OpenFst Library Home Page

[3] OpenGrm NGram Library

[4] Training Acoustic Model For CMUSphinx

[5] Java FST Framework SVN Repository

[6] Sphinx-4 Application Programmer’s Guide

[7] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”

Postprocessing Framework

(author: Alex Tomescu)

The Postprocessing Framework project (part of GSoC 2012) is ready for use.

This project concentrates on capitalization and punctuation recovery tasks, based on capitalized and punctuated language models. The current accuracy for comma prediction is 35% and for period it's 39%. Capitalization is at around 94% (most of the words are lower-cased).

This project had two main parts: the language model and the main algorithm.

The language model

For the post processing task the language model used has to contain capitalized words and punctuation mark word tokens. In the training data, commas are replaced with and periods are replaced with . Also sentences should be grouped into paragraphs so that start and end of sentence markers ( and ) are not very frequent. The language model need to be compressed from ARPA format to DMP format with sphinx_lm_convert (or sphinx3_lm_convert).

The gutenberg.DMP language model is correctly formatted and can be found in the language model download section on the project's sourceforge page (https://sourceforge.net/projects/cmusphinx/files/).

The Algorithm

The algorithm relies on iterating throught word symbols to create word sequences, which are evaluated and put into stacks. When a stack gets full (a maximum capacity is set) it gets sorted (by sequence probabilities) and the lowest scoring part is discarded. This way bad scoring sequences are discarded, and only the best ones are kept. The final solution is the sequence with the same size as the input, with the best probability.

Usage:

The project is available for download at:

https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/ppf

To compile the project install apache ant and be sure to set the required enviroment variables. Then type the following:

ant

To postprocess text use the postprocess.sh script:

sh ./postprocessing.sh -input_text path_to_file -lm path_to_lm