GSoC 2012: Pronunciation Evaluation #Troy - Project Conclusions

(author: Troy Lee)

This article briefly summarized the Pronunciation Evaluation Web Portal Design and Implementation for the GSoC 2012 Pronunciation Evaluation Project.

The pronunciation evaluation system mainly consists following components:

1) Database management module: Store, retrieve and update all the necessary information including both user information and various data information such as phrases, words, correct pronunciations, assessment scores and etc.

2) User management module: New user registration, information update, change/reset password and so on.

3) Audio recording and playback module: Recording the user's pronunciation for further processing.

4) Exemplar verification module: Justify whether a given recording is an exemplar or not.

5) Pronunciation assessment module: Provide numerical evaluation at the phoneme level (which could be aggregated to form higher level evaluation scores) in both acoustic and duration aspects.

6) Phrase library module: Allow users to create new phrases into the database for evaluation.

7) Human evaluation module: Support human experts to evaluate the users' pronunciations which could be compared with the automatically generated evaluations.

The website could be tested at Do let me know ( once you encounter any problem as the site needs quite a lot testing before it works robustly. The complete setup of the website could be found at More detailed functionality and implementations could be found in a more manual like report:

Although it is the end of this GSoC, it is just the start of our project that leveraging on open source tools to improve people's lives around the world using speech technologies. We are currently preparing using Amazon Mechanical Turk to collect more exemplar data through our web portal to build a rich database for improved pronunciation evaluation performance and further making the learning much more fun through gamification.

GSOC 1012: Grapheme to Phoneme Conversion in sphinx-4 – Project conclusions

(author: John Salatas)


This article tries to summarize the Grapheme-to-Phoneme (g2p) in sphinx-4 project which was part of the GSoC 2012 program and can be thought as an integration of phonetisaurus [1] g2p application with both SphinxTrain and Sphinx-4. The project can be divided in three parts which are the g2p model training procedure integrated in the SphinxTrain application, the java g2p decoder integrated in Sphinx-4 and finally the new FST framework in java which was created for the project's needs.

The training procedure

The training procedure is based on the original phonetisaurus' training procedure using the openGRM NGram Library instead of the MITLM toolkit and in order to use it, you need first to install the openFST [2] and openGRM NGram [3] libraries in your system and then build the SphinxTrain application providing the --enable-g2p-decoder parameter to the script.

Training an acoustic model following the instructions found at [4], can train also a g2p model. As an addition to [4], after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= 'yes';

By enabling the g2p functionality, the SphinxTrain application will in its initial steps train a new model  based on the provided dictionary, and then will also use it to provide any missing pronunciations in the training transcription file.

The new java FST framework

In order to be able to use the generated g2p model in java we needed to port the original phonetisaurus' decoder to java. As a first step a general use java fst framework was created which is capable of handling fst models generated with openFST library and which contains all the required fst functionality and operations needed by the g2p decoder.

The java FST framework is available at CMUSphinx SVN Repository in [5].

Using the g2p models in sphinx-4

Having the various files (fst text file and input/output symbol tables files) of text format of the g2p model created with SphinxTrain, we need first to convert to the java FST binary format. This can be done using the script which is distributed with the java FST framework. The script accepts two parameters: the first one pointing to the base location (path and base filename excluding extensions) of the trained model's text format and the second providing the full path and filename to which the java FST model will be saved.
After the conversion, in order to use the java FST model, we need to add the following lines to the dictionary component in the configuration file

notice that the "wordReplacement" property should not exist in the dictionary component. The property "g2pModelPath" should contain a URI pointing to the g2p model in java fst format. The property "g2pMaxPron" holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [6].


Further to the new g2p feature introduced in sphinx-4, we need to emphasize the new java FST framework. Its' usage and extensive testing in the sphinx-4 g2p decoder suggest that its' implemented functionality are usable in general, although it may luck functionality required in different applications (eg. additional operations) which in any case should be not hard to implemented.

As a final note, the current article is just a summary of the work during the project, an extensive set of documentation is available at the GSoC project page [7].


[1] phonetisaurus A WFST-driven Phoneticizer

[2] OpenFst Library Home Page

[3] OpenGrm NGram Library

[4] Training Acoustic Model For CMUSphinx

[5] Java FST Framework SVN Repository

[6] Sphinx-4 Application Programmer’s Guide

[7] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”

Postprocessing Framework

(author: Alex Tomescu)

The Postprocessing Framework project (part of GSoC 2012) is ready for use.

This project concentrates on capitalization and punctuation recovery tasks, based on capitalized and punctuated language models. The current accuracy for comma prediction is 35% and for period it's 39%. Capitalization is at around 94% (most of the words are lower-cased).

This project had two main parts: the language model and the main algorithm.

The language model

For the post processing task the language model used has to contain capitalized words and punctuation mark word tokens. In the training data, commas are replaced with and periods are replaced with . Also sentences should be grouped into paragraphs so that start and end of sentence markers ( and ) are not very frequent. The language model need to be compressed from ARPA format to DMP format with sphinx_lm_convert (or sphinx3_lm_convert).

The gutenberg.DMP language model is correctly formatted and can be found in the language model download section on the project's sourceforge page (

The Algorithm

The algorithm relies on iterating throught word symbols to create word sequences, which are evaluated and put into stacks. When a stack gets full (a maximum capacity is set) it gets sorted (by sequence probabilities) and the lowest scoring part is discarded. This way bad scoring sequences are discarded, and only the best ones are kept. The final solution is the sequence with the same size as the input, with the best probability.


The project is available for download at:

To compile the project install apache ant and be sure to set the required enviroment variables. Then type the following:


To postprocess text use the script:

sh ./ -input_text path_to_file -lm path_to_lm

Enlarging In-Domain Data Using Crawled News Articles

In the previous post, we used the Gutenberg corpus for selecting sentences that resembled Jane Eyre. We need to see how it applies to real world problems such as constructing a language model for running speech recognition on podcasts.