GSOC 1012: Grapheme to Phoneme Conversion in sphinx-4 – Project conclusions

(author: John Salatas)


This article tries to summarize the Grapheme-to-Phoneme (g2p) in sphinx-4 project which was part of the GSoC 2012 program and can be thought as an integration of phonetisaurus [1] g2p application with both SphinxTrain and Sphinx-4. The project can be divided in three parts which are the g2p model training procedure integrated in the SphinxTrain application, the java g2p decoder integrated in Sphinx-4 and finally the new FST framework in java which was created for the project's needs.

The training procedure

The training procedure is based on the original phonetisaurus' training procedure using the openGRM NGram Library instead of the MITLM toolkit and in order to use it, you need first to install the openFST [2] and openGRM NGram [3] libraries in your system and then build the SphinxTrain application providing the --enable-g2p-decoder parameter to the script.

Training an acoustic model following the instructions found at [4], can train also a g2p model. As an addition to [4], after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= 'yes';

By enabling the g2p functionality, the SphinxTrain application will in its initial steps train a new model  based on the provided dictionary, and then will also use it to provide any missing pronunciations in the training transcription file.

The new java FST framework

In order to be able to use the generated g2p model in java we needed to port the original phonetisaurus' decoder to java. As a first step a general use java fst framework was created which is capable of handling fst models generated with openFST library and which contains all the required fst functionality and operations needed by the g2p decoder.

The java FST framework is available at CMUSphinx SVN Repository in [5].

Using the g2p models in sphinx-4

Having the various files (fst text file and input/output symbol tables files) of text format of the g2p model created with SphinxTrain, we need first to convert to the java FST binary format. This can be done using the script which is distributed with the java FST framework. The script accepts two parameters: the first one pointing to the base location (path and base filename excluding extensions) of the trained model's text format and the second providing the full path and filename to which the java FST model will be saved.
After the conversion, in order to use the java FST model, we need to add the following lines to the dictionary component in the configuration file

notice that the "wordReplacement" property should not exist in the dictionary component. The property "g2pModelPath" should contain a URI pointing to the g2p model in java fst format. The property "g2pMaxPron" holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [6].


Further to the new g2p feature introduced in sphinx-4, we need to emphasize the new java FST framework. Its' usage and extensive testing in the sphinx-4 g2p decoder suggest that its' implemented functionality are usable in general, although it may luck functionality required in different applications (eg. additional operations) which in any case should be not hard to implemented.

As a final note, the current article is just a summary of the work during the project, an extensive set of documentation is available at the GSoC project page [7].


[1] phonetisaurus A WFST-driven Phoneticizer

[2] OpenFst Library Home Page

[3] OpenGrm NGram Library

[4] Training Acoustic Model For CMUSphinx

[5] Java FST Framework SVN Repository

[6] Sphinx-4 Application Programmer’s Guide

[7] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”

Postprocessing Framework

(author: Alex Tomescu)

The Postprocessing Framework project (part of GSoC 2012) is ready for use.

This project concentrates on capitalization and punctuation recovery tasks, based on capitalized and punctuated language models. The current accuracy for comma prediction is 35% and for period it's 39%. Capitalization is at around 94% (most of the words are lower-cased).

This project had two main parts: the language model and the main algorithm.

The language model

For the post processing task the language model used has to contain capitalized words and punctuation mark word tokens. In the training data, commas are replaced with and periods are replaced with . Also sentences should be grouped into paragraphs so that start and end of sentence markers ( and ) are not very frequent. The language model need to be compressed from ARPA format to DMP format with sphinx_lm_convert (or sphinx3_lm_convert).

The gutenberg.DMP language model is correctly formatted and can be found in the language model download section on the project's sourceforge page (

The Algorithm

The algorithm relies on iterating throught word symbols to create word sequences, which are evaluated and put into stacks. When a stack gets full (a maximum capacity is set) it gets sorted (by sequence probabilities) and the lowest scoring part is discarded. This way bad scoring sequences are discarded, and only the best ones are kept. The final solution is the sequence with the same size as the input, with the best probability.


The project is available for download at:

To compile the project install apache ant and be sure to set the required enviroment variables. Then type the following:


To postprocess text use the script:

sh ./ -input_text path_to_file -lm path_to_lm

Enlarging In-Domain Data Using Crawled News Articles

In the previous post, we used the Gutenberg corpus for selecting sentences that resembled Jane Eyre. We need to see how it applies to real world problems such as constructing a language model for running speech recognition on podcasts.

Enlarging In-domain Data Using Perplexity Differences for Language Model Training

We are using the web to obtain extra language model training data for a topic, but what good is that data if it does not fit to our domain? A paper published by Moore et al.[1] uses an interesting and relatively cheap way of enlarging an in-domain corpus using a general corpus.