[GSoC 2012: Pronunciation Evaluation #ronanki] Work prior to Official Start

Well, it has been a month since I got accepted into this year's Google Summer of Code. This has been a great time for me, during the community bonding period within the CMU Sphinx organization.

It has been four days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at https://cmusphinx.github.io/wiki/pronunciation_evaluation. Troy and I are going to blog here and update the wiki there during this summer. So please check here for important updates.

Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:

Work accomplished:

  • Created an initial web interface which allows users to record and playback their speech using the open source wami-recorder which is being designed by the spoken language systems at MIT.
  • When the recording is completed, the wave file is uploaded to the server for processing.
  • Sphinx3 forced alignment is used to align a phoneme string expected from the utterance with the recorded speech to calculate time endpoints acoustic scores for each phoneme.
  • I tried many different output arguments in sphinx3_align from  https://cmusphinx.github.io/wiki/sphinx4:sphinxthreealigner and successfully tested producing the phoneme acoustic scores using two recognition passes.
    • In the first pass, I use -phlabdir as an argument to get a .lab file as output, which contains the list of recognized phonemes.
    • In the second pass, I use that list to get acoustic scores for each phoneme using -wdsegdir as an input argument.
  • Later, I integrated sphinx3 forced alignment with the wami-recorder microphone recording applet so that the user sees the acoustic scores after uploading their recording.
  • Please try this link to test it:  http://talknicer.net/~ronanki/test/
  • Wrote a program to convert a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet.
  • Wrote a program to take a string of phonemes representing an expected utterance as input and produce a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.
Ongoing work:
  • Reading about Worldbet, OGIbet, ARPAbet, and CMUbet, the different ASCII-based phonetic alphabets and their mappings between each other and the International Phonetic Alphabet.
  • Will be enhancing the first pass of recognition described above using the generated alternative neighboring phoneme grammars to find phonemes which match the recorded speech more closely than the expected phonemes without using complex post-processing acoustic score statistics.
  • Trying more parameters and options to derive acoustic scores for each phoneme from sphinx3 forced alignment.
  • Writing an exemplar score aggregation algorithms to find the means, standard deviations, and their expected error for each phoneme in a phrase from a set of recorded exemplar pronunciations of that phrase.
  • Writing an algorithm which can detect mispronunciations by comparing a recording's acoustic scores to the expected mean and standard deviation for each phoneme, and aggregating those scores to biphones, words, and the entire phrase.

[GSoC 2012: Pronunciation Evaluation #Troy] Before Week 1

Google Summer of Code 2012 officially started this Monday (21 May). Our expected weekly report should begin next Monday, but here is a brief overview of the preparations we have accomplished during the "community bonding period."

We started with a group chat including our mentor James and the other student Ronanki. The project details are becoming more clear to me, from the chat and subsequent email communications. For my project, the major focuses will be:
1) A web portal for automatic pronunciation evaluation audio collection; and
2) An Android-based mobile automatic pronunciation evaluation app.
The core of these two applications is edit distance grammar based-automatic pronunciation evaluation using CMU Sphinx3.

Here are the preparations I have accomplished during the bonding period:

  1. Trying out the basic wami-recorder demo on my school's server;
  2. Changing rtmplite for audio recording. Rtmplite is a Python implementation of an RTMP server with minimum support needed for real-time streaming and recording using Adobe's AMF0 protocol. On the server side, the RTMP server daemon process listens on TCP port 1935 by default, for connections and media data streaming. On the client side, the Flash user needs to use Adobe ActionScript 3's NetConnection function to set up a session with the server, and the NetStream function for audio and video streaming, and also microphone recording. The demo application has been set up at: http://talknicer.net/~li-bo/testClient/bin-debug/testClient.html
  3. Based on my understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is a key user interface component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: http://talknicer.net/~li-bo/audioRecorder/audioRecorder.html . The current implementation:
    1. Distinguishes recordings from different users with user IDs;
    2. Loads pre-defined text sentences to display for recording, which will be useful for pronunciation exemplar data collection;
    3. Performs peal-time audio recording;
    4. Can play back the recordings from the server; and
    5. Has basic event control logic, such as to prevent users from recording and playing at the same time, etc.
  4. Also, I have also learned from https://cmusphinx.github.io/wiki/sphinx4:sphinxthreealigner on how to get phoneme acoustic scores from "forced alignment" using sphinx3. To generate the phoneme alignment scores, two steps are needed. The details of how to perform that alignment can be found on my more tech-oriented posts at http://troylee2008.blogspot.com/2012/05/testing-cmusphinx3-alignment.html and http://troylee2008.blogspot.com/2012/05/cmusphinx3-phoneme-alignment.html on my personal blog.
Currently, these tasks are ongoing:
  1. Set up the server side process to manage user recordings, i.e., distinguishing between users and different utterances.
  2. Figure out how to use ffmpeg, speexdec, and/or sox to automatically convert the recorded server side FLV files to PCM .wav files after the users upload the recordings.
  3. Verify the recording parameters against the recording and speech recognition quality, possibly taking the network bandwidth into consideration.
  4. Incorporating delays between network and microphone events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission, etc.) to successfully finish before processing the next user event, which can often cause the recordings to be clipped.

My GSoC Project Page: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/troylee2008/1

Porting openFST to java: Part 2

(author: John Salatas)


This article, the second in a series regarding, porting openFST to java, briefly presents some additional base classes and raise some issues regarding the java fst architecture in general and its compatibility with the original openFST binary format for saving models.

1. FST java library base architecture
The first article in the series [1] introduced the Weight class, the semiring operations related classes and interfaces and finally a solid class implementation for the tropical semiring based on float weights. In the last svn commit (revision 11363) [2] includes the edu.cmu.sphinx.fst.weight.LogSemiring class which implements a logarithmic semiring based on Double weight values.

Furthermore revision 11363 includes the edu.cmu.sphinx.fst.state.State>, edu.cmu.sphinx.fst.arc.Arc> and edu.cmu.sphinx.fst.fst.Fst> classes which implement the state, arc and fst functionality, respectively.

2. Architecture design issues

2.1. Java generics support

As described in the first part [1], the edu.cmu.sphinx.fst.weight.Weight acts basically as a wrapper for the weight’s value. The current implementation of State, Arc and Fst classes take as a type parameter any class that extends the Weight base class. Although this approach provides a great flexibility on buildin custom types of FSTs, the implementations can be greatly simplified if we assume only the base Weight class and modify the State, Arc and Fst classes definition to simply usse a type parameter.

As an example the Arc class definition would be simplified to

public class Arc implements Serializable{

private static final long serialVersionUID = -7996802366816336109L;

// Arc's weight
protected Weight weight;
// Rest of the code.....

instead of its current definition

public class Arc implements Serializable{

private static final long serialVersionUID = -7996802366816336109L;

// Arc's weight
protected W weight;
// Rest of the code.....

The proposed modification can be applied also to State and Fst classes and provide an easier to use api. In that case the construction of a basic FST in the class edu.cmu.sphinx.fst.demos.basic.FstTest would be simplified as follows

// ...
Fst fst = new Fst();

// State 0
State s = new State();
s.AddArc(new Arc(new Weight(0.5), 1, 1, 1));
s.AddArc(new Arc(new Weight(1.5), 2, 2, 1));

// State 1
s = new State();
s.AddArc(new Arc(new Weight(2.5), 3, 3, 2));

// State 2 (final)
s = new State(new Weight(3.5));
// ...

The code could be further simplified by completely dropping generics support in State, Arc and Fst classes by just providing solid implementations based on Weight weights.

2.2. Compatibility with the original openFST binary format

A second issue is the compatibility of the serialized binary format with the original openFST format. A compatible java library that is able to load/save openFST models, would provide us the ability to share trained models between various applications. As an example, in the case of ASR appliactions, trained models could be easily shared between between sphinx4 and kaldi [3] which is written in C++ and already uses the openFST library.

2.3. Logarithmic Semiring implementation issues

A final issue has to do with a possible inconsistency of the plus operation definition between Allauzen's et. Al paper [4] and the actual openFST code (version 1.3.1.): The plus operation ( $latex \oplus_{\log} $ ) is defined in [4] as $latex x \oplus_{\log} y = -\log(e^{-x} +e^{-y}) $, however in code it is implemented as follows

inline T LogExp(T x) { return log(1.0F + exp(-x)); }

inline LogWeightTpl Plus(const LogWeightTpl &w1,
const LogWeightTpl &w2) {
T f1 = w1.Value(), f2 = w2.Value();
if (f1 == FloatLimits::kPosInfinity)
return w2;
else if (f2 == FloatLimits::kPosInfinity)
return w1;
else if (f1 > f2)
return LogWeightTpl(f2 - LogExp(f1 - f2));
return LogWeightTpl(f1 - LogExp(f2 - f1));


[1] “Porting openFST to java: Part 1”, last accessed: 18/05/2012.

[2] CMUSphinx g2p SVN repository

[3] Kaldi Speech recognition research toolkit , last accessed: 18/05/2012.

[4] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, M. Mohri, “OpenFst: a general and efficient weighted finite-state transducer library”, Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007), pp. 11–23, Prague, Czech Republic, July 2007.

Porting openFST to java: Part 1

(author: John Salatas)


This article is the first part of a series of articles on porting openFST[1] in java. OpenFST is an open-source C++ library for weighted finite-state transducers (WFSTs) [1] and having a similar java implementation is a crucial first step for the integration of phonetisaurus g2p into sphinx 4 [2]

This article will briefly review some mathematical background of weighted finite-state transducers describe the current implementation of openFST and then start describing the java implementation which will be completed in articles that will follow.

1. Weighted finite-state transducers

Weighted finite-state transducers have been used in speech recognition and synthesis, machine translation, optical character recognition, pattern matching,string processing, machine learning, information extraction and retrieval among others. Having a comprehensive software library of weighted transducer representations and core algorithms is key for using weighted transducers in these applications and for the development of new algorithms and applications. [1]

A weighted finite-state transducer (WFST) is a finite automaton for which each transition has an input label, an output label, and a weight. Figure 1 depicts a weighted finite state transducer: [1]

[caption id="attachment_339" align="aligncenter" width="497" caption="Figure 1: Example weighted finite-state transducer"]Figure 1:  Example weighted finite-state transducer[/caption]

The initial state is labeled 0. The final state is 2 with final weight of 3.5. Any state with non-infinite final weight is a final state. There is a transition from state 0 to 1 with input label a, output label x, and weight 0.5. This machine transduces, for instance, the string ac to xz with weight 6.5 (the sum of the arc and final weights). [1]

The weights may represent any set so long as they form a semiring. A semiring $latex (\mathbb{K}, \oplus, \otimes, \bar{0}, \bar{1}) $ is specified by a set of values $latex \mathbb{K} $, two binary operations $latex \oplus $ and $latex \otimes $, and two designated values $latex \bar{0} $ and $latex \bar{1} $. The operation $latex \oplus $ is associative, commutative, and has $latex \bar{0} $ as identity. The operation $latex \otimes $ is associative, has identity $ and $latex \bar{1} $, distributes with respect to $latex \oplus $, and has $latex \bar{0} $ as annihilator: for all $latex a \in \mathbb{K} , a \otimes \bar{0} = \bar{0} \otimes a = \bar{0} $. If $latex \otimes $ is also commutative, we say that the semiring is commutative. [1]

Table 1 below lists some common semirings. All but the last are defined over subsets of the real numbers (extended with positive and negative infinity). In addition to the familiar Boolean semiring, and the probability semiring used to combine probabilities, two semirings often used in applications are the log semiring which is isomorphic to the probability semiring via the negative-log mapping, and the tropical semiring which is similar to the log semiring except the operation is min. The left (right) string semiring, which is defined over strings, has longest common prefix (suffix) and concatenation as its operations, and has the (extended element) infinite string and the empty string for its identity elements. It only distributes on the left (right). [1]

[caption id="attachment_355" align="aligncenter" width="347" caption="Table 1: Semiring examples."]Table 1:  Semiring examples.[/caption]

2. The openFST C++ library: Representation and Construction

The motivation for OpenFst was to create a library as comprehensive and efficient as the AT&T FSM [3] Library, but that was an open-source project. We also sought to make this library as flexible and customizable as possible given the wide range of applications WFSTs have enjoyed in recent years. It is a C++ template library, allowing it to be both very customizable and efficient. [1]

In the OpenFst Library, a transducer can be constructed from either the C++ level using class constructors and mutators or from a shell-level program using a textual file representation. [1]

In order to create a transducer using openFST we need first to construct an empty VectorFst: [1]

// A vector FST is a general mutable FST
VectorFst fst;

The VectorFst, like all transducer representations and algorithms in this library, is templated on the transition type. This permits customization of the labels, state IDs and weights in a transducer. StdArc defines the library-standard transition representation:

class ArcTpl {
typedef W Weight;
typedef int Label;
typedef int StateId;

ArcTpl(Label i, Label o, const Weight& w, StateId s)
: ilabel(i), olabel(o), weight(w), nextstate(s) {}

ArcTpl() {}

static const string &Type(void) {
static const string type =
(Weight::Type() == "tropical") ? "standard" : Weight::Type();
return type;

Label ilabel;
Label olabel;
Weight weight;
StateId nextstate;

A Weight class holds the set element and provides the semiring operations. Currently openFST provides many different C++ Template-based implementations like TropicalWeightTpl, LogWeightTpl and MinMaxWeightTpl which extend a base FloatWeightTpl (see float-weight.h for implementation details) and others. Having these Template-based implementations opeFST we need just have a typedef to define a particular Weight such as TropicalWeight:

// Single precision tropical weight
typedef TropicalWeightTpl TropicalWeight;

3. The proposed FST java library

Based on the above description and on technical implementation differences between C++ and Java, and more specific mostly on a) difference between C++ Templates and Java generics [4] and b) the lack of operation overloads in Java, the initial implementation includes the edu.cmu.sphinx.fst.weight.Weight class, acting basically as a wrapper for the weight's value, which can be on any type. The Weight class can be extended in order to create more advanced implementations, if needed.

There is also a generics based interface edu.cmu.sphinx.fst.weight.Semiring> which declares the Plus, Times and Divide semiring operations for a Weight class. In addition, it declares the zero and one elements of the semiring and the boolean isMember(Weight w) method which should be implemented in a way that returns true if w is a member of the semiring set of values ($latex \mathbb{K} $) or false if it is not. The edu.cmu.sphinx.fst.weight.TropicalSemiring class implements this interface in order to create a solid class of a Tropical Semiring based on single-precision Float type for storing the weight's values.

Finaly the edu.cmu.sphinx.fst.demos.basic package contains a main class for testing the above functionality by instatiating a TropicalSemiring and performing some operations on various Weight values.

4. Conclusion – Future work

This article tried to describe some basic theoritical background on weighted finite-state transducers, provide a brief description on the openFST architect and foundation classes and finally presented an initial design for the FST java library implementation. Following the general open-source philosophy “perform small commits often” the library is available in CMUShinx' repository created for the integration of phonetisaurus g2p into sphinx 4. [5]

The next steps is to provide the Arc and Fst classes which over time will be extended to provide the required functionality for the various FST operations needed for my GSoC 2012 project. Hopefully, over time, more functionality will be provided by the community.


[1] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, M. Mohri, “OpenFst: a general and efficient weighted finite-state transducer library”, Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007), pp. 11–23, Prague, Czech Republic, July 2007.

[2] J. Salatas, “Phonetisaurus: A WFST-driven Phoneticizer – Framework Review”, last accessed: 08/05/2012.

[3] M. Mohri, F. Pereira, M. Riley, “The Design Principles of a Weighted Finite-State Transducer Library”, Theoretical Computer Science, pp. 15-32, 2000.

[4] H. M. Qusay, “Using and Programming Generics in J2SE 5.0”, Oracle Technology Network, 2004, last accessed: 08/05/2012.

[5] CMUSphinx g2p SVN repository