Porting openFST to java: Part 4

(author: John Salatas)


This article, the fourth in a series regarding porting openFST to java, describes the latest version of the java fst library, which contains all the required code for porting phonetisaurus g2p decoder [1] to java and eventually integrate it with sphinx4.

1. Basic Usage

Creating as simple wfst like the one in in figure 1 below is a straightforward procedure

[caption id="attachment_339" align="aligncenter" width="497" caption="Figure 1: Example weighted finite-state transducer"]Figure 1: Example weighted finite-state transducer[/caption]

// Create a new Semiring
TropicalSemiring ts = new TropicalSemiring();
// Create the wfst
Fst wfst = new Fst(ts);
// Create input and output symbol tables
// and assign these to the wfst
Mapper isyms = new Mapper();
isyms.put(0, "a");
isyms.put(1, "b");
isyms.put(2, "c");

Mapper osyms = new Mapper();
osyms.put(0, "x");
osyms.put(1, "y");
osyms.put(2, "z");

// Create state 0
State state = new State(ts.zero());
// Add it to the wfst
// Set it as the start state
// Add arcs to state 0
state.addArc(new Arc(0, 0, 0.5, "1"));
state.addArc(new Arc(1, 1, 1.5, "1"));

// Create state 1
state = new State(ts.zero());
// Add it to the wfst
// Add arcs to state 1
state.addArc(new Arc(2, 2, 2.5, "2"));

// Create (final) state 2
state = new State(new Weight(3.5));
// Add it to the wfst

Then this wfst can be serialized as binary java fst
try {
} catch (IOException e) {

or exported to openFst text format

Convert.export(wfst, wfst.getSemiring(), "path/to/exported");

Finally it can also be used for several operations as described in the following section.

2. Fst Operations

As already mentioned, priority was given to operations that are required for the porting of phonetisaurus' g2p decoder to java. All operations described are defined in their own class having the same name with the operation under the edu.cmu.sphinx.fst.operations package.

2.1. ArcSort

This operation sorts the arcs in an FST per state. It is defined as follows:
public static void apply(Fst fst, Comparator cmp)

The Comparator cmp can be either ILabelCompare or OlabelCompare which short the arcs according to their input or output labels accordingly.

2.2. Compose

This operation computes the composition of two transducers [2]. It is defined as follows:
public static Fst get(Fst fst1, Fst fst2, Semiring semiring)

2.3. Connect

This operation removes all states that are not contained on a path leading from the initial state to a final state. It is defined as follows:
public static void apply(Fst fst)

2.4. Determinize

This operation determinizes a weighted transducer. The result will be an equivalent FST that has the property that no state has two transitions with the same input label. For this algorithm, epsilon transitions are treated as regular symbols [3]. It is defined as follows:
public static Fst get(Fst fst)

2.5. ExtendFinal

This operations extends a wfst by adding a new final state with weight 0 and adding epsilon transitions from the existing final states to the new one, with weights equal to the existing state's final weight. It is defined as follows:
public static void apply(Fst fst)

Furthermore there is also a procedure to undo this change, which is defined as:
public static void undo(Fst fst)

2.6. NShortestPaths

This operation produces an FST containing the n -shortest paths in the input FST [4], [5]. It is defined as follows:
public static Fst get(Fst fst, int n)

where n is the number of the best paths to return.

2.7. Project

This operation projects an FST onto its domain or range by either copying each arc's input label to its output label or vice versa. It is defined as follows:
public static void apply(Fst fst, ProjectType pType)

where pType is an enumeration talking values either INPUT or OUTPUT which project the input labels to the output or the output to the input accordingly.

2.8. Reverse

This operation reverses an FST. Internally it first extends the input to a single final state and this extension is undone before exiting leaving the input fst unchanged. It is defined as follows:
public static Fst get(Fst fst)

2.9. RmEpsilon

This operation removes epsilon-transitions (when both the input and output label are an epsilon) from a transducer. The result will be an equivalent FST that has no such epsilon transitions [2]. It is defined as follows:
public static Fst get(Fst fst)

3. Performance in g2p decoding

The performance of the fst java library, was evaluated in g2p decoding by porting the phonetisaurus' g2p decoder to java (available at [6]). N-gram fst models were created (for n=6,7,8,9 and 10) and then loaded in the decoder in order to phoneticize 5 different words. The loading time for each model and also the time of various operations taking place during the decoding, were measured and their average are summarized in the table below. The table also shows the memory used by the java VM after loading the model (this refers more or less to the memory needed by the model) and also the maximum amount of memory utilized during the g2p decoding. All tests were performed on an Intel Core Duo CPU running openSuSE 12.1 x64.

[caption id="attachment_570" align="aligncenter" width="371" caption="Table 1: Performance tests of java g2p decoder"]Performance tests of java g2p decoder[/caption]

The graphs below visualize the first four rows of the above table.

[caption id="attachment_571" align="aligncenter" width="528" caption="Figure 2: Performance graphs for the java g2p decoder"]Performance graphs for the java g2p decoder[/caption]

4. Conclusion – Future work

Studying the performance table in the previous section it is clear that the critical procedure in the decoding process is the model loading (deserialization) which usually take more than a minute to complete. Although this is an issue that needs to be fixed, a quick workaround is to load it in advance and keep a copy of it in memory for providing pronunciations for all required words. This of course comes with additional requirements for memory which are more or less equal to the amount shown in 3rd in the previous section's table (row “Memory usage after load”).

Next step is of course to evaluate the g2p decoder's ability to provide correct pronunciations for words and compare it with the original pronunciations produced by phonetisaurus. Having a same quality java g2p decoder will first of confirm the correctness of the java fst library code and enable us to continue with its integration with CMUSphinx 4.

[1] "Phonetisaurus: A WFST-driven Phoneticizer – Framework Review".
[2] M. Mohri, "Weighted automata algorithms", Handbook of Weighted Automata. Springer, pp. 213-250, 2009.
[3] M. Mohri, "Finite-State Transducers in Language and Speech Processing", Computational Linguistics, 23:2, 1997.
[4] M. Mohri, "Semiring Framework and Algorithms for Shortest-Distance Problems", Journal of Automata, Languages and Combinatorics, 7(3), pp. 321-350, 2002.
[5] M. Mohri, M. Riley, "An Efficient Algorithm for the n-best-strings problem", Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP '02).
[6] G2P java decoder SVN repository

Scoring routines for pronunciation evaluation: part1

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Week 5)

The basic scoring routine for the pronunciation evaluation system is now available at http://talknicer.net/~ronanki/test/. The output is generated for each phoneme in the phrase and displays the total score.

1. Edit-distance neighbor grammar generation:

Earlier, I did this with:

(a) a single-phone decoder

(b) a three-phone decoder (contextual)

(c) an entire phrase decoder with neighboring phones

This week, I added two more decoders: a word-decoder and a complete phrase decoI used using each phoneme at each time

word-decoder: Use sox to split each wav file into worded based on forced-alignment output and then present each word as follows.

Ex: word - "with" is presented as

public = ( (W | L | Y) (IH) (TH) );

public = ( (W) (IH | IY | AX | EH) (TH) );

public = ( (W) (IH) (TH | S | DH | F | HH) );

The accuracy turned out to be better than single-phone/three-phone decoder, same as entire phrase decoder and the output of a sample test phrase is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_words.txt

Complete phrase decoder using each phoneme: This is again more similar to entire phrase decoder. This time I supplied neighboring phones for each phoneme at each time and fixed the rest of the phonemes in the phrase. Not a good approach, takes more time to decode. But, the accuracy is better than all the previous methods. The output is at http://talknicer.net/~ronanki/phrase_data/results_edit_distance/output_phrases.txt

The code for above methods are uploaded in cmusphinx sourceforge at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/neighborphones_decode/

Please follow the README file in each folder for detailed instructions on how to use them.

2. Scoring paradigm:

The current basic scoring routine which is deployed at http://talknicer.net/~ronanki/test/ aligns the test recording with the utterance using forced alignment in sphinx and generates a phone segmentation file. Each phoneme in the file is then compared with mean, std. deviation of the respective phone in phrase_statistics (http://talknicer.net/~ronanki/phrase_data/phrase1_stats.txt) and standard scores are calculated from z-scores of acoustic_score and duration.

I also derived statistics (mean score, std. deviation score, mean duration) for each phone in CMUphoneset irrespective of context using the exemplar recordings for all the three phrases (http://talknicer.net/~ronanki/phrase_data/phrases.txt) which I have as of now. So, If a test utterance is given, I can test each phone in the random phrase with respective phone statistics.

Statistics are at : http://talknicer.net/~ronanki/phrase_data/all_phrases_stats (column count represents number of times each phone occurred)

Things to do in the upcoming week:

1. Use of an edit-distance grammar to derive standard scores such that the minimal effective training data set is required.
2. Use of the same grammar to detect the words that are having two correct different pronunciation (ex: READ/RED)
3. In a random phrase scoring method, another column can be added to store the position of each phone with respect to word (or SILence) such that each phone will have three statistics and can be compared better with the exemplar phonemes based on position.
4. Link all those modules to try to match experts' scores.
5. Provide feedback to the user with underlined mispronunciations, or numerical labels.

Future tasks:

1. Use of CART models in training to do better match of statistics for each phoneme in the test utterance with the training data based on contextual information
2. Use of phonological features instead of mel-cepstral features, which are expected to better represent the state of pronunciation.
3. Develop a complete web-based system so that end user can test their pronunciation in an efficient way.

Web data collection for pronunciation evaluation

(author: Troy)

(status: week 4)

[Project mentor note:  I have been holding these more recent blog posts pending some issues with Adobe Flash security updates which periodically break cross-platform audio upload web browser solutions. We have decided to plan for a fail-over scheme using low-latency HTTP POST multipart/form-data binary Speex uploads to provide backup in case Flash/rtmplite fails again in the future. This might also support most of the mobile devices. Please excuse the delay and rest assured that progress continues and will continue to be announced at such time as we are confident that we won't need to contradict ourselves as browser technology for audio upload continues to develop. --James Salsman]

The data collection website now can provide basic capabilities. Anyone interested, please check out http://talknicer.net/~li-bo/datacollection/login.php and give it a try. If you encounter any problems, please let us know.

Here are my accomplishments from last week:

1) Discussed the project schema design with  the project mentor and created the database with MySQL. The current schema is shown at http://talknicer.net/w/Database_schema.  During the development of the user interface, slight modifications were made to refine the database schema, such as the age field in for the users table: Storing the user's birth date is much better. Other similar changes were made. I learned that good database design comes from practice, not purely imagination.

2) Implemented the two types of user registration pages: one for students and one for exemplar uploaders. To avoid redundant work and allow for fewer constraints on types of users, the registration process involves two steps: one basic registration and one extra information update. For students, only the basic one is mandatory, but the exemplar uploaders have to fill out two separate forms.
3) Added extra supporting functionality for user management, including password reset and mode selection for users with more than one type.
4) Incorporated the audio recorder with the website for recording and uploading to servers.

edit-distance grammar decoding using sphinx3: Part 2

(Author: Srikanth Ronanki)
(Status: GSoC 2012 Pronunciation Evaluation Week 4)

The source code for the functions below [1] have been uploaded to http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki/scripts/
Here are some brief notes on how to use those programs:

Method 1: (phoneme decode)
Steps To Run:
1. Use split_wav2phoneme.py to split a sample wav file in to individual phoneme wav files
$ python split_wav2phoneme.py
2. Create split.ctl file using extracted split_wav directory
$ ls split_wav/* > split.ctl
$ sed -i 's/.wav//g' split.ctl

3. Run feature_extract.sh program to extract features for individual phoneme wav files
$ sh feature_extract.sh
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg.
$ sh jsgf2fsg.sh
6. Run decode_1phoneme.py to get the required output in output_decoded_phones.txt
$ python decode_1phoneme.py

Method 2: (Three phones decode)

Steps To Run:
1. Use split_wav2threephones.py to split a sample wav file in to individual phoneme wav files which consists of three phones the other two being served as contextual information for the middle one.
$ python split_wav2threephones.py
2. Create split.ctl file using extracted split_wav directory
$ ls split_wav/* > split.ctl
$ sed -i 's/.wav//g' split.ctl

3. Run feature_extract.sh program to extract features for individual phoneme wav files
$ sh feature_extract.sh
4. Java Speech Grammar Format (JSGF) files are already created in FSG_phoneme
5. Run jsgf2fsg.sh in FSG_phoneme to convert from jsgf to fsg
$ sh jsgf2fsg.sh
6. Run decode_3phones.py to get the required output in output_decoded_phones.txt
$ python decode_3phones.py

Method 3: (Single/Batch phrase decode)

Steps To Run:
1. Construct grammar file (JSGF) using my earlier scripts from phonemes2ngbphones [2] and then use jsgf2fsg in sphinxbase to convert from JSGF to FSG which serves as input Language Model to sphinx3_decode
2. Provide the input arguments such as grammar file, feats, acoustic models etc., for the input test phrase
3. Run decode.sh program to get the required output in sample.out
$ sh decode.sh


[1] edit-distance grammar decoding using sphinx3: Part 1

[2] Input string of phonemes to CMUBet neighboring phones