Phonetisaurus: A WFST-driven Phoneticizer – Framework Review

(author: John Salatas)


This article tries to analyze the phonetisaurus g2p [1], [2] code by describing it's main parts and algorithms behind these. Phonetisaurus is a modular system and includes support for several third-party components. The system has been implemented primarily in python, but also leverages the OpenFST framework [3].

1. Overall Architecture

The procedure for model training and evaluation in phonetisaurus consists by three parts [4]: the dictionary alignment, the model training and finally the evaluation of the model.

1.1. Dictionary Alignment
Manual G2P alignments are generally not available, thus it is necessary to first align the grapheme and phoneme sequences in a pronunciation dictionary, prior to building a pronunciation model. Phonetisaurus utilizes the EM-based many-to-many alignment procedure detailed in [5] that supports alignments from digraphs such as “sh” to a single phoneme, or the reverse case. Recently the dictionary alignment was reimplemented and upgraded using OpenFst.
The command line script that controls the alignment procedure interfaces with the M2MFstAligner class (M2MFstAligner.cpp) using Swig [6], in order to transform two sequences, one of graphemes and one of phonemes to an FST that encodes all possible alignments between the symbols in the two sequences.
The basic transformation of the sequence
void M2MFstAligner::Sequences2FST( VectorFst* fst, vector* seq1, vector* seq2 );
creates a VectorFst fst instance and iterates over all possible combinations which are added to the fst. It uitilizes the Plus semiring operation and the Connect optimization operation.

After the FSTs for all entries are created the procedure continus with the EM algorithm which is implemented in the
void M2MFstAligner::expectation( );
float M2MFstAligner::maximization( bool lastiter );
procedures. These procedures utilize the ShortestDistance search operation and the Divide, Times and Plus semiring operations.

1.2. Model Training

The command line script that controls the model training procedure uses the estimate-ngram utility of the MIT Language Modeling (MITLM) toolkit [7] in order to estimate an n-gram language model language model by cumulating n-gram count statistics,"smoothing observed counts, and building a backoff n-gram mode [8].
The estimate-ngramm utility produces a language model in ARPA format which is then converted to a FST textual represantion through the use of the script. This textual represantion is then parsed by the fstcompile command line utility of OpenFST and converted to the final binary representation.

1.3. Model Evaluation

The command line script that controls the model evaluation procedure utilizes the Phonetisaurus class (Phonetisaurus.cpp), through the phonetisaurus-g2p command line interface, for the g2p conversion, which is then evaluated. It uitilizes the Compose binary operation, Project unary operation, ShortestPath search operation, Times semiring operation and RmEpsilon optimization operation.
A pronunciation for a new word is achieved by compiling the word into a WFSA and composing it with the pronunciation model. The best hypothesis is just the shortest path through the composed WFST. [1]
The input word is converted to an acceptor I which has one arc for each of the characters in the word. I is then composed with M according to O = I ◦ M where ◦ denotes the composition operator. The n-best paths are extracted from O by projecting the output, removing the epsilon labels and applying the n-shortest paths algorithm with determinization. [2]

2. Conclusion – Future Work

This article tried to analyze the phonetisaurus g2p code and its main parts. Having this description will allow us to produce a more accurate and analytical planning and scheduling the tasks required for the integration of phonetisaurus g2p into sphinx 4 for my GSoC 2012 project [9].


[1] J. Novak, D. Yang, N. Minematsu, K. Hirose, "Initial and Evaluations of an Open Source WFST-based Phoneticizer", The University of Tokyo, Tokyo Institute of Technology

[2] D. Yang, et. al., “Rapid development of a G2Psystem based on WFST framework”, ASJ 2009
Autumn session, pp. 111-112, 2009.

[3] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, M. Mohri, "OpenFst: a general and efficient weighted finite-state transducer library", Proceedings of the 12th International Conference on Implementation and Application of Automata (CIAA 2007), pp. 11–23, Prague, Czech Republic, July 2007.

[4] J. Novak, README.txt, phonetisaurus source code, last accessed: 29/04/2012.

[5] S. Jiampojamarn, G. Kondrak, T. Sherif, “Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion”, NAACL HLT, pp. 372-379, 2007.

[6] Simplified Wrapper and Interface Generator, last accessed: 29/04/2012.

[7] MIT Language Modeling Toolkit, last accessed: 29/04/2012.

[8] D. Jurafsky, J. H. Martin, “Speech and Language Processing”, Prentice Hall, 2000.

[9] J. Salatas, GSoC 2012 Project: Letter to Phoneme Conversion in sphinx4, last accessed: 29/04/2012.