(author: John Salatas)
Foreword
This article will review the OpenGrm NGram Library [1] and its usage for language modeling in ASR. OpenGrm makes use of functionality in the openFST library [2] to create, access and manipulate n-gram language models and it can be used as the language model training toolkit for integrating phonetisaurus' model training procedures [3] into a simplified application.
1. Model Training
Having the aligned corpus produced from cmudict by the aligner code of our previous article [4], the first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with: [5]
# ngramsymbols < cmudict.corpus > cmudict.syms
Given the symbol table, a text corpus can be converted to a binary finite state archive (far) [6] with:
# farcompilestrings --symbols=cmudict.syms --keep_symbols=1 cmudict.corpus > cmudict.far
Next step is to count n-grams from an input corpus, converted in FAR format. It produces an n-gram model in the FST format. By using the switch --order the maximum length n-gram to count can be chosen.
The 1-gram through 9-gram counts for the cmudict.far finite-state archive file created above can be created with:
# ngramcount --order=9 cmudict.far > cmudict.cnts
Finally the 9-gram counts in cmudict.cnts above can be converted to a WFST model with:
# ngrammake --method="kneser_ney" cmudict.cnts > cmudict.mod
The --method option is used for selecting the smoothing method [7] from one of the six available:
- witten_bell: smooths using Witten-Bell [8], with a hyperparameter k, as presented in [9].
- absolute: smooths based on Absolute Discounting [10], using bins and discount parameters.
- katz: smooths based on Katz Backoff [11], using bins parameters.
- kneser_ney: smooths based on Kneser-Ney [12], a variant of Absolute Discounting.
- presmoothed: normalizes at each state based on the n-gram count of the history.
- unsmoothed: normalizes the model but provides no smoothing.
2. Evaluation – Comparison with phonetisaurus
In order to evaluate OpenGrm models, ther procedure described above was repeated using the standard 90%-10% split of the cmudict into a training and test set respectively. The binary fst format produced by ngrammake wasn't readable by the phonetisaurus evaluation script, so it was converted to ARPA format with:
# ngramprint --ARPA cmudict.mod > cmudict.arpa
and then back to a phonetisaurus binary fst format with:
# phonetisaurus-arpa2fst --input=cmudict.arpa --prefix="cmudict/cmudict"
Finally the test set was evaluated with
# evaluate.py --modelfile cmudict/cmudict.fst --testfile cmudict.dict.test --prefix cmudict/cmudict
Words: 13328 Hyps: 13328 Refs: 13328
##############################################
EVALUATION RESULTS
---------------------------------------------------------------------
(T)otal tokens in reference: 84955
(M)atches: 77165 (S)ubstitutions: 7044 (I)nsertions: 654 (D)eletions: 746
% Correct (M/T) -- %90.83
% Token ER ((S+I+D)/T) -- %9.94
% Accuracy 1.0-ER -- %90.06
---------------------------------------------------------------------
(S)equences: 13328 (C)orrect sequences: 8010 (E)rror sequences: 5318
% Sequence ER (E/S) -- %39.90
% Sequence Acc (1.0-E/S) -- %60.10
##############################################
3. Conclusion – Future Works
The evaluation results above are, as expected, almost identical with those produced by the phonetisaurus' procedures and the use of MITLM toolkit instead of OpenGrm.
Having the above description, the next step is to integrate all of the commands above into a simplified application, combined with the dictionary alignment code introduced in our previous article [13].
References
[2] openFST library
[3] Phonetisaurus: A WFST-driven Phoneticizer – Framework Review
[4] Porting phonetisaurus many-to-many alignment python script to C++
[5] OpenGrm NGram Library Quick Tour
[6] OpenFst Extensions: FST Archives (FARs)
[7] S.F. Chen, G. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Harvard Computer Science Technical report TR-10-98, 1998.
[8] I. H. Witten, T. C. Bell, "The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression", IEEE Transactions on Information Theory 37 (4), pp. 1085–1094, 1991.
[9] B. Carpenter, "Scaling high-order character language models to gigabytes", Proceedings of the ACL Workshop on Software, pp. 86–99, 2005.
[10] H. Ney, U. Essen, R. Kneser, "On structuring probabilistic dependences in stochastic language modeling", Computer Speech and Language 8, pp. 1–38, 1994.
[11] S. M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recogniser", IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3), pp. 400–401, 1987.
[12] R. Kneser, H. Ney, "Improved backing-off for m-gram language modeling", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 181–184, 1995.
[13] Porting phonetisaurus many-to-many alignment python script to C++