CMUSphinx Open Source Speech Recognition

Jun 19, 2012

edit-distance grammar decoding using sphinx3: Part 1

(Status: GSoC 2012 Pronunciation Evaluation Week 3)

I finally finished trying different methods for edit-distance grammar decoding. Here is what I have tried so far:

1. I used sox to split each input wave file into individual phonemes based on the forced alignment output. Then, I tried decoding each phoneme against its neighboring phonemes. The decoding output matched the expected phonemes only 12 out of 41 times for the exemplar recordings in the phrase "Approach the teaching of pronunciation with more confidence"

The accuracy for that method of edit distance scoring was 12/41 (29%) -- This naive approach didn't work well.

2. I used sox to split each input wave file into three phonemes based on the forced alignment output and position of the phoneme. If a phoneme is at beginning of its word, I used a grammar like: (current phone) (next) (next2next) and if it is middle phoneme: (previous) (current) (next) and if it is at the end: (previous2previous) (previous) (current) and supplied neighboring phones for the current phone and fixed the other two.
For example, phoneme IH in word "with" is encoded as:
((W) (IH|IY|AX|EH) (TH))

The accuracy was 19/41 (46.2%) -- better because of more contextual information.

3. I used the entire phrase with each phoneme encoded in a sphinx3_decode grammar file for matching a sequence of alternative neighboring phonemes which looks something like this:

#JSGF V1.0; grammar phonelist; public = (SIL (AH|AE|ER|AA) (P|T|B|HH) (R|Y|L) (OW|AO|UH|AW) (CH|SH|JH|T) (DH|TH|Z|V)(AH|AE|ER|AA) (T|CH|K|D|P|HH) (IY|IH|IX) (CH|SH|JH|T) (IH|IY|AX|EH) (NG|N) (AH|AE|ER|AA) (V|F|DH) (P|T|B|HH)(R|Y|L) (AH|AE|ER|AA) (N|M|NG) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) (IY|IH|IX) (EY|EH|IY|AY) (SH|S|ZH|CH) (AH|AE|ER|AA) (N|M|NG) (W|L|Y) (IH|IY|AX|EH) (TH|S|DH|F|HH) (M|N) (AO|AA|ER|AX|UH) (R|Y|L) (K|G|T|HH) (AA|AH|ER|AO) (N|M|NG) (F|HH|TH|V) (AH|AE|ER|AA) (D|T|JH|G|B) (AH|AE|ER|AA) (N|M|NG) (S|SH|Z|TH) SIL);
The accuracy for this method of edit distance scoring was 30/41 (73.2%) -- the more contextual information provided, better the accuracy.

Here is some sample output, written both one below the other to have a comparison of phonemes.

Forced-alignment output:
AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M

Decoder output:
ER P R UH JH DH AH CH IY CH IY N AH V P R ER N AH NG Z IY EY SH AH N W IH TH M

In this case, both are forced outputs. So, if someone skips or inserts something during phrase recording, it may not work well. We need to think a method to solve this. Will a separate pass decoder grammar to test for whole word or syllable insertions and deletions work?

Things to do for next week:

1. We are trying to combine acoustic standard scores (and duration) from forced alignment with an edit distance scoring grammar, which was reported to have better correspondence with human expert phonologists.

2. Complete a basic demo of the pronunciation evaluation without edit distance scoring from exemplar recordings using conversion of phoneme acoustic scores and durations to normally distributed scores, and then using those to derive their means and standard deviations, so we can produce per-phoneme acoustic and duration standard scores for new uploaded recordings.

3. Finalize the method for mispronunciation detection at phoneme and word level.

Jun 13, 2012

Automating the creation of joint multigram language models as WFST

(author: John Salatas)

Foreword

Previous articles have introduced the C++ code to align a pronounciation dictionary [1] and how this aligned dictionary can be used in combination with OpenGrm Ngram Library for the encoding of joint multigram language models as WFST [2]. This article will describe the automation of the language model creation procedures as a complete C++ application that is simpler to use than the original procedures described in [2].

1. Installation
The procedure below is tested on an Intel CPU running openSuSE 12.1 x64 with gcc 4.6.2. Further testing is required for other systems (MacOSX, Windows).

The code requires the openFST library to be installed on your system. Having downloaded, compiled and installed openFST, the first step is to checkout the code from the cmusphinx SVN repository:

$ svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/train

and compile it

$ cd train $ make g++ -c -g -o src/train.o src/train.cpp g++ -c -g -o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/M2MFstAligner.cpp g++ -c -g -o src/phonetisaurus/FstPathFinder.o src/phonetisaurus/FstPathFinder.cpp g++ -g -L/usr/local/lib64/fst -lfst -lfstfar -lfstfarscript -ldl -lngram -o train src/train.o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/FstPathFinder.o $

2. Usage
Having compiled the script, running it without any command line arguments will print out it's usage:
$ ./train Input file not provided Usage: ./train [--seq1_del] [--seq2_del] [--seq1_max SEQ1_MAX] [--seq2_max SEQ2_MAX] [--seq1_sep SEQ1_SEP] [--seq2_sep SEQ2_SEP] [--s1s2_sep S1S2_SEP] [--eps EPS] [--skip SKIP] [--seq1in_sep SEQ1IN_SEP] [--seq2in_sep SEQ2IN_SEP] [--s1s2_delim S1S2_DELIM] [--iter ITER] [--order ORDER] [--smooth SMOOTH] [--noalign] --ifile IFILE --ofile OFILE

--seq1_del, Allow deletions in sequence 1. Defaults to false. --seq2_del, Allow deletions in sequence 2. Defaults to false. --seq1_max SEQ1_MAX, Maximum subsequence length for sequence 1. Defaults to 2. --seq2_max SEQ2_MAX, Maximum subsequence length for sequence 2. Defaults to 2. --seq1_sep SEQ1_SEP, Separator token for sequence 1. Defaults to '|'. --seq2_sep SEQ2_SEP, Separator token for sequence 2. Defaults to '|'. --s1s2_sep S1S2_SEP, Separator token for seq1 and seq2 alignments. Defaults to '}'. --eps EPS, Epsilon symbol. Defaults to ''. --skip SKIP, Skip/null symbol. Defaults to '_'. --seq1in_sep SEQ1IN_SEP, Separator for seq1 in the input training file. Defaults to ''. --seq2in_sep SEQ2IN_SEP, Separator for seq2 in the input training file. Defaults to ' '. --s1s2_delim S1S2_DELIM, Separator for seq1/seq2 in the input training file. Defaults to ' '. --iter ITER, Maximum number of iterations for EM. Defaults to 10. --ifile IFILE, File containing training sequences. --ofile OFILE, Write the binary fst model to file. --noalign, Do not align. Assume that the aligned corpus already exists. Defaults to false. --order ORDER, N-gram order. Defaults to 9. --smooth SMOOTH, Smoothing method. Available options are: "presmoothed", "unsmoothed", "kneser_ney", "absolute", "katz", "witten_bell", "unsmoothed". Defaults to "kneser_ney". $

As in [1], the two required options are the pronunciation dictionary (IFILE) and the file in which the binary fst model will be saved (OFILE). The script provide default values for all other options and an fst binary model for cmudict (v. 0.7a) can be created simply by the following command

$ ./train --seq1_del --seq2_del --ifile --ofile

allowing for deletions in both graphemes and phonemes, and

$ ./train--ifile --ofile

not allowing for deletions.

3. Performance, Evaluation and Comparison with phonetisaurus
in order to test the new code's performance, tests similar to those in [1] and [2] where performed, with similar results in both resource utilization and it's ability to generate pronunciations for previously unseen words.

4. Conclusion - Future Works
Having integrated the model training procedure into a simplified application, combined with the dictionary alignment code, the next step would be to create the evaluation code in order to avoid using phonetisaurus evaluate python script. Further steps include the writing of the necessary code to load the WFST binary model in java code, and convert it to the java's implementation of openFST [3], [4].

References
[1] Porting phonetisaurus many-to-many alignment python script to C++.
[2] Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST.
[3] Porting openFST to java: Part 1.
[4] Porting openFST to java: Part 2.

Jun 10, 2012

Sphinx3 Forced Alignment with different arguments, edit-distance grammar generation

(Author: Srikanth Ronanki)

(Status: GSoC 2012 Pronunciation Evaluation Week 2)

Following last week's discussion describing how to obtain phoneme acoustic scores from sphinx3_align, here is some additional detail pertaining to two of the necessary output arguments:

1. Following up on the discussion at https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4583225, I was able to produce acoustic scores for each frame, and thereby also for each phoneme in a single recognition pass. Add the following code to thewrite_stseg function in main_align.c and use the state segmentation parameter -stsegdir as an argument to the program:


char str2[1024];
align_stseg_t *tmp1;


for (i = 0, tmp1 = stseg; tmp1; i++, tmp1 = tmp1->next) {
mdef_phone_str(kbc->mdef, tmp1->pid, str2);
fprintf(fp, "FrameIndex %d Phone %s PhoneID %d SenoneID %d state %d Ascr %11d \n", i, str2, tmp1->pid, tmp1->sen, tmp1->state, tmp1->score);
}

2. By using the phone segmentation parameter -phsegdir as an argument to the program, the acoustic scores for each phoneme can be calculated. The output sequence for the word "approach" is as follows:

SFrm EFrm SegAScr Phone


0     9    -64725           SIL
10    21    -63864       AH SIL P b
22    33   -126819       P AH R i
34    39    -21470       R P OW i
40    51    -69577       OW R CH i

52 64 -55937 CH OW DH e

Each phoneme in the "Phone" column is represented as . The full command line usage for this output is:


$ sphinx3_align -hmm wsj_all_cd30.mllt_cd_cont_4000 -dict cmu.dic -fdict phone.filler -ctl phone.ctl -insent phone.insent -cepdir feats -phsegdir phonesegdir -phlabdir phonelabdir -stsegdir statesegdir -wdsegdir aligndir -outsent phone.outsent

Work in progress:

1. It's very important to weight word scores by the words' part of speech (articles don't matter very much if they are omitted, but nouns, adjectives, verbs, and adverbs are the most important.)

2. I put some exemplar recordings for three phrases the project mentor had collected at http://talknicer.net/~ronanki/Datasets/ in each subdirectory there for each of the three phrases. The description of the phrases is at http://talknicer.net/~ronanki/Datasets/files/phrases.txt.

3. I ran sphinx3_align for that sample data set. I wrote a program to calculate mean and standard deviations of phoneme acoustic scores, and the mean duration of each phoneme. I also generated neighbor phonemes for each of the phrases, and the output is written in this file: http://talknicer.net/~ronanki/Datasets/out_ngb_phonemes.insent

4. I also tried some of the other sphinx3 executables such as sphinx3_decode, sphinx3_livepretend, andsphinx3_continous for mispronunciation detection. For the sentence, "Approach the teaching of pronunciation with more confidence." (phrase 1), I used this command:


$ SPHINX3DECODE -hmm ${WSJ} -fsg phone.fsg -dict basicphone.dic -fdict phone.filler -ctl new_phone.ctl -hyp phone.out -cepdir feats -mode allphone -hypseg phone_hypseg.out -op_mode 2

The decoder, sphinx3_decode, produced this output:


P UH JH DH CH IY CH Y N Z Y EY SH AH W Z AO K AA F AH N Z

The forced alignment system, sphinx3_align, produced this output:


AH P R OW CH DH AH T IY CH IH NG AH V P R AH N AH N S IY EY SH AH N W IH TH M AO R K AA N F AH D AH N S

The sphinx3_livepretend and sphinx3_continous commands produce output in words using language models and acoustic models along with a complete dictionary of expected words:


Approach to teaching opponents the nation with more confidence

Plans for the coming week:

1. Write and test audio upload and pronunciation evaluation for per-phoneme standard scores.

2. Since there are many deletions in the edit distance scoring grammars tried so far, we need to modify the grammar file and/or the method we are using to detect whether neighboring phonemes match more closely. Here is my idea of finding neighboring phonemes by dynamic programming:

a. Run the decoder to get the best possible output

b. Align the decoder output to forced-alignment output using a dynamic programming string matching algorithm

c. The aligned output will have the same number of phones as from forced alignment. So, we need to test two things for each phoneme:

If the phone is same as expected phoneme, no need to do anything
If the phone is not as expected phoneme, check that phone in the list of neighboring phonemes of the expected phoneme.

d. Then, we can run sphinx3_align with this outcome against the same wav file to check whether the acoustic scores actually indicate a better match.

3. As an alternative to the above, I used sox to split each input wave file in to individual phoneme wav files using the forced alignment phone labels, and then used a separate recognition pass on each tiny speech segment. Now, I am writing separate grammar files for the neighboring phonemes for each phoneme. Once I complete them, I will check the output using decoder for each phoneme segment. This should provide for more accurate assessment of mispronunciations.

4. I will update the wiki here at https://cmusphinx.github.io/wiki/pronunciation_evaluation with my current tasks and milestones.

Jun 9, 2012

Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST.

(author: John Salatas)

Foreword
This article will review the OpenGrm NGram Library [1] and its usage for language modeling in ASR. OpenGrm makes use of functionality in the openFST library [2] to create, access and manipulate n-gram language models and it can be used as the language model training toolkit for integrating phonetisaurus' model training procedures [3] into a simplified application.

1. Model Training
Having the aligned corpus produced from cmudict by the aligner code of our previous article [4], the first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with: [5]

# ngramsymbols < cmudict.corpus > cmudict.syms

Given the symbol table, a text corpus can be converted to a binary finite state archive (far) [6] with:

# farcompilestrings --symbols=cmudict.syms --keep_symbols=1 cmudict.corpus > cmudict.far

Next step is to count n-grams from an input corpus, converted in FAR format. It produces an n-gram model in the FST format. By using the switch --order the maximum length n-gram to count can be chosen.

The 1-gram through 9-gram counts for the cmudict.far finite-state archive file created above can be created with:

# ngramcount --order=9 cmudict.far > cmudict.cnts

Finally the 9-gram counts in cmudict.cnts above can be converted to a WFST model with:

# ngrammake --method="kneser_ney" cmudict.cnts > cmudict.mod

The --method option is used for selecting the smoothing method [7] from one of the six available:

witten_bell: smooths using Witten-Bell [8], with a hyperparameter k, as presented in [9].
absolute: smooths based on Absolute Discounting [10], using bins and discount parameters.
katz: smooths based on Katz Backoff [11], using bins parameters.
kneser_ney: smooths based on Kneser-Ney [12], a variant of Absolute Discounting.
presmoothed: normalizes at each state based on the n-gram count of the history.
unsmoothed: normalizes the model but provides no smoothing.

2. Evaluation – Comparison with phonetisaurus

In order to evaluate OpenGrm models, ther procedure described above was repeated using the standard 90%-10% split of the cmudict into a training and test set respectively. The binary fst format produced by ngrammake wasn't readable by the phonetisaurus evaluation script, so it was converted to ARPA format with:

# ngramprint --ARPA cmudict.mod > cmudict.arpa

and then back to a phonetisaurus binary fst format with:

# phonetisaurus-arpa2fst --input=cmudict.arpa --prefix="cmudict/cmudict"

Finally the test set was evaluated with

# evaluate.py --modelfile cmudict/cmudict.fst --testfile cmudict.dict.test --prefix cmudict/cmudict Words: 13328 Hyps: 13328 Refs: 13328 ############################################## EVALUATION RESULTS --------------------------------------------------------------------- (T)otal tokens in reference: 84955 (M)atches: 77165 (S)ubstitutions: 7044 (I)nsertions: 654 (D)eletions: 746 % Correct (M/T) -- %90.83 % Token ER ((S+I+D)/T) -- %9.94 % Accuracy 1.0-ER -- %90.06 --------------------------------------------------------------------- (S)equences: 13328 (C)orrect sequences: 8010 (E)rror sequences: 5318 % Sequence ER (E/S) -- %39.90 % Sequence Acc (1.0-E/S) -- %60.10 ##############################################

3. Conclusion – Future Works

The evaluation results above are, as expected, almost identical with those produced by the phonetisaurus' procedures and the use of MITLM toolkit instead of OpenGrm.

Having the above description, the next step is to integrate all of the commands above into a simplified application, combined with the dictionary alignment code introduced in our previous article [13].

References

[1] OpenGrm NGram Library

[2] openFST library

[3] Phonetisaurus: A WFST-driven Phoneticizer – Framework Review

[4] Porting phonetisaurus many-to-many alignment python script to C++

[5] OpenGrm NGram Library Quick Tour

[6] OpenFst Extensions: FST Archives (FARs)

[7] S.F. Chen, G. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Harvard Computer Science Technical report TR-10-98, 1998.

[8] I. H. Witten, T. C. Bell, "The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression", IEEE Transactions on Information Theory 37 (4), pp. 1085–1094, 1991.

[9] B. Carpenter, "Scaling high-order character language models to gigabytes", Proceedings of the ACL Workshop on Software, pp. 86–99, 2005.

[10] H. Ney, U. Essen, R. Kneser, "On structuring probabilistic dependences in stochastic language modeling", Computer Speech and Language 8, pp. 1–38, 1994.

[11] S. M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recogniser", IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3), pp. 400–401, 1987.

[12] R. Kneser, H. Ney, "Improved backing-off for m-gram language modeling", Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 181–184, 1995.

[13] Porting phonetisaurus many-to-many alignment python script to C++

Newer

Older

Page 20 of 37