Sphinx3 phoneme acoustic scores, CMUbet neighboring phonemes

(Author: Srikanth Ronanki)

(Status: GSoC 2012 Pronunciation Evaluation Week 1)

Last week, I accomplished the following:

  1. Successfully tested producing phoneme acoustic scores from sphinx3_align using two recognition passes. I was able to use the state segmentation parameter -stsegdir as an argument to the program, to obtain acoustic scores for each frame and thereby for each phoneme as well. But, the output of the program is to be decoded to integer format which I will try to do by the end of next week.
  2. Last week I wrote a program which converts a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet. But, yesterday, when I compared both files manually, found some of the phones mismatched. So, I re-checked my code and fixed the bug. The corrected program takes a string of phonemes representing an expected utterance as input and produces a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.

All the programs I have written so far are checked in at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/ronanki using subversion. (Similarly, Troy's code is checked in at http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/troy.)

Here is the procedure for using that code to obtain neighboring phonemes of CMUbet from a file which contains a string of phonemes:

  • To convert Worldbet phonetic alphabet to CMUbet

  • python convert_world2cmu.py

  • To convert input list of phonemes to neighboring phones

  • python convert2_ngbphones.py

  • Ex: "I had faith in them" (arctic_a0030) - a sentence from arctic database:

  •  AY HH AE D F EY TH IH N DH EH M (arctic_a0030)

     {AY|AA|IY|OY|EY} {HH|TH|F|P|T|K} {AE|EH|ER|AH} {D|T|JH|G|B} {F|HH|TH|V} {EY|EH|IY|AY} {TH|S|DH|F|HH} {IH|IY|AX|EH} {N|M|NG} {DH|TH|Z|V} {EH|IH|AX|ER|AE} {M|N} (arctic_a0030)

Troy: GSoC 2012 Pronunciation Evaluation Week 1

The first week of GSoC 2012 has already been a busy summer. Here is what I have accomplished so far:

  1. To measure the Speex recording "quality" parameter (which is set by the client from 0 to 10) I recorded the same Sphinx3 test utterance ("NO ONE AT THE STATE DEPARTMENT WANTS TO LET SPIES IN") with the quality varying from 0 to 10. As shown on the graph, the higher the Speex quality parameter, the larger the .FLV file will be. Judging from my own listening, greater quality parameter values do result in better quality, but it is difficult to hear the differences above level 7. I also tried to generate alignment scores to see whether the quality affects the alignment. However, from the results shown in the following graph, the acoustic scores seems essentially identical for the different recordings. But to be on the safe side in case of background and line noise, for now we will use a Speex recording quality parameter of 8.graph
  2. The rtmplite server is now configured to save its uploaded files to the[path_to_webroot]/data directory on the server. The initial audioRecorder applet will place its recordings in the [path_to_webroot]/data/audioRecorder directory, and for each user there will be a separate folder (e.g. [path_to_webroot]/data/audioRecorder/user1). For each recording utterance, the file name is now in the format of [sentence name]_[quality level].flv
  3. The conversion from .FLV Speex uploads to .WAV PCM audio files is done entirely in the rtmplite server using a process spawned by Python's subprocess.Popen() function calling ffmpeg. After the rtmplite closes the FLV file, the conversion is performed immediately and the converted WAV file has exactly the same path and name except the suffix, which is .wav instead of .flv. Guillem suggested the sox command for the conversion, but it doesn't recognize .flv files directly.  Other possibilities included speexdec, but that won't open .flv files either.
  4. In the audioRecorder client, the user interface now waits for NetConnection and NetStream events to open and close successfully before proceeding with other events. And a 0.5 second delay has been inserted at the beginning and end of the recording button click event to avoid inadvertently trimming the front or end of the recording.
My plans for the 2nd week are:
  1. Solve a problem encountered in converting FLV files to WAV using ffmpeg with Python's Popen() function. If the main Python script (call it test.py for example) is run from a terminal as "python test.py", then everything works great. However, if I put it in background and log off the server by doing "python test.py &", everytime when Popen() is invoked, the whole process hangs there with a "Stopped + test.py &" error message. I will try to figure out a way to work around this issue. Maybe if I start the process from cron (after checking to see whether it already running with a process ID number in a .pid text file) then it will start subprocesses without stopping as occurs when it is detached from a terminal.
  2. Finish the upload interface. There will be two kinds of interfaces: one for students and one for exemplar pronunciations. For the students, we will display from one to five cue phrases below space for a graphic or animation, assuming the smallest screen possible using HTML which would also look good in a larger window. For the exemplar recordings, we just need to display one phrase but we should also have per-upload form fields (e.g., name, age, sex, native speaker (y/n?), where speaker lived ages 6-8 (which determines their accent), self-reported accent, etc.) which should persist across multiple uploads by the same user (perhaps using HTTP cookies.)  I want to integrate those fields with the mysql database running on our server, so I will need to create a SQL schema with some CREATE TABLE statements to hold all those fields, the filenames, maybe recording durations, the date and time, and perhaps other information.
  3. Test the rtmplite upload server to make sure it works correctly and without race conditions during simultaneous uploads from multiple users, and both sequential and simultaneous recording uploads by the same user, just to be on the safe side.
  4. Further milestones are listed at https://cmusphinx.github.io/wiki/pronunciation_evaluation#milestones1

Setting up Development Environment

This information is outdated and can not be applied to recent versions. See our wiki for recent document

Sphinx4 is an open source speech recognition engine, which involves a wide variety of researchers and developers.

Here is an introduction to how to set up the environment if someone would like to contribute to the project. You can find guild lines for other platform at sphinx4 wiki. Now, I focus only on Ubuntu with Eclipse.

Ubuntu & Eclipse

This procedure has been tested on Ubuntu 12.04, but should also work for newer and older releases.

Required Software

  • JDK (Sphinx-4 is written in Java and therefore requires the JVM to run. However, usually JDK is already installed in Ubuntu)
  • Eclipse (IDE (Integrated Development Environment))
  • Ant (to build to source code)
  • Subversion (svn, source code control)
  • Subclipse (svn for eclipse)
  • sharutils (to unpack jsapi)

Step by Step

Install Basic required software
$ sudo apt-get install eclipse subversion ant sharutils
Install Subclipse

  • Open Eclipse
  • "Help" -> "Install New Software"
  • Click "Add"
  • "Name" = "SVN", and Location is "http://subclipse.trigris.org/update_1.8.x", Click "OK"
  • Check "SVNKit" and Open the submenu of  "Subclipse" in the Name field, and check "Subclipse (Required)", Click "Next"
  • Click "Next"
  • Check "I accept ...", Click "Finish"
  • Click "OK", "Restart Now"

Obtaining the Code

  • Open Eclipse
  • Click "File" → "New" → "Project" → "SVN", choose "Checkout Project from SVN"
  • Create a new repository location, click "Next"
  • URL = "https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx" (replace to your own source code directory)
  • Choose the code folder you want to check out, such as "branches/sphinx4_slm"
  • Finish

Building Sphinx-4

Setup JSAPI 1.0, see sphinx4_wiki

  • Go to the lib directory.
  • Type$ chmod +x ./jsapi.sh
  • Type$ sh ./jsapi.sh View the BCL. Accept it by typing "y"

Now, you can build sphinx4 in Eclipse as follows

  • "Run" → "External Tools" → "sphinx4 build.xml"

If you want to run demo project, follow the steps below:

  • Project → Properties → Jave Build Path → Source
  • open "sphinx4/src/apps" in tab "source folder on build path"
  • Double click "Included: All"
  • Click "Add Multiple"
  • select "edu", click "OK"
  • click "Finish" in the window "Inclusion and Exclusion Patterns"
  • "OK"

Now, you can open the file "src/apps/edu/cmu/sphinx/demo/HelloWorld.java"

  • click "Run" → "Run"

Commit your codes

  • Right Click the file or folder that you want to commit → "Team" → "Commit"

Porting phonetisaurus many-to-many alignment python script to C++

(author: John Salatas)

Following our previous article on phonetisaurus [1] and the decision to use this framework as the g2p conversion method for my GSoC project, this article will describe the port of the dictionary alignment script to C++.

1. Installation
The procedure below is tested on an Intel CPU running openSuSE 12.1 x64 with gcc 4.6.2. Further testing is required for other systems (MacOSX, Windows).

The alignment script requires the openFST library [2] to be installed on your system. Having downloaded, compiled and installed openFST, the first step is to checkout the alignment code from the cmusphinx SVN repository:

$ svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/train

and compile it
$ cd train
$ make
g++ -c -g -o src/align.o src/align.cpp
g++ -c -g -o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/M2MFstAligner.cpp
g++ -c -g -o src/phonetisaurus/FstPathFinder.o src/phonetisaurus/FstPathFinder.cpp
g++ -g -l fst -l dl -o src/align src/align.o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/FstPathFinder.o

2. Usage
Having compiled the script, running it without any command line arguments will print out it's usage, which is similar to that of the original phonetisaurus m2m-aligner python script:
$ cd src
$ ./align
Input file not provided
Usage: ./align [--seq1_del] [--seq2_del] [--seq1_max SEQ1_MAX] [--seq2_max SEQ2_MAX]
[--seq1_sep SEQ1_SEP] [--seq2_sep SEQ2_SEP] [--s1s2_sep S1S2_SEP]
[--eps EPS] [--skip SKIP] [--seq1in_sep SEQ1IN_SEP] [--seq2in_sep SEQ2IN_SEP]
[--s1s2_delim S1S2_DELIM] [--iter ITER] --ifile IFILE --ofile OFILE

--seq1_del, Allow deletions in sequence 1. Defaults to false.
--seq2_del, Allow deletions in sequence 2. Defaults to false.
--seq1_max SEQ1_MAX, Maximum subsequence length for sequence 1. Defaults to 2.
--seq2_max SEQ2_MAX, Maximum subsequence length for sequence 2. Defaults to 2.
--seq1_sep SEQ1_SEP, Separator token for sequence 1. Defaults to '|'.
--seq2_sep SEQ2_SEP, Separator token for sequence 2. Defaults to '|'.
--s1s2_sep S1S2_SEP, Separator token for seq1 and seq2 alignments. Defaults to '}'.
--eps EPS, Epsilon symbol. Defaults to ''.
--skip SKIP, Skip/null symbol. Defaults to '_'.
--seq1in_sep SEQ1IN_SEP, Separator for seq1 in the input training file. Defaults to ''.
--seq2in_sep SEQ2IN_SEP, Separator for seq2 in the input training file. Defaults to ' '.
--s1s2_delim S1S2_DELIM, Separator for seq1/seq2 in the input training file. Defaults to ' '.
--iter ITER, Maximum number of iterations for EM. Defaults to 10.
--ifile IFILE, File containing sequences to be aligned.
--ofile OFILE, Write the alignments to file.

The two required options are the pronunciation dictionary to align (IFILE) and the file in which the aligned corpus will be saved (OFILE). The script provide default values for all other options and cmudict (v. 0.7a) can be aligned simply by the following command
$ ./align --seq1_del --seq2_del --ifile --ofile
allowing for deletions in both graphemes and phonemes
$ ./align --ifile --ofile
not allowing for deletions

3. Performance
In order to test the new alignment script's performance in both its results and its requirements for cpu and memory usage, I have performed two tests for aligning of the full cmudict (v. 0.7a) allowing deletions in both sequenses:
$ ./align --seq1_del --seq2_del --ifile ../data/cmudict.dict --ofile ../data/cmudict.corpus.gsoc
and compared with the original phonetisuarus script
$ ./m2m-aligner.py --align ../train/data/cmudict.dict -s2 -s1 --write_align ../train/data/cmudict.corpus

3.1. Alignment
Comparing of the two outputs using the linux diff util, didn't result in major differences. Minor differences were noticed in case of the alignment double vowels and consonants with a single phoneme, as in the two following examples:
--- cmudict.corpus
+++ cmudict.corpus.gsoc
-B}B O}AO1 R}_ R}R I}IH0 S}S
+B}B O}AO1 R}R R}_ I}IH0 S}S
-B}B O}AO1 S|C}SH H}_ E}_ E}IY0
+B}B O}AO1 S|C}SH H}_ E}IY0 E}_

3.2. CPU memory usage
In the system described above, the average (of two runs) running time for new aligned command was 1h:14m in comparison to an average of 1h:28m of the original phonetisaurus script. Both scripts consumed the same RAM amount (~ 1.7GB).

Conclusion – Future works
This article presented the new g2p align script which seems to produce the same results as the original one and is a little bit faster than that.
Although it should compile and run as expected to any modern linux system, further testing is reeuired for other systems (like MacOSX, windows). We need also to investigated the alignment differnces (compared to the original script) in the vowels and consonants as described above. Although it doesn't seem critical, it may cause problems later.

[1] Phonetisaurus: A WFST-driven Phoneticizer – Framework Review

[2] OpenFst Library, http://www.openfst.org/twiki/bin/view/FST/WebHome