Setting up Development Environment

This information is outdated and can not be applied to recent versions. See our wiki for recent document

Sphinx4 is an open source speech recognition engine, which involves a wide variety of researchers and developers.

Here is an introduction to how to set up the environment if someone would like to contribute to the project. You can find guild lines for other platform at sphinx4 wiki. Now, I focus only on Ubuntu with Eclipse.

Ubuntu & Eclipse

This procedure has been tested on Ubuntu 12.04, but should also work for newer and older releases.

Required Software

  • JDK (Sphinx-4 is written in Java and therefore requires the JVM to run. However, usually JDK is already installed in Ubuntu)
  • Eclipse (IDE (Integrated Development Environment))
  • Ant (to build to source code)
  • Subversion (svn, source code control)
  • Subclipse (svn for eclipse)
  • sharutils (to unpack jsapi)

Step by Step

Install Basic required software
$ sudo apt-get install eclipse subversion ant sharutils
Install Subclipse

  • Open Eclipse
  • "Help" -> "Install New Software"
  • Click "Add"
  • "Name" = "SVN", and Location is "", Click "OK"
  • Check "SVNKit" and Open the submenu of  "Subclipse" in the Name field, and check "Subclipse (Required)", Click "Next"
  • Click "Next"
  • Check "I accept ...", Click "Finish"
  • Click "OK", "Restart Now"

Obtaining the Code

  • Open Eclipse
  • Click "File" → "New" → "Project" → "SVN", choose "Checkout Project from SVN"
  • Create a new repository location, click "Next"
  • URL = "" (replace to your own source code directory)
  • Choose the code folder you want to check out, such as "branches/sphinx4_slm"
  • Finish

Building Sphinx-4

Setup JSAPI 1.0, see sphinx4_wiki

  • Go to the lib directory.
  • Type$ chmod +x ./
  • Type$ sh ./ View the BCL. Accept it by typing "y"

Now, you can build sphinx4 in Eclipse as follows

  • "Run" → "External Tools" → "sphinx4 build.xml"

If you want to run demo project, follow the steps below:

  • Project → Properties → Jave Build Path → Source
  • open "sphinx4/src/apps" in tab "source folder on build path"
  • Double click "Included: All"
  • Click "Add Multiple"
  • select "edu", click "OK"
  • click "Finish" in the window "Inclusion and Exclusion Patterns"
  • "OK"

Now, you can open the file "src/apps/edu/cmu/sphinx/demo/"

  • click "Run" → "Run"

Commit your codes

  • Right Click the file or folder that you want to commit → "Team" → "Commit"

Porting phonetisaurus many-to-many alignment python script to C++

(author: John Salatas)

Following our previous article on phonetisaurus [1] and the decision to use this framework as the g2p conversion method for my GSoC project, this article will describe the port of the dictionary alignment script to C++.

1. Installation
The procedure below is tested on an Intel CPU running openSuSE 12.1 x64 with gcc 4.6.2. Further testing is required for other systems (MacOSX, Windows).

The alignment script requires the openFST library [2] to be installed on your system. Having downloaded, compiled and installed openFST, the first step is to checkout the alignment code from the cmusphinx SVN repository:

$ svn co

and compile it
$ cd train
$ make
g++ -c -g -o src/align.o src/align.cpp
g++ -c -g -o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/M2MFstAligner.cpp
g++ -c -g -o src/phonetisaurus/FstPathFinder.o src/phonetisaurus/FstPathFinder.cpp
g++ -g -l fst -l dl -o src/align src/align.o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/FstPathFinder.o

2. Usage
Having compiled the script, running it without any command line arguments will print out it's usage, which is similar to that of the original phonetisaurus m2m-aligner python script:
$ cd src
$ ./align
Input file not provided
Usage: ./align [--seq1_del] [--seq2_del] [--seq1_max SEQ1_MAX] [--seq2_max SEQ2_MAX]
[--seq1_sep SEQ1_SEP] [--seq2_sep SEQ2_SEP] [--s1s2_sep S1S2_SEP]
[--eps EPS] [--skip SKIP] [--seq1in_sep SEQ1IN_SEP] [--seq2in_sep SEQ2IN_SEP]
[--s1s2_delim S1S2_DELIM] [--iter ITER] --ifile IFILE --ofile OFILE

--seq1_del, Allow deletions in sequence 1. Defaults to false.
--seq2_del, Allow deletions in sequence 2. Defaults to false.
--seq1_max SEQ1_MAX, Maximum subsequence length for sequence 1. Defaults to 2.
--seq2_max SEQ2_MAX, Maximum subsequence length for sequence 2. Defaults to 2.
--seq1_sep SEQ1_SEP, Separator token for sequence 1. Defaults to '|'.
--seq2_sep SEQ2_SEP, Separator token for sequence 2. Defaults to '|'.
--s1s2_sep S1S2_SEP, Separator token for seq1 and seq2 alignments. Defaults to '}'.
--eps EPS, Epsilon symbol. Defaults to ''.
--skip SKIP, Skip/null symbol. Defaults to '_'.
--seq1in_sep SEQ1IN_SEP, Separator for seq1 in the input training file. Defaults to ''.
--seq2in_sep SEQ2IN_SEP, Separator for seq2 in the input training file. Defaults to ' '.
--s1s2_delim S1S2_DELIM, Separator for seq1/seq2 in the input training file. Defaults to ' '.
--iter ITER, Maximum number of iterations for EM. Defaults to 10.
--ifile IFILE, File containing sequences to be aligned.
--ofile OFILE, Write the alignments to file.

The two required options are the pronunciation dictionary to align (IFILE) and the file in which the aligned corpus will be saved (OFILE). The script provide default values for all other options and cmudict (v. 0.7a) can be aligned simply by the following command
$ ./align --seq1_del --seq2_del --ifile --ofile
allowing for deletions in both graphemes and phonemes
$ ./align --ifile --ofile
not allowing for deletions

3. Performance
In order to test the new alignment script's performance in both its results and its requirements for cpu and memory usage, I have performed two tests for aligning of the full cmudict (v. 0.7a) allowing deletions in both sequenses:
$ ./align --seq1_del --seq2_del --ifile ../data/cmudict.dict --ofile ../data/cmudict.corpus.gsoc
and compared with the original phonetisuarus script
$ ./ --align ../train/data/cmudict.dict -s2 -s1 --write_align ../train/data/cmudict.corpus

3.1. Alignment
Comparing of the two outputs using the linux diff util, didn't result in major differences. Minor differences were noticed in case of the alignment double vowels and consonants with a single phoneme, as in the two following examples:
--- cmudict.corpus
+++ cmudict.corpus.gsoc
-B}B O}AO1 R}_ R}R I}IH0 S}S
+B}B O}AO1 R}R R}_ I}IH0 S}S
-B}B O}AO1 S|C}SH H}_ E}_ E}IY0
+B}B O}AO1 S|C}SH H}_ E}IY0 E}_

3.2. CPU memory usage
In the system described above, the average (of two runs) running time for new aligned command was 1h:14m in comparison to an average of 1h:28m of the original phonetisaurus script. Both scripts consumed the same RAM amount (~ 1.7GB).

Conclusion – Future works
This article presented the new g2p align script which seems to produce the same results as the original one and is a little bit faster than that.
Although it should compile and run as expected to any modern linux system, further testing is reeuired for other systems (like MacOSX, windows). We need also to investigated the alignment differnces (compared to the original script) in the vowels and consonants as described above. Although it doesn't seem critical, it may cause problems later.

[1] Phonetisaurus: A WFST-driven Phoneticizer – Framework Review

[2] OpenFst Library,

[GSoC 2012: Pronunciation Evaluation #ronanki] Work prior to Official Start

Well, it has been a month since I got accepted into this year's Google Summer of Code. This has been a great time for me, during the community bonding period within the CMU Sphinx organization.

It has been four days since GSoC 2012 started officially. Prior to that, I became familiarized with a few different things with the help of my mentor. He created a wiki page for our projects at Troy and I are going to blog here and update the wiki there during this summer. So please check here for important updates.

Currently, my goal is to build a web interface which allows users to evaluate their pronunciation. Some of the sub-tasks have already been accomplished, and some of them are still ongoing:

Work accomplished:

  • Created an initial web interface which allows users to record and playback their speech using the open source wami-recorder which is being designed by the spoken language systems at MIT.
  • When the recording is completed, the wave file is uploaded to the server for processing.
  • Sphinx3 forced alignment is used to align a phoneme string expected from the utterance with the recorded speech to calculate time endpoints acoustic scores for each phoneme.
  • I tried many different output arguments in sphinx3_align from and successfully tested producing the phoneme acoustic scores using two recognition passes.
    • In the first pass, I use -phlabdir as an argument to get a .lab file as output, which contains the list of recognized phonemes.
    • In the second pass, I use that list to get acoustic scores for each phoneme using -wdsegdir as an input argument.
  • Later, I integrated sphinx3 forced alignment with the wami-recorder microphone recording applet so that the user sees the acoustic scores after uploading their recording.
  • Please try this link to test it:
  • Wrote a program to convert a list of each phoneme's "neighbors," or most similar other phonemes, provided by the project mentor from the Worldbet phonetic alphabet to CMUbet.
  • Wrote a program to take a string of phonemes representing an expected utterance as input and produce a sphinx3 recognition grammar consisting of a string of alternatives representing each expected phoneme and all of its neighboring, phonemes for automatic edit distance scoring.
Ongoing work:
  • Reading about Worldbet, OGIbet, ARPAbet, and CMUbet, the different ASCII-based phonetic alphabets and their mappings between each other and the International Phonetic Alphabet.
  • Will be enhancing the first pass of recognition described above using the generated alternative neighboring phoneme grammars to find phonemes which match the recorded speech more closely than the expected phonemes without using complex post-processing acoustic score statistics.
  • Trying more parameters and options to derive acoustic scores for each phoneme from sphinx3 forced alignment.
  • Writing an exemplar score aggregation algorithms to find the means, standard deviations, and their expected error for each phoneme in a phrase from a set of recorded exemplar pronunciations of that phrase.
  • Writing an algorithm which can detect mispronunciations by comparing a recording's acoustic scores to the expected mean and standard deviation for each phoneme, and aggregating those scores to biphones, words, and the entire phrase.

[GSoC 2012: Pronunciation Evaluation #Troy] Before Week 1

Google Summer of Code 2012 officially started this Monday (21 May). Our expected weekly report should begin next Monday, but here is a brief overview of the preparations we have accomplished during the "community bonding period."

We started with a group chat including our mentor James and the other student Ronanki. The project details are becoming more clear to me, from the chat and subsequent email communications. For my project, the major focuses will be:
1) A web portal for automatic pronunciation evaluation audio collection; and
2) An Android-based mobile automatic pronunciation evaluation app.
The core of these two applications is edit distance grammar based-automatic pronunciation evaluation using CMU Sphinx3.

Here are the preparations I have accomplished during the bonding period:

  1. Trying out the basic wami-recorder demo on my school's server;
  2. Changing rtmplite for audio recording. Rtmplite is a Python implementation of an RTMP server with minimum support needed for real-time streaming and recording using Adobe's AMF0 protocol. On the server side, the RTMP server daemon process listens on TCP port 1935 by default, for connections and media data streaming. On the client side, the Flash user needs to use Adobe ActionScript 3's NetConnection function to set up a session with the server, and the NetStream function for audio and video streaming, and also microphone recording. The demo application has been set up at:
  3. Based on my understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is a key user interface component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: . The current implementation:
    1. Distinguishes recordings from different users with user IDs;
    2. Loads pre-defined text sentences to display for recording, which will be useful for pronunciation exemplar data collection;
    3. Performs peal-time audio recording;
    4. Can play back the recordings from the server; and
    5. Has basic event control logic, such as to prevent users from recording and playing at the same time, etc.
  4. Also, I have also learned from on how to get phoneme acoustic scores from "forced alignment" using sphinx3. To generate the phoneme alignment scores, two steps are needed. The details of how to perform that alignment can be found on my more tech-oriented posts at and on my personal blog.
Currently, these tasks are ongoing:
  1. Set up the server side process to manage user recordings, i.e., distinguishing between users and different utterances.
  2. Figure out how to use ffmpeg, speexdec, and/or sox to automatically convert the recorded server side FLV files to PCM .wav files after the users upload the recordings.
  3. Verify the recording parameters against the recording and speech recognition quality, possibly taking the network bandwidth into consideration.
  4. Incorporating delays between network and microphone events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission, etc.) to successfully finish before processing the next user event, which can often cause the recordings to be clipped.

My GSoC Project Page: