[GSoC 2012: Pronunciation Evaluation #Troy] Before Week 1

Google Summer of Code 2012 officially started this Monday (21 May). Our expected weekly report should begin next Monday, but here is a brief overview of the preparations we have accomplished during the "community bonding period."

We started with a group chat including our mentor James and the other student Ronanki. The project details are becoming more clear to me, from the chat and subsequent email communications. For my project, the major focuses will be:

1) A web portal for automatic pronunciation evaluation audio collection; and

2) An Android-based mobile automatic pronunciation evaluation app.

The core of these two applications is edit distance grammar based-automatic pronunciation evaluation using CMU Sphinx3.

Here are the preparations I have accomplished during the bonding period:

Trying out the basic wami-recorder demo on my school's server;
Changing rtmplite for audio recording. Rtmplite is a Python implementation of an RTMP server with minimum support needed for real-time streaming and recording using Adobe's AMF0 protocol. On the server side, the RTMP server daemon process listens on TCP port 1935 by default, for connections and media data streaming. On the client side, the Flash user needs to use Adobe ActionScript 3's NetConnection function to set up a session with the server, and the NetStream function for audio and video streaming, and also microphone recording. The demo application has been set up at: http://talknicer.net/~li-bo/testClient/bin-debug/testClient.html
Based on my understanding of the demo application, which does the real time streaming and recording of both audio and video, I started to write my own audio recorder which is a key user interface component for both the web-based audio data collection and the evaluation app. The basic version of the recorder was hosted at: http://talknicer.net/~li-bo/audioRecorder/audioRecorder.html . The current implementation:
1. Distinguishes recordings from different users with user IDs;
2. Loads pre-defined text sentences to display for recording, which will be useful for pronunciation exemplar data collection;
3. Performs peal-time audio recording;
4. Can play back the recordings from the server; and
5. Has basic event control logic, such as to prevent users from recording and playing at the same time, etc.
Also, I have also learned from https://cmusphinx.github.io/wiki/sphinx4:sphinxthreealigner on how to get phoneme acoustic scores from "forced alignment" using sphinx3. To generate the phoneme alignment scores, two steps are needed. The details of how to perform that alignment can be found on my more tech-oriented posts at http://troylee2008.blogspot.com/2012/05/testing-cmusphinx3-alignment.html and http://troylee2008.blogspot.com/2012/05/cmusphinx3-phoneme-alignment.html on my personal blog.

Currently, these tasks are ongoing:

Set up the server side process to manage user recordings, i.e., distinguishing between users and different utterances.
Figure out how to use ffmpeg, speexdec, and/or sox to automatically convert the recorded server side FLV files to PCM .wav files after the users upload the recordings.
Verify the recording parameters against the recording and speech recognition quality, possibly taking the network bandwidth into consideration.
Incorporating delays between network and microphone events in the recorder. The current version does not wait for the network events (such as connection set up, data package transmission, etc.) to successfully finish before processing the next user event, which can often cause the recordings to be clipped.

My GSoC Project Page: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/troylee2008/1