Pronunciation evaluation for GSoC 2012
Introductory conference paper: https://docs.google.com/open?id=0B73LgocyHQnfS0g5ZEw1aFNKT2s
Source code repository: http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/branches/speecheval/
Project blog: http://pronunciationeval.blogspot.com/
Please donate to support: http://talknicer.com/slics
|week 0 and 1|
|week 0 and 1|
|speex quality testing|
|Neighbor phoneme edit distance grammar generation|
|Base edit distance/alignment scoring routines|
|Database schema (see below)|
|completion level pending testing|
|User login/password/cookies framework|
|alpha release pending documentation||Troy|
|Need docs for UI parts to use this|
|Phrase data entry user interface|
|both algorithm and schema components|
|User interface for exemplar uploads|
|pending Flash/WebRTC audio upload pivot|
|User interface for learners|
|this is a general client implementation task|
|Manual scoring user interface|
|involves both schema and client UI interface|
|Fail-over from Flash/rtmplite when unavailable to
|Exemplar phoneme acoustic score and duration means and std. deviations|
|document score aggregation functions in the top-three useful modules|
|Calculation of exemplar quantity sufficiency|
|also converting Python to PHP|
|Aggregate acoustic score, duration, and neighbor phoneme to phoneme scores|
|only Ronanki at present; will swap during code reviews|
|Aggregate phoneme scores to biphone score|
|Aggregate phoneme scores to word scores|
|Aggregation from word (and phoneme?) scores to phrase scores|
|Validation of exemplar recordings (outlier scores)|
|Phrase library development|
|pending Flash/WebRTC/wami-recorder/type=file outcome||Troy|
|Measurement against panel of native speakers’ manual scores|
|Game authoring interface|
|Measurement of phonological features|
|pending fundraising||Troy and/or Guillem||?|
|similar to web client|
|have XO-1.75, getting -1 and -1.5|
|measure memory and speed|
|accepted for publication||everyone|
Please see http://talknicer.net/w/Database_schema which will be included here as its portions pass testing.
HTML/RTMP client protocol(s)
Update: RTMP from Flash microphone upload, or port 80 with multipart background HTTP form uploads, in Speex quality 8/10 or better quality, to PCM 16,000 samples/second, 16 bit samples. Looking forward to WebRTC, but until then https://code.google.com/p/wami-recorder/ is okay.
Web-Based Pronunciation Evaluation Using Acoustic, Duration and Phonological Scoring with CMU Sphinx3
Project Short Description
To build a website capable of collecting recorded audio speech utterances for use as both pronunciation exemplars and student utterances to be evaluated as part of the system’s user interface. I will use an open source Adobe Flex/Flash microphone recording and audio upload web client applet and rtmplite (Python) server which the project mentor has agreed to provide. I will use this website as the user interface to the project and to collect data which might not be available from my existing speech databases (or e.g. http://librivox.org and http://www.voxforge.org.) This website will also allow me to collect human judges’ pronunciation assessment scores for use in training and validation of the automatic pronunciation evaluation.
To build a pronunciation evaluation system that can detect mispronunciations at the phoneme level and provide feedback scores at the phoneme, bi-phone, word, and phrase level. I will use scoring techniques such as standardizing acoustic phoneme scores, edit distance scoring with alternate pronunciation grammars, phoneme duration standard scores, and standard score aggregation across biphones, words, and phrases.
If time permits, I hope to investigate mapping the acoustic speech features of each phoneme or diphone derived from machine phonetic transcription to articulation-based phonological features. Each phoneme is characterized by place of tongue articulation, manner of articulation, lip rounding, voiced or unvoiced, duration, height, frontness etc., These features represent the physical characteristics of the speech production process. Using this mapping, mispronunciations at the phone level can be identified using phonological features along with acoustic pronunciation scores and edit distances.
Benchmark the memory utilization and performance of Sphinx3 on forced alignment and edit distance scoring tasks on the initial website’s server equipment for parameter optimization and resource planning purposes.
Download the Adobe Actionscript Flex compiler and use it to build the open source Flex/Flash microphone audio input and upload applet to be provided by the mentor.
Install the rtmplite server on website and test it with microphone upload applet.
Write and test back-end server software for phrase word phoneme string selection, using CMUDICT, part-of-speech selection, and phrase audio exemplar collection.
Write and test phoneme acoustic score and duration standard score aggregation routines from exemplar utterance recordings.
Write and test audio upload and pronunciation evaluation for per-phoneme standard scores.
Write and test score aggregation by words, phrase, and biphones.
Collect sufficient transcribed phrases and corresponding exemplar recordings such that the system can be shown to be useful for general educational purposes.
Measure the agreement of the automatic evaluation scores with experts’ average subjective scores.
Write and test learner user accounts and adaptive feedback system to prompt learners to pronounce phrases which include phonemes, biphones, diphones, or words which they need to practice.
Publicize the system so that it can be tested by large numbers of users and monitored while in use.
Write and test an edit distance score enhancement system using an automatically generated mispronunciation grammar to the phoneme
Measure the benefit of including edit distance scoring in the calculation of phoneme scores.
(If the schedule at this point in the project permits:)
Build a CART model for the training dataset so that the test phonemes can be compared to the correct phone with reference to their context.
Build a table to transform phonemes and/or biphones derived from machine phonetic transcription to a new set of articulation-based phonological
Write and test a supplementary scoring routine to aggregate articulation-based phonological features with existing standard scores.
Measure the improvement of including articulation-based phonological features.
Write initial project documentation.
Proofread project documentation with mentor review.
Update the wiki at http://cmusphinx.github.io/wiki/faq#qhow_to_implement_pronunciation_evalua tion with the project description, pointing to the project documentation and including source code download instructions.
Prepare a research report on the project, proofread with mentor review, and submit it to a preprint server and a peer reviewed academic journal.
(Ongoing after the summer, if necessary.) Respond to journal submission comments, questions, and requests, and if rejected submit to other journal(s).
- May 21 to 27: milestones 1-4
- May 28 to June 3: milestones 5-7
- June 4 to 10: milestones 8 and 9
- June 11 to 17: milestones 10 and 11
- June 18 to 24: milestones 12 and 13
- June 25 to July 1: milestones 14
- July 2 to 8: milestone 15
- July 9 (firm): Complete and submit mid-term evaluation.
- July 9 to 15: milestone 16 part (a)
- July 16 to 22: milestone 16 part (b)
- July 23 to 29: milestone 16 parts (c) and (d)
- July 30 to August 5: milestones 17 and 18
- August 6 to 13: milestones 19 and 20
- August 14 to 19: buffer in case of schedule overrun
Mobile Pronunciation Evaluation for Language Learning Using Edit Distance Scoring with CMU Sphinx3, Copious Speech Data Collection, and a Game-Based Interface
Project Short Description
Pronunciation learning is one of the most important parts for second language acquisition. The aim of this project is to utilize the automatic speech recognition technologies to facilitate spoken language learning. This project will mainly focus on developing accurate and efficient pronunciation evaluation system using CMU Sphinx3 and maximizing the adoption population by implementing mobile apps with our evaluation system. Additionally, we also plan to design and implement game based pronunciation learning to make the learning process much more fun. Four specific sub-tasks are involved in this project, namely, automatic edit distance based grammar generation, exemplar pronunciation database building, Android pronunciation evaluation app interface implementation and game based learning interface development.
To automatically generate edit distance grammars for pronunciation evaluation. Language learners tend to make similar pronunciation mistakes, especially for learners from the same region or sharing the same mother tongue language. Identifying these mispronunciation patterns would greatly reduce the search space and improve the evaluation efficiency while maintaining the evaluation accuracy. This task would involve automatic mispronunciation pattern learning using native and non-native speech data, grammar generation using those patterns mined from speech data and testing the recognition grammar. The grammar will be finally represented as phoneme network/lattice. This would be in Python using Sphinx3 and would probably take 3-5 weeks.
To build a exemplar pronunciation database for pronunciation evaluation. As mentioned in the first task, to achieve efficient and accurate pronunciation evaluation, we need non-native speech data for mispronunciation pattern mining. In this task, we need to recruit people to come and visit a website and record their pronunciation of phrases. Then we need post-process the recorded speech samples and build a database for future system development. The recording website will be provided by the mentor and I will invite my friends to contribute their speech to this database and do data post-processing with some automatic approaches, such as outlier analysis for rejecting obvious bad pronunciations and speech detection for cropping the signal etc. This would be an ongoing thing to take 4-6 hours per week.
To implement an Android interface to a pronunciation evaluation system. To make our pronunciation evaluation system accessible to more users, we plan to build an Android app for our pronunciation evaluation system. This would be a Java task to take an existing pronunciation evaluation system for the web and make an Android interface to it. I will implement the audio recording and playback functions on Android platform and client-server interaction for transferring recorded speech signals and evaluation results. Simple HTTP based client-server communication will be adopted for this taks. This would take 2-4 weeks.
To develop a game front end for a pronunciation evaluation system. The best way to attract users is to implement the product as a game. We will explore this idea to make our system more popular. This task would be design and implementation of a simple game front end on web and/or Android to increase the attractiveness of working on pronunciation evaluation practice tasks for students. This task is a much more challenging one. The final game mechanism and implementation will be further discussed with the mentor. Generally speaking, I will design the basic game play functionalities and the interfaces. Similar client-server communication as the previous one will be also adopted here. This would take the remainder of the time, as fancy as you want to make it.
Get familiar with Sphinx3 and setup the baseline of the existing pronunciation evaluation system to be provided by the mentor.
Get both the native and non-native speech data from the mentor and extract MFCC features for recognition.
Automatically decode the speech data with acoustic model and language model provided by the mentor.
Extract mispronunciation patterns in those data.
Automatically construct grammars to include both the correct pronunciation and possible mispronunciation patterns.
Testing the generated grammars in the baseline evaluation system and comparing the performance.
Analyze the results, if necessary repeat 3-6 to improve the evaluation performance until a better grammar is learnt.
Collect information about the exemplar pronunciation data collection website and process from the mentor, if needed, help setup the website and promote the data collection among my friends.
Implement automatic data post-process programs for the recording data post-processing.
Guarding the post-processing process and occasionally do verifications manually if necessary.
Settle down the functionality design for the Android app.
Settle down the interface design for the Android app.
Implement the audio recording and playback on Android platform.
14) Implement the HTTP based client-server file communication.
Testing the Android app and fixing bugs.
Discuss with the mentor about the game mechanism.
Settle down the game play logic.
Settle down the functionality design.
Settle down the interface design.
Game implementation and testing.
Prepare the final project report.