Project Ideas
Note to students: Google Summer of Code proposals are closed.
This page was updated March 2017.
Please see also: summerofcodeideas and summerofcodestudents
For reference: http://write.flossmanuals.net/gsoc-mentoring/making-your-ideas-page/
About CMUSphinx
The CMUSphinx project is a leading automatic speech recognition project in the open source world. Since being released as open source code in 1999, it provides a platform for building speech recognition applications. It’s used in desktop control software, telephony platforms, intelligent houses, computer-assisted language learning tools, information retrieval and mobile applications. Traditionally, CMUSphinx provides support for low-resource and underdeveloped languages.
Tasks for Contributors
This is a list of potential tasks for new and intermediate contributors to the project. If you have some other idea, feel free to add it. For any questions contact us on cmusphinx-devel@lists.sourceforge.net mailing list or on #cmusphinx irc channel on Freenode.
Intelligibility remediation for pronunciation assessment
In 2014 and 2015, a few speech recognition labs working on speech recognition for pronunciation assessment in language learning and computer assisted pronunciation training started using a new method for authentic intelligibility remediation. This technique vastly improves upon scoring a learner’s pronunciation by trying to predict the assessment of pronunciation judges, which is the paradigm which has dominated the field since the Malvern UK STAR project in the late 1980s. Instead, we assess the quality of a student’s pronunciation by trying to predict whether a panel of native and nonnative transcriptionists can correctly discern the words that the learner was supposed to say. This is a quantum-level improvement in the field which we want to implement with PocketSphinx.
Our initial plans are to provide working demonstration code in English and one other language (e.g., Kannada or Tamil) along with crowdsourcing transcription and exemplar pronunciation infrastructure, and sufficiency index statistics for the number of transcripts and exemplar utterances provided per target phrase.
Mentors
James Salsman (GSoC 2012 mentor) jim at talknicer dot com
Srikanth Ronanki (GSoC 2012 student) srikanth dot 143 dot iiit at gmail dot com
Interested student(s) email James Salsman if you are interested
Balaji VenkataRamanan - full-time Ph.D. scholar at Christ University, Bangalore, India - balajiv211 at yahoo dot com
Complexity
Medium
Skills
C (PocketSphinx) and Python (2012 GSoC pronunciation assessment project)
**PLEASE SEE pocketsphinx_pronunciation_evaluation **
Diphone alignment and acoustic scores
We would like to modify PocketSphinx so that it creates meaningful time alignment and accurate acoustic scores from HMM posterior probability scores for phonemes as well as diphones. (A diphone is the second half of one phoneme, other than a dipthong or other composite, followed by the first half of another. There are about 1,400 diphones in Basic English.)
Complexity
Easy
Skills
C
Possible mentors
James Salsman - jim at talknicer dot com
Interested potential students
Email James Salsman to add your name here.
DNN acoustic models for pocketsphinx
For the best accuracy we need to introduce neural network based acoustic model
Complexity
Medium
Skills
C, Python, Tensorflow or other DNN toolkit
CNN-LSTM models for sphinx4
Same as above, but more heavyweight models for large vocabulary sphinx4 decoder
Complexity
Medium
Skills
C, Python, Java, Deeplearning4j, Tensorflow or other DNN toolkit
ROS integration
The Robot Operating System (ROS) is a set of software libraries and tools that help you build robot applications. From drivers to state-of-the-art algorithms, and with powerful developer tools, ROS has what you need for your next robotics project. And it’s all open source.
ROS has support for pocketsphinx, but it is very initial stage. A full-scale support for recent features would be very welcome.
Complexity
Easy
Skills
C, Python
Note
http://www.ros.org http://wiki.ros.org/pocketsphinx
UniMRCP integration
UniMRCP is a popular telephony server for IVR speech recognition. It has some initial support for pocketsphinx, but, as usual it is not optimal configuration and very complex to install. It would be nice to enable full pocketsphinx support in UniMRCP to enable IVR developers use accurate and open source speech recognition.
Complexity
Easy
Skills
C
Note
https://code.google.com/p/unimrcp/wiki/PocketSphinxPlugin
Speaker Verification and Identification framework
Speech biometrics gaining more and more attention recently. We don’t just need to decode the speech, we also need to understand who said what. Beside biometrics speaker identification is important for speech recognition itself because it enables better adaptation.
Complexity
Medium
Skills
C, Java
Note
There are many publications about speaker identification, it’s probably better to start with a book
Fundamentals of Speaker Recognition by Homayoon Beigi http://www.springer.com/de/book/9780387775913
Java LM Training code
All C language model training tools have issues with portability, it is very hard to run them on Windows. It would be nice to implement a full-scale LM training tool in Java to be used instead of CMUCLMTK. That should support variety of unformatted text inputs and provide competitive language model for further use in our decoders.
Complexity
Medium
Skills
Java
Note
The best toolkit nowdays is SRILM, http://www.speech.sri.com/projects/srilm/, we can take a lot of ideas from it. Language models and training algorithms are best explained in “Spoken Language Processing” textbook.
Very large scale grid-based model training
Have you heard about SETI project or Folding@HOME? The idea is that you can give everyone an easy opportunity to share CPU power to some project. So people could simply install the screensaver on their desktops and help to train a very large acoustic model. That seems to be a great opportunity to increase computational performance of the project resources.
The goal is to port CMUSphinx acoustic model training process to a distributed grid framework so everyone could donate the resources for best possible speech recognition accuracy. The port should outsource most computationally hard tasks to the grid, mainly it’s a Baum-Welch estimation (bw binary in acoustic model training tutorial).
The project can use BOINC for the grid middleware or some Java analog of BOINC if you prefer to code in Java. We prefer Java.
Resources
A server for grid experiments will be provided
Expected results
Training process can submit bw task and features for it to the grid and train the acoustic model this way.
The setup and code is well documented in the project wiki
Complexity
Easy
Skills
C, Java
Note
Large scale grid-based model optimization
Previous project deals with existing training framework, but we can try something very new. We can try to build the best possible acoustic model with brute force. You know, it’s better to spend electricity on speech recognition than on search for the aliens who have already contacted us first.
The acoustic model is just a set of about 1000 parameters. You can apply a simple large-scale optimization algorithm to create the best acoustic model possible. You need to distribute evaluation data on the grid and just select the best parameters possible. You can start from existing model approximation and try to improve it.
Resources
A server for grid experiments will be provided
Expected results
The setup is in place and the model training is running. The model demonstrates reasonable improvement over baseline on our task.
Complexity
Hard
Skills
C, Java, Optimization methods
Note
Numerical Optimization Texbook
Collect pronunciation dictionaries from Wikipedia
It’s critical to reuse existing information sources to improve system performance and support more languages. We also need to incorporate quickly the pronunciation for many new words which appeared just last year and missing in the common dictionaries. Think about how to pronounce the word “gangnam”. One valuable source of phonetic pronunciations is Wikipedia, in particular Wiktionary project. It often contains pronunciations for many words in IPA format created by the dictionary authors. However, it’s not trivial to parse the pages in uniform way since page format is different from language to language.
You need to write a code to parse existing Wiktionary pronunciations and create the dictionary for many langauges. It’s not that trivial task as it might seem as the code has to work for many Wikipedia languages.
Expected results
Phonetic dictionaries for 10 languages are collected from Wikipedia. The tool is documented and is easy to use to discover pronunciations for the new words.
Complexity
Easy
Skills
Python
Note
http://en.wiktionary.org/wiki/Wiktionary:Main_Page
Implement Java trainer on Hadoop framework
Hadoop is a framework for distributed computation an it’s enabled processing of the huge databases. Sphinx4 implements java training already, but this training is not parallel. The port of the Java trainer to support training from Hadoop would allow us to scale significantly beyond the simple training setups.
Expected results
Train Sphinxtrain acoustic models using Hadoop framework
Complexity
Easy
Skills
Java, Hadoop
Notes
Apache Mahout, machine learning using Hadoop Apache Hadoop
http://en.wikipedia.org/wiki/Mixture_model
Implement WER evaluation framework
Our current tools for WER evaluation are pretty simplistic, they only accept preformatted text in a certain format, they do not provide extendend statistics required for model optimization, they do not allow to estimate the quality of the engine and suggest help on the model issues. The goal is to implement an advanced WER scoring tool which is both able to process large files and at the same time has extensive reporting functionality like NIST sclite toolkit.
Expected results
Advanced error scoring tool implemented
Complexity
Easy
Skills
Java, C, Python
Notes
Support PocketSphinx in JavaScript
https://github.com/syl22-00/pocketsphinx.js (live demo) has been progressing well, and it’s fast enough to handle many tasks in better than real time on low-end semi-loaded laptops, and probably tablets. It won’t be long before it’s a viable fully client-side solution on phones.
Expected results
Support the existing code base with examples and cookbook-style solutions.
Complexity
Easy to medium
Skills
JavsScript (including WebRTC GetUserMedia), C, possibly Adobe Flash Actionscript
Contact
jim at talknicer dot com
Deep neural networks for Sphinx
In this project we aim to develop DNN frameworks for incorporation into the open source Sphinx systems. Over the course of the summer we will train, test and finally incorporate two frameworks into one or both of Sphinx4 and Pocketsphinx. The project is expected to require two students.
Mentors: Bhiksha Raj (bhiksha@cs.cmu.edu) and Evandro Gouvea (egouvea@gmail.com)
Javascript simulation of the brain’s motor control of the vocal tract
We need someone to do one of these, but translucent with labels and more degrees of freedom to swivel and zoom, and animations showing the projections from these three articles using these polygons with these kinds of animations using canvas-based Phoria including translucence like a 3D version of this thing.
Expected results
We can embed such interactive JavaScript animations wherever we have difficult tongue position diagrams, among many other places, e.g. at https://en.wikipedia.org/wiki/Diphthong#Closing.2C_opening.2C_and_centering
Complexity
Medium
Skills
JavsScript
Contact
jim at talknicer dot com
Please see also http://dood.al/pinktrombone/
Computer-aided instruction system capable of teaching speaking skills
We would like to make a general, open-source computer aided instruction which can teach speaking skills along with reading, listening, and writing.
Expected results
Accomplish these tasks to make a general purpose computer-aided instruction system capable of teaching speaking skills. Can you use the accuracy review system to provide default instructional content questions while allowing voice response of phrases describing the approach to each question?
Complexity
Hard
Skills
JavsScript, Python, Android and iOS microphone input integration with Javascript, possibly Adobe Flash ActionScript. Please consider integration with Moodle in your project proposal.
Contact
jim at talknicer dot com