Here comes the first update on the Long Audio Training project. It's aim is to enable SphinxTrain training on recordings hours long. Presently the SphinxTrain can process files up to 3 minutes approx.
Full info on the project can be found at https://cmusphinx.github.io/wiki/longaudiotraining.
During the last week a collection of audio files 5 to 10 minutes long have been turned into CMUSphinx training database in order to determine possible issues when training on longer recordings. The first experiments resulted in two main findings:
A branch long-audio-training was created in the CMUSphinx SVN repository. All work done on this project will be commited into this branch.
First change is the fix for the word-lookup problem which was caused by the truncation of transcription sentence and thus it's last word. The idea is to replace all usages of read_line in SphinxTrain with the lineiter_* set of functions from sphinxbase, which do not impose any such limits on the length of the data.
More update to come soon, stay tuned!
Update:
Windows users, please check more or less updated manual here:
https://sites.google.com/site/opiatefuchs/home/pocketsphinxandroiddemo
I know you waited for the instruction for a long time, here it is. Cudos to Matthew Cooper who wrote this. These instruction should work on any GNU/Linux distributuion, for example on Fedora or Ubuntu. This is also written for phones have OS 2.2 or higher. Previous OS requires a different path to the sdcard, so it will be easy to adapt this guide.
Download and build pocketsphinx
To run pocketsphinx download the following from https://cmusphinx.github.io/wiki/download/
sphinxbase-0.8
pocketsphinx-0.8
Then download PocketSphinxDemo archive from
PocketsphinxAndroidDemo - snapshot
Unzip all three folders into a place where you will remember to find them. It is necessary to unzip both pocketsphinx and sphinxbase folders into the same parent directory and rename them to just "sphinxbase" and "pocketsphinx" without version information.
Open up the terminal and type:
sudo -i
followed by your password. This will give you root access.
You will need swig later, so let's install it. You need swig 1.3, for now we do not support newer swig like 2.0:
apt-get install swig
or
yum install swig
On Windows install swig according to the following document:
http://www.swig.org/Doc1.3/Windows.html
Now, cd into sphinxbase and run the following commands from command line:
./configure
make
make install
cd into pocketsphinx and type:
./configure
make
make install
On Windows you do not need to compile Sphinxbase, just unpack it.
Now cd into PocketSphinxDemo/jni folder
Open the Android.mk file, found in the jni folder, and change the SPHINX_PATH(line #5) to the parent folder holding pocketsphinx and sphinxbase.
from command line type:
the-path-to-your-ndk-folder/ndk-build -B
Of course, substitude the real path to your ndk folder for the-path-to-your-ndk-folder
Eclipse
Now open Eclipse and import the PocketSphinxDemo.
In the Navigator View look for PocketSphinxDemo project. Right click on it and select properties. The properties screen will pop up and you will need to select Builders. In the Builders screen you will see SWIG and NDK build.
Click on NDK build and edit.
In the edit screen change the field Location to point to your ndk-folder you have on your machine. Click on the Refresh tab and select "The project containing the selected resource"
Click on the Build Options tab and deselect "Specify working set of relevant resources"
Apply changes and exit the configuration for NDK build.
Click on SWIG and edit.
You will not need to change the Location since downloaded swig at the start of the tutorial.
In the refresh tab select "The folder containing the selected resource"
In the Build Options tab deselect "Specifiy working set of relevant resources"
Apply changes and exit the configuration for SWIG.
Phone
Connect to your android phone and create the edu.cmu.pocketsphinx folder at
/mnt/sdcard
You can do this by opening terminal command and typing:
adb shell
In shell type:
mkdir /mnt/sdcard/edu.cmu.pocketsphinx
now cd into the edu.cmu.pocketsphinx folder that is located on your phone and create the following folder structure:
edu.cmu.pocketsphinx
|
----> hmm
| |
| ----> en_US
| |
| ----> hub4wsj_sc_8k
|
----> lm
|
-----> en_US
Now type quit to leave adb shell.
While still in terminal you will need to push files from your computer onto the phone.
cd into pocketsphinx/model/hmm/en_US/ and type:
adb push ./hub4wsj_sc_8k /mnt/sdcard/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k
Now cd into pocketsphinx/model/lm/ in your and type:
adb push ./en_US /mnt/sdcard/edu.cmu.pocketsphinx/lm/en_US
Now open the RecognizerTask.java found in
/src/edu/cmu/pocketsphinx/demo
There are declared paths to a structure that is not valid on a 2.2 phone. We will need to change the paths so that they work correctly. Here is my code for the section.
pocketsphinx.setLogfile("/mnt/sdcard/edu.cmu.pocketsphinx/pocketsphinx.log");
Config c = new Config();
/*
* In 2.2 and above we can use getExternalFilesDir() or whatever it's called
*/
c.setString("-hmm", "/mnt/sdcard/edu.cmu.pocketsphinx/hmm/en_US/hub4wsj_sc_8k");
c.setString("-dict", "/mnt/sdcard/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.dic");
c.setString("-lm", "/mnt/sdcard/edu.cmu.pocketsphinx/lm/en_US/hub4.5000.DMP");
c.setString("-rawlogdir", "/mnt/sdcard/edu.cmu.pocketsphinx"); // Only use it to store the audio
If your model is different, you will have to use different paths, for example fo rthe mandarin model
c.setString("-hmm", "/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/zh/tdt_sc_8k");
c.setString("-dict", "/sdcard/Android/data/edu.cmu.pocketsphinx/lm/zh_TW/mandarin_notone.dic");
c.setString("-lm", "/sdcard/Android/data/edu.cmu.pocketsphinx/lm/zh_TW/gigatdt.5000.DMP");
Now build and run the project.
After one week of steady work, I finally make the first post on my results and findings.
For detailed description of the project please read here.
I started with a few experiments on various grammars to see which preformed best and in what scenarios. By manipulating just the grammar I could only reach a word error rate of almost 18% for audio files which were almost 6 minutes long. Some observations made from these experiments were:
A source of error in alignment comes from words in the text that are not in the dictionary (Out of Vocabulary words). It was hence proposed to provide a Java based model for generating phonetic representations for any such word. This module prepares a FST based on test data and uses it to make hypothesis for word pronunciations. As of now, the front end for this module is nearly complete and depends on an automata library in Java (which as it seems does not exist for now). We now plan to implement this library and finally test the improvement in alignment due to this addition.
We would like to thank applicants for putting the time and effort into creating GSoC applications to work on CMUSphinx. We were ultimately provided with two slots and had many great applications that made choosing very difficult. We hope that students who were not accepted will still get involved with CMUSphinx and look forward to receiving your applications next year.
We are pleased to announce that two spots were awarded to Michal Krajňanský and Apurv Tiwari.
Michal is a student at Masaryk University in Brno, Czech Republic. He is taking Informatics - Artificial Intelligence & Natural Language Processing. Michal will be working on training acoustic models on long audio files. This will be done by optimizing SphinxTrain through the utilization of massively parallel hardware - the NVIDIA CUDA framework. It will enable acoustic model training on long audio files by the utilization of the NVIDIA CUDA architecture that will reduce the memory requirements of the Baum-Welch algorithm and significantly speed things up. Lastly, he will also modify SphinxTrain to be able to process long input audio files.
Apurv is a student at the Indian Institute of Technology Delhi in New Delhi, India. He is taking Mathematics and Computing. Apurv will be working on adding Long Audio Alignment to CMUSphinx. The problem he will solve is to align a given approximate-transcription for audio data corresponding to the audio file as well as improve the transcription at points of low confidence.
The mentors team includes Prof. James Baker, Prof Bhiksha Raj. as well as all the members of our community.
Both Apurv and Michal will blog weekly about their experience. The blogs will appear here at https://cmusphinx.github.io/
We want to thank Google for providing this wonderful opportunity and the mentors for donating their valuable time. We eagerly anticipate great things from Apurv and Michal. Stay tuned!