Last update indicated an error rate of almost 16 % for sufficiently large audio files. Experiments were conducted to pin-point the source of these errors. It was then suggested to align audio without classifying speech and non-speech components of the audio. Alignment with such configuration is now being tested with different grammars for different sorts of possible errors in the audio and/or transcription.
Audio and it's perfect transcription for up to 20 minutes long utterances have been checked for alignment with the current state of aligner, and the resulted in close 0% word error rate. The grammar used for this sort of alignment only allows transitions from word to it's immediate successor in the transcription.
We are currently classifying different sorts of errors in transcription and utterance , and modelling grammar to allow alignment for the same.
The past-days-work addressed the usage of restrictive function read_line in the SphinxTrain. All occurrences of read_line were eliminated making use of line iterators from sphinxbase. During this a decission to modify the lineiter interface was made in order to support original read_line functionality, e.g. comments skipping and whitespace trimming. This now takes following methods:
Usages of line iterators in the sphinxbase were also updated to comply with the interface modifications. These changes were necessary to enable SphinxTrain training on audio files of unlimited size and the lineiter is meant to be central input interface in the SphinxTrain in the future.
Work on the examination of memory issues is ongoing and will be followed by the implementation of memory-optimized Baum-Welch algorithm as described at https://cmusphinx.github.io/wiki/longaudiotraining as well as finding other ways to reduce unreasonable memory demands of current SphinxTrain version.
As indicated in my last update regarding Long Audio Alignment project (https://cmusphinx.github.io/2011/05/long-audio-alignment-week-1/), my attempt this week was to fix the problem of generating pronunciations for out of vocabulary (OOV) words. Due to lack of a reliable Java based Automata library, I added an existing phone generator from FreeTTS to Sphinx 4 to generate pronunciation hypothesis for OOV words.
The module for generating these hypothesis is modeled to ensure correct pronunciation hypothesis for:
A branch long-audio-alignment has been created on SVN. Source files for this project can be found there.
With the current state of the Aligner model (i.e. with pronunciations for all words in the dictionary), the word error rate (WER) was found to have improved to 0.16 as compared to 0.18 without the pronunciation generator.
After a few quick experiments with grammars again, we next aim to model anchors based on trigram language model.
Here comes the first update on the Long Audio Training project. It's aim is to enable SphinxTrain training on recordings hours long. Presently the SphinxTrain can process files up to 3 minutes approx.
Full info on the project can be found at https://cmusphinx.github.io/wiki/longaudiotraining.
During the last week a collection of audio files 5 to 10 minutes long have been turned into CMUSphinx training database in order to determine possible issues when training on longer recordings. The first experiments resulted in two main findings:
A branch long-audio-training was created in the CMUSphinx SVN repository. All work done on this project will be commited into this branch.
First change is the fix for the word-lookup problem which was caused by the truncation of transcription sentence and thus it's last word. The idea is to replace all usages of read_line in SphinxTrain with the lineiter_* set of functions from sphinxbase, which do not impose any such limits on the length of the data.
More update to come soon, stay tuned!