Long Audio Training: Update 3

During the last week the memory management of the current SphinxTrain version was examined using Valgrind family tools.

Program was profiled on one pass Baum-Welch training on the whole an4 database (948 files) and on one recording from custom 'rita' database. The total length of traing recordings was ~42 minutes and 5:31 minutes respectively. The reason for the testing on rita database considered only one file was the big amount of time taken by the training process. In both cases this was about 8 minutes without Valgrind profiling. (This time has increased about 15 times when profiling.)

First interesting result is that the training on 42 minutes of short audio files (each few seconds long) took about the same as training on a single file 5:31 minutes long.

During the experiments an issue with extensive memory reallocation was found in the forward algorithm. This was cca 760,000 of __ckd_realloc__ function calls when training on an4 database. The cause was found by the analysis of source codes: initial active state set allocation size was repeatedly set too low and after that vastly insreased (by about 1.3 GB on rita database) by the step of a few bytes. Algorithm was modified to reallocate the memory by quadratic step instead of linear. The modification reduced the reallocation to the 18,500 function calls on an4 database, thus by the factor of about 40. The tradeof was slight increase in the total amount of memory allocated from 5.7 MB to 6.2 MB and from 1.4 GB to 1.7 GB on an4 and rita databases respectively. The multiplication factor was set to 2 but this was arbitrary and can be subject to optimization.

In theory this modification could also reduce the running time of the algorithm but significant reduce of time was not measured (approximately 8 minutes +- few seconds in all cases - an4 & rita with and without the modification).

A question was whether a substantial fraction of memory demands could be attributed to memory leaks. This has shown not to be the case. Some memory leaks were found but nothing significant - under 1 MB in all cases. It was concluded that the reduced space technique implementation is necessary.

The work on the implementation of reduced space Baum-Welch has begun. This will involve a major rewriting of the forward, backward and Viterbi algorithms using checkpoints method (as described in article http://www.cse.ucsc.edu/research/compbio/papers/samspace.pdf). This method allows parametrization reducing the memory use from linear with respect to T to an arbitrary integer-root of T. The tradeof is increase in the computation length by the same factor. During this also the memory leaks should be completely prevented.

In order of testing this should be done in few phases to ensure the new implementation is correct. More information will be found at the project wiki https://cmusphinx.github.io/wiki/longaudiotraining as the work progresses.

Long Audio Alignment : Week 3

Last update indicated an error rate of almost 16 % for sufficiently large audio files. Experiments were conducted to pin-point the source of these errors.  It was then suggested to align audio without classifying speech and non-speech components of the audio.  Alignment with such configuration is now being tested with different grammars for different sorts of possible errors in the audio and/or transcription.

Audio and it's perfect transcription for up to 20 minutes long utterances have been checked for alignment with the current state of aligner, and the resulted in close 0% word error rate. The grammar used for this sort of  alignment only allows transitions from word to it's immediate successor in the transcription.

We are currently classifying different sorts of errors in transcription and utterance , and modelling grammar to allow alignment for the same.

Long Audio Training: Update 2

The past-days-work addressed the usage of restrictive function read_line in the SphinxTrain. All occurrences of read_line were eliminated making use of line iterators from sphinxbase. During this a decission to modify the lineiter interface was made in order to support original read_line functionality, e.g. comments skipping and whitespace trimming. This now takes following methods:

  • lineiter_init - init iterator for reading without any preprocessing
  • lineiter_init_clean - init iterator for reading compatible with the read_line function: skip commented lines and trim leading and trailing whitespaces
  • lineiter_next - read next line from the file
  • lineiter_free - finish reading and free resources

Usages of line iterators in the sphinxbase were also updated to comply with the interface modifications. These changes were necessary to enable SphinxTrain training on audio files of unlimited size and the lineiter is meant to be central input interface in the SphinxTrain in the future.

Work on the examination of memory issues is ongoing and will be followed by the implementation of memory-optimized Baum-Welch algorithm as described at https://cmusphinx.github.io/wiki/longaudiotraining as well as finding other ways to reduce unreasonable memory demands of current SphinxTrain version.

Long Audio Alignment : Week 2

As indicated in my last update regarding Long Audio Alignment project (https://cmusphinx.github.io/2011/05/long-audio-alignment-week-1/), my attempt this week was to fix the problem of generating pronunciations for out of vocabulary (OOV) words. Due to lack of a reliable Java based Automata library, I added an existing phone generator from FreeTTS to Sphinx 4 to generate pronunciation hypothesis for OOV words.
The module for generating these hypothesis is modeled to ensure correct pronunciation hypothesis for:

  • Abbreviations: Words like "USD" in transcription could have a equivalent utterance as both as United States Dollar and U-S-D. Accuracy of this depends on the model used.
  • Numbers : "123" in text can have a equivalent utterance as "One Two Three" as well as "One Hundred Twenty Three"
  • OOV words that are neither abbreviation nor numbers are pronounced as it is ( i.e. default pronunciation generated by FreeTTS).

A branch long-audio-alignment has been created on SVN. Source files for this project can be found there.

With the current state of the Aligner model (i.e. with pronunciations for all words in the dictionary), the word error rate (WER) was found to have improved to 0.16 as compared to 0.18 without the pronunciation generator.

After a few quick experiments with grammars again, we next aim to model anchors based on trigram language model.