Long Audio Training: Update 1


Here comes the first update on the Long Audio Training project. It's aim is to enable SphinxTrain training on recordings hours long. Presently the SphinxTrain can process files up to 3 minutes approx.

Full info on the project can be found at https://cmusphinx.github.io/wiki/longaudiotraining.

During the last week a collection of audio files 5 to 10 minutes long have been turned into CMUSphinx training database in order to determine possible issues when training on longer recordings. The first experiments resulted in two main findings:

  • Some of the components of SphinxTrain put an arbitrary limit to the size of data they process. E.g. the problem is with the function read_line which reads a line only up to the constant size of a buffer. This implementation resulted in crashes of training process in the baum-welch step due to the failed word lookup.
  • Another finding is that the training process currently requires a huge amout of memory. This takes about 1.7GB of RAM when trainning on ~5 minutes-long recording and much over 4GB processing ~10 minutes-long input. (I did not determine the actual value because this actually shot down my machine.) This indicates there is a flaw in the memory management of the training process and will be subject to examination in the following days.

A branch long-audio-training was created in the CMUSphinx SVN repository. All work done on this project will be commited into this branch.

First change is the fix for the word-lookup problem which was caused by the truncation of transcription sentence and thus it's last word. The idea is to replace all usages of read_line in SphinxTrain with the lineiter_* set of functions from sphinxbase, which do not impose any such limits on the length of the data.

More update to come soon, stay tuned!