In the last post regarding the Long Audio Training it was indicated that there were still some problems in the reduced Baum-Welch implementation. Fortunately these were identified as memory leaks introduced during the optimization and were fixed. In the past days I have made some extensive tests, which show, that modified algorithms perform significantly better than the original version of SphinxTrain with the respect to the memory consumption.
See https://cmusphinx.github.io/wiki/longaudiotraining for more detailed evaluation.
A mixture of cool technologies could help to create really innovating applications. Check this video with demonstratoin of CMUSphinx capabilities when it's combined with OpenCV video recognition library.
Making subtitles from scratch usually consists of two tedious tasks: (1) figuring out the times when someone starts and ends speaking — in subtitle length pieces — and (2) typing down text corresponding to that speech. An approach often worth attempting is to automate a large part of this work by using speech recognition to generate subtitles from given video. This method cannot be expected to produce release quality subtitles on its own, but it should provide a rough first draft, which can be finished by usual manual methods. With most video sources, the actual speech recognition cannot be expected to perform well, but voice recognition should provide decent results for the start and end times of subtitles.
Gaupol's speech recognition uses the excellent CMU Sphinx speech recognition toolkit developed at Carnegie Mellon University — to be exact, the pocketsphinx plugin for the GStreamer multimedia framework.
My last update on Phone Loops indicated some significant improvements in Out of Grammar words recognition while aligning audio files with text, but it also remarked on alignment of large audio file's need for some special attention because of the huge search space generated the linguist and it's associated memory.
I hence modified the linguist to dynamically generate search graph during the process of recognition by adding Phone Loop only at the current grammar state, hence significantly reducing the memory required to store 1 phone-loop per word in the transcription. Some tests were conducted as well to determine 'Out of Grammar Branch" probability and 'Phone Insertion Probability" for audio files with some noise and close to 3% error in transcription. Best results in Out of Grammar word recognition were achieved at out of grammar branch probability ~ 1E-8 and phone insertion probability ~ 1E - 80. The word error rate hence obtained was close 5%.
Current version of the linguist is now available in long-audio-aligner branch. We now proceed towards the implementation of Keyword spotting algorithm for improving the alignment even further.