Over the last couple of weeks Long Audio Alignment Project has had a lot of new developments. It was understood that the accuracy of audio alignment would improve even further if some approximate time information for certain words were known from before the actual alignment. A decoder without such an information, during alignment has to go through all the frames of the audio to finally tell which alignment hypothesis scores the best. However, with such approximate timed information the decoder only has to wait until one (or more) of it's hypothesis token agree with this additional information, allowing it to prune out the rest of the tokens. This helps keeping the beam size small.
A highly configurable phrase spotter was hence implemented to get this timed information. Phrase spotter creates a left to right no skips grammar from the words in the phrase and (like Aligner) uses a garbage model to model all out of grammar utterances. The grammar has been chosen this way to ensure that a phrase is recognized only when all the words in the phrase are present in the utterance without a skip. Corresponding changes in the linguist were made as well to ( allow and ) ensure that one a Phone Loop is inserted only at the start of a phrase in the search graph.
Aligner search manager was then designed to exploit this phrase spotter's result and perform audio alignment. As a result, even with much more complicated grammar, the aligner can now align audio with much better accuracy (almost 0% error) , however memory requirements for aligning very large text allowing large skips still poses a problem. For now I am profiling the memory usage to locate and tackle this issue.
In the last post regarding the Long Audio Training it was indicated that there were still some problems in the reduced Baum-Welch implementation. Fortunately these were identified as memory leaks introduced during the optimization and were fixed. In the past days I have made some extensive tests, which show, that modified algorithms perform significantly better than the original version of SphinxTrain with the respect to the memory consumption.
See https://cmusphinx.github.io/wiki/longaudiotraining for more detailed evaluation.
A mixture of cool technologies could help to create really innovating applications. Check this video with demonstratoin of CMUSphinx capabilities when it's combined with OpenCV video recognition library.
frameborder="0" allowfullscreen>
Making subtitles from scratch usually consists of two tedious tasks: (1) figuring out the times when someone starts and ends speaking — in subtitle length pieces — and (2) typing down text corresponding to that speech. An approach often worth attempting is to automate a large part of this work by using speech recognition to generate subtitles from given video. This method cannot be expected to produce release quality subtitles on its own, but it should provide a rough first draft, which can be finished by usual manual methods. With most video sources, the actual speech recognition cannot be expected to perform well, but voice recognition should provide decent results for the start and end times of subtitles.
Gaupol's speech recognition uses the excellent CMU Sphinx speech recognition toolkit developed at Carnegie Mellon University — to be exact, the pocketsphinx plugin for the GStreamer multimedia framework.