CMUSphinx Powers Video Subtitle Editing Tool

Making subtitles from scratch usually consists of two tedious tasks: (1) figuring out the times when someone starts and ends speaking — in subtitle length pieces — and (2) typing down text corresponding to that speech. An approach often worth attempting is to automate a large part of this work by using speech recognition to generate subtitles from given video. This method cannot be expected to produce release quality subtitles on its own, but it should provide a rough first draft, which can be finished by usual manual methods. With most video sources, the actual speech recognition cannot be expected to perform well, but voice recognition should provide decent results for the start and end times of subtitles.

Gaupol's speech recognition uses the excellent CMU Sphinx speech recognition toolkit developed at Carnegie Mellon University — to be exact, the pocketsphinx plugin for the GStreamer multimedia framework.

Check it out

Long Audio Alignment: Dynamic Linguist and Phone Loops

My last update on Phone Loops indicated some significant improvements in Out of Grammar words recognition while aligning audio files with text, but it also remarked on alignment of large audio file's  need for some special attention because of the huge search space generated the linguist and it's associated memory.

I hence modified the linguist to dynamically generate search graph during the process of recognition by adding Phone Loop only at the current grammar state, hence significantly reducing the memory required to store 1 phone-loop per word in the transcription. Some tests were conducted as well to determine 'Out of Grammar Branch" probability and 'Phone Insertion Probability" for audio files with some noise and close to 3% error in transcription. Best results in Out of Grammar word recognition were achieved at out of grammar branch probability ~ 1E-8 and phone insertion probability ~ 1E - 80. The word error rate hence obtained was close 5%.

Current version of the linguist is now available in long-audio-aligner branch. We now proceed towards the implementation of Keyword spotting algorithm for improving the alignment even further.

Long Audio Training - reduced B-W computation & move towards CUDA

It's been a while since my last post. In theese days I was modifying the Baum-Welch algorithm to the reduced version, which is finally complete.

Forward, backward and Viterbi methods were changed in the following way:

  • 'Reduced' forward method was created. This method computes the checkpoints for later re-computation of actual alpha & scaling values. The size of reduced matrices is a function of block size, which is taken as a parameter.
  • 'Local' forward method was created. This method performs the alpha values re-computation for a particular checkpoint (block of values).
  • As SphinxTrain has Viterbi back-pointers computation embedded in the Forward pass, the modification of Viterbi was just to use the reduced forward and to recompute the alpha values with the local forward.
  • Backward update was modified in a similar way as the Viterbi.

The modification was successfully tested on an4 database. It performed somewhat slower, which was anticipated, as the modified algorithm does more computation.

I also tried the modification on the 'rita' (long audio) database. I was forced to quit the computation as it took all my system's memory. This sadly seems as no improvement in the memory demands and might suggest that some of the memory demands are not in the forward/backward/Viterbi as well as that I might have just introduced some memory leaks. During the brief tests the block_size parameter was set arbitrarily to 11, not the sqrt function of time frames count, which also may have some performance consequences.

The actual slow-down and memory requirements are subject to more detailed tests.

Regarding the CUDA, I have gain access to 3 CUDA machines. Two of them belong to Sitola, The Laboratory of Advanced Network Technologies. The access to these machines is provided by MetaCentrum, Czech academic grid organization providing the computation and storage resources. The cards are GeForce GTX 285, GeForce 8800 Ultra and GeForce 8400M GS (a rather low-end one in my personal laptop). These are devices the CUDA development and testing will take place on. More info to come, please check the project page.

Long Audio Alignment: Phone Loops

Our objective this week was to model presence of words in the utterance that are not in the transcription. The approach used was to model it using Phone Loops. A phone loop contains all the phones of an acoustic model and can model any utterance (i.e. also words in the transcription).  Hence the key to good alignment using phone loop is an optimal branch probability which is large enough so that recogniser does not mistake a OOV word as a word in the grammar and small enough to not replace a word in grammar by a OOV word.

A linguist satisfying the above criteria has been added to long audio aligner branch. However, the linguist performs quite well for small sized transcriptions , the size of the search graph produced is too large for small sized transcription. We plan to generate this search graph dynamically now, to solve this memory issue. This way the memory requirements for generating and storing huge search graphs will be reduced to almost  O(1) .