Long Audio Alignment : Week 2


As indicated in my last update regarding Long Audio Alignment project (https://cmusphinx.github.io/2011/05/long-audio-alignment-week-1/), my attempt this week was to fix the problem of generating pronunciations for out of vocabulary (OOV) words. Due to lack of a reliable Java based Automata library, I added an existing phone generator from FreeTTS to Sphinx 4 to generate pronunciation hypothesis for OOV words.
The module for generating these hypothesis is modeled to ensure correct pronunciation hypothesis for:

  • Abbreviations: Words like "USD" in transcription could have a equivalent utterance as both as United States Dollar and U-S-D. Accuracy of this depends on the model used.
  • Numbers : "123" in text can have a equivalent utterance as "One Two Three" as well as "One Hundred Twenty Three"
  • OOV words that are neither abbreviation nor numbers are pronounced as it is ( i.e. default pronunciation generated by FreeTTS).

A branch long-audio-alignment has been created on SVN. Source files for this project can be found there.

With the current state of the Aligner model (i.e. with pronunciations for all words in the dictionary), the word error rate (WER) was found to have improved to 0.16 as compared to 0.18 without the pronunciation generator.

After a few quick experiments with grammars again, we next aim to model anchors based on trigram language model.