Postprocessing Framework
Postprocessing Framework refers to a part of the speech recognition process in which the word stream resulted in the basic recognition process is sentence segmented, punctuation is recovered, capitalization is performed and abbreviations are made when needed. Also, numbers and other types of special data should be converted to usual form from the words. This aims to improve legibility and enhance information for future human and machine processing.
Speech segmentation in sentences is an important sub-problem of speech recognition and depends on context, grammar and semantics. This task requires non-trivial techniques, such as statistic decision making.
Spoken language is typically less organized than textual material, making it a challenge to bridge the gap between spoken and written material.
The insertion of punctuation marks into spoken texts is a way of approximating such texts, even if a given punctuation mark may assume a slightly different behavior in speech. A large number of punctuation marks can be considered in text: full stops, commas, exclamation mark, question mark, colon, semicolon and quotation marks. For our task one usually only considers full stops and commas, as they have higher corpus frequency. The other punctuation marks rarely occur, and are difficult to insert or evaluate. The capitalization task consists of rewriting every word with it’s proper case depending. This is an opposite transform which text-to-speech systems are usually doing.
Testing data
As any machine learning project, this project has been test-driven. Test have been made using a language model built on 95% of the gutenberg text database, on the rest of 5% of the texts. The input file has to be lower-cased and with no punctuation.
Implementation
This project is based on capitalized and punctuated language models. A similar implementation is the disambig tool from SRILM (which works only for capitalization).
The algorithm relies on iterating throught word symbols to create word sequences, which are evaluated and put into stacks. When a stack gets full (a maximum capacity is set) it gets sorted (by sequence probabilities) and the lowest scoring part is discarded. This way bad scoring sequences are discarded, and only the best ones are kept. The final solution is the sequence with the same size as the input, with the best probability.
Language Model
For the post processing task the language model used has to contain capitalized
words and punctuation mark word tokens. In the training data, commas are
replaced with <COMMA>
and periods are replaced with <PERIOD>
. Also
sentences should be grouped into paragraphs so that start and end of sentence
markers (<s>
and </s>
) are not very frequent. The language model need to
be compressed from ARPA format to DMP format with sphinx_lm_convert (or
sphinx3_lm_convert).
The gutenberg.DMP language model is correctly formatted and can be found in the language model download section on the project’s sourceforge.
Example language model training data:
That will be in about fifteen minutes from now <COMMA>
I figure <COMMA>
murmured Frank Sheldon to his friend and comrade <COMMA>
Bart Raymond
<COMMA>
as he glanced at the hands of his radio watch and then put it up to
his ear to make sure that it had not stopped <PERIOD>
It’ll seem more like
fifteen hours <COMMA>
muttered Tom Bradford <COMMA>
who was on the other
side of Sheldon <PERIOD>
Tom’s in a hurry to get at the Huns <COMMA>
chuckled Billy Waldon <PERIOD>
He wants to show them where they get off
<PERIOD>
Usage
The project is available for download at https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/ppf.
To compile the project install ant and be sure to set the required enviroment variables. Then type the following:
ant
To postprocess text use the postprocess.sh script:
sh ./postprocessing.sh -input_text path_to_file -lm path_to_lm
Results
Results vary depending on the language model and the input text. Using the language model trained on the Gutenberg corpus on a few texts from the same project (test data was not used to train), the accuracy for commas prediction is 35% and for periods is 39%. Of course, ASR output is less strict and the readibility impact is pretty big.
This is the accuracy estimation for 3 big texts from the Gutenberg Project:
CAPITALIZATION
CORRECT: 494849
INCORRECT: 28328
COMMA
CORRECT:17142
EXTRA COMMA: 10049
MISSING FROM OUTPUT: 22181
PERIOD
CORRECT:13191
EXTRA PERIOD:9844
MISSING FROM OUTPUT: 10802
References
- SENTENCE SEGMENTATION AND PUNCTUATION RECOVERY FOR SPOKEN LANGUAGE TRANSLATION
- Recovering Capitalization and PunctuationMarks for Automatic Speech Recognition
- RESTORING PUNCTUATION AND CAPITALIZATION IN TRANSCRIBED SPEECH
- Intro to Probability, Language Modeling
- FSM Lecture
- WFST Lecture - University of Tokio
- Probability estimation