Web Data Collection For Language Modeling - New plan, IRST LM and perplexity

Dr. Tony Robinson, one of the mentors, has come up with a new plan for the project.

The core idea remains the same: Given some audio and its transcripts belonging to some domain, we find additional text on the web which matches our domain. Then we build language models on this obtained text.

We will use podcasts for this. Podcasts can have very different domains and also sometimes come with transcripts. The task therefore becomes building adaptive language models for podcasts.

The difference is that we are going to use Lucene for searching text after obtaining it from a small set of websites that provide high-quality text, such as news websites. This eliminates the problem of search engine automated query policies and also increases the speed of processing data as everything is queried locally.

The language model toolkit was chosen to be IRST LM. Installation of IRST LM is straightforward. Manual for version 5.60.01 explains the process quite well, with one minor error. Caching has to be enabled in configure step.

I have read some of “Speech and Language Processing” by Jurafsky et al. to get some ideas about n-grams, smoothing and perplexity. Then I have run some experiments on training data that was provided by IRST LM. I used your-text-file and test files in the zip archive to get some results. Punctuation marks and suffixes like 's have already been separated by whitespace in the provided text, so no extra processing than adding sentence boundary marks was necessary. I added sentence boundaries using the provided script add-start-end.sh and saved the result as training-text for convenience:

emre@ammit ~/gsoc/irstlm $ add-start-end.sh < your-text-file > training-text

(Note that you have to set environment path correctly to access IRST LM tools without specifying folders.)

Then, I created a language model using trigrams and Witten-Bell smoothing and evaluated it using given test file:

emre@ammit ~/gsoc/irstlm $ tlm -tr=training-text -n=3 -lm=wb -te=test

which gave n=49984 LP=301734.5406 PP=418.4772517 OVVRate=0.05007602433 as output. PP stands for the perplexity of the language model and OOVRate is out-of-vocabulary rate of the test set. When using Modified shift-beta smoothing by setting the parameter -lm to msb,

emre@ammit ~/gsoc/irstlm $ tlm -tr=training-text -n=3 -lm=msb -te=test

the perplexity score seems to be much lower: n=49984 LP=287035.4908 PP=311.8578364 OVVRate=0.05007602433.

The project proposal is going to be updated soon. My next task is using some actual text and comparing perplexities. I also have to read a bit more about entropy as I am confused about the idea that the cross entropy of a model of a language is always going to be higher than the entropy of the language itself, i.e. cross entropy of a language model is an upper bound to entropy of the language.

- Emre