Enlarging In-Domain Data Using Crawled News Articles


(Author: Emre Çelikten)

In the previous post, we used the Gutenberg corpus for selecting sentences that resembled Jane Eyre. We need to see how it applies to real world problems such as constructing language models for podcasts.

In this case, Democracy Now transcripts was used as in-domain data. We examine two cases: Using the Gutenberg corpus and crawled news articles from the web.

A custom multi-threaded web scraper has been created for this purpose. The scraper works in a simple way: There is a crawler thread for each starting URL. The threads only follow links that are on the same domain address with their starting URL. In other words, they do not follow links that go to other websites. Text from websites are extracted using Boilerpipe library.

A text corpus around 1 billion words were obtained from news websites in English. Due to time constraints, it was not possible to try the entire corpus.

Three experiments have been conducted for obtaining extra data: Using a 340M word subset of Gutenberg corpus, using 140M and 315M subsets of crawled news. It allows us to compare two different data sets and also see the results of having more data for running the algorithm.

Democracy Now corpus which was used has 1259987 words. (59571 sentences). After splitting the in-domain corpus in a 90/10 fashion for creating training and test sets, test set perplexity against training set LM was 125.63.

Total amount of words including sentence markers are:

Gutenberg corpus: 371665822 (15999414 sentences)
Subset #1 of crawled news corpus: 158474509 (7257989 sentences)
Subset #2 of crawled news corpus: 349157708 (15988267 sentences)

For the Gutenberg case, the following results were obtained. Note that the number of words include sentence markers in the graphs.

Gutenberg case for words
Gutenberg case for sentences

After trying smaller LMs, the lowest perplexity was obtained with the smallest segment, which had 4491752 words in it and had a test set perplexity of 212.4.

The algorithm seems to favor shorter sentences.

Here are the results for 140M subset of crawled data:

Crawled corpus 140M case for words
Crawled corpus 140M case for sentences

Crawled data works much better than the previous case.

Here are the results for 300M subset of crawled data:

Crawled corpus 315M case for words
Crawled corpus 315M case for sentences

We can see that using more data results in better test set perplexity values, since there are more sentences that are similar to in-domain data from which we select our data.