LIUM team, the main CMUSphinx contributor, has announced today the release of TEDLIUM corpus version2, an amazing database prepared from transcribed TED talks
A details on this update could be found in corresponding publication:
A. Rousseau, P. Deléglise, and Y. Estève, "Enhancing the TED-LIUM Corpus with Selected Data for Language Modeling and More TED Talks", in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 2014.
This database of 200 hours of speech allows you to build a speech recognition system with very good performance with open source toolkits like Kaldi or CMUSphinx. A Kaldi recipe for TEDLIUM v1, is available in the repository and we hope that the update to TEDLIUM v2 will be available soon.
Modern technology like automatic alignment of transcribed audio made it easy to create very competitive databases, so it's easy to predict that the size of the available databases will quickly grow to thousands of hours and thus we will see a very significant improvement in accuracy of the open source recognition. The problem comes here that quite powerful training clusters will be required to work with such databases, it is not possible to train model on a single server in acceptable amount of time.