Training CMU Sphinx with LibriSpeech

TL;DR: Training is fast, easy and automated, but the accuracy is not good. You should not use CMU Sphinx for large-vocabulary continuous speech recognition.

The simplest way to train a CMU Sphinx model is using a single machine with multiple CPUs. It may not be as cost-effective as a cluster, but is quite a bit simpler, as all of the cloud HPC cluster solutions seem to unnecessarily difficult to set up and incredibly badly documented. This is a real shame, and once I figure out how to use one of them, I’ll write a document which explains how to actually make it work.

To get access to a machine, a good option is Microsoft Azure, though there are many others. The free credits you get on signing up are more than sufficient to train a few models, but after that it can start to get expensive quickly. I have tried to optimize this process so that “resources” (virtual machines, disks) can be used only as long as they are needed.

First you will need to install software and (possibly) download the data. This is going to take a while no matter what, and doesn’t need multiple CPUs, so when you initially create your VM, you can use a very small one.

Setting up the software

The Azure portal is slow and unwieldy, and the cloud shell isn’t much better, so it’s worth setting up the Azure CLI locally - then log in with:

az login

We will put everything in a “resource group” so that we can track costs and usage.

az group create --name librispeech-100 --location canadacentral

Now create the VM (assuming here you already have an SSH public key that you will use to log in - otherwise omit the --ssh-key-values line):

az vm create --resource-group librispeech-100 --name compute \
--image UbuntuLTS --size Standard_B1ls \
--ssh-key-values ~/.ssh/

This will print a blob of JSON with the information for the new VM. The publicIpAddress key is the one you want, and now you can log into the newly created server with it using ssh. One way to automate this (surely there are many) is:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0][0].ipAddress \
    | tr -d '"')
ssh $ipaddr

Run the usual OS updates:

sudo apt update
sudo apt upgrade

And install Docker, which we’ll need later:

sudo apt install
sudo adduser $USER docker

Setting up data

We will set up a virtual disk to hold the data. While it’s possible to put it in a storage account, this is unsuitable for storing large numbers of small files and just generally annoying to set up. Azure has cheap disks which don’t cost more than network storage and are still way faster, so we’ll use one of those. Log out of the VM, create the disk, and attach it (75GB is enough for the full 960 hours of LibriSpeech audio, you could make it smaller if you want):

az disk create --resource-group librispeech-100 --name librispeech \
    --size-gb 75  --sku Standard_LRS
az vm disk attach --resource-group librispeech-100 \
    --vm-name compute --name librispeech

The Standard_LRS option is very important here, as the default disk type is extremely expensive. Now log back in, partition, attach, and mount it. The disk should be available as /dev/sdc, but this isn’t guaranteed, so run lsblk to find it:

$ lsblk
sdc       8:16   0   75G  0 disk 

You can now use parted to create a partition:

sudo parted /dev/sdc
mklabel gpt
mkpart primary ext4 0% 100%

And mkfs.ext4 to format it:

$ sudo mkfs.ext4 /dev/sdc1
mke2fs 1.44.1 (24-Mar-2018)
Discarding device blocks: done                            
Creating filesystem with 19660288 4k blocks and 4915200 inodes
Filesystem UUID: e8acc3b5-eec7-441e-96e8-50fa200471bb

The Filesystem UUID (which will not match the one above) is important, as it allows you to make the disk persistent. Add a line to /etc/fstab using the UUID that was printed when you formatted the partition:

UUID=e8acc3b5-eec7-441e-96e8-50fa200471bb # CHANGE THIS!!!
echo "UUID=$UUID /data/librispeech auto defaults,nofail 0 2" \
    | sudo tee -a /etc/fstab

And, finally, mount it and give access to the normal user:

sudo mkdir -p /data/librispeech
sudo mount /data/librispeech
sudo chown $(id -u):$(id -g) /data/librispeech

I promise you, all that was still way easier than using a storage account.

Now download and unpack the data directly to the data disk:

cd /data/librispeech
curl -L \
    | tar zxvf -
curl -L \
    | tar zxvf -

Setting up training

Make a scratch directory (we are going to save it and restore it when we restart the VM later):

sudo mkdir /mnt/work
sudo chown $(id -u):$(id -g) /mnt/work

Now set up the training directory and get a few extra files:

mkdir librispeech
cd librispeech
docker run -v $PWD:/st dhdaines/sphinxtrain -t librispeech setup
ln -s /data/librispeech/LibriSpeech wav
cd etc

Edit a few things in sphinx_train.cfg. It is not recommended to train PTM models for this amount of data, as the training is quite slow, probably unnecessarily so.

$CFG_HMM_TYPE  = '.cont.';
$CFG_NPART = 16;
$CFG_G2P_MODEL= 'yes';
$DEC_CFG_LISTOFFILES    = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_dev.fileids";

You will need the scripts from templates/librispeech in SphinxTrain, which unfortunately don’t get copied automatically… Check out the SphinxTrain source and copy them from there or simply download them from GitHub.

Create the transcripts and OOV list:

python3 \
    -l etc/librispeech-lexicon.txt --100 wav

Create dictionaries:

python3 etc/librispeech-lexicon.txt

Now we will save this setup to the system disk because we will resize the VM in order to run compute on more CPUs:

cd /mnt
tar zcf ~/work.tar.gz work

Finally pull the Docker image we’ll use for training:

docker pull dhdaines/sphinxtrain

Running training

We can now run training! Let’s resize the VM to 16 CPUs (you can use more if you want, and it will be faster, though certain stages of training won’t use them all):

az vm resize --resource-group librispeech-100 --name compute \
    --size Standard_F16s_v2

Log into the updated VM (it will have a different IP address) and remake the scratch directory:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0][0].ipAddress \
    | tr -d '"')
ssh $ipaddr
cd /mnt
sudo tar xf ~/work.tar.gz

Now run training. Note that in addition to the scratch directory, we also need to “mount” the data directory inside the Docker image so it can be seen:

docker run -v $PWD:/st -v /data/librispeech/LibriSpeech:/st/wav \
    dhdaines/sphinxtrain run

This should take about 5 hours including decoding (actually the training only takes an hour and a half…) You should obtain a word error rate of approximately 18.5%, which, it should be said, is pretty terrible. For comparison, a Kaldi baseline with this training set and language model gives 11.69%, the best Kaldi system with this language model (but trained on the full 960 hours of LibriSpeech) gets 5.99%, and using a huge neural network acoustic model and an even huger 4-gram language model, Kaldi can go as low as 4.07% at the moment.

Of course, wav2vec2000, DeepSpeech42, HAL9000 and company are somewhere under 2%.

What’s missing from CMU Sphinx? Well, it should be noted that what we’ve done here is the degree zero of automatic speech recognition, using strictly 20th-century technology. Kaldi is more parameter efficient, as it allows a different number of Gaussians for each phone, and its baseline model already includes speaker-adaptive training and feature-space speaker adaptation, which make a big difference. The Kaldi decoder is also faster and more accurate and supports rescoring with larger and more accurate language models (4-gram, 5-gram, RNN, etc).

On the other hand, SphinxTrain is easier to use than the Kaldi training scripts ;-)

You should, therefore, probably not use CMU Sphinx.

Saving the training setup

You will probably want to use the acoustic models (in model_parameters/librispeech.cd_ptm_5000) for something. You may also wish to rerun the training with different parameters. The most obvious solution, provided you have a couple gigabytes to spare (for librispeech-100 you need about 3G) and a sufficiently fast connction, is to copy it to your personal machine if you have one:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0][0].ipAddress \
    | tr -d '"')
rsync -av --exclude=__pycache__ --exclude='*.html' \
    --exclude=bwaccumdir --exclude=qmanager --exclude=logdir \
    $ipaddr:/mnt/work/librispeech .

Another option is to create a file share and mount it:

az storage share-rm create --resource-group librispeech-100 \
    --storage-account $STORAGE_ACCT --name librispeech --quota 1024
sudo mkdir -p /data/store
sudo mount -t cifs //$ /data/store \
    -o uid=$(id -u),gid=$(id -g),credentials=/etc/smbcredentials/$STORAGE_ACCT.cred,serverino,nosharesock,actimeo=30

You could also use an Azure Storage blob with azcopy, or other things like that. Now copy your training directory and models to a tar file (no need to compress it):

tar --exclude=bwaccumdir --exclude=logdir --exclude=qmanager \
    --exclude='*.html' -cf /data/store/librispeech-100-train.tar \
    -C /mnt/work librispeech

Notice that it is considerably smaller than the original dataset.

Shutting down

Now you must either deallocate the VM or convert it back to a cheap one to avoid paying for unused time:

az vm deallocate --resource-group librispeech-100 --name compute
# or
az vm resize --resource-group librispeech-100 \
    --name compute --size Standard_B1ls

Note that in either case your scratch directory will be erased. Note also that you will continue to pay a small amount of money for the data disk you created (as well as the storage account, if you have one). You can make a free snapshot of the data disk which will allow you to deallocate it and stop paying for it. I don’t know how to do that, though…