Why I Removed pocketsphinx_continuous And What You Can Do About It, Part Two

Executive Summary: Voice Activity Detection is necessary but not sufficient for endpointing and wake-up word detection, which are different and more complex problems. One size does not fit all. For this reason it is better to do it explicitly and externally.

Un jour j’irai vivre en Théorie, car en Théorie tout se passe bien.
– Pierre Desproges

Between the 0.8 and prealpha5 releases, PocketSphinx was modified to do voice activity detection by default in the feature extraction front-end, which caused unexpected behaviour, particularly when doing batch mode recognition. Specifically, it caused the timings output by the decoder in the logs and hypseg file to have no relation to the input stream, as the audio classified “non-speech” was removed from its input. Likewise, sphinx_fe would produce feature files which did not at all correspond to the length of the input (and could even be empty).

When users noticed this, they were instructed to use the continuous listening API, which (in Theory) reconstructed the original timings. There is a certain logic to this if:

  • You are doing speech-to-text and literally nothing else
  • You are always running in live mode

Unfortunately, PocketSphinx is not actually very good at converting speech to text, so people were using it for other things, like pronunciation evaluation, force-alignment, or just plain old acoustic feature extraction using sphinx_fe, where timings really are quite important, and where batch-mode recognition is easier and more accurate. Making silence removal the default behaviour was therefore a bad idea, and hiding it from the user behind two command-line options, one of which depended on the other, was a bad API, so I removed it.

But why did we put voice activity detection in the front-end in the first place? Time For Some (more) Audio Theory!

Although we, as humans, have a really good idea of what is and isn’t speech (unless we speak Danish)1, at a purely acoustic level, it is not nearly as obvious. There is a good, if slightly dated summary of the problem on this website. In Theory, the ideal way to recognize what is and isn’t speech is just to do speech recognition, since by definition a speech recognizer has a good model of wnat is speech, which means that we can simply add a model of what isn’t and, in Theory, get the best possible performance in an “end-to-end” system. And this is an active research area, for example.

There are fairly obvious drawbacks to doing this, primarily that speech recognition is computationally quite expensive, secondarily that learning all the possible types of “not speech” in various acoustic environments is not at all easy to do. So in practice what we do, simply put, is to create a model of “not speech”, which we call “noise”, and assume that it is added to the speech signal which we are trying to detect. Then, to detect speech, we can subtract out the noise, and if there is anything left, call this “speech”. And this is exactly what PocketSphinx prealpha5 did, at least if you enabled both the -remove_noise and -remove_silence options.

This is a reasonably simple and effective way to do voice activity detection. So why not do it?

First, because of the problem with the implementation mentioned at the top of this post, which is that it breaks the contract of frames of speech in the input corresponding to timestamps in the output. This is not insurmountable but, well, we didn’t surmount it.

Second, because it requires you to use the built-in noise subtraction in order to get voice activity detection, and you might not want to do that, because you have some much more difficult type of noise to deal with.

Third, because the feature extraction code in PocketSphinx is badly written (I can say this because I wrote it) and not easy to integrate VAD into, so… there were bugs.

Fourth, because while PocketSphinx (and other speech recognizers) use overlapping, windowed frames of audio, this is unnecessary and inefficient for doing voice activity detection. For speech segments, the overhead of a heavily-optimized VAD like the WebRTC one is minimal, and in non-speech segments we save a lot of computation by not doing windowing and MFCC computation.

And finally, because voice activity detection, while extremely useful for speech compression, is less useful for speech recognition.

A little like we saw previously with respect to audio hardware and APIs, the reason VAD was invented was not to do speech recognition, but to increase the capacity of telephone networks. Fundamentally, it simply tells you if there is speech (which should be transmitted) or not-speech (which can be omitted) in a short frame of audio. This isn’t ideal, because:

What you actually want to do for speech recognition really depends on what speech recognition task you’re doing. For transcription we talk about segmentation (if there is only one speaker) or diarization (if there are multiple speakers) which is a fancy word for “who said what when”. For dialogue systems we usually talk about barge-in and endpointing, i.e. detecting when the user is interrupting the system, and when the user has stopped speaking and is expecting the system to say something. And of course there is the famous “wake-up word” task where we specifically want to only detect speech that starts with a specific word or phrase.

Segmentation, diarization and endpointing are not the same thing, and none of them is the same thing as voice activity detection, though all of them are often built on top of a voice activity detector. Notably, none of them belong inside the decoder, which by its design can only process discrete “utterances”. The API for pocketsphinx-python, which provides the wrapper classes AudioFile for segmentation and LiveSpeech for endpointing, is basically the right approach, and something like it will be available in both C and Python for the 5.0 release, but with the flexibility for the user to implement their own approach if desired.

Why I Removed pocketsphinx_continuous And What You Can Do About It, Part One

Executive Summary: Audio input is complicated and a speech recognition engine, particularly a small one, should not be in the business of handling it, particularly when sox can do it for you.

For most of recorded history, PocketSphinx installed a small program called pocketsphinx_continuous which, among other things, would record audio from the microphone and do speech recognition on it. The badly formatted comment at the top of the code explained exactly what it was:

* This is a simple example of pocketsphinx application that uses continuous listening
* with silence filtering to automatically segment a continuous stream of audio input
* into utterances that are then decoded.

Unfortunately, thought it was always intended as example code, because it was installed as a program you can run, people (and I am one) considered it to be the official command-line tool for PocketSphinx and tried to build “applications” around it. This usually ended in frustration. Why?

Time For Some Audio Theory!

Leaving aside the debatable usefulness of live-mode speech recognition for tasks other hands-free automotive control (I don’t care how big the touchscreen is in your T*sla, I don’t want you touching it), it is nonetheless an audio application. But it is very much not like your typical audio application.

If you, as a speech developer/user/ordinary human being, try to read the documentation for a typical audio API you are likely to be deeply confused. You do not care about latency. You do not want to create a processing graph with multiplex reverb units chained into a multi-threaded non-blocking pipeline of HM-2s. You just want to get a stream of PCM data, preferably 16kHz and 16-bit, from the microphone. How the h*ck do you do that? The documentation will not help you, because the API will not help you either. It is not written for you, because audio hardware and software is not designed for you.

In the beginning, audio hardware on PCs existed for one reason: to play games. Later on, it was repurposed for recording and making music. Both of these applications have in common a single-minded focus on minimizing latency. When you jump on the boss monster’s head, it needs to go “splat” right now and not 100ms later. When you punch in the bass track, same thing (though I hope your bass doesn’t sound like the boss monster’s head exploding). As a consequence, audio APIs do singularly un-useful things like making you run your processing code in a separate real-time thread and only ever feeding it 128 samples at a time. (Speech processing uses frames that are generally at least 400 samples long)

By contrast, while some speech applications like spoken dialogue care deeply about latency, and while it’s obviously good to have speech recognition that runs faster than real-time and gives incremental recognition results, by far the largest contributor to latency in these systems is endpointing - i.e. deciding when the user has finally stopped speaking, and this latency is at least two orders of magnitude greater than what game and music developers are worried about. Also, endpointing (and speech processing in general) is a language processing rather than an audio processing task.

All this is to say that handling audio input in a speech recognition engine is super annoying and should be avoided if possible, i.e. handled by some external library or program or other part of the application code. Ideally this external thing should, as noted above, just provide a nicely buffered stream of plain old data in the optimal format for the recognizer.

Luckily, there is a program like that, and it is so perfect that development on it largely ceased in 2015. Yes, I am talking about good old sox, the Sound eXchanger. Think about it, would you rather:

  • Create a device context (in a platform-specific way)
  • Create a procesing thread (in a very platform-specific way)
  • Create a message queue or ring buffer sufficiently large to handle possibly slower than real-time processing (not knowing ahead of time how large this will be)
  • Write code to mix down the input, convert it to integers, and (maybe, though you don’t have to) resample it to 16kHz
  • Spin up your processing thread possibly with real-time priority
  • Then, maybe, recognize some speech

Or:

popen("sox -q -r 16000 -c 1 -b 16 -e signed-integer -d -t raw -");

And get some data with fread()? From the point of view of someone who has stepped up to minimally restart maintenance of what is essentially abandonware, it’s pretty clear which one I would prefer to support.

So pocketsphinx_continuous (add .exe if you like) won’t be coming back. At the moment, in Python, you can just do this:

from pocketsphinx5 import Decoder
import subprocess
import os
MODELDIR = os.path.join(os.path.dirname(__file__), "model")
BUFSIZE = 1024

decoder = Decoder(
    hmm=os.path.join(MODELDIR, "en-us/en-us"),
    lm=os.path.join(MODELDIR, "en-us/en-us.lm.bin"),
    dict=os.path.join(MODELDIR, "en-us/cmudict-en-us.dict"),
)
sample_rate = int(decoder.config["samprate"])
soxcmd = f"sox -q -r {sample_rate} -c 1 -b 16 -e signed-integer -d -t raw -"
with subprocess.Popen(soxcmd.split(), stdout=subprocess.PIPE) as sox:
    decoder.start_utt()
    try:
        while True:
            buf = sox.stdout.read(BUFSIZE)
            if len(buf) == 0:
                break
            decoder.process_raw(buf)
    except KeyboardInterrupt:
        pass
    finally:
        decoder.end_utt()
    print(decoder.hyp().hypstr)

Or in C (see why we prefer to use Python? and no, C++ is NOT BETTER):

#include <pocketsphinx.h>
#include <signal.h>

static int global_done = 0;
void
catch_sig(int signum)
{
    global_done = 1;
}

int
main(int argc, char *argv[])
{
    ps_decoder_t *decoder;
    cmd_ln_t *config;
    char *soxcmd;
    FILE *sox;
    #define BUFLEN 1024
    short buf[BUFLEN];
    size_t len;

    if ((config = cmd_ln_parse_r(NULL, ps_args(),
                                 argc, argv, TRUE)) == NULL)
        E_FATAL("Command line parse failed\n");
    ps_default_search_args(config);
    if ((decoder = ps_init(config)) == NULL)
        E_FATAL("PocketSphinx decoder init failed\n");
    #define SOXCMD "sox -q -r %d -c 1 -b 16 -e signed-integer -d -t raw -"
    len = snprintf(NULL, 0, SOXCMD,
                   (int)cmd_ln_float_r(config, "-samprate"));
    if ((soxcmd = malloc(len + 1)) == NULL)
        E_FATAL_SYSTEM("Failed to allocate string");
    if (signal(SIGINT, catch_sig) == SIG_ERR)
        E_FATAL_SYSTEM("Failed to set SIGINT handler");
    if (snprintf(soxcmd, len + 1, SOXCMD,
                 (int)cmd_ln_float_r(config, "-samprate")) != len)
        E_FATAL_SYSTEM("snprintf() failed");
    if ((sox = popen(soxcmd, "r")) == NULL)
        E_FATAL_SYSTEM("Failed to popen(%s)", soxcmd);
    free(soxcmd);
    ps_start_utt(decoder);
    while (!global_done) {
        if ((len = fread(buf, sizeof(buf[0]), BUFLEN, sox)) == 0)
            break;
        if (ps_process_raw(decoder, buf, len, FALSE, FALSE) < 0)
            E_FATAL("ps_process_raw() failed\n");
    }
    ps_end_utt(decoder);
    if (pclose(sox) < 0)
        E_ERROR_SYSTEM("Failed to pclose(sox)");
    if (ps_get_hyp(decoder, NULL) != NULL)
        printf("%s\n", ps_get_hyp(decoder, NULL));
    cmd_ln_free_r(config);
    ps_free(decoder);
        
    return 0;
}

What will come back for the release is a program which reads audio from standard input and outputs recognition results in JSON, so you can do useful things with them in another program. This program will probably be called pocketsphinx. It will also do voice activity detection, which will be the subject of the next in this series of blog posts. Obviously, if you want to build a real application, you’ll have to do something more sophisticated, probably a server, and if I were you I would definitely write it in Python, though Node.js is also a good choice and, hopefully, we will support it again for the release.

Stay tuned!

Training CMU Sphinx with LibriSpeech

Executive Summary: Training is fast, easy and automated, but the accuracy is not good. You should not use CMU Sphinx for large-vocabulary continuous speech recognition.

The simplest way to train a CMU Sphinx model is using a single machine with multiple CPUs. It may not be as cost-effective as a cluster, but is quite a bit simpler, as all of the cloud HPC cluster solutions seem to unnecessarily difficult to set up and incredibly badly documented. This is a real shame, and once I figure out how to use one of them, I’ll write a document which explains how to actually make it work.

To get access to a machine, a good option is Microsoft Azure, though there are many others. The free credits you get on signing up are more than sufficient to train a few models, but after that it can start to get expensive quickly. I have tried to optimize this process so that “resources” (virtual machines, disks) can be used only as long as they are needed.

First you will need to install software and (possibly) download the data. This is going to take a while no matter what, and doesn’t need multiple CPUs, so when you initially create your VM, you can use a very small one.

Setting up the software

The Azure portal is slow and unwieldy, and the cloud shell isn’t much better, so it’s worth setting up the Azure CLI locally - then log in with:

az login

We will put everything in a “resource group” so that we can track costs and usage.

az group create --name librispeech-100 --location canadacentral

Now create the VM (assuming here you already have an SSH public key that you will use to log in - otherwise omit the --ssh-key-values line):

az vm create --resource-group librispeech-100 --name compute \
--image UbuntuLTS --size Standard_B1ls \
--ssh-key-values ~/.ssh/id_rsa.pub

This will print a blob of JSON with the information for the new VM. The publicIpAddress key is the one you want, and now you can log into the newly created server with it using ssh. One way to automate this (surely there are many) is:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0].virtualMachine.network.publicIpAddresses[0].ipAddress \
    | tr -d '"')
ssh $ipaddr

Run the usual OS updates:

sudo apt update
sudo apt upgrade

And install Docker, which we’ll need later:

sudo apt install docker.io
sudo adduser $USER docker

Setting up data

We will set up a virtual disk to hold the data. While it’s possible to put it in a storage account, this is unsuitable for storing large numbers of small files and just generally annoying to set up. Azure has cheap disks which don’t cost more than network storage and are still way faster, so we’ll use one of those. Log out of the VM, create the disk, and attach it (75GB is enough for the full 960 hours of LibriSpeech audio, you could make it smaller if you want):

az disk create --resource-group librispeech-100 --name librispeech \
    --size-gb 75  --sku Standard_LRS
az vm disk attach --resource-group librispeech-100 \
    --vm-name compute --name librispeech

The Standard_LRS option is very important here, as the default disk type is extremely expensive. Now log back in, partition, attach, and mount it. The disk should be available as /dev/sdc, but this isn’t guaranteed, so run lsblk to find it:

$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
...
sdc       8:16   0   75G  0 disk 

You can now use parted to create a partition:

sudo parted /dev/sdc
mklabel gpt
mkpart primary ext4 0% 100%

And mkfs.ext4 to format it:

$ sudo mkfs.ext4 /dev/sdc1
mke2fs 1.44.1 (24-Mar-2018)
Discarding device blocks: done                            
Creating filesystem with 19660288 4k blocks and 4915200 inodes
Filesystem UUID: e8acc3b5-eec7-441e-96e8-50fa200471bb
...

The Filesystem UUID (which will not match the one above) is important, as it allows you to make the disk persistent. Add a line to /etc/fstab using the UUID that was printed when you formatted the partition:

UUID=e8acc3b5-eec7-441e-96e8-50fa200471bb # CHANGE THIS!!!
echo "UUID=$UUID /data/librispeech auto defaults,nofail 0 2" \
    | sudo tee -a /etc/fstab

And, finally, mount it and give access to the normal user:

sudo mkdir -p /data/librispeech
sudo mount /data/librispeech
sudo chown $(id -u):$(id -g) /data/librispeech

I promise you, all that was still way easier than using a storage account.

Now download and unpack the data directly to the data disk:

cd /data/librispeech
curl -L https://www.openslr.org/resources/12/train-clean-100.tar.gz \
    | tar zxvf -
curl -L https://www.openslr.org/resources/12/dev-clean.tar.gz \
    | tar zxvf -

Setting up training

Make a scratch directory (we are going to save it and restore it when we restart the VM later):

sudo mkdir /mnt/work
sudo chown $(id -u):$(id -g) /mnt/work

Now set up the training directory and get a few extra files:

mkdir librispeech
cd librispeech
docker run -v $PWD:/st dhdaines/sphinxtrain -t librispeech setup
ln -s /data/librispeech/LibriSpeech wav
cd etc
wget https://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz
wget https://www.openslr.org/resources/11/librispeech-lexicon.txt

Edit a few things in sphinx_train.cfg. It is not recommended to train PTM models for this amount of data, as the training is quite slow, probably unnecessarily so.

$CFG_WAVFILE_EXTENSION = 'flac';
$CFG_WAVFILE_TYPE = 'sox';
$CFG_HMM_TYPE  = '.cont.';
$CFG_FINAL_NUM_DENSITIES = 16;
$CFG_N_TIED_STATES = 4000;
$CFG_NPART = 16;
$CFG_QUEUE_TYPE = "Queue::POSIX";
$CFG_G2P_MODEL= 'yes';
$DEC_CFG_LANGUAGEMODEL  = "$CFG_BASE_DIR/etc/3-gram.pruned.1e-7.arpa.gz";
$DEC_CFG_LISTOFFILES    = "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_dev.fileids";
$DEC_CFG_TRANSCRIPTFILE = 
    "$CFG_BASE_DIR/etc/${CFG_DB_NAME}_dev.transcription";
$DEC_CFG_NPART = 20;

You will need the scripts from templates/librispeech in SphinxTrain, which unfortunately don’t get copied automatically… Check out the SphinxTrain source and copy them from there or simply download them from GitHub.

Create the transcripts and OOV list:

python3 make_librispeech_transcripts.py \
    -l etc/librispeech-lexicon.txt --100 wav

Create dictionaries:

python3 make_librispeech_dict.py etc/librispeech-lexicon.txt

Now we will save this setup to the system disk because we will resize the VM in order to run compute on more CPUs:

cd /mnt
tar zcf ~/work.tar.gz work

Finally pull the Docker image we’ll use for training:

docker pull dhdaines/sphinxtrain

Running training

We can now run training! Let’s resize the VM to 16 CPUs (you can use more if you want, but remember to set $CFG_NPART to at least the number of CPUs - also, certain stages of training won’t use them all):

az vm resize --resource-group librispeech-100 --name compute \
    --size Standard_F16s_v2

Log into the updated VM (it will have a different IP address) and remake the scratch directory:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0].virtualMachine.network.publicIpAddresses[0].ipAddress \
    | tr -d '"')
ssh $ipaddr
cd /mnt
sudo tar xf ~/work.tar.gz

Now run training. Note that in addition to the scratch directory, we also need to “mount” the data directory inside the Docker image so it can be seen:

docker run -v $PWD:/st -v /data/librispeech/LibriSpeech:/st/wav \
    dhdaines/sphinxtrain run

This should take about 5 hours including decoding (actually the training only takes an hour and a half…) You should obtain a word error rate of approximately 18.5%, which, it should be said, is pretty terrible. For comparison, a Kaldi baseline with this training set and language model gives 11.69%, the best Kaldi system with this language model (but trained on the full 960 hours of LibriSpeech) gets 5.99%, and using a huge neural network acoustic model and an even huger 4-gram language model, Kaldi can go as low as 4.07% at the moment.

Of course, wav2vec2000, DeepSpeech42, HAL9000 and company are somewhere under 2%.

What’s missing from CMU Sphinx? Well, it should be noted that what we’ve done here is the degree zero of automatic speech recognition, using strictly 20th-century technology. Kaldi is more parameter efficient, as it allows a different number of Gaussians for each phone, and its baseline model already includes speaker-adaptive training and feature-space speaker adaptation, which make a big difference. The Kaldi decoder is also faster and more accurate and supports rescoring with larger and more accurate language models (4-gram, 5-gram, RNN, etc).

On the other hand, SphinxTrain is easier to use than the Kaldi training scripts ;-)

You should, therefore, probably not use CMU Sphinx.

Saving the training setup

You will probably want to use the acoustic models (in model_parameters/librispeech.cd_ptm_5000) for something. You may also wish to rerun the training with different parameters. The most obvious solution, provided you have a couple gigabytes to spare (for librispeech-100 you need about 3G) and a sufficiently fast connction, is to copy it to your personal machine if you have one:

ipaddr=$(az vm list-ip-addresses -g librispeech-100 -n compute \
    --query [0].virtualMachine.network.publicIpAddresses[0].ipAddress \
    | tr -d '"')
rsync -av --exclude=__pycache__ --exclude='*.html' \
    --exclude=bwaccumdir --exclude=qmanager --exclude=logdir \
    $ipaddr:/mnt/work/librispeech .

Another option is to create a file share and mount it:

az storage share-rm create --resource-group librispeech-100 \
    --storage-account $STORAGE_ACCT --name librispeech --quota 1024
sudo mkdir -p /data/store
sudo mount -t cifs //$STORAGE_ACCT.file.core.windows.net/librispeech /data/store \
    -o uid=$(id -u),gid=$(id -g),credentials=/etc/smbcredentials/$STORAGE_ACCT.cred,serverino,nosharesock,actimeo=30

You could also use an Azure Storage blob with azcopy, or other things like that. Now copy your training directory and models to a tar file (no need to compress it):

tar --exclude=bwaccumdir --exclude=logdir --exclude=qmanager \
    --exclude='*.html' -cf /data/store/librispeech-100-train.tar \
    -C /mnt/work librispeech

Notice that it is considerably smaller than the original dataset.

Shutting down

Now you must either deallocate the VM or convert it back to a cheap one to avoid paying for unused time:

az vm deallocate --resource-group librispeech-100 --name compute
# or
az vm resize --resource-group librispeech-100 \
    --name compute --size Standard_B1ls

Note that in either case your scratch directory will be erased. Note also that you will continue to pay a small amount of money for the data disk you created (as well as the storage account, if you have one). You can make a free snapshot of the data disk which will allow you to deallocate it and stop paying for it. I don’t know how to do that, though…

CMUSphinx Maintenance Restarted

Thank you to all of the contributors who have keept CMUSphinx alive over the last decade, in particular to Nickolay Shmyrev and the whole team at Alpha Cephei. As you may have noticed, active development has mostly ceased over the last few years, and the technological foundation of CMUSphinx has become quite antiquated.

For state-of-the-art speech recognition the Alpha Cephei team is now working exclusively on Vosk, and there are a number of other open source options, notably Coqui, wav2vec, Julius, TensorFlowASR, DeepSpeech and of course Kaldi.

Nonetheless, there are still many people using CMUSphinx and PocketSphinx in particular, so there is some value in maintaining (if not actually developing) it. Its users frequently encounter difficulties due to the build system, which could be corrected by modernizing the codebase slightly. Due to the eternal "pre-alpha" status of the system, there are also many problems of portability and stability that should be adressed.

For this reason, we are preparing a true release of PocketSphinx, with a focus on a modern build system with no external dependencies, and a stable, documented, and easy to use API in C and Python. In addition, SphinxTrain will either continue to be maintained or will simply be integrated into PocketSphinx.

Finally, SourceForge is no longer a viable option for hosting. From now on, the GitHub Project is the official home of CMUSphinx, and we will soon migrate all the other downloads (models, etc) and close the SourceForge site.