CMUSphinx Open Source Speech Recognition

Sep 7, 2022

PocketSphinx 5.0.0 release candidate 3

Executive Summary: Try the new Python module (please).

Hot on the heels of the last one, there is another release candidate. You can also download it from PyPI.

There isn’t much to announce except that the pocketsphinx5 Python package doesn’t exist anymore. That’s right, the Python interface is now called just plain pocketsphinx. It should install properly on Windows and Mac OS X now, as well.

Pull requests and bug reports and such are welcome via https://github.com/cmusphinx/pocketsphinx.

Sep 1, 2022

PocketSphinx 5.0.0 release candidate 2

Executive Summary: This is a release candidate and the API is not yet stable so please don’t package it.

PocketSphinx now has a release candidate. You can also download it from PyPI.

Why release candidate 2? Because there was a release candidate 1, but it had various problems regarding installation, so I made another one. This one is relatively complete, but the documentation isn’t good, and it hasn’t been fully tested on Windows or Mac OS X. If you are courageous, you can try that. Installation should be a matter of:

cmake -S . -B build
cmake --build build
sudo cmake --build build --target install

The most important change versus 5prealpha is, as mentioned previously, the disappearance of pocketsphinx_continuous and the “live” API in general, which has been replaced with <pocketsphinx/endpointer.h>. The API is quite simple but it requires you to feed it data in precise quantities. The best way to do this is to ensure that you can read data from a file stream, as shown in the examples live.c and live.py.

For command-line usage there is a very Unixy program called pocketsphinx now, which nonetheless doesn’t have a man page yet (Update: it has a man page). Use it like this:

# From microphone
sox -d $(pocketsphinx soxflags) | pocketsphinx
# From file
sox audio.wav $(pocketsphinx soxflags) | pocketsphinx

There are no innovations with respect to modeling, algorithms, etc, and there will never be. But I am trying to make this into a decent piece of software nonetheless. All documentation, bug reports (that are actually bug reports and not just ‘how do i run the program’) and such are welcome via https://github.com/cmusphinx/pocketsphinx.

Aug 18, 2022

Why I Removed pocketsphinx_continuous And What You Can Do About It, Part Two

Executive Summary: Voice Activity Detection is necessary but not sufficient for endpointing and wake-up word detection, which are different and more complex problems. One size does not fit all. For this reason it is better to do it explicitly and externally.

Un jour j’irai vivre en Théorie, car en Théorie tout se passe bien.
– Pierre Desproges

Between the 0.8 and prealpha5 releases, PocketSphinx was modified to do voice activity detection by default in the feature extraction front-end, which caused unexpected behaviour, particularly when doing batch mode recognition. Specifically, it caused the timings output by the decoder in the logs and hypseg file to have no relation to the input stream, as the audio classified “non-speech” was removed from its input. Likewise, sphinx_fe would produce feature files which did not at all correspond to the length of the input (and could even be empty).

When users noticed this, they were instructed to use the continuous listening API, which (in Theory) reconstructed the original timings. There is a certain logic to this if:

You are doing speech-to-text and literally nothing else
You are always running in live mode

Unfortunately, PocketSphinx is not actually very good at converting speech to text, so people were using it for other things, like pronunciation evaluation, force-alignment, or just plain old acoustic feature extraction using sphinx_fe, where timings really are quite important, and where batch-mode recognition is easier and more accurate. Making silence removal the default behaviour was therefore a bad idea, and hiding it from the user behind two command-line options, one of which depended on the other, was a bad API, so I removed it.

But why did we put voice activity detection in the front-end in the first place? Time For Some (more) Audio Theory!

Although we, as humans, have a really good idea of what is and isn’t speech (unless we speak Danish)¹, at a purely acoustic level, it is not nearly as obvious. There is a good, if slightly dated summary of the problem on this website. In Theory, the ideal way to recognize what is and isn’t speech is just to do speech recognition, since by definition a speech recognizer has a good model of wnat is speech, which means that we can simply add a model of what isn’t and, in Theory, get the best possible performance in an “end-to-end” system. And this is an active research area, for example.

There are fairly obvious drawbacks to doing this, primarily that speech recognition is computationally quite expensive, secondarily that learning all the possible types of “not speech” in various acoustic environments is not at all easy to do. So in practice what we do, simply put, is to create a model of “not speech”, which we call “noise”, and assume that it is added to the speech signal which we are trying to detect. Then, to detect speech, we can subtract out the noise, and if there is anything left, call this “speech”. And this is exactly what PocketSphinx prealpha5 did, at least if you enabled both the -remove_noise and -remove_silence options.

This is a reasonably simple and effective way to do voice activity detection. So why not do it?

First, because of the problem with the implementation mentioned at the top of this post, which is that it breaks the contract of frames of speech in the input corresponding to timestamps in the output. This is not insurmountable but, well, we didn’t surmount it.

Second, because it requires you to use the built-in noise subtraction in order to get voice activity detection, and you might not want to do that, because you have some much more difficult type of noise to deal with.

Third, because the feature extraction code in PocketSphinx is badly written (I can say this because I wrote it) and not easy to integrate VAD into, so… there were bugs.

Fourth, because while PocketSphinx (and other speech recognizers) use overlapping, windowed frames of audio, this is unnecessary and inefficient for doing voice activity detection. For speech segments, the overhead of a heavily-optimized VAD like the WebRTC one is minimal, and in non-speech segments we save a lot of computation by not doing windowing and MFCC computation.

And finally, because voice activity detection, while extremely useful for speech compression, is less useful for speech recognition.

A little like we saw previously with respect to audio hardware and APIs, the reason VAD was invented was not to do speech recognition, but to increase the capacity of telephone networks. Fundamentally, it simply tells you if there is speech (which should be transmitted) or not-speech (which can be omitted) in a short frame of audio. This isn’t ideal, because:

It breaks the signal into acoustic rather than linguistic segments.
Speech contains things that don’t actually “sound like speech”, e.g. stop consonants (which are mostly silence), but really also anything unvoiced. Some languages like Georgian, Kwak’wala, and Nuxalk have lots of these things.

What you actually want to do for speech recognition really depends on what speech recognition task you’re doing. For transcription we talk about segmentation (if there is only one speaker) or diarization (if there are multiple speakers) which is a fancy word for “who said what when”. For dialogue systems we usually talk about barge-in and endpointing, i.e. detecting when the user is interrupting the system, and when the user has stopped speaking and is expecting the system to say something. And of course there is the famous “wake-up word” task where we specifically want to only detect speech that starts with a specific word or phrase.

Segmentation, diarization and endpointing are not the same thing, and none of them is the same thing as voice activity detection, though all of them are often built on top of a voice activity detector. Notably, none of them belong inside the decoder, which by its design can only process discrete “utterances”. The API for pocketsphinx-python, which provides the wrapper classes AudioFile for segmentation and LiveSpeech for endpointing, is basically the right approach, and something like it will be available in both C and Python for the 5.0 release, but with the flexibility for the user to implement their own approach if desired.

Danish is actually speech, and not that hard to learn, especially compared to English. ↩

Aug 16, 2022

Why I Removed pocketsphinx_continuous And What You Can Do About It, Part One

Executive Summary: Audio input is complicated and a speech recognition engine, particularly a small one, should not be in the business of handling it, particularly when sox can do it for you.

For most of recorded history, PocketSphinx installed a small program called pocketsphinx_continuous which, among other things, would record audio from the microphone and do speech recognition on it. The badly formatted comment at the top of the code explained exactly what it was:

* This is a simple example of pocketsphinx application that uses continuous listening
* with silence filtering to automatically segment a continuous stream of audio input
* into utterances that are then decoded.

Unfortunately, thought it was always intended as example code, because it was installed as a program you can run, people (and I am one) considered it to be the official command-line tool for PocketSphinx and tried to build “applications” around it. This usually ended in frustration. Why?

Time For Some Audio Theory!

Leaving aside the debatable usefulness of live-mode speech recognition for tasks other hands-free automotive control (I don’t care how big the touchscreen is in your T*sla, I don’t want you touching it), it is nonetheless an audio application. But it is very much not like your typical audio application.

If you, as a speech developer/user/ordinary human being, try to read the documentation for a typical audio API you are likely to be deeply confused. You do not care about latency. You do not want to create a processing graph with multiplex reverb units chained into a multi-threaded non-blocking pipeline of HM-2s. You just want to get a stream of PCM data, preferably 16kHz and 16-bit, from the microphone. How the h*ck do you do that? The documentation will not help you, because the API will not help you either. It is not written for you, because audio hardware and software is not designed for you.

In the beginning, audio hardware on PCs existed for one reason: to play games. Later on, it was repurposed for recording and making music. Both of these applications have in common a single-minded focus on minimizing latency. When you jump on the boss monster’s head, it needs to go “splat” right now and not 100ms later. When you punch in the bass track, same thing (though I hope your bass doesn’t sound like the boss monster’s head exploding). As a consequence, audio APIs do singularly un-useful things like making you run your processing code in a separate real-time thread and only ever feeding it 128 samples at a time. (Speech processing uses frames that are generally at least 400 samples long)

By contrast, while some speech applications like spoken dialogue care deeply about latency, and while it’s obviously good to have speech recognition that runs faster than real-time and gives incremental recognition results, by far the largest contributor to latency in these systems is endpointing - i.e. deciding when the user has finally stopped speaking, and this latency is at least two orders of magnitude greater than what game and music developers are worried about. Also, endpointing (and speech processing in general) is a language processing rather than an audio processing task.

All this is to say that handling audio input in a speech recognition engine is super annoying and should be avoided if possible, i.e. handled by some external library or program or other part of the application code. Ideally this external thing should, as noted above, just provide a nicely buffered stream of plain old data in the optimal format for the recognizer.

Luckily, there is a program like that, and it is so perfect that development on it largely ceased in 2015. Yes, I am talking about good old sox, the Sound eXchanger. Think about it, would you rather:

Create a device context (in a platform-specific way)
Create a procesing thread (in a very platform-specific way)
Create a message queue or ring buffer sufficiently large to handle possibly slower than real-time processing (not knowing ahead of time how large this will be)
Write code to mix down the input, convert it to integers, and (maybe, though you don’t have to) resample it to 16kHz
Spin up your processing thread possibly with real-time priority
Then, maybe, recognize some speech

Or:

popen("sox -q -r 16000 -c 1 -b 16 -e signed-integer -d -t raw -");

And get some data with fread()? From the point of view of someone who has stepped up to minimally restart maintenance of what is essentially abandonware, it’s pretty clear which one I would prefer to support.

So pocketsphinx_continuous (add .exe if you like) won’t be coming back. At the moment, in Python, you can just do this:

from pocketsphinx5 import Decoder
import subprocess
import os
MODELDIR = os.path.join(os.path.dirname(__file__), "model")
BUFSIZE = 1024

decoder = Decoder(
    hmm=os.path.join(MODELDIR, "en-us/en-us"),
    lm=os.path.join(MODELDIR, "en-us/en-us.lm.bin"),
    dict=os.path.join(MODELDIR, "en-us/cmudict-en-us.dict"),
)
sample_rate = int(decoder.config["samprate"])
soxcmd = f"sox -q -r {sample_rate} -c 1 -b 16 -e signed-integer -d -t raw -"
with subprocess.Popen(soxcmd.split(), stdout=subprocess.PIPE) as sox:
    decoder.start_utt()
    try:
        while True:
            buf = sox.stdout.read(BUFSIZE)
            if len(buf) == 0:
                break
            decoder.process_raw(buf)
    except KeyboardInterrupt:
        pass
    finally:
        decoder.end_utt()
    print(decoder.hyp().hypstr)

Or in C (see why we prefer to use Python? and no, C++ is NOT BETTER):

#include <pocketsphinx.h>
#include <signal.h>

static int global_done = 0;
void
catch_sig(int signum)
{
    global_done = 1;
}

int
main(int argc, char *argv[])
{
    ps_decoder_t *decoder;
    cmd_ln_t *config;
    char *soxcmd;
    FILE *sox;
    #define BUFLEN 1024
    short buf[BUFLEN];
    size_t len;

    if ((config = cmd_ln_parse_r(NULL, ps_args(),
                                 argc, argv, TRUE)) == NULL)
        E_FATAL("Command line parse failed\n");
    ps_default_search_args(config);
    if ((decoder = ps_init(config)) == NULL)
        E_FATAL("PocketSphinx decoder init failed\n");
    #define SOXCMD "sox -q -r %d -c 1 -b 16 -e signed-integer -d -t raw -"
    len = snprintf(NULL, 0, SOXCMD,
                   (int)cmd_ln_float_r(config, "-samprate"));
    if ((soxcmd = malloc(len + 1)) == NULL)
        E_FATAL_SYSTEM("Failed to allocate string");
    if (signal(SIGINT, catch_sig) == SIG_ERR)
        E_FATAL_SYSTEM("Failed to set SIGINT handler");
    if (snprintf(soxcmd, len + 1, SOXCMD,
                 (int)cmd_ln_float_r(config, "-samprate")) != len)
        E_FATAL_SYSTEM("snprintf() failed");
    if ((sox = popen(soxcmd, "r")) == NULL)
        E_FATAL_SYSTEM("Failed to popen(%s)", soxcmd);
    free(soxcmd);
    ps_start_utt(decoder);
    while (!global_done) {
        if ((len = fread(buf, sizeof(buf[0]), BUFLEN, sox)) == 0)
            break;
        if (ps_process_raw(decoder, buf, len, FALSE, FALSE) < 0)
            E_FATAL("ps_process_raw() failed\n");
    }
    ps_end_utt(decoder);
    if (pclose(sox) < 0)
        E_ERROR_SYSTEM("Failed to pclose(sox)");
    if (ps_get_hyp(decoder, NULL) != NULL)
        printf("%s\n", ps_get_hyp(decoder, NULL));
    cmd_ln_free_r(config);
    ps_free(decoder);
        
    return 0;
}

What will come back for the release is a program which reads audio from standard input and outputs recognition results in JSON, so you can do useful things with them in another program. This program will probably be called pocketsphinx. It will also do voice activity detection, which will be the subject of the next in this series of blog posts. Obviously, if you want to build a real application, you’ll have to do something more sophisticated, probably a server, and if I were you I would definitely write it in Python, though Node.js is also a good choice and, hopefully, we will support it again for the release.

Stay tuned!

Newer

Older

Page 3 of 37