Executive Summary: Alas, poor SphinxBase!
Yes, it’s that time of week again, time for another release candidate. You can also download it from PyPI.
In the spirit of total elimination, the major change here is the
disappearance of the <sphinxbase/*.h> headers. Some of them have
been relocated, so if you include <pocketsphinx.h> you can still do
useful things like load and save language models and parse JSGF. Oh,
and also do speech recognition, maybe.
There are a number of other things you can’t do, because the “utility”
headers were mostly unsuitable for public consumption. Really they
were a bit embarrassing, at least in 2022. A major rationale for
removing SphinxBase from circulation is that it just isn’t a good
foundation for you to build “applications” or anything else really.
Like, there are at least a dozen better implementations of pretty much
everything in there, and you should really use them. Command-line
parsing, for instance, should not be done with <cmd_ln.h>, so it
has been hidden from you to discourage you from trying that.
Which brings us to the other major breaking change here. Configuration is not done by parsing (possibly imaginary) command lines anymore. You can simply create a configuration and set values in it, e.g.:
ps_config_t *config = ps_config_init(NULL);
ps_config_set_str(config, "hmm", "/path/to/model");
ps_config_set_int(config, "samprate", 11025);
You can also parse JSON, or even a sort of degenerate “JSON”:
ps_config_t *config = ps_config_parse_json(
NULL, "{\"hmm\": \"/path/to/model\"}");
ps_config_t *config = ps_config_parse_json(
NULL, "hmm: /path/to/model, samprate: 11025");
The configuration can be serialized to (actual) JSON as well:
const char *jconf = ps_config_serialize_json(config);
Creating a ps_config_t sets all of the default values, but does not
set the default model, so you still need to use
ps_default_search_args() for that. Also note that
ps_expand_model_config() no longer creates magical underscore
versions of the config parameters (e.g. "_hmm", "_dict", etc) but
simply overwrites the existing values.
Python code is entirely unaffected by these changes (though it has also acquired the JSON functions mentioned above), so you should maybe use Python instead of hurting yourself with the C API.
Pull requests and bug reports and such are welcome via https://github.com/cmusphinx/pocketsphinx.
Executive Summary: Try the new Python module (please).
Hot on the heels of the last one, there is another release candidate. You can also download it from PyPI.
There isn’t much to announce except that the pocketsphinx5 Python
package doesn’t exist anymore. That’s right, the Python interface is
now called just plain pocketsphinx. It should install properly on
Windows and Mac OS X now, as well.
Pull requests and bug reports and such are welcome via https://github.com/cmusphinx/pocketsphinx.
Executive Summary: This is a release candidate and the API is not yet stable so please don’t package it.
PocketSphinx now has a release candidate. You can also download it from PyPI.
Why release candidate 2? Because there was a release candidate 1, but it had various problems regarding installation, so I made another one. This one is relatively complete, but the documentation isn’t good, and it hasn’t been fully tested on Windows or Mac OS X. If you are courageous, you can try that. Installation should be a matter of:
cmake -S . -B build
cmake --build build
sudo cmake --build build --target install
The most important change versus 5prealpha is, as mentioned
previously, the disappearance of pocketsphinx_continuous and the
“live” API in general, which has been replaced with
<pocketsphinx/endpointer.h>.
The API is quite simple but it requires you to feed it data in precise
quantities. The best way to do this is to ensure that you can read
data from a file stream, as shown in the examples
live.c
and
live.py.
For command-line usage there is a very Unixy program called
pocketsphinx now, which nonetheless doesn’t have a man page yet
(Update: it has a man page). Use it like this:
# From microphone
sox -d $(pocketsphinx soxflags) | pocketsphinx
# From file
sox audio.wav $(pocketsphinx soxflags) | pocketsphinx
There are no innovations with respect to modeling, algorithms, etc, and there will never be. But I am trying to make this into a decent piece of software nonetheless. All documentation, bug reports (that are actually bug reports and not just ‘how do i run the program’) and such are welcome via https://github.com/cmusphinx/pocketsphinx.
Executive Summary: Voice Activity Detection is necessary but not sufficient for endpointing and wake-up word detection, which are different and more complex problems. One size does not fit all. For this reason it is better to do it explicitly and externally.
Un jour j’irai vivre en Théorie, car en Théorie tout se passe bien.
– Pierre Desproges
Between the 0.8 and prealpha5 releases, PocketSphinx was modified to
do voice activity detection by default in the feature extraction
front-end, which caused unexpected behaviour, particularly when doing
batch mode recognition. Specifically, it caused the timings output by
the decoder in the logs and hypseg file to have no relation to the
input stream, as the audio classified “non-speech” was removed from
its input. Likewise, sphinx_fe would produce feature files which
did not at all correspond to the length of the input (and could even
be empty).
When users noticed this, they were instructed to use the continuous listening API, which (in Theory) reconstructed the original timings. There is a certain logic to this if:
Unfortunately, PocketSphinx is not actually very good at converting
speech to text, so people were using it for other things, like
pronunciation evaluation, force-alignment, or just plain old acoustic
feature extraction using sphinx_fe, where timings really are quite
important, and where batch-mode recognition is easier and more
accurate. Making silence removal the default behaviour was therefore
a bad idea, and hiding it from the user behind two command-line
options, one of which depended on the other, was a bad API, so I
removed it.
But why did we put voice activity detection in the front-end in the first place? Time For Some (more) Audio Theory!
Although we, as humans, have a really good idea of what is and isn’t speech (unless we speak Danish)1, at a purely acoustic level, it is not nearly as obvious. There is a good, if slightly dated summary of the problem on this website. In Theory, the ideal way to recognize what is and isn’t speech is just to do speech recognition, since by definition a speech recognizer has a good model of wnat is speech, which means that we can simply add a model of what isn’t and, in Theory, get the best possible performance in an “end-to-end” system. And this is an active research area, for example.
There are fairly obvious drawbacks to doing this, primarily that
speech recognition is computationally quite expensive, secondarily
that learning all the possible types of “not speech” in various
acoustic environments is not at all easy to do. So in practice what
we do, simply put, is to create a model of “not speech”, which we call
“noise”, and assume that it is added to the speech signal which we are
trying to detect. Then, to detect speech, we can subtract out the
noise, and if there is anything left, call this “speech”. And this is
exactly what PocketSphinx prealpha5 did, at least if you enabled
both the -remove_noise and -remove_silence options.
This is a reasonably simple and effective way to do voice activity detection. So why not do it?
First, because of the problem with the implementation mentioned at the top of this post, which is that it breaks the contract of frames of speech in the input corresponding to timestamps in the output. This is not insurmountable but, well, we didn’t surmount it.
Second, because it requires you to use the built-in noise subtraction in order to get voice activity detection, and you might not want to do that, because you have some much more difficult type of noise to deal with.
Third, because the feature extraction code in PocketSphinx is badly written (I can say this because I wrote it) and not easy to integrate VAD into, so… there were bugs.
Fourth, because while PocketSphinx (and other speech recognizers) use overlapping, windowed frames of audio, this is unnecessary and inefficient for doing voice activity detection. For speech segments, the overhead of a heavily-optimized VAD like the WebRTC one is minimal, and in non-speech segments we save a lot of computation by not doing windowing and MFCC computation.
And finally, because voice activity detection, while extremely useful for speech compression, is less useful for speech recognition.
A little like we saw previously with respect to audio hardware and APIs, the reason VAD was invented was not to do speech recognition, but to increase the capacity of telephone networks. Fundamentally, it simply tells you if there is speech (which should be transmitted) or not-speech (which can be omitted) in a short frame of audio. This isn’t ideal, because:
What you actually want to do for speech recognition really depends on what speech recognition task you’re doing. For transcription we talk about segmentation (if there is only one speaker) or diarization (if there are multiple speakers) which is a fancy word for “who said what when”. For dialogue systems we usually talk about barge-in and endpointing, i.e. detecting when the user is interrupting the system, and when the user has stopped speaking and is expecting the system to say something. And of course there is the famous “wake-up word” task where we specifically want to only detect speech that starts with a specific word or phrase.
Segmentation, diarization and endpointing are not the same thing, and
none of them is the same thing as voice activity detection, though all
of them are often built on top of a voice activity detector. Notably,
none of them belong inside the decoder, which by its design can only
process discrete “utterances”. The API for
pocketsphinx-python, which
provides the wrapper classes
AudioFile
for segmentation and
LiveSpeech
for endpointing, is basically the right approach, and something like
it will be available in both C and Python for the 5.0 release, but
with the flexibility for the user to implement their own approach if
desired.