Building an application with PocketSphinx
In this tutorial we will walk through a simple code example in C using
PocketSphinx. This corresponds exactly to the live_portaudio.c
example
in the source code. So, TL;DR, you could just try to compile that:
cmake -G Ninja -S. -B build
cmake --build --target live_portaudio
Building PocketSphinx
First, obtain the source code, either by downloading a release from the GitHub releases page or by cloning the source with git
For more details see the download page or the README file
PocketSphinx uses CMake to manage configuration
and buliding on multiple platforms. By default, it builds static
libraries and binaries which can simply be used from the source
directory. You may also install it system-wide or in a user
directory. In the case where it is installed, it will use
pkg-config
to allow you to find library and header directories and names, but
this isn’t really necessary, since there is just one header file
(<pocketsphinx.h>
) and
one library (-lpocketsphinx
).
For the purposes of this tutorial we will install it in a local directory.
Installation on a Unix-like system (including MacOS)
Make sure you have CMake, PortAudio, and a working C compiler installed. On GNU/Linux you can use your package manager, whatever it is. For example on Ubuntu/Debian/etc:
sudo apt install build-essential cmake ninja-build portaudio19-dev
On MacOS this will require you to minimally install the “Xcode command-line tools”. If you have installed Homebrew then you have these already. You can then install CMake, either with the official installer, or from Homebrew. For PortAudio… I dunno, use Homebrew, I guess:
brew install cmake portaudio
We will assume that you install PocketSphinx in directory called
cmusphinx
inside your home directory. It should be as simple as
(assuming CMake is installed):
cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/cmusphinx
cmake --build build --target install
If you are lucky enough to have Ninja
installed, you can make the build many times faster (it takes 1.7
seconds to complete on an Intel processor from
2009),
simply add -G Ninja
to the first line above.
Windows
Of course, Windows is more complicated in every possible way. There
are many ways to do everything, all of them inconvenient and
frustrating. The path of least resistance is to just install
MSYS2, start the MSYS2 shell (preferably the
UCRT64
version whatever that is) then install CMake, PortAudio and
Ninja along with the compiler:
pacman -S mingw-w64-ucrt-x86_64-gcc cmake ninja mingw-x64-ucrt-x86_64-portaudio
Now build:
cmake -S . -B build -G Ninja -DCMAKE_INSTALL_PREFIX=$HOME/cmusphinx
cmake --build build --target install
Note that because you’re running the MSYS2 version of CMake, you
cannot directly use a Windows-y path like $LOCALAPPDATA/cmusphinx
or (horror) %LOCALAPPDATA%/cmusphinx
as the CMAKE_INSTALL_PREFIX
,
as it won’t understand the drive
letters. You can use
cygpath
for this, e.g.:
cmake -S . -B build -G Ninja -DCMAKE_INSTALL_PREFIX=$(cygpath $LOCALAPPDATA)/cmusphinx
cmake --build build --target install
Configuration
Now that you have installed PocketSphinx you must set one little
environment variable to make sure that it can find its models. When
you ran the first cmake
command above you may have seen a line like
this:
MODELDIR="/home/user/cmusphinx/share/pocketsphinx/model"
You’ll need to set POCKETSPHINX_PATH
to this directory. In the near
future the install target will also tell you about this, and will also
create a little script to do it for you, but for now you have to do it
manually:
export POCKETSPHINX_PATH=$HOME/cmusphinx/share/pocketsphinx/model
Using the Pocketsphinx API
Okay, let’s get to the code! As a reminder, you can see the whole thing at https://github.com/cmusphinx/pocketsphinx/blob/master/examples/live_portaudio.c
General Principles
The reference documentation for the API is available at https://cmusphinx.github.io/doc/pocketsphinx/. There are three opaque types (like classes) that we will create and use to configure the recognizer, detect speech segments in the input, and recognize speech:
In general, PocketSphinx types have a function called TYPE_init
which creates an instance of type and a function called TYPE_free
which releases an instance of a type. The general rule is that if you
create an instance in your code, you will always need to free it, and
if you didn’t create it (i.e. it was returned to you by some API
function), you shouldn’t free it. The sole exception to this is
iterator types like
ps_seg_t
and
ps_alignment_iter_t
,
which must be “freed” if you stop iterating over them before the end
of the list.
That sounds confusing but it makes sense if you think about it! When in doubt, remember: “Memory leaks are quite acceptable in many applications”
Initialization
First we will create the
ps_config_t
with the set of default arguments, which also includes a default
acoustic and language model:
config = ps_config_init(NULL);
ps_default_search_args(config);
The ps_config_t
has a bunch of associated functions to get
information out of it. We will use one of these in particular to
obtain the sampling rate, which we will use to initialize
PortAudio
later on.
(note that you can also do this the other way around, and use the sampling rate provided by your audio stream to initialize the recognizer, which makes more sense in some cases)
Skipping the details of initializing PortAudio, we will now initialize the recognizer, which is called a “decoder” for Reasons:
if ((decoder = ps_init(config)) == NULL)
E_FATAL("PocketSphinx decoder init failed\n");
And the endpointer, which is the component that detects speech in the audio stream. Yes, that’s a lot of zeros. You can see what they mean in the documentation:
if ((ep = ps_endpointer_init(0, 0.0, 0, 0, 0)) == NULL)
E_FATAL("PocketSphinx endpointer init failed\n");
Endpointing
The endpointer works by consuming a “frame” of audio data and returning a “frame” of speech data, if any was detected. Again, since we’re in C here, it’s important to remember that:
- The input data is owned by the caller
- The output data is owned by the endpointer
What this means is that you can simply allocate a single buffer for input audio and reuse it through your whole application, and that you should never do anything with the frames of speech returned by the endpointer except:
- Pass them directly to the decoder
- Copy them somewhere else if you want to save them
This means, by the way, that you should also never try to share an endpointer or a decoder between multiple threads.
How do you know what to allocate? The endpointer tells you, with
ps_endpointer_frame_size
:
frame_size = ps_endpointer_frame_size(ep);
if ((frame = malloc(frame_size * sizeof(frame[0]))) == NULL)
E_FATAL_SYSTEM("Failed to allocate frame");
If you don’t care about compatibility with ancient C compilers you can simply do this and not worry about freeing anything later on:
int16_t frame[ps_endpointer_frame_size(ep)];
(as an aside, likely a future version of PocketSphinx will drop C89 compatibility entirely, since it is dubiously C89-compliant already)
It’s important to note that you can only ever pass buffers of this size to the endpointer. With certain low-level audio APIs, this means that you’ll have to do some buffering. Luckily, PortAudio allows you to request the buffer size, which we do when opening the audio stream:
if ((err = Pa_OpenDefaultStream(&stream, 1, 0, paInt16,
ps_config_int(config, "samprate"),
frame_size, NULL, NULL)) != paNoError)
E_FATAL("Failed to open PortAudio stream: %s\n",
Pa_GetErrorText(err));
Now, your application will wait to get some audio buffers, which usually involves either a callback function (yuck) or a nice, simple loop (hooray). Luckily, PortAudio gives us the second option. So, we sit in a loop, which looks like this:
- Check if we are currently in a speech section with
ps_endpointer_in_speech
. - Wait for an audio buffer.
- Pass it to the endpointer with
ps_endpointer_process
. If it’sNULL
, then go back to step 1. - If we weren’t in a speech section before, start recognizing speech.
- Pass the speech buffer to the recognizer.
- If we are no longer in a speech section (check with
ps_endpointer_in_speech
again), stop recognizing speech and get recognition results.
Processing
Before we can recognize any speech, we need to start an “utterance” by
calling ps_start_utt
. Then we can pass buffers of audio (of any
size, in this case) using ps_process_raw
. This has a couple of
options which we won’t use here for live-mode recognition, but may be
useful in other cases - one can instruct it to simply buffer the audio
without actually doing any recognition (in the case of a very slow
computer), or to treat the entire buffer as a single utterance (useful
when recognizing entire files at once, as it gives better accuracy).
Getting Results
Recognition results can be requested using ps_get_hyp
at any point
between a call to ps_start_utt
and ps_end_utt
. This function
simply returns a string - if you want a word segmentation, you can use
ps_seg_iter
.
As with speech buffers, you didn’t allocate this string, you shouldn’t free it, and you shouldn’t do anything with it except:
- Print it out or some other immediate action.
- Copy it if you wish to save or store it for later use.
Likewise same warnings about not sharing ps_decoder_t
between threads.
Code
Concretely, the whole thing looks like this:
while (!global_done) {
const int16 *speech;
int prev_in_speech = ps_endpointer_in_speech(ep);
if ((err = Pa_ReadStream(stream, frame, frame_size)) != paNoError) {
E_ERROR("Error in PortAudio read: %s\n",
Pa_GetErrorText(err));
break;
}
speech = ps_endpointer_process(ep, frame);
if (speech != NULL) {
const char *hyp;
if (!prev_in_speech)
ps_start_utt(decoder);
if (ps_process_raw(decoder, speech, frame_size, FALSE, FALSE) < 0)
E_FATAL("ps_process_raw() failed\n");
if (!ps_endpointer_in_speech(ep)) {
ps_end_utt(decoder);
if ((hyp = ps_get_hyp(decoder, NULL)) != NULL) {
printf("%s\n", hyp);
fflush(stdout);
}
}
}
}
How to compile? Well… the actual example of course can just be
built with CMake as noted way up at the top, but for your own code
you’ll need to know how to compile and link it. If you installed
using the commands above you’ll find the PocketSphinx headers in
$HOME/cmusphinx/include
and the libraries in $HOME/cmusphinx/lib
.
For PortAudio, they’ll be… somewhere, but pkg-config
can tell you.
So, if your code was in example.c
:
cc -o example example.c \
-I$HOME/cmusphinx/include -L$HOME/cmusphinx/lib -lpocketsphinx -lm \
$(pkg-config --static --libs --cflags portaudio-2.0)
Advanced usage
For more advanced uses of the API please check the API reference.
- For word segmentations, the API provides an iterator object which is used to iterate over the sequence of words. This iterator object is an abstract type, with some accessors provided to obtain timepoints, scores and, most interestingly, posterior probabilities for each word.
- The confidence of the whole utterance can be accessed with the
ps_get_prob
method. - You can access the lattice if needed.
- You can configure multiple searches and switch between them in runtime.
Searches
As a developer you can configure several “search” objects with different grammars and language models and switch between them during runtime to provide interactive experience for the user.
There are multiple possible search modes:
- keyword: efficiently looks for a keyphrase and ignores other speech. It Allows to configure the detection threshold.
- grammar: recognizes speech according to the JSGF grammar. Unlike keyphrase search, grammar search doesn’t ignore words which are not in the grammar but tries to recognize them.
- ngram/lm: recognizes natural speech with a language model.
- allphone: recognizes phonemes with a phonetic language model.
Each search has a name and can be referenced by a name. Names are
application-specific. The function ps_set_search
allows to activate the
search that was previously added by a name.
In order to add a search, one needs to point to the grammar/language model describing the search. The location of the grammar is specific to the application. If only a simple recognition is required it is sufficient to add a single search or to just configure the required mode using configuration options.
The exact design of a search depends on your application. For example, you might want to listen for an activation keyword first and once this keyword is recognized switch to ngram search to recognize the actual command. Once you recognized the command you can switch to grammar search to recognize the confirmation and then switch back to keyword listening mode to wait for another command.
Building an application with Sphinx4 Using PocketSphinx on Android