Introduction

This is the documentation for the PocketSphinx speech recognition engine. The main API calls are documented in ps_decoder_t and ps_config_t. The organization of this document is not optimal due to the limitations of Doxygen, so if you know of a better tool for documenting object-oriented interfaces in C, please let me know.

Installation

To install from source, you will need a C compiler and a recent version of CMake. If you wish to use an integrated development environment, Visual Studio Code will automate most of this process for you once you have installed C++ and CMake support as described at https://code.visualstudio.com/docs/languages/cpp

The easiest way to program PocketSphinx is with the Python module. See http://pocketsphinx.readthedocs.io/ for installation and usage instructions.

Unix-like systems

From the top-level source directory, use CMake to generate a build directory:

cmake -S . -B build

Now you can compile and run the tests, and install the code:

cmake --build build
cmake --build build --target check
cmake --build build --target install

By default CMake will try to install things in /usr/local, which you might not have access to. If you want to install somewhere else you need to set CMAKE_INSTALL_PREFIX when running cmake for the first time, for example:

cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local

Windows

On Windows, the process is similar, but you will need to tell CMake what build tool you are using with the -G option, and there are many of them. The build is known to work with nmake but it is easiest just to use Visual Studio Code, which should automatically detect and offer to run the build when you add the source directory to your list of directories. Once built, you will find the EXE files in build\Debug or build\Release depending on your build type.

Compilation options

By default, PocketSphinx does not build shared libraries, as there are not very many executables, and the library is quite smol. If you insist on building them, you can add BUILD_SHARED_LIBS=ON to the CMake configuration. This is done either in the CMake GUI, in Visual Studio Code, or with the -D option to the first CMake command-line above, e.g.:

cmake -S. -B build -DBUILD_SHARED_LIBS=ON

GStreamer support is not built by default, but can be enabled with BUILD_GSTREAMER=ON.

PocketSphinx uses a mixture of fixed and floating-point computation by default, but can be configured to use fixed-point (nearly) exclusively with FIXED_POINT=ON.

Using the Library

Minimally, to do speech recognition, you must first create a configuration, using ps_config_t and its associated functions. This configuration is then passed to ps_init() to initialize the decoder, which is returned as a ps_decoder_t. Note that you must ultimately release the configuration with ps_config_free() to avoid memory leaks.

At this point, you can start an "utterance" (a section of speech you wish to recognize) with ps_start_utt() and pass audio data to the decoder with ps_process_raw(). When finished, call ps_end_utt() to finalize recognition. The result can then be obtained with ps_get_hyp(). To get a detailed word segmentation, use ps_seg_iter(). To get the N-best results, use ps_nbest().

When you no longer need the decoder, release its memory with ps_free().

A concrete example can be found in simple.c.

You may, however, wish to do more interesting things like segmenting and recognizing speech from an audio stream. As described below, PocketSphinx will not handle the details of microphone input for you, because doing this in a reliable and portable way is outside the scope of a speech recognizer. If you have sox, you can use the method shown in live.c.

Frequently Asked Questions

My code no longer compiles! Why?

Some APIs were intentionally broken by the 5.0.0 release. The most likely culprit here is the configuration API, where the old "options" which started with a - are now "parameters" which do not, and instead of a cmd_ln_t it is now a ps_config_t. There is no backward compatibility, you have to change your code manually. This is straightforward for the most part. For example, instead of writing:

cmdln = cmd_ln_init(NULL, "-samprate", "16000", NULL);
cmd_ln_set_int32_r(NULL, "-maxwpf", 40);

You should write:

config = ps_config_init(NULL);
ps_config_set_int(config, "samprate", 16000);
ps_config_set_int(config, "maxwpf", 40);

Another likely suspect is the search module API where the function names have been changed to be more intuitive. Wherever you had ps_set_search you can use ps_activate_search(), it is the same function. Likewise, anything that was ps_set_* is now ps_add_*, e.g. ps_add_lm(), ps_add_fsg(), ps_add_keyphrase().

What does ERROR: "acmod.c, line NN: ..." mean?

In general you will get "Acoustic model definition is not specified" or "Folder does not contain acoustic model definition" errors if PocketSphinx cannot find a model. If you are trying to use the default module, perhaps you have not installed PocketSphinx. Unfortunately it is not designed to run "in-place", but you can get around this by setting the POCKETSPHINX_PATH environment variable, e.g.

cmake --build build
POCKETSPHINX_PATH=$PWD/model build/pocketsphinx single foo.wav

There is literally no output!

If by this you mean it doesn't spew copious logging output like it used to, you can solve this by passing -loglevel INFO on the command-line, or setting the loglevel parameter to "INFO", or calling err_set_loglevel() with ERR_INFO.

If you mean that you just don't have any recognition result, you may have forgotten to configure a dictionary. Or see below for other reasons the output could be blank.

Why doesn't my audio device work?

Because it's an audio device. They don't work, at least for things other than making annoying "beep boop" noises and playing Video Games. More generally, I cannot solve this problem for you, because every single computer, operating system, sound card, microphone, phase of the moon, and day of the week is different when it comes to recording audio. That's why I suggest you use SoX, because (a) it usually works, and (b) whoever wrote it seems to have retired long ago, so you can't bother them.

The recognized text is wrong.

That's not a question! But since this isn't Jeopardy, and my name is not Watson, I'll try to answer it anyway. Be aware that the answer depends on many things, first and foremost what you mean by "wrong".

If it sounds the same, e.g. "wreck a nice beach" when you said "recognize speech" then the issue is that the language model is not appropriate for the task, domain, dialect, or whatever it is you're trying to recognize. You may wish to consider writing a JSGF grammar and using it instead of the default language model (with the jsgf parameter). Or you can get an N-best list or word lattice and rescore it with a better language model, such as a recurrent neural network or a human being.

If it is total nonsense, or if it is just blank, or if it's the same word repeated, e.g. "a a a a a a", then there is likely a problem with the input audio. The sampling rate could be wrong, or even if it's correct, you may have narrow-band data. Try to look at the spectrogram (Audacity can show you this) and see if it looks empty or flat below the frequency in the upperf parameter. Alternately it could just be very noisy. In particular, if the noise consists of other people talking, automatic speech recognition will nearly always fail.

Why don't you support (pick one or more: WFST, fMLLR, SAT, DNN, CTC, LAS, CNN, RNN, LSTM, etc)?

Not because there's anything wrong with those things (except LAS, which is kind of a dumb idea) but simply because PocketSphinx does not do them, or anything like them, and there is no point in adding them to it when other systems exist. Many of them are also heavily dependent on distasteful and wasteful platforms like C++, CUDA, TensorFlow, PyTorch, and so on.

Acknowledgements

PocketSphinx was originally released by David Huggins-Daines, but is largely based on the previous Sphinx-II and Sphinx-III systems, developed by a large number of contributors at Carnegie Mellon University, and released as open source under a BSD-like license thanks to Kevin Lenzo. For some time, it was maintained by Nickolay Shmyrev and others at Alpha Cephei, Inc. See the AUTHORS file for a list of contributors.

Table of Contents