PocketSphinx
5.0.0
A small speech recognizer
|
This is the documentation for the PocketSphinx speech recognition engine. The main API calls are documented in ps_decoder_t and ps_config_t. The organization of this document is not optimal due to the limitations of Doxygen, so if you know of a better tool for documenting object-oriented interfaces in C, please let me know.
To install from source, you will need a C compiler and a recent version of CMake. If you wish to use an integrated development environment, Visual Studio Code will automate most of this process for you once you have installed C++ and CMake support as described at https://code.visualstudio.com/docs/languages/cpp
The easiest way to program PocketSphinx is with the Python module. See http://pocketsphinx.readthedocs.io/ for installation and usage instructions.
From the top-level source directory, use CMake to generate a build directory:
cmake -S . -B build
Now you can compile and run the tests, and install the code:
cmake --build build cmake --build build --target check cmake --build build --target install
By default CMake will try to install things in /usr/local
, which you might not have access to. If you want to install somewhere else you need to set CMAKE_INSTALL_PREFIX
when running cmake for the first time, for example:
cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local
On Windows, the process is similar, but you will need to tell CMake what build tool you are using with the -G
option, and there are many of them. The build is known to work with nmake
but it is easiest just to use Visual Studio Code, which should automatically detect and offer to run the build when you add the source directory to your list of directories. Once built, you will find the EXE files in build\Debug
or build\Release
depending on your build type.
By default, PocketSphinx does not build shared libraries, as there are not very many executables, and the library is quite smol. If you insist on building them, you can add BUILD_SHARED_LIBS=ON
to the CMake configuration. This is done either in the CMake GUI, in Visual Studio Code, or with the -D
option to the first CMake command-line above, e.g.:
cmake -S. -B build -DBUILD_SHARED_LIBS=ON
GStreamer support is not built by default, but can be enabled with BUILD_GSTREAMER=ON
.
PocketSphinx uses a mixture of fixed and floating-point computation by default, but can be configured to use fixed-point (nearly) exclusively with FIXED_POINT=ON
.
Minimally, to do speech recognition, you must first create a configuration, using ps_config_t and its associated functions. This configuration is then passed to ps_init() to initialize the decoder, which is returned as a ps_decoder_t. Note that you must ultimately release the configuration with ps_config_free() to avoid memory leaks.
At this point, you can start an "utterance" (a section of speech you wish to recognize) with ps_start_utt() and pass audio data to the decoder with ps_process_raw(). When finished, call ps_end_utt() to finalize recognition. The result can then be obtained with ps_get_hyp(). To get a detailed word segmentation, use ps_seg_iter(). To get the N-best results, use ps_nbest().
When you no longer need the decoder, release its memory with ps_free().
A concrete example can be found in simple.c.
You may, however, wish to do more interesting things like segmenting and recognizing speech from an audio stream. As described below, PocketSphinx will not handle the details of microphone input for you, because doing this in a reliable and portable way is outside the scope of a speech recognizer. If you have sox
, you can use the method shown in live.c.
Some APIs were intentionally broken by the 5.0.0 release. The most likely culprit here is the configuration API, where the old "options" which started with a -
are now "parameters" which do not, and instead of a cmd_ln_t
it is now a ps_config_t
. There is no backward compatibility, you have to change your code manually. This is straightforward for the most part. For example, instead of writing:
cmdln = cmd_ln_init(NULL, "-samprate", "16000", NULL); cmd_ln_set_int32_r(NULL, "-maxwpf", 40);
You should write:
config = ps_config_init(NULL); ps_config_set_int(config, "samprate", 16000); ps_config_set_int(config, "maxwpf", 40);
Another likely suspect is the search module API where the function names have been changed to be more intuitive. Wherever you had ps_set_search
you can use ps_activate_search(), it is the same function. Likewise, anything that was ps_set_*
is now ps_add_*
, e.g. ps_add_lm(), ps_add_fsg(), ps_add_keyphrase().
In general you will get "Acoustic model definition is not
specified" or "Folder does not contain acoustic model definition" errors if PocketSphinx cannot find a model. If you are trying to use the default module, perhaps you have not installed PocketSphinx. Unfortunately it is not designed to run "in-place", but you can get around this by setting the POCKETSPHINX_PATH
environment variable, e.g.
cmake --build build POCKETSPHINX_PATH=$PWD/model build/pocketsphinx single foo.wav
If by this you mean it doesn't spew copious logging output like it used to, you can solve this by passing -loglevel INFO
on the command-line, or setting the loglevel
parameter to "INFO"
, or calling err_set_loglevel() with ERR_INFO
.
If you mean that you just don't have any recognition result, you may have forgotten to configure a dictionary. Or see below for other reasons the output could be blank.
Because it's an audio device. They don't work, at least for things other than making annoying "beep boop" noises and playing Video Games. More generally, I cannot solve this problem for you, because every single computer, operating system, sound card, microphone, phase of the moon, and day of the week is different when it comes to recording audio. That's why I suggest you use SoX, because (a) it usually works, and (b) whoever wrote it seems to have retired long ago, so you can't bother them.
That's not a question! But since this isn't Jeopardy, and my name is not Watson, I'll try to answer it anyway. Be aware that the answer depends on many things, first and foremost what you mean by "wrong".
If it sounds the same, e.g. "wreck a nice beach" when you said "recognize speech" then the issue is that the language model is not appropriate for the task, domain, dialect, or whatever it is you're trying to recognize. You may wish to consider writing a JSGF grammar and using it instead of the default language model (with the jsgf
parameter). Or you can get an N-best list or word lattice and rescore it with a better language model, such as a recurrent neural network or a human being.
If it is total nonsense, or if it is just blank, or if it's the same word repeated, e.g. "a a a a a a", then there is likely a problem with the input audio. The sampling rate could be wrong, or even if it's correct, you may have narrow-band data. Try to look at the spectrogram (Audacity can show you this) and see if it looks empty or flat below the frequency in the upperf
parameter. Alternately it could just be very noisy. In particular, if the noise consists of other people talking, automatic speech recognition will nearly always fail.
Not because there's anything wrong with those things (except LAS, which is kind of a dumb idea) but simply because PocketSphinx does not do them, or anything like them, and there is no point in adding them to it when other systems exist. Many of them are also heavily dependent on distasteful and wasteful platforms like C++, CUDA, TensorFlow, PyTorch, and so on.
PocketSphinx was originally released by David Huggins-Daines, but is largely based on the previous Sphinx-II and Sphinx-III systems, developed by a large number of contributors at Carnegie Mellon University, and released as open source under a BSD-like license thanks to Kevin Lenzo. For some time, it was maintained by Nickolay Shmyrev and others at Alpha Cephei, Inc. See the AUTHORS
file for a list of contributors.