PocketSphinx  5.0.0rc5
A small speech recognizer
PocketSphinx API Documentation
Author
David Huggins-Daines dhdai.nosp@m.nes@.nosp@m.gmail.nosp@m..com
Version
5.0.0rc5
Date
September 28, 2022

Introduction

This is the documentation for the PocketSphinx speech recognition engine. The main API calls are documented in <pocketsphinx.h>.

Installation

To install from source, you will need a C compiler and a recent version of CMake. If you wish to use an integrated development environment, Visual Studio Code will automate most of this process for you once you have installed C++ and CMake support as described at https://code.visualstudio.com/docs/languages/cpp

Python module install

The easiest way to program PocketSphinx is with the Python module. This can be installed in a VirtualEnv or Conda environment without affecting the rest of your system. For example, from the top-level source directory:

python3 -m venv ~/ve_pocketsphinx
. ~/ve_pocketsphinx/bin/activate
pip install .

There is no need to create a separate build directory as pip will do this for you.

Unix-like systems

From the top-level source directory, use CMake to generate a build directory:

cmake -S . -B build

Now you can compile and run the tests, and install the code:

cmake --build build
cmake --build build --target check
cmake --build build --target install

By default CMake will try to install things in /usr/local, which you might not have access to. If you want to install somewhere else you need to set CMAKE_INSTALL_PREFIX when running cmake for the first time, for example:

cmake -S . -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local

In this case you may also need to set the LD_LIBRARY_PATH environment variable so that the PocketSphinx library can be found:

export LD_LIBRARY_PATH=$HOME/.local/lib

Windows

On Windows, the process is similar, but you will need to tell CMake what build tool you are using with the -G option, and there are many of them. The build is known to work with nmake but it is easiest just to use Visual Studio Code, which should automatically detect and offer to run the build when you add the source directory to your list of directories. Once built, you will find the DLL and EXE files in build\Debug or build\Release depending on your build type. If the EXE files do not run, you need to ensure that pocketsphinx.dll is located in the same directory as them.

Using the Library

The Python interface is documented at http://pocketsphinx.readthedocs.io/, where you will find a quick start guide as well as a full API reference.

Frequently Asked Questions

Why doesn't my audio device work?

Because it's an audio device. They don't work, at least for things other than making annoying "beep boop" noises and playing Video Games. More generally, I cannot solve this problem for you, because every single computer, operating system, sound card, microphone, phase of the moon, and day of the week is different when it comes to recording audio. That's why I suggest you use SoX, because (a) it usually works, and (b) whoever wrote it seems to have retired long ago, so you can't bother them.

The recognized text is wrong.

That's not a question! But since this isn't Jeopardy, and my name is not Watson, I'll try to answer it anyway. Be aware that the answer depends on many things, first and foremost what you mean by "wrong".

If it sounds the same, e.g. "wreck a nice beach" when you said "recognize speech" then the issue is that the language model is not appropriate for the task, domain, dialect, or whatever it is you're trying to recognize. You may wish to consider writing a JSGF grammar and using it instead of the default language model (with the -jsgf flag). Or you can get an N-best list or word lattice and rescore it with a better language model, such as a recurrent neural network or a human being.

If it is total nonsense, or if it is just blank, or if it's the same word repeated, e.g. "a a a a a a", then there is likely a problem with the input audio. The sampling rate could be wrong, or even if it's correct, you may have narrow-band data. Try to look at the spectrogram (Audacity can show you this) and see if it looks empty or flat below the frequency in the -upperf flag. Alternately it could just be very noisy. In particular, if the noise consists of other people talking, automatic speech recognition will nearly always fail.

Why don't you support (pick one or more: WFST,

fMLLR, SAT, DNN, CTC, LAS, CNN, RNN, LSTM, etc)?

Not because there's anything wrong with those things (except LAS, which is kind of a dumb idea) but simply because PocketSphinx does not do them, or anything like them, and there is no point in adding them to it when other systems exist. Many of them are also heavily dependent on distasteful and wasteful platforms like C++, CUDA, TensorFlow, PyTorch, and so on.

Acknowledgements

PocketSphinx was originally released by David Huggins-Daines, but is largely based on the previous Sphinx-II and Sphinx-III systems, developed by a large number of contributors at Carnegie Mellon University, and released as open source under a BSD-like license thanks to Kevin Lenzo. For some time, it was maintained by Nickolay Shmyrev and others at Alpha Cephei, Inc. See the AUTHORS file for a list of contributors.