CMUSphinx Open Source Speech Recognition

This document contains an overview of SphinxTrain from the perspective of researchers or developers wishing to implement new features or training methods. The process of training is already covered by the Robust Tutorial and the Sphinx Manual, so it will not be covered here except as necessary.

Layout of SphinxTrain code

The SphinxTrain code is organized into a few static libraries which contain most of the “core” functionality, and a large number of tools which do manipulations on acoustic model files. In many cases, the tools mostly just call library functions and don’t themselves contain a lot of code.

Python Modules

The Python modules are located in the ‘python/sphinx’ directory of the source tree. There is a setup.py file in the ‘python’ top-level directory which can be used to install these modules. However, since all modules are written purely in Python, you can also simply set up your sys.path variable to point to the source directory (provided that you have first installed NumPy):

dhuggins@slim:~/Projects/Sphinx/SphinxTrain$ python
Python 2.5.1 (r251:54863, Oct  5 2007, 13:36:32)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.path.insert(0, "python/sphinx")
>>> from sphinx import *
>>>

Some automatically-generated (and incomplete) documentation on the Python modules can be found at http://lima.lti.cs.cmu.edu/pydoc/SphinxTrain/

Header Files

Nearly all of the header files useful in SphinxTrain development are in the ‘include/s3’ directory.

Libraries

The current directory layout of the library portion of the SphinxTrain code is:

src/libs/libcommon
src/libs/libio
src/libs/libs2io
src/libs/libcep_feat
src/libs/libmodinv
src/libs/libmllr
src/libs/libclust
src/libs/librpcc

A description of these libraries follows:

libcommon: contains utility functions (many of which replicate or reimplement similar functions in SphinxBase)
libio: contains reading and writing functions for every sort of file type dealt with in SphinxTrain.
libs2io: contains reading and writing functions (or actually just writing, I think) for Sphinx-II format model files
libcep_feat: contains dynamic feature computation code. This does the same thing (but with a different API) as libsphinxfeat in SphinxBase.
libmodinv: contains model inventory (i.e. acoustic model) functions, including Gaussian density and mixture model computation.
libmllr: contains MLLR adaptation functions (which ought to be in libmodinv, probably)
libclust: contains clustering functions, primarily related to decision tree building for state tying
librpcc: is obsolete (formerly contained one function, which reads the performance counter on DEC Alphas using the RPCC instruction)

Most of the code that you are likely to want to modify is in libmodinv and possibly also libclust and libcep_feat.

Tools

Source code for the training programs is in the ‘src/programs’ directory. The functionality of most of these programs is detailed elsewhere.
The ones which are the most important for the training process are ‘bw’ and ‘norm’. The ‘bw’ program collects expected state occupation and transition counts using the Forward-Backward algorithm, while ‘norm’ performs Maximum Likelihood updating of acoustic model parameters based on these counts.

Writing SphinxTrain programs in C

Basic structure of a SphinxTrain tool

Let’s look at the ‘norm’ tool to demonstrate the typical structure of a SphinxTrain tool written in C. This tool’s source code is in the directory ‘src/programs/norm’ inside the SphinxTrain source tree, and contains the following files:

src/programs/norm/main.c
src/programs/norm/Makefile
src/programs/norm/parse_cmd_ln.c
src/programs/norm/parse_cmd_ln.h

First the file ‘Makefile’ contains instructions for compiling this tool. Most of the build logic is contained in the file config/common_make_rules which is included at the end of each tool’s Makefile. The Makefile simply defines a number of variables which describe the source files. In the future, we may switch to using GNU Autotools in which case there will be a ‘Makefile.am’ which operates similarly. Here is the important part of the Makefile:

TOP=../../..
DIRNAME=src/programs/norm
BUILD_DIRS =
ALL_DIRS= $(BUILD_DIRS)
SRCS = \
  main.c \
  parse_cmd_ln.c
H = \
  parse_cmd_ln.h
FILES = Makefile $(SRCS) $(H)
TARGET = norm
ALL = $(BINDIR)/$(TARGET)
include $(TOP)/config/common_make_rules

The variables TOP and DIRNAME are required in all Makefiles in SphinxTrain.
BUILD_DIRS and ALL_DIRS are only needed if there are subdirectories within the Makefile’s directory which need to be built. The SRCS variable lists all C source files to be compiled while the H variable lists all the header files in this directory. The FILES directory lists all files which should be included in a distribution package. The TARGET variable is very important - it gives the name of the program which is to be built from the source files in this directory. The final two lines are required in order to make everything work.

Now, let’s look at parse_cmd_ln.c and parse_cmd_ln.h. The main function of these source files is to defined the command-line arguments to the tool. For whatever historical reasons, these two files also contain a bunch of boilerplate code which is duplicated in every tool, with the only changes being the actual definition of arguments and the help text and description of the tool. Again, this may change in the near future. Finally the main.c file contains the actual code. Let’s walk through what this does. First, the initialize() function parses the command-line:

static int
initialize(int argc,
           char *argv[])
{
    /* define, parse and (partially) validate the command line */
    parse_cmd_ln(argc, argv);
    return S3_SUCCESS;
}

This sets the internal variables which are read by the various functions in <s3/cmd_ln.h> such as cmd_ln_str(), cmd_ln_float32(), and such.
Unfortunately these access macros are not used consistently in SphinxTrain, so you will see a lot of code that just uses cmd_ln_access() and casts the result to some other type. This is BAD and should not be done, because in SphinxBase the return value of cmd_ln_access() is not a simple void pointer.

The actual work of the ‘norm’ program is done in the normalize() function.
There is no good reason to structure your code this way - you could just as easily do all this stuff in the main() function, or preferably, you could split this up into a number of smaller funtcions. But for whatever reason, this is the way that ‘norm’ was written, so let’s look at what normalize() does.
First, it declares an utterly ridiculous number of variables. This is not accepted best practice for programming in the 21st century. So, we’ll skip that part. Most of these variables are simply used to hold values acquired from the command-line. One important thing to note is that the command-line variable -accumdir is defined in parse_cmd_ln.c as a list of strings (using CMD_LN_STRING_LIST) and thus is stored in an array of pointers (char <nowiki></nowiki>). This is because there are typically multiple accumulator directories, because the Forward-Backward part of training (using ‘bw’) is often run in multiple parts on a cluster of networked machines.

The most common use case for ‘norm’ is to generate a new set of acoustic model files from accumulator directories alone. It is also possible to specify input mean and variance files to be copied into the new model files in the case where some model parameters were not observed. After validating the command-line arguments, the code checks to see if these files were specified, and reads in the data from them:

    if (in_mean_fn != NULL) {
        E_INFO("Selecting unseen density mean parameters from %s\n",
               in_mean_fn);
        if (s3gau_read(in_mean_fn,
                       &in_mean,
                       &n_mgau,
                       &n_gau_stream,
                       &n_gau_density,
                       &veclen) != S3_SUCCESS) {
          E_FATAL_SYSTEM("Couldn't read %s", in_mean_fn);
        }
        ckd_free((void *)veclen);
        veclen = NULL;
    }
    if (in_var_fn != NULL) {
        E_INFO("Selecting unseen density variance parameters from %s\n",
               in_var_fn);
        if (var_is_full) {
            if (s3gau_read_full(in_var_fn,
                           &in_fullvar,
                           &n_mgau,
                           &n_gau_stream,
                           &n_gau_density,
                           &veclen) != S3_SUCCESS) {
                E_FATAL_SYSTEM("Couldn't read %s", in_var_fn);
            }
        }
        else {
            if (s3gau_read(in_var_fn,
                           &in_var,
                           &n_mgau,
                           &n_gau_stream,
                           &n_gau_density,
                           &veclen) != S3_SUCCESS) {
                E_FATAL_SYSTEM("Couldn't read %s", in_var_fn);
            }
        }
        ckd_free((void *)veclen);
        veclen = NULL;
    }

The functions s3gau_read() and s3gau_read_full() are defined in <s3/s3gau_io.h>.

Next, the code iterates over all of the accumulator directories given to it and builds a set of cumulative observation counts from the files in them (some obsolete code has been removed):

    for (i = 0; accum_dir[i]; i++) {
        E_INFO("Reading and accumulating counts from %s\n", accum_dir[i]);
        if (out_mixw_fn) {
            rdacc_mixw(accum_dir[i],
                       &mixw_acc, &n_mixw, &n_stream, &n_density);
        }
        if (out_tmat_fn) {
            rdacc_tmat(accum_dir[i],
                       &tmat_acc, &n_tmat, &n_state_pm);
        }
        if (out_mean_fn || out_var_fn) {
            if (var_is_full)
                rdacc_den_full(accum_dir[i],
                               &wt_mean,
                               &wt_fullvar,
                               &pass2var,
                               &dnom,
                               &n_mgau,
                               &n_gau_stream,
                               &n_gau_density,
                               &veclen);
            else
                rdacc_den(accum_dir[i],
                          &wt_mean,
                          &wt_var,
                          &pass2var,
                          &dnom,
                          &n_mgau,
                          &n_gau_stream,
                          &n_gau_density,
                          &veclen);
            if (out_mixw_fn) {
                if (n_stream != n_gau_stream) {
                    E_ERROR("mixw inconsistent w/ densities WRT # "
                            "streams (%u != %u)\n",
                            n_stream, n_gau_stream);
                }
                if (n_density != n_gau_density) {
                    E_ERROR("mixw inconsistent w/ densities WRT # "
                            "den/mix (%u != %u)\n",
                            n_density, n_gau_density);
                }
            }
            else {
                n_stream = n_gau_stream;
                n_density = n_gau_density;
            }
        }
    }

This is accomplished by simply adding together the counts from each directory to produce a running total, which is done using the functions rdacc_mixw(), rdacc_tmat(), and rdacc_den() (or rdacc_den_full() for full covariance matrices). These functions are defined in <s3/s3acc_io.h>. The variables mixw_acc, tmat_acc, wt_mean, wt_var, and wt_fullvar are used to store the cumulative counts. You may or may not remember that the Baum-Welch update formula for maximum-likelihood estimation of the means of a continuous-density HMM is:

\begin{displaymath}
\hat\mu_{jk} = \frac{\sum_{t=1}^T \gamma_t(j,k) \vec o_t}{\sum_{t-1}^T 
\gamma_t(j,k)} 
\end{displaymath}

The SphinxTrain variables wt_mean[i][j][k] and dnom[i][j][k] correspond exactly to the numerator (note that this is a vector) and the denominator (note that this is a scalar) of this equation. Therefore to do normalization we basically just have to divide wt_mean by dnom. This is actually done in the file src/libs/libmodinv/gauden.c by the function gauden_norm_wt_mean.
In norm, it is called in the following piece of code:

if (wt_mean || wt_var || wt_fullvar) {
	if (out_mean_fn) {
	    E_INFO("Normalizing mean for n_mgau= %u, n_stream= %u, n_density= %u\n",
		   n_mgau, n_stream, n_density);

	    gauden_norm_wt_mean(in_mean, wt_mean, dnom,
				n_mgau, n_stream, n_density, veclen);
	}
	else {
	    if (wt_mean) {
		E_INFO("Ignoring means since -meanfn not specified\n");
	    }
	}

	if (out_var_fn) {
	    if (var_is_full) {
		if (wt_fullvar) {
		    E_INFO("Normalizing fullvar\n");
		    gauden_norm_wt_fullvar(in_fullvar, wt_fullvar, pass2var, dnom,
					   wt_mean,	/* wt_mean now just mean */
					   n_mgau, n_stream, n_density, veclen,
					   cmd_ln_boolean("-tiedvar"));
		}
	    }
	    else {
		if (wt_var) {
		    E_INFO("Normalizing var\n");
		    gauden_norm_wt_var(in_var, wt_var, pass2var, dnom,
				       wt_mean,	/* wt_mean now just mean */
				       n_mgau, n_stream, n_density, veclen,
				       cmd_ln_boolean("-tiedvar"));
		}
	    }
	}
	else {
	    if (wt_var || wt_fullvar) {
		E_INFO("Ignoring variances since -varfn not specified\n");
	    }
	}
}
else {
	E_INFO("No means or variances to normalize\n");
}

For variances, the standard Baum-Welch formula is:

\begin{displaymath}
\hat\Sigma_{jk} = \frac{\sum_{t=1}^T \gamma_t(j,k) (\vec o - \vec\mu_{jk})(\vec 
o - \vec\mu_{jk})^T}{\sum_{t-1}^T \gamma_t(j,k)}
\end{displaymath}

Note that the denominator of this equation is the same as in the mean re-estimation formula, and therefore the SphinxTrain variable dnom is used in both re-estimations. In the case of diagonal covariances, the outer product operation in the numerator reduces to a simple element-wise squaring.
Therefore the variance can be re-estmated independently in each dimension just like the mean. There is one further wrinkle to do with two possible versions of this formula, which will be discussed in more depth below.

Finally, we write out the newly re-estimated means and variances, which is done with the function s3gau_write, in this code:

if (out_mean_fn) {
	if (wt_mean) {
	    if (s3gau_write(out_mean_fn,
			    (const vector_t ***)wt_mean,
			    n_mgau,
			    n_stream,
			    n_density,
			    veclen) != S3_SUCCESS)
		return S3_ERROR;
	    
	    if (out_dcount_fn) {
		if (s3gaudnom_write(out_dcount_fn,
				    dnom,
				    n_mgau,
				    n_stream,
				    n_density) != S3_SUCCESS)
		    return S3_ERROR;
	    }
	}
	else
	    E_WARN("NO reestimated means seen, but -meanfn specified\n");
    }
    else {
	if (wt_mean) {
	    E_INFO("Reestimated means seen, but -meanfn NOT specified\n");
	}
    }
    
    if (out_var_fn) {
	if (var_is_full) {
	    if (wt_fullvar) {
		if (s3gau_write_full(out_var_fn,
				     (const vector_t ****)wt_fullvar,
				     n_mgau,
				     n_stream,
				     n_density,
				     veclen) != S3_SUCCESS)
		    return S3_ERROR;
	    }
	    else
		E_WARN("NO reestimated variances seen, but -varfn specified\n");
	}
	else {
	    if (wt_var) {
		if (s3gau_write(out_var_fn,
				(const vector_t ***)wt_var,
				n_mgau,
				n_stream,
				n_density,
				veclen) != S3_SUCCESS)
		    return S3_ERROR;
	    }
	    else
		E_WARN("NO reestimated variances seen, but -varfn specified\n");
	}
}
else {
	if (wt_var) {
	    E_INFO("Reestimated variances seen, but -varfn NOT specified\n");
	}
}

Writing SphinxTrain programs in Python

The Python modules included with SphinxTrain make it easy to write small scripts that manipulate acoustic models. They can also be used to examine models interactively through the Python shell. We highly recommend installing the matplotlib and ipython extensions to Python, which, in combination, provide a convenient, MATLAB-like environment for manipulating and viewing numerical data.

Let’s look at the Python equivalent of the code for ‘norm’ described above.
We’ll skip command line parsing since you can do that using the Python getopt module, and just assume that all the accumulator directories live under the directory ‘bwaccumdir’ in the current directory. We will also assume that the python modules are installed in the ‘python’ directory under the current directory, which is what SphinxTrain’s training setup script will do for you by default.

First, we have to set up the environment and load the necessary modules:

#!/usr/bin/env python
import os
import sys
sys.path.append('python')
from sphinx import s3gaucnt

In this case the s3gaucnt and s3gau modules are the only ones we actually need. The first module provides the sphinx.s3gaucnt.S3GauCntFile class and some helper functions which we will use to process observation count files, while the second one provides the sphinx.s3gau.S3GauFile class which we will use to write out the re-estimated parameter files. Now we will use the sphinx.s3gaucnt.accumdirs function to accumulate counts from a list of directories: (if we were dealing with full covariance accumulators, we would use sphinx.s3gaucnt.accumdirs_full)

# Accumulate observation counts from all accumulation directories
gauden = s3gaucnt.accumdirs([os.path.join('bwaccumdir', x) for x in os.listdir('bwaccumdir')])

The return value for this function is a sphinx.s3gaucnt.S3GauCntFile object which contains the merged counts for all the items in the list of directories passed to accumdirs. We can then obtain the mean, variance, mixture weight, and transition matrix counts from this object, as well as the normalization constant (“dnom”, which corresponds to the denominator of the Baum-Welch update formulae), using a set of accessor functions.

Let’s normalize the means first, since they are very simple. The means are obtained using the getmeans method, while the normalizer is obtained using getdnom. What you actually get is a list of lists of 2-dimensional arrays (of type numpy.ndarray). For more information on how to manipulate these, please see the NumPy documentation.

# Normalize means
outmeans = []
# For each codebook in the counts file and its associated set of normalizers:
for mgau, mgau_dnom in zip(gauden.getmeans(), gauden.getdnom()):
    # Create a list of re-estimated parameters for this codebook and append it to the output array
    outmgau = []
    outmeans.append(outmgau)
    # For each feature stream in this codebook and its associated set of normalizers:
    for feat, feat_dnom in zip(mgau, mgau_dnom):
        # Normalize the parameters (dividing a 2-D array by a 1-D array)
        outmgau.append(feat / feat_dnom)
# Write out the re-estimated means
s3gau.open("means", "wb").writeall(outmeans)

And that’s it, seriously. Note how we can use zip to ensure that we match up codebooks and features with their corresponding normalizers without ever having to deal with any index variables. itertools.izip might be a bit more efficient in this case.

Now we will do the variances. These are only slightly more complicated due to the fact that we have to distinguish between “two-pass” and “one-pass” variance accumulators. You may recall that there are two mathematically equivalent forms of the maximum likelihood estimator of variance:

\begin{displaymath}
\sigma^2 = E[(x-\mu)^2] = E[x^2] - \mu^2
\end{displaymath}

In the first case the sufficient statistic is <<latex($(x-\mu)^2$)>> while in the second case it is <<latex($x^2$)>> and does not depend on the means at all. In Sphinx speak, the first case is called “two-pass” estimation, and is used when the -2passvar yes flag is specified to bw. The count file contains a flag which describes which type of variance statistics it contains.
This is stored in the pass2var attribute of the sphinx.s3gaucnt.S3GauCntFile object. So, our variance normalization looks like this:

# Normalize variances
outvars = []
# For each codebook in the counts file and its associated set of normalizers:
for mgau, mgau_dnom, mgau_mean in zip(gauden.getvars(), gauden.getdnom(), outmeans):
    # Create a list of re-estimated parameters for this codebook and append it to the output array
    outmgau = []
    outvars.append(outmgau)
    # For each feature stream in this codebook and its associated set of normalizers:
    for feat, feat_dnom, feat_mean in zip(mgau, mgau_dnom, mgau_mean):
        if gauden.pass2var:
            # Two pass variance: statistic is (x-\mu)^2, we just have to normalize it
            outmgau.append(feat / feat_dnom)
        else:
            # One pass variance: statistic is x^2, we have to subtract the squared re-estimated mean
            outmgau.append(feat / feat_dnom - feat_mean * feat_mean)
# Write out the re-estimated variances
s3gau.open("variances", "wb").writeall(outvars)

And finally, to normalize the mixture weights and transition matrices, we do… nothing. In fact that is what the original norm program does as well. The reason for this is that the normalization constant for these can be computed quickly when the models are loaded, and there are some benefits to storing them in unnormalized form (for one, it allows them to be used as priors for MAP adaptation).