Segmentation and Diarization using LIUM tools

LIUM has released a free system for speaker diarization and segmentation, which integrates well with Sphinx. This tool is essential if you are trying to do recognition on long audio files such as lectures or radio or TV shows, which may also potentially contain multiple speakers.

Segmentation means to split the audio into manageable, distinct chunks of homogeneous audio - e.g. speech, silence, music. Diarization specifically means identifying the unique speakers in an audio file. The LIUM tool does both of these - in fact it is not very useful to do one without the other.

Getting Started

First, download the LIUM_spkDiarization .jar file from Running it without arguments will print out a summary of its options, which looks like this:

dhuggins@lima-2:~/Projects/PyCon/freelt$ ../tools/LIUM_SpkDiarization-3.1.jar 
info[info] 	 ====================================================== 
info[program] 	 name = Diarization
info[info] 	 ------------------------------------------------------ 
info[show] 	 [options] show
info[ParameterFeature-Input] 	 --fInputMask 	 Features input mask = %s.mfcc
info[ParameterFeature-Input] 	 --fInputDesc 	 Features info (type[,s:e:ds:de:dds:dde,dim,c:r:wSize:method]) = audio2sphinx,1:1:0:0:0:0,13,0:0:0:0
info[ParameterFeature-Input] 	 	 type [sphinx,spro4,gztxt,audio2sphinx] = audio2sphinx (4)
info[ParameterFeature-Input] 	 	 static [0=not present,1=present ,3=to be removed] = 1
info[ParameterFeature-Input] 	 	 energy [0,1,3] = 1
info[ParameterFeature-Input] 	 	 delta [0,1,2=computed on the fly,3] = 0
info[ParameterFeature-Input] 	 	 delta energy [0,1,2=computed on the fly,3] = 0
info[ParameterFeature-Input] 	 	 delta delta [0,1,2,3] = 0
info[ParameterFeature-Input] 	 	 delta delta energy [0,1,2,3] = 0
info[ParameterFeature-Input] 	 	 file dim = 13
info[ParameterFeature-Input] 	 	 normalization, center [0,1] = 0
info[ParameterFeature-Input] 	 	 normalization, reduce [0,1] = 0
info[ParameterFeature-Input] 	 	 normalization, window size = 0
info[ParameterFeature-Input] 	 	 normalization, method [0 (segment), 1 (cluster), 2 (sliding), 3 (warping)] =0
info[info] 	 ------------------------------------------------------ 
info[ParameterSegmentationFile-Input] 	 --sInputMask 	 Output segmentation mask =
info[ParameterSegmentationFile-Input] 	 --sInputFormat 	 Output segmentation format = seg,ISO-8859-1
info[ParameterSegmentationFile-Output] 	 --sOutputMask 	 Output segmentation mask = %s.out.seg
info[ParameterSegmentationFile-Output] 	 --sOutputFormat 	 Output segmentation format = seg,ISO-8859-1
info[info] 	 ------------------------------------------------------ 
info[ParameterSystem] 	 --system = current
info[ParameterSystem] 	 --doCEClustering = false
info[ParameterSystem] 	 --saveAllStep = false
info[ParameterSystem] 	 --loadInputSegmentation = false
info[info] 	 ------------------------------------------------------ 

The LIUM segmentation tool can take a variety of file types as input. By default, it is assumed that the input is audio - this can be a WAV file, probably also MP3 and others. If you already have MFCCs and wish to use them instead, then pass %%–fInputDesc sphinx%% to it. The input segmentation arguments (%%–sInputMask%% and %%–sInputFormat%%) are not required - if they are not given the tool will start with the entire file.

The default output format is the one used by LIUM in their evaluations.
However, the tool can easily output the control files used by Sphinx3 and PocketSphinx (FIXME: and Sphinx4 too?), by passing %%–sOutputFormat ctl%%. In theory it also supports Transcriber XML, but I haven’t figured out how to make that work yet.

Here’s a little script that will turn ctl format output files into label files for Audacity:

#!/usr/bin/perl -w
use strict;

while (`<>`) {
    my ($show, $sf, $ef, $uttid) = split;
    $sf /= 100;
    $ef /= 100;
    print "$sf\t$ef\t$uttid\n";

And, here’s one that will turn the output hyp files from PocketSphinx into label files, for transcription fixing:

#!/usr/bin/perl -w
use strict;

while (`<>`) {
    my ($text, $uttid, $score) = /^(.*)\((\S+)(?:\s+(-?\d+))?\)$/;
    my ($time, $start, $end) = split /-/, $uttid;
    print "$start\t$end\t$text\n";