|
|
|
|
|
|
SRILM-FAQ
SRILM-FAQ
NAME
SRILM-FAQ - Frequently asked questions about SRI LM tools
SYNOPSIS
man srilm-faq
DESCRIPTION
This document tries to answer some of the most frequently asked questions
about SRILM.
Build issues
- A1) I ran ``make World'' but the $SRILM/bin/$MACHINE_TYPE directory is empty.
-
Building the binaries can fail for a variety of reasons.
Check the following:
-
a)
-
Make sure the SRILM environment variable is set, or specified on the
make command line, e.g.:
make SRILM=$PWD
-
b)
-
Make sure the
$SRILM/sbin/machine-type
script returns a valid string for the platform you are trying to build on.
Known platforms have machine-specific makefiles called
$SRILM/common/Makefile.machine.$MACHINE_TYPE
If
machine-type
does not work for some reason, you can override its output on the command line:
make MACHINE_TYPE=xyz
If you are building for an unsupported platform create a new machine-specific
makefile and mail it to stolcke@speech.sri.com.
-
c)
-
Make sure your compiler works and is invoked correctly.
You will probably have to edit the
CC
and
CXX
variables in the platform-specific makefile.
If you have questions about compiler invocation and best options
consult a local expert; these things differ widely between sites.
-
d)
-
The default is to compile with Tcl support.
This is in fact only used for some testing binaries (which are
not built by default),
so it can be turned off if Tcl is not available or presents problems.
Edit the machine-specific makefile accordingly.
To use Tcl, locate the
tcl.h
header file and the library itself, and set (for example)
TCL_INCLUDE = -I/path/to/include
TCL_LIBRARY = -L/path/to/lib -ltcl8.4
To disable Tcl support set
NO_TCL = X
TCL_INCLUDE =
TCL_LIBRARY =
-
e)
-
Make sure you have the C-shell (/bin/csh) installed on your system.
Otherwise you will see something like
make: /sbin/machine-type: Command not found
early in the build process.
On Ubuntu Linux and Cygwin systems "csh" or "tcsh" needs to be installed
as an optional package.
-
f)
-
If you cannot get SRILM to build, save the make output to a file
make World >& make.output
and look for messages indicating errors.
If you still cannot figure out what the problem is, send the error message
and immediately preceding lines to the srilm-user list.
Also include information about your operating system ("uname -a" output)
and compiler version ("gcc -v" or equivalent for other compilers).
- A2) The regression test outputs differ for all tests. What did I do wrong?
-
Most likely the binaries didn't get built or aren't executable
for some reason.
Check issue A1).
- A3) I get differing outputs for some of the regression tests. Is that OK?
-
It might be.
The comparison of reference to actual output allows for small numerical
differences, but
some of the algorithms make hard decisions based on floating-point computations
that can result in different outputs as a result of different compiler
optimizations, machine floating point precisions (Intel versus IEEE format),
and math libraries.
Test of this nature include
ngram-class,
disambig,
and
nbest-rover.
When encountering differences, diff the output in the
$SRILM/test/outputs/TEST.$MACHINE_TYPE.stdout file to the corresponding
$SRILM/test/reference/TEST.stdout, where
TEST
is the name of the test that failed.
Also compare the corresponding .stderr files;
differences there usually indicate operating-system related problems.
Large data and memory issues
- B1) I'm getting a message saying ``Assertion `body != 0' failed.''
-
You are running out of memory.
See subsequent questions depending on what you are trying to do.
-
Note:
-
The above message means you are running
out of "virtual" memory on your computer, which could be because of
limits in swap space, administrative resource limits, or limitations of
the machine architecture (a 32-bit machine cannot address more than
4GB no matter how many resources your system has).
Another symptom of not enough memory is that your program runs, but
very, very slowly, i.e., it is "paging" or "swapping" as it tries to
use more memory than the machine has RAM installed.
- B2) I am trying to count N-grams in a text file and running out of memory.
-
Don't use
ngram-count
directly to count N-grams.
Instead, use the
make-batch-counts
and
merge-batch-counts
scripts described in
training-scripts(1).
That way you can create N-gram counts limited only by the maximum file size
on your system.
- B3) I am trying to build an N-gram LM and ngram-count runs out of memory.
-
You are running out of memory either because of the size of ngram counts,
or of the LM being built. The following are strategies for reducing the
memory requirements for training LMs.
-
a)
-
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use
ngram-count
in "raw" form.
Instead, use the
make-big-lm
wrapper script described in the
training-scripts(1)
man page.
-
b)
-
Switch to using the "_c" or "_s" versions of the SRI binaries.
For
instructions on how to build them, see the INSTALL file.
Once built, set your executable search path accordingly, and try
make-big-lm
again.
-
c)
-
Raise the minimum counts for N-grams included in the LM, i.e.,
the values of the options
-gt2min,
-gt3min,
-gt4min,
etc.
The higher order N-grams typically get higher minimum counts.
-
d)
-
Get a machine with more memory.
If you are hitting the limitations of a 32-bit machine architecture,
get a 64-bit machine and recompile SRILM to take advantage of the expanded
address space.
(The MACHINE_TYPE=i686-m64 setting is for systems based on
64-bit AMD processors, as well as recent compatibles from Intel.)
Note that 64-bit pointers will require a memory overhead in
themselves, so you will need a machine with significantly, not just a
little, more memory than 4GB.
- B4) I am trying to apply a large LM to some data and am running out of memory.
-
Again, there are several strategies to reduce memory requirements.
-
a)
-
Use the "_c" or "_s" versions of the SRI binaries.
See 3b) above.
-
b)
-
Precompute the vocabulary of your test data and use the
ngram -limit-vocab
option to load only the N-gram parameters relevant to your data.
This approach should allow you to use arbitrarily
large LMs provided the data is divided into small enough chunks.
-
c)
-
If the LM can be built on a large machine, but then is to be used on
machines with limited memory, use
ngram -prune
to remove the less important parameters of the model.
This usually gives huge size reductions with relatively modest performance
degradation.
The tradeoff is adjustable by varying the pruning parameter.
- B5) How can I reduce the time it takes to load large LMs into memory?
-
The techniques described in 4b) and 4c) above also reduce the load time
of the LM.
Additional steps to try are:
-
a)
-
Convert the LM into binary format, using
ngram -order N -lm OLDLM -write-bin-lm NEWLM
(This is currently only supported for N-gram-based LMs.)
You can also generate the LM directly in binary format, using
ngram-count ... -lm NEWLM -write-binary-lm
The resulting
NEWLM
file (which should not be compressed) can be used
in place of a textual LM file with all compiled SRILM tools
(but not with
lm-scripts(1)).
The format is machine-independent, i.e., it can be read on machines with
different word sizes or byte-order.
Loading binary LMs is faster because
(1) it reduces the overhead of parsing the input data, and
(2) in combination with
-limit-vocab
(see 4b)
it is much faster to skip sections of the LM that are out-of-vocabulary.
-
Note:
-
There is also a binary format for N-gram counts.
It can be generated using
ngram-count -write-binary COUNTS
and has similar advantages as binary LM files.
-
b)
-
Start a "probability server" that loads the LM ahead of time, and
then have "LM clients" query the server instead of computing the
probabilities themselves.
The server is started on a machine named
HOST
using
ngram LMOPTIONS -server-port P &
where
P
is an integer < 2^16 that specifies the TCP/IP port number the
server will listen on, and
LMOPTIONS
are whatever options necessary to define the LM to be used.
One or more clients (programs such as
ngram(1),
disambig(1),
lattice-tool(1))
can then query the server using the options
-use-server P@HOST -cache-served-ngrams
instead of the usual "-lm FILE".
The
-cache-served-ngrams
option is not required but often speeds up performance dramatically by
saving the results of server lookups in the client for reuse.
Server-based LMs may be combined with file-based LMs by interpolation;
see
ngram(1)
for details.
- B6) How can I use the Google Web N-gram corpus to build an LM?
-
Google has made a corpus of 5-grams extracted from 1 tera-words of web data
available via LDC.
However, the data is too large to build a standard backoff N-gram, even
using the techniques described above.
Instead, we recommend a "count-based" LM smoothed with deleted interpolation.
Such an LM computes probabilities on the fly from the counts, of which only
the subsets needed for a given test set need to be loaded into memory.
LM construction proceeds in the following steps:
-
a)
-
Make sure you have built SRI binaries either for a 64-bit machine
(e.g., MACHINE_TYPE=i686-m64 OPTION=_c) or using 64-bit counts (OPTION=_l).
This is necessary because the data contains N-gram counts exceeding
the range of 32-bit integers.
Be sure to invoke all commands below using the path to the appropriate
binary executable directory.
-
b)
-
Prepare mapping file for some vocabulary mismatches and call this
google.aliases:
<S> <s>
</S> </s>
<UNK> <unk>
-
c)
-
Prepare an initial count-LM parameter file
google.countlm.0:
order 5
vocabsize 13588391
totalcount 1024908267229
countmodulus 40
mixweights 15
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5
google-counts PATH
where
PATH
points to the location of the Google N-grams, i.e., the directory containing
subdirectories "1gms", "2gms", etc.
Note that the
vocabsize
and
totalcount
were obtained from the 1gms/vocab.gz and 1gms/total files, respectively.
(Check that they match and modify as needed.)
For an explanation of the parameters see the
ngram(1)
-count-lm
option.
-
d)
-
Prepare a text file
tune.text
containing data for estimating the mixture weights.
This data should be representative of, but different from your test data.
Compute the vocabulary of this data using
ngram-count -text tune.text -write-vocab tune.vocab
The vocabulary size should not exceed a few thousand to keep memory
requirements in the following steps manageable.
-
e)
-
Estimate the mixture weights:
ngram-count -debug 1 -order 5 -count-lm \
-text tune.text -vocab tune.vocab \
-vocab-aliases google.aliases \
-limit-vocab \
-init-lm google.countlm.0 \
-em-iters 100 \
-lm google.countlm
This will write the estimated LM to
google.countlm.
The output will be identical to the initial LM file, except for the
updated interpolation weights.
-
f)
-
Prepare a test data file
test.text,
and its vocabulary
test.vocab
as in Step d) above.
Then apply the LM to the test data:
ngram -debug 2 -order 5 -count-lm \
-lm google.countlm \
-vocab test.vocab \
-vocab-aliases google.aliases \
-limit-vocab \
-ppl test.text > test.ppl
The perplexity output will appear in
test.ppl.
-
g)
-
Note that the Google data uses mixed case spellings.
To apply the LM to lowercase data one needs to prepare a much more
extensive vocabulary mapping table for the
-vocab-aliases
option, namely, one that maps all
upper- and mixed-case spellings to lowercase strings.
This mapping file should be restricted to the words appearing in
tune.text
and
test.text,
respectively, to avoid defeating the effect of
-limit-vocab .
Smoothing issues
- C1) What is smoothing and discounting all about?
-
Smoothing
refers to methods that assign probabilities to events (N-grams) that
do not occur in the training data.
According to a pure maximum-likelihood estimator these events would have
probability zero, which is plainly wrong since previously unseen events
in general do occur in independent test data.
Because the probability mass is redistributed away from the seen events
toward the unseen events the resulting model is "smoother" (closer to uniform)
than the ML model.
Discounting
refers to the approach used by many smoothing methods of adjusting the
empirical counts of seen events downwards.
The ML estimator (count divided by total number of events) is then applied
to the discounted count, resulting in a smoother estimate.
- C2) What smoothing methods are there?
-
There are many, and SRILM implements are fairly large selection of the
most popular ones.
A detailed discussion of these is found in a separate document,
ngram-discount(7).
- C3) Why am I getting errors or warnings from the smoothing method I'm using?
-
The Good-Turing and Kneser-Ney smoothing methods rely on statistics called
"count-of-counts", the number of words occurring one, twice, three times, etc.
The formulae for these methods become undefined if the counts-of-counts
are zero, or not strictly decreasing.
Some conditions are fatal (such as when the count of singleton words is zero),
others lead to less smoothing (and warnings).
To avoid these problems, check for the following possibilities:
-
a)
-
The data could be very sparse, i.e., the training corpus very small.
Try using the Witten-Bell discounting method.
-
b)
-
The vocabulary could be very small, such as when training an LM based on
characters or parts-of-speech.
Smoothing is less of an issue in those cases, and the Witten-Bell method
should work well.
-
c)
-
The data was manipulated in some way, or artificially generated.
For example, duplicating data eliminates the odd-numbered counts-of-counts.
-
d)
-
The vocabulary is limited during counts collection using the
ngram-count
-vocab
option, with the effect that many low-frequency N-grams are eliminated.
The proper approach is to compute smoothing parameters on the full vocabulary.
This happens automatically in the
make-big-lm
wrapper script, which is preferable to direct use of
ngram-count
for other reasons (see issue B3-a above).
-
e)
-
You are estimating an LM from N-gram counts that have been truncated beforehand,
e.g., by removing singleton events.
If you cannot go back to the original data and recompute the counts
there is a heuristic to extrapolate low counts-of-counts from higher ones.
The heuristic is invoked automatically (and an informational message is output)
when
make-big-lm
is used to estimate LMs with Kneser-Ney smoothing.
For details see the paper by W. Wang et al. in ASRU-2007, listed under
"SEE ALSO".
- C4) How does discounting work in the case of unigrams?
-
First, unigrams are discounted using the same method as higher-order
N-grams, using the specified method.
The probability mass freed up in this way
is then either spread evenly over all word types
that would otherwise have zero probability (this is essentially
simulating a backoff to zero-grams), or
if all unigrams already have non-zero probabilities, the
left-over mass is added to
all
unigrams.
In either case all unigram probabilty probabilities will sum to 1.
An informational message from
ngram-count
will tell which case applies.
Out-of-vocabulary, zeroprob, and `unknown' words
- D1) What is the perplexity of an OOV (out of vocabulary) word?
-
By default any word not observed in the training data is considered
OOV and OOV words are silently ignored by the
ngram(1)
during perplexity (ppl) calculation.
For example:
$ ngram-count -text turkish.train -lm turkish.lm
$ ngram -lm turkish.lm -ppl turkish.test
file turkish.test: 61031 sentences, 1000015 words, 34153 OOVs
0 zeroprobs, logprob= -3.20177e+06 ppl= 1311.97 ppl1= 2065.09
The statistics printed in the last two lines have the following meanings:
- 34153 OOVs
-
This is the number of unknown word tokens, i.e. tokens
that appear in
turkish.test
but not in
turkish.train
from which
turkish.lm
was generated.
- logprob= -3.20177e+06
-
This gives us the total logprob ignoring the 34153 unknown word tokens.
The logprob does include the probabilities
assigned to </s> tokens which are introduced by
ngram-count(1).
Thus the total number of tokens which this logprob is based on is
words - OOVs + sentences = 1000015 - 34153 + 61031
- ppl = 1311.97
-
This gives us the geometric average of 1/probability of
each token, i.e., perplexity.
The exact expression is:
ppl = 10^(-logprob / (words - OOVs + sentences))
- ppl1 = 2065.09
-
This gives us the average perplexity per word excluding the </s> tokens.
The exact expression is:
ppl1 = 10^(-logprob / (words - OOVs))
You can verify these numbers by running the
ngram
program with the
-debug 2
option, which gives the probability assigned to each token.
- D2) What happens when the OOV word is in the context of an N-gram?
-
Exact details depend on the discounting algorithm used, but typically
the backed-off probability from a lower order N-gram is used. If the
-unk
option is used as explained below, an <unk> token is assumed to
take the place of the OOV word and no back-off may be necessary
if a corresponding N-gram containing <unk> is found in the LM.
- D3) Isn't it wrong to assign 0 logprob to OOV words?
-
That depends on the application.
If you are comparing multiple language
models which all consider the same set of words as OOV it may be OK to
ignore OOV words.
Note that perplexity comparisons are only ever meaningful
if the vocabularies of all LMs are the same.
Therefore, to compare LMs with different sets of OOV words
(such as when using different tokenization strategies for morphologically
complex languages) then it becomes important
to take into account the true cost of the OOV words, or to model all words,
including OOVs.
- D4) How do I take into account the true cost of the OOV words?
-
A simple strategy is to "explode" the OOV words, i.e., split them into
characters in the training and test data.
Typically words that appear more than once in the training data are
considered to be vocabulary words.
All other words are split into their characters and the
individual characters are considered tokens.
Assuming that all characters occur at least once in the training data there
will be no OOV tokens in the test data.
Note that this strategy changes the number of tokens in the data set,
so even though logprob is meaningful be careful when reporting ppl results.
- D5) What if I want to model the OOV words explicitly?
-
Maybe a better strategy is to have a separate "letter" model for OOV words.
This can be easily created using SRILM by using a training
file listing the OOV words one per line with their characters
separated by spaces.
The
ngram-count
options
-ukndiscount
and
-order 7
seem to work well for this purpose.
The final logprob results are obtained in two steps.
First do regular training and testing on your data using
-vocab
and
-unk
options.
The resulting logprob will include the cost of the vocabulary words and an
<unk> token for each OOV word.
Then apply the letter model to each OOV word in the test set.
Add the logprobs.
Here is an example:
# Determine vocabulary:
ngram-count -text turkish.train -write-order 1 -write turkish.train.1cnt
awk '$2>1' turkish.train.1cnt | cut -f1 | sort > turkish.train.vocab
awk '$2==1' turkish.train.1cnt | cut -f1 | sort > turkish.train.oov
# Word model:
ngram-count -kndiscount -interpolate -order 4 -vocab turkish.train.vocab -unk -text turkish.train -lm turkish.train.model
ngram -order 4 -unk -lm turkish.train.model -ppl turkish.test > turkish.test.ppl
# Letter model:
perl -C -lne 'print join(" ", split(""))' turkish.train.oov > turkish.train.oov.split
ngram-count -ukndiscount -interpolate -order 7 -text turkish.train.oov.split -lm turkish.train.oov.model
perl -pe 's/\s+/\n/g' turkish.test | sort > turkish.test.words
comm -23 turkish.test.words turkish.train.vocab > turkish.test.oov
perl -C -lne 'print join(" ", split(""))' turkish.test.oov > turkish.test.oov.split
ngram -order 7 -ppl turkish.test.oov.split -lm turkish.train.oov.model > turkish.test.oov.ppl
# Add the logprobs in turkish.test.ppl and turkish.test.oov.ppl.
Again, perplexities are not directly meaningful as computed by SRILM, but you
can recompute them by hand using the combined logprob value, and the number of
original word tokens in the test set.
- D6) What are zeroprob words and when do they occur?
-
In-vocabulary words that get zero probability are counted as
"zeroprobs" in the ppl output.
Just as OOV words, they are excluded from the perplexity
computation since otherwise the perplexity value would be infinity.
There are three reasons why zeroprobs could occur in a
closed vocabulary setting (the default for SRILM):
-
a)
-
If the same vocabulary is used at test time as was used during
training, and smoothing is enabled, then the occurrence of zeroprobs
indicates an anomalous condition and, possibly, a broken language model.
-
b)
-
If smoothing has been disabled (e.g., by using the option
-cdiscount 0),
then the LM will use maximum likelihood estimates for
the N-grams and then any unseen N-gram is a zeroprob.
-
c)
-
If a different vocabulary file is specified at test time than
the one used in training, then the definition of what counts as an OOV
will change.
In particular, a word that wasn't seen in the training data (but is in the
test vocabulary) will
not
be mapped to
<unk>
and, therefore, not
count as an OOV in the perplexity computation.
However, it will still get zero probability and, therefore, be tallied
as a zeroprob.
- D7) What is the point of using the <unk> token?
-
Using
<unk>
is a practical convenience employed by SRILM.
Words not in the specified vocabulary are mapped to
<unk>,
which is equivalent to performing the same mapping
in a data pre-processing step outside of SRILM.
Other than that,
for both LM estimation and evaluation purposes,
<unk>
is treated like any other word.
(In particular, in the computation of discounted probabilities
there is no special handling of
<unk>.)
- D8) So how do I train an open-vocabulary LM with <unk>?
-
First, make sure to use the
ngram-count
-unk
option, which simply indicates that the
<unk>
word should be included in the LM vocabulary, as required for an
open-vocabulary LM.
Without this option, N-grams containing
<unk>
would simply be discarded.
An "open vocabulary" LM is simply one that contains
<unk>,
and can therefore (by virtue of the mapping of OOVs to
<unk>)
assign a non-zero probability to them.
Next, we need to ensure there are actual occurrences of
<unk>
N-grams
in the training data so we can obtain meaningful probability estimates
for them
(otherwise
<unk>
would only get probabilty via unigram discounting, see item C4).
To get a proper estimate
of the
<unk>
probability, we need to explicitly specify a vocabulary that is not
a superset of the training data.
One way to do that is to extract the vocabulary from an independent
data set, or by only including words with some minimum count (greater than 1)
in the training data.
- D9) Doesn't ngram-count -addsmooth deal with OOV words by adding a constant to occurrence counts?
-
No, all smoothing is applied when building the LM at training time,
so it must use the
<unk>
mechanism to assign probability to words that are first seen in the
test data.
Furthermore, even add-constant smoothing requires a fixed, finite
vocabulary to compute the denominator of its estimator.
SEE ALSO
ngram(1), ngram-count(1), training-scripts(1), ngram-discount(7).
$SRILM/INSTALL
http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
W. Wang, A. Stolcke, & J. Zheng,
Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, pp. 159-164, Kyoto, 2007.
http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
BUGS
This document is work in progress.
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>
Deniz Yuret <dyuret@ku.edu.tr>
Nitin Madnani <nmadnani@umiacs.umd.edu>
Copyright 2007-2010 SRI International
|
|
|