ngram-class
ngram-class
NAME
ngram-class - induce word classes from N-gram statistics
SYNOPSIS
ngram-class [ -help ] option ...
DESCRIPTION
ngram-class
induces word classes from distributional statistics,
so as to minimize perplexity of a class-based N-gram model
given the provided word N-gram counts.
Presently, only bigram statistics are used, i.e., the induced classes
are best suited for a class-bigram language model.
The program generates the class N-gram counts and class expansions
needed by
ngram-count(1)
and
ngram(1),
respectively to train and to apply the class N-gram model.
OPTIONS
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
- -help
-
Print option summary.
- -version
-
Print version information.
- -debug level
-
Set debugging output at
level.
Level 0 means no debugging.
Debugging messages are written to stderr.
A useful level to trace the formation of classes is 2.
Input Options
- -vocab file
-
Read a vocabulary from file.
Subsequently, out-of-vocabulary words in both counts or text are
replaced with the unknown-word token.
If this option is not specified all words found are implicitly added
to the vocabulary.
- -tolower
-
Map the vocabulary to lowercase.
- -counts file
-
Read N-gram counts from a file.
Each line contains an N-gram of
words, followed by an integer count, all separated by whitespace.
Repeated counts for the same N-gram are added.
Counts collected by
-text
and
-counts
are additive as well.
Note that the input should contain consistent lower- and higher-order
counts (i.e., unigrams and bigrams), as would be generated by
ngram-count(1).
- -text textfile
-
Generate N-gram counts from text file.
textfile
should contain one sentence unit per line.
Begin/end sentence tokens are added if not already present.
Empty lines are ignored.
Class Merging
- -numclasses C
-
The target number of classes to induce.
A zero argument suppresses automatic class merging altogether
(e.g., for use with
-interact).
- -full
-
Perform full greedy merging over all classes starting with one class per
word.
This is the O(V^3) algorithm described in Brown et al. (1992).
- -incremental
-
Perform incremental greedy merging, starting with
one class each for the
C
most frequent words, and then adding one word at a time.
This is the O(V*C^2) algorithm described in Brown et al. (1992);
it is the default.
- -maxwordsperclass M
-
Limits the number of words in a class to
M
in incremental merging.
By default there is no such limit.
- -interact
-
Enter a primitive interactive interface when done with automatic class
induction, allowing manual specification of additional merging steps.
- -noclass-vocab file
-
Read a list of vocabulary items from
file
that are to be excluded from classes.
These words or tags do no undergo class merging, but their
N-gram counts still affect the optimization of model perplexity.
The default is to exclude the sentence begin/end tags (<s> and </s>)
from class merging; this can be suppressed by specifying
-noclass-vocab /dev/null.
- -read file
-
Read initial class memberships from
file.
Class memberships need to be stored in
classes-format(5)
with the additional condition that probabilities are obligatory
and that each membership definition must specify exactly one word.
Output Options
- -class-counts file
-
Write class N-gram counts to
file
when done.
The format is the same as for word N-gram counts, and can be
read by
ngram-count(1)
to estimate a class-N-gram model.
- -classes file
-
Write class definitions (member words and their probabilities) to
file
when done.
The output format is the same as required by the
-classes
option of
ngram(1).
- -save S
-
Save the class counts and/or class definitions every
S
iterations during induction.
The filenames are obtained from the
-class-counts
and
-classes
options, respectively, by appending the iteration number.
This is convenient for producing sets of classes at different granularities
during the same run.
The saved class memberships can also be used with the
-read
option to restart class merging at a later time.
S=0
(the default) suppresses the saving actions.
- -save-maxclasses K
-
Modifies the action of
-save
so as to only start saving once the number of classes reaches
K.
(The iteration numbers embedded in filenames will start at 0 from that point.)
SEE ALSO
ngram-count(1), ngram(1), classes-format(5).
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,
``Class-Based n-gram Models of Natural Language,''
Computational Linguistics 18(4), 467-479, 1992.
BUGS
Classes are optimized only for bigram models at present.
AUTHOR
Andreas Stolcke <andreas.stolcke@microsoft.com>
Seppo Enarvi <seppo.enarvi@aalto.fi>
Copyright (c) 1999-2010 SRI International
Copyright (c) 2012-2014 Microsoft Corp.
Copyright (c) 2013-2014 Seppo Enarvi