disambig
disambig
NAME
disambig - disambiguate text tokens using an N-gram model
SYNOPSIS
disambig
[-help]
option
...
DESCRIPTION
disambig
translates a stream of tokens from a vocabulary V1 to a corresponding stream
of tokens from a vocabulary V2,
according to a probabilistic, 1-to-many mapping.
Ambiguities in the mapping are resolved by finding the V2 sequence with
the highest posterior probability given the V1 sequence.
This probability is computed from pairwise conditional probabilities P(V1|V2),
as well as a language model for sequences over V2.
OPTIONS
Each filename argument can be an ASCII file, or a
compressed file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
- -help
-
Print option summary.
- -version
-
Print version information.
- -text file
-
Specifies the file containing the V1 sentences.
- -map file
-
Specifies the file containing the V1-to-V2 mapping information.
Each line of
file
contains the mapping for a single word in V1:
w1 w21 [p21] w22 [p22] ...
where
w1
is a word from V1, which has possible mappings
w21,
w22,
... from V2.
Optionally, each of these can be followed by a numeric string for the
probability
p21,
which defaults to 1.
The number is used as the conditional probability P(w1|w21),
but the program does not depend on these numbers being properly normalized.
- -escape string
-
Set an ``escape string.''
Input lines starting with
string
are not processed and passed unchanged to stdout instead.
This allows associated information to be passed to scoring scripts etc.
- -text-map file
-
Processes a combined text/map file.
The format of
file
is the same as for
-map,
except that the
w1
field on each line is interpreted as a word
token
rather than a word
type.
Hence, the V1 text input consists of all words in column 1 of
file
in order of appearance.
This is convenient if different instances of a word have different mappings.
There is no implicit insertion of begin/end sentence tokens in this
mode. Sentence boundaries should be indicated explicitly by
lines of the form
</s> </s>
<s> <s>
An escaped line (see
-escape)
also implicitly marks a sentence boundary.
- -classes file
-
Specifies the V1-to-V2 mapping information in
classes-format(5).
Class labels are interpreted as V2 words, and expansions as V1 words.
Multi-word expansions are not allowed.
- -scale
-
Interpret the numbers in the mapping as P(w21|w1).
This is done by dividing probabilities by the unigram probabilities of
w21,
obtained from the V2 language model.
- -logmap
-
Interpret numeric values in map file as log probabilities, not probabilities.
- -lm file
-
Specifies the V2 language model as a standard ARPA N-gram backoff model file
ngram-format(5).
The default is not to use a language model, i.e., choose V2 tokens
based only on the probabilities in the map file.
- -use-server S
-
Use a network LM server (typically implemented by
ngram(1)
with the
-server-port
option) as the main model.
The server specification
S
can be an unsigned integer port number (referring to a server port running on
the local host),
a hostname (referring to default port 2525 on the named host),
or a string of the form
port@host,
where
port
is a portnumber and
host
is either a hostname ("dukas.speech.sri.com")
or IP number in dotted-quad format ("140.44.1.15").
For server-based LMs, the
-order
option limits the context length of N-grams queried by the client
(with 0 denoting unlimited length).
Hence, the effective LM order is the mimimum of the client-specified value
and any limit implemented in the server.
When
-use-server
is specified, the arguments to the options
-mix-lm,
-mix-lm2,
etc. are also interpreted as network LM server specifications provided
they contain a '@' character and do not contain a '/' character.
This allows the creation of mixtures of several file- and/or
network-based LMs.
- -cache-served-ngrams
-
Enables client-side caching of N-gram probabilities to eliminated duplicate
network queries, in conjunction with
-use-server.
This may result in a substantial speedup
but requires memory in the client that may grow linearly with the
amount of data processed.
- -order n
-
Set the effective N-gram order used by the language model to
n.
Default is 2 (use a bigram model).
- -mix-lm file
-
Read a second N-gram model for interpolation purposes.
- -factored
-
Interpret the files specified by
-lm
and
-mix-lm
as a factored N-gram model specification.
See
ngram(1)
for details.
- -count-lm
-
Interpret the model specified by
-lm
(but not
-mix-lm)
as a count-based LM.
See
ngram(1)
for details.
- -lambda weight
-
Set the weight of the main model when interpolating with
-mix-lm.
Default value is 0.5.
- -mix-lm2 file
-
- -mix-lm3 file
-
- -mix-lm4 file
-
- -mix-lm5 file
-
- -mix-lm6 file
-
- -mix-lm7 file
-
- -mix-lm8 file
-
- -mix-lm9 file
-
Up to 9 more N-gram models can be specified for interpolation.
- -mix-lambda2 weight
-
- -mix-lambda3 weight
-
- -mix-lambda4 weight
-
- -mix-lambda5 weight
-
- -mix-lambda6 weight
-
- -mix-lambda7 weight
-
- -mix-lambda8 weight
-
- -mix-lambda9 weight
-
These are the weights for the additional mixture components, corresponding
to
-mix-lm2
through
-mix-lm9.
The weight for the
-mix-lm
model is 1 minus the sum of
-lambda
and
-mix-lambda2
through
-mix-lambda9.
- -context-priors file
-
Read context-dependent mixture weight priors from
file.
Each line in
file
should contain a context N-gram (most recent word first) followed by a vector
of mixture weights whose length matches the number of LMs being interpolated.
- -bayes length
-
Interpolate models using posterior probabilities
based on the likelihoods of local N-gram contexts of length
length.
The
-lambda
values are used as prior mixture weights in this case.
This option can also be combined with
-context-priors,
in which case the
length
parameter also controls how many words of context are maximally used to look up
mixture weights.
If
-context-priors
is used without
-bayes,
the context length used is set by the
-order
option and Bayesian interpolation is disabled, as when
scale
(see next) is zero.
- -bayes-scale scale
-
Set the exponential scale factor on the context likelihood in conjunction
with the
-bayes
function.
Default value is 1.0.
- -lmw W
-
Scales the language model probabilities by a factor
W.
Default language model weight is 1.
- -mapw W
-
Scales the likelihood map probability by a factor
W.
Default map weight is 1.
Note: For Viterbi decoding (the default) it is equivalent to use
-lmw W
or
-mapw 1/W",
but not for forward-backward computation.
- -tolower1
-
Map input vocabulary (V1) to lowercase, removing case distinctions.
- -tolower2
-
Map output vocabulary (V2) to lowercase, removing case distinctions.
- -keep-unk
-
Do not map unknown input words to the <unk> token.
Instead, output the input word unchanged.
This is like having an implicit default mapping for unknown words to
themselves, except that the word will still be treated as <unk> by the language
model.
Also, with this option the LM is assumed to be open-vocabulary
(the default is close-vocabulary).
- -vocab-aliases file
-
Reads vocabulary alias definitions from
file,
consisting of lines of the form
alias word
This causes all V2 tokens
alias
to be mapped to
word,
and is useful for adapting mismatched language models.
- -no-eos
-
Do no assume that each input line contains a complete sentence.
This prevents end-of-sentence tokens </s> from being appended automatically.
- -continuous
-
Process all words in the input as one sequence of words, irrespective of
line breaks.
Normally each line is processed separately as a sentence.
V2 tokens are output one-per-line.
This option also prevents sentence start/end tokens (<s> and </s>)
from being added to the input.
- -fb
-
Perform forward-backward decoding of the input (V1) token sequence.
Outputs the V2 tokens that have the highest posterior probability,
for each position.
The default is to use Viterbi decoding, i.e., the output is the
V2 sequence with the higher joint posterior probability.
- -fw-only
-
Similar to
-fb,
but uses only the forward probabilities for computing posteriors.
This may be used to simulate on-line prediction of tags, without the
benefit of future context.
- -totals
-
Output the total string probability for each input sentence.
- -posteriors
-
Output the table of posterior probabilities for each
input (V1) token and each V2 token, in the same format as
required for the
-map
file.
If
-fb
is also specified the posterior probabilities will be computed using
forward-backward probabilities; otherwise an approximation will be used
that is based on the probability of the most likely path containing
a given V2 token at given position.
- -nbest N
-
Output the
N
best hypotheses instead of just the first best when
doing Viterbi search.
If
N>1,
then each hypothesis is prefixed by the tag
NBEST_n x,
where
n
is the rank of the hypothesis in the N-best list and
x
its score, the negative log of the combined probability of transitions
and observations of the corresponding HMM path.
- -write-counts file
-
Outputs the V2-V1 bigram counts corresponding to the tagging performed on
the input data.
If
-fb
was specified these are expected counts, and otherwise they reflect the 1-best
tagging decisions.
- -write-vocab1 file
-
Writes the input vocabulary from the map (V1) to
file.
- -write-vocab2 file
-
Writes the output vocabulary from the map (V2) to
file.
The vocabulary will also include the words specified in the language model.
- -write-map file
-
Writes the map back to a file for validation purposes.
- -debug
-
Sets debugging output level.
Each filename argument can be an ASCII file, or a compressed
file (name ending in .Z or .gz), or ``-'' to indicate
stdin/stdout.
BUGS
The
-continuous
and
-text-map
options effectively disable
-keep-unk,
i.e., unknown input words are always mapped to <unk>.
Also,
-continuous
doesn't preserve the positions of escaped input lines relative to
the input.
SEE ALSO
ngram-count(1), ngram(1), hidden-ngram(1), training-scripts(1),
ngram-format(5), classes-format(5).
AUTHOR
Andreas Stolcke <stolcke@speech.sri.com>,
Anand Venkataraman <anand@speech.sri.com>.
Copyright 1995-2007 SRI International