N-gram models try to estimate the probability of a word z in the context of the previous n-1 words (a_), i.e., Pr(z|a_). We will denote this conditional probability using p(a_z) for convenience. One way to estimate p(a_z) is to look at the number of times word z has followed the previous n-1 words (a_):
(1) p(a_z) = c(a_z)/c(a_)This is known as the maximum likelihood (ML) estimate. Unfortunately it does not work very well because it assigns zero probability to N-grams that have not been observed in the training data. To avoid the zero probabilities, we take some probability mass from the observed N-grams and distribute it to unobserved N-grams. Such redistribution is known as smoothing or discounting.
Most existing smoothing algorithms can be described by the following equation:
(2) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)If the N-gram a_z has been observed in the training data, we use the distribution f(a_z). Typically f(a_z) is discounted to be less than the ML estimate so we have some leftover probability for the z words unseen in the context (a_). Different algorithms mainly differ on how they discount the ML estimate to get f(a_z).
If the N-gram a_z has not been observed in the training data, we use the lower order distribution p(_z). If the context has never been observed (c(a_) = 0), we can use the lower order distribution directly (bow(a_) = 1). Otherwise we need to compute a backoff weight (bow) to make sure probabilities are normalized: Sum_z p(a_z) = 1
Let Z be the set of all words in the vocabulary, Z0 be the set of all words with c(a_z) = 0, and Z1 be the set of all words with c(a_z) > 0. Given f(a_z), bow(a_) can be determined as follows:
(3) Sum_Z p(a_z) = 1 Sum_Z1 f(a_z) + Sum_Z0 bow(a_) p(_z) = 1 bow(a_) = (1 - Sum_Z1 f(a_z)) / Sum_Z0 p(_z) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 p(_z)) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z))
Smoothing is generally done in one of two ways. The backoff models compute p(a_z) based on the N-gram counts c(a_z) when c(a_z) > 0, and only consider lower order counts c(_z) when c(a_z) = 0. Interpolated models take lower order counts into account when c(a_z) > 0 as well. A common way to express an interpolated model is:
(4) p(a_z) = g(a_z) + bow(a_) p(_z)Where g(a_z) = 0 when c(a_z) = 0 and it is discounted to be less than the ML estimate when c(a_z) > 0 to reserve some probability mass for the unseen z words. Given g(a_z), bow(a_) can be determined as follows:
(5) Sum_Z p(a_z) = 1 Sum_Z1 g(a_z) + Sum_Z bow(a_) p(_z) = 1 bow(a_) = 1 - Sum_Z1 g(a_z)
An interpolated model can also be expressed in the form of equation (2), which is the way it is represented in the ARPA format model files in SRILM:
(6) f(a_z) = g(a_z) + bow(a_) p(_z) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)
Most algorithms in SRILM have both backoff and interpolated versions. Empirically, interpolated algorithms usually do better than the backoff ones, and Kneser-Ney does better than others.
This section describes the formulation of each discounting option in ngram-count(1). After giving the motivation for each discounting method, we will give expressions for f(a_z) and bow(a_) of Equation 2 in terms of the counts. Note that some counts may not be included in the model file because of the -gtmin options; see Warning 4 in the next section.
Backoff versions are the default but interpolated versions of most models are available using the -interpolate option. In this case we will express g(a_z) and bow(a_) of Equation 4 in terms of the counts as well. Note that the ARPA format model files store the interpolated models and the backoff models the same way using f(a_z) and bow(a_); see Warning 3 in the next section. The conversion between backoff and interpolated formulations is given in Equation 6.
The discounting options may be followed by a digit (1-9) to indicate that only specific N-gram orders be affected. See ngram-count(1) for more details.
f(a_z) = (c(a_z) - D) / c(a_) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) ; Eqn.2 bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) ; Eqn.3With the -interpolate option we have:
g(a_z) = max(0, c(a_z) - D) / c(a_) p(a_z) = g(a_z) + bow(a_) p(_z) ; Eqn.4 bow(a_) = 1 - Sum_Z1 g(a_z) ; Eqn.5 = D n(a_*) / c(a_)The suggested discount factor is:
D = n1 / (n1 + 2*n2)where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. Different discounting constants can be specified for different N-gram orders using options -cdiscount1, -cdiscount2, etc.
f(a_z) = (c(a_z) - D0) / c(a_) ;; for highest order N-grams f(_z) = (n(*_z) - D1) / n(*_*) ;; for lower order N-gramswhere the n(*_z) notation represents the number of unique N-grams that match a given pattern with (*) used as a wildcard for a single word. D0 and D1 represent two different discounting constants, as each N-gram order uses a different discounting constant. The resulting conditional probability and the backoff weight is calculated as given in equations (2) and (3):
p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) ; Eqn.2 bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) ; Eqn.3The option -interpolate is used to create the interpolated versions of -kndiscount and -ukndiscount. In this case we have:
p(a_z) = g(a_z) + bow(a_) p(_z) ; Eqn.4Let Z1 be the set {z: c(a_z) > 0}. For highest order N-grams we have:
g(a_z) = max(0, c(a_z) - D) / c(a_) bow(a_) = 1 - Sum_Z1 g(a_z) = 1 - Sum_Z1 c(a_z) / c(a_) + Sum_Z1 D / c(a_) = D n(a_*) / c(a_)Let Z2 be the set {z: n(*_z) > 0}. For lower order N-grams we have:
g(_z) = max(0, n(*_z) - D) / n(*_*) bow(_) = 1 - Sum_Z2 g(_z) = 1 - Sum_Z2 n(*_z) / n(*_*) + Sum_Z2 D / n(*_*) = D n(_*) / n(*_*)The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as
D = n1 / (n1 + 2*n2)where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively.
Y = n1/(n1+2*n2) D1 = 1 - 2Y(n2/n1) D2 = 2 - 3Y(n3/n2) D3+ = 3 - 4Y(n4/n3)
bow(a_) = n(a_*) / (n(a_*) + c(a_))Here n(a_*) represents the number of unique words following the context (a_) in the training data. Witten-Bell is originally an interpolated discounting method. So with the -interpolate option we get:
g(a_z) = c(a_z) / (n(a_*) + c(a_)) p(a_z) = g(a_z) + bow(a_) p(_z) ; Eqn.4Without the -interpolate option we have the backoff version which is implemented by taking f(a_z) to be the same as the interpolated g(a_z).
f(a_z) = c(a_z) / (n(a_*) + c(a_)) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) ; Eqn.2 bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) ; Eqn.3
c(a_z) c(a_) (c(a_) + 1) + n(a_*) (1 - n(a_*)) f(a_z) = ------ --------------------------------------- c(a_) c(a_)^2 + c(a_) + 2 n(a_*) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) ; Eqn.2 bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) ; Eqn.3
p(a_z) = (c(a_z) + D) / (c(a_) + D n(*))
r' = (r+1) n[r+1]/n[r]Here n[r] is the number of N-grams that occur exactly r times in the training data.
f(a_z) = (c(a_z) / c(a_)) if c(a_z) > gtmaxThe lower counts are discounted proportional to the Good-Turing estimate with a small correction A to account for the high-count N-grams not being discounted. If 1 <= c(a_z) <= gtmax:
n[gtmax + 1] A = (gtmax + 1) -------------- n[1] n[c(a_z) + 1] c'(a_z) = (c(a_z) + 1) --------------- n[c(a_z)] c(a_z) (c'(a_z) / c(a_z) - A) f(a_z) = -------- ---------------------- c(a_) (1 - A)The -interpolate option has no effect in this case, only a backoff version has been implemented, thus:
p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z) ; Eqn.2 bow(a_) = (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z)) ; Eqn.3
ngram-count -order N -text file.txt -write file.cntThe -order option determines the maximum length of the N-grams. The file file.txt should contain one sentence per line with tokens separated by whitespace. The output file.cnt contains the N-gram tokens followed by a tab and a count on each line:
a_z <tab> c(a_z)A couple of warnings:
For most smoothing methods (except -count-lm) SRILM generates and uses N-gram model files in the ARPA format. A typical command to generate a model file would be:
ngram-count -order N -text file.txt -lm file.lmThe ARPA format output file.lm will contain the following information about an N-gram on each line:
log10(f(a_z)) <tab> a_z <tab> log10(bow(a_z))Based on Equation 2, the first entry represents the base 10 logarithm of the conditional probability (logprob) for the N-gram a_z. This is followed by the actual words in the N-gram separated by spaces. The last and optional entry is the base-10 logarithm of the backoff weight for (n+1)-grams starting with a_z.