info prev up next book cdrom email home

Zipf's Law

In the English language, the probability of encountering the $r$th most common word is given roughly by $P(r)=0.1/r$ for $r$ up to 1000 or so. The law breaks down for less frequent words, since the Harmonic Series diverges. Pierce's (1980, p. 87) statement that $\sum P(r)>1$ for $r=8727$ is incorrect. Goetz states the law as follows: The frequency of a word is inversely proportional to its Rank $r$ such that

\begin{displaymath}
P(r)\approx {1\over r\ln(1.78 R)},
\end{displaymath}

where $R$ is the number of different words.

See also Harmonic Series, Rank (Statistics)


References

Goetz, P. ``Phil's Good Enough Complexity Dictionary.'' http://www.cs.buffalo.edu/~goetz/dict.html.

Pierce, J. R. Introduction to Information Theory: Symbols, Signals, and Noise, 2nd rev. ed. New York: Dover, pp. 86-87 and 238-239, 1980.




© 1996-9 Eric W. Weisstein
1999-05-26