Loading Blog Search...

Friday, August 20, 2004

Laws in Text Documents

The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the 1/ik times of that of the most frenquent word. This implies that in n documents, the i-th word appears n/(ik* HV(k)), where HV(k) is the harmonic number of order k of V, as defined below: HV(k)=\sumj=1V(1/jk). k=1.5..2.0 fits real data quite well. Experimental data show m/(ci)k is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution. The distribution of appearance frequency of a word in a set of documents is: F(k)=( \begin{array}{c} a+k-1\\ k \end{array})pk(1+p){-a-k}. The formula gives the fraction of the document set contains a word for k times. See, Brown Corpus (Frequency Analysis of English Usage). The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law: V=Knk=O(nb. In TREC-2 dataset, b in (0.4,0.6). See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).

No comments: