### Laws in Text Documents

The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the *1/i ^{k}* times of that of the most frenquent word. This implies that in

*n documents, the i-th word appears*

*n/(i*, where^{k}* H_{V}(k))*H*is the harmonic number of order_{V}(k)*k of V, as defined below:**H*._{V}(k)=\sum_{j=1}^{V}(1/j^{k})*k=1.5..2.0*fits real data quite well. Experimental data show*m/(c*is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution. The distribution of appearance frequency of a word in a set of documents is:_{i})^{k}*F(k)=( \begin{array}{c} a+k-1\\ k \end{array})p*. The formula gives the fraction of the document set contains a word for^{k}(1+p)^{{-a-k}}*k*times. See, Brown Corpus (Frequency Analysis of English Usage). The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law:*V=Kn*^{k}=O(n^{b}*. In TREC-2 dataset,**b in (0.4,0.6)*. See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).

## No comments:

Post a Comment