Laws in Text Documents
The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the 1/ik times of that of the most frenquent word. This implies that in n documents, the i-th word appears n/(ik* HV(k)), where HV(k) is the harmonic number of order k of V, as defined below: HV(k)=\sumj=1V(1/jk). k=1.5..2.0 fits real data quite well. Experimental data show m/(ci)k is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution. The distribution of appearance frequency of a word in a set of documents is: F(k)=( \begin{array}{c} a+k-1\\ k \end{array})pk(1+p){-a-k}. The formula gives the fraction of the document set contains a word for k times. See, Brown Corpus (Frequency Analysis of English Usage). The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law: V=Knk=O(nb. In TREC-2 dataset, b in (0.4,0.6). See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).
No comments:
Post a Comment