Sunday, August 29, 2004
Block analysis of web documents
Usually, a web page is considered as a whole semantic unit in current link analysis techniques, such as HITS, PageRank. In portal pages in a website, there exist many contents about navigation, advertisement and different areas. Thus, the whole semantic unit assumption is too ideal to fit the real situation. Recently, some work divides each web page to several semantic block (or layout block, with the assumption that a certain area should belong the same semantic class), and analyze hyperlinks among blocks. There are three papers about block analysis in SIGIR 2004. In Block-level Link Analysis, authors state that dividing web pages to blocks can improve ranking accuracy remarkably. Its idea is quite simple, unlike classic HITS that find hub& authoritative nodes in a bipartite graph (each partite is the set of pages to be ranked), it build a bipratite graphs or block-page.
Saturday, August 28, 2004
wedding related
08-Aug-2004, 定婚纱照
- 选摄影师,婚纱,外景地点(MARINE BAY, BOTANIC GARDEN, SENTOSA),POSE。
- 安平,假睫毛,化妆盒。
- 交通,底片的问题
- 网上找其他注意事项
Wednesday, August 25, 2004
Tuesday, August 24, 2004
Programming Camlimages
Core Manual is here. Please refer to for how to programming o'caml with camlimages and other development kits.
An example 1.ml to write image to bmp file:
To load needed modules, there are several methods.
open Graphics;; open Images;; (*before versioni 2.4, it's Image*) open Rgb24;; open_graph "";; let sx, sy = size_x (), size_y () in let maxr = min (sx / 4) (sy / 4) and maxc = 256 in while not (key_pressed ()) do let x, y, r, color = Random.int sx, Random.int sy, Random.int maxr, rgb (Random.int maxc) (Random.int maxc) (Random.int maxc) in Graphics.set_color color; fill_circle x y r done ;; let img = Graphic_image.image_of (get_image 0 0 (size_x ()) (size_y()));; Bmp.save "ddd" [] (Images.Rgb24 img);;Here, the third para of Bmp.save is Images.t; Graphic_image.image_of returns a value with type of Rgb24.t; Images.t is a sum type:
type t= | Index8 of Index8.t | Rgb24 of Rgb24.t | Index16 of Index16.t | Rgba32 of Rgba32.t | Cmyk32 of Cmyk32.tthus, (Images.Rgb24 img)'s type is Image.t
To load needed modules, there are several methods.
- during compile: ocamlc -I /usr/lib/ocaml/camlimages ci_core.cma graphics.cma ci_graphics.cma ci_bmp.cma 1.ml
- in top-level: start ocaml, and then: #directory "/usr/lib/ocaml/camlimages";; #load "ci_core.cma";; #load "graphics.cma";; #load "ci_graphics.cma";; #load "ci_graphics.cma";; #load "ci_bmp.cma";;
furthurmore, by adding these commands to 1.ml, it can be interpreted by ocaml directly.
Friday, August 20, 2004
Laws in Text Documents
The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the 1/ik times of that of the most frenquent word. This implies that in n documents, the i-th word appears n/(ik* HV(k)), where HV(k) is the harmonic number of order k of V, as defined below: HV(k)=\sumj=1V(1/jk). k=1.5..2.0 fits real data quite well. Experimental data show m/(ci)k is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution.
The distribution of appearance frequency of a word in a set of documents is: F(k)=( \begin{array}{c} a+k-1\\ k \end{array})pk(1+p){-a-k}. The formula gives the fraction of the document set contains a word for k times. See, Brown Corpus (Frequency Analysis of English Usage).
The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law: V=Knk=O(nb. In TREC-2 dataset, b in (0.4,0.6). See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).
Thursday, August 19, 2004
Taxonomy of Information Retrieval Models
IR Models
Over the years, alternative modeling paradiagms for several classic models have been proposed:
- Adhoc:
- Boolean
- Fuzzy
- Extended Boolean
- Vector
- Generalized Vector
- Latence Semantic Index
- Neural Networks
- Probabilistic
- Inference Network
- Belief Network
- Boolean
- Filtering:
- Non-Overlapping Lists
- Proximal Nodes
Wednesday, August 18, 2004
My favorite HTML tags
To build HTML pages, it's much easier if one have these tags in his brain.
For example, to use UL& LI .
There are three type of LI:
To use latin symbols in HTML, see specification, this page for example.
To use icon like symbols, see specification.
For simple reference of HTML4 tags, see here
- square
- circle
- disc
To use latin symbols in HTML, see specification, this page for example.
To use icon like symbols, see specification.
For simple reference of HTML4 tags, see here