Li Zhao's Homepage: 2004-08

Sunday, August 29, 2004

Block analysis of web documents

Usually, a web page is considered as a whole semantic unit in current link analysis techniques, such as HITS, PageRank. In portal pages in a website, there exist many contents about navigation, advertisement and different areas. Thus, the whole semantic unit assumption is too ideal to fit the real situation. Recently, some work divides each web page to several semantic block (or layout block, with the assumption that a certain area should belong the same semantic class), and analyze hyperlinks among blocks. There are three papers about block analysis in SIGIR 2004. In Block-level Link Analysis, authors state that dividing web pages to blocks can improve ranking accuracy remarkably. Its idea is quite simple, unlike classic HITS that find hub& authoritative nodes in a bipartite graph (each partite is the set of pages to be ranked), it build a bipratite graphs or block-page.

Saturday, August 28, 2004

wedding related

08-Aug-2004, 定婚纱照

选摄影师，婚纱，外景地点（MARINE BAY， BOTANIC GARDEN， SENTOSA），POSE。
安平，假睫毛，化妆盒。
交通，底片的问题
网上找其他注意事项

28-Aug-2004, 项链

Some links

These links are here for improve my PR, :-D

Site Ring Ring Owner: zhao li Site:

Wednesday, August 25, 2004

Emacs Tuareg Mode for O'Caml

Emacs Tuareg Mode is quite good as a IDE for O'caml.

My Calendar!

精确计时LINUX&WINDOWS

IBM Edward G. Bradford LINUX&WINDOWS精确计时比较, windows 精度稍差，效率稍低，不过两者都可以接受。

Tuesday, August 24, 2004

Programming Camlimages

Core Manual is here. Please refer to for how to programming o'caml with camlimages and other development kits. An example 1.ml to write image to bmp file:

open Graphics;;
open Images;; (*before versioni 2.4, it's Image*)
open Rgb24;;
open_graph "";;
 let sx, sy = size_x (), size_y () in
  let maxr = min (sx / 4) (sy / 4) and maxc = 256 in
  while not (key_pressed ()) do
    let x, y, r, color =
      Random.int sx, Random.int sy, Random.int maxr,
      rgb (Random.int maxc) (Random.int maxc) (Random.int maxc) in
    Graphics.set_color color; fill_circle x y r
  done              ;;
 let img = Graphic_image.image_of (get_image 0 0 (size_x ()) (size_y()));;
Bmp.save "ddd" [] (Images.Rgb24 img);;

Here, the third para of Bmp.save is Images.t; Graphic_image.image_of returns a value with type of Rgb24.t; Images.t is a sum type:

type t=
   | Index8 of Index8.t
   | Rgb24 of Rgb24.t
   | Index16 of Index16.t
   | Rgba32 of Rgba32.t
   | Cmyk32 of Cmyk32.t

thus, (Images.Rgb24 img)'s type is Image.t
To load needed modules, there are several methods.

during compile: ocamlc -I /usr/lib/ocaml/camlimages ci_core.cma graphics.cma ci_graphics.cma ci_bmp.cma 1.ml
in top-level: start ocaml, and then: #directory "/usr/lib/ocaml/camlimages";; #load "ci_core.cma";; #load "graphics.cma";; #load "ci_graphics.cma";; #load "ci_graphics.cma";; #load "ci_bmp.cma";;
furthurmore, by adding these commands to 1.ml, it can be interpreted by ocaml directly.

Friday, August 20, 2004

Laws in Text Documents

The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the 1/i^k times of that of the most frenquent word. This implies that in n documents, the i-th word appears n/(i^k* H_V(k)), where H_V(k) is the harmonic number of order k of V, as defined below: H_V(k)=\sum_j=1^V(1/j^k). k=1.5..2.0 fits real data quite well. Experimental data show m/(c_i)^k is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution. The distribution of appearance frequency of a word in a set of documents is: F(k)=( \begin{array}{c} a+k-1\\ k \end{array})p^k(1+p)^{-a-k}. The formula gives the fraction of the document set contains a word for k times. See, Brown Corpus (Frequency Analysis of English Usage). The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law: V=Kn^k=O(n^b. In TREC-2 dataset, b in (0.4,0.6). See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).

Email This BlogThis!Share to X Share to Facebook Share to Pinterest Posted by Li-Zhao 李钊 at 8:48 AM 0 comments

Thursday, August 19, 2004

Taxonomy of Information Retrieval Models

IR Models
Over the years, alternative modeling paradiagms for several classic models have been proposed:
Adhoc:
Boolean

Fuzzy
Extended Boolean

Vector

Generalized Vector
Latence Semantic Index
Neural Networks

Probabilistic

Inference Network
Belief Network

Filtering:

Non-Overlapping Lists
Proximal Nodes

Email This BlogThis!Share to X Share to Facebook Share to Pinterest Posted by Li-Zhao 李钊 at 7:09 PM 0 comments

Wednesday, August 18, 2004

My favorite HTML tags

To build HTML pages, it's much easier if one have these tags in his brain. For example, to use UL& LI . There are three type of LI:

square
circle
disc
For more information about these tags, plz infer HTML Codes Tutor for Easy Website Design
To use latin symbols in HTML, see specification, this page for example.
To use icon like symbols, see specification.
For simple reference of HTML4 tags, see here

Email This BlogThis!Share to X Share to Facebook Share to Pinterest Posted by Li-Zhao 李钊 at 4:09 PM 0 comments

Sunday, August 29, 2004

Block analysis of web documents

Saturday, August 28, 2004

wedding related

Some links

Wednesday, August 25, 2004

Emacs Tuareg Mode for O'Caml

My Calendar!

精确计时LINUX&WINDOWS

Tuesday, August 24, 2004

Programming Camlimages

Friday, August 20, 2004

Laws in Text Documents

Thursday, August 19, 2004

Taxonomy of Information Retrieval Models

Wednesday, August 18, 2004

My favorite HTML tags

Labels

Blog Archive

O! Cloud

Photos