Loading Blog Search...

Sunday, August 29, 2004

Block analysis of web documents

Usually, a web page is considered as a whole semantic unit in current link analysis techniques, such as HITS, PageRank. In portal pages in a website, there exist many contents about navigation, advertisement and different areas. Thus, the whole semantic unit assumption is too ideal to fit the real situation. Recently, some work divides each web page to several semantic block (or layout block, with the assumption that a certain area should belong the same semantic class), and analyze hyperlinks among blocks. There are three papers about block analysis in SIGIR 2004. In Block-level Link Analysis, authors state that dividing web pages to blocks can improve ranking accuracy remarkably. Its idea is quite simple, unlike classic HITS that find hub& authoritative nodes in a bipartite graph (each partite is the set of pages to be ranked), it build a bipratite graphs or block-page.

Saturday, August 28, 2004

wedding related

08-Aug-2004, 定婚纱照

  • 安平,假睫毛,化妆盒。
  • 交通,底片的问题
  • 网上找其他注意事项
28-Aug-2004, 项链

Some links

These links are here for improve my PR, :-D

Free Links from Bravenet.com Free Links from Bravenet.com
Site Ring
Ring Owner: zhao li  Site:
Free Site Ring from Bravenet Free Site Ring from Bravenet Free Site Ring from Bravenet Free Site Ring from Bravenet Free Site Ring from Bravenet
Free Site Ring form Bravenet

Tuesday, August 24, 2004

Programming Camlimages

Core Manual is here. Please refer to for how to programming o'caml with camlimages and other development kits. An example 1.ml to write image to bmp file:

open Graphics;;
open Images;; (*before versioni 2.4, it's Image*)
open Rgb24;;
open_graph "";;
 let sx, sy = size_x (), size_y () in
  let maxr = min (sx / 4) (sy / 4) and maxc = 256 in
  while not (key_pressed ()) do
    let x, y, r, color =
      Random.int sx, Random.int sy, Random.int maxr,
      rgb (Random.int maxc) (Random.int maxc) (Random.int maxc) in
    Graphics.set_color color; fill_circle x y r
  done              ;;
 let img = Graphic_image.image_of (get_image 0 0 (size_x ()) (size_y()));;
Bmp.save "ddd" [] (Images.Rgb24 img);;
Here, the third para of Bmp.save is Images.t; Graphic_image.image_of returns a value with type of Rgb24.t; Images.t is a sum type:
type t=
   | Index8 of Index8.t
   | Rgb24 of Rgb24.t
   | Index16 of Index16.t
   | Rgba32 of Rgba32.t
   | Cmyk32 of Cmyk32.t
thus, (Images.Rgb24 img)'s type is Image.t
To load needed modules, there are several methods.
  1. during compile: ocamlc -I /usr/lib/ocaml/camlimages ci_core.cma graphics.cma ci_graphics.cma ci_bmp.cma 1.ml
  2. in top-level: start ocaml, and then: #directory "/usr/lib/ocaml/camlimages";; #load "ci_core.cma";; #load "graphics.cma";; #load "ci_graphics.cma";; #load "ci_graphics.cma";; #load "ci_bmp.cma";;
    furthurmore, by adding these commands to 1.ml, it can be interpreted by ocaml directly.

Friday, August 20, 2004

Laws in Text Documents

The distribution of appearance frequency of different words over documents keeps the Zipf's Law. The law states that the frequency of the i-th most frenquent word is the 1/ik times of that of the most frenquent word. This implies that in n documents, the i-th word appears n/(ik* HV(k)), where HV(k) is the harmonic number of order k of V, as defined below: HV(k)=\sumj=1V(1/jk). k=1.5..2.0 fits real data quite well. Experimental data show m/(ci)k is a better model for word distribution, where $c$ and $m$ are parameters. It is Mandelbrot distribution. The distribution of appearance frequency of a word in a set of documents is: F(k)=( \begin{array}{c} a+k-1\\ k \end{array})pk(1+p){-a-k}. The formula gives the fraction of the document set contains a word for k times. See, Brown Corpus (Frequency Analysis of English Usage). The distribution of the number of distinct words (vocabulary) appearing in a document set fits Heap's Law: V=Knk=O(nb. In TREC-2 dataset, b in (0.4,0.6). See, (Large text searching allowing errors; Block-addressing indices for approximate text retrieval).

Thursday, August 19, 2004

Taxonomy of Information Retrieval Models

IR Models
Over the years, alternative modeling paradiagms for several classic models have been proposed:

  • Adhoc:
    • Boolean
      • Fuzzy
      • Extended Boolean
    • Vector
      • Generalized Vector
      • Latence Semantic Index
      • Neural Networks
    • Probabilistic
      • Inference Network
      • Belief Network
  • Filtering:
    • Non-Overlapping Lists
    • Proximal Nodes

Wednesday, August 18, 2004

My favorite HTML tags

To build HTML pages, it's much easier if one have these tags in his brain. For example, to use UL& LI . There are three type of LI:

  • square
  • circle
  • disc
For more information about these tags, plz infer HTML Codes Tutor for Easy Website Design
To use latin symbols in HTML, see specification, this page for example.
To use icon like symbols, see specification.
For simple reference of HTML4 tags, see here