Main Page   Class Hierarchy   Compound List   File List   Compound Members   File Members  

MarkovPR Documentation

Introduction

The MarkovPR project consists of a collection of programs and datasets designed to permit the study of web document page ranking schemes. These programs consist of

A overview of the full project is presented here.

Copyrights and Licenses

The distributed source code and programs are partly based upon the code provided freely by Google, Inc., on the occasion of their First Annual Programming Contest, for research and noncommercial purposes. See their LICENSE.Google. While Google retains the copyright for its supplied source files, including all modifications to them, the remaining source files (clearly marked) are distributed under the GPL.

The dataset required by the ripper program must be obtained separately from Google, Inc. At the time of writing, the necessary files can be downloaded directly. See instructions in the file README.Google.

Overview of Ripper

General

The ripper reads the (uncompressed) preparsed repository files on STDIN and builds a link graph in-memory. Once the graph is built, the ripper enters an interactive mode which allows it to communicate with the jack program via two pipes.

The program can be compiled by issuing the command

 make all
from within the src directory. An optimized version without debug information is build by typing
 make optimized 
The full documentation can be regenerated by typing
 make doc 
which requires the programs doxygen and graphviz.

After compilation, a typical invocation would be

 bzcat data/pprepos.*.bz2 | src/ripper - 
Various options can be included on the ripper command line. These are listed by running the command
 ripper --help 

The ripper displays some status information on STDERR. For example, once all repositories have been read, it will say so. The user can then invoke the program jack to enter ripper's interactive mode.

Web link graph construction

The structure of the ripper is roughly the following: Ripper is responsible for reading repositories and invoking callback functions for every encountered web document. The relevant callbacks for building the web link graph are handled by the class GraphParseHandler. The conversion of this document into a WebNode is the purpose of the class GraphBuilder. Among its data members are the Trie, which allows all encountered URLs to be stored and searched, and the nodetable, which is a SimpleHashTable<WebNodePtr> and allows a specific WebNode to be found from its URL (this is necessary since a WebNode uses an ID number for identification, *not* a URL directly for space reasons). Once GraphBuilder has produced the WebLinkGraph, the latter is passed onto Talker, which instantiates several WebSampler classes. Talker's main purpose is to read pipes for commands to execute, this being mainly running WebSampler classes and supporting functions.

Parallel computation

The ripper/jack programs can be used in conjunction with PVM to perform simulations in parallel. This requires the PVM libraries and include files to be installed. You can compile the ripper/jack programs with the command

 make distributed 
which requires also that the PVM_ROOT environment variable be set (see installation instructions for PVM). Once the programs are compiled, you can run several copies of ripper on different machines. Only one master is allowed to communicate with jack. Here is an example .pvmrc file for two machines ender and enigma:
add enigma
conf
ps -axl
spawn -enigma ripper --pvm_slave 2 --name darth --temp_dir tmp/ data/pprepos.??
spawn -ender ripper --pvm_master 2 --name palpatine proj/cpp/google_contest/data/pprepos.??
This assumes that on the machine ender, the dataset files (uncompressed) can be found in $HOME/proj/cpp/google_contest/data. This directory is mounted remotely on enigma as the directory $HOME/data. Also, temporary files will be created in /tmp on ender, which should be remotely mounted (r/w) on enigma as $HOME/tmp.

To control the master, you should execute on ender the command

jack --name palpatine
(but wait until palpatine has finished processing the dataset).

Miscellaneous ripper options

The ripper can also be used to simply extract information from the repository files. For example, typing

bzcat ../data/pprepos.*.bz2 | ripper --catlinks -
extracts the link urls to STDOUT. Other options which perform similarly include --cat, --caturl, --catdate.

Overview of Jack

The program jack communicates with ripper via two pipes, whose default names are /tmp/ripper.input and /tmp/ripper.output respectively. Once invoked, the user can type simple interactive commands which ripper will execute. These commands include running various Markov chains on the link graph, saving page ranking calculations to a file, etc. The jack program accepts two command line options which behave identically to their ripper namesakes, namely --temp_dir and --name. You should use these options if you invoke ripper with them also.


Generated on Wed May 29 11:37:14 2002 for MarkovPR by doxygen1.2.15