The MarkovPR project consists of a collection of programs and datasets designed to permit the study of web document page ranking schemes. These programs consist of
The distributed source code and programs are partly based upon the code provided freely by Google, Inc., on the occasion of their First Annual Programming Contest, for research and noncommercial purposes. See their LICENSE.Google. While Google retains the copyright for its supplied source files, including all modifications to them, the remaining source files (clearly marked) are distributed under the GPL.
The dataset required by the ripper program must be obtained separately from Google, Inc. At the time of writing, the necessary files can be downloaded directly. See instructions in the file README.Google.
The ripper reads the (uncompressed) preparsed repository files on STDIN and builds a link graph in-memory. Once the graph is built, the ripper enters an interactive mode which allows it to communicate with the jack program via two pipes.
The program can be compiled by issuing the command
make all
make optimized
make doc
After compilation, a typical invocation would be
bzcat data/pprepos.*.bz2 | src/ripper -
ripper --help
The ripper displays some status information on STDERR. For example, once all repositories have been read, it will say so. The user can then invoke the program jack to enter ripper's interactive mode.
The structure of the ripper is roughly the following: Ripper is responsible for reading repositories and invoking callback functions for every encountered web document. The relevant callbacks for building the web link graph are handled by the class GraphParseHandler. The conversion of this document into a WebNode is the purpose of the class GraphBuilder. Among its data members are the Trie, which allows all encountered URLs to be stored and searched, and the nodetable, which is a SimpleHashTable<WebNodePtr> and allows a specific WebNode to be found from its URL (this is necessary since a WebNode uses an ID number for identification, *not* a URL directly for space reasons). Once GraphBuilder has produced the WebLinkGraph, the latter is passed onto Talker, which instantiates several WebSampler classes. Talker's main purpose is to read pipes for commands to execute, this being mainly running WebSampler classes and supporting functions.
The ripper/jack programs can be used in conjunction with PVM to perform simulations in parallel. This requires the PVM libraries and include files to be installed. You can compile the ripper/jack programs with the command
make distributed
add enigma conf ps -axl spawn -enigma ripper --pvm_slave 2 --name darth --temp_dir tmp/ data/pprepos.?? spawn -ender ripper --pvm_master 2 --name palpatine proj/cpp/google_contest/data/pprepos.??
To control the master, you should execute on ender the command
jack --name palpatine
The ripper can also be used to simply extract information from the repository files. For example, typing
bzcat ../data/pprepos.*.bz2 | ripper --catlinks -
The program jack communicates with ripper via two pipes, whose default names are /tmp/ripper.input and /tmp/ripper.output respectively. Once invoked, the user can type simple interactive commands which ripper will execute. These commands include running various Markov chains on the link graph, saving page ranking calculations to a file, etc. The jack program accepts two command line options which behave identically to their ripper namesakes, namely --temp_dir and --name. You should use these options if you invoke ripper with them also.