#include <urlfilter.h>
Public Methods | |
URLFilter (bool rs) | |
const char * | DeindexURL (const char *anurl) |
const char * | CompressURL (const char *anurl) |
void | ParseURL (const char *anurl, char *schemebuf, char *netlocbuf, char *querybuf, char *paramsbuf, char *pathbuf) |
Decomposes a URL into its components for analysis. More... | |
void | NormalizeURLPath (char *apath) |
Fixes the path in case document is of type index.html. More... | |
ContentType | ClassifyURLPath (const char *path) |
classifies a file according to its extension. More... | |
const char * | FormatURL (const char *anurl, int anurl_len, URLComponents *baseurl, ContentType *foundtype) throw (domain_error) |
Private Attributes | |
char | scratchbuf0 [STRINGBUF_LEN0+1] |
char | scratchbuf1 [STRINGBUF_LEN2+1] |
char | scratchbuf2 [STRINGBUF_LEN2+1] |
char | scratchbuf3 [STRINGBUF_LEN2+1] |
char | scratchbuf4 [STRINGBUF_LEN3+1] |
char | scratchbuf5 [STRINGBUF_LEN1+1] |
char | scratchbuf6 [STRINGBUF_LEN1+1] |
char | comp_scratchbuf [STRINGBUF_LEN2+1] |
char | parse_scratchbuf [STRINGBUF_LEN1+1] |
char | deindex_scratchbuf [STRINGBUF_LEN2+1] |
struct { | |
bool remove_html_suffix | |
bool rearrange_components | |
} | flags |
Definition at line 47 of file urlfilter.h.
|
Definition at line 28 of file urlfilter.cc. References flags. |
|
classifies a file according to its extension.
Definition at line 261 of file urlfilter.cc. References CONTENT_APPLICATION_MS_POWERPOINT, CONTENT_APPLICATION_MSWORD, CONTENT_APPLICATION_PDF, CONTENT_APPLICATION_POSTSCRIPT, CONTENT_APPLICATION_XGZIP, CONTENT_AUDIO_MP3, CONTENT_GOOGLE_OTHER, CONTENT_IMAGE, CONTENT_TEXT_HTML, CONTENT_TEXT_PLAIN, CONTENT_TEXT_RTF, and ContentType. |
|
This function compresses a URL, whose characters are guaranteed to fit within seven bits, and removes all the forward slashes, which are the most commonly used character. Everytime a slash is removed, the *preceding* charater has its eight bit set. A slash is not removed if the previous character already has its eight bit set. The compressed URL is always located in the special buffer comp_scratchbuf[]. The string anurl is not modified Definition at line 107 of file urlfilter.cc. References comp_scratchbuf. Referenced by GraphBuilder::FindLeafNodeKey(), GraphBuilder::FindWebNode(), GraphBuilder::NodeSetURL(), and GraphBuilder::TrieInsertLinkURL(). |
|
this function takes a standardized url ( see NormalizeURLPath() ) and removes the trailing string /index.htm(l) This is used to compactify the string before adding it to the Trie (in a trie, common prefixes are harmless, but common suffixes waste space) In case the remove_html_suffix flag is set, other common html endings are also tokenized to reduce space requirements. Note that this operation is irreversible (we cannot reinsert the suffix /index.html reliably in all cases). The string anurl is not modified. Definition at line 46 of file urlfilter.cc. References deindex_scratchbuf, and flags. Referenced by GraphBuilder::FindLeafNodeKey(), GraphBuilder::FindWebNode(), GraphBuilder::NodeSetURL(), and GraphBuilder::TrieInsertLinkURL(). |
|
This function formats anurl into a standard form. Its most important use is as a completion mechanism for URL fragments as can be found in anchor tags. The URL is completed relative to the baseurl, which typically is the current document's url. The return value will always be a pointer to one of the scratch buffers so you should copy the returned string before formatting another. Definition at line 333 of file urlfilter.cc. References ContentType, and NULL. Referenced by GraphBuilder::FormatURL(), and GraphBuilder::NodeSetURL(). |
|
Fixes the path in case document is of type index.html. This function maps paths of the form xxx/ xxx/index.htm to the standard xxx/index.html xxx/index.html WARNING: This function modifies the string apath. It is assumed that apath has STRINGBUF_LEN1 storage available Definition at line 239 of file urlfilter.cc. |
|
Decomposes a URL into its components for analysis. Each of the supplied buffers must be STRINGBUF_LEN1 long. This function does not modify anurl. If flags.rearrange_components is true, the network location and file path are rearranged so that the suffix is placed first. Definition at line 150 of file urlfilter.cc. References parse_scratchbuf. Referenced by GraphBuilder::NodeSetURL(). |
|
Definition at line 70 of file urlfilter.h. Referenced by CompressURL(). |
|
Definition at line 72 of file urlfilter.h. Referenced by DeindexURL(). |
|
Referenced by DeindexURL(), and URLFilter(). |
|
Definition at line 71 of file urlfilter.h. Referenced by ParseURL(). |
|
Definition at line 76 of file urlfilter.h. |
|
Definition at line 75 of file urlfilter.h. |
|
Definition at line 62 of file urlfilter.h. |
|
Definition at line 63 of file urlfilter.h. |
|
Definition at line 64 of file urlfilter.h. |
|
Definition at line 65 of file urlfilter.h. |
|
Definition at line 66 of file urlfilter.h. |
|
Definition at line 67 of file urlfilter.h. |
|
Definition at line 68 of file urlfilter.h. |