dbacl − a digramic Bayesian classifier for text recognition. |
dbacl [-01dvnirmMNDX] [-T type ] -l category [-h size] [-H gsize] [-x decim] [-q quality] [-w max_order] [-e deftok] [-o online] [-L measure] [-g regex]... [FILE]... |
dbacl [-vnimNRX] [-h size] [-T type] -c category [-c category]... [-f keep]... [FILE]... |
dbacl -V |
dbacl is a Bayesian text document classifier, which uses a maximum entropy (minimum divergence) language model constructed with respect to a digramic reference measure (unknown words are predicted from digrams, i.e. pairs of letters). The following is a general summary of dbacl, see the USAGE section below for more practical examples. If using the -l command form, dbacl learns a category when given one or more FILE names, which should contain readable ASCII text. If no FILE is given, dbacl learns from STDIN. If FILE is a directory, it is opened and all its files are read, but not its subdirectories. The result is saved in the binary file named category, and completely replaces any previous contents. If using the -c command form, dbacl attempts to classify the text found in FILE, or STDIN if no FILE is given. Each possible category must be given separately, and should be the file name of a previously learned text corpus. When several (more than one) categories are specified, the text classification is performed by computing the Bayesian posterior probability for each model, given the input text, and with a uniform prior distribution on categories. When only a single category is specified, dbacl calculates a score which is the product of the cross entropy and the complexity of the input text (needs the -n and -v options). A low score (close to zero) indicates a good fit of the category model. The cross entropy measures the average compression rate which is achievable, under the given category, for the features of the input text. The -X switch can be used when classifying, to ouput a confidence level in the computed scores. This is a percentage which estimates how likely the score is, when assuming that the given category is correct. This is not a substitute for the actual score, and should not be used on its own to decide the best category. A low confidence (e.g. 10% or less) can be interpreted as an "unsure" flag. Not all category models support confidence scores, but "email" categories do. If you use the -o switch, then -X is disabled. By default, dbacl will classify the input text as a whole. However, when using the -f option, dbacl can be used to filter each input line separately, printing only those lines which match one or more models identified by keep. Learning and classifying cannot be mixed on the same command invocation. By default, in text mode (see -T text switch) non-alphabetic characters in the input are always ignored, as is word case, unless one or more regular expressions are supplied. In email mode ( -T email) the default tokens are more complicated (currently "adp", see -e option). If the -w option is given, the default features consist of all n-grams up to max_order, where each first order component is tokenized according to the applicable -e option. You can always visualize the tokens by using the -D option while learning. If -w is not given, dbacl assumes max_order equals 1. The default feature selection is completely overriden however if the -g switch is encountered. When supplying one or more regular expressions with the -g option, only those features of the input text or texts which match regex are analysed (with possible overlap between matches), and only those subexpressions of regex which are explicitly tagged with parentheses are used as features in the model (if several subexpressions are tagged simultaneously, the resulting feature is the concatenated string of these tagged expressions in order of tag appearance). For regular exression syntax, see regex(7). For best results, keep the number of concatenated expressions strictly below 8. When learning with regular expressions, the -D option will print all matched text features, which is very useful to see what is going on. Learning roughly works by tweaking probabilities until the training data is least surprising. The amount of tweaking can be selected by the -q option. More tweaking means longer learning and better accuracy, but usually the fastest setting is close enough. When relearning the same category several times, a significant speedup can be obtained by using the -1 switch, as this allows the previously learned probabilities to be read from the category and reused. Note that classification accuracy depends foremost on the amount and quality of the training samples, and then only on amount of tweaking. |
When using the -l command form, dbacl returns zero. When using the -c form, dbacl returns a positive integer corresponding to the category with the highest posterior probability. In case of a tie, the first most probable category is chosen. If an error occurs, dbacl returns zero. |
-0 |
When learning, prevents weight preloading. Normally, dbacl checks if the category file already exists, and if so, tries to use the existing weights as a starting point. This can dramatically speed up learning. If the -0 (zero) switch is set, then dbacl behaves as if no category file already exists. This is mainly useful for testing. This switch is now enabled by default, to protect against weight drift which can reduce accuracy over many learning iterations. Use -1 to force preloading. |
||
-1 |
Force weight preloading if the category file already exists. See discussion of the -0 switch. |
||
-a |
Append scores. Every input line is written to STDOUT and the dbacl scores are appended. This is useful for postprocessing with bayesol(1). For ease of processing, every original input line is indented by a single space (to distinguish them from the appended scores), and the line with the scores (if -n is used) is prefixed with the string "scores ". If a second copy of dbacl needs to read this output later, it should be invoked with the -A switch. |
||
-d |
Dump the model parameters to STDOUT. In conjunction with the -l option, this produces a human-readable summary of the maximum entropy model. In conjunction with the -c option, displays the contribution of each token to the final score. Suppresses all other normal output. |
||
-e |
Select character class for default (not regex-based) tokenization. By default, tokens are alphabetic strings only. This corresponds to the case when deftok is "alpha". Possible values for deftok are "alpha", "alnum", "graph", "cef" and "adp". The last two are custom tokenizers intended for email messages. See also isalpha(3). |
||
-f |
Filter each line of input separately, passing to STDOUT only lines which match the category identified as keep. This option should be used repeatedly for each category which must be kept. keep can be either the category file name, or a positive integer representing the required category in the same order it appears on the command line. |
Output lines are flushed as soon as they are written. If the input file is a pipe or character device, then an attempt is made to use line buffering mode, otherwise the more efficient block buffering is used. |
-g |
Learn only features described by the extended regular expression regex. This overrides the default feature selection method (see -w option) and learns, for each line of input, only tokens constructed from the concatenation of strings which match the tagged subexpressions within the supplied regex. All substrings which match regex within a suffix of each input line are treated as features, even if they overlap on the input line. |
As an optional convenience, regex can include the suffix ||xyz which indicates which parenthesized subexpressions should be tagged. In this case, xyz should consist exclusively of digits 1 to 9, numbering exactly those subexpressions which should be tagged. Alternatively, if no parentheses exist within regex, then it is assumed that the whole expression must be captured. |
-h |
Set the size of the hash table to 2^size elements. When using the -l option, this refers to the total number of features allowed in the maximum entropy model being learned. When using the -c option toghether with the -M switch and multinomial type categories, this refers to the maximum number of features taken into account during classification. Without the -M switch, this option has no effect. |
||
-i |
Fully internationalized mode. Forces the use of wide characters internally, which is necessary in some locales. This incurs a noticeable performance penalty. |
||
-j |
Make features case sensitive. Normally, all features are converted to lower case during processing, which reduces storage requirements and improves statistical estimates for small datasets. With this option, the original capitalization is used for each feature. This can improve classification accuracy. |
||
-m |
Aggressively maps categories into memory and locks them into RAM to prevent swapping, if possible. This is useful when speed is paramount and memory is plentiful, for example when testing the classifier on large datasets. |
Locking may require relaxing user limits with ulimit(1). Ask your system administrator. Beware when using the -m switch together with the -o switch, as only one dbacl process must learn or classify at a time to prevent file corruption. If no learning takes place, then the -m switch for classifying is always safe to use. See also the discussion for the -o switch. |
-n |
Print scores for each category. Each score is the product of two numbers, the cross entropy and the complexity of the input text under each model. Multiplied together, they represent the log probability that the input resembles the model. To see these numbers separately, use also the -v option. In conjunction with the -f option, stops filtering but prints each input line prepended with a list of scores for that line. |
||
-q |
Select quality of learning, where quality can be 1,2,3,4. Higher values take longer to learn, and should be slightly more accurate. The default quality is 1 if the category file doesn’t exist or weights cannot be preloaded, and 2 otherwise. |
||
-o |
When learning, reads/writes partial token counts so they can be reused. Normally, category files are learned from exactly the input data given, and don’t contain extraneous information. When this option is in effect, some extra information is saved in the file online, after all input was read. This information can be reread the next time that learning occurs, to continue where the previous dataset left off. If online doesn’t exist, it is created. If online exists, it is read before learning, and updated afterwards. The file is approximately 3 times bigger (at least) than the learned category. |
In dbacl, file updates are atomic, but if using the -o switch, two or more processes should not learn simultaneously, as only one process will write a lasting category and memory dump. The -m switch can also speed up online learning, but beware of possible corruption. Only one process should read or write a file. |
-r |
Learn the digramic reference model only. Skips the learning of extra features in the text corpus. |
||
-v |
Verbose mode. When learning, print out details of the computation, when classifying, print out the name of the most probable category. In conjunction with the -n option, prints the scores as an explicit product of the cross entropy and the complexity. |
||
-w |
Select default features to be n-grams up to max_order. This is incompatible with the -g option, which always takes precedence. If no -w or -g options are given, dbacl assumes -w 1. Note that n-grams for n greater than 1 can straddle line breaks. If you’re learning single lines of text, it’s more accurate to use the -g option as appropriate. |
||
-x |
Set decimation probability to 1 - 2^(-decim). To reduce memory requirements when learning, some inputs are randomly skipped, and only a few are added to the model. Exact behaviour depends on the applicable -T option (default is -T "text"). When the type is not "email" (eg "text"), then individual input features are added with probability 2^(-decim). When the type is "email", then full input messages are added with probability 2^(-decim). Within each such message, all features are used. |
||
-A |
Expect indented input and scores. With this switch, dbacl expects input lines to be indented by a single space character (which is then skipped). Lines starting with any other character are ignored. This is the counterpart to the -a switch above. When used together with the -a switch, dbacl outputs the skipped lines as they are, and reinserts the space at the front of each processed input line. |
||
-D |
Print debug output. Do not use normally, but can be very useful for displaying the list features picked up while learning. |
||
-H |
Allow hash table to grow up to a maximum of 2^gsize elements during learning. Initial size is given by -h option. |
||
-L |
Select the digramic reference measure for character transitions. The measure can be one of "uniform", "dirichlet" or "maxent". Default is "uniform". |
||
-M |
Force multinomial calculations. When learning, forces the model features to be treated multinomially. When classifying, corrects entropy scores to reflect multinomial probabilities (only applicable to multinomial type models, if present). Scores will always be lower, because the ordering of features is lost. |
||
-N |
Print posterior probabilities for each category. This assumes the supplied categories form an exhaustive list of possibilities. In conjunction with the -f option, stops filtering but prints each input line prepended with a summary of the posterior distribution for that line. |
||
-R |
Include an extra category for purely random text. The category is called "random". Only makes sense when using the -c option. |
||
-T |
Specify nonstandard text format. By default, dbacl assumes that the input text is a purely ASCII text file. This corresponds to the case when type is "text". |
There are several types and subtypes which can be used to clean the input text of extraneous tokens before actual learning or classifying takes place. Each (sub)type you wish to use must be indicated with a separate -T option on the command line, and automatically implies the corresponding type. The "text" type is for unstructured plain text. No cleanup is performed. This is the default if no types are given on the command line. The "email" type is for mbox format input files or single RFC822 emails. Headers are recognized and most are skipped. To include extra RFC822 standard headers (except for trace headers), use the "email:headers" subtype. To include trace headers, use the "email:theaders" subtype. To name all headers in the email, use the "email:xheaders" subtype. To skip all headers, except the subject, use "email:noheaders". To scan binary attachments for strings, use the "email:atts" subtype. When the "email" type is in effect, HTML markup is automatically removed from text attachments except text/plain attachments. To also remove HTML markup from plain text attachments, use "email:noplain". To prevent HTML markup removal in all text attachments, use "email:plain". The "html" type is for removing HTML markup (between <html> and </html> tags) and surrounding text. Note that if the "email" type is enabled, then "html" is automatically enabled for compatible message attachments only. The "xml" type is like "html", but doesn’t honour <html> and </html>, and doesn’t interpret tags (so this should be more properly called "angle markup" removal, and has nothing to do with actual XML semantics). When "html" is enabled, most markup attributes are lost (for values of ’most’ close to ’all’). The "html:links" subtype forces link urls to be parsed and learned, which would otherwise be ignored. The "html:alt" subtype forces parsing of alternative text in ALT attributes and various other tags. The "html:scripts" subtype forces parsing of scripts, "html:styles" forces parsing of styles, "html:forms" forces parsing of form values, while "html:comments" forces parsing of HTML comments. |
-U |
Print (U)nambiguity. When used in conjunction with the -v switch, prints scores followed by their empirical standard deviations. When used alone, prints the best category, followed by an estimated probability that this category choice is unambiguous. More precisely, the probability measures lack of overlap of CLT confidence intervals for each category score (If there is overlap, then there is ambiguity). |
This estimated probability can be used as an "unsure" flag, e.g. if the estimated probability is lower than 50%. Formally, a score of 0% means another category is equally likely to apply to the input, and a score of 100% means no other category is likely to apply to the input. Note that this type of confidence is unrelated to the -X switch. Also, the probability estimate is usually low if the document is short, or if the message contains many tokens that have never been seen before (only applies to uniform digramic measure). |
-V |
Print the program version number and exit. |
||
-X |
Print the confidence in the score calculated for each category, when used together with the -n or -N switch. Prepares the model for confidence scores, when used with the -l switch. The confidence is an estimate of the typicality of the score, assuming the null hypothesis that the given category is correct. When used with the -v switch alone, factorizes the score as the empirical divergence plus the shannon entropy, multiplied by complexity, in that order. The -X switch is not supported in all possible models, and displays a percentage of "0.0" if it can’t be calculated. Note that for unknown documents, it is quite common to have confidences close to zero. |
To create two category files in the current directory from two ASCII text files named Mark_Twain.txt and William_Shakespeare.txt respectively, type: % dbacl -l twain Mark_Twain.txt Now you can classify input text, for example: % echo "howdy" | dbacl -v -c twain -c shake Note that the -v option is necessary, otherwise dbacl does not print anything. The return value is 1 in the first case, 2 in the second. If you want to print a simple confidence value together with the best category, replace -v with -U. Suppose a file document.txt contains English text lines interspersed with noise lines. To filter out the noise lines from the English lines, assuming you have an existing category shake say, type: % dbacl -c shake -f shake -R document.txt >
document.txt_eng Note that the quality of the results will vary depending on how well the categories shake and random represent each input line. It is sometimes useful to see the posterior probabilities for each line without filtering: % dbacl -c shake -f shake -RN document.txt > document.txt_probs You can now postprocess the posterior probabilities for each line of text with another script, to replicate an arbitrary Bayesian decision rule of your choice. In the special case of exactly two categories, the optimal Bayesian decision procedure can be implemented for documents as follows: let p1 be the prior probability that the input text is classified as category1. Consequently, the prior probability of classifying as category2 is 1 - p1. Let u12 be the cost of misclassifying a category1 input text as belonging to category2 and vice versa for u21. We assume there is no cost for classifying correctly. Then the following command implements the optimal Bayesian decision: |
% dbacl -n -c category1 -c category2 | awk ’{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } else { print $3; } }’ |
dbacl can also be used in conjunction with procmail(1) to implement a simple Bayesian email classification system. Assume that incoming mail should be automatically delivered to one of three mail folders located in $MAILDIR and named work, personal, and spam. Initially, these must be created and filled with appropriate sample emails. A crontab(1) file can be used to learn the three categories once a day, e.g. CATS=$HOME/.dbacl To automatically deliver each incoming email into the appropriate folder, the following procmailrc(5) recipe fragment could be used: CATS=$HOME/.dbacl # run the spam classifier # send to the appropriate mailbox :0: Sometimes, dbacl will send the email to the wrong mailbox. In that case, the misclassified message should be removed from its wrong destination and placed in the correct mailbox. If it is left in the wrong category, dbacl will learn the wrong corpus statistics. The default text features (tokens) read by dbacl are purely alphabetic strings, which minimizes memory requirements but can be unrealistic in some cases. To construct models based on alphanumeric tokens, use the -e switch. The example below also uses the optional -D switch, which prints a list of actual tokens found in the document: % dbacl -e alnum -D -l twain Mark_Twain.txt | less It is also possible to override the default feature selection method used to learn the category model by means of regular expressions. For example, the following duplicates the default feature selection method in the C locale, while being much slower: |
% dbacl -l twain -g ’^([[:alpha:]]+)’ -g ’[^[:alpha:]]([[:alpha:]]+)’ Mark_Twain.txt |
The category twain which is obtained depends only on single alphabetic words in the text file Mark_Twain.txt (and computed digram statistics for prediction). For a second example, the following command builds a smoothed Markovian (word bigram) model which depends on pairs of consecutive words within each line (but pairs cannot straddle a line break): |
% dbacl -l twain2 -g ’(^|[^[:alpha:]])([[:alpha:]]+)||2’ -g ’(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23’ Mark_Twain.txt |
More general, line based, n-gram models of all orders (up to 7) can be built in a similar way. To construct paragraph based models, you should reformat the input corpora with awk(1) or sed(1) to obtain one paragraph per line. Line size is limited by available memory, but note that regex performance will degrade quickly for long lines. Handy tip: Create the hidden directory $HOME/.dbacl, and set DBACL_PATH=$HOME/.dbacl in your shell initialization script. Then you can refer to categories without writing full paths. |
When saving category files, dbacl first writes out a temporary file in the same location, and renames it afterwards. If a problem or crash occurs during learning, the old category file is therefore left untouched. This ensures that categories can never be corrupted, no matter how many processes try to simultaneously learn or classify, and means that valid categories are available for classification at any time. When using the -m switch, file contents are memory mapped for speedy reading and writing. This, together with the -o switch, is intended mainly for testing purposes, when tens of thousands of messages must be learned and scored in a laboratory to measure dbacl’s accuracy. Because no file locking is attempted for performance reasons, corruptions are possible, unless you make sure that only one dbacl process reads or writes any file at any given time. |
DBACL_PATH |
When this variable is set, its value is prepended to every category filename which doesn’t start with a ’/’ or a ’.’. |
INT |
If this signal is caught, dbacl simply exits without doing any cleanup or other operations. This signal can often be sent by pressing Ctrl-C on the keyboard. See stty(1). |
HUP, QUIT, TERM |
If one of these signals is caught, dbacl stops reading input and continues its operation as if no more input was available. This is a way of quitting gracefully, but note that in learning mode, a category file will be written based on the incomplete input. The QUIT signal can often be sent by pressing Ctrl- on the keyboard. See stty(1). |
USR1 |
If this signal is caught, dbacl reloads the current categories at the earliest feasible opportunity. This is not normally useful at all, but might be in special cases, such as if the -f switch is invoked together with input from a long running pipe. |
dbacl generated category files are in binary format, and may or may not be portable to systems using a different byte order architecture (this depends on how dbacl was compiled). The -V switch prints out whether categories are portable, or else you can just experiment. dbacl does not recognize functionally equivalent regular expressions, and in this case duplicate features will be counted several times. With every learned category, the command line options that were used are saved. When classifying, make sure that every relevant category was learned with the same set of options (regexes are allowed to differ), otherwise behaviour is undefined. There is no need to repeat all the switches when classifying. Specific documentation about the design of dbacl and the statistical models that it uses can be found in @PKGDATADIR@/doc/dbacl.ps . For a basic overview of text classification using dbacl, see @PKGDATADIR@/doc/tutorial.html . A companion tutorial geared towards email filtering is @PKGDATADIR@/doc/email.html . If you get many digitization warnings, then you are trying to learn too much data at once, or your model is too complex. dbacl is compiled to save memory by digitizing final weights, but you can disable digitization by editing dbacl.h and recompiling. dbacl offers several built-in tokenizers (see -e switch) with more to come in future versions, as the author invents them. While the default tokenizer may evolve, no tokenizer should ever be removed, so that you can always simulate previous dbacl behaviour subject to bug fixes and architectural changes. The confidence estimates obtained through the -X switch are underestimates, ie are more conservative than they should be. |
"Ya know, some day scientists are gonna invent something that will outsmart a rabbit." (Robot Rabbit, 1953) |
The source code for the latest version of this program is available at the following locations: http://www.lbreyer.com/gpl.html |
Laird A. Breyer <laird@lbreyer.com> |
awk(1), bayesol(1), crontab(1), less(1), mailcross(1), mailfoot(1), mailinspect(1), mailtoe(1), procmailex(5), regex(7), stty(1), sed(1) |