DBACL
NAME
SYNOPSIS
OVERVIEW
EXIT_STATUS
DESCRIPTION
SECONDARY_SWITCHES
EXIT STATUS
OPTIONS
USAGE
PERFORMANCE
MULTIPLE PROCESSES AND DATA CORRUPTION
MEMORY USE
ENVIRONMENT
SIGNALS
NOTES
BUGS
SOURCE
AUTHOR
SEE ALSO
NAME
dbacl − a
digramic Bayesian classifier for text recognition.
SYNOPSIS
|
dbacl |
|
[-01dvnirmwMNDXW] [-T
type ] -l category [-h size] [-H
gsize] [-x decim] [-q quality] [-w
max_order] [-e deftok] [-o online] [-L
measure] [-z ftresh] [-O ronline]...
[-g regex]... [FILE]... |
|
dbacl |
|
[-vnimNRXYP] [-h size]
[-T type] -c category [-c category]...
[-f keep]... [FILE]... |
|
dbacl |
|
-V |
OVERVIEW
dbacl is
a Bayesian text and email classifier. When using the
-l switch, it learns a body of text and produce a
file named category which summarizes the text. When
using the -c switch, it compares an input text stream
with any number of category files, and outputs the
name of the closest match, or optionally various numerical
scores explained below.
Whereas this
manual page is intended as a reference, there are several
tutorials and documents you can read to get specialized
information. Specific documentation about the design of
dbacl and the statistical models that it uses can be
found in dbacl.ps. For a basic overview of text
classification using dbacl, see tutorial.html. A
companion tutorial geared towards email filtering is
email.html. If you have trouble getting dbacl to classify
reliably, read is_it_working.html. The USAGE section of this
manual page also has some examples.
@PKGDATADIR@/doc/dbacl.ps
@PKGDATADIR@/doc/tutorial.html
@PKGDATADIR@/doc/email.html
@PKGDATADIR@/doc/is_it_working.html
dbacl
uses a maximum entropy (minimum divergence) language model
constructed with respect to a digramic reference measure
(unknown tokens are predicted from digrams, i.e. pairs of
letters). Practically, this means that a category is
constructed from tokens in the training set, while
previously unseen tokens can be predicted automatically from
their letters. A token here is either a word (fragment) or a
combination of words (fragments), selected according to
various switches. Learning roughly works by tweaking token
probabilities until the training data is least
surprising.
EXIT_STATUS
The normal
shell exit conventions aren’t followed (sorry!). When
using the -l command form, dbacl returns zero
on success, nonzero if an error occurs. When using the
-c form, dbacl returns a positive integer
corresponding to the category with the highest
posterior probability. In case of a tie, the first most
probable category is chosen. If an error occurs,
dbacl returns zero.
DESCRIPTION
When using the
-l command form, dbacl learns a category when
given one or more FILE names, which should contain readable
ASCII text. If no FILE is given, dbacl
learns from STDIN. If FILE is a directory, it is opened and
all its files are read, but not its subdirectories. The
result is saved in the binary file named category,
and completely replaces any previous contents. As a
convenience, if the environment variable DBACL_PATH contains
a directory, then that is prepended to the file path, unless
category starts with a ’/’ or a
’.’.
The input text
for learning is assumed to be unstructured plain text by
default. This is not suitable for learning email, because
email contains various transport encodings and formatting
instructions which can reduce classification effectiveness.
You must use the -T switch in that case so that
dbacl knows it should perform decoding and filtering
of MIME and HTML as appropriate. Apropriate switch values
are "-T email" for RFC2822 email input, "-T
html" for HTML input, "-T xml" for generic
XML style input and "-T text" is the default plain
text format. There are other values of the -T switch
that also allow fine tuning of the decoding
capabilities.
When using the
-c command form, dbacl attempts to classify
the text found in FILE, or STDIN if no FILE is given. Each
possible category must be given separately, and
should be the file name of a previously learned text corpus.
As a convenience, if the variable DBACL_PATH contains a
directory, it is prepended to each file path which
doesn’t start with a ’/’ or a
’.’. The visible output of the classification
depends on the combination of extra switches used. If no
switch is used, then no output is shown on STDOUT. However,
dbacl always produces an exit code which can be
tested.
To see an
output for a classification, you must use at least one of
the
-v,-U,-n,-N,-D,-d
switches. Sometimes, they can be used in combination to
produce a natural variation of their individual outputs.
Sometimes, dbacl also produces warnings on STDERR if
applicable.
The -v
switch outputs the name of the best category among all the
choices given.
The -U
switch outputs the name of the best category followed by a
confidence percentage. Normally, this is the switch that you
want to use. A percentage of 100% means that dbacl is
sure of its choice, while a percentage of 0% means that some
other category is equally likely. This is not the model
probability, but measures how unambiguous the classification
is, and can be used to tag unsure classifications (e.g. if
the confidence is 25% or less).
The -N
switch prints each category name followed by its (posterior)
probability, expressed as a percentage. The percentages
always sum to 100%. This is intuitive, but only valuable if
the document being classified contains a handful of tokens
(ten or less). In the common case with many more tokens, the
probabilities are always extremely close to 100% and 0%.
The -n
switch prints each category name followed by the negative
logarithm of its probability. This is equivalent to using
the -N switch, but much more useful. The smallest
number gives the best category. A more convenient form is to
use both -n and -v which prints each category
name followed by the cross entropy and the number of tokens
analyzed. The cross entropy measures (in bits) the average
compression rate which is achievable, under the given
category model, per token of input text. If you use all
three of -n,-v,-X then an extra value
is output for each category, representing a kind of p-value
for each category score. This indicates how typical the
score is compared to the training documents, but only works
if the -X switch was used during learning, and only
for some types of models (e.g. email). These p-values are
uniformly distributed and independent (if the categories are
independent), so can be combined using Fisher’s chi
squared test to obtain composite p-values for groupings of
categories.
The -v
and -X switches together print each category name
followed by a detailed decomposition of the category score,
factored into ( divergence rate + shannon entropy rate )*
token count @ p-value. Again, this only works in some types
of models.
The -v
and -U switches print each category name followed by
a decomposition of the category score into ( divergence rate
+ shannon entropy rate # score variance )* token count.
The -D
switch prints out the input text as modified internally by
dbacl prior to tokenization. For example, if a MIME
encoded email document is classified, then this prints the
decoded text that will be actually tokenized and classified.
This switch is mainly useful for debugging.
The -d
switch dumps tokens and scores while they are being read. It
is useful for debugging, or if you want to create graphical
representations of the classification. A detailed
explanation of the output is beyond the scope of this manual
page, but is straightforward if you’ve read dbacl.ps.
Possible variations include -d together with
-n or -N.
Classification
can be done with one or several categories in principle.
When two or more categories are used, the Bayesian posterior
probability is used, given the input text, with a uniform
prior distribution on categories. For other choices of
prior, see the companion utility bayesol(1). When a
single category is used, classification can be done by
comparing the score with a treshold. In practice however,
much better results are obtained with several
categories.
Learning and
classifying cannot be mixed on the same command invocation,
however there are no locking issues and separate
dbacl processes can operate simultaneously with
obvious results, because file operations are designed to be
atomic.
Finally, note
that dbacl does not manage your document corpora or
your computed categories. In particular, dbacl cannot
add or subtract a document from a category file directly. If
you want to learn a category incrementally, the standard way
is to keep adding to your document corpus, and learn the
whole corpus each time. By keeping control of your archives,
you can never lose the information in your categories, and
you can easily experiment with different switches or
tokenizations or sets of training documents if you like.
If the standard
incremental learning method is too slow, the -o
switch can help. This creates a data file named
online which contains all the document statistics
that have been learned. When you use the -l and
-o switches together, dbacl merges the online
data file (if it exists) with the new document(s) to be
learned, and recreates an updated version of online.
This is equivalent to adding the new documents to the corpus
and relearning the whole corpus, but faster. However,
documents cannot be removed if you change your mind. This is
a limitation of dbacl which cannot be changed for
mathematical reasons. You can work around this by making
backups of the online data file. It is also possible
to merge one or more extra online data files
simultaneously by using the -O switch one or more
times.
SECONDARY_SWITCHES
By default,
dbacl classifies the input text as a whole, ie it
only outputs a single result even if you specify several
input files. If you want to classify multiple input files
you can either call dbacl repeatedly (which is fast
when you use the -m switch), or use the -F
switch, which prints each input FILE followed by the result
for that FILE. Alternatively, you can classify each line of
the input individually, by using the -f option, which
prints only those lines which match one or more models
identified by keep (use the category name or number
to refer to a category). This last switch is useful if you
want to filter out some lines, but note that if the lines
are short, then the error rate can be high.
The
-e,-w,-g,-j switches are used
for selecting an appropriate tokenization scheme. A token is
a word or word fragment or combination of words or
fragments. The shape of tokens is important because it forms
the basis of the language models used by dbacl. The
-e switch selects a predefined tokenization scheme,
which is speedy but limited. The -w switch specifies
composite tokens derived from the -e switch. For
example, "-e alnum -w 2" means that tokens should
be alphanumeric word fragments combined into overlapping
pairs (bigrams). When the -j switch is used, all
tokens are converted to lowercase, which reduces the number
of possible tokens and therefore memory consumption.
If the
-g switch is used, you can completely specify what
the tokens should look like using a regular expression.
Several -g switches can be used to construct complex
tokenization schemes, and parentheses within each expression
can be used to select fragments and combine them into
n-grams. The cost of such flexibility is reduced
classification and learning speed. When experimenting with
tokenization schemes, try using the -d or -D
switches while learning or classifying, as they will print
the tokens explicitly so you can see what text fragments are
picked up or missed out. For regular exression syntax, see
regex(7).
The -h
and -H switches regulate how much memory dbacl
may use for learning. Text classification can use a lot of
memory, and by default dbacl limits itself even at
the expense of learning accuracy. In many cases if a limit
is reached, a warning message will be printed on STDERR with
some advice.
When relearning
the same category several times, a significant speedup can
be obtained by using the -1 switch, as this allows
the previously learned probabilities to be read from the
category and reused.
Note that
classification accuracy depends foremost on the amount and
quality of the training samples, and then only on amount of
tweaking.
EXIT STATUS
When using the
-l command form, dbacl returns zero on
success. When using the -c form, dbacl returns
a positive integer (1,2,3...) corresponding to the
category with the highest posterior probability. In
case of a tie, the first most probable category is chosen.
If an error occurs, dbacl returns zero.
OPTIONS
|
-0 |
|
When learning, prevents weight
preloading. Normally, dbacl checks if the category
file already exists, and if so, tries to use the existing
weights as a starting point. This can dramatically speed up
learning. If the -0 (zero) switch is set, then
dbacl behaves as if no category file already exists.
This is mainly useful for testing. This switch is now
enabled by default, to protect against weight drift which
can reduce accuracy over many learning iterations. Use
-1 to force preloading. |
|
-1 |
|
Force weight preloading if the category file already
exists. See discussion of the -0 switch. |
|
-a |
|
Append scores. Every input line is written to STDOUT and
the dbacl scores are appended. This is useful for
postprocessing with bayesol(1). For ease of
processing, every original input line is indented by a
single space (to distinguish them from the appended scores),
and the line with the scores (if -n is used) is
prefixed with the string "scores ". If a second
copy of dbacl needs to read this output later, it
should be invoked with the -A switch. |
|
-d |
|
Dump the model parameters to STDOUT. In conjunction with
the -l option, this produces a human-readable summary
of the maximum entropy model. In conjunction with the
-c option, displays the contribution of each token to
the final score. Suppresses all other normal output. |
|
-e |
|
Select character class for default (not regex-based)
tokenization. By default, tokens are alphabetic strings
only. This corresponds to the case when deftok is
"alpha". Possible values for deftok are
"alpha", "alnum", "graph",
"char", "cef" and "adp". The
last two are custom tokenizers intended for email messages.
See also isalpha(3). The "char" tokenizer
picks up single printable characters rather than bigger
tokens, and is intended for testing only. |
|
-f |
|
Filter each line of input separately, passing to STDOUT
only lines which match the category identified as
keep. This option should be used repeatedly for each
category which must be kept. keep can be
either the category file name, or a positive integer
representing the required category in the same order
it appears on the command line. |
Output lines
are flushed as soon as they are written. If the input file
is a pipe or character device, then an attempt is made to
use line buffering mode, otherwise the more efficient block
buffering is used.
|
-g |
|
Learn only features described by
the extended regular expression regex. This overrides
the default feature selection method (see -w option)
and learns, for each line of input, only tokens constructed
from the concatenation of strings which match the tagged
subexpressions within the supplied regex. All
substrings which match regex within a suffix of each
input line are treated as features, even if they overlap on
the input line. |
As an optional
convenience, regex can include the suffix
||xyz which indicates which parenthesized
subexpressions should be tagged. In this case, xyz
should consist exclusively of digits 1 to 9, numbering
exactly those subexpressions which should be tagged.
Alternatively, if no parentheses exist within regex,
then it is assumed that the whole expression must be
captured.
|
-h |
|
Set the size of the hash table
to 2^size elements. When using the -l option,
this refers to the total number of features allowed in the
maximum entropy model being learned. When using the
-c option toghether with the -M switch and
multinomial type categories, this refers to the maximum
number of features taken into account during classification.
Without the -M switch, this option has no effect. |
|
-i |
|
Fully internationalized mode. Forces the use of wide
characters internally, which is necessary in some locales.
This incurs a noticeable performance penalty. |
|
-j |
|
Make features case sensitive. Normally, all features are
converted to lower case during processing, which reduces
storage requirements and improves statistical estimates for
small datasets. With this option, the original
capitalization is used for each feature. This can improve
classification accuracy. |
|
-m |
|
Aggressively maps categories into memory and locks them
into RAM to prevent swapping, if possible. This is useful
when speed is paramount and memory is plentiful, for example
when testing the classifier on large datasets. |
Locking may
require relaxing user limits with ulimit(1). Ask your
system administrator. Beware when using the -m switch
together with the -o switch, as only one dbacl
process must learn or classify at a time to prevent file
corruption. If no learning takes place, then the -m
switch for classifying is always safe to use. See also the
discussion for the -o switch.
|
-n |
|
Print scores for each
category. Each score is the product of two numbers,
the cross entropy and the complexity of the input text under
each model. Multiplied together, they represent the log
probability that the input resembles the model. To see these
numbers separately, use also the -v option. In
conjunction with the -f option, stops filtering but
prints each input line prepended with a list of scores for
that line. |
|
-q |
|
Select quality of learning, where quality
can be 1,2,3,4. Higher values take longer to learn, and
should be slightly more accurate. The default quality
is 1 if the category file doesn’t exist or weights
cannot be preloaded, and 2 otherwise. |
|
-o |
|
When learning, reads/writes partial token counts so they
can be reused. Normally, category files are learned from
exactly the input data given, and don’t contain
extraneous information. When this option is in effect, some
extra information is saved in the file online, after
all input was read. This information can be reread the next
time that learning occurs, to continue where the previous
dataset left off. If online doesn’t exist, it
is created. If online exists, it is read before
learning, and updated afterwards. The file is approximately
3 times bigger (at least) than the learned
category. |
In
dbacl, file updates are atomic, but if using the
-o switch, two or more processes should not learn
simultaneously, as only one process will write a lasting
category and memory dump. The -m switch can also
speed up online learning, but beware of possible corruption.
Only one process should read or write a file. This option is
intended primarily for controlled test runs. See also the
-O (big-oh) switch.
|
-r |
|
Learn the digramic reference
model only. Skips the learning of extra features in the text
corpus. |
|
-v |
|
Verbose mode. When learning, print out details of the
computation, when classifying, print out the name of the
most probable category. In conjunction with the
-n option, prints the scores as an explicit product
of the cross entropy and the complexity. |
|
-w |
|
Select default features to be n-grams up to
max_order. This is incompatible with the -g
option, which always takes precedence. If no -w or
-g options are given, dbacl assumes -w
1. Note that n-grams for n greater than 1 do not straddle
line breaks by default. The -S switch enables line
straddling. |
|
-x |
|
Set decimation probability to 1 - 2^(-decim). To
reduce memory requirements when learning, some inputs are
randomly skipped, and only a few are added to the model.
Exact behaviour depends on the applicable -T option
(default is -T "text"). When the type is
not "email" (eg "text"), then individual
input features are added with probability 2^(-decim).
When the type is "email", then full input messages
are added with probability 2^(-decim). Within each
such message, all features are used. |
|
-z |
|
When learning, only take into account features whose
occurrence count is strictly greater than ftreshold. By
default, ftreshold is zero, so all features in the training
corpus are used. A negative value of ftreshold causes dbacl
to subtract from the maximum observed feature count, and to
use that if it is positive. For example, -z 1 means dbacl
only learns features which occur at least twice in the
corpus, and -z -5 means dbacl only learns the feature(s)
whose occurrence count is within 4 of the global
maximum. |
|
-A |
|
Expect indented input and scores. With this switch,
dbacl expects input lines to be indented by a single
space character (which is then skipped). Lines starting with
any other character are ignored. This is the counterpart to
the -a switch above. When used together with the
-a switch, dbacl outputs the skipped lines as
they are, and reinserts the space at the front of each
processed input line. |
|
-D |
|
Print debug output. Do not use normally, but can be very
useful for displaying the list features picked up while
learning. |
|
-F |
|
For each FILE of input, print the FILE name followed by
the classification result (normally dbacl only prints
a single result even if multiple files are listed as
input). |
|
-H |
|
Allow hash table to grow up to a maximum of
2^gsize elements during learning. Initial size is
given by -h option. |
|
-L |
|
Select the digramic reference measure for character
transitions. The measure can be one of
"uniform", "dirichlet" or
"maxent". Default is "uniform". |
|
-M |
|
Force multinomial calculations. When learning, forces
the model features to be treated multinomially. When
classifying, corrects entropy scores to reflect multinomial
probabilities (only applicable to multinomial type models,
if present). Scores will always be lower, because the
ordering of features is lost. |
|
-N |
|
Print posterior probabilities for each category.
This assumes the supplied categories form an exhaustive list
of possibilities. In conjunction with the -f option,
stops filtering but prints each input line prepended with a
summary of the posterior distribution for that line. |
|
-O |
|
This switch causes the online data file named
ronline to be merged during learning. The
ronline file must be created using the -o (little-oh)
switch. Several -O data files can be merged simultaneously.
This is intended to be a read only version of -o, to allow
piecing together of several sets of preparsed data. See the
description of the -o switch. |
|
-R |
|
Include an extra category for purely random text. The
category is called "random". Only makes sense when
using the -c option. |
|
-P |
|
Correct the category scores to include estimated prior
probabilities. The prior probability estimate for each
category is proportional to the number of documents or, if
that doesn’t make sense, the number of unique
features. This can help with "balancing" when one
category is learned from much more data than another. If all
categories are learned from approximately the same amount of
data (or maybe within a factor of 2), then this option
should have little qualitative effect. |
|
-S |
|
Enable line straddling. This is useful together with the
-w option to allow n-grams for n > 1 to ignore
line breaks, so a complex token can continue past the end of
the line. This is not recommended for email. |
|
-T |
|
Specify nonstandard text format. By default,
dbacl assumes that the input text is a purely
ASCII text file. This corresponds to the case
when type is "text". |
There are
several types and subtypes which can be used to clean the
input text of extraneous tokens before actual learning or
classifying takes place. Each (sub)type you wish to use must
be indicated with a separate -T option on the command
line, and automatically implies the corresponding type.
The
"text" type is for unstructured plain text. No
cleanup is performed. This is the default if no types are
given on the command line.
The
"email" type is for mbox format input files or
single RFC822 emails. Headers are recognized and most are
skipped. To include extra RFC822 standard headers (except
for trace headers), use the "email:headers"
subtype. To include trace headers, use the
"email:theaders" subtype. To include all headers
in the email, use the "email:xheaders" subtype. To
skip all headers, except the subject, use
"email:noheaders". To scan binary attachments for
strings, use the "email:atts" subtype.
When the
"email" type is in effect, HTML markup is
automatically removed from text attachments except
text/plain attachments. To also remove HTML markup from
plain text attachments, use "email:noplain". To
prevent HTML markup removal in all text attachments, use
"email:plain".
The
"html" type is for removing HTML markup (between
<html> and </html> tags) and surrounding text.
Note that if the "email" type is enabled, then
"html" is automatically enabled for compatible
message attachments only.
The
"xml" type is like "html", but
doesn’t honour <html> and </html>, and
doesn’t interpret tags (so this should be more
properly called "angle markup" removal, and has
nothing to do with actual XML semantics).
When
"html" is enabled, most markup attributes are lost
(for values of ’most’ close to
’all’). The "html:links" subtype
forces link urls to be parsed and learned, which would
otherwise be ignored. The "html:alt" subtype
forces parsing of alternative text in ALT attributes and
various other tags. The "html:scripts" subtype
forces parsing of scripts, "html:styles" forces
parsing of styles, "html:forms" forces parsing of
form values, while "html:comments" forces parsing
of HTML comments.
|
-U |
|
Print (U)nambiguity. When used
in conjunction with the -v switch, prints scores
followed by their empirical standard deviations. When used
alone, prints the best category, followed by an estimated
probability that this category choice is unambiguous. More
precisely, the probability measures lack of overlap of CLT
confidence intervals for each category score (If there is
overlap, then there is ambiguity). |
This estimated
probability can be used as an "unsure" flag, e.g.
if the estimated probability is lower than 50%. Formally, a
score of 0% means another category is equally likely to
apply to the input, and a score of 100% means no other
category is likely to apply to the input. Note that this
type of confidence is unrelated to the -X switch.
Also, the probability estimate is usually low if the
document is short, or if the message contains many tokens
that have never been seen before (only applies to uniform
digramic measure).
|
-V |
|
Print the program version number
and exit. |
|
-W |
|
Like -w, but prevents features from straddling newlines.
See the description of -w. |
|
-X |
|
Print the confidence in the score calculated for each
category, when used together with the -n or
-N switch. Prepares the model for confidence scores,
when used with the -l switch. The confidence is an
estimate of the typicality of the score, assuming the null
hypothesis that the given category is correct. When used
with the -v switch alone, factorizes the score as the
empirical divergence plus the shannon entropy, multiplied by
complexity, in that order. The -X switch is not
supported in all possible models, and displays a percentage
of "0.0" if it can’t be calculated. Note
that for unknown documents, it is quite common to have
confidences close to zero. |
|
-Y |
|
Print the cumulative media counts. Some tokenizers
include a medium variable with each token: for example, in
email classification the word "the" can appear in
the subject or the body of a message, but the subject is
counted as a separate medium from the body. This allows the
token frequencies to be kept separate, even though the word
is the same. Currently, up to 16 different media are
supported (0-15), with the following interpretation for
email: |
0 unused.
1 default medium.
2 mail body or attachment in HTML format.
3 mail body or attachment in plain text format.
4 mail header unknown.
5 User-Agent, Comments, Keywords, Note
6 X-MS*, Categor*, Priority, Importance, Thread-*
7 X-*
8 List-*
9 MIME-Version, Content-*
10 Subject
11 To
12 Sender, Sent, BCC, CC, From
13 Resent-*, Original-*
14 Message-ID, References, In-Reply-To
15 Received, Return-Path, Return-Receipt-To, Reply-To
The -Y
switch prints the number of tokens observed in each separate
medium, in order from 0 to 15.
USAGE
To create two
category files in the current directory from two
ASCII text files named Mark_Twain.txt and
William_Shakespeare.txt respectively, type:
% dbacl -l
twain Mark_Twain.txt
% dbacl -l shake William_Shakespeare.txt
Now you can
classify input text, for example:
% echo
"howdy" | dbacl -v -c twain -c shake
twain
% echo "to be or not to be" | dbacl -v -c twain -c
shake
shake
Note that the
-v option at least is necessary, otherwise
dbacl does not print anything. The return value is 1
in the first case, 2 in the second.
% echo "to
be or not to be" | dbacl -v -N -c twain -c shake
twain 22.63% shake 77.37%
% echo "to be or not to be" | dbacl -v -n -c twain
-c shake
twain 7.04 * 6.0 shake 6.74 * 6.0
These
invocations are equivalent. The numbers 6.74 and 7.04
represent how close the average token is to each category,
and 6.0 is the number of tokens observed. If you want to
print a simple confidence value together with the best
category, replace -v with -U.
% echo "to
be or not to be" | dbacl -U -c twain -c shake
shake # 34%
Note that the
true probability of category shake versus category
twain is 77.37%, but the calculation is somewhat
ambiguous, and 34% is the confidence out of 100% that the
calculation is qualitatively correct.
Suppose a file
document.txt contains English text lines interspersed with
noise lines. To filter out the noise lines from the English
lines, assuming you have an existing category shake say,
type:
% dbacl -c
shake -f shake -R document.txt > document.txt_eng
% dbacl -c shake -f random -R document.txt >
document.txt_rnd
Note that the
quality of the results will vary depending on how well the
categories shake and random represent each input line. It is
sometimes useful to see the posterior probabilities for each
line without filtering:
% dbacl -c
shake -f shake -RN document.txt > document.txt_probs
You can now
postprocess the posterior probabilities for each line of
text with another script, to replicate an arbitrary Bayesian
decision rule of your choice.
In the special
case of exactly two categories, the optimal Bayesian
decision procedure can be implemented for documents as
follows: let p1 be the prior probability that the
input text is classified as category1. Consequently,
the prior probability of classifying as category2 is
1 - p1. Let u12 be the cost of misclassifying
a category1 input text as belonging to
category2 and vice versa for u21. We assume
there is no cost for classifying correctly. Then the
following command implements the optimal Bayesian
decision:
|
% dbacl -n -c
category1 |
|
-c category2 | awk
’{ if($2 * p1 * u12 > $4 * |
(1 - p1) * u21) {
print $1; } else { print $3; } }’
dbacl
can also be used in conjunction with procmail(1) to
implement a simple Bayesian email classification system.
Assume that incoming mail should be automatically delivered
to one of three mail folders located in $MAILDIR and named
work, personal, and spam. Initially,
these must be created and filled with appropriate sample
emails. A crontab(1) file can be used to learn the
three categories once a day, e.g.
CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
10 0 * * * dbacl -T email -l $CATS/personal
$MAILDIR/personal
15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam
To
automatically deliver each incoming email into the
appropriate folder, the following procmailrc(5)
recipe fragment could be used:
CATS=$HOME/.dbacl
# run the spam
classifier
:0 c
YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c
$CATS/spam
# send to the
appropriate mailbox
:0:
* ? test -n "$YAY"
$MAILDIR/$YAY
:0:
$DEFAULT
Sometimes,
dbacl will send the email to the wrong mailbox. In
that case, the misclassified message should be removed from
its wrong destination and placed in the correct mailbox. The
error will be corrected the next time your messages are
learned. If it is left in the wrong category, dbacl
will learn the wrong corpus statistics.
The default
text features (tokens) read by dbacl are purely
alphabetic strings, which minimizes memory requirements but
can be unrealistic in some cases. To construct models based
on alphanumeric tokens, use the -e switch. The
example below also uses the optional -D switch, which
prints a list of actual tokens found in the document:
% dbacl -e
alnum -D -l twain Mark_Twain.txt | less
It is also
possible to override the default feature selection method
used to learn the category model by means of regular
expressions. For example, the following duplicates the
default feature selection method in the C locale, while
being much slower:
|
% dbacl -l twain -g
’^([[:alpha:]]+)’ -g
’[^[:alpha:]]([[:alpha:]]+)’ Mark_Twain.txt |
The category
twain which is obtained depends only on single alphabetic
words in the text file Mark_Twain.txt (and computed digram
statistics for prediction). For a second example, the
following command builds a smoothed Markovian (word bigram)
model which depends on pairs of consecutive words within
each line (but pairs cannot straddle a line break):
|
% dbacl -l twain2 -g
’(^|[^[:alpha:]])([[:alpha:]]+)||2’ -g
’(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23’
Mark_Twain.txt |
More general,
line based, n-gram models of all orders (up to 7) can be
built in a similar way. To construct paragraph based models,
you should reformat the input corpora with awk(1) or
sed(1) to obtain one paragraph per line. Line size is
limited by available memory, but note that regex performance
will degrade quickly for long lines.
PERFORMANCE
The underlying
assumption of statistical learning is that a relatively
small number of training documents can represent a much
larger set of input documents. Thus in the long run,
learning can grind to a halt without serious impact on
classification accuracy. While not true in reality, this
assumption is surprisingly accurate for problems such as
email filtering. In practice, this means that a well chosen
corpus on the order of ten thousand documents is sufficient
for highly accurate results for years. Continual learning
after such a critical mass results in diminishing returns.
Of course, when real world input document patterns change
dramatically, the predictive power of the models can be
lost. At the other end, a few hundred documents already give
acceptable results in most cases.
dbacl is
heavily optimized for the case of frequent classifications
but infrequent batch learning. This is the long run optimum
described above. Under ideal conditions, dbacl can
classify a hundred emails per second on low end hardware
(500Mhz Pentium III). Learning speed is not very much
slower, but takes effectively much longer for large document
collections for various reasons. When using the -m
switch, data structures are aggressively mapped into memory
if possible, reducing overheads for both I/O and memory
allocations.
dbacl
throws away its input as soon as possible, and has no limits
on the input document size. Both classification and learning
speed are directly proportional to the number of tokens in
the input, but learning also needs a nonlinear optimization
step which takes time proportional to the number of unique
tokens discovered. At time of writing, dbacl is one
of the fastest open source mail filters given its optimal
usage scenario, but uses more memory for learning than other
filters.
MULTIPLE PROCESSES AND DATA CORRUPTION
When saving
category files, dbacl first writes out a temporary
file in the same location, and renames it afterwards. If a
problem or crash occurs during learning, the old category
file is therefore left untouched. This ensures that
categories can never be corrupted, no matter how many
processes try to simultaneously learn or classify, and means
that valid categories are available for classification at
any time.
When using the
-m switch, file contents are memory mapped for speedy
reading and writing. This, together with the -o
switch, is intended mainly for testing purposes, when tens
of thousands of messages must be learned and scored in a
laboratory to measure dbacl’s accuracy. Because
no file locking is attempted for performance reasons,
corruptions are possible, unless you make sure that only one
dbacl process reads or writes any file at any given
time. This is the only case (-m and -o together) when
corruption is possible.
MEMORY USE
When
classifying a document, dbacl loads all indicated
categories into RAM, so the total memory needed is
approximately the sum of the category file sizes plus a
fixed small overhead. The input document is consumed while
being read, so its size doesn’t matter, but very long
lines can take up space. When using the -m switch,
the categories are read using mmap(2) as
available.
When learning,
dbacl keeps a large structure in memory which
contains many objects which won’t be saved into the
output category. The size of this structure is proportional
to the number of unique tokens read, but not the size of the
input documents, since they are discarded while being read.
As a rough guide, this structure is 4x-5x the size of the
final category file that is produced.
To prevent
unchecked memory growth, dbacl allocates by default a
fixed smallish amount of memory for tokens. When this space
is used up, further tokens are discarded which has the
effect of skewing the learned category making it less usable
as more tokens are dropped. A warning is printed on STDERR
in such a case.
The -h
switch lets you fix the initial size of the token space in
powers of 2, ie "-h 17" means 2^17 = 131072
possible tokens. If you type "dbacl -V", you can
see the number of bytes needed for each token when either
learning or classifying. Multiply this number by the maximum
number of possible tokens to estimate the memory needed for
learning. The -H switch lets dbacl grow its
tables automatically if and when needed, up to a maximum
specified. So if you type "-H 21", then the
initial size will be doubled repeatedly if necessary, up to
approximately two million unique tokens.
When learning
with the -X switch, a handful of input documents are
also kept in RAM throughout.
ENVIRONMENT
DBACL_PATH
When this variable is set, its
value is prepended to every category filename which
doesn’t start with a ’/’ or a
’.’.
SIGNALS
|
INT |
|
If this signal is caught,
dbacl simply exits without doing any cleanup or other
operations. This signal can often be sent by pressing Ctrl-C
on the keyboard. See stty(1). |
HUP, QUIT, TERM
If one of these signals is
caught, dbacl stops reading input and continues its
operation as if no more input was available. This is a way
of quitting gracefully, but note that in learning mode, a
category file will be written based on the incomplete input.
The QUIT signal can often be sent by pressing Ctrl- on
the keyboard. See stty(1).
|
USR1 |
|
If this signal is caught, dbacl reloads the
current categories at the earliest feasible opportunity.
This is not normally useful at all, but might be in special
cases, such as if the -f switch is invoked together
with input from a long running pipe. |
NOTES
dbacl
generated category files are in binary format, and may or
may not be portable to systems using a different byte order
architecture (this depends on how dbacl was
compiled). The -V switch prints out whether
categories are portable, or else you can just
experiment.
dbacl
does not recognize functionally equivalent regular
expressions, and in this case duplicate features will be
counted several times.
With every
learned category, the command line options that were used
are saved. When classifying, make sure that every relevant
category was learned with the same set of options (regexes
are allowed to differ), otherwise behaviour is undefined.
There is no need to repeat all the switches when
classifying.
If you get many
digitization warnings, then you are trying to learn too much
data at once, or your model is too complex. dbacl is
compiled to save memory by digitizing final weights, but you
can disable digitization by editing dbacl.h and
recompiling.
dbacl
offers several built-in tokenizers (see -e switch)
with more to come in future versions, as the author invents
them. While the default tokenizer may evolve, no tokenizer
should ever be removed, so that you can always simulate
previous dbacl behaviour subject to bug fixes and
architectural changes.
The confidence
estimates obtained through the -X switch are
underestimates, ie are more conservative than they should
be.
BUGS
"Ya know,
some day scientists are gonna invent something that will
outsmart a rabbit." (Robot Rabbit, 1953)
SOURCE
The source code
for the latest version of this program is available at the
following locations:
http://www.lbreyer.com/gpl.html
http://dbacl.sourceforge.net
AUTHOR
Laird A. Breyer
<laird@lbreyer.com>
SEE ALSO
awk(1),
bayesol(1), crontab(1), hmine(1),
hypex(1), less(1), mailcross(1),
mailfoot(1), mailinspect(1),
mailtoe(1), procmailex(5), regex(7),
stty(1), sed(1)
|