Email classification with dbacl
Laird A. Breyer
Introduction
dbacl is a UNIX/POSIX command line
toolset which can be used in scripts to classify a single email among one
or more previously learned categories.
dbacl(1) supports several different statistical models and several different tokenization
schemes, which can be adjusted to trade in speed and memory performance for
statistical sophistication. dbacl(1) also permits the user to select cost weightings
for different categories, thereby permitting simple adjustments to the
type I and type II errors (a.k.a. false positives, etc.). Finally,
dbacl(1) can print confidence percentages which help decide when
a decision is ambiguous.
dbacl(1)
is a general purpose text classifier which can understand email message formats.
The tutorial explains general classification only. It is worth reading (of course :-), but doesn't describe the extra steps necessary
to enable the email functionality. This document describes the necessary switches and caveats from first principles.
You can learn more about the dbacl suite of utilities (e.g dbacl,
bayesol,
mailinspect,
mailcross) by typing for example:
% man dbacl
Before you begin
A statistical email classifier cannot work unless you show it some (as many as you can)
examples of emails for every category of interest. This requires some work, because you must
separate your emails into dedicated folders. dbacl works best with mbox style folders, which are
the standard UNIX folder type that most mailreaders can import and export.
We'll assume that you want to define two categories
called notspam and spam respectively. If you want dbacl(1) to recognize these categories, please take a moment to create two mbox folders named (for example) $HOME/mail/spam and $HOME/mail/notspam.
You must make sure that the $HOME/mail/notspam folder doesn't contain any unwanted messages, and
similarly $HOME/mail/spam must not contain any wanted messages. If you mix messages in the two folders, dbacl(1) will be somewhat confused, and its classification accuracy will drop.
If you've used other Bayesian spam filters, you will find that dbacl(1) is slightly different. While other filters can sometimes learn incrementally, one message at a time,
dbacl always learns from scratch all the messages you give it, and only those. dbacl is optimized for learning a large number of messages quickly in one go, and to classify messages as fast as possible afterwards. The author has several good reasons for this choice, which are beyond the scope of this tutorial.
As time goes by, if you use dbacl(1) for classification, you will probably set up your filing system so that messages identified as spam go automatically into the $HOME/mail/notspam folder,
and messages identified as notspam go into the $HOME/mail/notspam folder.
dbacl(1) is far from perfect, and can make mistakes. This will result in messages going to the wrong folder. When dbacl(1) relearns, it will become slightly confused and over time its ability to distinguish spam and notspam will be diminished.
As with all email classifiers which learn your email, you should inspect your folders regularly,
and if you find messages in the wrong folder, you must move them to the correct folder before relearning. If you keep your mail folders clean for learning, dbacl(1) will eventually make very
few mistakes, and you will have plenty of time to inspect the folders once in a while. Or so the theory goes...
Since dbacl(1) must relearn categories from scratch each time, you will
probably want to set up a cron(1) job to relearn your mail folders
every day at midnight. This tutorial explains how to do this below.
If you like, you don't need to relearn periodically at all. The author relearns
his categories once every few months, without noticeable loss. This works
as long as your mail folders contain enough representative emails for training.
Last but not least, if after reading this tutorial you have trouble to get
classifications working, please read is_it_working.html.
Basic operation: Learning
To learn your spam category, go to the directory containing your
$HOME/mail/notspam folder (we assume mbox type here) and at the command prompt (written %), type:
% dbacl -T email -l spam $HOME/mail/spam
This reads all the messages in $HOME/mail/notspam one at a time, ignoring certain mail
headers and attachments, and also removes HTML markup. The result is a binary
file called spam, which can be used by dbacl for classifications.
This file is a snapshot, it cannot be modified by learning extra mail messages.
(unlike other spam filters, which sometimes let you learn incrementally,
see below for discussion).
If you get warning messages about the hash size being too small,
you need to increase the memory reserved for email tokens. Type:
% dbacl -T email -H 20 -l spam $HOME/mail/spam
which reserves space for up to 2^20 (one million) unique words. The
default dbacl(1) settings are chosen to
limit dramatically the memory requirements (about 32 thousand tokens). Once the limit is reached,
no new tokens are added and your category models will be strongly skewed towards the first few emails read. Heed the warnings.
If your spam folder contains too many messages, you can tell dbacl to pick the emails to be learned randomly. This is done by using the decimation switch -x. For example,
% dbacl -T email -H 20 -x 1 -l spam $HOME/mail/spam
will learn only about 1/2 or the emails found in the spam folder. Similarly, -x 2 would learn about 1/4, -x 3 would learn about 1/8 of available emails.
dbacl(1) has several options designed to control the way email messages are
parsed. The man page lists them all, but of particular interest are the -T
and -e options.
If your email isn't kept in mbox format, dbacl(1) can open one or more directories
and read all the files in it.
For example, if your messages are stored in the directory $HOME/mh/, one file per email,
you can type
% dbacl -T email -l spam $HOME/mh
At present, dbacl(1) won't read the subdirectories, or look at the file names
to decide whether to read some messages and not others.
Another (but not necessarily faster) solution is to temporarily
convert your mail into an mbox format file and use that for learning:
% find $HOME/mh -type f | while read f; \
do formail <$f; done | dbacl -T email -l spam
It is not enough to learn $HOME/mail/notspam emails, you must also learn the
$HOME/mail/notspam emails. dbacl(1) can only choose among the categories which have been previously learned. It
cannot say that an email is unlike spam (that's an open ended statement), only that an
email is like spam or like notspam (these are both concrete statements). To learn
the notspam category, type:
% dbacl -T email -l notspam $HOME/mail/notspam
Make sure to use the same switches for both spam and notspam categories.
Once you've fully read the man page later, you can start to mix and match switches.
Every time dbacl(1) learns a category, it writes
a binary file containing the statistical model into the hidden directory $HOME/.dbacl,
so for example after learning the category spam, you will have a small file named $HOME/.dbacl/spam which contains everything dbacl(1) learned. The file is recreated from scratch each time you relearn spam, and is loaded each time you classify an email.
Basic operation: Classifying
Suppose you have a file named email.rfc which contains the body of a single email
(in standard RFC822 format).
You can classify the email into either spam or notspam by typing:
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
All you get is the name of the best category, the email itself is consumed.
A variation of particular interest is to replace -v by -U, which gives
a percentage representing how sure dbacl is of printing the correct category.
% cat email.rfc | dbacl -T email -c spam -c notspam -U
notspam # 67%
A result of 100% means dbacl is very sure of the printed category, while
a result of 0% means at least one other category is equally likely.
If you would like to see separate scores for each category, type:
% cat email.rfc | dbacl -T email -c spam -c notspam -n
spam 232.07 notspam 229.44
The winning category always has the smallest score (closest to zero). In fact, the numbers returned with
the -n switch can be interpreted as unnormalized distances towards each category from the input document, in a mathematical space of high dimensions.
If you prefer a return code,
dbacl(1) returns a positive integer (1, 2, 3, ...)
identifying the category by its position on the command line. So if you type:
% cat email.rfc | dbacl -T email -c spam -c notspam
then you get no output, but the return code is 2. If you use the bash(1) shell, the return code
for the last run command is always in the variable $?.
Basic operation: Scripts
There is generally
little point in running the commands above by hand, except if you want to understand
how dbacl(1) operates, or want to experiment with switches.
Note, however, that simple scripts often do not check for error and warning messages on
STDERR. It is always worth rehearsing the operations you intend to script, as
dbacl(1) will let you know on
STDERR if it encounters problems during learning. If
you ignore warnings, you will likely end up with suboptimal classifications,
because the dbacl system prefers to do what it is told predictably,
rather than stop when an error condition occurs.
Once you are ready for spam filtering, you need to handle two issues.
The first issue is when and how to learn.
You should relearn your categories whenever you've received an appreciable number of emails or whenever you like. Unlike other spam filters, dbacl cannot
learn new emails incrementally and update its category files. Instead, you
must keep your messages organized and dbacl(1) will take a snapshot.
This limitation is actually advantageous in the long run, because it forces you to
keep usable archives of your mail and gives you control over every message
that is learned. By contrast, with incremental learning you must remember
which messages have already been learned, how many times, and whether to unlearn
them if you change your mind.
A dbacl category model normally doesn't change dramatically if you add a single new email (provided the
original model depends on more than a handful of emails). Over time, you
can even stop learning altogether when your error rate is low enough.
The simplest strategy for continual learning is a cron(1) job run once a day:
% crontab -l > existing_crontab.txt
Edit the file existing_crontab.txt with your favourite editor and add the following
three lines at the end:
CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -H 18 -l $CATS/spam $HOME/mail/notspam
10 0 * * * dbacl -T email -H 18 -l $CATS/notspam $HOME/mail/notspam
Now you can install the new crontab file by typing
% crontab existing_crontab.txt
The second issue is how to invoke and what to do with the dbacl classification.
Many UNIX systems offer procmail(1)
for email filtering. procmail(1) can pipe a copy of each incoming email into dbacl(1), and use the
resulting category name to write the message directly to the appropriate mailbox.
To use procmail, first verify that the file
$HOME/.forward exists and contains the single line:
|/usr/bin/procmail
Next, create the file $HOME/.procmailrc and make sure it contains something like this:
PATH=/bin:/usr/bin:/usr/local/bin
SHELL=/bin/bash
MAILDIR=$HOME/mail
DEFAULT=$MAILDIR/inbox
#
# this line runs the spam classifier
#
:0
YAY=| dbacl -vT email -c $HOME/.dbacl/spam -c $HOME/.dbacl/notspam
#
# this line writes the email to your mail directory
#
:0:
* ? test -n "$YAY"
# if you prefer to write the spam status in a header,
# comment out the first line and uncomment the second
$MAILDIR/$YAY
#| formail -A "X-DBACL-Says: $YAY" >>$DEFAULT
#
# last rule: put mail into mailbox
#
:0:
$DEFAULT
The above script will automatically file your incoming email into one of two folders named $HOME/mail/spam and $HOME/mail/notspam respectively (if you have a POP account, and your mailreader contacts your ISP directly, this won't work. Try using fetchmail(1)).
Advanced operation: Costs
This section can be skipped. It is here for completeness, but probably
won't be very useful to you, especially if you are a new user.
The classification performed by dbacl(1) as described above is known as a MAP
estimate. The optimal category is chosen only by looking at the email contents. What is missing is your input
as to the costs of misclassifications.
This section is by no means necessary for using dbacl(1) for most
classification tasks. It is useful for tweaking dbacl's algorithms only.
If you want to improve dbacl's accuracy, first try to learn bigger collections
of email.
To understand the idea, imagine that an email being wrongly marked spam is likely to be
sitting in the $HOME/mail/spam folder until you check through it, while an email wrongly marked notspam will prominently appear among your regular correspondence. For most people, the former case can mean a missed timely communication, while the latter case is merely an annoyance.
No classification system is perfect. Learned emails can only imperfectly predict never before seen emails. Statistical models vary in quality. If you try to lower one kind of error, you automatically increase the other kind.
The dbacl system allows you to specify how much you hate each type of misclassification, and does its best to accomodate this extra information. To input your settings, you will need a risk specification like this:
categories {
spam, notspam
}
prior {
1, 1
}
loss_matrix {
"" spam [ 0, 1^complexity ]
"" notspam [ 2^complexity, 0 ]
}
This risk specification states that your cost for misclassifying spam emails into notspam is 1 for every word of the email (merely an annoyance). Your cost for misclassifying regular emails into spam is 2 for every word of the email (a more serious problem). The costs for classifying your email correctly are zero in each case. Note that the cost numbers are arbitrary, only their relative sizes matter. See the tutorial if you want to understand these statements.
Now save your risk specification above into a file named my.risk, and type
% cat email.rfc | dbacl -T email -c spam -c notspam \
-vna | bayesol -c my.risk -v
notspam
The output category may or may not differ from the category selected via dbacl(1) alone, but over
many emails, the resulting classifications will be more cautious about marking an email as spam.
Since dbacl(1) can output the score for each category (using the -n switch),
you are also free to do your own processing and decision calculation, without using bayesol(1).
For example, you could use:
% cat email.rfc | dbacl -T email -n -c spam -c notspam | \
awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } \
else { print $3; } }'
where p1 is the a priori probability that an email is spam, u12 is the cost of misclassifying
spam as notspam, and u21 is the cost of seeing spam among your regular email.
When you take your misclassification costs into account, it is better to use
the logarithmic scores (given by the -n) instead of the true probabilities (given by the -N switch).
The scores represent the amount of evidence away from each model, so the smaller the score the better. For each category, dbacl outputs both the score and the complexity of the email (ie the number of tokens/words actually looked at). For example
% cat email.rfc | dbacl -T email -c spam -c notspam -vn
spam 5.52 * 42 notspam 5.46 * 42
would indicate that there are 5.46 bits of evidence away from notspam,
but 5.52 bits of evidence away from spam. This evidence is computed
based on 42 tokens (it's a small email :-), and one would conclude that the notspam category is a slightly better fit.
Advanced operation: Parsing
dbacl(1) sets some default switches which should be acceptable
for email classification, but probably not optimal. If you like to experiment,
then this section should give you enough material to stay occupied, but
reading it is not strictly necessary.
When dbacl(1) inspects an email message, it only looks at certain words/tokens.
In all examples so far,
the tokens picked up were purely alphabetic words. No numbers are picked up, or special characters such as $, @, % and punctuation.
The success of text classification schemes depends not only on the statistical models used, but also strongly on the type of tokens considered. dbacl(1) allows you to try out different tokenization schemes. What works best depends on your email.
By default, dbacl(1) picks up only purely alphabetic words as tokens (this uses the least amount of memory). To pick up alphanumeric tokens, use the -e switch as follows:
% dbacl -e alnum -T email -l spam $HOME/mail/notspam
% dbacl -e alnum -T email -l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
You can also pick up printable words (use -e graph) or purely ASCII (use -e ascii) tokens.
Note that you do not need to indicate the -e switch when classifying, but you should make sure
that all the categories use the same -e switch when learning.
dbacl(1) can also look at single words, consecutive
pairs of words, triples, quadruples. For example, a trigram
model based on alphanumeric tokens can be learned as follows:
% dbacl -e alnum -w 3 -T email -l spam $HOME/mail/notspam
One thing to watch out for is that n-gram models require much more memory to learn in general. You will likely need to use the -H switch to reserve enough space.
If you prefer, you can specify the tokens to look at through a regular expression. The following
example picks up single words which contain purely alphabetic characters followed by zero or more numeric characters. It can be considered an intermediate tokenization scheme between -e alpha and -e alnum:
% dbacl -T email \
-g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
-l spam $HOME/mail/notspam
% dbacl -T email \
-g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
-l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
Note that there is no need to repeat the -g switch when classifying.
Cross Validation
This section explains quality control and accuracy testing, but
is not needed for daily use.
If you have time to kill, you might be inspired by the instructions above to write your own learning and filtering shell scripts. For example, you might
have a script $HOME/mylearner.sh containing
#!/bin/sh
dbacl -T email -H 19 -w 1 -l $@
With this script, you could learn your spam and notspam emails by typing
% ./mylearner.sh spam $HOME/mail/spam
% ./mylearner.sh notspam $HOME/mail/notspam
A second script $HOME/myfilter.sh might contain
#!/bin/sh
dbacl -T email -vna $@ | bayesol -c $HOME/my.risk -v
With this script, you could classify an email by typing
% cat email.rfc | $HOME/myfilter.sh -c spam -c notspam
What we've done with these scripts saves typing, but isn't necessarily
very useful. However, we can use these scripts together with another
utility, mailcross(1).
What mailcross(1) does is run an email n-fold
cross-validation, thereby giving you an idea of the effects of
various switches. It accomplishes this by splitting each of your spam and notspam folders (which normally are fully used up for learning categories) into
n equal sized subsets. One of these subsets is chosen for testing each category, and the remaining subsets are used for learning, since you already know if they are spam or notspam. This gives you an idea how good the filter is, and how it improves (or not!) when you change switches.
Statistically, cross-validation is on very shaky ground, because it violates the fundamental principle that you must not reuse data for two different purposes, but that doesn't stop most of the world from using it.
Suppose you want to cross-validate your filtering scripts above. The following
instructions only work with mbox files. First you should type:
% mailcross prepare 10
% mailcross add spam $HOME/mail/spam
% mailcross add notspam $HOME/mail/notspam
This creates several copies of all your spam and notspam emails
for later testing. Your original $HOME/mail/{,not}spam files are not
modified or even needed after these steps.
Next, you must indicate the learning and filtering scripts to use. Type
% export MAILCROSS_LEARNER="$HOME/mylearner.sh"
% export MAILCROSS_FILTER="$HOME/myfilter.sh"
Finally, it is time to run the cross validation. Type
% mailcross learn
% mailcross run
% mailcross summarize
The results you will see eventually are based on learning about 9/10th of the emails in $HOME/mail/spam and $HOME/mail/spam respectively for each category and testing with the other 1/10th.
Beware that the commands above may take a long time and have all the excitement of watching paint dry. When you get bored with cross validation, don't forget to type
% mailcross clean
Comparing Email Filters
Provided all runs use the same test corpora, you can compare any number of
email classifiers with the mailcross command. This helps in choosing the
best combination of switches for learning, although it can't be stressed enough
that the results will depend strongly on your particular set of corpora.
It quickly gets tedious to run the cross validator multiple times, however.
With this in mind, mailcross(1) has a "testsuite" subcommand,
which runs the mailcross commands successively on any number of filters.
These filters can be various versions of dbacl with different switches, or indeed
can be other open source email classifiers.
For every email classifier you wish to cross validate, you need a wrapper script
which performs the work of the scripts mylearner.sh and myfilter.sh
in the previous section. Since every open source email classifier has its own interface,
the wrapper must translate the mailcross instructions into something that the
classifier understands. At the time of writing, wrapper scripts exist for
dbacl, bogofilter,
ifile and
spambayes.
The interface requirements are described in the mailcross(1) manual page.
Note that the supplied wrappers can be sometimes out of date for the most
popular Bayesian filters, because these projects can change their interfaces
frequently. Also, the wrappers may not use the most flattering combinations
of switches and options, as only each filter author knows the best way
to use his own filter.
Besides cross validation, you can also test Train On Error and Full Online
Ordered Training schemes, via the mailtoe(1) and mailfoot(1) commands. Using
them is very similar to using mailcross(1).
TREC
The (United States) National Institute of Standards and Technology organises an annual conference on text retrieval called
TREC, which in 2005 began a new
track on spam filtering. A goal of this conference is to develop over
several years a set of standard methodologies for evaluating and comparing
spam filtering systems.
For 2005, the initial goal is to compare spam filters in a laboratory
environment, not directly connected to the internet. An identical
stream of email messages addressed to a single person
is shown in chronological order to all
participating filters, which can learn them incrementally and must predict
the type of each message as it arrives.
The spamjig
is the automated system which performs this evaluation. You can download it
yourself and run it with your own email collections to test any participating
filters. Special instructions for dbacl can be found in the
TREC subdirectory of the source package. Many other open source spam filters
can also be tested in this framework.
|