Comparing dbacl with other classifiers
Note: this tutorial shows one way of comparing classifiers using the cross
validation method. A different approach, which compares spam filters more
realistically, has been pioneered at the TREC 2005 conference. The instructions for using dbacl with the TREC spamjig are here.
The key to comparing dbacl(1) with other
classifiers is the mailcross(1) testsuite command.
Simply put, this command allows you to compare the error rates of several
classifiers on a common set of training documents. The rates you obtain
are of course only estimates, and likely to vary somewhat depending on the
actual sample emails you use. Thus it is possible for one classifier to
perform better than another with one set of documents, while performing worse
with a different set.
Unfortunately, there does not exist a truly representative set of email documents
for everyone on the planet. Moreover, one person's email characteristics vary slowly over time. Consequently, it makes little sense to compare the
performance of different classifiers on different sets of documents. Instead,
the task of choosing the best classifier for yourself can only be done reliably by referencing your own email, and by comparing classifiers on exactly the same emails.
The mailcross(1) testsuite must be given a set
of categories, with sample emails from each category in mbox format.
After selecting all the classifiers to be compared, it remains only to leave
the script running over night. The summary is usually inspected the next morning.
The method used to estimate classification errors is a standard cross validation.
The training emails are split into a number of roughly equal sized subsets, all of which, except for one, are used for learning. The remaining subset, which wasn't learned, is predicted. Finally, the percentage of errors is calculated for each category by averaging results over all possible choices of the prediction subset.
Note that this is neither the only way to estimate prediction errors, nor
even accepted as a good way by all academics. However, it's independent of
the classifier, widely used around the world, and easy to program.
You can cross validate as many categories as you like, provided the classifiers all support multiple categories. For example, you could compare dbacl(1) and ifile on many categories.
However, most email junk filters can only cope with two categories, representing junk mail and regular mail. When comparing the performance of these classifiers, such as bogofilter for example, the mailcross(1) testsuite is hard coded to function with two categories named spam and notspam. You must use these category names, or the results will not make sense.
An Example: Preparations
Before you can run mailcross(1) testsuite,
you need a set
of sample emails for each category. Here, we shall test on two
categories, named spam and notspam.
Take a moment to sift through your local mail folders for sample emails.
The instructions below assume you have two Unix mbox format files
named $HOME/sample_spam.mbox and $HOME/sample_notspam.mbox, containing
junk email and ordinary email respectively. These will not be
modified in any way during the test.
Fill these folders with
as many messages as you can. While this will lengthen the time it
takes for the cross validation to complete, it also gives more accurate results.
You should expect the tests to run overnight anyway.
If your emails aren't in mbox format, you must convert them. For example,
if $HOME/myspam is a directory containing your emails, one file per email, you
can type:
% find $HOME/myspam -type f | while read f; \
do formail <$f; done > $HOME/sample_spam.mbox
Alternatively, if you don't have many emails for testing, you can download
samples from a public corpus. For example, SpamAssassin maintains
suitable sets of messages at
http://spamassassin.org/publiccorpus/. Be kind to their server!
The SpamAssassin corpus doesn't come in mbox format. Here's what you must do
to obtain usable files: Download a compressed message archive. For example,
you can download the file 20021010_hard_ham.tar.bz2, which contains
a selection of nonjunk messages. Type
% tar xfj 20021010_hard_ham.tar.bz2
which will extract the files into a directory named hard_ham. If you
inspect the directory by typing
% ls hard_ham
you will see many files named something like
0053.ccd1056dc3ff533d76a97044bac52087.
These are all individual messages. Watch out for files named out of the ordinary.
Some archives contain a file named cmds which is NOT a mail message.
Delete all such files before proceeding. Next, type:
% find hard_ham -type f | while read f; \
do formail <$f; done > $HOME/sample_notspam.mbox
You can repeat this command for as many archives as needed, but remember to
change the destination mbox name, as it will get overwritten otherwise.
An Example: Running the Tests
Before you cross validate, make sure you have ample disk space available.
As a rough rule, expect to require up to 20 times the combined size of your
$HOME/sample_*.mbox files if you do the following.
% mailcross prepare 10
% mailcross add spam $HOME/sample_spam.mbox
% mailcross add notspam $HOME/sample_notspam.mbox
Note that if you have several mbox files with spam, you can repeat the
add spam command several times with each mbox file. All this command
does is merge the contents of the mbox file into a specially created directory
named maicross.d. Once this is done, you don't need the original
*.mbox files around any longer, at least for cross validation purposes.
You are now ready to select the classifiers you wish to compare. Type
% mailcross testsuite list
The following classification wrappers are selectable:
annoyance-filter - Annoyance Filter 1.0b with prune
antispam - AntiSpam 1.1 with default options
bmf - bmf 0.9.4 with default options
bogofilterA - bogofilter 0.15.7 with Robinson-Fischer algorithm
bogofilterB - bogofilter 0.15.7 with Graham algorithm
bogofilterC - bogofilter 0.15.7 with Robinson algorithm
crm114A - crm114 20031129-RC11 with default settings
crm114B - crm114 20031129-RC11 with Jaakko Hyvatti's normalizemime
dbaclA - dbacl 1.6 with alpha tokens
dbaclB - dbacl 1.6 with cef,headers,alt,links
dbaclC - dbacl 1.6 with alpha tokens and risk matrix
ifile - ifile 1.3.3 with to,from,subject headers and default tokens
popfile - POPFile (unavailable?) with default options
qsf - qsf 0.9.4 with default options
spamassassin - SpamAssassin 2.60 (Bayes module) with default settings
spambayes - SpamBayes x with default settings
spamoracle - SpamOracle x with default settings
spamprobe - SpamProbe v0.9e with default options
The exact list you see depends on the classifiers installed on your system.
If a classifier is marked unavailable, you must first download and install
it somewhere in your path. Once this is done, select the classifiers you
are going to test, for example:
% mailcross testsuite select dbaclB bogofilterA annoyance-filter
Note that some of these only work with two categories spam and notspam. You can see the state of the testsuite by typing:
% mailcross testsuite status
The following categories are to be cross validated:
notspam.mbox - counting... 2500 messages
spam.mbox - counting... 500 messages
Cross validation is performed on each of these classifiers:
annoyance-filter - Annoyance Filter 1.0b with prune
bogofilterA - bogofilter 0.15.7 with Robinson algorithm
dbaclB - dbacl 1.5 with cef,headers,alt,links
Finally, to start the test, type
% mailcross testsuite run
The cross validation may take a long time, depending on the classifier and the
number of messages. You can check progress by keeping an eye on the log
files in the directory mailcross.d/log/
An Example: Viewing The Results
Once the cross validation test has completed, you can see the
results as follows:
% mailcross testsuite summarize
Each selected classifier is scored in two complementary ways.
The first question asked is Where do misclassifications go?,
which shows roughly how good the predictions are from an objective standpoint.
The percentage of notspam messages predicted as spam is sometimes
called the false negative rate. The percentage of spam messages predicted as notspam is sometimes called the false positive rate. This terminology is however not standardized and confusing (as it depends on the purpose of the test) and won't be used here.
The second question asked is What is really in each category after prediction?, which is really a dual form of the previous question.
Normally, the purpose of mail classification is to separate your messages so that you save time. Here you can see how "clean" your mailboxes would be after classification.
Here is a sample summary output by mailcross(1) testsuite. Remember that results such as these make no sense unless you try them
out on your own emails. You have no idea what emails were used to obtain
these results, and I am not going to tell you.
---------------
Annoyance Filter 1.0b with prune
Fri Nov 14 11:26:58 EST 2003
---------------
Where do misclassifications go?
true | but predicted as...
* | notspam spam
notspam | 100.00% 0.00%
spam | 9.40% 90.60%
What is really in each category after prediction?
category | contains mixture of...
* | notspam spam
notspam | 98.15% 1.85%
spam | 0.00% 100.00%
---------------
bogofilter 0.15.7 with Robinson algorithm
Fri Nov 14 11:30:25 EST 2003
---------------
Where do misclassifications go?
true | but predicted as...
* | notspam spam
notspam | 100.00% 0.00%
spam | 8.40% 91.60%
What is really in each category after prediction?
category | contains mixture of...
* | notspam spam
notspam | 98.35% 1.65%
spam | 0.00% 100.00%
---------------
dbacl 1.5 with cef,headers,alt,links
Fri Nov 14 11:33:33 EST 2003
---------------
Where do misclassifications go?
true | but predicted as...
* | notspam spam
notspam | 100.00% 0.00%
spam | 5.80% 94.20%
What is really in each category after prediction?
category | contains mixture of...
* | notspam spam
notspam | 98.85% 1.15%
spam | 0.00% 100.00%
|