|
To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form % dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \ -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | grep match match d191e93e []acquired[](1) 1 match 8c56f142 []inquire[](1) 1 match 7a2ccda2 []inquiry[](1) 1 match 38f595f3 []consequently[](1) 1 match a52053f2 []questions[](1) 1 match 78a0e302 []question[](1) 1 This command lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model. However, when you are trying out regular expressions, you can compare this output with the contents of the document to see if your expression misses out on words or reads too many. Sometimes, it's convenient to use parentheses which you want to throw away. dbacl understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses you want to keep. Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms: % LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \ -g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2' Note that you need a multilingual terminal and Japanese fonts to view this, and your computer must have a Japanese locale available. In the table below, you will find a list of some simple regular expressions to get you started:
The last entry in the table above shows how to take word pairs as features. Such models are called bigram models, as opposed to the unigram models whose features are only single words, and they are used to capture extra information. For example, in a unigram model the pair of words "well done" and "done well" have the same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus). However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such a combined unigram/bigram alphabetic word model, type % dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \ -g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt If all you want are alphabetic bigrams, trigrams, etc, there is a special switch -w you can use. The command % dbacl -l slick -w 2 sample1.txt produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can). Let's look at the first few features: type % dbacl -l slick -w 2 sample1.txt -D | grep match | head -10 match 818ad280 []tom[](1) 1 match 5d20c0e2 []tom[]no[](1) 2 match 3db5da99 []no[](1) 1 match 4a18ad66 []no[]answer[](1) 2 match eea4a1c4 []answer[](1) 1 match 95392743 []answer[]tom[](1) 2 match 61cc1403 []answer[]what[](1) 2 match 8c953ec2 []what[](1) 1 match 4291d86e []what[]s[](1) 2 match b09aa375 []s[](1) 1You can see both pairs and single words, all lower case because dbacl converts everything to lower case unless you tell it otherwise. This saves a little on memory. But what did the original document look like? % head -10 sample1.txt "TOM!" No answer. "TOM!" No answer. "What's gone with that boy, I wonder? You TOM!"Now you see how the pairs are formed. But wait, the pair of words ("TOM!", No) occurs twice in the text, but only once in the list of matches? Did we miss one? No, look again at the line match 5d20c0e2 []tom[]no[](1) 2and you will see that the last value is '2', since we've seen it twice. dbacl uses the frequencies of features to build its model. Obviously, all this typing is getting tedious, and you will eventually want to automate the learning stage in a shell script. Use regular expressions sparingly, as they can quickly degrade the performance (speed and memory) of dbacl. See Appendix A for ways around this. Evaluating the modelsNow that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use? There is no easy answer to this problem. Intuitively, a larger set of features seems always preferable, since it takes more information into account. However, there is a tradeoff. Comparing more features requires extra memory, but much more importantly, too many features can overfit the data. This results in a model which is so good at predicting the learned documents, that virtually no other documents are considered even remotely similar. It is beyond the scope of this tutorial to describe the variety of statistical methods which can help decide what features are meaningful. However, to get a rough idea of the quality of the model, we can look at the cross entropy reported by dbacl. The cross entropy is measured in bits and has the following meaning: If we use our probabilistic model to construct an optimal compression algorithm, then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature. This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself, and moreover what we mean by compression is the act of compressing the features, not the full document, which also contains punctuation and white space which is ignored. To compute the cross entropy of category one, type % dbacl -c one sample1.txt -vn one 7.42 * 678.0 The cross entropy is the first value (7.42) returned. The second value essentially measures how many features describe the document. Now suppose we try other models trained on the same document: % dbacl -c slick sample1.txt -vn slick 4.68 * 677.5 % dbacl -c smooth sample1.txt -vn smooth 6.03 * 640.5 The first thing to nota is that the complexity terms are not the same. The slick category is based on word pairs (also called bigrams), of which tere are 677 in this document. But there are 678 words, and the fractional value indicates that the last word only counts for half a feature. The smooth category also depends on word pairs, but unlike slick, pairs cannot be counted if they straddle a newline (this is a limitation of line-oriented regular expressions). So in smooth, there are several missing word pairs, and various single words which count as a fractional pair, giving a grand total of 640.5. The second thing to note is that both bigram models fit sample1.txt better. This is easy to see for slick, since the complexity (essentially the number of features) is nearly the same as for one, so the comparison reduces to seeing which cross entropy is lowest. Let's ask dbacl which category fits better: % dbacl -c one -c slick sample1.txt -v slick You can do the same thing to compare one and smooth. Let's ask dbacl which category fits better overall: % dbacl -c one -c slick -c smooth sample1.txt -v slick We already know that slick is better than one, but why is slick better than smooth? While slick looks at more features than smooth (677.5 versus 640.5), it needs just 4.68 bits of information per feature to represent the sample1.txt document, while smooth needs 6.03 bits on average. So slick wins based on economies of scale. WARNING: it is not always appropriate to classify documents whose models look at different feature set like we did above. The underlying statistical basis for these comparisons is the likelihood, but it is easy to compare "apples and oranges" incorrectly. It is safest if you learn and classify documents by using exactly the same command line switches for every category. Decision TheoryIf you've read this far, then you probably intend to use dbacl to automatically classify text documents, and possibly execute certain actions depending on the outcome. The bad news is that dbacl isn't designed for this. The good news is that there is a companion program, bayesol, which is. To use it, you just need to learn some Bayesian Decision Theory. We'll suppose that the document sample4.txt must be classified in one of the categories one, two and three. To make optimal decisions, you'll need three ingredients: a prior distribution, a set of conditional probabilities and a measure of risk. We'll get to these in turn. The prior distribution is a set of weights, which you must choose yourself, representing your beforehand beliefs. You choose this before you even look at sample4.txt. For example, you might know from experience that category one is twice as likely as two and three. The prior distribution is a set of weights you choose to reflect your beliefs, e.g. one:2, two:1, three:1. If you have no idea what to choose, give each an equal weight (one:1, two:1, three:1). Next, we need conditional probabilities. This is what dbacl is for. Type % dbacl -l three sample3.txt % dbacl -c one -c two -c three sample4.txt -N one 0.00% two 100.00% three 0.00% As you can see, dbacl is 100% sure that sample4.txt resembles category two. Such accurate answers are typical with the kinds of models used by dbacl. In reality, the probabilities for one and three are very, very small and the probability for two is really close, but not equal to 1. See Appendix B for a rough explanation. We combine the prior (which represents your own beliefs and experiences) with the conditionals (which represent what dbacl thinks about sample4.txt) to obtain a set of posterior probabilities. In our example,
Now comes the tedious part. What you really want to do is take these posterior distributions under advisement, and make an informed decision. To decide which category best suits your own plans, you need to work out the costs of misclassifications. Only you can decide these numbers, and there are many. But at the end, you've worked out your risk. Here's an example:
We are now ready to combine all these numbers to obtain the True Bayesian Decision. For every possible category, we simply weigh the risk with the posterior probabilities of obtaining each of the possible misclassifications. Then we choose the category with least expected posterior risk.
The lowest expected risk is for caterogy two, so that's the category we choose to represent sample4.txt. Done! Of course, the loss matrix above doesn't really have an effect on the probability calculations, because the conditional probabilities strongly point to category two anyway. But now you understand how the calculation works. Below, we'll look at a more realistic example (but still specially chosen to illustrate some points). One last point: you may wonder how dbacl itself decides which category to display when classifying with the -v switch. The simple answer is that dbacl always displays the category with maximal conditional probability (often called the MAP estimate). This is mathematically completely equivalent to the special case of decision theory when the prior has equal weights, and the loss matrix takes the value 1 everywhere, except on the diagonal (ie correct classifications have no cost, everything else costs 1). Using bayesolbayesol is a companion program for dbacl which makes the decision calculations easier. The bad news is that you still have to write down a prior and loss matrix yourself. Eventually, someone, somewhere may write a graphical interface. The good news is that for most classification tasks, you don't need to bother with bayesol at all, and can skip this section. Really. bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of misclassifications. For the toy example discussed earlier, the file toy.risk looks like this: categories { one, two, three } prior { 2, 1, 1 } loss_matrix { "" one [ 0, 1, 2 ] "" two [ 3, 0, 5 ] "" three [ 1, 1, 0 ] } Let's see if our hand calculation was correct: % dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v two Good! However, as discussed above, the misclassification costs need improvement. This is completely up to you, but here are some possible suggestions to get you started. To devise effective loss matrices, it pays to think about the way that dbacl computes the probabilities. Appendix B gives some details, but we don't need to go that far. Recall that the language models are based on features (which are usually kinds of words). Every feature counts towards the final probabilities, and a big document will have more features, hence more opportunities to steer the probabilities one way or another. So a feature is like an information bearing unit of text. When we read a text document which doesn't accord with our expectations, we grow progressively more annoyed as we read further into the text. This is like an annoyance interest rate which compounds on information units within the text. For dbacl, the number of information bearing units is reported as the complexity of the text. This suggests that the cost of reading a misclassified document could have the form (1 + interest)^complexity. Here's an example loss matrix which uses this idea loss_matrix { "" one [ 0, (1.1)^complexity, (1.1)^complexity ] "" two [(1.1)^complexity, 0, (1.7)^complexity ] "" three [(1.5)^complexity, (1.01)^complexity, 0 ] } Remember, these aren't monetary interest rates, they are value judgements. You can see this loss matrix in action by typing % dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v three Now if we increase the cost of misclassifying two as three from 1.7 to 2.0, the optimal category becomes % dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v two bayesol can also handle infinite costs. Just write "inf" where you need it. This is particularly useful with regular expressions. If you look at each row of loss_matrix above, you see an empty string "" before each category. This indicates that this row is to be used by default in the actual loss matrix. But sometimes, the losses can depend on seeing a particular string in the document we want to classify. Suppose you normally like to use the loss matrix above, but in case the document contains the word "Polly", then the cost of misclassification is infinite. Here is an updated loss_matrix: loss_matrix { "" one [ 0, (1.1)^complexity, (1.1)^complexity ] "Polly" two [ inf, 0, inf ] "" two [(1.1)^complexity, 0, (2.0)^complexity ] "" three [(1.5)^complexity, (1.01)^complexity, 0 ] } bayesol looks in its input for the regular expression "Polly", and if it is found, then for misclassifications away from two, it uses the row with the infinite values, otherwise it uses the default row, which starts with "". If you have several rows with regular expressions for each category, bayesol always uses the first one from the top which matches within the input. You must always have at least a default row for every category. The regular expression facility can also be used to perform more complicated document dependent loss calculations. Suppose you like to count the number of lines of the input document which start with the character '>', as a proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion. #!/usr/bin/perl # this is file prop.pl $special = $normal = 0; while(<SDTIN>) { $special++ if /^ >/; $normal++; print; } $prop = $special/$normal; print "proportion: $prop\n"; If we used this script, then we could take the output of dbacl, append the proportion of lines containing '>', and pass the result as input to bayesol. For example, the following line is included in the example3.risk specification "^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ] and through this, bayesol reads, if present, the line containing the proportion we calculated and takes this into account when it constructs the loss matrix. You can try this like so: % dbacl -T email -c one -c two -c three sample6.txt -nav \ | perl prop.pl | bayesol -c example3.risk -v Note that in the loss_matrix specification above, $1 refers to the numerical value of the quantity inside the parentheses. Also, it is useful to remember that when using the -a switch, dbacl outputs all the original lines from unknown.txt with an extra space in front of them. If another instance of dbacl needs to read this output again (e.g. in a pipeline), then the latter should be invoked with the -A switch. MiscellaneousBe careful when classifying very small strings. Except for the multinomial models (which includes the default model), the dbacl calculations are optimized for large strings with more than 20 or 30 features. For small text lines, the complex models give only approximate scores. In those cases, stick with unigram models, which are always exact. In the UNIX philosophy, programs are small and do one thing well. Following this philosophy, dbacl essentially only reads plain text documents. If you have non-textual documents (word, html, postscript) which you want to learn from, you will need to use specialized tools to first convert these into plain text. There are many free tools available for this. dbacl has limited support for reading mbox files (UNIX email) and can filter out html tags in a quick and dirty way, however this is only intended as a convenience, and should not be relied upon to be fully accurate. Appendix A: memory requirementsWhen experimenting with complicated models, dbacl will quickly fill up its hash tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning. For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.
As can be seen from this table, including bigrams and trigrams has a noticeable memory and performance effect during learning. Luckily, classification speed is only affected by the number of features found in the unknown document.
The heavy memory requirements during learning of complicated models can be reduced at the expense of the model itself. dbacl has a feature decimation switch which slows down the hash table filling rate by simply ignoring many of the features found in the input. Appendix B: Extreme probabilitiesWhy is the result of a dbacl probability calculation always so accurate? % dbacl -c one -c two -c three sample4.txt -N one 0.00% two 100.00% three 0.00% The reason for this has to do with the type of model which dbacl uses. Let's look at some scores: % dbacl -c one -c two -c three sample4.txt -n one 13549.34 two 8220.22 three 13476.84 % dbacl -c one -c two -c three sample4.txt -nv one 26.11 * 519.0 two 15.84 * 519.0 three 25.97 * 519.0 The first set of numbers are minus the logarithm (base 2) of each category's probability of producing the full document sample4.txt. This represents the evidence away from each category, and is measured in bits. one and three are fairly even, but two has by far the lowest score and hence highest probability (in other words, the model for two is the least bad at predicting sample4.txt, so if there are only three possible choices, it's the best). To understand these numbers, it's best to split each of them up into a product of cross entropy and complexity, as is done in the second line. Remember that dbacl calculates probabilities about resemblance by weighing the evidence for all the features found in the input document. There are 519 features in sample4.txt, and each feature contributes on average 26.11 bits of evidence against category one, 15.84 bits against category two and 25.97 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt: % head -25 sample4.txt | dbacl -c one -c two -c three -nv one 20.15 * 324.0 two 15.18 * 324.0 three 20.14 * 324.0 There are fewer features in the first 25 lines of sample4.txt than in the full text file, but the picture is substantially unchanged. % head -25 sample4.txt | dbacl -c one -c two -c three -N one 0.00% two 100.00% three 0.00% dbacl is still very sure, because it has looked at many features (324) and found small differences which add up to quite different scores. However, you can see that each feature now contributes less information (20.15, 15.18, 20.14) bits compared to the earlier (26.11, 15.84, 25.97). Since category two is obviously the best (closest to zero) choice among the three models, let's drop it for a moment and consider the other two categories. We also reduce dramatically the number of features (words) we shall look at. The first line of sample4.txt has 15 words: % head -1 sample4.txt | dbacl -c one -c three -N one 25.65% three 74.35%Finally, we are getting probabilities we can understand! Unfortunately, this is somewhat misleading. Each of the 15 words gave a score and these scores were added for each category. Since both categories here are about equally bad at predicting words in sample4.txt, the difference in the final scores for category one and three amounts to less than 3 bits of information, which is why the probabilities are mixed: % head -1 sample4.txt | dbacl -c one -c three -nv one 16.61 * 15.0 three 16.51 * 15.0 So the interpretation of the probabilities is clear. dbacl weighs the evidence from each feature it finds, and reports the best fit among the choices it is offered. Because it sees so many features separately (hundreds usually), it believes its verdict is very sure. Wouldn't you be after hundreds of checks? Of course, whether these features are independent, and are the right features to look at for best classification is another matter entirely, and it's entirely up to you to decide. dbacl can't do much about its inbuilt assumptions. Last but not least, the probabilities above are not the same as the confidence percentages printed by the -U switch. The -U switch was developed to overcome the limitations above, by looking at dbacl's calculations from a higher level, but this is a topic for another time.
| |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||