Last Modified: 2004-07-27 (Since: 2004-01-09)
Shiro Kawai (shiro@acm.org)
Scbayes is a library for scmail that implements a Bayesian spam filter described in Paul Graham's "A Plan for Spam". It is bundled in scmail, and works within scmail. A separate script 'scbayes' is provided to manage the database.
To filter spams using Bayesian filtering within scmail, you have to take the following preparation steps.
First of all, you need to build a statistics table. You need to have separate folders for spams and non-spam mails. To learn non-spam mails, run scbayes as follows:
% scbayes --learn-nonspam folder ...
To learn spams, run scbayes as follows:
% scbayes --learn-spam folder ...
These command build a statistics table (under ~/.scmail/ by default; you can specify the location of the table by ~/.scmail/config). If you already have the table, the newly learned data is added to the table.
For example, suppose you have non-spam saved mails in the 'inbox-save' folder, and discarded (but non-spam) mails in the 'trash' folder, and spam mails in 'spam' folder. Then you can build the statistics table by the following commands.
% scbayes --learn-nonspam inbox-save trash % scbayes --learn-spam spam
On the author's machine (Pentium 4, 2GHz), it took about 15minuets to learn 15000 spams.
Scbayes remembers the mails it learned, so you can run learning command again on the same folder to learn new mails that are added after the last time you ran the command.
If you make a mistake to learn mails in a wrong category, you can make them unlearned by --unlearn-spam and --unlearn-nonspam options. It makes scbayes 'forget' about all mails in the given folder(s).
For the time being, scmail can only deal with folders to learn/unlearn statistics. If you want to deal with a particular message, create a folder and move it to that folder, then run the command on the folder.
If you write
(add-bayesian-filter-rule!)
in ~/.scmailrc/refile-rules and/or ~/.scmailrc/deliver-rules, scmail-refile/scmail-deliver applies bayesian filter to the message it is processing, and refiles it to the "spam" folder (you can specify the folder by ~/.scmail/config) if the message is determined as a spam.
Note that scmail applies rules in the order in the configuration file, so the location of (add-bayesian-filter-rule!) matters. I found that it worked well if I first applied explicit rules to refile emails from known senders, then applied the bayesian filter to the rest.
If you need to write a filtering rule in lambda expressions, you can use the predicate 'mail-is-spam?' to check whether the mail is spam or not. For example, you can write the following rule.
(lambda (mail) (and (mail-is-spam? mail) (mail 'from #/\.kp\b/) (refile mail "spam-from-north-korea"))
Bayesian filtering doesn't work well if you have too few mails to learn. The mails which are not a spam but you don't need, are also the source of learning, so you'd better keep them separately, e.g. in "trash" folder.
Some types of non-spam mails, such as periodical ad mails, or free mailing-list messages with an ad attached, may look suspiciously like spams statistically. A good strategy is to refile them by explicit rules before applying the bayesian filter.
The "scbayes" script has a few other management features.
You can look a statistic information by --table-stat option.
% scbayes --table-stat lang nonspam spam #t : 184657w/ 3129m 453409w/14061m jp : 88720w/ 3834m 67379w/ 813m total: 273377w/ 6963m 520788w/14874m
Each row shows the number of learned words and messages, for each category (nonspam/spam). "#t" row shows non-japanese messages, "jp" row shows japanese messages, and "total" row is the sum of them.
You can check a "spamness" of a particular message without running scmail. Use --check-mail option of scbayes. Scbayes will show the spam probability of the message, with the most significant words and their spam probabilities that contributes the spamness of the message.
% scbayes --check-mail ~/Mail/spam/13500 /home/shiro/Mail/spam/13500 1.0 wwww3 : 0.9999 html4 : 0.9999 style6 : 0.9999 style5 : 0.9999 discreetly : 0.9999 delobel : 0.9999 bastapharma : 0.9999 meds : 0.9999 estes : 0.9998 overnight : 0.9685370077823972 needed! : 0.9652974189631429 medication : 0.9608080329456689 prescription : 0.9600886615864224 prescribed : 0.9578025262665094 medications : 0.9520815441868073
If you want to know how accurate scbayes is, you can run it over the specific folder using --check-spam and --check-nonspam options.
% scbayes --check-spam folder
This scans all messages in folder, and reports messages that are not categorized as a spam. So if you run this on the spam folder, you can see how many messages scbayes would have failed to recognize as spams.
% scbayes --check-nonspam folder
This also scans the folder, but reports messages that are categorized as a spam. If you run this on the non-spam folder, you can check false positives. If you're getting too many false positives, start consider applying explicit rules to refile suspicious messages. In my experience, this check just shows the spams which ends up in the non-spam folders by my mis-operation.
Scbayes is pretty straightforward implementation of Paul Graham's method. However, it does some special treatment on Japanese messages.
For further details, see Gauche:SpamFilter.