scbayes: a Bayesian filtering library for scmail

Last Modified: 2004-07-27 (Since: 2004-01-09)

Shiro Kawai (shiro@acm.org)

What is scbayes?

Scbayes is a library for scmail that implements a Bayesian spam filter described in Paul Graham's "A Plan for Spam". It is bundled in scmail, and works within scmail. A separate script 'scbayes' is provided to manage the database.

To filter spams using Bayesian filtering within scmail, you have to take the following preparation steps.

Use the scbayes script to build a statistics table by learning spams and non-spams.
Add a Bayesian filtering rule to your scmail configuration file.

Learning

First of all, you need to build a statistics table. You need to have separate folders for spams and non-spam mails. To learn non-spam mails, run scbayes as follows:

% scbayes --learn-nonspam folder ...

To learn spams, run scbayes as follows:

% scbayes --learn-spam folder ...

These command build a statistics table (under ~/.scmail/ by default; you can specify the location of the table by ~/.scmail/config). If you already have the table, the newly learned data is added to the table.

For example, suppose you have non-spam saved mails in the 'inbox-save' folder, and discarded (but non-spam) mails in the 'trash' folder, and spam mails in 'spam' folder. Then you can build the statistics table by the following commands.

% scbayes --learn-nonspam inbox-save trash
% scbayes --learn-spam spam

On the author's machine (Pentium 4, 2GHz), it took about 15minuets to learn 15000 spams.

Scbayes remembers the mails it learned, so you can run learning command again on the same folder to learn new mails that are added after the last time you ran the command.

If you make a mistake to learn mails in a wrong category, you can make them unlearned by --unlearn-spam and --unlearn-nonspam options. It makes scbayes 'forget' about all mails in the given folder(s).

For the time being, scmail can only deal with folders to learn/unlearn statistics. If you want to deal with a particular message, create a folder and move it to that folder, then run the command on the folder.

Configuring scmail

If you write

(add-bayesian-filter-rule!)

in ~/.scmailrc/refile-rules and/or ~/.scmailrc/deliver-rules, scmail-refile/scmail-deliver applies bayesian filter to the message it is processing, and refiles it to the "spam" folder (you can specify the folder by ~/.scmail/config) if the message is determined as a spam.

Note that scmail applies rules in the order in the configuration file, so the location of (add-bayesian-filter-rule!) matters. I found that it worked well if I first applied explicit rules to refile emails from known senders, then applied the bayesian filter to the rest.

If you need to write a filtering rule in lambda expressions, you can use the predicate 'mail-is-spam?' to check whether the mail is spam or not. For example, you can write the following rule.

(lambda (mail)
  (and (mail-is-spam? mail)
       (mail 'from #/\.kp\b/)
       (refile mail "spam-from-north-korea"))

Tips

Bayesian filtering doesn't work well if you have too few mails to learn. The mails which are not a spam but you don't need, are also the source of learning, so you'd better keep them separately, e.g. in "trash" folder.

Some types of non-spam mails, such as periodical ad mails, or free mailing-list messages with an ad attached, may look suspiciously like spams statistically. A good strategy is to refile them by explicit rules before applying the bayesian filter.

Other features of scbayes

The "scbayes" script has a few other management features.

You can look a statistic information by --table-stat option.

% scbayes --table-stat
lang       nonspam           spam
#t  :  184657w/ 3129m   453409w/14061m
jp  :   88720w/ 3834m    67379w/  813m
total: 273377w/ 6963m   520788w/14874m

Each row shows the number of learned words and messages, for each category (nonspam/spam). "#t" row shows non-japanese messages, "jp" row shows japanese messages, and "total" row is the sum of them.

You can check a "spamness" of a particular message without running scmail. Use --check-mail option of scbayes. Scbayes will show the spam probability of the message, with the most significant words and their spam probabilities that contributes the spamness of the message.

% scbayes --check-mail ~/Mail/spam/13500
/home/shiro/Mail/spam/13500
  1.0
    wwww3 : 0.9999
    html4 : 0.9999
    style6 : 0.9999
    style5 : 0.9999
    discreetly : 0.9999
    delobel : 0.9999
    bastapharma : 0.9999
    meds : 0.9999
    estes : 0.9998
    overnight : 0.9685370077823972
    needed! : 0.9652974189631429
    medication : 0.9608080329456689
    prescription : 0.9600886615864224
    prescribed : 0.9578025262665094
    medications : 0.9520815441868073

If you want to know how accurate scbayes is, you can run it over the specific folder using --check-spam and --check-nonspam options.

% scbayes --check-spam folder

This scans all messages in folder, and reports messages that are not categorized as a spam. So if you run this on the spam folder, you can see how many messages scbayes would have failed to recognize as spams.

% scbayes --check-nonspam folder

This also scans the folder, but reports messages that are categorized as a spam. If you run this on the non-spam folder, you can check false positives. If you're getting too many false positives, start consider applying explicit rules to refile suspicious messages. In my experience, this check just shows the spams which ends up in the non-spam folders by my mis-operation.

What scbayes does

Scbayes is pretty straightforward implementation of Paul Graham's method. However, it does some special treatment on Japanese messages.

It tokenizes Japanese message by bigrams. It also recognizes several Japanese punctuation characters as delimiters. Also, it is highly possible that the transition from a non-kanji Japanese character to a kanji character is a boundary of words, so scbayes treats it as a word boundary.
Scbayes canonicalizes the mail's character encoding by honoring charset attribute of content-type field first. However, not all mails have a proper charset attribute. When character encoding conversion or tokenization fails, scbayes rescans the message as a byte sequence.
For typical Japanese users, it is usual that the spam ratio differs greatly between Japanese and non-Japanese mails. If we use a single statistics table, it would give some bias (e.g. if almost all English emails one receives are spam, it is likely that most English words would get high spam probability). So scbayes uses separate tables for Japanese and non-Japanese mails.

For further details, see Gauche:SpamFilter.