This is an old revision of the document!


Some internals, and further reading

Antispam decision

Clapf uses a statistical algorithm* to decide whether the incoming email is spam or not. It only marks the message by inserting a new message header called “X-Clapf-spamicity” indicating what probability the email has. If the probability is above a certain limit clapf also adds the “X-Clapf-spamicity: Yes” flag to the message header. This second header is just for the easy spam recognition.

*: inverse chi-square

You may use the following snippet in ~/.mailfilter with maildrop to put spam to your junk folder:

if(/^X-Clapf-spamicity: Yes/:h)
{
        to "mail/junk"
}

Clapf only marks the message instead rejecting or removing it. Though statistical filters are famous for their high accuracy, there is no 100.00% perfect anti-spam measure. It is up to the user what to do with messages marked as spam: can be delivered to the user and put into the junk folder or move it to the spam quarantine.

Training

Before using any statistical filter it must be trained. Read the training paper how to create the initial token database and how to train it further.

You may test the antispam decision mechanism with a single RFC-822 format message:

# test with your current unix uid as the clapf uid
spamdrop -D -m messagefile

# test for jim@aaaa.fu
spamdrop -D -r jim@aaaa.fu -m messagefile

Statistical internals

ham probability = (the number of ham emails containing the word w) / (the total number of ham emails)
spam probability = (the number of spam emails containing the word w) / (the total number of spam emails)

probability of a given word (pn) = spam probability / (ham probability + spam probability)

Then clapf counts the spaminess and the haminess of the most interesting tokens:

spaminess: P = (1-p1) * (1-p2) * .... * (1-pn)

non-spaminess: Q = p1 * p2 * .... *pn

Now clapf applies the inverse chi-square algorithm and computes a combined indicator (I):

H = chi2inv(-2 * ln Q, 2*n);

S = chi2inv(-2 * ln P, 2*n);

I = (1 + H - S) / 2;

What to do with rare words?

p(w) = ( s * x + n * p(w) ) / (s+n)

where x=0.5, s=1, n: number of word occurence

Implementation issues

Clapf parses and tokenize the incoming message. For better accuracy clapf creates additional phrases from consecutive tokens. Look at the following text: “How nice day!”

Now we have 5 tokens: “How”, “nice”, “day!”, “How+nice” and “nice+day!”

Clapf creates two tables: one for only phrases and one for phrases and single tokens. First it calculates a spamicity value from the phrase table and uses the 'mix' table only if it's unsure. Then it chooses the value with the greater deviation from the neutral 0.5

Clapf discards some tokens:

  • numeric only tokens
  • tokens with shorter length than 3
  • tokens with longer length then MAX_WORD_LEN
  • tokens occuring only once in the spam or in the ham folder

Note that the clapf parser converts all token to lower case.

Clapf also degenerates a token to a simpler form if it ends with a punctuation character, eg. both “Free!!!!!” and “FrEE!!” become “free!”

Performance

You should put clapf's tmp/ directory to a separate (lightly loaded) disk. If you want even better performance put it to a memory filesystem or SSD.

You need at least max_message_size * (max_paralel_delivery + 1). Eg. 8M * (10+1) = 88M

/etc/fstab:

/dev/shm        /var/lib/clapf/tmp       tmpfs   defaults,size=128M       0	0

Notes on parsing email

There is a custom tailorable array - called invalid_junk_characters (look ijc.h) - you can fill with garbage characters occuring messages with a completely invalid characters for your language, display. I myself fill it with Chinese, Korean, Japanese, … stuff. If this junk exceeds a certain (configurable) limit clapf marks the message as spam.FIXME

Bayesian Filtering Example, Using Bayes' Formula to keep spam out of your Inbox: http://www.process.com/precisemail/bayesian_example.htm

Spam Detection: http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

A Statistical Approach to the Spam Problem: http://www.linuxjournal.com/article/6467

A Plan for Spam: http://www.paulgraham.com/spam.html

Better Bayesian Filtering: http://www.paulgraham.com/better.html

Filters vs. Blacklists: http://www.paulgraham.com/falsepositives.html

The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It: http://crm114.sourceforge.net/Plateau_Paper.html