2005-06-28

Jaakko Hyvätti Anti-spam software

These are my experiments.. sorry I do not have much statistics on how good these are..

Helper programs

normalizemime.cc - 2007-03-20 version. This is a mime email message parser to be used as a preprocessor for email classification software, version 1.18.

Tries to normalize the content to 8bit encoding with utf-8 character set. Also appends a copy of message body with HTML removed (IMG and A tags remain unaffected). New version as of 2003-08-06 also decodes HTML entities, like ä or ä and limits the size of attached binary files. 2003-08-19 version also decodes URL encoding in HREF and SRC parameters in html, and fixes a mime decoding bug. 2003-09-18 version has a core dump bug fixed, and X-Spam-Status and X-CRM114-Status headers filtered away. 2003-09-22 version filters my pine status headers away and inserts information in text about malformed base64 padding. 2003-10-05 version fixes header decoding and marks unnecessary header encoding with a token. 2004-04-26 version deletes null characters and limit message size to 1MB. 2005-12-13 version recognizes some more misspelled charsets. 2007-03-20 version allows easier tweaking of header purge list, see around line 1500.

2004-09-17 version changes:

Thanks for these ideas go to Tony Godshall.

2004-06-28 version changes:

2005-06-28 version 1.16 fixes a core dump on null chars. Thanks to Richard Carver for pointing that out.

These text strings on the output come from base64 decoding, and indicate possible attacks against decoders and virus/spam scanners:

X-warn: jHnnb3URVED5UgX9fxnZfAsV invalid base64 padding
X-warn: ksU7AwpcqQoiCC84ceueEqKn padding inside base64
If the base64 string was inside a header, the headers get mangled totally, so this is not strictly speaking a header but just a word token that crm114 could learn.

This is a spam indicator that the header was encoded even if it only consisted of US-ASCII:

ONLYASCIIKFrjuZnFvmJJdrRkeXrd95wu

String ICONVERROR5iorjkfewfmkdfs2lklkfsd is added when the first charset coversion error is detected.
String UTFATTACK45809jkHJSD82rk8903jdfj3 is added if suspicious UTF-8 encodings are used.
String BADHEADERCHARSETckW2eAWEEyAGmHQK is added if the encoding in header was not recognized.

This header is added if the message body charset is not recognized:

X-warn: 3j94twCXM5njkztE bad charset in body

These are the headers that are removed from the messages. The list is around line 1500 in the source.

      "X-Spam-",           // Added by SpamAssasin for example
      "X-CRM114-",         // Added by CRM114
      "X-Virus-",          // Added by ClamAV
      "X-UID:",            // added by Pine mail user agent
      "Status:",           // added by Pine mail user agent
      "X-Status:",         // added by Pine mail user agent
      "X-Keywords:",       // added by Pine mail user agent

After this filtering, the email message no more confirms to any standards, and formatting information is irreversibly lost. Even the MIME message structure is potentially corrupted as the encodings are decoded and message separators may appear inside the data.

This filter is useful for preprocessing messages for content recognizing spam filters, like crm114.

CRM114 scripts

mailfilter.crm - normalizes the email and then filters it. Modified from some old crm114 source distribution.

mailfilterconfig.crm - used by both mailfilter.crm and learnspam.crm below.

procmail.txt - .procmailrc example to be used with above scripts.

learnspam.pl and learnspam.crm - splits mailbox up to individual emails before using normalizemime to remove mime encodings and html. Then learn the email unless it already is classified correctly by current css files. This is a TOE (Train on error) behaviour. I removed all blacklist processing from mailfilter.crm when converting it to learnspam.crm, as it is bad to use them when learning.

These scripts need a recent CRM version. The perl script is used to read the text files one messages at a time and feed it to crm114. Spam and nonspam text chunks are trained alternating. Multiple messages are not feeded to crm114 as it seems not to work well with current TRE regexp bugs with UTF-8 text. That's a shame, as it slows down the process 10 times. Note that also you need to remove the msync() call from crm_markovian.c to get any performance out from crm learning.

Have spamtext.txt and nonspamtext.txt mailbox files ready before running the script. Usage example:

./learnspam.pl
./learnspam.pl rerun
./learnspam.pl rerun
./learnspam.pl rerun
A couple of reruns with the same material like in the example above seems to work for me. I created a shell script to accomplish this and generate statistics: learnspamtest.sh. Here are my latest with repeated TOE, that is, TUNE (Train until no errors):
nonspam learned: 55
spam learned:    39
nonspam ignored: 2012
spam ignored:    1130

nonspam learned: 18
spam learned:    29
nonspam ignored: 2049
spam ignored:    1140

nonspam learned: 9
spam learned:    12
nonspam ignored: 2058
spam ignored:    1157

nonspam learned: 3
spam learned:    5
nonspam ignored: 2064
spam ignored:    1164

nonspam learned: 0
spam learned:    0
nonspam ignored: 2067
spam ignored:    1169
Finally the fifth run usually is clean.

Back to home page