Spam Reduction Notes
These are notes from a talk given at the
Tokyo Linux Users Group
technical meeting on January 18th, 2003. It was intended
as a survey of the current techniques that you might like
to try... with an emphasis of what works for me.
|Chapman:||Have you got anything without spam?|
|Jones:|| Well, there's spam egg sausage and spam, that's not got much spam in it.|
|Chapman:||I don't want ANY spam!|
|Monty Python's Flying Circus, Episode 25, June 25, 1970|
What is SPAM?
SPAM is a registered trademark of
Hormel Foods, LLC, for luncheon meat.
What is spam?
- typical definition
- Unsolicited Commercial Email (UCE)
- I prefer to stress the untargeted, bulk nature of the mail
- unsolicited, automated, Email
|Nov 2002:||Brightmail says 36% of traffic is spam|
|Jan 2003:|| MessageLabs predicts spam exceeds ham by July|
- very effective
- trivial setup
- extremely low false sorting rate
Multiple Email Addresses
- suffixes common with many modern MTA/MDAs
- throw-away (free) accounts
- easy with your own mail server
- good for the curiosity factor, use a different
address or suffix and you know where the spammer got your address
ad hoc content filtering
- Exim filter files
- system filter
- user filter
- competitive... you against the spammer
- can rapidly get out of hand
- cooperative effort many smart people against the spammer
- whitelists and blacklists
- rule-based filtering
- recent versions also include Bayesian filtering
- identify good guys and bad guys
- don't use for filtering... use them to reduce computational load
- can be implemented in server or client tools
# Accept mail to postmaster or abuse in any local domain,
# regardless of source.
accept local_parts = postmaster:abuse
domains = +local_domains
# reject if the sender is a known spammer.
deny senders = @@cdb;/etc/exim/spam-domains.cdb : \
message = message from spammer rejected
# Deny unless the sender address can be verified.
require verify = sender
Public (realtime) Blacklists
# DNS blacklists.
# There are a variety of realtime blacklists that attempt to
# identify spam sources, open relays, and even dialup address
# blocks. You really need to check the policy published by
# each list before deciding to use it.
deny dnslists = list.dsbl.org : \
relays.ordb.org : \
message = rejected because $sender_host_address is in the blacklist at $dnslist_domain\n\
Tagged Message Delivery Agent
- confirmation system
- suffix (tag)
Make it comutationally expensive for spammers (or at least people
not on your whitelist) to send you mail.
Vipul's Razor (AKA Spamnet)
- cooperative effort
- 20 character SHA digest sent to local catalog server
- Mad-lib problem
- recent performance problems
- tokenize (punctuation?, HTML? stopwords?)
A Plan for Spam
Graphs from SpamBayes background information
produce two numbers
- H - ham (good) probability
- S - spam probability
- user can decide thresholds depending on false negative, false positive sensitivity in classifying "unsures"
- Stupid beats smart.
- avoid magic constants
- build rules into tokenizer, not analyzer policy
Chi-squared distribution explanation
- hammiefilter with procmail
- Outlook2000 Plugin
Mailing List Applications?
Will filters end spam?
- If you can successfully filter -- is it worth sending?
- Will the gullible filter?
Tools in other domains
- CRM114 for syslog monitoring
look for outlying events
- Server Based
- Client Based
- Fixed Rule Based
- Probability (Trained) Based
Watch the video archives of the experts speaking on the subject
at the Spam Conference held at MIT on January 17th, 2003.