Email Filtering for Nonprofits
Panel Presentation at 2003 N-TEN Roundup Emerging Technologies Session
Brent Emerson | brent@electricembers.net
http://electricembers.net/pubs/filtering/


Email filtering: what can we filter?
Viruses
 Why filter viruses in email?
 New developments in AV
Spam
 Why filter spam?
 The old way: Centralized DNS Blacklists
 The new way: A wide-spectrum approach
 Why is wide-spectrum better?
A prototype filtering system
Evaluative criteria for filtering solutions
 Do you need to filter?
 DIY v. ASP
 Comparing specific AV/AS solutions
Appendix: A sample spam message as scanned by SpamAssassin
Appendix: Email filtering software/vendors/ASPs/resources

Email filtering: what can we filter?

Hopefully, we can filter or will develop means to filter out anything an organization or a user doesn't want to receive. Usually, these messages fall into three general categories: viruses (malicious code), spam (bulk, automated, or commercial unsolicited email), and abuse (targeted offensive or abusive messages). Email has become a crucial tool of communication, and anything which reduces its effectiveness reduces the effectiveness of our organizations. This is especially true in the nonprofit sector, where we may be dealing with more email communication relative to our reduced human resources than many for-profit entities.

And the problem is becoming more intense and complicated lately in that these three categories, while still very useful, are starting to blur. It's becoming more and more difficult to draw the distinctions, as viral, spammy and abusive characteristics are increasingly found together. Examples of this blurring: viruslike spam - spam so rampant and irritating that it is seen as an infection affecting the functioning of the email system; spamlike viruses - "ad-ware" and other code which doesn't cause damage but just advertises and annoys; abuselike spam - spam that is offensive to users; and spamlike abuse - automated bulk targeted harassment.

The problem of email viruses, spam and abuse may be developing into one problem rather than three, and the future may see it coalesce further. So our approach will need to respond to this blurring - an active response with a coordinated strategy is becoming more and more relevant.

I'll focus here primarily on anti-spam (AS) filtering technology since there's less "emerging technology" in the anti-virus (AV) world and anti-abuse filtering is mostly a subset of AS. But first a few notes on virus filtering.

Viruses

Why filter viruses in email?

At first glance, it's clear that viruses are annoying and destructive and we want to get rid of them any way we can. But why bother with the extra expense of removing viruses from email?
  1. To minimize infection, our goal should be to remove viruses from our systems as early as is possible and feasible. For email-borne viruses, the earliest point is as email is first coming over SMTP into the organization, before it ever reaches a user's computer.

  2. To minimize the network/mailspool congestion caused by floods of email-borne viruses, we need to be able to remove viruses within the email system.

New developments in AV

Virus scanning and disinfection in general hasn't changed much lately, except that over the last several years more and more viruses are actually email worms, so antiviral email filtering has become more useful. A plethora of commercially available mature server-side filtering solutions and discounts on client AV software (as in Compumentor's DiscountTech program) means AV shouldn't be a big challenge for nonprofits.

One big change that I'm excited about is the rise of open source / free antivirus software. AV is an area where open-source hasn't yet penetrated, probably because of the need to keep virus pattern databases up to date if software is to be successful. ClamAV and OpenAntiVirus are newly-developed examples of open-source virus scanning engines, as opposed to open-source tools that use commercial engines or pattern databases. There is conflicting anecdotal evidence about their effectiveness and not a lot of conclusive evidence -- they need more adopters and more widespread testing before it's clear if they're as effective as commercial products.

Spam

Why filter spam?

Opinions are divided about how big a problem spam is: some organizations don't get much, some get a lot. The impact is often divided even within an organization, as maintainers of public addresses (like an info@ account) wade through spam all day and others with more private addresses aren't aware of it at all. An organization's response is mostly based on how many public addresses they have and on the traffic patterns of their users, which mirrors the internet-wide divide between hotmail accounts which receive 100 spams per day and new accounts in small domains that don't get spam for their first year. In addition, definitions of spam vary - it can be hard to pin down a universally accepted definition. (For more on this issue, see the recent forum Spammers and Scammers on techsoup.com, especially the thread on Spam v. Internet Activism.) But if we were to plow through 1000 candidate spam messages here as a group today, I imagine we'd agree on at least 90% of it. And one thing is certain: spam traffic is growing. Spam volume is increasing Internet-wide and it's increasing by account as each account ages and is ever more exposed to the world.

As long as the cost of sending spam is lower than the cost of rejecting it, it will continue to be a problem, becoming more major and widespread. So we have two connected goals: increase the cost and difficulty of sending spam effectively and decrease the cost and difficulty of not receiving spam you're sent.

The old way: Centralized DNS Blacklists

The use of centralized DNS blacklists to block spam is the technique most people think of when I mention AS filtering, which makes sense as it's the oldest and most well-known. Blacklists are databases of IP addresses of open SMTP relays or other machines frequently used by spammers - users block all mail coming from these addresses to preemptively block spam. Unfortunately, this technique doesn't work to block more than a small percentage of spam. It was very successful compared to nothing, when spam was new, but seems like a failure against today's spam problem. Because spammers can jump from IP to IP, a filtering strategy that pays attention only to the network routing information in a message is doomed to mediocrity at best. To distinguish spam effectively, a machine needs to identify it the same way we do: by inspecting its content.

The new way: A wide-spectrum approach

A new way to identify and handle spam is taking hold - instead of blocking incoming mail based on where it's coming from (which often results in false positives that mean users never receive legitimate email), messages are tested against a variety of different conditions. If they meet enough conditions to be considered spam, the message can be deleted. Better yet, it can be marked and delivered normally so that the recipient can decide how to handle it. So even if there are false positives, users never lose any mail. Here are the four main message test techniques:

IP-BASED TESTS

centralized/authoritative: DNS-queried blacklist databases of IP addresses of open SMTP relays, machines frequently used by spammers, dynamic dialup lines, whitelist warrant abusers, and other high-spam-risk IP addresses can be used to reject or identify suspect mail.

distributed/trust-based: Like centralized blacklists, only these are maintained by the community that uses them instead of an external authority. Community members and their contributions are trusted and admitted to the system based on their agreement with other trusted members' contributions.

self-administered: Firewalls (for IP or SMTP), MTA access rules (for SMTP) or custom IP address checks in an AS scanner can be used to block or identify traffic that one has independently determined to be undesirable based on individual experience.

PATTERN MATCHING

header patterns: A text processor can scan through message headers looking for addresses, address patterns or header irregularities that are often found in spam, or whitelisted addresses or whitelist warrants that are not often found in spam.

body patterns: A text processor can scan through message bodies (the content most users see) looking for text, formatting and stylistic characteristics that are often found in spam or not often found in spam.

DISTRIBUTED WHOLE-MESSAGE CHECKSUMS

Community-contributed hashes of entire spam messages are stored in a database on a central server. When an MTA receives a message, it can query the server to determine whether the message has been previously reported as spam. Since these systems rely on an actual human being judging a specific message to be spam, they can be very effective. Because these systems rely on contributions from a widespread community, they become more and more effective as more and more people use them.

STATISTICAL COMPARISONS TO KNOWN GOOD/BAD EMAIL

In August 2002, a programmer named Paul Graham proposed using statistical analysis of a user's old messages to judge the probability that a new message was spam. He fed the algorithm with a large body of non-spam email he had received and a large body of spam he had received to teach it how often different words occurred in each. His algorithm, a variant of the Naive Bayesian algorithm, could then estimate the probability that a message was spam or non-spam given that it contained a certain word, by comparing it to the kinds of messages he ordinariy received. When the analysis is extended to all the words in a message (or even just the fifteen most clear-cut), the estimate is strikingly accurate. It wasn't actually a new idea, but Graham's results were impressive (he reports that his implementation correctly distinguished over 99% of the next 1000 spams he received with no false positives) and the idea caught on quickly. His algorithm and similar statistical analyses are now being implemented and experimented with on a variety of platforms.

Why is wide-spectrum better?

Opinions differ about how appropriate or promising each of these techniques is, and some people prefer to use just one. But there are several very good reasons to use a variety of techniques:

  1. Strategies that use more techniques will detect more spam. Since the techniques all work on different message contexts, they can be assumed to have roughly independent detection patterns. Thus each one will pick up messages missed by a previously-applied method.

  2. Strategies that use more techniques will produce fewer false positives. In my experience, users' primary complaint about AS filtering concerns false positives - they hate to see legitimate email marked as spam. However, it's very difficult to adjust thresholds on single AS techniques to get acceptable results in marking true spam while minimizing false positives. But it's usually pretty easy to completely eliminate false positives if you don't need great detection results. Assuming you can tune each method to reject 50% of incoming spam with no false positives, using all four will eliminate approximately 94% of spam with absolutely zero false positives.

  3. Strategies that use more techniques will detect more types of spam. Since the techniques all work on different message contexts, some can specifically target characteristics of messages that will slip through other methods. For example, any means of modifying content to evade body checks creates a distinctive signature that can be easily detected with a statistical analyzer. Spams that slip through IP and pattern checks will be more likely to be seen by humans and reported to whole-message checksum services.

  4. Different methods work best at different scales. IP checks should happen as early as possible in the network path, making them useful from the ISP or WAN level. Pattern-matching and whole-message checks are universal enough to be applied best enterprise-wide, while statistical checks work best at the recipient, since each user has a distinctive statistical email content signature.

And indeed, the wide-spectrum approach has proven very effective: SpamAssassin (currently using three of the four technique-families) claims to differentiate between spam and non-spam in 95-99% of cases. In my own experience I've seen these tools detect 90-95% of spam, with nearly zero false positives, and then only on spamlike messages.

A prototype filtering system

Here I'll quickly describe a prototype email filtering server resembling the setup at NPOshield, the email filtering service I've built over the past year:
  1. SMTP mail is received by the MTA (Mail Transport Agent) and saved into an incoming queue for the mail scanning service.

  2. The mail scanner checks every few seconds for new incoming files - when it finds one, it calls the virus scanning engine on it to determine if it contains a virus.

  3. The virus scanner checks its virus pattern database and reports back with a status and a disinfected version of the message body if an infection was found.

  4. The mail scanner then calls the spam scanning engine on the original message to determine if it's likely spam.

  5. The spam scanner runs a variety of tests on the message body, determines whether it believes the message to be spam, and reports back with a status.

  6. Based on the reports from the AV and AS engines, the mail scanner then modifies the message, adding attachments and headers so the recipient will know that a virus was removed, a spam identification was positive, or the message was found to be clean.

  7. The new message is placed in an outgoing queue, which is then delivered by the MTA over SMTP to the destination mail server.

  8. The message is ultimately requested over POP or IMAP or any client protocol by the user, where the mail client can use spam ID markings to filter likely spam into a Junk mail folder.

At NPOshield I use Sendmail as the MTA, MailScanner as the mail scanner, Clam AntiVirus as the AV engine, and SpamAssassin as the AS engine. Some sample (extra-verbose) output from SpamAssassin can be seen in the accompanying materials for this presentation, showing positive tests and how the message was scored. Interesting things to note: all three methods were used to score the spam (IP check, header/body patterns, whole-message report); although the tests all have low individual scores, the combined score is far above the threshold; and the X-Spam-Level header allows advanced users to filter spam on their own threshold instead of the centrally defined one if they prefer.

Evaluative criteria for filtering solutions

Do you need to filter?

Are the reasons to filter against viruses (added protection, reduction in unnecessary network traffic) important enough to justify the costs? Are the reasons to filter against spam (reduction in staff irritation, increased email productivity, reduction in unnecessary network traffic / disk storage) important enough to justify the costs? Are there privacy issues within your organization that would make it inappropriate or controversial to have a machine scanning email content?

DIY v. ASP

If you decide to filter, you'll need to decide between installing and maintaining a do-it-yourself (DIY) software product (running either on your mail server or your mail clients) or a server-based scanning service hosted at an ASP. Some advantages/disadvantages of different solutions and questions to consider:

DIY SOFTWARE
HOSTED SERVICE (ASP)
server-based
client-based
filtering email server hosted
filtering hosted only
Only an option if you already administer your own mail server.

Do you have the resources to administer it? Single point of success or failure: easy to administer compared to client-based, but filtering failure occurs enterprise- wide and could impact email delivery.

You may be able to choose a modular architecture, so you're not locked into one AS or AV approach.

Only available for certain email clients.

Client-based solutions may require extra administrator time to install, extra user resources and extra user training.

Do you trust the ASP to handle your email server?

Do you have other reasons to outsource your entire email server? (e.g., dissatisfaction with ISP hosting, lack of resources for internal hosting.)

What does the ASP interface let you administer yourself?

Do you trust the ASP to handle your email?

Will the service work with your current SMTP architecture?

What happens if the filtering service fails - does email stop flowing, or is it delivered unfiltered?

Can the service be used in a secure configuration where no unfiltered mail can be sent to your server? Do you want to use it this way?

Software + in-house labor may be more cost-effective than subscription service.
Subscription service may be more cost-effective than software + in-house labor.

Comparing specific AV/AS solutions

AV CRITERIA
  • Effectiveness (how many viruses are caught/disinfected by the product) is always important. But organizations should maintain their desktop AV software (to protect against non-email routes of infection and in case of email filter failure), so 100% effectiveness in the filter is not a requirement. (This makes it easy to evaluate several solutions for effectiveness yourself by trying them out in series with your current AV software.)

  • How much is your mail processing slowed?

  • How intrusive (and how configurable) is the recipient- or sender-bound notification generated in the event of a virus disinfection?

  • Does the software need to be monitored or restarted often?

AS CRITERIA

  • How wide-spectrum is the spam identification engine? (i.e., how many different kinds of techniques does it use?) Wide-spectrum solutions work best.

  • Can you choose multiple actions for suspected spam? Recipients appreciate tagged spam that is delivered instead of being blocked or deleted.

  • Does the engine identify messages as binary spam/non-spam or assign a score indicating its probability of being spam? Score-based identifications are most flexible.

  • How much is your mail processing slowed?

  • If you're going to use only one technique, use statistical analysis. You should bear in mind that statistical methods work best at the scale of the individual, and must be trained on a corpus of user-received spam and non-spam to work most effectively. This may involve additional effort on the part of your users.