Christopher Browne's Web Pages
Prev	Internet Data Filtering	Next

5. How to Filter

5.1. Political Approaches

There are more "advocacy groups" than you can shake a stick at, unfortunately, enforcement of anything is an inherently expensive thing.

One person going to court for a day keeps at least 6 "professionals" on the job, which is expensive. (Two lawyers, judge, bailiff, court clerk and secretary...)

Anti-Spam - FTC - Federal Trade Commission
Anti-Sex/Anti-Obscenity - Communications Decency Act (CDA)

5.2. Other Advocacy Organizations

SpamCop
ISP-based Filtering
Since "spam" tends to be associated with particular ISPs, some other ISPs have been cracking down on it, taking such actions as:
- Enforcing attachment of user authentication information to mail/news articles;
- Move to mailing lists;
- Move to moderated newsgroups;
- Some ISPs (Kingdom Online) use URL/content blockers similar to Junkbuster to provide "child-safe service;"
- Forbidding mail or news server access to non-customers, which is OK for ISPs, but not for companies like UUnet, BBN, and MCI;
- Refusing mail or news coming from known "spam ISPs;" Sendmail Against Spam shows how to use sendmail rulesets to restrict non-local users from relaying "spam" through sendmail.

Here's some URLS:

Junkbusters
Their page of Junkbusters Links to Other Resources references this page.
They have an Anti-telemarketing script that is useful for dealing with the denizens of the telephone world.
Privoxy
Fork from JunkBuster , which is apparently no longer being actively developed or maintained...
Jim Florentine - Terrorizing Telemarketers
The Hoaxkill service: Let's get rid of hoaxes now!
Web Guard Webring Home Page at James S. Huggins Refrigerator Door
Spam and Telemarketing
This web page contains some additional information and discussion related to the paper "Selling interrupt rights: A way to control unwanted e-mail and telephone calls", published in the Technical Forum section of IBM Systems Journal, Volume 41, Number 4, 2002, pages 759-766.
Reciprocal Linking - why and why not
Networks Of Sleaze

5.2.1. Rule-based, matching header information

As a vast generalization, information about the sender is the best single indicator of how to classify the message, at least from a quality perspective.

For instance, many people used to set up "elimination rules" for messages coming from [anyone]@aol.com

Looking at news, the quality of a posting tends to be highly correlated with the identity of the author. If I see something written by Linus Torvalds, that gets "high points" for validity. On the other hand, if it's written by Toronto's conspiracy theorist Bob Allistat, it's highly probable that I will consider it to be worthless drivel.

Subject: information is far less useful, particularly with news because people seem incapable of comprehending the need to modify the subject line as needed as discussion flows to new topics.

Some people keep looking for ways to "nuke" all messages that can be detected to be written in Netscape's "pseudo-HTML" format.

5.2.1.1. Pros:

So long as there aren't too many rules, it's fast.
It's often easy to share rules. Periodically get an update from Rahul Dhesi's web site or some such place listing notable spammers.
It's easy to associate actions with rules.

5.2.1.2. Cons

Spammers now build messages with random, misleading header information so that it's tough to come up with a header rule that will only kill their messages, and pass through "good" information.
You have to design a lot of rules. News readers typically have tools to ease this process; mail readers don't.

5.2.2. Analyze message text

Methods that look at the body of the message have more material to work with, and thus can provide much better classification...

5.2.2.1. Pros

Provides extremely good classification, once tuned;
As it uses the full text of the message, it is not confused by misleading header information including poor newsgroup selections;
New "kinds" of messages can be processed with an expectation of some reasonable classification.

5.2.2.2. Cons

There needs to be an initial "training period" where messages are classified manually in order to build some sort of database of "classification rules."
Hefty resource consumption
- Memory
  For Ifile , performance improvements have primarily come by moving from "naive" methods to algorithms that minimize RAM usage.
- Disk space for the database
- CPU
Fuzzy results
All you generally get with full-text methods is a " best classification. " Since the success rate is not quite 100%, it is not safe to take drastic action at any particular message such as automatically sending a complaint to the sender's postmaster or tossing it out. . I've had "pearls" dropped in my Spam folder now and again.
Most importantly, that piece of email might actually be a response from an ISP to a complaint about Spam.

5.2.3. Query a server for ratings

The GroupLens news evaluation system allows users to rate articles. It's been active for Linux newsgroups. The ratings are collected on a server, and other users can query the server for those ratings.

Unfortunately, someone has to be the first one to rate those articles, and these people get no benefit.

But tie in a "full text classifier" that works like Ifile so that rated messages get thrown into the "learning database," and then new messages can get a rating straight off, and the raters will get benefit...

Prev	Home	Next
Why is Linux Good for Filtering?	Up	Mail Filtering