Christopher B. Browne's Home Page
cbbrowne@acm.org

5. How to Filter

5.1. Political Approaches

There are more "advocacy groups" than you can shake a stick at, unfortunately, enforcement of anything is an inherently expensive thing.

One person going to court for a day keeps at least 6 "professionals" on the job, which is expensive. (Two lawyers, judge, bailiff, court clerk and secretary...)

5.2. Other Advocacy Organizations

Here's some URLS:

5.2.1. Rule-based, matching header information

As a vast generalization, information about the sender is the best single indicator of how to classify the message, at least from a quality perspective.

For instance, many people used to set up "elimination rules" for messages coming from [anyone]@aol.com

Looking at news, the quality of a posting tends to be highly correlated with the identity of the author. If I see something written by Linus Torvalds, that gets "high points" for validity. On the other hand, if it's written by Toronto's conspiracy theorist Bob Allistat, it's highly probable that I will consider it to be worthless drivel.

Subject: information is far less useful, particularly with news because people seem incapable of comprehending the need to modify the subject line as needed as discussion flows to new topics.

Some people keep looking for ways to "nuke" all messages that can be detected to be written in Netscape's "pseudo-HTML" format.

5.2.1.1. Pros:

  • So long as there aren't too many rules, it's fast.

  • It's often easy to share rules. Periodically get an update from Rahul Dhesi's web site or some such place listing notable spammers.

  • It's easy to associate actions with rules.

5.2.1.2. Cons

  • Spammers now build messages with random, misleading header information so that it's tough to come up with a header rule that will only kill their messages, and pass through "good" information.

  • You have to design a lot of rules. News readers typically have tools to ease this process; mail readers don't.

5.2.2. Analyze message text

Methods that look at the body of the message have more material to work with, and thus can provide much better classification...

5.2.2.1. Pros

  • Provides extremely good classification, once tuned;

  • As it uses the full text of the message, it is not confused by misleading header information including poor newsgroup selections;

  • New "kinds" of messages can be processed with an expectation of some reasonable classification.

5.2.2.2. Cons

  • There needs to be an initial "training period" where messages are classified manually in order to build some sort of database of "classification rules."

  • Hefty resource consumption

    • Memory

      For Ifile , performance improvements have primarily come by moving from "naive" methods to algorithms that minimize RAM usage.

    • Disk space for the database

    • CPU

  • Fuzzy results

    All you generally get with full-text methods is a " best classification. " Since the success rate is not quite 100%, it is not safe to take drastic action at any particular message such as automatically sending a complaint to the sender's postmaster or tossing it out. . I've had "pearls" dropped in my Spam folder now and again.

    Most importantly, that piece of email might actually be a response from an ISP to a complaint about Spam.

5.2.3. Query a server for ratings

The GroupLens news evaluation system allows users to rate articles. It's been active for Linux newsgroups. The ratings are collected on a server, and other users can query the server for those ratings.

Unfortunately, someone has to be the first one to rate those articles, and these people get no benefit.

But tie in a "full text classifier" that works like Ifile so that rated messages get thrown into the "learning database," and then new messages can get a rating straight off, and the raters will get benefit...

Google
Contact me at cbbrowne@acm.org