Christopher Browne's Web Pages
Prev	Internet Data Filtering	Next

7. News Filtering

In the beginning, there was readnews.

There were a couple hundred newsgroups, and anyone capable of reading without drooling on their terminal could read all the news in a not-unreasonable amount of time. (This is referenced in the Hackers Test...)

Today, with thousands of newsgroups and millions of posters, you can't review a list of all the newsgroups in a reasonable period of time.

7.1. Kill Files

The second news reader, rn, provided "kill files" that allowed messages to be premarked as read based on article headers.

The "kill rules," highly representative of the "rule-based" approach, come in two varieties:

Local - to a particular newsgroup
Global across all newsgroups

This split reduces the work as there will likely be some limited number of "global" rules applied everywhere, and then a proliferation of anyrules that only apply to a single newsgroup.

7.2. Score Files

This is an "obvious" extension of kill files; various keywords can be combined with different weights to build up an article "score."

Articles with high scores are likely "most interesting" should be read first; articles with poor scores (typically below some threshold) may be eliminated forthwith.

One might have things like the following SLRN scoring rules:

% These rules are for all newsgroups... [*] % AOL has a slight tendancy towards having bozos... Score: -5 From: aol.com % Slight reversal for MIT... Score: 5 From: mit.edu % I certainly want to see anything coming from linus... Score: 5000 From: Linus Torvalds % And nothing from Bob Allistat... Score: -9999 From: Bob Allistat % If Linux is mentioned, favor the article a bit... Score: 5 Subject: Linux % In a Linux group, a "Linux" subject is rather uninformative... [*linux*] Score: -5 Subject: Linux % In database newsgroups, I want to highlight Linux stuff... [comp.database*] Score: 100 Subject: linux % And vice-versa [comp.os.linux.*] Score: 100 Subject: data Subject: base Score: 100 Subject: dbm % Ditto for spreadsheets for Linux [comp.apps.spreadsheets] Score: 100 Subject: linux % And the converse rule [comp.os.linux.*] Score: 100 Subject: spread Subject: sheet

7.3. Gnus

The most sophisticated scoring system comes in the Gnus news reader that integrates into GNU Emacs and XEmacs.

Anyone needing inspiration for improved features in a news reader should consult the Gnus documentation, and "steal" features from there.

Neat idea: Dynamic Scoring

If I read an article, give the topic/author a little positive score
If I don't read an article, give 'em a little negative score.
If I follow up or reply, give a big positive score, as I clearly found the article interesting

7.4. Other Scoring Systems

7.5. adcomplain

adcomplain is a shell script through which you can redirect "offending" news and mail messages.

The script has the ability (in some cases) to search for the "real" identity of the offender, as well as a best guess of someone to contact at their service provider.

I installed the script as /home/cbbrowne/bin/spam. Character-interfaced news readers can typically pass messages to spam by typing: |spam

I have assisted in the removal of a number of "spammers'" ISP accounts by virtue of the use of this utility.

7.6. Pass the news feed thru Ifile

I have experimented with passing my news through the Ifile system. "Spam" gets quite accurately dumped to my MH Spam folder; the sexually oriented material even more successfully flows to Spam/phonesex for later purging.

Beyond that, I defined "virtual newsgroups" relating to topics of interest. (And created some for "trash" to flow into.)

Material looking like Linux advocacy tended to flow to Linux/Advocacy, regardless of where it was originally posted.
Material on Linux hardware gets automatically (and fairly accurately) split up into:
- Linux/Hardware/CPU
- Linux/Hardware/CDROM (mostly junked)
- Linux/Hardware/Cameras (digital cameras)
- Linux/Hardware/Disk
- Linux/Hardware/Ethernet (mostly junked)
- Linux/Hardware/Modems (mostly junked) ...
- Linux/Hardware/SCSI
- Linux/Hardware/Tape (some keepers)
- Linux/Hardware/Video (mostly junked)
This allowed me to fairly quickly target useful hardware information for reading.

Unfortunately, the filtering process still dropped a lot of garbage into my "mail feed," and required that I go through and delete a whopping lot of messages once they're read. It works, but needs some work to make it of "production" quality.

There needs to be some sort of "scoring" ability, with "score thresholds" to indicate that messages can be discarded. I can't think of a decent interface to feed in information to this effect.

Prev	Home	Next
Mail Filtering	Up	Web Filtering