cleanfeed(8) redhat man page

cleanfeed(8) Cleanfeed - Because spam sucks cleanfeed(8)

NAME
Cleanfeed - spam filter for Usenet news servers

SYNOPSIS
INN: Installed as filter_innd.pl, location is configured into INN at compile time.

Highwind servers: <command line> -program cleanfeed -body

NNTPRelay: ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl

DESCRIPTION
A spam filter for Usenet servers. Cleanfeed blocks spam on the way into your server, before it is written to disk or propagated to
outbound feeds. It can also block binaries in non-binary newsgroups and includes several other features to keep your newsfeed clean.

Cleanfeed currently works with INN, Cyclone, Typhoon, Breeze, and NNTPRelay servers. See my webpage (listed at the end of this document)
for pointers to information about using Cleanfeed with CNews, Diablo, Collabra, or INN versions earlier than 1.5.1.

USAGE
For all versions, place the cleanfeed.conf configuration file somewhere, then edit the Cleanfeed source file and change the $config_dir
option at the top to point to the directory where the config file lives.

INN Install the filter file (called cleanfeed) as filter_innd.pl, and cleanfeed.conf, in the location you specified in config.data (INN
1.7.2 and earlier) or when configuring INN 2.x (usually the bin/filter directory under the installation root). Make sure both files
are readable by the news user. Once in place, the filter is loaded with the command ctlinnd reload filter.perl meow. Filtering can be
turned on with ctlinnd perl y and turned off with ctlinnd perl n.

Cyclone/Typhoon/Breeze
Add the -program <file> and -body options to the bin/start script, where <file> is the location and name of the Cleanfeed program.
Restart the server. Cleanfeed will run as an external process (standalone mode). IMPORTANT: make sure both cleanfeed and
cleanfeed.conf are readable by the news user! Double-check the permissions as this is a fairly common mistake!

NNTPRelay
Find the ExternalFilter directive in config.txt and make it look like:

ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl

Cleanfeed will run as an external process (standalone mode).

More detailed installation instructions are provided later in this document.

CONFIGURATION OPTIONS
Configuration is accomplished by setting the various options in the cleanfeed.conf configuration file. This file is evaluated as Perl
code, so comments can be included in the usual Perl # syntax. A sample default file is included with the distribution.

If you would rather not use cleanfeed.conf, you can set its location to "undef" in the source and edit the configuration variables directly
in the source file.

cleanfeed.conf has two sections (which define perl hashes): %config_local and %config_append. Entries in %config_local will override the
default settings of the same name in the Cleanfeed source. Entries in %config_append can be used to add to most of the default regular
expressions, for items such as badguys, bin_allowed, poison_groups, etc. Settings in %config_append for these items will be appended to
the default regexps, seperated by "|" (or).

If you want to completely override the default regexps for these options, rather than just add to the defaults, you can add an entry for
them into the %config_local section of cleanfeed.conf.

All of this is done quite blindly, so if you do anything odd, be careful. (Cleanfeed will remove the common mistake of including two "|"
(or) signs in a row.) All config options are exposed to %config_local, including any that may not be present in the sample file. Only the
defined list of options are exposed to %config_append.

Options that are on/off or yes/no should be set to 1 for on/yes, or 0 for off/no.

First, you need to tell Cleanfeed which news server software you are using. At the top of the file, set the appropriate variable to 1.
For INN, set $inn; for Cyclone, Typhoon, or Breeze, set $highwind; and for NNTPRelay, set $nntprelay. Ensure the other two (the ones
you're not using) are set to 0.

General Settings

aggressive
Set this to 0 to disable all content-based filters. Helpful to please paranoid lawyers, or paranoid customers.

active_file
Set this to the full path to an active file, to allow Cleanfeed to know what groups are moderated. This is normally your server's
active file, but it doesn't have to be; it is possible, for example, to run Cyclone with no active file, but give one to Cleanfeed
anyway.

MD5 Body Filter Settings

do_md5 When turned on, the MD5 EMP checks will be done. This should be left on unless you have a really good reason to turn it off. If
you're running Hippo along with Cleanfeed, you might feel Cleanfeed's MD5 checks are redundant and want to turn them off, for
example. It would probably be better to leave it on with the history turned down, instead.

md5maxmultiposts
Start rejecting articles after we have seen this many copies, according to the MD5 checksum filter.

MD5History
How many articles to remember for MD5-based EMP comparison. Since the MD5 filter is not prone to false positives, setting this
higher is a good idea to catch more spam, if you have the RAM to spare.

MD5maxlife
When a spam is identified by the MD5 EMP filter, it is saved for continual rejection. MD5maxlife specifies how long, in hours, to
keep a saved MD5 id which is no longer getting any hits. (A spam id which is still getting matches will be saved regardless of
age.) 24 hours works well.

fuzzy_md5
When turned on, the message bodies will be munged up a bit before MD5 checksums are generated. Whitespace and other non-
alphanumeric characters are stripped and letters are forced to lowercase, as well as a couple other bits of treachery to try to
defeat the "hashbuster" spam-bots. This adds a bit of "fuzziness" to the MD5 filter, and results in a performance hit as well.

Since the smarter spammers have discovered hashbusting, I recommend that this be turned on.

fuzzy_max_length
Sets the maximum amount of lines for an article body to be subject to the fuzzy_md5 munging (above). This keeps extremely large
articles out of those nasty regular expressions.

md5_skips_followups
Determines whether the MD5 filter checks articles with References headers. The default is to skip them. Setting this option to 0
will result in all articles passing through the MD5 filter, which can result in a major performance hit, but does close another
hole in the filter. If you turn this off, you should increase MD5history as well to avoid shortening your "window".

MD5HistSize
The maximum allowed size of the EMP memory for the MD5-checksum EMP filter. Use this as a "sanity check" to prevent a sudden burst
of spam from eating up all of your memory. It should be set high enough so that you normally never hit this number; use the
MD5MaxLife to expire the hash instead.

Header-Based EMP Filter Settings

do_phl Turns on the NNTP-Posting-Host/Lines EMP filter. This filter identifies spam by identical posting-host headers and article sizes
in a short period of time. You really don't want to turn this off.

do_fsl Turns on the From/Subject/Lines EMP filter. This filter identifies spam by identical From and Subject headers and article sizes in
a short period of time. This is the one that gets the least number of hits these days, so you won't lose much by shutting it off.

maxmultiposts
Start rejecting articles after we have seen this many copies, according to the header-based EMP filter. Since false positives are
somewhat more likely with this filter than with MD5, this should be set appropriately higher to reduce the odds.

ArticleHistory
How many ids to remember for header-based EMP comparison. Setting this higher will catch more spam because there will be a larger
"window" to look at. Larger settings will also consume more memory and have a (small) impact on performance, as well as slightly
increase the chance of a false positive (since the sample size will be larger). Most articles will actually take up two entries in
this history because there are two different header-based filters.

EMPmaxlife
Same as MD5maxlife but for the header-based EMP filter.

EMPHistSize
Same as MD5HistSize but for the header-based EMP filter. If you are running the header-based filter but not the MD5 filter for
whatever reason, set this high.

Excessive Crosspost Settings

maxgroups
Reject articles crossposted so that followups will be to more than this many newsgroups.

low_xpost_maxgroups
Specify a special, lower crosspost limit for certain groups, specifed by regular expression in low_xpost_groups (below). Useful
for being more strict in groups plagued by crossposting, such as sex, binaries, and jobs groups. (Replaces the old tfjmaxgroups
option.)

Misplaced Binaries Filter

block_binaries
Enables blocking of binary posts in non-binary newsgroups. Which newsgroups allow binaries is configured with bin_allowed (below).

max_encoded_lines
Sets the number of uuencoded or base64-encoded lines to allow before considering a post to be a binary. This should be set high
enough to pass regular PGP signatures. (Those satanic Netscape crypto-sigs can die along with the other binaries.) Default is 15
lines, which may be a little low if you are lenient, which you're not.

binaries_in_mod_groups
If set, binaries are allowed in spite of block_binaries if they are posted only to moderated groups (requires active_file).

HTML

block_mime_html
Enables blocking of MIME-encapsulated HTML posts. This does NOT affect straight text/html or multipart/alternative posts of the
type created by misconfigured Netscape and Microsoft "newsreaders", but ONLY posts which are MIME-encapsulated HTML, a favorite
format of sex spammers which often sneaks in under the EMP radar.

block_html
Enables blocking of HTML and multipart/alternative posts. You can specify group patterns where HTML is allowed by setting
html_allowed (below).

Cancel Message Filtering

block_late_cancels
If turned on, cancels for recently rejected articles will be rejected. Set the window with MIDmaxlife (below). This will result
in a huge number of rejections if you have multiple full feeds and you aren't backlogging. If you are concerned about your
downstream sites receiving the cancels, leave this off. If you need a performance boost, turn it on.

MIDmaxlife
How long to remember rejected message-ids so cancels for these posts can later be rejected. Specified in hours. This only has an
effect if block_late_cancels is enabled (above).

Disabling Other Filters

do_scoring_filter
Enables the (new) "scoring" filter. You probably want to leave this on, even if you need to turn of aggressive mode (turning off
aggressive mode will disable the content-based parts of the scoring filter).

do_mid_filter (INN only)
Enables the message-id filter. This requires an additional patch to INN 1.7.2, which is included with Cleanfeed (but optional).
The patch adds a new Perl hook to check message-id's during the NNTP CHECK transaction, and decide whether to refuse the article.
There is a patch for this for INN 2.0 which may get incorporated into the INN distribution at some point. The default is off.

do_bot_checks
Enables the filters that check for spam bot signatures. The only reason you would ever want to turn this off is if you've written
your own version, or something. Otherwise, leave it on.

do_supersedes_filter
Enables the Excessive Supersedes filter, to catch rogue Supersedes attacks. This filter begins dropping articles with Supersedes
headers if too many appear from the same posting-host in a short time. Moderated groups are given a higher limit (if active_file
is set), as is news.answers. Default is on.

check_supersedes_path
If set, bad_cancel_paths will also be applied to Supersedes articles. Articles with Supersedes headers, where a path element
matches the regexp in bad_cancel_paths, will be dropped. Default is on.

drop_useless_controls
If set, all control messages of types sendsys, senduuname, and version will be dropped. These are no longer useful and are a hole
for denial-of-service attacks due to the way INN and some other servers handle them. On by default.

drop_ihave_sendme
If set, control messages of types ihave and sendme will be dropped. See drop_useless_controls. If you use these types of control
messages, turn this off. If you're not sure, then you're not using them.

drop_control_with_supersedes
Drops any and all control messages which contain a Supersedes header. Since control messages are not passed through the same
filters as regular messages, a rogue Supersedes attack can use control messages to avoid filtering; this option closes this hole.
Legitimate control messages don't have Supersedes headers. On by default.

Hash-Trimming

trimcycles
The EMP memories are trimmed every trimcycles times through the filter.

EMPstarttrimming
Tells the filter not to waste time trimming the EMP memories until they have this many entries. Just a minor performance
enhancement during the first hours the filter is running or when you first start innd.

Logging

verbose When turned on, verbose logging to news.notice will happen; spam domains will be listed, etc. When off, only general messages will
be logged, making the news.daily summaries less interesting but much shorter and more to the point. (There is, alas, no way to
shut off news.notice logging entirely.) (news.notice only applies to INN.) Note that this will not reduce the number of log
entries, but only their verbosity.

logfile (Standalone Mode)
If set to the path to a file, this will enable logging of message-ids of all articles processed by the filter. Rejections will be
logged with the reason for rejection. Note that this will create a very large logfile which you will need to rotate or delete (see
max_log_size, below).

reportfile (Standalone Mode)
If set to the path to a file, this will enable generation of a simple report of articles accepted and rejected. The report file
will contain one entry per line with the start time, end time, number of articles accepted, and number of articles rejected, tab-
separated.

log_accepts (Standalone Mode)
When using the above logfiles, this setting determines whether articles accepted should be logged. When disabled, only rejections
will be logged.

max_log_size (Standalone Mode)
The size at which to rotate the logfile. This will be replaced by time-based rotation at some point.

statfile
If this is set to the full path of a file, a crude stats file will be written each time the filter is reloaded with ctlinnd reload
filter.perl meow (for INN) or whenever the Cleanfeed process receives a SIGUSR1 (for standalone mode). The file shows how many
entries are present in each of the EMP histories, MID history and excessive supersedes history; timer information if enabled (see
timer_info); and the contents of all configuration settings. Posting-hosts in for each supersedes entry will be listed, along with
their counts; these are not being rejected unless they are over the threshold. The default for this is undef, which disables
creation of the stat file.

More comprehensive stats are planned for the future.

Timing Info

timer_info
When enabled, Cleanfeed will generate timing statistics telling you how many articles per second are being examined by the filter
and being accepted by the filter. This information will appear in the statfile if this is enabled, and in the output of INN's
ctlinnd mode if the mode.patch is applied to INN. Note that the accepted/second rate is not necessarily the rate at which your
server is accepting articles; articles can be rejected by the server after Cleanfeed passes them, for example if they are posted to
groups not in your active file.

timer_interval
The period over which to average timing information, in seconds. The default is 600 seconds, or 5 minutes.

Debugging

debug_batch_directory
Specifies a directory where debugging "batchfiles" can be written. See the Hacker's Guide in this document for more information.

debug_batch_size
The maximum size of a debugging batchfile before it gets rotated. Rotation is done by renaming the file to file.1, file.2, etc.,
using the lowest number that doesn't already exist.

Regular Expressions

You can add to most of these regular expressions in the %config_append section of cleanfeed.conf; settings you add there will be added to
the defaults, rather than overriding them. If you want to completely override the default settings you can add entries for these to the
%config_local section instead.

bin_allowed
This is a regular expression telling the anti-binary filter in which newsgroups binaries are allowed. If all groups in the
Newsgroups header match this pattern, binaries are allowed through the filter. (This obviously has no effect when the binary
filter is disabled.) If the binary filter is enabled and this is set to a null string (by overriding the default in the local
config) the result will be blocking all binaries regardless of where they are posted.

poison_groups
If any groups in the Newsgroups header match this regexp, the article will be rejected. Thus you can reject crossposts to certain
groups even if they are also posted to groups you carry.

html_allowed
This is a regular expression telling the anti-HTML filter in which newsgroups HTML and multipart/alternative posts are allowed.
This only has an effect if block_html is turned on (above). The default (to allow HTML in microsoft.* groups) can be added to in
cleanfeed.conf.

If you don't want to allow HTML anywhere, not even the microsoft.* groups, override this setting in the local configuration and
set it to a null string or undef.

md5exclude
If an article is posted only to groups matching this regexp, the MD5 EMP filter will not be applied. Useful for "test" groups
where it's okay for lots of the posts to be the same.

allexclude
If an article is posted only to groups matching this regexp, NO checks are applied at all.

low_xpost_groups
If a group matches this regular expression, it gets a special crosspost limit, set in low_xpost_maxgroups, rather than the general
crosspost limit set in maxgroups. This is useful for groups plagued by excessive crossposting, such as sex, binaries, and jobs
groups. The default is to limit crossposts to 6 groups in test, forsale, and jobs groups. Setting this to a null string, or
undef, will disable this feature.

badguys This is a monster regular expression containing domains of known spammers. Only the "middle" part of the domains are listed; these
are checked as email addresses in From headers by appending a list of top-level domains to the end, and as URLs by prepending
http:// and an optional "www.". If you modify this list, be very careful not to end up with "||" in there (two "or" signs in a
row); this will match every single post that comes through, which is Bad.

baddomainpat
If a post contains a URL for a site whose domain name matches this pattern (in .com, .net, and .nu TLDs only) the post will be
rejected. For example, there are hundreds of spamming porn sites whose domain names begin or end with "xxx". This prevents us
from having to keep up with their nonsense. Yes, it's a little aggressive, but it works.

exempt Regular expression of NNTP-Posting-Hosts that are exempt from the posting-host-based EMP filter. This is for high-output systems
where all posts contain the same NNTP-Posting-Host header, such as AOL, which if not exempted would end up hitting the posting-host
EMP filter with all of their posts. There aren't many of these out there; a "regular" multi-user system does not present a problem
because the filter doesn't kick in until it sees a large number of posts from the same posting-host and also of the same length, in
a short period of time.

supersedes_exempt
Regular expression of NNTP-Posting-Hosts that are exempt from the excessive supersedes filter. Generally this will be systems
which post a lot of FAQs.

bad_cancel_paths
Cancel messages will be rejected if the Path header contains elements matching this regular expression. Also applied to the
NNTP-Posting-Host. If check_supersedes_path is set, this will also be checked against the Path header of articles with Supersedes
headers. This list contains sites which are or have recently been the source of rogue cancel attacks.

refuse_messageids (INN only)
If you have do_mid_filter (above) enabled, and you have the optional message-id patch applied to INN (or otherwise have obtained
the hook for filter_messageid in INN 2.0), this regular expression will be applied to message-ids as they are offered to your
server, and they will be refused if it matches.

net_abuse_groups

spam_report_groups
These regular expressions are used to exempt certain groups from certain filters; for example, groups expected to contain spam
reports, example spams, NoCeM notices, etc. These are not in cleanfeed.conf; if you need to add to them please let me know.

After modifying the filter file, always check for mistakes by typing:

perl -cw filter_innd.pl (or cleanfeed or whatever you called it)

There should be no errors and no warnings.

You can check cleanfeed.conf with:

perl -cw cleanfeed.conf

You will get several warnings about variables being used only once; these can be ignored.

If you are running INN, you can modify the file and reload it with ctlinnd reload filter.perl meow while the server is running. The
configuration in f<cleanfeed.conf> will be reloaded at this time as well.

With the Highwind servers, modifying the program will require a server restart (use the bin/restart script). Note that this will result in
all connections (including newsreader clients) being dropped. This is not my fault. :)

When in standalone mode, configuration from cleanfeed.conf can be reloaded by sending Cleanfeed a SIGHUP.

I have no idea what NNTPRelay does, but I'm guessing it needs a restart as well.

IMPORTANT NOTE: A common mistake is not setting file permissions on cleanfeed/filter_innd.pl, cleanfeed.conf, and cleanfeed.local so that
they are readable by the news user. Please double-check your permissions! If Cleanfeed is running, and fails to successfully load
cleanfeed.conf, it will use the default settings instead of those you specified in the config file.

INSTALLATION - INN
These instructions assume you have the Perl hooks compiled into INN. If you don't, you will need to add them and rebuild the INN
distribution before proceeding.

With INN, Perl is embedded into the innd program. The filter file defines subroutines that are called by innd at the appropriate times.

SYSTEM REQUIREMENTS

In order to run Cleanfeed with INN, you will need:

o INN 1.5.1 or later (1.7.2+insync1.1d or 2.1 recommended)

o Perl 5.004 or later

o Perl hooks compiled into INN

o The MD5 Perl module

INN is available from:
http://www.isc.org/inn.html

The Insync distribution of INN (highly recommended if you aren't running INN 2.1) is available from:
http://www.insync.net/~aos/inn.html

The MD5 Perl module is available from:
http://www.perl.com/CPAN-local/modules/by-module/MD5/

Perl itself is available from the Perl home page:
http://www.perl.com/

PATCHES AND STUFF

INN 2.0 includes everything you need to run Cleanfeed, except the MD5 Perl module.

With earlier versions, Cleanfeed requires some patches to INN in order to function properly.

If you are running INN 1.7.2+insync1.1d, you already have the original filter.patch and the dynamic-load.patch; You need only apply the
upgrade.patch.

None of these patches are against INN 2.1; the "extra feature" ones like mode.patch may not apply to 2.1. Ports are always welcome.

filter.patch
This patch provides the basic functionality for Cleanfeed by making some extra headers available to the Perl filter, as well as message
bodies. This patch was changed in version 0.95.3. It is against INN 1.7.2 and should be applied in the innd directory. This patch is
included in the insync "megapatch" for INN as of version 1.1c, so if you are running this version of INN you need not apply this patch.
Not necessary for INN 2.x.

dynamic-load.patch
This patch enables INN's Perl interpreter to load dynamic modules. It is necessary for MD5 support. The patch is against INN
1.7+insync and should be applied in the lib directory (NOT the innd directory). It applies cleanly to other versions of INN including
1.5.1 and 1.7.2. This patch is included in the insync "megapatch" for INN as of version 1.1d, so if you are running this version of
INN you need not apply this patch. Not necessary for INN 2.x.

If you are still using INN 1.5.1, you can use dynamic-1.5.1.patch instead.

In order to compile INN with the new patch, you need to edit the PERL_LIB entry in config.data. Type this command at the shell, and
paste its output into config.data as PERL_LIB:

perl -MExtUtils::Embed -e ldopts

Most systems also allow you to simply enter that line in backquotes as PERL_LIB.

This patch requires Perl 5.004 or later! INN will not compile linked with Perl 5.003 after following these instructions!

AIX: There is a problem with Perl dynamic loading from INN under the AIX operating system. In simple terms, it doesn't work. This
seems to be a problem with the gcc compiler. Success has been reported by rebuilding both Perl and INN with IBM's commercial compiler
CSet (a.k.a. xlC).

Solaris: There have been multiple reports of Cleanfeed not working under Solaris if any part of the system -- INN, Perl, or the MD5
module -- are compiled using egcs. Success has been reported by recompiling everything with gcc, and by upgrading to the very newest
egcs.

upgrade.patch
For current users of Cleanfeed, this is a patch for an already-patched INN, or for 1.7.2+insync1.1d, to bring you up to the new version
of the Cleanfeed patch. Not applying this patch right now will only lose you a couple of filters, and nothing will break if you don't
apply it (no changes to the filter source or configuration will be required).

messageid.patch
This is a patch which adds a new Perl hook to innd, filter_messageid. This allows you to run a Perl subroutine against each message-id
as it is offered to your server, and decide whether to refuse the article before it is even sent to your server. Cleanfeed includes a
small filter_messageid. This patch is entirely optional.

mode.patch
This patch adds a line to INN's ctlinnd mode output for Perl filter status. The output line is generated by the filter_stats
subroutine. The default output contains the number of articles accepted, rejected and refused since the filter started, and the sizes
of the EMP, Message-ID, and Excessive Supersedes hashes. If timer_info is enabled, this will also include the rate in articles per
second (rounded to the nearest tenth) at which articles were examined (total sent through the filter) and accepted by the filter,
averaged over the timer_interval number of seconds.

After applying the patches, rebuild all of INN and do a "make update". The first patch (filter.patch) only requires innd to be rebuilt,
but the dynamic-load.patch requires you to rebuild the whole distribution. Current users upgrading with upgrade.patch need only rebuild
innd and reinstall that executable.

Thus:

cd inn [to the top-level source directory]
make clean
cd innd
cp wherever/filter.patch . [from the Cleanfeed distribution]
patch <filter.patch
cd ../lib
cp wherever/dynamic-load.patch [from the Cleanfeed distribution]
patch <dynamic-load.patch
cd ../config
emacs config.data [edit the PERL_LIB entry as above]
make all
make update

Finally, you need to install the MD5 Perl module, no matter what version of INN you are running.

INSTALLING CLEANFEED - INN

In INN 1.7.2 and earlier, the location where INN looks for the Perl filter is set in config.data, as _PATH_PERL_FILTER_INND. By default,
the filename is filter_innd.pl. The Cleanfeed filter program file should be installed in this location. INN comes with an example
filter_innd.pl file; move this file (or whatever other filter is in place) out of the way first.

Before putting the filter in place, edit the file, changing $config_dir to the location of your cleanfeed.conf file.

After editing the file, always check for errors with the command:

perl -cw filter_innd.pl

Once the file is in place, tell innd to reload it:

ctlinnd reload filter.perl meow

And, if Perl filtering is currently disabled, enable it:

ctlinnd perl y

Now, you can watch it working by looking at your news.notice log:

tail -f /var/log/news/news.notice

If your server is running a full feed, you should start seeing a constant stream of rejections almost immediately.

INSTALLATION - HIGHWIND SERVERS
The various Highwind server packages (Cyclone, Typhoon, and Breeze) all have the same external filter interface. The filter runs as its
own process, reading from standard input and writing to standard output.

SYSTEM REQUIREMENTS

In order to run Cleanfeed with a Highwind server, you will need:

o Cyclone, Typhoon or Breeze

o Perl 5.003 or later

o The MD5 Perl module

The Highwind servers are commercial products. For more information:
http://www.highwind.com/

The MD5 Perl module is available from:
http://www.perl.com/CPAN-local/modules/by-module/MD5/

Perl itself is available from the Perl home page:
http://www.perl.com/

INSTALLING CLEANFEED - HIGHWIND

The Cleanfeed program file should be installed as "cleanfeed" in your news server's bin directory (cyclone/bin, etc). Make it owned by
news:news and make it executable.

Before putting the filter in place, edit the file, changing $config_dir to the location of your cleanfeed.conf file. Also ensure that the
shebang line (the first line of the file, starting with #!) points to the correct location of your perl executable.

After editing the file, always check for errors with the command:

perl -cw cleanfeed

There should be no warnings.

Now, edit your bin/start script. You need to add two options to the command line that starts up the server process, the -program option to
tell it what program to use as a filter, and the -body option to tell it to send the bodies as well as the headers.

typhoond -program /typhoon/bin/cleanfeed -body

...along with whatever else you have cluttering up the command line.

(Highwind has indicated that this may/will be a config file option in a future release.)

Now you can restart the server with the bin/restart script. Check to make sure Cleanfeed is running, with "ps -ef" or "top". If
Cyclone/Typhoon is unable to start the filter for some reason, it will log an error via syslog. The error will not be terribly helpful.

You can make Cleanfeed reload its configuration from cleanfeed.conf and local code from cleanfeed.local by sending it a SIGHUP.

INSTALLATION - NNTPRELAY
Please note that I do not have an NNTPRelay server, nor access to one, nor much interest in mucking around with Windows NT, and thus I have
not tested the NNTPRelay filtering support myself. The necessary changes and notes were contributed by someone else. Additions and
improvements to this documentation would be most welcome.

The filter interface in NNTPRelay is pretty much the same as in the Highwind servers.

SYSTEM REQUIREMENTS

In order to run Cleanfeed with NNTPRelay, you will need:

o NNTPRelay version 1.1b4 or later

o Perl 5.003 or later

o The MD5 Perl module

NNTPRelay is available from:
http://nntprelay.maxwell.syr.edu/

An NT binary release of Perl 5.004, which apparently includes the MD5 module, can be found at:
http://www.perl.com/CPAN/ports/win32/Standard/x86

The MD5 module (in source code) can be found at:
http://www.perl.com/CPAN-local/modules/by-module/MD5/

INSTALLING CLEANFEED - NNTPRELAY

Before putting the filter in place, edit the file, changing $config_dir to the location of your cleanfeed.conf file.

Install the Cleanfeed program file wherever is appropriate on your system, as "cleanfeed.pl". Edit NNTPRelay's config.txt file, adding an
entry like this:

ExternalFilter=c:/perl/bin/perl.exe c:/news/cleanfeed.pl

Of course, use the correct path to your Perl executable and to the Cleanfeed program file. Now restart NNTPRelay. If you defined a
logfile in Cleanfeed, it should appear.

THE HACKER'S GUIDE
Cleanfeed will look for a file called cleanfeed.local, in the same directory as cleanfeed.conf. If this file exists, it will be loaded and
evaluated as Perl code right after the config file. This enables you to provide your own local filter code which will survive an upgrade
of the main Cleanfeed source.

It will be reloaded when the filter is reloaded with ctlinnd reload filter.perl meow (for INN), or when configuration is reloaded with a
SIGHUP (in standalone mode). This means that you can modify the running code without restarting Cleanfeed.

cleanfeed.local can define a number of different subroutines, which, if defined, will be called at various points in the filter process.
Other subroutines can, of course, be defined as required by your code.

The file is simply re-evaluated each time. So, if you remove a subroutine from the file completely, that subroutine will remain defined
after the reload, because nothing replaced it. You will need instead to define it as an empty subroutine, or explicitely undef it, to make
it go away.

STUFF YOU CAN DEFINE

Cleanfeed will call the following subroutines, if they are defined. See the section on return values for instructions on what your code
should return.

local_config
This is called after configuration is loaded, each time. It will be called when the filter is reloaded (with INN) or when
configuration is reloaded with SIGHUP (running standalone), as well as when the filter is first run. No return value is expected.

local_filter_before_emp
Called for each (non-control) article, before any other filters. General-purpose spam filters shouldn't go here, because you really
want to populate the EMP hashes first.

local_filter_after_emp
Called for each (non-control) article, after the EMP filters but before any other filters.

local_filter_middle
Called for each (non-control) article, after the "simple" filters but before the "expensive" body checks.

local_filter_scoring
Called during the scoring filter. Return the value, positive or negative, by which to adjust the article's score.

Warning: Here there be dragons! If you're going to play with this please examine the existing source, and use the debugging routines
to watch what you're doing.

local_filter_last
Called for each (non-control) article, after all other filters are done.

local_filter_cancel
Called for all cancel control messages.

local_filter_newrmgroup
Called for all newgroup and rmgroup control messages.

RETURN VALUES

The general filtering subroutines you can define (local_filter_before_emp, local_filter_after_emp, local_filter_middle, local_filter_last,
local_filter_cancel, and local_filter_newrmgroup) are expected to return a value indicating whether you want to accept the article being
examined. If the article is okay, you should return "" (empty string), in which case filtering will proceed as usual. If you want to
reject the article, you return any other string, which will be used as the reason.

The rejection code actually expects two return values -- the first string is the "verbose" rejection message, and the second is the "non-
verbose" message (see the verbose configuration option). If only one is supplied, it will be used for both purposes.

The scoring filter calls local_filter_scoring, which is expected to return the value, postive or negative, by which the article's score
should be adjusted.

WHAT YOU GET

Your subroutines get information about the article in several variables.

%hdr
A hash containing the article headers. The key is the header name, in "canonical" case as INN likes them; the value is the content of
the header. When running under INN, only headers known to INN will be included in the hash (which includes any header used anywhere in
Cleanfeed). In standalone mode, all headers will be present, but only the known headers will be sent in canonical case; others will
have the header name (and thus hash key) in whatever case they are in the article itself, making them difficult to find and use
consistently.

The message body is in this hash under the key __BODY__. If running INN 2.x with storageapi, it will be provided in wireformat, with
lines terminated in
rather than just
. With the traditional spool format (and in all cases with INN prior to 2.x) lines will be
terminated only with
.

Examples:

To get the Subject header as a scalar: $hdr{'Subject'}

To get the entire message body as a scalar: $hdr{'__BODY__'}

%lch
A hash containing lowercased versions of some of the article headers. The hash keys are the header names in all lowercase; the values
are the contents of the headers, with all letters forced to lowercase.

Currently, the only headers added to this hash are From, Organization, Subject, Content-Type, X-Newsreader, X-Newsposter, Message-ID,
and Sender.

This hash is not availabe to local_filter_before_emp.

@groups
An array containing the newsgroups the article is posted to (from the Newsgroups header). You can find out how many groups the article
is crossposted to with "scalar @groups".

@followups
An array containing the newsgroups to which followups are set (from the Followup-To header). If the article has no Followup-To header,
this array will be identical to @groups. You can find out how many groups followups are set to with "scalar @followups". This is the
preferred way to limit crossposting, because limiting only by the Newsgroups header will catch FAQs and such.

$lines
The number of lines in the message body. This is not taken from the Lines header as that can be client-supplied to fool filtering;
this is determined by counting the lines in the message body.

%gr A hash containing information about the groups the article is posted to. This isn't very straightforward and may not be useful to you,
but I'm including it in this documentation for completeness. The following entries may be present in this hash:

$gr{'net'} - the number of net.* (Usenet II) newsgroups the article is posted to, if any.

$gr{'other'} - the number of non-net.* groups the article is posted to.

$gr{'md5skip'} - true if the article should be exempted from the MD5 body checks (if all newsgroups match the regexp in md5exclude).

$gr{'binary'} - true if the article is posted only to groups where binaries are allowed (if all newsgroups match bin_allowed).

$gr{'html'} - true if the article is posted only to groups where html is allowed (if all newsgroups match html_allowed).

$gr{'poison'} - number of 'poison' newsgroups this article is posted to (matching poison_groups). If this is present, you'll only see
this entry in local_filter_before_emp and local_filter_after_emp because it will be rejected after that.

$gr{'abuse'} - number of 'net abuse' newsgroups this article is posted to (matching net_abuse_groups).

$gr{'reports'} - number of 'spam reports' newsgroups this article is posted to (matching spam_report_groups).

$gr{'low_xpost'} - number of 'low crosspost limit' groups this article is posted to (matching low_xpost_groups).

$gr{'mod'} - number of moderated groups this article is posted to (requires that Cleanfeed have an active file).

$gr{'allmod'} - true if this article is posted only to moderated groups.

$gr{'faq'} - true if this article is crossposted to news.answers.

%config
A hash containing all configuration options.

DEBUGGING

When you make filtering changes, you should always check the results for false positives. I've provided two subroutines to help you do
this: writeheaders() and writefull().

First, make sure debug_batch_directory is set in your configuration. Set this to a directory that is writable by the news user.

Call either of these subroutines with one argument, the basename of the batch file you want to write the current article to. writeheaders
will dump the article's headers out to the file (with INN this will only give you the known headers). writefull will dump the full
article, headers (again, known headers with INN) and body. The file will be rotated if it becomes larger than debug_batch_size, set in
your configuration. The rotation is simple, a number is appended to the end of the file, and incremented until the filename does not
exist. You'll have to delete the old files yourself.

When testing a new filter, simply call writeheaders ("batchfile") or writefull ("batchfile") when you're going to reject an article. Then
you can look at the file to make sure you're doing what you think you're doing.

SIGNALS
When running under Cyclone, Typhoon, Breeze, or NNTPRelay (standalone mode), Cleanfeed will catch SIGHUP, and reload its configuration from
cleanfeed.conf. It will also reload and reevaluate cleanfeed.local if you're using it. Note that, unlike INN, there is no way to reload
the filter code itself without restarting the server.

Cleanfeed in standalone mode will also catch SIGUSR1 and write its crude current-status file (see statfile in the config section) on the
next cycle through the filter.

(I honestly don't know if SIGUSR1 and SIGHUP are things which exist on NT for NNTPRelay.)

CREDITS
Written by Jeremy Nixon <jeremy@exit109.com>.

Originally based on Jeff Garzik's EMP filter.

I can't possibly mention everyone who has submitted ideas or fixes for the filter, but I'd like to acknowledge the substantial
contributions of several people: Danhiel Baker, Frank Copeland, Brian Moore, John Payne, Russ Allbery, David Riley, and SeokChan LEE.
Thanks, guys.

dynamic-load.patch is from Piers Cawley. The body-filtering portion of the INN filter.patch is from Jeff Garzik. messageid.patch is from
Ed Mooring. mode.patch is from John Payne.

LICENSE
This software may be distributed freely, provided it is intact (including all the files from the original archive). You may modify it, and
you may distribute your modified version, provided the original work is credited to the appropriate authors, and your work is credited to
you (don't make changes and pass them off as my work), and that you aren't charging for it.

AVAILABILITY
This filter is available at:

http://www.exit109.com/~jeremy/news/antispam.html ftp://ftp.exit109.com/users/jeremy/

3rd Berkeley Distribution Version 0.95.7b cleanfeed(8)

cleanfeed(8) redhat man page | unix.com