Keyword-based blacklisting

Created on Thursday, January 1, 1970.
Filed under Software, Web Backend.
 

Mainstream anti-spam technology is not future-proof to any degree — spammers change domains and servers too often to keep up. I believe keyword-based blacklisting is the best approach to spam protection, and I’ll explain why (and how).

 

Mainstream anti-spam technology is not future-proof to any degree. I believe keyword-based blacklisting is the best approach to spam protection, and I’ll explain why (and how). This article is meant for developers who are interested in improving the anti-spam measures of their software and users who want to know more about keyword-based blacklisting.

What is a spamment?

A spam comment (or spamment) is a comment whose purpose is to list as many links as possible to promote a webpage or product.

The most important element of a spamment is the link. The entire purpose of a spamment is not to get users to click on the link, but to make search engines increase the ranking of a webpage by the collective weight of hundreds of thousands of spamments, all pointing to the same URL.

Why is current anti-spam technology lacking?

Often software relies on a blacklist of URLs or IPs to protect the commenting system. I herein call such blacklists literal because they require the identification string either whole or modified with wildcards. Literal blacklists have two major shortcomings:

  1. It is difficult to add new spam URLs as they arrive and to remove redundant URLs as they are changed.
  2. You need to know what to block before you block it; the process is reactive rather than proactive.

Keyword-based blacklisting is the future

Keyword-based blacklisting is the most viable option for future-proof spam protection. It uses a newline-delimited blacklist to see if a banned string appears anywhere within the URLs submitted with the comment. It searches only links so that a commenter won’t be locked out simply for using a banned string.

Literal blacklists require the domain name (maybe even the full URL) of each spam link in order to effectively block the spamments:

https://www.penismedical.com
https://www.altpenis.com/
https://www.enlargepenisguide.com/
https://www.allabout-penis-enlargement.com
https://www.Penis-Devices.com/pumps.html
https://www.penis-enlargement-planet.com
and so on, ad nauseam.

A keyword-based blacklist can block all of those URLs with the single entry penis, and using enlarge blocks many more. You can replace tens of literal entries with a few keywords while actually improving security and lessening a blacklist’s footprint on the server, and because it’s case-insensitive, all variants of capitalisation are caught.

penis
enlarge

And what’s more, keywords let you block spam without even knowing the URL. If spammers try to send you any URLs containing ‘penis’ or ‘enlarge’ they are automatically blocked.

Are there disadvantages to keyword-based blacklisting?

A keyword can easily be found in a non-spam URL, like ‘rape’ in www.western-grapes.com. Sometimes it is better to block a fragment of the URL than have an innocent user locked out, though remember that an unreasonable lock-out is the exception rather than the rule. The keywords you end up putting in the blacklist are often those that only occur in spam-related or otherwise unpleasant URLs.

To help make sure that the chance of blocking innocent users is reduced, the blacklist operates depending on the complexity of its entries. A one-word entry like porn will match any instance of ‘porn’ in the URLs, while a full URL like https://www.Penis-Devices.com/pumps.html will match only that.

How do I implement it?

Here is the PHP code I use for keyword-based blacklists, pulled from my CMS Writer’s Block. The code assumes that you have a newline-delimited list of keywords at include/blacklist.txt.

// Remove comments from blacklist.txt.
    $blacklist = preg_replace('/(##).+(##)/', '',
         file_get_contents('include/blacklist.txt'));

// Turn spaces into vertical bars
    $BlockedUrls = preg_replace('/s+/', '|', trim($blacklist));

// Get all hrefs from comment body and put them into array
    preg_match_all('/hrefs*=s*(.{0,1})[^ >]+/i', $_POST['comment'],
        $urls_in_text, PREG_PATTERN_ORDER);

// Stringify array (sans 'href=') and append URL field to the end.
    $given_urls = implode(' % ', $urls_in_text[0]);
    $given_urls = preg_replace('/hrefs*=s*/i', '', $given_urls,
        -1).$_POST['url'];

// Search string for blocked URL fragments and kill script if found.
	if(eregi($BlockedUrls, $given_urls)) {
	    echo 'Comment blocked.';
	    exit();
	}
// If not exited, continue with the script.

I put the text of the comment and the input of the URL field into a single string because it would be faster to perform regex on a single string than to loop multiple calls on an array. Perl-compatible regex is used in most instances because it is faster than POSIX Extended.

That's all there is, there isn't any more.
© Desi Quintans, 2002 – 2022.