Blog

Archive of September 2009

Four quick PHP filters to reduce contact form spam.

I maintain the code for a fairly popular, if localized, blog about the Hintonburg area called Miss Vicky's Offhand Remarks. It's been around for over 5 years now. It has a contact form to allow neighbourhood residents, and anyone else, to send in tips and requests. I've resisted attaching a captcha to it, as I find them annoying. As a result we get occasional waves of bot spam. I have found that by studying the spam, I have been able to cut down on most of the seriously egregious scripts out there. Now I've pulled out (and simplified) the actual code I use, so these snippets aren't going to work as is, but they should be enough to illustrate the methodology. First, I do filter the referrer to ensure any form posted on the site appears to come from my server. Yes, this is easily faked, but if does cut down on a surprising amount of poorly written (lazy) scripts.

...
if($_SERVER['REQUEST_METHOD'] == 'POST')) {
$srv_rx = '/^http';
$srv_rx .= ($_SERVER['HTTPS'])?('s'):('');
$srv_rx .= ":\/\/".str_replace('.','\.',$_SERVER['SERVER_NAME']).'/';
if (!preg_match($srv_rx ,$_SERVER['HTTP_REFERER'])) {
//should track this, since it's probably a hacker/script
//instead, i will simply die.
$action = 'return';
}
}
...

Then I do three comment form specific checks. The first thing I look for is an inordinate amount of links. If the text is comprised of more than half urls, I throw it back. Again, easily worked around, but this seems to catch most scripts. I do let a legitimate user know that their message has failed to send, in case they want to reformat the message. Again, most scripts don't really care if you've returned anything, so I'm not giving away a trade secret here.

...
if (strlen(preg_replace('/(\W|\s)(?:(?:ht|f)tp(?:s?)\:\/\/)?(?:\w+:\w+@)?'
. '(?:(?:[-\w]+\.)+'
. '(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))'
. '(?::[\d]{1,5})?(?:\/(?:[-\w~!$+|.,\:\*\/&?#=]|%[a-f\d]{2})*)?(\W|\s)/',
'$1$2',$email_text))/strlen($email_text) < (1/2)) {
//text is more than 1/2 urls. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (url to "real text" ratio too high).<br />Sorry... sort of';
break;
}
...

Since this is a contact form I'm not expecting any formatting. It should just be text. If the email is more than one third HTML, I throw it back. Again, I let the user know, since I have had users send me bits of info and code about the site itself, when they've found bugs.

if (strlen(strip_tags($email_text))/strlen($email_text) < (2/3)) {
//text is more than 1/3 html. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (HTML to "real text" ratio too high).<br />Sorry... sort of';
break;
}
...

Next, look for both an HTML anchor, and a BBCode url or link tag. If they both exist, it's spam. Again, I send it back, because you never know. Some people are confused.

...
if (preg_match('/<a(?:[^>])*href/',$email_text)
&& preg_match('/\[(?:url|link)=/',$email_text)) {
//text contains both anchor tag and bbcode link. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (wacky linking).<br />Sorry... sort of';
break;
}
...

And that cuts down on the majority of our contact form spam, and the rest will have to wait untill I can figure out how to write a regular expression that detects 'crazy-talk'.

Tags: ,
2009.09.02 03:09 PM | Permalink 0 Comments