Blog

Posts tagged with “PHP”

The Ultimate Date Validation RegExp

Now, I'm not claiming to be a god at regular expressions, but I do own this shirt for a reason. That reason is that I like using RegExp to do a lot of heavy lifting for me. Case in point: validating a date format. It's one thing to pull out year, month and date, allowing for multiple delimiters; it's quite another to ensure that the actual values match expected results. Month values are only between 01 and 12. Days are not only between 01 and 31, but also exclude 29, 30, or 31 if the months aren't supposed to include them — including checking for leap years. I can do that. With Regular Expressions.

Here's how (for simplicty's sake, all years must start with 19xx or 20xx):

var yyyymmdd = /(?:((?:19|20)[0-9]{2})[\/\\\-. ]?(?:(0[1-9]|1[0-2])[\/\\\-. ]?([0-2][1-8]|[12]0|19)|(0[13-9]|1[0-2])[\/\\\-. ]?(29|30)|(0[13578]|1[02])[\/\\\-. ]?(31))|(19(?:[0][48]|[2468][048]|[13579][26])|20(?:[02468][048]|[13579][26]))[\/\\\-. ]?(02)[\/\\\-. ]?(29))/;
//if doing a replace: year is $1$8, month is $2$4$6$9, day is $3$5$7$10
var ddmmyyyy = /(?:(?:([0-2][1-8]|[12]0|19)[\/\\\-. ]?(0[1-9]|1[0-2])|(29|30)[\/\\\-. ]?(0[13-9]|1[0-2])|(31)[\/\\\-. ]?(0[13578]|1[02]))[\/\\\-. ]?((?:19|20)[0-9]{2})|(29)[\/\\\-. ]?(02)[\/\\\-. ]?(19(?:[0][48]|[2468][048]|[13579][26])|20(?:[02468][048]|[13579][26])))/;
//if doing a replace: year is $7$10, month is $2$4$6$9, day is $1$3$5$8
var mmddyyyy = /(?:(?:(0[1-9]|1[0-2])[\/\\\-. ]?([0-2][1-8]|[12]0|19)|(0[13-9]|1[0-2])[\/\\\-. ]?(29|30)|(0[13578]|1[02])[\/\\\-. ]?(31))[\/\\\-. ]?((?:19|20)[0-9]{2})|(02)[\/\\\-. ]?(29)[\/\\\-. ]?(19(?:[0][48]|[2468][048]|[13579][26])|20(?:[02468][048]|[13579][26])))/;
//if doing a replace: year is $7$10, month is $1$3$5$8, day is $2$4$6$9

I'll break down the first example.
The entire expression looks for most likely values first, then looks for less likely, yet still valid values.

  • First it looks for a year with
    • a month value is between 01-12 and the day value is between 01-28.
    • If not, it checks if the month value is 01, or 03-12 and the day is either 29 or 30.
    • Failing that, it checks if the month is one of 01,03,05,07,08,10, or 12 and the day is 31.
  • Failing that, it does one last sanity check to see if the year was a leap year (only checking the years starting with 19xx or 20xx, so values are actually easily calculatable) and the date is 02-29.

If all of that fails, it isn't a valid date.

Tags: , , , ,
2012.06.07 10:23 AM | Permalink 0 Comments

Quick date validation with a Regular Expression

/(?:19|20)[0-9]{2}-(?:(?:0[1-9]|1[0-2])-(?:0[1-9]|1[0-9]|2[0-8])|(?:(?!02)(?:0[1-9]|1[0-2])-(?:29|30))|(?:(?:0[13578]|1[02])-31))/

Here's the breakdown of what it does:

Given a date format of YYYY-MM-DD (standard MySQL date format and easiest format for sorting) it makes sure that

  1. the year is numeric and starts with 20 or 19, and
  2. the month is numeric and is either
    1. between 01 - 12 and followed by a numeric day value between 01-28;
    2. between 01 - 12 but not 02 and followed by a day value of 29 or 30; or
    3. one of 01,03,05,07,08,10,12 and followed by a day value of 31

I have left out Feb. 29th so that you are forced to do a secondary leap year check.

Tags: , , ,
2010.01.28 11:43 AM | Permalink 0 Comments

Four quick PHP filters to reduce contact form spam.

I maintain the code for a fairly popular, if localized, blog about the Hintonburg area called Miss Vicky's Offhand Remarks. It's been around for over 5 years now. It has a contact form to allow neighbourhood residents, and anyone else, to send in tips and requests. I've resisted attaching a captcha to it, as I find them annoying. As a result we get occasional waves of bot spam. I have found that by studying the spam, I have been able to cut down on most of the seriously egregious scripts out there. Now I've pulled out (and simplified) the actual code I use, so these snippets aren't going to work as is, but they should be enough to illustrate the methodology. First, I do filter the referrer to ensure any form posted on the site appears to come from my server. Yes, this is easily faked, but if does cut down on a surprising amount of poorly written (lazy) scripts.

...
if($_SERVER['REQUEST_METHOD'] == 'POST')) {
$srv_rx = '/^http';
$srv_rx .= ($_SERVER['HTTPS'])?('s'):('');
$srv_rx .= ":\/\/".str_replace('.','\.',$_SERVER['SERVER_NAME']).'/';
if (!preg_match($srv_rx ,$_SERVER['HTTP_REFERER'])) {
//should track this, since it's probably a hacker/script
//instead, i will simply die.
$action = 'return';
}
}
...

Then I do three comment form specific checks. The first thing I look for is an inordinate amount of links. If the text is comprised of more than half urls, I throw it back. Again, easily worked around, but this seems to catch most scripts. I do let a legitimate user know that their message has failed to send, in case they want to reformat the message. Again, most scripts don't really care if you've returned anything, so I'm not giving away a trade secret here.

...
if (strlen(preg_replace('/(\W|\s)(?:(?:ht|f)tp(?:s?)\:\/\/)?(?:\w+:\w+@)?'
. '(?:(?:[-\w]+\.)+'
. '(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))'
. '(?::[\d]{1,5})?(?:\/(?:[-\w~!$+|.,\:\*\/&?#=]|%[a-f\d]{2})*)?(\W|\s)/',
'$1$2',$email_text))/strlen($email_text) < (1/2)) {
//text is more than 1/2 urls. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (url to "real text" ratio too high).<br />Sorry... sort of';
break;
}
...

Since this is a contact form I'm not expecting any formatting. It should just be text. If the email is more than one third HTML, I throw it back. Again, I let the user know, since I have had users send me bits of info and code about the site itself, when they've found bugs.

if (strlen(strip_tags($email_text))/strlen($email_text) < (2/3)) {
//text is more than 1/3 html. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (HTML to "real text" ratio too high).<br />Sorry... sort of';
break;
}
...

Next, look for both an HTML anchor, and a BBCode url or link tag. If they both exist, it's spam. Again, I send it back, because you never know. Some people are confused.

...
if (preg_match('/<a(?:[^>])*href/',$email_text)
&& preg_match('/\[(?:url|link)=/',$email_text)) {
//text contains both anchor tag and bbcode link. probably a bot.
$problem = 'Email not sent because the text looked too spammy'
. ' (wacky linking).<br />Sorry... sort of';
break;
}
...

And that cuts down on the majority of our contact form spam, and the rest will have to wait untill I can figure out how to write a regular expression that detects 'crazy-talk'.

Tags: ,
2009.09.02 03:09 PM | Permalink 0 Comments