/PHP/ Form validation

04/07/2006 | Filed under Develop > PHP

Discover the holy grail of PHP – A bullet-proof email validation script

Every PHP developer needs a bullet-proof email validation script, which is easier said than done: it’s tricky to build one and it’s dangerous to use a friend’s (especially if they’ve borrowed it). It might be difficult to see why a foolproof validation script is so essential, but the reasons will soon become clear...

Almost every site you’ll ever build will need a user login system with an email address and password. When people create these accounts, they’ll need to enter an email address. Sometimes users make typos or enter invalid addresses, wich stops them from retrieving their login details or receiving updates. By checking for invalid addresses, you’ll stop your database from filling up with invalid data, and it might save you from legal trouble, too: the Data Protection Act requires that information is ‘accurate and, where necessary, kept up to date’. Visit the Information Commissioner’s Office for more details at tinyurl.com/e4goo

The next step is digging into a scary document called the RFC822. This defines what is and what is not a valid email address. You can see the whole document at www.faqs.org/ rfcs/rfc822.html. In this tutorial, we’re just going to focus on section 6.1 9 (shown below), which explains the formatting of email addresses (specifically, the addr-spec rule):

 6.1. SYNTAX
addr-spec = local-part “@” domain ;
global address
local-part = word *(“.” word) ;
uninterpreted
; case-preserved
domain = sub-domain *(“.” sub-domain)
sub-domain = domain-ref / domain-literal
domain-ref = atom ; symbolic
reference

According to the RFC, the addr-spec is comprised of local-part “@” domain. From this, we can see that a local-part is comprised of a word *(“.” word), and can use it to start building our regular expression. To make this process easy to understand, we’ll create each rule in a regular PHP string, like this:

// addr-spec = local-part “@” domain
$addr_spec = “$local_part\\x40$domain”;

We need to replace the dash with an underscore, as PHP won’t permit dashes in regular expression tokens. Also, we’re using hex character references, so we can easily distinguish between literals and metacharacters. We now need to define local_part. It will look like this:

// local-part = word *(“.” word)
$local_part = “$word(\\x2e$word)*”;

Continue this, so all rules are defined. The last one will be:

# domain-ref = atom
$domain_ref = $atom;

Now that we’ve defined all the rules, we need to define atom, quoted_string and domain_literal according to the lexical tokens.

Copy the code:
PHP EMAIL VALIDATION IS SIMPLE ONCE YOU KNOW WHAT NEEDS TO BE DONE

Once we’ve gone through the entire process of defining all the rules and values that they depend on, we’ll end up with the following script. Note that it is in reverse order, so the sub-rules are defined when the parents need to use them. Thanks to Cal Henderson (www.iamcal.com) for his direction on the code.

$qtext = [^\\x0d\\x22\\x5c\\
x80-\\xff] ;
$dtext = [^\\x0d\\x5b-\\
x5d\\x80-\\xff] ;
$atom = [^\\x00-\\x20\\x22\\
x28\\x29\\x2c\\x2e\\x3a-\\
x3c .
\\x3e\\x40\\x5b-\\x5d\\
x7f-\\xff]+ ;
$quoted_pair = \\x5c\\x00-\\
x7f ;
$domain_literal = “\\x5b
($dtext|$quoted_pair)*\\x5d”;
$quoted_string = “\\x22
($qtext|$quoted_pair)*\\x22”;
$domain_ref = $atom;
$sub_domain = “($domain_ref|
$domain_literal)”;
$word = “($atom|$quoted_
string)”;
$domain = “$sub_domain(\\
x2e$sub_domain)*”;
$local_part = “$word(\\
x2e$word)*”;
$addr_spec = “$local_part\\
x40$domain”;

We can now plug this into the PHP regular expression and validate email addresses!

1. Lines 4-20: First, we need to plug our regular expression into a handy function. We’re going to call it is_valid_email_ address(). This receives an email address as a string and returns one if the email is valid, and a zero if it’s not.

function is_valid_email_
address($email)
{
$qtext = [^\\x0d\\x22\\
x5c\\x80-\\xff] ;
$dtext = [^\\x0d\\x5b-\\
x5d\\x80-\\xff] ;
$atom = [^\\x00-\\x20\\
x22\\x28\\x29\\x2c\\x2e\\
x3a-\\x3c .
\\x3e\\x40\\x5b-\\x5d\\
x7f-\\xff]+ ;
$quoted_pair = \\x5c\\
x00-\\x7f ;
$domain_literal = “\\x5b
($dtext|$quoted_pair)*\\
x5d”;
$quoted_string = “\\
x22($qtext|$quoted_pair)
*\\x22”;
$domain_ref = $atom;
$sub_domain = “($domain_
ref|$domain_literal)”;
$word = “($atom|$quoted_
string)”;
$domain = “$sub_domain
(\\x2e$sub_domain)*”;
$local_part = “$word
(\\x2e$word)*”;
$addr_spec = “$local_part
\\x40$domain”;
return preg_match(“!^$addr_
spec$!”, $email) ? 1 : 0;
}

2. Lines 6-9: Let’s break that function down into bite-sized chunks. Using the Lexical Tokens section of the RFC, we can tell how to define qtext, dtext and atom. For example, qtext is defined as ‘Any CHAR except <”>, “\” & CR and including linear, white space”. We can then convert that statement into hex character references. An example is: CHAR is defined as any byte between 0x00 and 0x7f.

$qtext = [^\\x0d\\x22\\x5c\\
x80-\\xff] ;
$dtext = [^\\x0d\\x5b-\\
x5d\\x80-\\xff] ;
$atom = [^\\x00-\\x20\\x22\\
x28\\x29\\x2c\\x2e\\x3a-\\
x3c .
\\x3e\\x40\\x5b-\\x5d\\
x7f-\\xff]+ ;

3. Lines 10-12: We can then continue to define the rules by their hex character references. You can see on line 11 that we use the $quoted_pair variable, which is why it’s defined on the previous line (10).

$quoted_pair = \\x5c\\x00-\\
x7f ;
$domain_literal = “\\x5b
($dtext|$quoted_pair)*\\x5d”;
$quoted_string = “\\x22
($qtext|$quoted_pair)*\\x22”;

4. Line 19: On line 19 we use a regular expression with our final string, $addr_spec. Note that we’re using the ^ and $ metacharacters to make sure we’re matching the entire string.

return preg_match(“!^$addr_
spec$!”, $email) ? 1 : 0;

5. Lines 22-26: Next we’re going to create a simple wrapper function for is_valid_email_address() called test(). This accepts an email address as a string, displays the email address, and calls is_valid_email_ address(). Essentially, this outputs the results in a table to make them easy to read.

function test($email)
{
echo “\t<tr>\n\t\t<td>”.
HtmlEntities($email).”</
td>\n”;
echo “\t\t<td>”.(is_valid_
email_address($email)? Yes
: No ).”</td>\n\t</tr>\n”;
}

6. Lines 36-50: We’re going to be outputting the results into an easyto-read table. This is just to make it easy to see some examples of valid and invalid email addresses. We’ve used quite a few different email addresses to give you a good idea of what is valid and what isn’t.

<table border=”1”>
<tr>
<th>Email</th>
<th>Valid?</th>
</tr>
<?php test
( ry@carsonsystems.com ); ?>
<?php test
( >ryan+carson@carsonsystems.com ); ?>
<?php test
( ryan carson@carsonsystems.com ); ?>
<?php test
( ”ryan carson”@carsonsytems.com ); ?>
<?php test
( ryan@carsonsystems ); ?>
<?php test
( ryan@carsonsystemscom ); ?>
<?php test
( ryan@carsonsystems.com );
?>
<?php test
( ryan@[carsonsystems].com )
; ?>
<?php test( abcdefghijklmnopq
rstuvwxyz@abcdefghijklmnopqrs
tuvwxyz ); ?>
</table>

 

 

Comments

Fransjo Leihitu / 03/08/2006 / 13:12 / http://www.leihitu.nl

hmm I've tried this tutorial,

But when I look at "abcdefghijklmnopqrstuvwxyz@abcdefghijklmnopqrstuvwxyz" ... the script returned "Yes" ... how is this possible?

my test version : http://www.leihitu.nl/xperiments/email_test/

the code: http://www.leihitu.nl/xperiments/email_test/index.txt

AmberV / 15/08/2006 / 19:42

Fransjo, I think I fixed it. I do not completely understand this function, so test it well, but I believe the error is in the description of $domain. The quantifier on the end should be '+' not '*'.

Mathew Browne / 16/05/2007 / 13:07 / http://www.mbwebdesign.co.uk

This is very useful. Form validation is a necessary evil - I like designing things, and this coding lark is all very time consuming and not at all fun. Articles like these help take the pain away!

Scrivna / 20/04/2008 / 00:19 / http://scrivna.com/blog

Just to inform anyone that's looking at this now, PHP5 has this feature built in using the filter_var function...

filter_var(&#039;bob@example.com';, FILTER_VALIDATE_EMAIL); // returns bob@example.com
filter_var('bob@example', FILTER_VALIDATE_EMAIL); // returns false

Easy peasy.

Martin Bean / 25/06/2008 / 22:39 / http://www.mcbstudios.co.uk

Although I myself do use FILTER_VALIDATE_EMAIL when feeling lazy, it does not do a good job of checking the TLD of the email address. So entering something like martin@mcbstudios.xxx will actually validate according to FILTER_VALIDATE_EMAIL.

Unfortunately it appears there will never be a dead-certain email validation method, especially when new TLDs are introduced (such as .mobi, .name, etc).

Add a comment

Your name:


Your email: (Not displayed)


Your website: (optional)


Enter your comment here:

Issue 179

.net issue 179 is now on sale! Craft better sites, be inspired by the next generation of web design trends, and learn how to create the perfect newsletter. Find out more ...

» Subscribe and save 40%
» Buy issue 179
» Get a corporate subscription
» Join us on Facebook



ADVERTISEMENT FEATURE

Let your creativity fly

Adobe® Creative Suite® 3.3 Design Premium now includes Acrobat® 9 Pro and Fireworks® CS3, a lower price and a bonus training DVD worth £50.
 
Win with .net

The latest competitons from .net magazine

Signup for our newsletter

Enter your email address and start receiving our new-look weekly email newsletter!

 
 

Rackspace Managed Hosting

TopHosts

actinic

.net photos powered by:
Canon