Get your FREE 30 page Developing SOLID Applications guide!

FIEO: Filtering Input with PHP’s Filter Functions

Out Of Date Warning

Languages change. Perspectives are different. Ideas move on. This article was published on August 28, 2009 which is more than two years ago. It may be out of date. You should verify that technical information in this article is still current before relying upon it for your own purposes.

Brand-new PHP developers have drilled into their heads the concept of Filter Input, Escape Output (FIEO). This concept essentially insists that all user-provided content be filtered or escaped, without exception. With the delivery of PHP 5.2.0, this got a lot easier, because PHP included, by default, the Filter library.

Before the Filter library, doing something such as validating an email address often required an ugly regular expression along the lines of this:

<?php
$email = 'firstname.lastname@aaa.bbb.com';
$regexp = "/^[^0-9][A-z0-9_]+([.][A-z0-9_]+)*[@][A-z0-9_]+([.][A-z0-9_]+)*[.][A-z]{2,4}$/";

if (preg_match($regexp, $email)) {
    echo "Email address is valid.";
} else {
    echo "Email address is invalid";
}
?>

The filtering protocol makes this easy for us by providing a built-in filter that we can use to validate an email address:

<?php

$email = 'firstname.lastname@aaa.bbb.com';

if(filter_var($email, FILTER_VALIDATE_EMAIL) !== false)
{
    echo "Email address is valid."
}
else
{
    echo "Email address is invalid.";
}

This way has a number of benefits: first, it makes the code more readable. You don’t have to know regular expressions to see what it is we’re validating. Second it reduces the likelihood of errors. Since the same filter is applied each and every time, and has had the benefit of being reviewed by other core developers, you can feel confident that PHP has a working validation function.

There are a number of other validation and filtering functions you can use, including checking to make sure something is a string, applying addslashes(), checking for an integer or a boolean or the like. These filtering functions will always be faster than any custom function you might write (being part of the PHP C code), and provide a fantastic amount of benefit to the filtering and validating of data. Check them out today!

Write better object oriented PHP today.

Object oriented programming always leaves you with a headache. What if you could master it instead?

Get the book now! »

Richard Harrison (@pluggable) wrote at 8/28/2009 8:35 am:

Hi Brandon,

It’s worth noting that filter_var returns the “filtered” variable or false. If you use it for validation and don’t check against === false then you might bump into the null/false/0 problem.

Consider trying to validate the integer 0:-

if(!filter_var(0, FILTER_VALIDATE_INT)){
echo “Not a valid integer”; // huh, 0 is a valid int!
}

It’s probably a good habit to get into use === check against false:-

if(false === filter_var(0, FILTER_VALIDATE_INT)){
echo “Not a valid integer”;
}

I got bitten by this recently :P

Andy Walpole (@http://www.suburban-glory.com/blog.html) wrote at 8/28/2009 9:25 am:

Brandon,

As we know strip_tags() doesn’t stop a lot of malicious code so I used HTML Purifier instead.

How good is the PECL filter FILTER_SANITIZE_STRING at stripping out cross-site scripting and other nasty stuff? Could I safely use it instead of HTML Purifier?

Brandon Savage (@brandonsavage) wrote at 8/28/2009 10:23 am:

Andy, I think it’s perfectly fine for stripping out HTML, and probably faster than whatever you’re using, just by virtue of being compiled into PHP.

Ray Paseur (@Ray_Paseur) wrote at 8/28/2009 10:43 am:

Great post, Brandon. I love the idea of filters, not only because they simplify and standardize code, but also because if they are wrong, it WILL get corrected and the benefit will come forth to all of us.

There are literally thousands of email validation REGEX strings and almost all of them are wrong in one way or another, probably including mine. In the example above, the TLD is limited to 4 characters, and this is technically incorrect since “.museum” is a valid TLD. Practically speaking, I know of no museum that is not already a “.org” ;-)

I have been filtering email addresses with this, and now I can spare my eyes the strain of reading that REGEX:

// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE
function check_valid_email($email)
{

// IS THE PATTERN OF THE EMAIL ADDRESS OK?
if (!preg_match(‘/^[A-Z0-9_-][A-Z0-9._-]*@([A-Z0-9][A-Z0-9-]*\.)+[A-Z]{2,6}$/i’, $email)) return FALSE;

// IS THE DOMAIN OF THE EMAIL ADDRESS ROUTABLE OVER THE INTERNET FOR MX OR A RECORDS?
$emaila = explode(‘@’, $email);
if ( checkdnsrr($emaila[1],”MX”) || checkdnsrr($emaila[1],”A”) ) return TRUE;

// NOT ROUTABLE
return FALSE;
}

Does FILTER_VALIDATE_EMAIL give any insight into the routability of the domain?

Herman Radtke (@hermanradtke) wrote at 8/28/2009 1:54 pm:

According to the guys at HTML Purifier, the filter extension is not sufficient for Andy’s needs: http://htmlpurifier.org/phorum/read.php?2,2903

Les wrote at 8/28/2009 1:54 pm:

Thanks but I’ll stick to using my own classes simply because the work is already done; I certainly ain’t gonna gut out my own script just to shave a few millionths of a second off what is already optimised script.

Besides… I firmly believe these filters arrived way too late; they should have been with PHP since version 4 but…

Brandon Savage (@brandonsavage) wrote at 8/28/2009 2:00 pm:

Herman, I’m not sure I buy the people at HTML Purifier as unbiased sources of information. I’d personally be wary about including 842 items in my application. HTML Purifier is almost 4.5 MB alone.

Les, no need to gut your existing script. This is just one option out of many for securing your application. I agree, the filters should have been included sooner.

Keith Casey (@CaseySoftware) wrote at 8/28/2009 4:00 pm:

@Les

Agreed that they should have been added sooner, but personally, I look forward to stripping out some of my (potentially incorrect) code in favor of this. The speed is there but then it’s one more thing that I don’t have to worry about and I can spend my time on other things.

Lars Johansson wrote at 8/29/2009 10:47 am:

Hi Brandon,
Thanks, I just implemented a function parsing mailaddresses using ‘your’ filtering technique. Real nice :)

Brandon Savage (@brandonsavage) wrote at 8/29/2009 10:51 am:

Lars, you’re welcome. Glad I could help. :-)

Les wrote at 8/29/2009 5:26 pm:

> HTML Purifier is almost 4.5 MB alone

I would stay away from HTML Purifier; not because it’s badly developed but as stated, it’s not exactly compact and from my own experience there are perforamce concerns.

I did suggest to the developers that they break it down into more modular, manageable components a while back, giving the end user greater control and flexibility over the level of purification they do.

Also, it’s way over the top for the most applications; I beg to differ that there are [now] better options available.

Adrian wrote at 8/30/2009 5:01 am:

Isn’t this function more like a validator than a filter?

Ray Paseur (@Ray_Paseur) wrote at 8/30/2009 10:14 am:

@Adrian: Yes, both. See the different types here:
http://us2.php.net/manual/en/filter.filters.php

artur ejsmont wrote at 8/31/2009 4:31 am:

1st of all good point Adrian …. title of this post should say ‘validating email with PHPs filter function’ ;-)

but jokes aside, the only thing that worries me about such innovations (they should be in php for ages) is that it always makes me wonder. How actually good are these functions?

I love to use open source as i save time but sometimes i just dont know what is in there.

With email there are all those different encodings of domain names and special chars in non-english languages. Its not that simple any more. Will they be accepted or not? How well unit tested is it? (and i dont mean coverage which does not mean anything in this case!). Why did they not add any information (doc says: ‘Validates value as e-mail.’)

Finally, what if im not happy with it any more? What if there is bug or just filter is crap? Your approach does not isolate your code. So what i would do is add my own wrapper. If i change my mind i will just replace implementation not ripping entire site again.

In addition i would add a few simple unit tests to make sure i know what to expect and i can easily replace implementation without major risk or waste of time.

With PHP libs its a bit different story. I can easily lookup whats under the hood thats why i like zend framework so much :-) but reading PHP extensions code is usually much more frustrating. But i sill would wrap whenever possible.

To summarize …. yes its super cool they added it and that people can use it but its totally not cool you dont know what you get ;-)

art

Mikael (@mikaelgramont) wrote at 9/4/2009 4:58 am:

I have started using HTMLPurifier on a project I’m working on and yes it’s heavy. I don’t plan on people POSTing stuff too much, so I’m not too worried about overhead.

However, I would like to hear suggestions of other libraries that I could use, should I find out HTMLPurifier takes up too many resources.

Thanks,

Mikael

Brandon Savage (@brandonsavage) wrote at 9/4/2009 1:52 pm:

Even including the HTML Purifier code through the require() or include() statement has a performance hit, because the compiler must still compile it. So its heft might still present a problem, even if your users don’t actually make use of the code. Something to keep in mind.

Mikael (@mikaelgramont) wrote at 9/4/2009 3:00 pm:

In that case, you just slap APC on the server, don’t you ?

Brandon Savage (@brandonsavage) wrote at 9/4/2009 4:04 pm:

You’ll still face some issues with performance, though they won’t be as bad.

Every time you invoke the PHP parser, APC will stat() the files included (and the files included by that script and so on) to see if they’ve changed. Chances are good that they haven’t, but APC still has to check (unless you disable that functionality as I’ve described here: http://www.brandonsavage.net/to-stat-or-not-to-stat/)

Additionally, as APC fills up, it will dump files off the memory to ensure it doesn’t overrun its limit. This might mean that some files get cached, then dumped, then cached again, etc.

You might be able to solve this by writing an autoload function (see http://us3.php.net/manual/en/function.spl-autoload-register.php and http://www.brandonsavage.net/making-life-better-with-the-spl-autoloader/) but as soon as you hit the HTML Purifier script it will include what it’s been written to include unless you strip out the includes, which would make upgrading difficult.

I think a lot of the time you don’t need the functionality of something as robust as HTML Purifier, and leaving it out is better than including it and then trying to manage the performance and/or access questions that arise. You only need it in select circumstances (namely, when saving input to the database or getting it from the database) so you might be better off using the Lite version, rolling your own, using the filtering functions, or including it ONLY when ABSOLUTELY necessary.

Andy Walpole (@http://www.suburban-glory.com/blog.html) wrote at 9/4/2009 4:20 pm:

The biggest attraction for HTML Purifier is that it works at one of the most critical jobs in an application, namely it stops malicious code – http://htmlpurifier.org/comparison

I don’t have any confidence in the PHP filters for this job and I’m not really sure I want to start experimenting with my own filter for such a crucial job when a tried and tested script is already out there.

Les wrote at 9/6/2009 6:57 pm:

I agree HTML Purifier does an excellent job but in my opinion not at the cost of performance.

To get around any security issues what I do is to encode the POSTed data (coming from TEXTAREAs) with base64 and leave it to a human to clean up any mess etc.

Not a perfect solution (you can’t sort or search data) by any means but it ain’t going to 1) break your box and 2) piss off your host.

Paul (@paulyg76) wrote at 9/18/2009 10:37 am:

I find the syntax of the filter ext to be a little… well ugly. Who wants to remember all those constant names? So in my case I wrote a wrapper class around the extension with method names like sanitizeString, validateEmail, etc… which is easier for me to remember.

I also find HTML Purifier to be too big.