Get your FREE 30 page Developing SOLID Applications guide!

Validation Blind Spots Hurt Real Users

Out Of Date Warning

Languages change. Perspectives are different. Ideas move on. This article was published on April 3, 2011 which is more than two years ago. It may be out of date. You should verify that technical information in this article is still current before relying upon it for your own purposes.

A friend of mine lives on Bonieta Harrold Drive. I live on a Windsor Hill Drive. Both of us have a problem in common, which is that poorly designed software is incapable of accepting the length of our street address. For me, American Express refuses to accept more than “WINDSOR HILL D”, which still arrives at our home. I can’t imagine if my friend ever got an American Express card, since given the maximum length available for an address, he would live on “BONIETA HARROL”. If you live in a place where direction (e.g. NW, SW, SE) matter, not having enough space can be extraordinarily problematic to the proper delivery of mail and packages if there is not enough room for the whole address.

Clearly, these software systems have a design flaw. That design flaw is that the programmers responsible for programming the software assumed that 20 characters (house number and street information) was long enough for a standard address. It’s likely that in the best case, developers picked 20 characters based on some given experience (e.g. they considered all the street names in their own town in conjunction with known house number lengths, and came to an answer) or worse, simply picked a number out of thin air. Real users are worse off because of it.

This problem isn’t limited to address length; it manifests itself in any number of forms. Names are prohibited from having spaces or dashes in them, even though many people have names that DO contain spaces or hyphens. Some people have names that are not even alphabet-based; others have names that are not in Latin-based alphabets. Programmers are also famous for insisting that data be presented in certain formats – phone numbers that have dashes, dots, are formatted like “(xxx) xxx-xxxx” or dates that are “mm/dd/yyyy”, completely ignoring the reality that the rest of the world represents dates differently (usually “dd/mm/yyyy”) and that anyone outside North America probably doesn’t have a ten-digit phone number, and most certainly doesn’t format a ten-digit number like Americans do.

This isn’t to say that there shouldn’t be some limitations on the lengths we accept. But our blind spots about validation can and do harm real users if they’re poorly or incompletely thought out. What might seem like a completely rational limit to us might hurt a real user who needs to exceed that limit, through no fault of their own. How many women are “BETTYJEAN” because their first name “can’t” have a space in it?

Let’s end the blind spots. Here are five steps to take in order to accept valid data from real users.

1. Accept valid data in any form provided by the user.

It’s pretentious of programmers to push off the task of formatting data in a manner we like, just because we don’t want to write a function that formats it for us. Demanding phone numbers in “(xxx) xxx-xxxx” format requires the user to do the work that the system should do for them. Programmers are smart people; they can figure out that phone numbers contain only numbers (props to people who accept letters and convert them, but this isn’t necessary) and strip out unnecessary characters. Once you have the numbers, you can format them any way you want. But don’t push this burden off to the user.

2. Where possible, use well-developed validation libraries.

It’s pretty easy to recognize a valid email address…or so we think. And so we write regular expressions that expect email addresses to have a localpart, a domain, and a three-letter extension. The problem is the resulting regular expression might exclude the following valid addresses:

  • abc123@something.com (numbers, if we don’t explicitly allow them)
  • abc@someting.us (two-letter extension)
  • abc@def.something.com (extra prefix)
  • abc@something.co.uk (two extensions)

The same rule applies to other standard validation we perform: URLs, phone numbers, credit card numbers, social security numbers, driver license numbers, postal codes, and other data points that must follow a format. Almost every framework on earth contains a library that is likely to implement standards we’ve never heard of; use them.

3. Do not place artificial limits on valid data.

This might seem like a repeat of the first point in this list, but it’s not. That point is about format; this point is about length. Placing artificial lengths on valid data is precisely the thing that gets billing statements sent to “BONIETA HARROL” instead of Bonieta Harrold Drive. American Express, for what it’s worth, limits passwords to eight characters. That’s an artificial limit on valid data that makes security worse.

Obviously, certain data fields require lengths to be applied in order for them to be created properly. Though the days of imposing a 255 character limit are long behind us; it is possible to offer a larger limit or, in the case of document datastores, it can be unlimited. And by no means is it a good idea to have artificially small limitations just for legacy support of older applications; if older applications are that horribly broken that they require you to impose artificial and silly limitations, fix them.

4. Do place valid limits on specific data.

Certain data will have valid, and reasonable limits. For example, in the United States, no phone number will ever be longer than sixteen characters (including a country code, parenthesis around the area code, and dashes between the groups). This of course excludes extensions, which can add an unlimited value, but illustrates a point: certain data can be limited in length and scope. Social security numbers are another perfect example of this: they’ll never be longer than 11 characters, including dashes, and what’s more, there are certain standards to what a social security number can contain. Credit card numbers are another example; accepted cards will start with particular digits that can be predictable and validated, and the account numbers will be generally a standard length.

It’s wise and practical to limit, validate and enforce restrictions on data that should be and must be specific. And while format should never be a consideration here, limiting the length, character content and acceptable values most certainly is a legitimate aim of validating specific data.

Conclusion

Validation is a complex art, requiring the developer to think beyond his and oftentimes, his team’s predetermined notions of what is correct and what is acceptable. While format should never be something arbitrarily enforced, correcting and assisting the user in providing data is our responsibility. Making that data entry easy, fast and convenient, while more difficult for us on the backend, makes our applications more useable, useful and desirable.

Write better object oriented PHP today.

Object oriented programming always leaves you with a headache. What if you could master it instead?

Get the book now! »

Wiseguy wrote at 4/3/2011 10:54 pm:

I had the same problem with American Express. I called, hoping that the limit was just in the web form and that they could enter longer addresses, but they just cut my street name in the middle and spread it onto the second line. When I joined several years ago, I noticed the 8-char password thing, too. I wrote them to complain. I was recently able to change my password to a longer one, so I guess that’s since been changed for the better.

Worse, I had a bank that required a password to be exactly 8 chars contain no symbols/punctuation/spaces (so, alphanumeric only), and not start with a number. Seriously? Sheesh.

Predrag Supurović wrote at 4/4/2011 3:38 am:

There is one point that you mentioned lightly, but actually, lots of developers do fail: support for international characters.

Nowadays, it is pretty easy to support almost every alphabet in the world. All you have to do is use UTF8 encoding. It should be used both on web page and database and that would help anyone type in his name, address or whatever else using his own language and alphabet if necessary.

Chris Shiflett (@shiflett) wrote at 4/4/2011 12:00 pm:

Slightly tangential to your post, but here are a few facts about US addresses that you might find interesting:

1. US addresses consist of two lines, an address line and a last line. Every form that collects city, state, and ZIP (collectively, the last line) separately only needs one address line.

2. Any secondary unit designator (apartment, suite, etc.) is part of the address line. If a form collects it separately, you can opt to just write it at the end of the address line instead. I always do that, because it’s faster.

3. The longest valid address line is 49 characters. Limiting this to 20 is especially dumb, but even limiting it to 50 can be problematic, because standardized addresses use standardized abbreviations, and users might spell those out.

The more you know. :-)

Jani Hartikainen (@jhartikainen) wrote at 4/5/2011 4:10 pm:

Excellent points there. These are some things that have baffled me as well.

Something you didn’t mention is passwords.

Why oh why can’t I have special characters in my password, and why does it have to be between 6 an 12 characters and not any longer?

If I want to have a password of 100 characters, you should let me. It shouldn’t matter to you what’s in it, because you should be hashing it anyway, so it all passwords would match your prerequired length (be it 40 with sha1 or whatever)

Tim Swann (@faffyman) wrote at 4/11/2011 3:41 am:

Nice article…

I’m probably guilty of a few of those offences myself in the past.
I know I’ve been guilty of strict validation on credit card numbers. Why I didn’t allow spaces was daft when I think about it, it’s so easy to strip the spaces out.

It’s a perfect example of programmers living in a programmatic rather than real world. It’s good to have it pointed out to us every now and then that we need to stop over-thinking, and put the user first.

@Jani – totally agree, my password is mine, and so it should contain whatever I want – and not be limited to or even enforced by the system.

Em wrote at 4/11/2011 4:21 pm:

What do you think about validating and sanitizing email with build in php filter functions:
filter_var($email, FILTER_VALIDATE_EMAIL), filter_var($email, FILTER_SANITIZE_EMAIL).