Designing Databases: Picking The Right Data Types
Out Of Date Warning
Languages change. Perspectives are different. Ideas move on. This article was published on November 20, 2009 which is more than two years ago. It may be out of date. You should verify that technical information in this article is still current before relying upon it for your own purposes.
When I started writing this blog post, I had titled it “Tips for Designing Databases” and I planned to talk about various database design techniques. However, as I did more and more research, it dawned on me that one of the most crucial, and most overlooked, components of database development, is the selection of data types for columns.
Much of the information presented in this article was taken from presentations by Jay Pipes and a talk by Ronald Bradford. The talks are The Top 20 Design Tips For MySQL Enterprise Data Architects, Join-fu: The Art of SQL Tuning and SQL Query Tuning: The Legend Of Drunken Query Master.
MySQL supports a large number of data types (with Postgres supporting even more). For example, MySQL supports some 10 different numeric data types (INTEGER, TINYINT, SMALLINT, MEDIUMINT, BIGINT, DECIMAL, NUMERIC, FLOAT, REAL, and DOUBLE PRECISION), meaning that database designers need to know and understand how to use each one of them properly. Using them improperly adds stress to the database and generally reflects bad database design.
Since it would be impossible to discuss every data type in a blog post, I will instead discuss some of the most common MySQL (and this applies to other database platforms as well) mistakes, as highlighted by the presentations and blog posts I will cite.
Understanding how disk I/O affects databases
Databases are typically stored on disk (with the exception of some, like MEMORY databases, which are stored in memory). This means that in order for the database to fetch information for you, it must read that information off the disk and turn it into a results set that you can use. Disk I/O is extremely slow, especially in comparison to other forms of data storage (like memory).
When your database grows to be large, the read time begins to take longer and longer. This is a natural occurrence as the engine must read over more and more of the disk in order to find the information you have requested. Poorly designed databases exacerbate this problem by allocating more space on the disk than they need; this means that the database occupies space on the disk that is being used inefficiently.
Picking the right data types can help by ensuring that the data we are storing makes the database as small as possible. We do this by selecting only the data types we need, rather than choosing data types willy-nilly. This helps reduce the size of our rows, and by extension, our database, making reads and writes faster and more efficient.
Picking the right numeric type
The type of integer you select affects the amount of space that integer occupies on the disk – regardless of the value of the number you actually store in it.
For example, a BIGINT occupies 8 bytes of space, while a TINYINT occupies 1 byte of space. While the 8 byte integer gives you the ability to store huge numbers, it also means that you must store all eight bytes every time you store a record into that table. If you’re storing numbers like 2,000 or 5,000, you’re wasting lots and lots of bytes. This will inevitably make your reads slower, because the database must read over multiple sectors of the disk.
Also, many people assume that they must pick a larger integer size because the integer size they picked might not allow for enough values. For example, a SMALLINT will allow you to store up to 32,767 – if you leave it signed. Leaving it signed leaves the very last bit as a determination of whether or not the value is a positive or negative value. But if you declare the value to be UNSIGNED, you make that bit available for use, allowing a SMALLINT to store up to 65,535 – twice what it can store as a signed integer. Your primary keys should always be unsigned, especially if they are auto-incremented; MySQL will never assign a negative value as a primary key.
Refer to the documentation when picking an integer type, and use it to determine how big an integer you need.
Storing text in a database
Obviously a database is less useful if we only store numbers in it. MySQL, as well as other databases, allow us to store text values (strings or full text values) as a way of making our databases more useful. However, when doing so we must be cautious.
Pop quiz: what’s the difference between CHAR(4) and VARCHAR(4)? If you said that CHAR(4) will always be 4 bytes while VARCHAR(4) will be a variable number of bytes, you’re right! That’s a central difference between the two.
CHAR will right-pad any value that does not fit the width of the column defined, making all values the same number of bytes, while VARCHAR allows for variable length values. However, this variable nature comes with a cost – VARCHAR stores additional information, meaning that the overall bytes are higher than the CHAR for the same data.
For the most part, CHAR should be used when the length of the data is known and fixed (for example, as a MD5 hash, which is always 32 characters long). VARCHAR should be used when the length of the data is not fixed. You should also be careful to ensure that the VARCHAR is not wider than it has to be; for example, if you allow for usernames up to 12 characters, your VARCHAR should not be 255 characters wide.
Another favorite (and bad) development practice is to use TEXT columns to store large blocks of text (for example, this blog uses TEXT columns to store the contents). TEXT columns are extremely inefficient, however. When using these types of columns, it is a good idea to consider abstracting the TEXT columns to a different table, especially if fast lookup is required on the primary table. And if you are defining indexes on the TEXT columns (for example, for full-text searching), this is doubly important (indexes slow down write time, and the more data you store, the slower writing will be).
Oh, and as for BLOB columns – use the file system for what it was designed to do: store files. If you’re storing them in the database, to paraphrase, “ur doin it rong.” This does, of course, have room for interpretation; however, for the most part, you should use these columns extremely sparingly.
ENUM and SET make sense to developers for a number of different reasons: with ENUM you can enforce certain types of data being stored (since you can only store data that’s in the enumerated list). With SET you can store multiple values from a specified list, which makes the SET seem almost like an array.
The problem here is that databases aren’t designed to work like this. If you ever decide you have to add something else to the ENUM or SET declaration, MySQL must rebuild the entire table which, if you have 3 million records, could take you a week or two (hyperbole). This is clearly not optimal.
This is clearly a case where having the programming language do the work is preferable to having the database do it. Sure, it’s easy to validate whether or not the value is allowed by having the database enforce a constraint, but ultimately you will cause yourself more trouble than it’s worth.
Wait, isn’t this micro optimization?
Yes and no.
The ways in which this is similar to micro optimization are that the performance boost most sites will get is negligible. It will take a fair bit of time to make these changes, thus it might qualify as micro optimization.
But where this is clearly not micro optimization is that this is an adoption of a best practice. Choosing the proper data type is a crucial component of good database design, not simply an optimization technique. While this will have major performance benefits in high traffic databases, it is still the given best practice for all databases, regardless of traffic.