Get your FREE 30 page Developing SOLID Applications guide!

Scaling Up: Reducing Drag, Increasing Lift

Out Of Date Warning

Languages change. Perspectives are different. Ideas move on. This article was published on February 24, 2009 which is more than two years ago. It may be out of date. You should verify that technical information in this article is still current before relying upon it for your own purposes.

Now that we know we want to scale our application, we first need to make sure it’s running at peak performance. There are a number of things that we can and must do in order to ensure that the newly scaled application uses resources appropriately, runs efficiently, and, most importantly, does not require excessive resources which will amount to extra costs.

The intuitive will note that many if not most of these suggestions are performance enhancements, not scaling techniques. Why then are they in an series about scaling? Scaling is about more than just adding hardware. It’s also about making sure your system runs better. You can add lots and lots of hardware but you will someday be unable to compensate for bad queries and poor optimization. So before we start adding servers, let’s take a look under the hood.

Are there any errors in the logs? Look again – what about notices?
Every time an error is raised – whether it be a PHP error, a web server error, a 404 Not Found error, or any sort of server-side error, an error log is opened and written to with the error information. While this functionality is great for diagnosing why your application is failing, and should be turned on by default, if you’re writing “stupid” errors like notices and warnings, you’re spending a lot of time that you shouldn’t be opening and writing to error logs.

In your development environment, you should turn on E_ALL for error_reporting. You may even consider turning on display_errors so that those annoying notices break the way your page looks and make you fix them. Just make sure that display_errors is turned off on production.

Most people think that notices are just an inconvenient part of PHP but this is not true. While many times they do create a headache, they can also help you track down or prevent mistakes by diagnosing the use of variables that haven’t been initialized, array keys that don’t exist, and function calls that aren’t accomplishing anything. Resolve notices in your code.

For those developers who might be tempted to simply make error_reporting report all errors sans notices, this is a poor choice. Notices may be raised legitimately when things fail, and should be addressed. Turning them off removes that check.

There is an exception to this rule: in the event that you are using a third-party application (like Drupal, WordPress, etc.) that you do not want to modify to remove all notices, you should consider turning notice reporting off ONLY in the directory containing this particular application.

With regards to other types of errors – 404 Not Found and others – see what you can do to reduce these errors. Are there images being linked that are not being found? Files that don’t exist? Some errors will come as the result of spammers and bots, but others can and should be fixed. This will reduce the drag on the file system from errors being written.

Are you configuring individual directories with htaccess files?
Many people configure individual directory settings using htaccess files. What they don’t realize is that this requires Apache to search for – and compile – htaccess files in every directory of the hierarchy. For example, if you place an htaccess file in /usr/www and one in /usr/www/mysite and one in /usr/www/mysite/admin, and then access /usr/www/mysite/admin/index.php it must find and read three htaccess files, apply its rules, and then come up with a common set of policies to apply. This is inefficient.

Instead, it is better to define rules in the virtual hosts file (vhost), as this allows us the ability to read the rules only one time. Every single rule you can set with htaccess can be set with the vhost file too. Also, be sure to disable AllowOverride, so that Apache does not go looking for override files.

There are some cases where you might want to set an htaccess directive, but these are rare cases. Since you can do everything (including require authentication for a directory) from the vhost file, this should be where the majority (99%) of all directives should be set.

Note: While this applies specifically to Apache, be sure that your configuration is optimized if you use other webservers. Apache is by far the most popular, serving more than half the internet, which is why I address it specifically here.

How many database queries are you executing on each page load?
One of the things that really slows a site down is repeated connections and requests to MySQL or other database servers. The problem is compounded if that request is made over TCP/IP (as opposed to localhost), though both are fairly slow. Many developers will make multiple requests on a page to get various components of that page; this can cause some slowdown and should be fixed.

There are a number of strategies for hitting the database more efficiently. One that I recommend is caching – either file-based or using memory. File-based caching is 2x faster than either form of MySQL querying and the memory options only get faster from there. We’ll discuss caching in another section.

Another problem that’s often seen in code (and should never ever be done regardless of the page) is looping queries. For example:

<?php

$conn = mysql_connect('localhost','user','passwd');
mysql_select_db('mydatabase',$conn);

$sql = 'SELECT * FROM blog_entries ORDER BY addDate DESC LIMIT 15';
$resource = mysql_query($sql, $conn);

while($array = mysql_fetch_assoc($resource))
{
    // Get comments
    $commentCountSql = 'SELECT COUNT(*) FROM blog_comments WHERE entryID = ' . $array['id'];
    $commentCountResource = mysql_query($commentCountSql, $conn);
    $commentCountArr = mysql_fetch_row($commentCountResource);
    $commentCount = $commentCountArr[0];

    // Do some other stuff here.
}

This can have very troublesome implications for efficiency (Eli White pointed out that it can also be useful for not overloading your database server) and unless you have a good reason you should avoid it. This is, of course, an overly simplified example for the purposes of illustration, but I think you probably get the point.

The SQL statements can be turned into one SQL statement by using the JOIN option. For example:

SELECT blog_entries.*, COUNT(*) FROM blog_entries LEFT JOIN blog_comments ON blog_comments.entryID = blog_entries.id 
GROUP BY blog_entries.id ORDER BY blog_entries.addDate DESC LIMIT 15

Now, how do you figure out how many queries each script is executing without reading every single line of code (though at some point you probably should do this anyway to refactor)? Well, by use of a profiler. I recommend Xdebug as it’s robust and well-documented, and it’s my favorite. I’ve written extensively on its use. Your best bet is to do a function profile on each script and count the number of times you connect, query, or otherwise interface with MySQL, then address the reasons why.

How efficient are your database queries?
It’s a fairly common belief that reducing the number of queries an application executes is synonymous with making it better. This thinking is wrong. Sure, it’s important to reduce the overall number of queries that you execute, this is true; however, it is also critical that you identify which queries, when simplified, make your application run faster. This is equally as critical.

In the previous example, we looped two SQL queries that could have been joined together with relative ease. In this case, it’s pretty obvious to see the benefit. But sometimes, it’s more efficient not to join two queries together.

How then, as a developer, do you know the difference? MySQL has a built-in tool called EXPLAIN that lets you do this with pretty much every query. This command will explain the behavior of the engine and how the engine is seeking the data; it will also tell you what the MySQL engine is having trouble with.

Learning how to use EXPLAIN is an important job of any programmer who uses MySQL. Jay Pipes has a few great talks that illustate the main points, and I highly encourage you to read them.

To help identify whether or not queries actually run faster, you should also benchmark your code. Learn how to use a great debugger to do this, as most will also have profiling tools that will keep track of how much time code takes to execute.

What kind of caching engine do you have installed?
If you’re not using a cache you’re sacrificing some of the easiest performance gains you’ll ever make.

Real life story: I benchmarked my website with APC turned on and turned off. The turned on version was able to withstand some 230 hits per second, while the APC-less version cracked at roughly 80 hits per second. Are you willing to give up a 65% performance gain? I certainly wasn’t.

There are lots of caching engines out there, though I personally recommend APC. But you should check them out for yourself, including eAccelerator and XCache. If you install APC you should also download the APC source and install the apc.php file that comes included with APC; trust me, it’s worth it.

How many file system reads are you executing on each page load?
How many of us have seen code that begins like this?

<?php

/** My Index Page **/

require_once 'config.php';
require_once FILE_PATH . '/dblogin.php';
require_once FILE_PATH . '/doThisFunction.php';
require_once FILE_PATH . '/doThatFunction.php';
require_once FILE_PATH . '/userObject.php';
require_once FILE_PATH . '/databaseObject.php';
require_once FILE_PATH . '/stats.php';

echo 'ok';

?>

Each time a file is included, the system executes a stat() call to determine if the file has been updated and thus needs to be recompiled. With require_once() it also checks a hash value to see if the file has been included elsewhere. While the typical site will never notice these calls, good planning still suggests that you should be careful with the files and libraries you include, to reduce the number of system calls you ultimately have to make. XDebug has a great code coverage tool that you can use to see how well you’re doing on this. I recommend you check out the documentation.

File system reads are very fast, but they are still a bottleneck when memory is compared. Reducing the number of file reads will help speed up performance.

Updated at 11:11 AM on 2/24/2009 as a result of feedback from Eli White

Is your application ready for concurrency?
One thing that many developers miss out on is the idea of concurrency. Concurrency means making sure that your site can withstand the concept of the same data being acted upon by many different users.

For example, imagine that you have three friends who all act on the same checkbook. And you have the following functions:

<?php

function updateBalance($acctID, $oldBalance, $change)
{
	$balance = $oldBalance + $change; // Note if change is negative this will reduce the balance.
	$sql = 'UPDATE accounts SET balance = ' . $balance . ' WHERE id = ' . $acctID;
	return mysql_query($sql);
}

?>

Then imagine that these friends (Jason, John, and Jacob) withdraw money at almost exactly the same time, meaning that Jason withdraws money but before the balance can be updated, John withdraws money too. Jason takes $25 and John takes $30, and the account has $100 in it. The balance now should be $45, right?

Except here’s the problem: if Jason’s transaction has not completed when John’s starts, the functions are executed as follows:

FUNCTION TRACE:

0.001 withdrawMoney(12345, 25) // Jason's transaction
0.002   -> getBalance(12345) // returns 100
0.003 withdrawMoney(12345, 30) // John's transaction
0.004   -> getBalance(12345) // returns 100
0.005 updateBalance(12345, 100, -25) // Jason's transaction
0.006 updateBalance(12345, 100, -30) // John's transaction

The resulting balance is $70, because John’s action overwrote Jason’s (because John had the old balance).

How would we refactor this to prevent this problem from happening again? Like this:

<?php

function updateBalance($acctID, $change)
{
	$sql = 'UPDATE accounts SET balance = balance+'. $change . ' WHERE id = ' . $acctID;
	return mysql_query($sql);
}

?>

Why would this be better? Because it ensures that each time the transaction executes, the database is updated based off the database’s last known balance, not the application’s last known balance. This is an extremely important concept, because having multiple servers makes it increasingly likely that you’ll face a situation where the same data is being acted upon by multiple people (this can even happen with a single server). This time, when Jason and John do their transactions together, the database gets updated twice, directly, meaning the balance is true at $55.00.

Final Points
There are lots of ways to squeeze out a bit of extra performance from the systems we already have running. Before adding extra hardware take some time to add some extra performance. While you’re still going to need to scale, if your application is running at peak performance, you know that the resources you add will be necessary, and that’s time and money well spent.

Learning design patterns doesn't have to suck.

Design patterns open a whole new world of possibilities. So why are you avoiding them? This brand new book will help you finally understand these wonderful programming techiques!

Learn design patterns TODAY »

Eli White (@EliW) wrote at 2/24/2009 10:59 am:

Overall, a decent writeup; however, I disagree on some of the points. The SQL code you first present for example, is IMO the more scalable code (if perhaps not more performant). It is my experience that doing many smaller compact & direct queries allows you to scale better. They can be farmed out to multiple DBs, they have a greater chance of hitting a DB cache, they are easier for you to cache yourself, they create less DB load overall by requiring less ‘complicated work’ of the DB, etc.

Secondly, you state that if you are going to include a number of files, that you should turn APC off because it’s not helping you. Again, I disagree. Sure if you happen to include a dozen (or hundreds) of separate PHP files, you are requiring some stat commands to happen, and if using require_once, then a hash lookup. But those are so minor when compared to recompiling each file over and over again. Trying to avoid those stat/hash things is a good concept, but is a real fine-tune performance measure. And can harm your code maintainability in the process as you attempt to clump files together.

Finally, you have a very nice section on concurrency, something that many people overlook. (Also Slave Lag is often overlooked and can be seen as related to concurrency). However, having code designed for concurrency has nothing to do with scalability (nor performance)

Matthew Weier O'Phinney (@mwop) wrote at 2/24/2009 11:42 am:

require_once calls are expensive — but when you use an opcode cache and PHP >= 5.2.0, the impact becomes trivial. PHP 5.2.0 adds a realpath cache, which greatly speeds up the lookups and reduces the number of hits to the filesystem. Even if you’re loading dozens or hundreds of files, under APC or Zend Accelerator, the performance impact is greatly mitigated.

Another tip: strip the require_once calls and use autoloading. I did this when profiling Zend Framework, and the performance impact was dramatic. If you have a good autoload strategy and use an opcode cache, I often saw performance increases of 4-10x.

Brandon Savage (@brandonsavage) wrote at 2/24/2009 11:47 am:

Thanks Eli. You’ll note that I modified the article somewhat to reflect your statements, and I know you’re right about some of them. I do agree with Matthew Weier O’Phinney, though, relating to require_once, and I think that good planning can help with these things.

Thank you both for your comments. I welcome the discussion.

Wesley Mason (@1stvamp) wrote at 2/25/2009 7:06 am:

@weierophinney: Opcode caches such as APC and Xcache have trouble caching opcodes called using require_once and include_once due to the architecture of the Zend engine and the way it handles these, so even when using a cache it can be more optimal to avoid the extra IO of a *_once function call.
One way I get around this is by emulating the use of _once within an spl_autoloader class, which just calls require instead.

Matthew Weier O'Phinney (@mwop) wrote at 2/25/2009 8:28 am:

@1stvamp I should have mentioned that when I mentioned usage of autoloaders. Zend Framework’s Zend_Loader::loadClass() actually uses include() internally (after first checking against class_exists()), so when used within an autoloader, it’s doing exactly what you suggest — and that’s what I was benchmarking.