Anatomy of a Failure: CFA Website

 

As a result of disasterous fires in our state in 2009 known as Black Saturday there was a Royal Commission into the fires, the management of them and the deaths that occurred as a result.  Coming out of this was a set of recommendations and changes were instigated by a number of parties.  One of the main parties of interest in the handling of fire services in the state is theCFA (Country Fire Authority).

The CFA has to be congratulated for at least the intent of some of their initiatives post Black Saturday.  One of these is their Fire Ready app for mobile phones, and another was a revamp of their website.  A lot of time, and presumably public funds, went into the developments that resulted in a catastrophic failure of the website and app. So what went wrong?

Initial reports suggest that capacity planning was at issue, with the statement made that they had planned for up to 350 hits per second, but the site received 700 hits per second. Much of the blame was placed on the fact that the app was connecting to the same server as the main site, and that part of the solution was that they put the app on its own server. Note the singular 'server' there.  I'm not sure if this was just a reporting problem, trying to make it easier for journalists to understand, or we are really talking about an under-resourced system that is supposedly meant to assist people in the times of greatest need and stress.  So, aside from the fact that both were on the same infrastructure (let us assume there is actually more than one server there), what can we determine about the possible problems that these 700 hits per second were causing.

Firstly, looking at the site details, the app now uses osom.cfa.vic.gov.au, which appears to be fronted by a Squid proxy cache, often used to act as a web site accelerator, so there could be more than one server behind that.  The main website, www.cfa.vic.gov.au is behind a F5 BIG IP appliance, which in its base form is a load balancer, meaning that unless they are spending money for the sake of it, there is at least more than one server providing the service.  The BIG IP appliance is quite a useful beasty, and can provide web site acceleration, using compression and other techniques to make life easier.  So why was there a problem?  And apologies for me starting to get a bit technical from here on in.

Let's first define what a "hit" might be.  When you load a web page, like www.vic.gov.au, you might think that by the time it finishes loading that is one "hit".  And you'd be wrong.  A web page is made up of a number of files. Each image on the site is likely to be a file, as is each stylesheet and javascript file. Each of these files is requested separately and each of these requests results in a "hit".  Taking a look at the structure of the CFA site shows that there are a number of unnecessary files on there.  For instance there are 6 javascript files that could be combined into one.  There are images called from CSS that could be combined into "sprites".  Just performing that alone would change the number of hits the site was getting to load a single page, and in doing so increase its capacity - without resorting to extra hardware or tweaking networking stacks.  Indeed it looks like you could increase the number of page views supported by a factor of at least three, simply by resolving these extraneous hits.

There is more intersting stuff under the covers.  There is no compression used on any of the assets (page data, images, scripts, stylesheets) so that the amout of data having to be sent down the wire is far greater than it needs to be.  A conservative estimate suggests that the site could handle twice the current load simply by turning on compression.  This is usually just a configuration change in the base web server software.  All modern browsers support compression, and it is a rookie mistake not to be using it.

On the same front, most browsers also try and help by actively caching items that don't change often, meaning the next time you go to the site it is far quicker to load as the browser already knows about it.  For this to work you need to make sure your web server is set up to help out.  The CFA site is not.  There are two factors here, ETags and Expire headers.  ETags are supposedly unique ids for assets that allow the browser to check that it has a copy of this and therefore doesn't need to re-download (providing it hasn't expired).  The trouble with these is that if you have more than one server supplying the same data - as you would behind say a BIG IP appliance,  these ETags are likely to be different for the same resource, meaning that the browser thinks it needs to download the item again as it has changed.  The CFA site has ETags turned on.

Expiry headers tell the browser how long it should hold an item in its cache before checking for a new version.  For items that don't change much (like images and stylesheets and scripts) it is common to set a "far future" expiry header, say a year or even a month into the future.  This means the browser doesn't need to worry about these items after the first load.  The CFA site doesn't use Expiry headers.  Fixing these two problems could see at least a 50% if not doubling of the capacity of the servers.

Now, without looking at hardware, operating system or web server software, it looks like the site could have handled between 9 and 12 times the traffic it received on Friday.  All it would have taken was someone with experience in developing high capacity websites, or even a sysadmin with capacity planning skills to have foreseen this and averted what was an obviously avoidable calamity.  What a pity.  I hope someone in CFA is reading this, as despite the assurances made about things being done - I don't see any evidence of even the basics being covered.

Update 2013-01-15:

The minister has announced today that the site, and the app, have been improved such that there will be no repeat of the problems previously seen.  The details of what they've changed were unclear, however a quick check shows that the only thing addressed on the site itself was the compression.  As mentioned above, this was an easy fix and should have been applied before the site went live.  However, the other points were not addressed.

The wording of the minister's announcement suggests that more hardware has been added.  Perhaps the compression module on the Big IP box?   I haven't run any tests to identify if there have been extra servers added, but I suspect that has happened as well.  Yet compression and the other suggestions I've made would take a few minutes of a competent sysadmin's time, so why is it that it took more than a week for compression to be turned on?  Why aren't the other problems being addressed?  Why are we getting vague assurances that the problems are resolved with absolutely no detail as to what was addressed?  My suspicion is that there is a salesperson somewhere driving around in a very expensive car funded by the commission they have received on upselling the CFA.

No feedback yet