Anatomy of a Failure: CFA Website

 

As a result of disasterous fires in our state in 2009 known as Black Saturday there was a Royal Commission into the fires, the management of them and the deaths that occurred as a result.  Coming out of this was a set of recommendations and changes were instigated by a number of parties.  One of the main parties of interest in the handling of fire services in the state is theCFA (Country Fire Authority).

The CFA has to be congratulated for at least the intent of some of their initiatives post Black Saturday.  One of these is their Fire Ready app for mobile phones, and another was a revamp of their website.  A lot of time, and presumably public funds, went into the developments that resulted in a catastrophic failure of the website and app. So what went wrong?

Initial reports suggest that capacity planning was at issue, with the statement made that they had planned for up to 350 hits per second, but the site received 700 hits per second. Much of the blame was placed on the fact that the app was connecting to the same server as the main site, and that part of the solution was that they put the app on its own server. Note the singular 'server' there.  I'm not sure if this was just a reporting problem, trying to make it easier for journalists to understand, or we are really talking about an under-resourced system that is supposedly meant to assist people in the times of greatest need and stress.  So, aside from the fact that both were on the same infrastructure (let us assume there is actually more than one server there), what can we determine about the possible problems that these 700 hits per second were causing.

Firstly, looking at the site details, the app now uses osom.cfa.vic.gov.au, which appears to be fronted by a Squid proxy cache, often used to act as a web site accelerator, so there could be more than one server behind that.  The main website, www.cfa.vic.gov.au is behind a F5 BIG IP appliance, which in its base form is a load balancer, meaning that unless they are spending money for the sake of it, there is at least more than one server providing the service.  The BIG IP appliance is quite a useful beasty, and can provide web site acceleration, using compression and other techniques to make life easier.  So why was there a problem?  And apologies for me starting to get a bit technical from here on in.

Let's first define what a "hit" might be.  When you load a web page, like www.vic.gov.au, you might think that by the time it finishes loading that is one "hit".  And you'd be wrong.  A web page is made up of a number of files. Each image on the site is likely to be a file, as is each stylesheet and javascript file. Each of these files is requested separately and each of these requests results in a "hit".  Taking a look at the structure of the CFA site shows that there are a number of unnecessary files on there.  For instance there are 6 javascript files that could be combined into one.  There are images called from CSS that could be combined into "sprites".  Just performing that alone would change the number of hits the site was getting to load a single page, and in doing so increase its capacity - without resorting to extra hardware or tweaking networking stacks.  Indeed it looks like you could increase the number of page views supported by a factor of at least three, simply by resolving these extraneous hits.

There is more intersting stuff under the covers.  There is no compression used on any of the assets (page data, images, scripts, stylesheets) so that the amout of data having to be sent down the wire is far greater than it needs to be.  A conservative estimate suggests that the site could handle twice the current load simply by turning on compression.  This is usually just a configuration change in the base web server software.  All modern browsers support compression, and it is a rookie mistake not to be using it.

On the same front, most browsers also try and help by actively caching items that don't change often, meaning the next time you go to the site it is far quicker to load as the browser already knows about it.  For this to work you need to make sure your web server is set up to help out.  The CFA site is not.  There are two factors here, ETags and Expire headers.  ETags are supposedly unique ids for assets that allow the browser to check that it has a copy of this and therefore doesn't need to re-download (providing it hasn't expired).  The trouble with these is that if you have more than one server supplying the same data - as you would behind say a BIG IP appliance,  these ETags are likely to be different for the same resource, meaning that the browser thinks it needs to download the item again as it has changed.  The CFA site has ETags turned on.

Expiry headers tell the browser how long it should hold an item in its cache before checking for a new version.  For items that don't change much (like images and stylesheets and scripts) it is common to set a "far future" expiry header, say a year or even a month into the future.  This means the browser doesn't need to worry about these items after the first load.  The CFA site doesn't use Expiry headers.  Fixing these two problems could see at least a 50% if not doubling of the capacity of the servers.

Now, without looking at hardware, operating system or web server software, it looks like the site could have handled between 9 and 12 times the traffic it received on Friday.  All it would have taken was someone with experience in developing high capacity websites, or even a sysadmin with capacity planning skills to have foreseen this and averted what was an obviously avoidable calamity.  What a pity.  I hope someone in CFA is reading this, as despite the assurances made about things being done - I don't see any evidence of even the basics being covered.

Update 2013-01-15:

The minister has announced today that the site, and the app, have been improved such that there will be no repeat of the problems previously seen.  The details of what they've changed were unclear, however a quick check shows that the only thing addressed on the site itself was the compression.  As mentioned above, this was an easy fix and should have been applied before the site went live.  However, the other points were not addressed.

The wording of the minister's announcement suggests that more hardware has been added.  Perhaps the compression module on the Big IP box?   I haven't run any tests to identify if there have been extra servers added, but I suspect that has happened as well.  Yet compression and the other suggestions I've made would take a few minutes of a competent sysadmin's time, so why is it that it took more than a week for compression to be turned on?  Why aren't the other problems being addressed?  Why are we getting vague assurances that the problems are resolved with absolutely no detail as to what was addressed?  My suspicion is that there is a salesperson somewhere driving around in a very expensive car funded by the commission they have received on upselling the CFA.

License confusion and the stripping of rights.

I really thought that the time of license confusion was well past, and we had all settled down and understood the basis, and ramifications of each of the open source licenses.  It seems I was wrong.

Having a look at a piece of code recently I noticed some very familiar code - it was my code, originally released under GPL with the copyright notice clearly stating that it was "part of the collected works of Adam Donnison".  I had done this deliberately and done so with a number of pieces of code that I had built over the years, knowing that they were useful and could be used in other projects.  Imagine my surprise to see that same code, only slightly modified, in another project with the copyright notice removed and the explanation:

 * Note: Previously, this class was mis-licensed as GPL in an otherwise BSD
 *   application. The GPL attempt was in 2003 while the project itself was not
 *   relicensed from BSD to GPL in 2005. In 2007, all further development was
 *   done under the Clear BSD license and all GPL modifications were removed.

Really?  Since when did the BSD license become viral?  Wasn't that the entire reason people complained about GPL and wanted to move to BSD?  There is nothing in the BSD license, or the Clear BSD license that demands that all code in a project be covered by the same license.  Indeed even prior to 2001 dotProject had code that was under the Voxel Public License (ticketsmith) which was more restrictive than BSD, so there had been precedents for differently licensed parts of the code.  Indeed none of the BSD licenses even has the concept of a "project" or "greater work". Dropping the copyright is also a violation of the BSD license, as it is of the GPL, so no matter how you cut the dice, this action was against both the spirit and the letter of the licenses it supposes to uphold.

As to "all GPL modifications were removed", an interesting and, on the face of it, erroneous statement.

I believe I am within my rights to demand that the copyright notice be reinstated, or the code removed.  Now I don't want to get heavy with anyone, but these licenses only work based on strong copyright protection.  My copyright has been violated, and I am now considering what action to take.

This is not the first run-in I've had with this project's developers on their cavalier attitude to copyright notices, but this is by far the most egregious.  I believe their actions were to allow them to make money on the project - in which case I also believe that damages could be sought.  I have no problem with them making money - only not by stripping me of my rights.

Update: I've since spoken with the project lead on the project in question and we've come to an understanding on the issue.

Telstra BigPond Accounting Failure?

This morning I was considering downloading a site image to work on locally, something I do every now and then.  As usual I checked my download quota on BigPond as we are on a limited bandwidth, being out in the sticks and not having access to real internet.  I was astonished to see that the monthly bandwidth was exceeded, by a large margin, and that I was being shaped down to 64kbps.  In fact, the system had started shaping on Sunday.  This was around 2/3rds of the way through the current billing period, and I knew that neither my wife nor I had been doing anything that would account for the whopping 4GB supposedly downloaded since the shaping started.  Indeed, at 64kbps, 4GB in 3 days is theoretically impossible.

Speaking to Telstra accounts was, as usual, tricky.  All they could tell me was exactly what I could see in the usage graph, that I was being shaped and had exceeded my usage.  I explained how it was not theoretically possible to have downloaded the data claimed and explained why - so they decided I would be better talking to tech support.  The person on tech support was really helpful, which is not something I'm used to.  Normally they run through a set script and then give me grief because I don't use Windows, cannot run Internet Explorer, and generally don't like having to explain a technical issue to someone who is obviously non-technical.  Anyway, after exploring the issues, they suggested resetting the wifi password and SSID.  I pointed out that since I live on 20 acres, and unless someone parked their car in my front yard they would have no chance of even seeing the router, it was unlikely to be an issue - although I agreed to do so.  Just before talking to Telstra I checked the usage meter (which updates hourly).  There was 906MB downloaded today.  After the call, and at least an hour after I last checked, I checked again. 1474MB.

Whoa, 568MB in one hour?

I grabbed my calculator.  64kbps is 8KB per second, which is 28MB per hour.  568MB is 20 times that.  Sorry guys, not remotely possible, unless you are telling me that I can be both shaped and download at speeds that I find almost impossible to achieve even when the wind is in the right quarter and the schoolkids are still at school.

I managed to convince them that there was a real issue, and that we needed to monitor this.  I also changed my password to my account - which I suspect is more likely to be the issue.  They promised to lift the limit on my account although so far that hasn't yet happened.  I suspect, once again, I'll have to spend hours on the phone trying to find someone who understands the issue and can do something about it.  Stay tuned for updates.

Open letter to Yahoo! account holders

Yahoo! account holders are being witheld information that they signed up to receive. This is not the fault of the services they signed up with, but rather the arcane and obscure mechanisms by which Yahoo! supposedly identifies spam sources. If you are a Yahoo! account holder, and you suspect that you are not getting email you should, chances are you are correct. Notify Yahoo! that their mechanisms for problem resolution simply do not work and you are going to vote with your fingers and move to another email provider.

The detail behind this is that I have two virtual servers that are with a VPS provider, on two completely different subnets and from the time they were commissioned, no email has been accepted by Yahoo! for its account holders with an error message that appears to be for persistent spamming.  This is clearly an error, one that I have tried for weeks now to resolve with Yahoo! without success.  They will not admit that their detection of spamming cannot be for mail they have yet to receive, but instead most likely for a previous owner of the IP addresses in question. They will not provide any clear information even as to the reason that we are - in their quaint and completely incorrect terminology - "deprioritized".

I don't really want prioiritised mail services, any level of mail service would be fine with me, and I suspect with those users who in good faith have signed up with one of the services we host.  "Deprioritized" suggests that there is a chance the mail will get through, just not in a timely fashion. What we have is in fact a complete embargo on mail from our servers.

It doesn't matter how you wrap it up or what policies you quote ad-infinitum, Yahoo!  If your system starts blocking mail on the first attempt then there is something wrong with your system, not mine.  If it is for prior usage of the IP address, then provide me with the methods of showing that the IP address changed hands so you can reset your system. Don't lie to me and tell me that it is temporary and will resolve itself once I get my systems in line with your policies.  It is not temporary, it is a complete block.  My systems are in line with your policies and have been for quite some time.  You will not answer any of my questions with anything other than a pointer or extract from your policy documents - so I have no idea (and I suspect neither do you) of why I am listed at all, let alone how to resolve it.

So if you are a Yahoo! account holder, stop this nonsense by switching provider. I can't be the only service provider that is affected this way, so even if you are not on one of the services we host, I can assure you there is a good probability you are affected.

If you are Yahoo!, stop this nonsense by acknowledging your system is flawed and fixing it.

Adding dynamic fields to Signups on Drupal

In my day job at SkySQL I work with Drupal as our content management system.  One thing we often need to do is provide a way for people to sign up for events and the like.  One such event is the upcoming SkySQL and MariaDB: Solutions Day for the MySQL® Database and unlike other events we needed to take into account the dietary requirements of those wishing to attend.

For events registration we use the Signup module and use a theme template function to provide a set of standard fields.  The code looks something like this:

function ourtheme_signup_user_form($node) {
$form = array();
// If this function is providing any extra fields at all, the following
// line is required for form form to work -- DO NOT EDIT OR REMOVE.
$form['signup_form_data']['#tree'] = TRUE;

$form['signup_form_data']['FirstName'] = array(
'#type' => 'textfield',
'#title' => t('First Name'),
'#size' => 40, '#maxlength' => 64,
'#required' => TRUE,
);
$form['signup_form_data']['LastName'] = array(
'#type' => 'textfield',
'#title' => t('Last Name'),
'#size' => 40, '#maxlength' => 64,
'#required' => TRUE,
);

And so on, building up the elements and then returning the form.  This is great because it allows us to have a standard set of fields for all signup pages, making life a lot simpler when creating content that requires registration.  But the Solutions Day event required an extra field.  I could have done this a number of ways, including putting logic in the template file to check for that particular node and only display the field then, or perhaps some other hack specific to this node.  I, however, don't like specifics and tend to look for a generic solution, as the exception invariably becomes the rule.

For this exercise I wanted to be able to have a way of specifying for a particular node any extra fields that are available for this form.  So I now have in the template.php file the following code:

// If there is a special field required for this, check and display
if (!empty($node->field_signup_extra) && !empty($node->field_signup_extra[0]['value'])) {
$extras = explode("\n", $node->field_signup_extra[0]['value']);
foreach ($extras as $field_def) {
$field_def = trim($field_def);
if (empty($field_def)) {
continue;
}
$elems = explode('|', $field_def);
$field_name = array_unshift($elems);

$form['signup_form_data'][$field_name] = array();
foreach ($elems as $field_element) {
list($key, $val) = explode('=',$field_element);
if ($key == 'options') {
$val = explode(',', $val);
}
$form['signup_form_data'][$field_name]['#' . $key] = $val;
}
}
}

Now all I need to do is create a field that is non-displayable but contains information to build extra fields.  For example the content that describes the Dietary Requirements field is:

dietary_requirements|title=Dietary Requirements|size=40|type=textfield

The production version does a little more analysis of the input to ensure there are no possible attack vectors, but I've left that out for clarity sake.

Now, if I have an event (or other content type) that needs extra signup fields, I ensure that the content type has the new Signup Extras field and fill it on the new content with a simple field definition that Signup can use.

:: Next >>