2012/08/12

Battling the WhizKids: How I kept my website up in spite of the Prime Software Contractor

1999, the lead up to Y2K was a horrendous year for me. Too many long weeks, too much travel, tight deadlines, multiple competing projects, not enough sleep and a level of exhaustion that compromised my short-term memory and took around 5 years to mostly recover from...

So at the end of 1999, when a long term friend, Peter, offered me a part-time contract looking after a few web-servers, it sounded like a doodle. Easy work, reasonable hours and good people to work with. What's not to like?!?!

I got more 80+ hour weeks, a horror project and an exceedingly difficult Software supplier to work with. But because of my efforts, the site stayed up, remained usable and our area was the only part of the overall project to not make the papers (in a bad way). Apparently the project won a Gold Government Technology Award as well.

I presented a paper on the Performance aspects at CMG-A and a presentation.

All this and a very high-profile website: the first large-scale on-line Transaction of the Australian Government: the initial ABN on-line registrations.

While the website and project were done by the Tax Office (ATO), Business Entry Point (BEP) whom contracted me ran a website hosting service out of the Secure Gateway (SGE) facility in the Edmund Barton Building (EBB), home then of Department of Primary Industry and Energy. 15-25 Agencies used BEP for their websites, maybe more. The ATO had decided when the GST legislation had been passed that it wouldn't be able to build an ABN registration website (integrated and secure) in the time-frame available and opted for using BEP.

I've never tracked down the cost of the project: BEP had bought around $1M of SUN kit for the ATO servers and firewalls. DSD security had required separate machines for the webserver and Database, with two firewalls, one to the Wild Wild Web and another between the web-servers and DB.
There were another 6-10 small SUN servers at Wizard used for testing.
And multiple Oracle and Netscape licenses.

We added an Alteon load-balancer ($10-20k?) between the firewall and webservers around Jan 2000. One of the best decisions we made.

The whole ATO project, with BEP systems could've been $2-3MM, with perhaps $1MM going to Wizard for its software.
Which, as was pointed out to me, at $5 per completed registration, was very good value for the Australian Government.

BEP was under the "Small Business Branch" of Department of Employment, Work Relations and Small Business. The main offices were in town with 30-50 people, IIRC. It wasn't long before I started working first 40 hour weeks, then much longer - banished by myself to the EBB basement, home of the SGE.

The GST was to be introduced 1-Jul-2000, and all entities wishing to pay or claim back input costs need Australian Business Numbers (ABN's) by then. ABN applications would be processed within a month, giving a deadline of 31-May-2000 to register in time. The ATO expected about 1 million (1M) applications, there were 3.3M. We processed 600,000 complete applications on-line. There were other processing streams via the Accountants electronic submission systems and on paper. The ATO project manager was to receive a bonus if 20,000 (I think) registrations were done via the Web. Their projections were for 2% of applications on-line, not the 20% at the end. After the deadline, around 50% of new ABN applications came through the website.

The irony, and I didn't spot this, was that the ATO promised two week turnaround for electronically lodged applications - via the Web or Accountants. We had another mini-rush around 15-May.

Just to make things interesting, we had to handle 1/1/2000 early on and "someone" had decided that although the systems were all patched and OK, we had to take them off-line for the rollover. I negotiated that back to "leave up a simple static page telling the world what we were doing".

And yet another wrinkle was the Federal Government Outsourcing program: they decided to sell the SGE in Jan 2000. Right in the middle of the lead-up to our deadline. All the SGE staff were let go, but not before training the staff of the successful tenderer. Who of course promptly moved those staff on.

Chaotic doesn't do justice to the initial state of the Secure Gateway under the new owners... They had their own share of political in-fighting, with one of the co-owners/directors eventually being sacked and banned from the site.

The reason I'd been hired is that the long-term Systems Administrator of the Prime Software Contractor, Wizard Information Systems, had quit. This was to be a recurring theme, people we needed left or were reassigned and unavailable. We had multiple people titled "Project Manager" to deal with. None of them overwhelmed us with confidence, good customer service or helpfulness.

The head of BEP at the time was Dr Guy Verney who navigated this who storm exceedingly well. I had my own troubles dealing with the "WhizKids", but was unaware (all) others in BEP similarly had problems. After the ABN project, the BEP staff presented their head with a "Show Cause" letter they wanted given to Wizard - exceptional within the Public Service. Dr Verney met with the head of Wizard, Tony Robey, and the letter was discussed but not served. The BEP staff were promised "it'll all be different'. Unsurprisingly, nothing changed...

In October, 2007, Wizard Information Services were put into Administration. A very sad loss to the ICT Industry in Canberra. I'd been employed by Dave Schwartz early on when he was "Wizard and Liveware". One of the best, most honest agents I'd ever met or had.

From what I'd been told, the ATO project started with a closed tender to three suppliers, with Wizard selected. The ATO project manager volunteered to me at one stage that he'd asked Wizard to requote to write the system in 'C'. Wizard had elected to implement the system in a scripting language they'd written, then charged us run-time licenses for on each platform (web and database servers).

Wizard refused to reimplement their code in 'C' although there appeared quiet adequate time and resources available.

The Application was wholly CGI (common gateway interface - start a new program for each webpage), which burnt considerable CPU resource for us on the web server. It also posed performance problems for us because SPARC's at the time could create maybe 100 new programs a second. Even with our 8-way web-server.

Half our system load was just starting the CGI processes for each page.
A simple optimisation by Wizard, to reuse a small number of existing processes, would've improved our performance considerably. But that wasn't to be.

Wizard had not thought to instrument their Application in any way or build any tools, let alone real-time, to monitor the systems or load. For a critical new application, this seemed an incredible oversight to me. Because we used https/SSL connections, the Application could have had start/stop timers added trivially, but that was not to be.

I had to invent a method to estimate the response time of the Application.
Luckily, the poor Software Engineering methods of Wizards actually helped me.
We had Unix "process accounting" turned on for security and audit reasons, so I was able to extract some reasonably detailed and fine-grained information on process execution.

The Application consisted of a first web form: "New Application or Continuation".
For 'Continuing' Applications, clients entered a number plus a password. For new Applications, they were issued a number and were requested to enter a password.

Following the initial form, clients would fill in 20-30 additional forms, driven by their individual situation, with each form containing roughly 20 fields. Each Application consisted of ~400 user-entered fields and around 100 extra system fields, such as when the Application was started (and hence when it would be automatically cancelled) and time stamps for a series of events. e.g.: completed, scheduled for transmission, transmitted, response received.

The bizarre thing that Wizard did, that made it trivial for me to identify all processes of their Application, was they handled ALL forms in the ONE program. At the start was a massive "if then else" statement or set of "go tos" to figure out "where are we up to" and "what do we need to do this time". It may have been their wonderful scripting language didn't support libraries or included common code, or they really were just that bad at Software Engineering.

Normal good design would create multiple executables, each one tailored for a single or small set of related input forms. This reduces the "footprint" of each running process, separates the various parts of the source code, allowing changes to be tested and deployed quickly, easily and with high-confidence.

So I just picked out the process accounting records that matched and made the assumption that the wall-clock execution time would be related to the "response time" experienced in the Users Web Browsers. This isn't quite right, but without extensive javascript such as Google uses, its a good first approximation.

I then added this data from the process accounting logfile scanned every minute to a free real-time graphing tool, RRDTOOL. (Round Robin Database).
I'd taken it on myself to create a simple system/health monitoring system that graphed important system load variable - including User Response time, as derived above.

This simple set of pages was one of the biggest learnings I had from BEP/ABN registrations.
On Feb-29, the ATO advertised the "business.gov.au" website, and the next day we were flooded.
Which, not unpredictably, caused the Wizard code to fail and overload our systems.

"Crashed" is too strong a word. We had an "avalanche" or "cascade failure".
Users would be entering data and getting responses in 3-5 seconds.
And then the site would stop responding - so automatically, Users would click on the form 'submit' button again after around 5 seconds. Probably multiple times after an extended delay.

The problem with that is we couldn't stop each of  those 'submits' from being processed afresh because of the combined effects of SSL and CGI. The web server queued the HTTP requests from each user on their dedicated link and processed them in oder. Regardless of what was in progress.

So just when the system was slowing down and struggling with the load, users suddenly starting madly clicking their mice... I never did track down what the Wizard code did under these circumstances. Given how they handled other "unexpected" situations, I guess very badly.

The overload was resolved in a few days by bringing forward an update of the Wizard scripting engine that improved performance around 10-fold. Which makes you wonder why they didn't do that in the first place. Wizard went on to sell systems with the scripting engine/language that had been paid-for by the ABN project to some other Agencies, notably the ACT Government. They still called the CGI script "bep.....cgi".

This event gained the Ops team three things:
  • Within 2 hours we had a purchase order for SUN for a $100,000 upgrade to the servers. We'd been trying to justify this for some time.
  • I'd expected an overload situation and written a specification for a "Busy Tone" system: when the system was flat-out, reduce demand by telling users to come back again later. This had been ignored until this time, then was scheduled to be written. Wizard, in their inimitable fashion, didn't start work on the code for months. We were delivered the first attempt 3-4 weeks before "Busy Day" [testing? testing? how could we load-test that?]. And within a matter of days the "Busy Tone" triggered for the first time. They'd almost scuppered us. Wizard had ignored a very important part of the specification, the "Busy Tone" page needed to be low-impact: to be a static page, not CGI, and not to load the database. What we were delivered did both things... When raised with the developer, the response was "I wasn't told that". Apparently reading detailed specifications wasn't part of this job description or maybe it was never passed on.
  • The Executive were shown the real-time monitoring system and guided roughly through what the various metrics meant.
    • This, and the load-balancer, were the single most important technical improvements for Operations.
    • The Executive and senior managers were able to call up the system load charts any time they liked. On our "busy day", 31-May, they intensively monitored the system all day [it was low overhead and run on a non-production server]. They were able to reassure themselves "all is well" in real-time and without needing to interrupt any technical staff. This empowered and resassured the management team and allowed them to be better informed whilst spending a lot less time and effort in doing so. Everyone was very happy.
After starting, an application was assigned a unique number and a user password stored, probably not 'in clear', but stored as a hash, like the Unix passwd file.
As well, a random number was generated that was stored as a hidden field in the HTML form.
This security device prevented URL-hacking, as happened to confidential information on some other Government websites at the time.

 Oracle supports a virtual table called a 'Sequence' that is guaranteed to return a unique, monotonically increasing, sequence of numbers: handling race-conditions and real-time demands.

In March or April, there was a serious Application fault: a client had restarted their Application, but incorrectly entered their application number (transposed digits). Somehow the wrong password was accepted and they were able to look around the confidential information of another person/entity - a massive security failure.

This event showed 3 problems in the Wizard process:
  • They used a simple Change Request system suitable for Development, but not for Operations.
    • We were unable to classify the fault as "Severity 1" and insist on an immediate work-around, or call an Emergency Operations meeting and consider taking the site off-line.
    • Despite multiple attempts to escalate the fault, nothing happened.
    • In the longer term we requested changes directed to addressing Operational issues properly, even getting access to their Fault Reporting system, but nothing came of it.
  • The fault had occurred before a public holiday and a weekend. Wizard did nothing to resolve this most serious fault for most of a week.
  • Over the weekend, without knowing anything of the Application or source-code, I guessed that the fault might be related to a race-condition in the initial page. Someone 'reconnecting' to an already open session would always get 'password OK'. It took 5-10 minutes to confirm this thesis on Monday morning and inform Wizard.
    • When given a reproducible test, Wizard were able to fix their code and deliver an update that day.
    • The Wizard team leader volunteered during the next weekly meeting, "You only beat us finding the fault by 20 minutes". That statement defies logical explanation... How could they have know when their testing or troubleshooting process would've led them to this insight?
The Wizard scripting language was an on-going source of concern.

One of the later problems occurred when we asked for a copy of the current source code. The updated versions that Wizard loaded onto the Systems were a "compiled" version and we were not given a compiler or any of the build tools.

Following around a month of disagreement by the (one of many!) Wizard Project Manager, including an attempt to first deny us access to 'their' Intellectual Property and then an attempt to charge us for what we'd already paid for, someone managed to get Wizard to read the (Govt. standard) contract which assigned Copyright and all Intellectual Property to the Client, BEP. Wizard had been duty bound to supply us the source code, but had not. When asked to rectify the problem, they'd been combative and recalcitrant.

Wizard had disbanded the ABN development team and were not interested in addressing performance concerns. Following a lot of cajoling, we got around a 20-fold performance improvement out of the scripting engine and simple caching was added to the Application.

The 'performance turning' was instead of doing an initial SQL select then reading all records (400-500 per application) ONE by ONe with a 'cursor'  - each read was a read/write transaction to the database, through the internal firewall.
That network and SQL processing load alone would've reduced our processing capacity 20-50 fold, but I never did test that.

After we'd pounded on them for a few months about this and other problems, Wizard very proudly came in one day and announced they'd implemented this massive speed-up in the Application. They'd issued a single SQL "select *" statement when the Application started and loaded the full set of records returned into a buffer for later use in a single transaction. This reduced the network/DB load to a single read/write, which made it possible for us to serve the load we finally handled on our "Busy Day".

In 1999, we had 300Mhz SPARC processors that weren't optimised to start large numbers of new processes - but were very good at handling multiple threads within processes.

Wizard were unable or unwilling to re-architect their code to suit the  SPARC and use threads.

The WhizKids made some stunning statements underlining their poor grasp of good Software Engineering and complete lack of Operational experience:
  • Forget about handling errors. [In response to a shell script I was crafting]
  • We won't do that, we store everything in the database. [Project manager commenting a code that was written and moving into production.]
  • Why isn't our test site the same as Production? [Really, do you have to ask? It does a whole bunch more things and is a bunch more expensive and larger.]
  • Why would I waste my time doing that for you? [yep, a Project Manager refusing to organise a log book]
  • We did that for performance reasons... [Writing two copies of various DB log files into different places on the same RAID filesystem.]
  • We've tested the code and it's OK. How does recompiling it with new parameters change any of that? [They'd hard-coded path names and Database constants into the code, rather than abstract them into a control or configuration file. They couldn't be convinced that once a binary was tested and accepted, then that exact set of bits was the 'configuration item', not a recompiled version.] 
Although Wizard proudly displayed their ISO 9001 accreditation and everything they produced had those Change Control and Release pages, they had almost no understanding of good quality processes.

More than once they ambushed us with "the bi-monthly major update depends entirely on this new piece of unapproved code being added within 5 days". This did not go down well in a high-security, highly controlled environment where every change had to be carefully inspected, justified and throughly tested.

The other "trick" they pulled was: "you signed off/accepted that. It's not our problem."
Using "Quality Systems" as a "Not My Problem"... [And pay us more money, and more, and more to attempt to get it right.]

Every two months, IIRC, Wizard did a major update of their systems.
They were supposed to be fully tested on their in-house test systems we'd supplied.
What they supplied was a bunch of tapes, very long sets of (paper) instructions and commands and a team of people. These 'upgrades' started with all systems going offline around 9AM and coming back 4-8 hours later. Every upgrade needed a frantic dash back to their office and extra materials/files brought across, they never went perfectly or quickly.

The commands were typed on the console of the machines as "super-user" ('god' or 'root').
I eventually refused to be part of these upgrades as they were insecure, unreliable and dangerous.
The WhizKids co-opted a non-techncal IT co-ordinator as the "BEP representative" responsible for overseeing the operation and for authorising the new site "go live". They were logged in as "super-user" to all production systems by the SGE staff.

That day I got a call at home, "how do we recover system libraries we've deleted?"

They'd blindly typed "rm -rf ./" as "super-user" into a command window without confirming the system they were on, or directory they were in.
This is pretty much the right command to destroy a whole system - which they almost did...

Early on I asked for a detailed set of actions for each role/person for a major upgrade so that it could be co-ordinated. They failed to disclose they already had extensive documentation and instructions and charged for the same document, slightly reordered.

I got an incredulous look when I asked Wizard to install the changes into a PreProduction environment that could be tested beforehand, and to leave the Previous Working Version available as a failback. With the load-balancer, we could've implement changes at 09:00 on a Monday, if a proper multi-version release system had been provided.

For high-security production systems, this should be a minimum.
Wizard, although they had extensive test systems, were unable to package their changes in a simple, reversible and auditable manner.

Competent admins would've both been able to derive exactly the files changes between releases, and have packaged the changes as a set of files and, if needed, scripts to install them and configure the systems if needed.

The last time I'd seen this level of incompetent chaos where every system upgrade required a further upgrade or changes of configuration items, was at Customs with "Management Solutions" of Hall and their FINEST accounting/financial management system. They went bust before Wizard.
The big omission in the Wizard solution was Performance: instrumentation, monitoring, reporting and Capacity planning were all missing.

As mentioned above, I designed and implemented a lightweight performance graphing system that proved very useful.

When I first arrived at BEP (September 1999?), I suspected that the design might be prone to failure under excessive load and did some load forecasting based on a couple of months data, but using some modelling and analysis techniques I'd recently learned.

The results concerned my boss and I: my predictions were for ~22,000 registrations on the final day.
That we were within 5% of this figure was sheer luck. It could've easily been 30% higher or lower.

When BEP approached Wizard for a paper on the performance of their system, this was produced:
  • "our target throughput is 1200 transactions a day, spread evenly over 24 hours. Or 50 transactions an hour. Our test system performs at that level".
  • they ignored extra load from abandoned or restarted applications, or users editing pages.
  • they ignored the additional load from system processes to transmit and track applications to the ATO gateway
  • they ignored the multiple firewalls between the web and database servers
  • And they failed to account for the difference between the test and production databases. [by a factor of 100-1000]
Wizard were unconcerned and unresponsive when we came up with a projection of 20 times their design maximum and that their system design guaranteed a catastrophic collapse when this occurred, causing severe embarrassment for the ATO, BEP and the Treasurer and Prime Minister.

My boss spent considerable effort in addressing aspects of this problem:
  • he hired Oracle to perform some real database performance tuning
  • he hired a firm in Sydney to analyse our daily load figures and model our response to load. This would tell us when the system was reaching capacity.
  • he approached SUN about extra capacity, disks and even machines
  • he got Wizard to implement multiple speeds-ups of their code.
Oracle identified one simple and important change for 'tuning'. The database server was spending half its CPU time parsing SQL statements - if Wizard chose to preparse their SQL code, it would've doubled the capacity of our backend. But they wouldn't change their code.

As mentioned above, we caught a break when the system overloaded at the start of March. It was the catalyst that allowed us to proceed with Busy Tone, purchase a substantial CPU upgrade and pressure Wizard to continue improving their software.

The absolute best piece of idiocy in the Wizard design/implementation was the database design.

Since Codd and Date came up with the theory behind Relational Databases and SQL, standard first steps with any database design is reducing it to "third normal form": this is the most compact and computational efficient database schema.

What did the WhizKids leave us?? Nothing like a good DB design.

It consisted of just 3 fields, the primary key (a number and a text 'key') and a value stored in a 255 long "varchar". This might be good for a quick hack or small test, but for a $1MM project, it's not just ridiculous, but as close to professional negligence as I've seen.

The 'key' was some sort of identifier for the various field names on the forms, plus all the system states.
Heaven knows how they dreamt up and kept a master list of all those names. My estimate was 400-500 fields. Perhaps they concatenated standard field names with the originating form, even creating two sub-fields.

A reasonable estimate for the space needed for all data for one Application might be 8-12Kb.
Instead, we got 500 lots of 256 bytes, or ~75Kb.

The killer though was the number of records in the single monster database table.
Instead of 2-8 short records per Application, we had 400-500.

The final database for 600,000 applications had over 250 Million records, consuming up to 30Gb (massive at the time), instead of 1-2M records and 3-5Gb.

But it gets worse...

There were two numbers generated by the Application for every new client application: a 'sequence' (an integer) and a random number.

Which should be used in the primary key? The sequence, that's part of what its designed for.
Databases can optimise indexes for sequential keys, especially integers, very well. For hybrid keys, integers and text fields, its not nearly as efficient.

Wizard used the random as the primary key of the database, hybridised with a text string.
This is such bad practice it defies belief.
It maximises the work that the database has to do to create an index and guarantees worst-case lookup performance.

Why would any knowledgable, competent professional do that? I can't guess.


But it gets worse...

The ATO needed client applications electronically transferred to a front-end system, where they were copied to tape and transferred to the secure back-end processing system. When the records had been loaded, an acknowledgement was sent back to the ABN system to mark the client application as complete and lodged.

The client application was written into a standard interchange format (I never looked into what they used. Could've been XML, could've been structured text, could've been an encrypted binary format).

These messages were sent via email, package in a standard "message digest" format with an included digital signature (MD5 hash?) for end-end checking.

As a backup, in case the email system failed, the messages could be dumped to files, that were then written to CD-ROM and imported directed to the ATO back-end system.

The files were named for the sequence number, the email used the random as the message number.

When we created a test disk and tried to cross-check and verify, we found it impossible.

But it gets worse...

The ATO had carefully specified a message digest format for the emails.
This allows multiple files to be included in a single email message.
Wizard only included one file in each email message.

At the time, 20,000 messages being sent during the day was a reasonable load on the SGE.
The SGE handled the email traffic for a large number of Agencies.

At one point, the process, it ran on the database server, that selected emails and sent them to the ATO front-end, failed and this went undetected for a day or two without raising an alarm or being noticed.

Someone from Wizard visited the datacentre and noticed, they "released the queue" (another bone of contention. They failed to inform us of visits, the work they performed when there or to document it sufficiently so it could be duplicated).

At which point, in the middle of the day, it flooded the SGE mail system with 5-10,000 messages, creating a 2-3 hour delay for all email for all Agencies using the SGE.

Wizard implemented some 'fix'.
Then promptly repeated the error... We were NOT popular with the SGE folks.

There must have been more, but that's quite enough.

And what happened next?

  • We survived "Rush Hour". The system hit 100% before 9AM and stayed maxxed-out until I went home at midnight.
    • The back-end filesystem was overloaded, something I hadn't anticipated, nor instrumented.
    • Part of the effect of this was DNS requests were being lost, causing email to fail.
  • David and Charles built a replacement system for $100,000 total in 3 months.
    • duplicated hardware, firewalls, load balancer, database, licenses, contractor costs.
    • Like everything with Wizard, 10-times smaller, cheaper and probably faster.
  • I tried to interest the ATO project manager in writing up what we'd learnt as the first major Government on-line transaction.
    • My contract was terminated before it got anywhere and years later I followed up and nothing had ever been published.
  • In 2006 I found a fragment of the process accounting data and reanalysed it, resulting in the CMG-A paper.
    • In that follow up study, I understood how close we'd come to a secondary failure: saturation of the database by the Busy Tone pages.

No comments: