Whether you’re providing a support service or a web app, things will go wrong (a cynical comment to start the new year with, I know!)- what’s important is how you deal with them. Part of that is making sure your customers understand what’s going on, and know that you care about getting back up and running so they can carry on with business as usual. It never fails to amaze me how few companies (big ones actually seem to fail more often than the little ones, and there’s massive news items to support this in the form of the “Great Sony Outages of 2010” etc) have a plan to tell customers about outages to minimise the damage to their reputation.
What’s particularly frustrating is how simple it is to plan for such scenarios- you need a communication plan. Whilst every issue is different, for every case you need to work out the MEDIUM (how you tell you customer about the problem, be it twitter, email or otherwise), the UPDATE frequency (how often you’ll update people- and, if agreed, make sure you stick to it), and the DISASTER PLAN (what you do if it’s really bad (failover, DR site, financial re-numeration for your customer). If you have all of this in place before the disaster strikes, you don’t have to worry any more (you’ve already, after all, got an outage or a disaster of some kind) or at least your initial panic can be down played. For an example, one recent partner of mine was discussing how best to cope with future incidents after a prolonged outage (after scheduled maintenance went wrong- even the planned stuff doesn’t always go to plan). I suggested they needed to get together a:
Communication Agreement – in which it was to be agreed who to contact, and how to contact, and after how long a period of problems. We figured an online bulletin board would be great for frequent updates as it meant it was easy to communicated to all customers, and could be hosted for free on a simple single page site.
Update Agreement: If the event of prolonged outages it was suggested the communication page would be updated every 30 minutes as a reasonable start. There was a further escalation element worked into the communication agreement which added email notification of prolonged downtime.
Finally, the DR plan came into conversation, and whilst financial recompense was discussed I came away feeling that if they followed the agreements set out above, it wasn’t such a big issue- downtime is bad, but NOT KNOWING is worse- it stops the customer from planning around the outage and aggravates them further. It’s a good chance too to think about the issues even the big guys face (we’ve seen Amazon and Microsoft fail in the past few years, and many more giants struggle and this won’t change any time soon) and how you’d cope- DR is getting increasing affordable as long as you plan it in advance- backups, DR tests, failover of KEY or all elements of your infrastructure is entirely possible (I’ve even seen a MS house with a Google Docs DR plan!) so don’t forget to be ready for the worst worse case scenario!
Happy new year all (and I hope you don’t need my advice!)!