Recently, Forbes published an article about the Immortal Enterprise, the one that never stops and never gets disrupted. To quickly sum it up:
– Unplanned downtimes can kill your enterprise
– Planned downtimes prevent 24/7 operations
– You need a good Disaster Recovery (DR) solution
– Real time duplication is the key to a good DR
– A downtime is a downtime, planned or not and should be considered business critical
I would say that those “tips” from Forbes are almost common sense for someone who deals with infrastructure (critical, or not) to bare in mind. Nevertheless, my experience tells me that every major company, startup or project I’ve worked for, lacked those common sense rules. Also, the younger the team and more capable of handling new technologies, the worse the DR plan or lack of it.
If you go even deeper into the startup (or new companies) world, disaster recovery is seen as a mitigation, just as security, and not part of the plan or a regular task, as development or systems operations. It always comes last, after deployment, the question of “if something fails, what should we do?”.
Bottom line, a disaster recovery is a major downtime, unplanned with severe results to the infrastructure, some of them, unrecoverable. With that in mind, we should then look at all the reasons for a downtime;
– scheduled, when you are doing some maintenance, upgrade or deployment
– planned, when you expect that something might go wrong
– unplanned; when none of the above happens
So a good downtime or disaster recovery plan should always have to answer these questions:
– How critical is it ? Business, Infrastructure, Application ?
– Can we roll back, relaunch or restart ?
– Do we have all data safely stored and quickly available ? And what about infrastructure ?
– How fast can we recover ?
– Do we have necessary resources (technical and human) to do the recovery safely and fast ?
– Who should be informed before, during and after the recovery ?
– Who is in charge ? How communication flows between technical teams, management and customers ?
– Do we have alarms and monitoring in place to quickly alert everyone ?
To help answer these, I’ll just give a few pointers:
– ITSM; which helps with all the procedures and flows
– Disaster Recovery; the basics
– Netflix API Deployment; a good approach to the continuous deployment
– Ebay Zero Downtime: how can devops be in charge and not cause disruptions
– Wix Media: example on how to use Google Cloud for disaster recovery
– AWS for DR: how t o use Amazon for your disaster recovery solution