If you've been in IT for very long you've been asked the dreaded question: "So, how long until the server is back up?" This is the moment where you need to shine. At no other time are you more important than when IT stuff isn't working. Sure, you add value to the organization deploying new applications, streamlining processes to save money, but when it comes down to it, they need you when stuff isn't working...
That being said, are you ready? Or are you up late at night sleepless worrying about a key application server going down? Rather than stress, get energized and prepared for when things aren't working so you can make them right, fast.
RTO/RPO Primer:
You've heard these before, RTO and RPO, but have you really given much thought to how they apply to your organization? Your business line applications? Your user types?
First let's define them: (Right from Wikipedia, no reason to reinvent the wheel)
RPO - Recovery Point Objective. Recovery point objective (RPO) describes the acceptable amount of data loss measured in time.
RPO Real world:
Ask yourself this question for each application, group of users and individual user you want to protect data for:
What would happen if the ____________ application/group/user were to lose ___________ minutes/hours/days of data?
Examples:
What would happen if the Accounting Department (group) were to lose 4 hours (time) of data?
What would happen if the online orders web store database (application) were to lose 24 hours (time) of data?
When you answer these questions you will understand the pain involved to the organization (and to you) and then you can usually quantify this into dollars and cents.
Scenario 1: Accounting has 10 people making an average wage of $40,000 a year (~$20/hour). The work they did for 4 hours would have to be redone (best case assuming it can be redone) possibly causing overtime to redo it, getting the department backlogged on new work and possibly irritating a few people. At a simple hourly rate 4 hours x 10 people x $20/hour = $800. Not much money in this scenario but possibly avoidable.
Scenario 2: Online web store goes down hard due to power supply failure, the database is corrupted and data has to be restored. Depending on what the online store is selling and how much is sold in an average day this could be significant or insignificant but the other attribute is customer perception. The customers that placed orders online are expecting their products. Does each customer have to be contacted to re-place their order? Do you even know who the customer were since the database was corrupted? Hmmmm? How much data in relation to time can be lost before it hurts too much? This is what you need to figure out.
RTO - Recovery Time Objective. The recovery time objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
RTO Real world:
Ask yourself this question for each application, group of users and individual user you want to protect data for:
What would happen if the ____________ application/group/user could not access their data/application(s) for ___________ minutes/hours/days?
Examples:
What would happen if the Accounting Department (group) could not access their data/application(s) for 2 hours (time)?
What would happen if the online orders web store database (application) could not access its data/application(s) for 16 hours (time)?
Keep in mind, this is not the same as RPO. This has nothing to do with how much data in time you are willing to lose, this has to do with how much time can you live without access to a specific application/data set/computer/server.
As with RPO, once you answer these questions for RTO, you will understand the pain involved with inaccessibility to each specific resource so you can quantify into dollars and cents.
Scenario 1: Accounting has 10 people making an average wage of $40,000 a year (~$20/hour). Losing access to the system means they cannot do their work. At an hourly rate 2 hours x 10 people x $20/hour = $400.
This is in addition to the RPO loss of $800. In essence the cost for this application going down and having to be recovered to a recovery point (RPO) of 4 hours with an RTO of 2 hours cost the organization at a bare minimum $1,200.
Scenario 2: The same online web store that went down hard due to power supply failure. Unfortunately you didn't have a spare power supply so you ordered one overnight service. 16 hours later, you have the replacement. Then you have to recover the database from your tape backups, which are a day old from your 24 hour RPO, taking 2 more hours. 24 hours RPO (age of last tape backup) + RTO of 18 hours downtime (waiting for power supply and 2 hours for data recovery) = 42 hours of lost past/redone/projected income. If this online store does a few hundred dollars a day, no big deal, but if it was the main income generator for the organization, this could be extremely expensive.
What's Ideal?
What is an ideal RPO? Zero, no data loss, no changes in data lost.
What is an ideal RTO? Zero, recovery time of zero is ideal.
The problem is, Zero, when it comes to RPO and RTO can be very expensive and in some cases nearly impossible to achieve. Data is always changing. Some applications are only installed on a single system, if that system goes down, so does the application. In lieu of Zero, you have to get real and find out what RTO and RPO is right for your applications/groups/users, one by one. Then get from where you are now to where you need to be. You'll sleep better, you'll being doing a great service to your organization and you'll have happy users/customers. Sounds pretty good, right?
How do you get there from here?
First: Ask yourself the RPO and RTO questions for each application, group of users and individual user you want to protect.
What would happen if the ____________ application/group/user were to lose ___________ minutes/hours/days of data?
What would happen if the ____________ application/group/user could not access their data/application(s) for ___________ minutes/hours/days?
Second: Focus on what hurts most, then work your way backward to the least important items.