Disaster recovery (DR) for SAP has always been a hot topic, since SAP is one of the most mission-critical environments for any organization. Research company IDC analyzed the cost to the company, should a critical application fail. They calculated that cost—if you’re a Fortune 1000 company—to be between $500,000 to $1 million per hour of downtime. They further calculated unplanned application downtime (all applications) costs a Fortune 1000 company between $1.25 billion to $2.5 billion every year.
I believe that in the case of SAP, the cost of hourly downtime is on the higher side, and can go even higher than $1 million per hour. Failure and downtime to the SAP environment can bring an entire organization, or at least large parts in it, to a halt.
For that reason, significant thought is invested in a DR plan for SAP. If your main site suffers a disaster, such as flooding, fire, major earthquake, and so on—the DR plan should get you up to running your normal operations in as short time as possible and with minimal data loss.
There are several DR options in the SAP world. The emergence of cloud technologies have added options that did not exist just a few years ago; there is also more flexibility today, for choosing the DR option that’s right for you. Of course, various options also carry different price tags. At oXya, we’ve been using all of the DR solutions I’ll cover here with our various customers, according to their needs and budget. There is no one single solution that fits everyone. In this blog post, I’ll list the various DR options we’re using with our customers, as well as the pros and cons for each of these options.
But before diving into the various DR options, let’s clarify two things:
1. We focus on the Production landscape. The SAP environment can be huge, with many landscapes and multiple servers. When speaking about DR, we limit the discussion to the Production environment. Due to cost considerations, organizations are usually not thinking about DR solutions for other landscapes.
2. DR in SAP world means copying the database log files. All the information handled by SAP is stored in the database (i.e. Oracle, DB2, MS-SQL). Any changes to the database are represented in the logs. By sending the log files to the remote/DR site, we can recover all the information for SAP to run properly on the DR site.
About Disaster Recovery Technologies: Synchronous and Asynchronous
This post deals with DR options for SAP, so I don’t want to dive too deep into core replication technologies. However, we can’t do with nothing, as these technologies play a critical role later on, especially when dealing with HANA. So, here’s a very short description about the two main replication technologies, also explaining which one we use for DR:
Synchronous: the database at your main site will not commit any database changes, before it received confirmation that this change has also been replicated to and committed at the DR site. This creates, in essence, two identical sites.
Asynchronous: the database at your main site acts normally and commits changes. All the changes are sent to the DR site, but committing (or not) the changes at the DR site does not affect the main site. By definition, there is always a lag between the main site and the DR site—the DR site lags after the main site. The size of lag depends on latency, which in its turn depends mostly on the distance between the two sites.
For a more thorough explanation of synchronous versus asynchronous technologies, see this HDS document; page 4 has a great explanation of synchronous versus asynchronous replication, including a nice comparison table.
As already mentioned, the latency between the two sites depends on the distance between them. If the two sites are distant enough, as required for true DR, the latency will be significant. If you try to implement a synchronous replication, this will bring the database performance (on the main site) to a crawl, and make it unusable. The reason is that every time there’s a small change, there will be a major wait until the order is also committed to the remote database, and that will crash the database. For that reason, any typical DR solution for SAP will use an asynchronous solution.
Another thing I’d like to clarify is the difference between a High-Availability solution (also known as Active-Active) and a Disaster Recovery solution (Active-Passive), because I heard in the past people relate to an HA solution as also a DR one, in parallel. A high-availability solution means two servers, usually within the same datacenter or at a very close proximity, that create a cluster and enable you to access and use them both at the same time. A high-availability solution is not a DR solution, or at least it’s a very bad DR solution. Think about some major disasters in the last decade, such as Hurricane Katrina in New Orleans, Hurricane Sandy in New Jersey, and the tsunami in Japan; a high-availability solution would have been totally destroyed in these cases, which brings us back to the point I made above – for a true DR solution, the DR site must be far away, hundreds and even thousands of miles away, in order to avoid the disaster impact. It also means, by definition, that the DR solution must be an asynchronous one.
Traditional Disaster Recovery for SAP
Traditionally, what we had in the SAP world was a main server at our main datacenter. In addition, we had a DR datacenter in which there was another server, usually identical to the server in the main datacenter. The traditional SAP approach to DR was to use “log shipping”. This means you gather the database logs at the main datacenter, and ship (send) them over the network (usually over MPLS) to the DR datacenter.
The traditional approach has been around since the beginning of SAP, and we’ve been using it for many years. It works great, and many customers are still using this method. This approach is very sturdy, works with any type of infrastructure, yet it’s the old fashioned way.
There are at least two drawbacks to the traditional approach: cost, uptime speed, and audits:
1. Cost: using this approach, we need to own both datacenters, and have servers in both of them (or we can lease space in a datacenter for DR purposes, but that’s still a major cost).
2. Uptime speed: newer DR technologies enable us to get the DR site up, running and operational in a shorter amount of time, compared to the traditional approach. I’ll discuss it shortly.
3. Audits: the traditional approach only replicated the database log files. While these are sufficient to get the SAP environment at the DR site up and running, there are additional log files that are created by the SAP system itself, whenever a job is performed, or there’s an error, and so forth. These files are not replicated when using the traditional approach. These SAP logs can be important for audits, for example.
The rise of cloud solutions has given us additional, newer options for DR. I’ll describe them from the cheapest to the most expensive one; all of these solutions have been used by oXya customers.
DR to the Public Cloud
Generally speaking, one of the cheapest solutions for DR would be to use a public cloud service, such as Amazon Web Services (AWS). If your server needs to be backed up, and it doesn’t have frequent changes (i.e. web server, front-end applications, or interface applications), then a public cloud can be an option. You backup to a server on AWS, and have that server “sit” there, turned off, so you don’t pay for that service/server until you bring the server up. You only bring it up when a disaster strikes your main servers, or once in a while for updates.
This is a very cheap option to achieve some type of Disaster Recovery, because you would pay very little so long as your servers are turned off (down).
For SAP however, you’ll need to keep you databases in sync, which means the database server on Amazon must stay online and continuously receive updates.
It’s important to emphasize that this method is NOT recommended for everyone due to security concerns of having your production data on a public cloud. You may also run into some difficulties in setting up all your SAP interfaces for the DR, within a public cloud (Bank interfaces, etc.). Still, this option can become relevant in cases where budgets are very small, and customers can’t afford to invest in one of the other, more expensive DR solutions for SAP. In such a case, some DR is better than none, so this solution can be considered.
How does it work, in practice? The method is quite similar to that of the Traditional DR method. You install all your DR servers on AWS, shut down all the applications servers (to avoid ongoing payment), and only keep the database server live, on a continuous base, to receive ongoing updates of the log files. You will then send the database logs from the main customer site to the DR database server on AWS. Once a disaster occurs you bring the other servers up, and can operate your SAP environment directly from Amazon.
This setup enables budget-strained SAP customers to obtain a fairly cheap SAP disaster recover option. It is a far cheaper option than having a physical server at another datacenter, because you only pay, on an ongoing base, for the uptime of the database server, that is kept in sync with the logs. All the other servers on Amazon are shut down, and cost almost nothing (you would still have to pay for the cost of the storage used by these servers).
This solution can be used with various providers, not necessarily just AWS. oXya, for example, provide this service through its own cloud, and there are additional solutions in the market such as Microsoft Azure. The idea behind all of these is similar – you only pay for servers that are actually being used.
SAP DR using VMware SRM
Another DR option to use with SAP is VMware SRM (Site Recovery Manager). For some of our customers, we implemented and are now using the VMware SRM method, instead of using the Traditional DR method. The difference is that the traditional method uses database-level replication, by sending the database log files. With the VMware SRM method, we perform a full server replication. This means we include all the additional files that are created as part of the SAP operations, such as the SAP logs. All of that additional data can now be replicated directly to the DR site.
With VMware SRM, you have a VMware farm on your primary datacenter. You would also have a VMware farm at the DR site, yet this is probably a smaller farm, to only satisfy the needs of the Production environment. Then, you perform a VMware SRM replication across these two VMware farms (in other words, you duplicate the full VMs, including SAN replication and the VM setups).
VMware SRM can be based either on Storage replication or on the vSphere hypervisor. Without going into the technicalities behind these two options, Storage replication is usually used when a very low Recovery Point Objective (RPO) is required. That option is somewhat less flexible and requires having the same type of high tier storage infrastructure on both sites.
The VMware SRM method allows you to have a full server replication to SAP, whereas before, with the traditional method, you only had database-level replication. In most cases, the database-level replication is enough, but you still have some work to do before you can get the DR system up and to par with the main, original site.
Therefore, a VMware-based replication will allow for a quicker/shorter Recovery Time Objective (RTO), which is the time for a business process (SAP, in our case) to be restored after a disruption. In addition, you keep all the files that are not residing within the database, and which are lost when using the Traditional DR for SAP.
HANA-specific Disaster Recovery
The last type of DR we should discuss is HANA-specific disaster recovery, because this one is a bit different. HANA usually runs on its own application server (its own appliance), or it can be installed as a Tailored Datacenter Integration (TDI) setup.
However, HANA has its own replication method. For customers who have HANA and want a DR solution, HANA offers a tool called HANA Replication, which replicates the entire HANA appliance to another site. There are several ways of doing that, but first let’s describe the typical setup for HANA.
In a typical setup on the main, Production site, you have one application server running the SAP application. In addition, you have the HANA database
running on its own appliance. This database setup is similar to how you did it prior to HANA – you could have had your database server separate from the application server (running an Oracle database, for example).
On the DR site, you need to have another HANA appliance, in order to replicate HANA, and also another application server. Let’s cover the HANA replication first, and then I’ll relate to the replication of the application server.
To replicate HANA from your main datacenter to a DR datacenter, you must have a second operational HANA database on the DR site (and yes, it’s quite costly, as my friend and colleague Melchior du Boullay covered last week in his blog post, Considerations Before Migrating To SAP HANA). You can have any combination of HANA appliance or TDI at your main datacenter and your DR site, that doesn’t matter, so long as both databases are operational and the DR database is at an equal or higher release level.
In theory, there are two ways for performing that replication, like any other replication – synchronous and asynchronous, with which we started. However, due to HANA’s enormous speed and performance (after all, it’s an in-memory technology), any attempt to implement a synchronous replication without sub-millisecond round trip latency will practically bring the HANA database performance to a crawl, and make it unusable. You will lose all the benefits of HANA. This is why in HANA’s case, using the asynchronous solution is the only practical solution for DR. Synchronous replication is really only an option for High Availability, where both HANA databases sit next to each other; and even then it requires careful consideration.
As for the application server itself, there are two methods for replication:
1. Using VMWare SRM: if you have your primary application server on VMware, you can use the SRM method we mentioned above, in order to keep it in sync with your original application server.
2. Install another server: alternatively, if you’re not using any kind of virtualization, then all you need is to install a fresh application server that has the exact same SID (same system number) as your original application server, and shut down this DR server. You can bring it up when you need to switch to the DR site, and it would work fine. The only thing you would lose are the SAP logs and potential interfaces files. But still, the SAP environment will work just fine, it will allow you to log into the system, and you will see all the transactions and all of your data.
Handling RPO in SAP
Recovery Point Objective (RPO) is defined as the maximum targeted period of time in which data might be lost, due to a major incident (disaster). In other words, how much data can you “afford” to lose in case of a disaster, as defined in your Business Continuity Plan. How is this handled in various SAP disaster recovery methods?
The answer is that it varies, depending on your DR solution. In the Traditional method, your RPO depends on the size of your database logs. The bigger the logs, the bigger the RPO. The smaller the logs, the smaller the RPO. However, if your logs are too small, then you’ll have performance impact, because you need to create a lot of files very frequently. Hence, there’s a balance to be had there. The RPO is always the result of a discussion between the customer and oXya’s SAP consultants, to define what is the acceptable RPO for the customer. Once the RPO is defined, oXya’s experts define the size of the database logs, in order to match that RPO.
For VMware SRM, the RPO can vary between zero (using synchronous storage level replication; again, not recommended across long distances) and 24 hours, depending on the replication settings. It’s important to clarify that 24 hours is not a realistic RPO in the SAP world, but rather the maximum RPO that is set by VMware SRM (page #48). A typical RPO if not using the storage replication option is 15 minutes.
So which DR method is preferred for SAP?
This is the million dollar question, which oXya’s SAP experts are frequently being asked.
The answer: there is no single disaster recovery solution that will be best for all cases. The DR solution needs to be adapted to the customer’s environment, and most importantly to the constraints of each customer. DR is always a compromise between how much money you’re willing to pay, and how much protection you get. oXya’s experts work within the constraints that you set, in order to build the best DR solution possible for your SAP environment.
The Traditional method is still being used by many of our customers, it is working very well, and it has proven over the years to be highly reliable. If a customer comes to us and asks about DR, and this customer has no specific constraints, then we start the discussion with the Traditional method, and explain to that customer the various constraints of that method. If the customer is comfortable with these constraints, then we will move forward with that.
If the customer requires a more sophisticated DR solution, then we discuss the VMware replication solution. However, implementing the VMware solution depends on whether the customer has already virtualized their SAP environment. If they are still running SAP on physical servers and are not considering virtualization, then SRM is irrelevant.
And if the customer has severe budget constraints, we will talk about having an AWS-type of DR solution. This means a relatively cheap DR solution, but it comes with its constraints, which I listed above.
Dominik Herzig is VP of Delivery for US & Canada at oXya. Has 10 years of SAP experience, starting as a Basis Admin, and then moving to SAP project management and to account management. Was one of the first few people to open oXya’s US offices back in 2007, and performed numerous projects of moving customers’ SAP environment to a private cloud, and including disaster recovery solutions.