Disaster recovery (DR) can be really expensive these days. Think about having a mirror site at a remote location. Every single server, disk array, SAN fabric component, and networking box is duplicated at the DR facility. Expensive replication technology is used to write data simultaneously at both sites - and that takes an awful lot of high-priced bandwidth. Pretty soon, the budget has skyrocketed and plans have to be pared down.
Making matters
worse, there may be a disconnection in reality between the IT side and the rest
of the business. A recent study conducted by Harris Interactive reveals that 71
percent of IT respondents identified Disaster Recovery/Business Continuity
(DRBC) as very important or crucial to business success, versus 49 percent of
business respondents. As a result, IT isn't getting the budgets necessary
to provide uninterrupted service.
The good news is
that there are some technologies around that provide a decent level of
protection at reduced cost. In addition, there are ways to cut corners without
incurring serious risk. This includes strategies that prioritize systems and
allocate dollars according to business needs.
Fidelity Bank of Edina, Minnesota, for example, found networking and bandwidth to be major constraints to its DR planning. It began replicating data to a remote site in order to reach its recovery objectives. However, it quickly discovered that just adding replication boxes at each site wasn't enough.
"We started sending 40-45 GB of data per day between our main bank and DR facility," says Rick Erickson, assistant network administrator at Fidelity Bank. "As the volume of data increased, we realized that we would have to double or triple our bandwidth to accommodate demand."
Initially, Fidelity planned to replicate entire server images across the WAN. But that idea had to be scrapped. The company reevaluated its priorities and decided to relay only specific system data. Email, for instance, was found to not be as mission critical as IT originally believed. On the other hand, a few basic applications weren't initially rated highly, yet experienced revealed them to be crucial to fast recovery. Erickson says, therefore, that it is essential to take a fresh look at the entire infrastructure during DR planning to evaluate priorities appropriately.
"A wrong guess can cost significant time and money," says Erickson. "You find out through the planning process what applications have to be up sooner than later. "
While this prioritization strategy avoided major additional infrastructure build out, it scaled back their recovery agility. In the event of a disaster, this approach alone would mean rebuilding a lost server from recovered data. As this considerably slowed the process, it defeated the purpose of rapid recovery.
After a rethink, the organization turned its attention to WAN optimization. It selected NX appliances by Silver Peak Systems Inc. of Santa Clara, CA.
This is representative of an ongoing trend - the rise of bandwidth and latency optimization techniques such as those used by Fidelity.
"WAN optimization continues to be popular as people understand issues and requirements centered upon data movement and effective bandwidth," says Greg Schulz, an analyst at StorageIO in Stillwater, MN.
Fidelity's Silver Peak machines sit between system resources and the WAN infrastructure. NX appliances cut down on the amount of data moving across the WAN. They also improve application performance and encrypt the data.
"We have achieved a consistent percent reduction in traffic across the WAN," says Erickson. "1 GB transfers have been reduced from 70 minutes to 4 minutes."
Greg Schulz, an analyst at StorageIO in Stillwater, MN, says this story highlights why remote data replication has become so important. Even small and mid-sized businesses (SMB) can take advantage of this technology as it becomes more affordable.
"Many entry-level storage systems are also supporting some form of remote replication built into the solution as opposed to requiring external appliances or host-based software," says Schulz. "The reason why is that SMB data is just as much exposed and at risk as larger environments."
Other DR Options
Some companies, however, don't want critical business application availability dependent on the WAN. For those that are happy with once a day backup or have a significant investment in an existing tape infrastructure, Virtual Tape Library (VTL) might be a good option. VTL integrates seamlessly into the current backup programs such as Netbackup, BackupExec or Legato. It can drastically reduce the size of backup windows.
"We have a couple of large servers that hold about 1.5 terabytes of data," says Andrew Ferguson, Enterprise Operations Manager at Brookhaven National Laboratory in Upton, N.Y. "Doing a full backup on it would take days."
The Laboratory has around 160 servers in its administrative data center along with a SAN fabric. To address the dual issues of backup and restoration, Brookhaven installed a S2100-ES VTL from Sepaton, Inc. of Southborough, Mass. The laboratory uses EMC Corporation's (Hopkinton, Mass.) NetWorker to back up about 45 TB. Most of it goes to tape, but the most critical data goes to the Sepaton VTL. Backing up a 1.5 TB server now takes less than 22 hours.
"We can start it on Friday night and it is done sometime Saturday, which is perfect," says Ferguson. "We recently restored that data to another server in a 24-hour window."
And he says restorations from the VTL are more reliable than restoring from tape.
"When we first got it everyone was a little nervous about keeping everything on disk with no physical tapes," says Ferguson. "We've had this for over two years now and have upgraded and expanded on it because we haven't lost anything."
For applications which need more rapid recovery - say in the range of less than 30 minutes, Continuous Data Protection (CDP) is another possibility. A wide range of vendors offer CDP including FalconStor. This technology basically continuously captures any changes made in data and can transfer them offsite. For remote offices, this can be done periodically.
"Most customers take 4 to 8 snapshots a day, depending on the amount of data change and size of the remote office," says Christopher Poelker, vice president of enterprise solutions at Falconstor Software. "CDP can protect files or complete servers depending on the requirements of the applications."
FalconStor Replication is used to replicate data from remote offices to the main data center. This can all be managed centrally.
"Continuous data protection (CDP) has gotten a lot of hype," says Schulz. "However, adoption has been limited, but that should change with more organizations leveraging it with data backup as part of an overall data protection strategy."
Virtualization
is another possible way to cut DR costs. Lowell,
MA-based Acopia Networks, Inc., for example, provided a file virtualization
solution to Wiley Publishing, Inc., a provider of print and electronic products
(including the "For Dummies" book series). Acopia ARX systems and FreedomFabric software are
used to support its storage tiering and replication for DR.
Long backup windows and the
high cost of its tape backup infrastructure led Wiley to Acopia to find a way
to automatically replicate data to a remote site.
"We needed to refine our backup strategies to ensure the availability of our growing data stores," says said James Sample, director of IT Infrastructure at Wiley. "We recognized that if one of our filers had a problem, we faced the real possibility of it taking one to two days to restore operations. This was an unacceptable risk."
Using file virtualization,
data is automatically tiered based on its age. This alone provides a
significant savings in tape costs and backup time. Older files are
archived or removed as needed. To date, Wiley has replicated 10 TBs of file
data.
"The Acopia ARX also automatically replicates our most critical files to
failover servers, ensuring recoverability from disaster," says Sample.
Cost Reduction Through Data Center
Consolidation
Many organizations have evolved sprawling IT infrastructures over time. Due to
rapid growth or acquisition, these organizations have data centers and small
computer rooms spread around the nation. When they review their entire network
in a comprehensive manner, therefore, it's hardly surprising that many come to
the same conclusion: DR capabilities can be considerably improved at lower cost
by consolidating into a couple of data centers. Further, this often enables the
company to deploy higher-octane DR than they would if they retained dozens of
smaller data centers.
Case in point: Banc of America Securities, a unit of Bank of America, suffered poor server and storage array utilization rates due to having multiple data centers. According to Gary Berger, vice president of technology at Banc of America Securities, this resulted in over-provisioning of data and was very hard to manage. He decided to consolidate into two facilities and based these around IBM blade servers and storage arrays from 3PARdata Inc. of Fremont, CA.
"Virtualization and consolidation has given us a 95 percent reduction in storage administration," says Berger. "We are able to offer each application and business unit their own virtual slice with high performance and availability."
The 3PAR arrays allows the company to stripe data across anywhere from 60 to 100 drives in order to improve I/O and balance workloads. The 3PAR box also includes a replication engine which helps the bank to copy data between data centers for added protection. Each data center, in essence, contains a copy of the data from the other data center. The result is that the organization can recover rapidly from any event.
"We now have much better DR due to cost-effective replication," says Berger. "This new architecture has also resulted in a 50 percent reduction in the amount of storage capacity we need to purchase."
Ashok Singhal, CTO of 3PAR, stresses that while virtualization can be done at many layers, it really needs to be implemented at the base layer in order to be broadly effective.
"Virtualization has almost become a bad word as it is subjected to so much hype," says Singhal. "It is used to mean a wide range of things. You have to figure out the right layer to address."
In his view, virtualization is a simple concept: provide the user with a single logical view of storage while taking advantage of a complex physical hierarchy. The benefits are resource aggregation/sharing, cost and performance optimization, and improved data availability.
Singhal believes that storage virtualization should not be done in multiple ways by multiple systems. He prefers a streamlined approach at the block storage layer. But where to virtualize block storage: in the host OS, the SAN switch, an appliance or the storage array? He advocates the latter.
"Virtualization should be done in the storage subsystem as there you can directly address disk drives, power, capacity, etc.," says Singhal. "It is much harder to achieve at the upper levels."
A 3PAR array, for example, lets the administrator change the RAID type or drive type while applications are running. Its Dynamic Optimization feature enables users to transition from one service level to another non-disruptively with one command, says Singhal.
In support of his argument, he cites gains from companies adopting virtualization at a higher rate. While they have experienced benefits, these typically range in the 10 to 20 percent range in terms of time and cost savings. Done in the array and at block level, Singhal says far greater results are being achieved. He cites Banc of America Securities as proof of this.
Large-Scale Replication
One final example highlights the importance of good homework prior to the execution of any DR strategy. Dmitri Ryutov, storage architect at JPMorganChase, recently had to adopt new DR technology to meet the stringent requirements of the financial services industry. This involved a seven-figure sum and lots of data.
He spent a while investigating four different options for DR. These were evaluated based on their ability to handle the complete loss of one site, their ability to replicate across hundreds of kilometers, their recovery point objective (RPO - the point that will be recovered to; i.e. an RPO of zero would mean no data loss) and recovery time objective RTO - how long it will take to get systems back online).
JPMorganChase has 9 PB of usable SAN and NAS storage on its 10 or so data center floors. It also has 33,000 servers. Like Banc of America Securities, it decided to consolidate to two core sites - one for production and one for DR. These were positioned 160 miles apart so as to be able to cope with a disaster affecting a wide area.
We evaluated four possible solutions," says Ryutov. "Synchronous replication, multi-hop replication, replication of point-in-time copies (asynchronous) and guaranteed write order replication (asynchronous)."
With synchronous replication, everything on the production site is sent to the production storage array then replication to a DR array. Once the I/O is complete, an acknowledgment is sent back. The great thing about this approach was that RPO would be zero and RTO low. The downside was the distance limitation of this technology - around 20 to 100 miles.
"The system had too much latency, which impacted end users and caused some database applications to time out," says Ryutov. "It was also not cheap, though it did fall within our budget."
Multi-hop replication gets around this distance problem by adding another array at an intermediate site. Its zero RPO and fast RTO made it an initial candidate. But it dropped out quickly, as it meant a major surge in costs due to the need to build the entire storage environment at the intermediate location. Further, it involved too many copies (five) of data being made.
For point-in-time replication, the database is quiesced while a local point-in-time copy is taken of the storage array. This copy is replicated to the DR site. A point-in-time copy is also taken at the DR site. While costs were moderate with this approach, RPO could vary and it proved to be cumbersome to implement.
The technology adopted by JPMorganChase turned out to be asynchronous guaranteed write order replication. When data is written from the host to the storage array, it is asynchronously written to the DR site. A point-in-time copy is taken at the DR site in case the network link is lost.
The pros of asynchronous replication are sub-minute RPO, no performance impact and no distance limitation," says Ryutov. "It also wasn't as expensive as some of the other options and had an acceptable RTO."
What this looks like in the real world is: three clustered Sun E25k production servers transmit data across the SAN fabric to an EMC DMX 3000 array. Cisco 9216 switches (GbE) with redundant paths are used to relay the data to the same EMC array at the DR facility. EMC SDRF is used for asynchronous replication. A further cluster of Sun E25k servers stores a point-in-time copy at the remote site. A tape backup is also done and taken off site.
Lessons Learned
Ryutov offers several lessons he's learned along the way. He lets management or his business units determine the RPO. If they require zero, highlight the costs and let them decide whether they are willing to pay the high cost to achieve it.
Perhaps the most important lesson, however, concerns DR testing. Regardless of how good the technology is, there is no substitute for regular testing.
"Test, test and retest the DR plan," says Ryutov. "And be prepared to come in weekends to test again."
Drew Robb is a freelance writer specializing in information technology.
Comments
Post new comment