Hu Yoshida

How do you back up and restore a Petabyte?

Blog Post created by Hu Yoshida Employee on Jul 2, 2015

The simple answer is you don’t.


It would take you weeks to backup a petabyte and much more to restore a petabyte. Even if time were not a factor, the costs could be prohibitive since studies we have done have shown that data protection costs for onsite storage is twice the cost of primary storage while the cost of protection for remote offices maybe three times the cost. Data protection costs include the total cost of:  back up media, backup servers, libraries, software licensing, operational costs, administration costs, maintenance costs, and the cost of failures. Since backup windows are a thing of the past, a point in time copy is made for a backup up server to use for backup and multiple point in time backups are kept in case restores have to fail back to a point in time. This is just the cost of data protection using traditional backup. Disaster Recovery costs can be an order of magnitude more costly.



The hyper-scale data centers like Google and Amazon handle exabytes of data without doing backups. The way they do this is by copy on write. By using a file system that writes to multiple storage nodes, applications can continue to run when one node fails. The failed node is replaced and rebuilt in the background. While copy on write adds costs, it is not nearly as costly as traditional back up. With copy on write the data is always under the management of the primary system and immediately available.


Hitachi Content Platform (HCP) delivers a complete range of data protection essentials. It is a true backup-free platform that protects content without the need for tape copies, using sophisticated data preservation technologies. Unparalleled, highly active, data protection is built into the object store for state-of-the-art retention, replication and self-healing. HCP can scale to 80 federated nodes and 80 petabytes of capacity and protects your data in-place using a host of technologies including:


  • Configurable Data Redundancy

Hitachi Content Platform was built to relieve concerns about data loss, with configurable redundant local object copies. Dynamic data protection levels (DPL) provide high reliability with up to 4 replicas of the original data object. For organization with highly sensitive or valuable data, the redundancy factor is indispensable. Software mirroring is used to store data for each object in multiple locations on different nodes. The nodes are grouped into protection sets so that all copies of the data for an object are stored across the nodes within a single set. This level of protection service automatically enforces required data redundancy. Maintaining replica copies helps tolerate simultaneous points of failure. While the default is to maintain 2 redundant copies, storage administrators can configure HCP settings for DPL 3 or DPL 4.


  • Metadata Protection

Hitachi Content Platform uses configurable redundancy to protect valuable metadata, too. Metadata can be critically important in the overall value of data stored on HCP. It is essential for assigning the right object policies and being able to build and manage enormous unstructured data stores. The Content Platform provides the industry's only integrated metadata query engine, which enables thorough data discovery and integrated search. The metadata protection level (MDPL) is a system-wide setting that specifies the number of copies of metadata for HCP to retain. As with DPL, the default for MDPL is 2 copies. A secondary copy of the metadata is created and stored with each copy of the object data, but is managed independently. In the case that the original metadata is lost or corrupted, it can be reconstructed by using the 2nd copy.


  • Backup Free File Synch and Share
    HCP serves as the core repository for HCP Anywhere, which provides file synch and share for mobile users. With HCP Anywhere, the data remains secure and protected behind your firewall while it is accessible to authorized users on their mobile devices and browsers. In the event of device loss or failure, the user’s files are available via any web browser and can be easily recovered to a new device.


  • Backup Free Remote Office
    The Hitachi Data Ingester (HDI) is software that runs on a remote server or Virtual machine and looks like a filer to a remote user or application. When a file is written to HDI, the file is replicated over RESTful interfaces to an HCP in your data center, which eliminates the need for backup. The local HDI has limited capacity so when a threshold is triggered it begins to stub out the oldest files so that it looks like a bottomless filer. If the HDI server or the remote site is unavailable for any reason, a new server can be installed, connected to the network and HCP Anywhere can remotely install and configure the new HDI software. As soon as the connection is made, the new HDI can immediately continue operations, creating new files that are replicated to HCP while accessing the files that were copied to HCP prior to the outage.


  • Disaster Recovery with HCP and Globally Dispersed Cloud Services
    The cost of Disaster recovery can be greatly reduced through HCP’s open interfaces, which include REST, Amazon S3, and Openstack Swift, and enables globally dispersed cloud services to replace traditional disaster recovery sites. HCP encrypts data, so data in a cloud can be moved to different devices without fear of violating privacy requirements. Photobucket, a global leader in online photo hosting, sharing and printing services where more than 2.25 million images are shared daily, chose Hitachi Cloud Services, powered by Hitachi Content Platform, to integrate its patented application to store two copies of original media in a globally dispersed Hitachi cloud.


  • A new way to look at Recovery
    Instead of recovering everything on a backup tape, before operations can restart, HCPs custom metadata enables faster, more accurate access of content and provides meaningful information needed by IT to efficiently and intelligently process data. IT is better equipped to support stringent recovery point objectives (RPO), recovery time objectives (RTO), and service level agreements (SLAs).


  • Content Validation

Data integrity refers to a verifiable guarantee that the data retrieved for a given name or ID is exactly the same data that was stored using that name or ID. For data being stored over years or decades, content validation is an absolute must. Validation is normally accomplished by hashing the data using cryptographic hash algorithms. The hash function maps large data sets of variable length to smaller data sets of fixed length, returning what is known as hash values or hashes. HCP performs continuous data integrity checking and proactive data repair. Each data object has an ID or digital fingerprint that the hash algorithms use to compare it to other copies of the data. If there is any discrepancy or integrity breach, HCP automates object repair to fully restore the original data object.


  • Object Versioning Protection

A chief mission of data protection is to prevent deletions or unintended changes before the retention period expires. Creating versions of the data objects helps to accomplish this mission and eliminate data tampering. Data stored in most object systems is fundamentally "write once, read many" (WORM). The object's unique identifier, that digital fingerprint, guarantees its immutability. To modify an object, HCP allows a new, different object to be created from the original. Then the original can then be assigned for deletion, or kept for versioning history. This versioning ability is critical to the backup-free HCP attributes as it allows past versions of a file to be recovered easily and in a self-service manner by end users.


HCP provides backup free protection for files, objects, and archive data, which is the majority of data that is backed up today. High performance, block based systems still require backup and disaster recovery solutions for active data. However, inactive data in those systems should be archived into HCP to reduce the cost and time required for backup.  Check with your Hitachi Data Systems representative or channel partner to see how HCP can reduce your backup and disaster recovery costs. You may be surprised at how much you could save and improve your SLAs.