We recently announced the HCP S10 Storage node as part of our HCP portfolio of object storage solutions. See my previous post on the changes that the S10 introduced in the way that we access, manage and protect data.
S10 introduced a high capacity commodity disk system with higher availability and faster rebuild times than enterprise disk systems with RAID 6 protection. This was done through the use of erasure coding and a patented approach to rebuild.
“Erasure Coding” (EC) is a method of data protection that is well known in the networking world but is relatively new in the storage world, although storage has been using a form of erasure coding for a number of years. Erasure coding is not a code for the erasing of data. Erasure coding is a method for detecting when portions or extents of data have been erased (made inaccessible due to a failure) and recovering the erased extents of data through the use of redundant extents. Erasure coding divides data into “n” extents to which are added “m” extents for data protection. An algorithm is applied to the “n + m” extents such that if any “m” number of extents are erased, the data can be reconstructed from the remaining number of extents. The “n + m” extents are distributed to different locations and can sustain up to “m” concurrent failures or erasures.
The S10 software uses raw disks and slices them into 64 MB extents and a group of 20 (n) data extents plus 6 (m) protection extents are used to form an extent group. This 20+6 erasure coding means it can sustain the concurrent erasure of 6 extents in an extent group of 26 extents. This provides a data reliability of 15 nines with a storage efficiency of 77%. Each extent in an extent group is written to a different hard disk drive (HDD). When a disk fails, only the damaged extents are rebuilt. New data is written around the failed disk and is fully protected and rebuild activity is distributed across all available disks. Rebuild priority goes to the extent group with the most number of damaged extents. There is no need to reserve disks for spares or wait to rebuild the whole disk. No immediate disk replacement is required, which reduces maintenance costs. By rebuilding the data and not the disk, we have faster rebuild, less vulnerability, and lower maintenance costs.
This example above shows an extent group of 3 extents to illustrate the principal of rebuild and writing new data extents around the failed disk. Notice that the failed disk does not affect the yellow extent group so a rebuild of that extent group is not required. The blue, red and green extent groups only have to rebuild the extents on the failed disk and rewrite them to one of the remaining good disks. When a disk fails, new writes continue to be written with full protection. The purple extent group shows how a new extent group can be written around the failed disk while the rebuild is being done.
While “n +m” erasure codes are similar in that they can all sustain “m” concurrent failures. There are differences in implementation which will affect read and write performance, recovery time, rebuild time, and maintenance costs. The challenge with erasure code data protection storage systems that slice objects in n+m fragments is that the total managed number of fragments on disk varies depending on the average size of the objects. When fragments are small, the amount of work a system can process is bound by the number of fragments and no longer by the total bytes. So if a 4TB disk is lost, full of large fragments, these fragments are much faster to rebuild than when the same drive is lost filled with small fragments. With HCP S10 the fragment size is fixed at an optimal size, which provides a very predictable and high rebuild throughput, independent of object size.
RAID 5 and Raid 6 can be considered forms of erasure coding where ”m” equals 1 for RAID 5 and 2 for RAID 6 and the extent group is a RAID stripe on a disk. The Erasure coding for RAID is based on the XOR of a parity extent with the remaining data extents to recover the extent that was erased. RAID stands for “Redundant Array of Independent Disks “ and rebuild is done on a disk basis, which requires a spare drive for the failed extents to be rewritten. The recovery and rebuild for RAID used to require special controllers and software, but is now done in hardware.
This example shows 3D plus 1P RAID stripes where the parity is rotated across the disks. The failed drive is rebuilt onto the spare disk
While RAID 6 has 2 parity drives and can protect against 2 concurrent disk failures, the rebuild is done on a disk basis and all the RAID groups on the failed disks are impacted while the disks are being rebuilt. The 20 plus 6 extent group on the S10 protects against 6 concurrent failures, and only the extents on a failed disk are affected while the failed extents are rebuilt. The big difference between erasure codes and RAID is that rebuilds are done by extent and not by disks and erasure code algorithms can support more than two concurrent failures. Hitachi has filed two patents in the areas of fast online recovery and fast rebuild.
The availability provided by “m” concurrent failures enables the use of commodity disks, which are prone to more frequent failures than expensive enterprise disks. Rebuilding extents, rather than disks, makes the rebuild time independent of the size of the disk. Depending on the number of extents that are affected in a disk failure, rebuilding these extents may only take a few minutes compared to the hours or days to rebuild all the raid stripes on a multi TB disk. With the S10, there is no need for idle spare drives. All drives are used all the time. When failures occur, missing extents are simply re-mapped to free areas on surviving disks. When this completes, the S10 can return to a fully optimal and protected state, even when missing multiple drives. However, it does require more time and processing cycles to generate and write the erasure codes and keep track of the changing locations of extents, so it is more suited for data that is referenced than data that is actively updated. This makes the S10 very suitable for an archive tier for the Hitachi Content Platform. RAID protection, with high availability enterprises disks, is still the best choice for database and transactional applications since the simpler RAID parity generation can be done in hardware with little impact to performance. However, as disk capacities increase the time to rebuild a disk also increases, which creates an exposure for additional disk failures during the rebuild, leading to loss of data. As the technology for erasure coding advances, it will eventually replace RAID. Today, enterprise storage controllers like the Hitachi VSP G1000, can mask the RAID rebuild times with active/active Global Access Devices.