Requirement #3 for guaranteed Quality of Service (QoS): RAID-less data protection
Ensuring Quality of Service (QoS) is an essential part of hosting business-critical applications in a cloud. But QoS just isn’t possible on legacy storage architectures. As we’ve been discussing in this QoS Benchmark blog series, guaranteeing true QoS requires an architecture built for it from the beginning, starting with all-SSD and scale-out architectures. Now let’s explore the third requirement to deliver guaranteed performance: data protection that doesn’t rely on standard RAID.
The invention of RAID 30+ years ago was a major advance in data protection, allowing “inexpensive” disks to store redundant copies of data, rebuilding onto a new disk when a failure occurred. RAID has advanced over the years with multiple approaches and parity schemes to try and maintain relevance as disk capacities have increased dramatically. Some form of RAID is used on virtually all enterprise storage systems today. However, the problems with traditional RAID can no longer be glossed over, particularly when you want a storage architecture that can guarantee performance even when failures occur.
The problem with RAID
When it comes to QoS, RAID causes a significant performance penalty when a disk fails, often 50% or more. This penalty occurs because a failure causes a 2-5X increase in IO load to the remaining disks. In a simple RAID10 setup, a mirrored disk now has to serve double the IO load, plus the additional load of a full disk read to rebuild into a spare. The impact is even greater for parity-based schemes like RAID5 and RAID6, where a read that would have hit a single disk now has to hit every disk in the RAID set to rebuild the original data – in addition to the load from reading every disk to rebuild into a spare.
The performance impact from RAID rebuilds becomes compounded with long rebuild times incurred by mutli-terabyte drives. Since traditional RAID rebuilds entirely into a new spare drive, there is a massive bottleneck of the write speed of that single drive combined with the read bottleneck of the few other drives in the RAID set. Rebuild times of 24 hours or more are now common, and the performance impact is felt the entire time.
How can you possibly meet a performance SLA when a single disk failure can lead to hours or days of degraded performance? In a cloud environment, telling the customer “the RAID array is rebuilding from a failure” is little comfort. The only option available for service providers is to dramatically under-provision the performance of the system and hope that the impact of RAID rebuilds goes unnoticed.
Introducing SolidFire Helix™ data protection
SolidFire’s Helix data protection is a post-RAID distributed replication algorithm. This solution spreads redundant copies of data for single disk throughout all the other disks in the cluster rather than just a limited RAID set. Data is distributed in such a way that when a disk fails, the IO load it was serving spreads out evenly among every remaining disk in the system, with each disk only needing to handle a few percent more IO – not double or triple what it served before like RAID. Furthermore, data is rebuilt in parallel to the free space on all remaining disks rather than to a dedicated spare drive. Each drive in the system simply needs to share 1-2% of its data with its peers, allowing for rebuilds in a matter of seconds or minutes rather than hours or days.
The combination of even load redistribution and rapid rebuilds allows SolidFire to continue to guarantee performance even when failures occur, something that just isn’t possible with traditional RAID.
Stay tuned to this blog as we discuss the other critical architecture requirements required for guaranteed QoS, and read our free e-book to learn more about unlocking the secret to QoS in the blog.