George Crump at Storage Swiss recently wrote a blog comparing the merits of scale-up versus scale-out architectures in all-flash array designs. The article starts with the assertion that scale-out systems are more expensive to build, implement, and maintain. However with significant innovations in scale-out architecture in the past few years, combined with rapid data growth in Enterprise datacenters, that view is now seriously outdated. In fact, the cost and complexity argument has now flipped for most environments and significantly favors scale-out.

The cost of scaling

For small capacity (<50TB) environments that don’t need to scale over time, fixed controller-based systems likely can be cheaper and simpler. But as soon as you add the element of scale, the element of growth, or the element of long-term TCO, a good scale-out architecture will be significantly less expensive and simpler. With a scale-up architecture, you either need to move to a faster controller for more performance, or add entirely new storage systems. Both of these options have significant costs associated with them. Moving to a faster controller usually involves data migration and you are left with an unused controller. Adding new storage systems exposes customers to data migration as well as the burden of more and more islands of storage to manage. A good scale-out architecture allows you to scale UP or DOWN by adding and removing nodes with no data migration and no increase in management burden. Storage systems can be sized based on the logical, physical, and application requirements of a datacenter, rather than arbitrary vendor-specified configurations.

Scaling performance and capacity

Due to the significant performance available in an all-flash array, most systems will run out of capacity before performance. However, in planning for growth it’s important to retain the flexibility to be able to scale both. In a scale-up design you are limited to the performance of the controller design. Regardless of how much capacity you have left, if you need to add performance you are stuck buying another controller pair. Getting an ideal balance of capacity and performance is nearly impossible. In contrast, a scale-out architecture can offer a wide range of capacity and performance points, particularly if it has the ability to mix heterogeneous nodes with different capacity and performance levels. Relying on Moore’s law and yearly controller upgrades may be a good strategy for a flash vendor, but it’s a substantial inconvenience for customers who want to leverage their investments for the maximum amount of time.

Consistent performance

Scale-up designs suffer from the well known problem that as capacity is added, the controller performance is now spread over more data and applications. Even if not fully “maxed out” to start, early applications will have less controller resources (CPU, cache, network bandwidth) as more applications are added. Scale-up designs are particularly prone to noisy-neighbor problems where a small number of applications can monopolize all controller resources.

By contrast, scale-out designs add more CPU, network connectivity, and memory with each node, ensuring that performance doesn’t degrade as more capacity is added. Of course obtaining linear performance growth in larger clusters is a difficult engineering problem. SolidFire’s architecture is uniquely designed to avoid performance loss as the cluster scales, and our Guaranteed QoS eliminates noisy neighbors and adds fine-grained performance control.

Data protection

Controller based shared-disk systems utilize redundant components for HA (redundant controllers, power supplies, etc), however despite claims of no single points of failure, they generally share a weak point in the shared disk shelf. That disk shelf and the backplane within represents a key point of failure without full redundancy. Historically, high-end disk based systems used dual-ported FC or SAS drives to allow independent backplane connections to each drive, reducing (if not completely eliminating) a single point of failure. Flash arrays that use SATA-based flash drives can’t do that. A shared-nothing scale-out system, with no disks shared between shelves, doesn’t have this limitation and can truly offer no single points of failure as a result. In addition, a good scale-out architecture can self-heal without the requirement for “extra” redundant components, removing the fire drills associated with storage component failures. At small scales, these differences may not matter much to customers – disk shelf failures are likely fairly rare, and a 4-hour data unavailability for parts replacement won’t kill most customers, but at large scale and in environments where 5+ 9′s of availability are needed, shared-disk flash systems represent an added risk.

End-of-life upgrades

Moving from one controller based storage system to a new generation after a 3-5 year service cycle is the bane of storage administrators’ existence. It can often take 6 months or more of planning, testing, and execution to complete, along with application downtime. As environments get larger and larger, the cost and time required continues to increase. With scale-out architectures that allow mixing of hardware generations, hardware upgrades become a trivial process. Simply add the new nodes to the cluster, and remove the old ones. No data migration, no downtime. Put the old nodes in a lab, resell or recycle them, and get back to productive work. The ability to mix generations also means that you can add in “new” storage nodes that offer higher capacity and performance (and lower cost) as you grow, rather than being stuck with old technology for 5 years.

 

There are places in the market for both scale-up and scale-out flash systems, but that balance is shifting rapidly towards scale-out as historical disadvantages are architectured out and a more agile enterprise datacenter runs into the fundamental disadvantages of scale-up. This shift is a key reason why nearly every new storage architecture from a major vendor in the last 10 years has been scale-out (XIV, 3par, LeftHand, Equallogic, XtremIO, Atmos, Isilon, etc). The early rush of scale-up flash architectures is an aberration. A reflection of startups looking for fast time to market, rather than a market shift back to the antiquated storage paradigm of 20 years ago.

 

Learn more about scale-out vs. scale-up in this video featuring Cloud Solutions Architect Ed Balduf.

mm

Dave Wright

Dave Wright, SolidFire CEO and founder, left Stanford in 1998 to help start GameSpy Industries, a leader in online videogame media, technology, and software. GameSpy merged with IGN Entertainment in 2004 and Dave served as Chief Architect for IGN and led technology integration with FIM / MySpace after IGN was acquired by NewsCorp in 2005. In 2007 Dave founded Jungle Disk, a pioneer and early leader in cloud-based storage and backup solutions for consumers and businesses. Jungle Disk was acquired by leading cloud provider Rackspace in 2008 and Dave worked closely with the Rackspace Cloud division to build a cloud platform supporting tens of thousands of customers. In December 2009 Dave left Rackspace to start SolidFire.