Scalability is often marketed as a feature of a storage system. But scale is not a checkbox feature, nor is it a single number like capacity. Scale is a set of constraints that operate across every metric and feature of a system. Within large cloud environments all parts of the infrastructure are expected to operate against this backdrop of scale. In two recent posts we touched briefly on the magnitude of the challenges presented by scale and why EMC spent $430 million to acquire scale. However, as a critical consideration in any cloud infrastructure build-out, we wanted to discuss more deeply how we solve the challenges of scale.
As it relates to storage, two of the most critical dimensions of scale in a cloud environment are performance and capacity. Using traditional storage systems, optimizing for either one of these resources almost always comes at the expense of the other. The best visual depiction of this dilemma can be seen in this graphic.
Flash-based designs today are IOPS rich but lack the capacity, high-availability and/or shared characteristics required to scale to the broader demands of a large scale cloud environment. Meanwhile, hard disk-based systems have plenty of capacity scale but lack the IOPS needed to service the full capacity footprint adequately. Unfortunately a storage infrastructure containing lots of underutilized disk is unsustainable from both a cost and management perspective.
Properly architecting for scale in a multi-tenant cloud environment requires a system design that is able to manage the mixed workload profile inherent to this environment. Unlike an on-premise architecture that has a more controlled binding between application and storage, the economics of cloud are predicated on a shared infrastructure across many applications. Rather than optimizing the underlying storage for a single application, a cloud infrastructure must be able to accommodate for the unique performance and capacity requirements of 1000’s of applications. Modern hypervisors provide this level of flexibility for compute resources today. It is about time storage caught up.
So what are the defining characteristics of a storage system designed to operate under the constraints of scale? Here are some of the design objectives we have based our system around:
- Performance and capacity balance- Rather than force an sub-optimal tradeoff at the system level (i.e. performance or capacity) we instead designed an architecture with a more balanced blend of performance and capacity. Armed with our performance virtualization technology, service providers can now carve up this system to serve the unique needs of many different applications across a wide mix of performance and capacity requirements. This more granular level of provisioning is a far more efficient method for allocating storage resources relative to more traditional system-centric alternatives that force a capacity or performance decision upfront on every application.
- Incremental growth- The recurring nature of the service provider business model necessitated an incremental approach to scale. Each node added to the SolidFire cluster adds equal parts performance and capacity to the global pool. With a more balanced, and linearly scalable resource pool at its disposal, a cluster can more easily span environments both small and large. Traditional controller based architectures require a large investment up-front for redundant controllers, and while adding more disk shelves can increase capacity, in many architectures the performance benefit is limited, or a complex reconfiguration is required.
- Dynamic change- Capacity and performance allocations within the cluster needs to be dynamic and non-disruptive to account for the only two constants in the cloud; growth and change. This requirements applies both at the node and volume level. Node additions to a SolidFire cluster are done non-disruptively with data rebalanced across the newly added footprint. Performance QoS settings for individual volumes can be dynamically adjusted on real-time through the SolidFire REST-based APIs.
- Single management domain- As a storage environment scales it is critically important that the management burden does not do the same. The clustered nature of the SolidFire architecture ensures a single management domain as the cluster grows. Alternative architectures often require additional points of management for each new storage system. Even worse, scale limitations often prevent vendors from addressing such a broad range of capacity and performance requirements from within the same product family. The complexity resulting from multiple points of management across multiple product families can have crippling effects at scale. Multiple clusters can be set up in different fault domains or availability zones as required, but the key decision point about what scale to place in each domain is determined by the customer, not by the storage system.
The scale challenges of cloud environments mandated different design choices for us at SolidFire compared to a solutions intended for more traditional enterprise use cases. Delivering such a balanced pool of performance and capacity with a single management domain is unique in the storage industry today. Layering our performance virtualization technology into the architecture allows service providers to flexibly host a much broader range of application requirements from start to scale. Consequently, I would urge anyone building a scale-out cloud infrastructure to at least consider the above criteria as a starting point for any discussion around scale.