Where we’re coming from: racks and racks of commodity hardware taking up hundreds of tiles on the data center floor.

Traditionally, the big data ecosystem is comprised of software designed to run on commodity hardware and spinning disk. A few years ago this was a great place to start. But a lot has changed, and the infrastructure required to drive big data projects needs to change, too. So what has really changed? And where are we headed?

The biggest sea-change in big data has come with the maturity of analytics systems and their value to the day-to-day business operations of many organizations. What started out as greenfield exploratory projects have now grown up to be fundamental business-critical information systems.

The value statements around big data infrastructure have morphed from “How cheaply can we store all this stuff?” to “How fast can we get this data processed and uptake the information to the core business?” What used to be overnight batch processing work has now evolved into a need for real-time analytics processing. Just as legacy relational systems drove the adoption of faster, specialized hardware for database systems, the pattern is repeating itself in the world of distributed big data clusters.

This shift has created a second generation of platform architectures that are being adopted by consumers who are serious about big data performance. The speed and agility benefits high-memory compute, flash storage, and orchestration systems bring to bear are being leveraged to create mission-critical systems that deliver result sets at an order of magnitude faster than the first generation of commodity hardware implementations. In addition to performance, next-generation consolidation platforms are making a huge impact on data center resource consumption, shrinking overall rack space, driving down cooling costs, all while increasing energy efficiencies.

In a parallel motion, large-scale enterprises in a variety of verticals are turning to cloud orchestration to automate the platforms that underlie distributed analytics systems. From telco to security/fed, to leaders in retail, health, and finance, organizations are pivoting from racks of basic server nodes to infrastructure designed for the next generation data center.

Software-defined everything: virtualized compute, programmable storage systems, elastic deployment models. The convergence of these trends with the key technologies in big data, such as Hadoop, Cassandra, and MongoDB, are evidenced by uptake of virtualized distributed database systems in OpenStack-powered clouds and AWS.

In particular, Database as a Service (DBaaS) is being looked to in order to solve the massive deployment challenges associated with large-scale big data analytics projects. In the OpenStack ecosystem, the Trove DBaaS and Sahara Hadoop projects are leading the way in database deployment and management automation, providing open-source frameworks for the massive scale required for big data. Right-sizing cpu, memory, and storage IOPS allocations to closely match the needs of the processing and aggregation nodes is best achieved when those resources are virtualized. Customizing these factors is the key to achieving peak efficiencies while avoiding over-provisioning. Additionally, the software-defined nature of next-gen systems allows for agile responsiveness to changing requirements and faster development iteration.

Sahara Hadoop
The Sahara Hadoop orchestration system within OpenStack, a next-generation approach to virtualized big data platform deployment.

As distributed analytics systems move further into the critical path for business, performance agility and flexibility become key decision factors in platform architecture. The massive amounts of data involved in the modern world of mobile data, sensor collection, and the Internet of Things (IOT), become problems that require much more creative solutions than simply using racks upon racks of last-generation hardware to store and crunch the ones and zeroes.

What is the overall cost of the solution? It goes well beyond simple dollars-per-terabyte calculations, and reaches from the data center footprint all the way up to the operational expenses and staffing for the platform itself.

For organizations looking to implement big data solutions, or early adopters looking to revisit their investment in analytics technologies, it is critical to look past just the base component costs of server platforms.

How critical is timeliness in the analysis process?

Is a day or two a reasonable timeframe to gather metrics, or do you require something closer to real-time data?

Is commodity hardware truly a good fit for the future direction of your organization?

Do you require flexibility, performance, and the ability to quickly copy vast amounts of data from one section of the platform to another?

These are questions that go well beyond the myopic focus of “How cheaply can we store all this stuff?” and address the true concerns of the organization for the near and long term.

If you’re interested in focusing on the bigger picture of big data, get in touch with our Solutions specialists and schedule a conversation today.


Chris Merz

A seasoned Internet services database veteran, Chris Merz is the Chief Database Systems Engineer at SolidFire. Chris develops benchmarking methodologies and database system certifications for SolidFire’s next-gen storage products.