By Val Bercovici, Senior Director, Office of the CTO, NetApp
Today most organizations use analytical frameworks such as Hadoop to ingest, store, and analyze their new big data (typically characterized by the three Vs: volume, velocity, and variety). But what about running Hadoop analytics to gain insights from the large amount of useful data stored on existing NFS storage? Setting up an analytics cluster usually requires a separate set of infrastructure, over and above the existing storage infrastructure for enterprise data. This option requires that enterprises artificially invest more, perform a copy or move of existing data, and then increase the data maintenance burden across two systems. Those requirements are costly, inefficient, and time-consuming-ultimately blunting the impact of the original analytics project in the first place.
Introducing NetApp NFS Connector for Hadoop
As you might have guessed, NetApp has a compelling product that is designed to optimize the preceding scenario. With the NetApp® NFS Connector for Hadoop, users can immediately analyze data on existing NFS-based storage, such as NetApp FAS storage arrays. The NFS Connector enables analyzing this NFS data without moving the data into the analytics cluster, saving expense, time, and effort. Without the need to copy and manage data across different silos, IT administrators and operations can support Apache Hadoop analytics without additional storage hardware. Also, data workflows are accelerated and simplified, increasing agility for the desired business goals of the underlying analytics project or projects.
Analyzing both sets of data (incoming and existing) can give much more powerful cross-referenced business insights about customer buying patterns, competitive behavior, and new market opportunities, just to name a few. In that way, companies can leverage their existing investments in enterprise storage and enable analytics incrementally, at a manageable pace. Many types of file-based data exist, such as source-code repositories, e-mails, and log files. These files are generated by traditional applications but currently require a cumbersome workflow to ingest the data into a separate analytics file system. NetApp NFS Connector for Hadoop allows a single storage back end to manage and service data for both enterprise and analytics workloads. Data analytics, using the same file system namespace, can analyze enterprise data with no additional ingest workflows.
With the NFS Connector, NFS storage can be used in either of two principal ways:
- As a secondary file system, in which Apache Hadoop uses the Hadoop Distributed File System (HDFS) for its primary storage (as a cache of a subset of the data), connecting through NFS for secondary storage to the rest of the data
- As the primary file system, in which Apache Hadoop runs entirely on NFS storage
This option allows users to read and write data between HDFS storage and NFS storage, enabling easy data sharing between storage running either file system.
Works with key analytics open-source projects with an open implementation
The NetApp NFS Connector for Hadoop works specifically with MapReduce for the compute or processing part of the Apache Hadoop framework. It can also support other Apache projects, such as Apache HBase (columnar database) and Apache Spark (another processing engine that is compatible with Hadoop). The NFS Connector also works with Tachyon, an in-memory file system that can run with Apache Hadoop and Apache Spark. Being able to support all these old and new big data analytics platforms speaks to the versatility and investment protection of the NetApp NFS Connector.
Gives back to the big data community
The connector works with any NFS-based storage system, and it has no proprietary features that are designed specifically for the NetApp clustered Data ONTAP® operating system. It is fully open source, it is hosted on GitHub, and NetApp plans to contribute the code to the main Hadoop trunk. NetApp has been an innovator in NFS. We have pioneered NFS standards to advance file-based storage access in UNIX and Linux environments since our inception. Our engineers lead the NFS3 standards efforts, and our company was first to market with pNFS (4.x) support for NAS. NetApp storage solutions come heavily pretested against the leading NFS RFC standards.
Provides a scalable single copy with data availability
The NetApp NFS Connector for Hadoop decouples storage from compute, thereby allowing several optimizations. First, it allows analytics on data stored on other file systems, such as NFS. Second, it improves storage efficiency by leveraging existing technologies such as NetApp RAID DP® and SnapMirror® for data protection (rather than the three copies that HDFS uses). Third, it allows the use of all the data management features of NetApp clustered Data ONTAP-such as compression, deduplication, NetApp FlexClone® volumes, and NetApp Snapshot® copies-with greater than “five 9s” (99.999%) availability.
To deploy a Hadoop cluster when you want to perform conventional ingest and storage of incoming data, you can use the NetApp Open Solution for Hadoop (NOSH). NOSH provides simple, reliable, and scalable storage for Hadoop with solutions that validate Hadoop distributions on NetApp E-Series storage.
Where to download the connector and get more information
You can download the NFS Connector from the GitHub repository. Refer to the NFS Connector Technical Report for more information about configuration, use cases, and the underlying architecture. Note that it is a beta version, so feedback is welcome; please e-mail email@example.com with any comments or suggestions.