This blog summarizes soem of the results from a rescent ESG lab validation of NetApp Solutoins for Hadoop.

ESG Lab performed a series of hands-on tests and evaluations of the NetApp Solutions for Hadoop. The intent was to demonstrate that the NetApp Solutions for Hadoop can perform and scale linearly as data volumes and loads increase, and can recover from a single node failure with no disruption.

 

The performance and scalability benefits of using NetApp E-Series hardware based RAID and a lower Hadoop replication count were evaluated, as was the performance benefits of NetApp dynamic disk pooling (DDP) and solid-state disks (SSD). Testing was performed using open source software, workload generators, and monitoring tools. The detailed test report is here.  A summary of some of the test results follows.

 

Table 1 Scaling Performance

 

4 DataNodes 8 DataNodes
NetApp E5660 1 2
NetApp E5660 60 120
Raw Capacity 360 720
Hadoop Data Set Size .5 1.0
Data load with TeraGen (h:m:s)

Data sort with TeraSort (h:m:s)

00:05:35

00:25:40

00:05:25

00:22:15

 

  • As depicted in table 1, as the number of DataNodes increased and the volume of data generated increased linearly, the TeraGen data loading completion time remained flat, at approximately five and a half minutes. This demonstrates the linear performance scalability of the NetApp Solution for Hadoop.
  • As the number of DataNodes increased and the volume of data generated increased linearly, the TeraSort data sorting completion time decreased 11%, from 25:40 to 22:51, demonstrating the performance scalability of the NetApp Solution for Hadoop.

 

Table 2 E-Series DDP Performance vs. RAID 5

 

Storage Configuration Per Node Throughput (MB/Sec)
Performance Gain over RAID 5
RAID 5 (6+1) 154
NetApp DDP w/ 60 Drives 663 330%
NetApp DDP w/ 120 Drives 981 536%
NetApp DDP w/ 180 Drives 1044 577%

 

 

  • As depicted in table 2, the DDP configuration distributed data from a single node across 60, 120, or 180 drives. This demonstrates that the DDP configurations were able to aggregate the throughput of a significantly larger number of drives for better performance.
  • Using DDP with 60 drives resulted in 330% better performance than RAID 5, while DDP with 180 drives resulted in 577% better performance than RAID 5.

 

Table 3 Impact of drive failure

 

Test Scenario Healthy Cluster Job Completion Time

(hh:mm:ss)

Drive Failure Job Completion Time

(hh:mm:ss)

Impact
Hadoop cluster with internal drives 0:29:29

 

1:00:14

 

104%
NetApp solution for Hadoop 0:27:02

 

0:27:12

 

1%

 

  • As depicted in table 3, with the traditional Hadoop configuration, a drive failure doubled the job completion time.
  • Using the NetApp Solution for Hadoop, the E5660 detected the drive failure and automatically deployed a hot spare, providing continued data protection while rebuilding the RAID set.
  • A drive failure with the NetApp Solution for Hadoop only affected the performance of the attached DataNode. The DataNode was still able to participate in all Hadoop operations.
  • The drive failure with the NetApp Solution for Hadoop resulted in only 1% longer job completion time vs. 104% with an Hadoop cluster with internal drives.

 

Summary

 

The ESG analysis and hands-on testing demonstrated the tangible benefits that an organization can achieve with a distributed, open application framework (e.g., Hadoop, NoSQL) that leverages purpose-built, direct-attached NetApp E-Series storage.  ESG recommended that you consider NetApp E-Series storage for your next distributed open Big Data application project.

mm

Mike McNamara

Mike McNamara is a senior manager of product and solution marketing at NetApp with over 25 years of storage and data management marketing experience. Before joining NetApp over 10 years ago, Mike worked at Adaptec, EMC and Digital Equipment Corporation. Mike was a key leader driving the launch of the industry’s first unified scale-out storage system (NetApp), iSCSI and SAS storage system (Adaptec), and Fibre Channel storage system (EMC CLARiiON ). In addition to his past role as marketing chairperson for the Fibre Channel Industry Association, he is a member of the Ethernet Technology Summit Conference Advisory Board, a member of the Ethernet Alliance, a regular contributor to industry journals, and a frequent speaker at events.