ESG Lab performed a series of hands-on tests and evaluations of the NetApp Solutions for Hadoop. The intent was to demonstrate that the NetApp Solutions for Hadoop can perform and scale linearly as data volumes and loads increase, and can recover from a single node failure with no disruption.
The performance and scalability benefits of using NetApp E-Series hardware based RAID and a lower Hadoop replication count were evaluated, as was the performance benefits of NetApp dynamic disk pooling (DDP) and solid-state disks (SSD). Testing was performed using open source software, workload generators, and monitoring tools. The detailed test report is here. A summary of some of the test results follows.
Table 1 Scaling Performance
|4 DataNodes||8 DataNodes|
|Hadoop Data Set Size||.5||1.0|
- As depicted in table 1, as the number of DataNodes increased and the volume of data generated increased linearly, the TeraGen data loading completion time remained flat, at approximately five and a half minutes. This demonstrates the linear performance scalability of the NetApp Solution for Hadoop.
- As the number of DataNodes increased and the volume of data generated increased linearly, the TeraSort data sorting completion time decreased 11%, from 25:40 to 22:51, demonstrating the performance scalability of the NetApp Solution for Hadoop.
Table 2 E-Series DDP Performance vs. RAID 5
|Storage Configuration||Per Node Throughput (MB/Sec)||
|RAID 5 (6+1)||154|
|NetApp DDP w/ 60 Drives||663||330%|
|NetApp DDP w/ 120 Drives||981||536%|
|NetApp DDP w/ 180 Drives||1044||577%|
- As depicted in table 2, the DDP configuration distributed data from a single node across 60, 120, or 180 drives. This demonstrates that the DDP configurations were able to aggregate the throughput of a significantly larger number of drives for better performance.
- Using DDP with 60 drives resulted in 330% better performance than RAID 5, while DDP with 180 drives resulted in 577% better performance than RAID 5.
Table 3 Impact of drive failure
|Test Scenario||Healthy Cluster Job Completion Time
|Hadoop cluster with internal drives||0:29:29
|NetApp solution for Hadoop||0:27:02
- As depicted in table 3, with the traditional Hadoop configuration, a drive failure doubled the job completion time.
- Using the NetApp Solution for Hadoop, the E5660 detected the drive failure and automatically deployed a hot spare, providing continued data protection while rebuilding the RAID set.
- A drive failure with the NetApp Solution for Hadoop only affected the performance of the attached DataNode. The DataNode was still able to participate in all Hadoop operations.
- The drive failure with the NetApp Solution for Hadoop resulted in only 1% longer job completion time vs. 104% with an Hadoop cluster with internal drives.
The ESG analysis and hands-on testing demonstrated the tangible benefits that an organization can achieve with a distributed, open application framework (e.g., Hadoop, NoSQL) that leverages purpose-built, direct-attached NetApp E-Series storage. ESG recommended that you consider NetApp E-Series storage for your next distributed open Big Data application project.