In July, my colleague Bob Lofton described NetApp IT’s early success in using the new NetApp All Flash FAS (AFF) to speed performance of our AutoSupport (ASUP) application. My role at NetApp allows me to see firsthand the impact that AFF has had on ASUP.  Here’s more insight into what we’ve learned about relieving performance constraints on critical enterprise applications with AFF, using ASUP as the example.

 

ASUP is a key monitoring, troubleshooting, and reporting tool that continuously checks the health of installed NetApp systems at our customers’ locations. The information ASUP provides to customers and the NetApp team is invaluable, as it helps us collectively plan for upgrades, have visibility on system health, and gain insights into metrics on product quality and usage.

 

When the ASUP subsystem notifies NetApp teams of an issue, automated workflows enable us to resolve the issue proactively. Any delays in this processing would affect our ability to respond quickly to our customers.

 

The Challenge

The ASUP ecosystem is a sophisticated integration of customer site support, partner site support, NetApp data centers, and tools for facilitating customer relationship management.

 

An ASUP file is processed by 20 different application components before the data is available to users. The process can be broadly classified into four areas-reception, ingestion, processing, and end-user access (real-time and reporting).  Each week during our six-hour peak workload window, we typically process thousands of ASUP files every minute. To ensure speedy response times, the ASUP files need to be available to the NetApp team within five minutes, and cases need to be created within two minutes of receipt. In addition, both the size and volume of ASUP files are doubling every year.

 

The ASUP file ingestion process relies on simultaneous access to file system metadata and directory operations. Ingestion performance is sensitive to any competing workloads on the storage subsystem, and a backlog of unprocessed files can occur. When a backlog occurs, manual intervention is required. This manual workaround affects both operational support costs and the quality of customer service.

 

The Solution

Any storage layer change in this complex landscape would require months of planning. However, in this case, using NetApp® OnCommand® Insight, we were able to quickly identify competing workloads-ingestion and reporting. We moved the more critical ingestion workload to a new storage controller. With the non-disruptive capabilities of NetApp clustered Data ONTAP®, we were able to add two new NetApp AFF storage controllers to our cluster.

 

Since most of the data in these volumes is temporary until the ASUP files are ingested, we were able to create new volumes on the new storage controller, and restart the reception and ingestion processes to use the new volumes. These controllers immediately eliminated intermittent disk latency issues during the ingest process. This entire migration effort from planning through implementation took less than one week.

 

Measuring Success

Since moving to AFF, backlogs have been eliminated. Files are being processed smoothly, and the need for manually batch processing files has been removed.

From an operations perspective, the biggest gain has been in time savings. By eliminating the resource intensive triage and recovery process, we have roughly eliminated the 100+ hours spent for each backlog incident, totaling about 1,000+ hours over the last two to three years.

 

Using hard drives, we saw 0.5ms latency under normal conditions, but that could reach up to 22ms latency under heavy load. Utilizing AFF, we now see 0.09ms latency on average with no higher than 0.4ms latency under extreme load. This equates to a 50X performance improvement under extreme conditions, but, more importantly, it means consistent and predictable performance for our applications and services.

 

All Flash FAS with ASUP workload.jpg

The storage team uses OnCommand Insight to monitor and analyze response times to ensure there are no bottlenecks and that our workloads are distributed across the storage layer.

The replacement of spinning disks with flash has reduced power, space, and cooling requirements for this system in our data center The result is a 64% reduction in power and 50% reduction in space.

 

Most importantly, customer support cases will not experience a delay in processing. NetApp support teams can access the data needed to assist customers without delay. This translates to better productivity, accurate assessments of system abnormalities, and higher customer satisfaction.

 

This article is another in our NetApp-on-NetApp blog series, which features advice from NetApp IT subject matter experts who share their real-world experiences using NetApp solutions to support business goals. Learn more at www.NetAppIT.com.

Ranga_Nathan