I came to NetApp in 2014, joining a wonderfully talented team of UNIX admins. The environment is quite large, consisting of roughly 3,000 servers. The charter given to the team was both simple and daunting: make the configuration landscape more consistent while, at the same time, make the management of said landscape easier.
The work we’ve done in UNIX Engineering since then has gotten us to a point where nearly everything we deliver is software-defined. This has the benefit of allowing us to test changes in sub-production that are identical to what we roll out into the IT environment. Tools, such as Ansible configuration management software, combined with a very well-defined delivery pipeline have allowed us to test and deliver feature requests and break-fixes quickly and efficiently.
Adopting a software-defined approach has enabled us to ensure consistency across our global enterprise environment and flexibility to keep pace with business requests along with the rest of IT. It’s important that our servers are in sync with the rest of IT so we don’t delay upgrades in other areas, especially storage and networking. Our approach to solution delivery is yielding some big wins that I describe below.
Starting Small – NTP Client Configurations
We started our configuration management journey by picking an easy-win use case—managing network time protocol (NTP) client configuration. As I stated before, we are ultimately responsible for 3,000 servers. However, these machines aren’t all the same. Some are VMware® virtual machines running some form of Linux (mostly RHEL 5 and 6), others are Solaris non-global zones, others are AIX LPARs, and so on. We picked a representative subset of these various platforms to test against in our non-production landscape.
We used configuration management software to design code to express the desired state for NTP client configuration. Within a day or so, we had a good code base and were ready to push our changes to the sub-production environment.
After pushing our NTP client configuration code out to the approximately 1,000 sub-production machines, the configuration software was registering changes left and right. As we thought, significant client configuration inconsistencies existed. In just a couple of hours we were able to automate the configuration across our servers. After deploying our changes to production, we saw a dramatic difference, from hours to minutes.
From this point forward, all changes to the environment started being done using configuration management software. The result is a compute environment that’s more consistent than ever before, and that simultaneously reduces man-hours spent in implementation.
Going Big – ASUP
Since our start small use case, we’ve standardized multiple aspects of system configuration – logging, monitoring, and systems accounts just to name a few. In fact, configuration management has become an integral part of our build and lifecycle management process.
In the fall of 2016, my team was asked to deliver a comprehensive solution that would maintain systems in our ASUP environment per a desired state. (ASUP is NetApp’s monitoring and reporting application for verifying the health of customer systems.)
The scope included everything–service users, storage multipath management, filesystems, network stack, etc. In addition, the server count across the ASUP landscape is in the hundreds. Working with the ASUP application owner and our storage engineering team, we defined a desired state for all servers in the application ecosystem. Then we used our configuration management process to ensure all servers met this desired state.
Armed with a well-defined desired state for these builds, we were able to make substantial improvements to both delivery time and consistency. Previous build efforts were almost entirely manual, requiring 4 hours of effort per server and often were inconsistent due to human involvement. Now build time is 15 minutes per server and results are consistent every time.
As the project progressed, issues were discovered that required adjustments. One such issue was our storage multipath configuration. As a team, we discussed what the new desired state for the storage multipath configuration should be. Once this was decided, we were able to use our configuration management routine to update hundreds of ASUP servers in a matter of minutes and in an utterly consistent fashion.
As we continue our hybrid cloud journey, we may face maintaining even more servers than ever before. However, we should be able to scale out and manage this growing environment with ease. We are also preparing to look down the road. Key areas of focus will include shrinking our mean time-to-delivery, preventing systems from going into an unstable state, and applying this methodology to an environment where applications are containerized. But our primary focus will remain on ensuring compute can advance at the same pace as storage, networking, and the rest of IT and support the pace of NetApp’s business.
The NetApp IT blog series features advice from subject matter experts from NetApp IT who share their real-world experiences using NetApp’s industry-leading storage solutions to support business goals. Want to learn more about the program? Visit www.NetAppIT.com.