One of the three goals for the GlobalNOC Renewal Program is to become one of our community’s leaders in network automation. To this end we’ve created some goals for year one of the automation objective. One of the key results for this objective is, in the first year, to automate 80% of the configuration changes GlobalNOC makes to core L2/L3 devices.
Our Goal: within 2019, automate 80% of the changes GlobalNOC makes to core L2/L3 devices
Breaking that key result down, it’s easy enough to define who those clients are. Defining core L2/L3 equipment is easy enough also. And while “in the first year” is easy enough, defining what “automation” is and what “changes” are takes a little more work. To define a roadmap for the next year we started by determining what changes were being made to the equipment so we could determine what we needed to do to get us to the 80%.
What is a “change”?
This was non-trivial and ultimately resulted in more hand-waving than is, at first glance, comfortable. We don’t yet have a rigorous change management system in place to examine and determine what the changes are or were for. Further, our ticketing system isn’t designed around this type of analysis. So, we turned next to the syslogs. CLI accounting was enabled for each client, meaning that every command typed on the equipment was sent to a syslog server. Further, it turned out that the VAST majority of the equipment in scope are Junipers. In Juniper-land a configuration value is changed by performing a “set” command while that changed value is pushed in to production via a “commit” command. Thus one commit can push several set commands in to production. At this point it was clear that our metrics probably needed to relate to the number of set & commit commands performed by the automated tools as compared to the number performed manually by the staff.
What kinds of changes do we actually make?
But what activities do we need to automate in order to hit our 80% goal? The logs make it relatively easy to see the commits, but not WHY the commits were made. The JunOS “comment” feature allows a comment to be placed alongside the commit, a reason the work was done. However, in practice these comments were not useful. Some referred to tickets or Kanban cards with more information. Some just summarized the change in a few words. There was little uniformity on how the feature was used across the various networks, or the staff working in those networks. We made a note to standardize this functionality across clients, as one of our automation deliverables and then moved on to examining a different area to try and answer our 80% question.
The “Set” commands were next to look at. Each set command notes which portion of the configuration to which the set command applies. For example, Set Interface, Set Firewall, Set Protocol, Set Policy-options and so on. This was relatively easy to count with a simple query. Getting beyond this, though, proved more difficult. While we could determine which portion of the configuration the change took place in we could not tell WHY the change took place. What activity was it related to?
This step turned out to be more art than science. A senior engineer reviewed the various set commands, in time sequence, making a judgement call to tie them to an activity. These “set routing-policy” commands seem to be tied to a per-bgp session prefix-list, while this other batch seem to be applied to prefix-lists which protect the core node. These interface changes appear to be applied to a backbone interface augment while these others appear to be tied to a customer/peer interface. This worked about as well as you would expect it. IE: it was a mess. There were too many set commands to examine and the context just wasn’t there without a much greater effort being applied. But, we did learn some things. Very valuable things.
First, the number of one-time changes eclipsed the number of routine changes. We looked at our “set protocol” figures to get an idea of how many BGP peerings we brought up, one of our routine tasks. Or, more accurately said, a task everyone cited as being routine. There were a very small number of changes in that part of the configuration. In contrast, we expected the “set firewall” portion of the config to be rather static. But as it turns out, we made several thousand changes in this area. It turns out that these happened over two days during a month in which some of the core firewall filters on the equipment were updated. A one time event. A change, if you will, to how we deliver the service.
And yet as we dug deeper we found that these one time service changes were a huge part of the work we did. When questioned, everyone cited new customer/peering/interface turns up as a major part of their work. The set command though showed that one-time changes to services, or the enablement of a new feature (like uRPF fail filters) were a much larger part of the work.
There were some successes in examining the set commands more closely for activity work. We were able to determine to our satisfaction that prefix-list modifications to the customers & peers were common tasks, as were changes to the “global” prefix lists that controlled access to SNMP, NTP, and so on.
There were other successes as well. We were able to determine that “set interface” accounted for two-thirds of our work. Combined with one-time firewall work we easily reach our 80% goal. Looking at Commit commands is harder, but we think that the “set interface” work combined with our high number of “set policy-options” prefix-list additions and deletions will get us to 80%.
As you can see in this chart, the vast majority of changes came in just a few areas – interfaces, firewall filter, and policy changes:
The next step for us is simple. We’ll be taking this information and using it to target the most common areas to help us formulate a work plan for the next year, targeting specific areas of our configurations for automation projects.
As the wise stop-motion Santa Claus once said, “if you put one foot in front of the other, soon you’ll be walking out the door!”