Preparing for AI and Automation
It is undeniable that Artificial Intelligence and Automation are in the minds of the public. With major corporations such as Google, Amazon, Facebook, and Microsoft making the news on their artificial-intelligence research and products, and personalities such as Elon Musk, Bill Gates, and Stephen Hawking holding interviews warning of an A.I. apocalypse, it's no wonder people are talking about it.
Artificial Intelligence has recently migrated into Information Technology, with several companies providing solutions for IT Operations. Executives and managers are quickly eyeing it up, excited by its abilities to make employees more efficient, reduce downtime, and minimize staffing. The marketing for these products is very positive, extolling the simplicity of operations and their effectiveness. The algorithms, as it is explained, will handle everything.
There is a configuration cost to get it up and running and to keep it running smoothly that management may not see at first. There is no "Easy" button here. Depending on the organization, implementing an A.I. and automation platform may require thousands of hours of work. This article aims to provide some thoughts on prerequisites to using A.I. in your IT infrastructure.
The first requirement is management access. These A.I. algorithms work with large amounts of data. They want to see everything, so it can be potentially correlated. Thus, we need access to everything from where the A.I. system will be installed. All devices need to be accessible via some form of management network including servers, switches, routers, firewalls, power strips, UPS's, KVM’s, and more. Effectively, anything that has the option of connecting an Ethernet cable and configuring an IP address, needs to have that done.
Unless you have an existing inventory of every device that uses a power cable, this step will probably also require a full inventory of all equipment at every location. Many of these devices may be managed by other departments as well, requiring internal resources and collaboration. This is also an important step for many other reasons, and is highly recommended before continuing.
Be sure to name these devices in a consistent manner. Most of the algorithms in use require similar wording used between devices in a logical or physical area in order to increase matching probability. This will require the formulation of a corporation-wide naming standard, and potentially renaming hundreds or thousands of devices.
Regarding the network itself, depending on your environment, you may not have a management network, or you may have an unfinished one. So you'll need to design and create one for each of your locations, and get that routed properly. Or, maybe you have a very large environment with many management networks for various purposes and departments. Those will need to be identified, routes may need to be created, VPN SA’s may need reconfiguration, and ACL’s opened to the location of the A.I. system.
Now that there is a management network that can communicate between all devices and your A.I. system, you need to provide management services to it. The first thing that comes to mind is SNMP. A modern network should have SNMPv3 configured if a device supports it, which requires some security design effort as well. MIB’s may have to be found, or OID’s walked. Devices will need to be configured to report all SNMP traps possible, and to allow polling from the A.I. collector.
Next up would be Syslog. Preferably with encryption if supported for each device. This step would be best designed with a series of Syslog collection servers, local to each location, then forwarding those localized collections to the A.I. collector. This would require design and implementation time for such a distributed Syslog system. Part of that system would most likely include an ELK stack implementation on top of it for additional analysis, which can be very involved.
There may be other monitoring systems already in-place, performing up/down detection, resource utilization alerting, and synthetic transactions. Similarly, systems such as vCenter and AWS Cloudwatch may be used. Each of these systems would need to be configured to copy all alerts to the A.I. collector. These configurations may also need to be customized for the collector, as the A.I. will want to know about events sooner and more frequently than an email alert to IT personnel.
It’s very likely these reporting systems may send alerts to a ticketing system or collaboration service, which should also be integrated into the A.I. platform as an output. Once the algorithms detect a highly-probable issue, a ticket can be created for front line personnel. This may also require configuration and scaling considerations for your email server, depending on how it is integrated.
So far, we’ve talked about the setup of the networked devices, to allow for detection of issues. Once these alerts are investigated, they need an action performed. If an organization wishes to enable automation, that is, the automatic resolution of alerts from these A.I. systems, there needs to be remote management access provided to all devices. Not in the form of data flow from the networked devices, but the remote access of them. Remote access methods such as SSH, and Powershell are most common today. If a device is too old or not licensed to run SSH or Powershell for example, that device will need to be replaced or upgraded. The configuration of this remote access requirement may also be lengthy.
The automation methods provided usually rely on scripts of some kind. Scripts you may want to run via an automation system such as Ansible, rather than individual shell scripts. Again, we find a system that needs planning and implementation. This also requires personnel to write resolution scripts and playbooks for each issue that is detected, which would require personnel who know how to code, and certainly take a lot of time initially.
Finally, these A.I. alerts and resolutions only happen when the algorithm has a high level of confidence that an issue is correct. That means personnel need to train the system, especially in the beginning. There are usually many algorithms that work together, each one using a different set of rules, which requires care and validation. Algorithms are diverse and may include the ability to detect relationships between alerts based on source type, physical or logical proximity, time, language usage, and topology analysis.
As you can see, there is no "Easy" button here. A.I. platforms, their automation systems, and their algorithms are extremely powerful today, but they require planning, lots of preparatory work, and training once running. They cannot be implemented quickly, as a quick fix for lack of enough personnel, and in fact, will require more personnel during the implementation and configuration phase. When properly planned for and implemented, an A.I. system can be an important enhancement to IT Operations.