One of the best things a client of mine has ever done is to put in automated monitoring for their IBM i systems.
This client used to run like a 1980s-1990s Data Center. They had after-hours and weekend night shift operators pretty much doing nothing but watching the system, looking for problems. When they saw IBM i messages that needed answering, the operators were responsible for dealing with the problem.
(Originally published 7-14-13, updated 1-3-14)
But the operators didn’t have any authority to involve the programming staff in the response or change the system. So a lot of application issues were answered by a program dump (reply ‘D’) and the programmers would look at the issue in the morning. This caused more problems because the issue frequently resulted in data corruption which rippled to other programs. It could also result in issues where automated jobs that were dependent on other jobs completing correctly either didn’t run or didn’t run correctly. Some of these issues might have been avoided if the right people had gotten on the job in the first place.
When I started working with them in 2007, management wanted to get the applications group more involved so that correct responses could be formulated on the spot and critical dependent job streams could be held or modified to deal with the issue. They were also looking at more automation so the system itself responded correctly when a problem occurred.
We installed Bytware’s MessengerConsole into the shop. We set up MessengerConsole in conjunction with a call tree. An on-call person would get the alert that a program needed a response; they would check the call tree to see who should be alerted; and make the call. The on-call person was instructed not to release the call until they handed the issue off to someone who would take responsibility for the problem.
RELATED: Community Post: List of IBM i Monitoring Products
This avoided a lot of overnight issues and started to improve system reliability. It had another side-effect in that when others who could solve the problem were awoken in the middle of the night, they became more motivated to address underlying issues so the problem wouldn’t occur in the future. It wasn’t easy but it brought ignored issues to the surface and everyone on the call tree became stakeholders in making sure overnight problems were answered.
RELATED: What’s Your Glove Display for Making IT Changes?
In the years since, the Bytware system has been improved. The client had configured MessengerConsole to be more granular so that it more precisely targets those people who are needed to solve a problem (doesn’t disturb anyone unnecessarily).
RELATED: IT Jungle–Seven Things You Should Be Monitoring on Your IBM i Systems
RELATED: A Checklist for Monitoring Your IBM i Partitions
Another improvement is that they could use Bytware to set up escalation pagers so that if the problem went to one person and it wasn’t solved in x amount of time, it would go to the next person who then had a certain amount of time to solve it. These escalation trees would go on until the higher-ups and eventually a vice-president would be alerted that an open problem existed on the system. That resulted in more people becoming interested in resolving the problem, and it resisted the temptation for people to ignore overnight issues.
Another side effect is that the client was able to retool those overnight operator positions and replace them with higher value employees (like more programmers, Help Desk personnel, etc.). By weeding out their 1980s-1990s habits, it made them a better shop.
Overall, I’d say installing a good monitoring system was an excellent move, improving system stability; tackling problems as they occur so they don’t escalate out of control; increasing visibility on system issues; and allowing the client to reallocate their resources to more productive uses. I would recommend any IBM i shop look seriously at putting in a full-fledged monitoring system like this, if they don’t have one already.