One of my clients has what I’d consider a mature IBM i High Availability environment using Mimix software on Capacity BackUp system (CBU). First implemented in 2006 and tested every year since then (with 1-2 extended switch-overs where the system ran live for several days to a week),
But after a recent user-certification HA Exercise , we noticed one finding that may be particular to mature HA environments like this.
The biggest issue is no longer making sure that all critical items are replicated; the biggest issue now is making sure all the non-replicated items are kept in sync with the production machine.
All the issues we found in our last test revolved around non-replicated items that weren’t in sync, possibly because they were set up some time ago and never updated on the CBU. Some examples included:
- Host table entries on the production system that weren’t present on the CBU system
- A Web-based terminal emulator that was dependent on the CBU’s system certificates being the same as the production machine’s certificates. Some of them weren’t. Two certificates had expired and been updated while the HA machines cigarettes weren’t updated..
- IP devices that had changed their addresses on the production machine due to a move, but those addresses hadn’t been changed on the CBU machine
The lesson I took out of this HA test was that besides auditing to insure all your data and applications are synchronized between machines, you also have to make sure your operating system supporting items are in sync. They are likely to change over time and your non-replicated CBU setup can degrade and need to be refreshed from production occasionally.
So don’t just focus on your replication data groups, to the exclusion of everything else. This is important for both mature and new IBM i High Availability setups.
Make sure your non-replicated items, including those listed above and other critical items such as subsystem descriptions, output queues, job queues, etc. are also in sync. These are important audit points and can cause application failures in a CBU switchover, especially in a mature environment where they may not be looked at for a number of years.
Originally posted on September 22, 2013. Updated and reposted on March 6, 2014.