Article of the week 5 – 2023

CRAZY MONKEY – things are about to get crazy in your new system environment

Today I would like to tell you about one of my current projects, in which I accompanied the introduction of a new omni-channel system within a customer project.

When implementing new TC/ACD or omni-channel systems, one thinks a lot in advance about the reliability, among other things, and plans the architecture according to the expected specifications, if necessary. This is particularly important in order to be able to comply with the set service level agreements (SLAs) that you as a customer agree with your provider as part of the new solution. Various relevant questions arise, such as „Should system components be duplicated?“ or „What is the right strategy here in the operation of these machines and what are they?“

I would like to introduce you to some of the most important procedures:

Active-active operation: This requires several servers for duplicated operation, where in the event of a failure of one server, one of the other servers can take over directly and operation can continue almost without interruption.
Active-passive operation: Although this variant provides fail-safety, the consequence is that in the event of failure of the active server, a period of time is required to enable the passive server to be activated and then take over all services.
Stand-alone server or database: These are servers that are intended to take over either the control or various special tasks. If such a server fails, all services that were operated by it fail. A resumption of trouble-free operation can usually only be achieved through recovery measures by reinstalling this server by importing a backup.
3rd party systems: Third-party systems connected via interfaces can also have an impact on the overall operation in the event of a failure. This is especially important for systems that provide a central frontend for the employee. For example, when deciding to merge a CRM system with an ACD environment in order to be able to control the telephone and other contact channels on a common front end.

Depending on the selected architecture, the manufacturers and service providers normally demonstrate the resulting reliability and ensure compliance with the agreed SLAs in the event of a malfunction. This is assured.

But this is exactly where I ask you the question: Is this really true? Is this exactly what is happening? Is everything as rosy as promised, or what is the worst case scenario? Or also:

Why CRAZY-MONKEY?

As soon as the components are set up in the data centre and are available for the first tests, it is advisable to carry out an intensive fail-over test. That’s also what we did in the current customer project mentioned. All conceivable failure scenarios were written down and summarised in a test document. A test team was formed, in which service providers of all systems, responsible persons from the local IT, the data centre and system testers were represented.

I like to call this the Crazy Monkey Test, because the procedure can be described as if conducted by a „crazy monkey“. In full operation, the various servers are switched off one by one or several times, and the testers record every kind of change in the operation of the entire system. The results can vary from „we hardly notice“ to „nothing works“. You would be amazed at how varied the test results can be.

Our findings were, for example:

Ongoing calls remained and did not break off, although the manufacturer did not guarantee this and had predicted a break-off.
Switching times were longer than guaranteed by the manufacturer.
Many other valuable insights more

But also insights such as „Yes, how do I log on now?“ or „Do the administered emergency routings take effect and where do the calls actually arrive now?“ are gained. Often, one or the other described process for operation in the event of an error is missing here as well.

Isn’t it very time-consuming to implement?

Especially nowadays, when many installations are built in a VM environment, it is easy to shut down a virtual server with a mouse click and then revive it. The Crazy Monkey test mentioned here then only takes about two hours. Time well spent, because: once you are in live operation and your employees are working on the new system, you will hardly have the opportunity to carry out such a fail-over test.

Conclusion

I can recommend the implementation of a Crazy Monkey test before the start-up of a new system environment to everyone. For a manageable amount of time and effort, you gain very important knowledge about the possible behaviour in the event of failures and thus prevent unexpected failure scenarios and the resulting costs.

Udo Ociepka – Senior Consultant

junokai