Today I would like to tell you about one of my current projects, in which I accompanied the introduction of a new omni-channel system within a customer project.
When implementing new TC/ACD or omni-channel systems, one thinks a lot in advance about the reliability, among other things, and plans the architecture according to the expected specifications, if necessary. This is particularly important in order to be able to comply with the set service level agreements (SLAs) that you as a customer agree with your provider as part of the new solution. Various relevant questions arise, such as „Should system components be duplicated?“ or „What is the right strategy here in the operation of these machines and what are they?“
I would like to introduce you to some of the most important procedures:
Depending on the selected architecture, the manufacturers and service providers normally demonstrate the resulting reliability and ensure compliance with the agreed SLAs in the event of a malfunction. This is assured.
But this is exactly where I ask you the question: Is this really true? Is this exactly what is happening? Is everything as rosy as promised, or what is the worst case scenario? Or also:
Why CRAZY-MONKEY?
As soon as the components are set up in the data centre and are available for the first tests, it is advisable to carry out an intensive fail-over test. That’s also what we did in the current customer project mentioned. All conceivable failure scenarios were written down and summarised in a test document. A test team was formed, in which service providers of all systems, responsible persons from the local IT, the data centre and system testers were represented.
I like to call this the Crazy Monkey Test, because the procedure can be described as if conducted by a „crazy monkey“. In full operation, the various servers are switched off one by one or several times, and the testers record every kind of change in the operation of the entire system. The results can vary from „we hardly notice“ to „nothing works“. You would be amazed at how varied the test results can be.
Our findings were, for example:
But also insights such as „Yes, how do I log on now?“ or „Do the administered emergency routings take effect and where do the calls actually arrive now?“ are gained. Often, one or the other described process for operation in the event of an error is missing here as well.
Isn’t it very time-consuming to implement?
Especially nowadays, when many installations are built in a VM environment, it is easy to shut down a virtual server with a mouse click and then revive it. The Crazy Monkey test mentioned here then only takes about two hours. Time well spent, because: once you are in live operation and your employees are working on the new system, you will hardly have the opportunity to carry out such a fail-over test.
Conclusion
I can recommend the implementation of a Crazy Monkey test before the start-up of a new system environment to everyone. For a manageable amount of time and effort, you gain very important knowledge about the possible behaviour in the event of failures and thus prevent unexpected failure scenarios and the resulting costs.
Udo Ociepka – Senior Consultant
junokai