Ensuring stable operations of IT services undergoing frequent changes is a bit like navigating shallow waters, but easily manageable with the right approach
Often, being responsible for the operation of an application feels like navigating shallow waters. Configuration changes, interface adaptations, load spikes and security updates are just examples for the shallows that you need to circumnavigate. Your team needs to manage of all this having more than enough wind in your sails, being generated by stakeholders' incessant demands for more and more features or adaptations to your application. With the right approach, balancing the need for constant change and operational stability is manageable.
A Clash of Cultures?
I remember when, a few years ago, I switched from a frontend development team to a team that was responsible for operations handovers of (often legacy) IT systems. The cultural differences could not have been larger. The frontend team consisted nearly exclusively of twentysomethings who were working according to DevOps practices and, in general, were accustomed to move fast and break things.
Quite to the contrary, I was all of sudden confronted with needing to organise training sessions for our 1st level support, defining processes for Incident Management, participating in Change Advisory Boards (CABs) and composing Operational Level Agreements (OLAs) - just to name a few.
After the dynamic world I have been in, this experience was indeed at first a culture shock to me. Back then, I did not know much about the Information Technology Infrastructure Library (ITIL), which is essentially a framework consisting of operational best practices and is still used by many companies across the globe to structure and optimise their IT operations. In the course of the ensuing years, I came to realise that many facets of the seemingly slow and bureaucratic processes really do make sense and ensure the stable operation of IT services (let me use this term for the sake of correctness). Of course, like with all frameworks, roles and processes should not replace customer orientation and common sense.
In my opinion, DevOps and ITIL practices do not necesssarily need to contradict, but can rather supplement each other. In any case, their goals are the same: stakeholders should be presented with available, stable and high-performance IT services that best fit their needs. If something goes wrong, the error should be fixed as quick as possible and with minimal impact to the stakeholder.
Let the Wind Blow You Further
One of the most common mistakes your team can make is to try to talk stakeholders out of constantly making changes to an application because they may destabilise it. The latter is true in most cases, but definitely not a reason why your IT service should not provide new functionalities or adapt to users' ever evolving needs.
Your team's task is to provide stakeholders with a funnel through which they can communicate their needs and afterward obtain feedback about how they are considered and when they are potentially going to be available. The easiest way to do this is to use a Kanban board for organising the resulting User Stories and making their journey from ideation, to creation, to deployment transparent to your stakeholders.
You should also consider building your application such that you can obtain insights into users' behaviours and thereby anticipate their wishes. Additionally, this gives your team the chance to observe how users interact with new features or adaptations and to react accordingly.
Since we have now made the case for allowing frequent changes to your IT Service, how do we ensure that this does not result in outages and subpar system performance? There are a few ways, of course :).
Preconditions
Before you start iterative development of your IT service, you may want to ask yourself the following questions:
Does your team have the appropriate tools available to frequently integrate and deploy changes to your application?
How can your team ensure that, for instance, each change to the code, to the operating systems of both your clients and servers, to any interfaces to other systems or to the system load conditions can be appropriately tested?
Can your team build the IT Service such that any abnormal application behaviour can be recognised before or shortly after its occurence?
I could go on here, but I wanted to illustrate the simple fact that frequent changes to your IT service require your application to be designed "properly" (I will elaborate on that below) and your team to work efficiently. Without those preconditions, you will surely run into the next sand bank in no time.
Design Your IT Service Keeping Operations in Mind
Your IT service's operation should be as transparent and as testable as possible. To satisfy the former, your team could set up use case-centred monitoring of your application's operation and establish (preferably automised) processes for handling abnormal system behaviour. Normally, system monitoring tends to be system-centric, with e.g. specific server parameters or services being monitored. Try shifting the focus of monitoring to use cases in order to properly assess how users' work is being impacted when a problem occurs.
Testability refers to the fact that starting from unit tests all the way to system and performance tests, your IT service needs to be designed such that, whenever a change to a system component or even to a connected system is being made, your team has a chance to test the effects prior to them occuring on production. For the sake of efficiency, this requires your deployment architecture to contain several stages, with the last one being a 1:1 copy of your production environment.
For a mission-control system, for instance, we have decided to have a DEV system used by our developers, a TEST system on which automated testing can be performed within the application and an INT system which fully mimics the production system, all the way down to the hardware of the physical client. Only like this we could ensure that, for instance, monthly operating system patches of the backend did not cause any disruption.
How Do You Become Entangled?
Frequent, better, continuous releases on IT services are a necessity nowadays. At the same time, your stakeholders want your IT service to function without any faults and without interruption. As much as that might sound like wishful thinking at first, there are ways on how to achieve that. First of all, your team, needs to continuously "listen" to its stakeholders and even anticipate their needs in order to get prepared. Talking about preparation: design your IT service such that its correct functioning can be verified easily and that each and every little change can be tested.
Care to Know More?
Obviously, the ship in the picture above did not make it to port. The shipwreck Dimitrios, a cargo vessel presumably used for smuggling cigarettes between Greece and Turkey, is nowadays a tourist attraction in the south of the Greek Peloponnese peninsula.
Would you also like to know more about this post's topic? If so, then please post your questions in the comments below and I will be answering them. If the subject of your question is a hot topic, then I will dedicate a post to it in the future.
Comentarios