By Feisal Ismail, Principal Consultant, Sapience Consulting
Network Reliability Engineering (NRE) is an implementation offshoot of the famed Site Reliability Engineering (SRE) practice pioneered by Google and adopted across many organisations that look to overcome the problem of scale without compromising the reliability and quality of the user and customer experience, typically through the extensive leveraging of automation. NRE extends SRE practices and behaviours to the network domain.
NRE focuses on maintaining the network infrastructure so that it is available when needed. Network reliability engineers work to prevent outages and ensure that data travels smoothly through the network. SRE, on the other hand, focuses on maintaining the applications and services that run on top of the network. Site reliability engineers work to ensure that these applications and services are available when needed and are performing as expected.
DevOps Institute Ambassador, Feisal Ismail, sat down (virtually!) with Wesley Situ, an NRE practice manager in an American multinational investment bank and financial services holding company, to learn more about his understanding and experiences in implanting NRE in his organisation.
Learn more about getting certified in DevOps Engineering Foundation
Why did you consider Network Reliability Engineering as a suitable practice for your organisation and did you have prior experience with NRE?
It started in October 2019 in my previous organisation which was transforming its IT services and moving to the cloud in a big way. As part of the cloud-first approach, there was a strong move toward embedding the complementary practices of DevSecOps and Site Reliability Engineering. I got my feet wet in SRE practices there.
With my present employer, our operating structure is heavy in terms of numbers. Network infrastructure components are growing rapidly, and it would be extremely challenging to be able to manage them without scaling up our technical resources if we do not fundamentally change the way we run things. We do have a scaling issue and modernising our workforce and the way we run network operations are the key. This presents a tremendous opportunity for us to assess and review the way we would like to scale network operations and the same time elevate the skillsets of our engineers.
What unforeseen challenges did you face as part of the transformation to an NRE way of working?
Hands down, transforming culture and mindset is the biggest challenge. It takes time and effort. What we aspire to do is develop our workforce to exhibit elements of the generative culture that is described in the Westrum model well covered in the DevOps Institute DevOps Foundation and SRE courses. It’s not easy to shift from a never-fail mindset to a fail-often-learn-often way of thinking.
A case in point is our attempt to put in place blameless post-mortems. A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened, and brainstorm how to improve the process to prevent similar incidents from happening again. It’s easy to conceptualize but human conditioning over many years makes it difficult to pull off. The mindset of assigning blame for human mistakes is still embedded into the organisational psyche. It’s tough to balance accountability and growth or learning, but it is something that we need to learn to strike.
Another challenge is monitoring. The organisation is steeped in very traditional monitoring approaches with little scope for engineering. Engineers are handling issues that they know are a more immediate need and those that are already hurting the organisation. It takes time to move engineers conditioned for firefighting to work on fire prediction.
I also did not anticipate the dearth of talent available in this area. We need people who understand Network Automation and have an affinity for programming. In my experience, it’s quite tough to look for these attributes in Asia, but my counterparts in US and UK found it less of an issue to fill up these required roles. I am looking to expand our technical capabilities in the Asia-Pacific region to catalyse the transformation. It’s challenging, but we are getting there, albeit slower than I would have preferred. My immediate task now is to keep the troops focused on what we are doing and planning to achieve with NRE.
Transforming the existing workforce into the NRE ways of working initially faced a roadblock due to the team’s lack of exposure to “novel” practices. This led to initial skepticism and a lack of faith in the path forward. Getting everyone on the same page with learning and understanding the common practices and terminologies used helps to grease things quite a bit. We are planning investments in this area.
Any areas affected that pleasantly surprised you?
Yeah! I underestimated the reactions that I would get from the existing team. Although there was pessimism and skepticism, in the beginning, the team was prepared to change, and there was not that much resistance! The team was more sensible and open than I thought they would be. The culture of the bank where we are a people-centric and dynamic organisation and where change is something that is viewed as normal helped immensely.
A top-down approach will not work as part of the NRE implementation. You’ve got to listen to people doing the work and embrace new ideas and be adaptive.
What would be your advice to anyone exploring implementing SRE or NRE in their organisation?
Be clear about why you are doing it. It is a huge undertaking. Does it even make sense to begin this journey? Is the practice even suitable for the organisation? Recognise that it is a long journey and you must be prepared to be adaptive along the way. Don’t go for immediate results but look for desired behavioural changes and sustainability.
Have the right road map, have a clearly communicated vision and be patient. Good luck!
This article was originally posted on Sapience Consulting