Incident response is a structured approach to handling and managing the operations after an incident or some disruption to the normal functioning of the application has occurred. This might have caused or could be causing a significant impact on the user experience/business.
The most important goal of incident response is to minimize damage, contain the incident, and investigate the root cause (RCA).
Large organizations have distributed context. This leads to difficulty in one engineer or manager being able to take up ownership of any issue. In these cases, there are pre-set folks who are trained to be “incident commanders” or “incident managers”.
They take up the role of owning up to the “resolution” effort of an incident — including collaborating between different engineering teams and communicating updates to business/leadership/customers. In tense situations, it sometimes becomes critical to create a silo between managers who are waiting for updates and engineers who are trying to fix it as it can create an undue press/stress environment.
Distributed systems are complex and incidents can happen. A philosophy that is advocated often lately, is that of a blameless root cause analysis (RCA) and Learning from Incidents (LFI).
The objective here is to avoid pointing fingers at individuals who might otherwise be held “responsible” for the issue, but focus on what processes, practices, and technical changes need to be done to avoid another such issue in the future.
You can read more about this practice and the thought process behind it, here.
Additionally, as you might find in the other articles, here’s a link to the incident response workbook, by Google SRE. (haha we do love the work by Google SRE)
Doctor Droid assists companies in monitoring critical KPIs associated with the operations and product, helping companies keep the focus on customer experience.
Our team has deep experience in helping companies set up their monitoring and observability stack, so if you need any assistance in setting it up, we are happy to assist. You can reach out to us, here.