
Your Tools Are Working. Your Incident Is Still a Disaster. Here's Why
I've sat in a lot of war rooms over 20 years in IT operations, 14 in leading teams.
And one of the things I noticed early on - something that took me a while to properly articulate - is that the worst incidents I ever dealt with were rarely the most technically complex ones. The ones that caused the most damage, cost the most money, and left the most difficult aftermath were the ones where the communication collapsed. Where the coordination broke down. Where the human side of the response fell apart while the technical side was still being fought.
The tools were working. Everything else wasn't.
What £11,000 a minute actually looks like in the room
Every alert fired on time. Service Desk woke the right people up. Skype, Teams or Slack channels were live within minutes. Monitoring (Check_mk, Zabbix, Prometheus/Grafana or Datadog,..) was showing exactly where the problem was.
And the incident still ran for three hours when it should have taken forty-five minutes.
Two senior engineers were duplicating each other's work without realising it. A customer update went out ninety minutes late, written in technical language that made the client more anxious, not less. The CEO found out about the outage from a client complaint rather than from his own IT team.
I'm not describing a fictional scenario. I'm describing a pattern I've seen repeat itself across industries, across company sizes, and across teams that by every measure had the right tools in place.
The tools did their job. The human systems around them didn't exist.
The gap nobody budgets for
The incident management software market is worth over £4 billion. Companies are spending serious money on alerting, routing, monitoring, and status pages. And they should - those tools matter.
But there is a gap in what those tools can do. And in my experience, it's a gap that almost nobody talks about openly until after a bad incident forces the conversation.
Tools handle the technical signal. They tell you something is broken and who should look at it. What they cannot do is tell your Head of Engineering/Operations or whoever how to run a war room with six teams active simultaneously. They cannot write a calm, clear update for your clients that reduces panic rather than amplifying it. They cannot stop your Incident Manager from freezing when the CEO calls asking what to tell the board. They cannot make sure the right person is coordinating rather than everybody coordinating at once and nothing getting decided.
That gap - between the technical alert and the structured human response - is where most of the damage happens. I've watched it happen more times than I can count.
What actually extends an incident
When I look back at the incidents that ran longest, the technical problem was rarely the whole story. The fix was often found within the first hour. What extended it was everything around the fix.
Duplicate effort because nobody assigned clear roles at the start. Time lost because the person who knew the most about that particular system was unavailable and nobody else had access to the runbook. Escalation that happened too late because the engineer on-call was nervous about waking up their manager. A stakeholder update that should have gone out at the thirty-minute mark that got delayed because nobody knew who was supposed to write it, or what it should say.
None of that is a tool problem. All of it is a process and communication problem. And all of it is fixable - but only if you treat it as a separate problem that requires a separate solution.
The part that costs the most
Here is what rarely makes it into post-mortems.
The technical outage might last three hours. The reputational damage from how it was communicated - or wasn't - can last months. Clients who were kept informed and felt like someone was in control will forgive almost anything. Clients who found out about the problem from their own customers, or who received no communication at all until the issue was resolved, remember that. They question the relationship. Sometimes they start looking elsewhere.
I've seen companies recover from serious technical incidents with their client relationships intact - because the communication was handled well. I've also seen companies lose long-standing clients after incidents that were technically resolved quickly - because the silence during those hours was deafening.
The cost of the outage is the cost of the outage. The cost of the communication failure is often much larger, and much harder to measure.
This is not a criticism of your tools
If you are running PagerDuty, Datadog,… or any combination of them - keep running them. They do what they are designed to do.
What I'm pointing at is a different problem. One that sits alongside your tools, not instead of them. The problem of what happens after the alert fires. Who runs the response. How the teams coordinate. What goes to stakeholders and when. How decisions get made when the pressure is highest and the information is most incomplete.
That is a human systems problem. And in fourteen years of building and leading IT teams, I've never seen a piece of software solve it.
What good actually looks like
A well-run incident, from a communication standpoint, looks like this.
Within five minutes of a P1 being declared, there is a named incident manager - not a default, but a named person with a clear role. Within fifteen minutes, a first stakeholder update has gone out. Not a full diagnosis. An acknowledgement that the issue is known, that it is being worked on, and that the next update will arrive at a specific time. The war room has clear lanes: one team on the technical fix, one person handling all communication outward. Nobody is doing both.
The incident gets resolved. The post-mortem happens within forty-eight hours. Actions are assigned, tracked, and closed - not documented and forgotten.
None of that requires new software. All of it requires a process, built and practised before the pressure arrives.
That process is what most companies don't have. Not because they don't care about it. Because nobody ever built it for them.
