Why Incident Playbook Fail

Why Incident Playbooks Fail: The Gap Between Process and Behaviour

June 30, 20266 min read

Companies Who Got the Playbook - And Still Had a Disaster. Here's What Happened.

I want to tell you about a pattern I've observed that doesn't get talked about enough in conversations about incident management.

A company recognises it has a problem. They invest in fixing it. They bring in support, build it themselves, or undergo a serious internal process improvement effort. A playbook gets written. Roles get defined. Templates get created. Everyone agrees it's a significant improvement on what existed before.

And then a major incident happens. And it's still a disaster.

Not as bad as it might have been. But bad enough that the people involved are left wondering whether the investment made any difference at all.

I've seen this happen. And I want to talk about it honestly - because understanding why it happens is more useful than pretending it doesn't.

The Playbook in the Folder

The most common version of this story goes like this.

The playbook was built during a period of focused effort. A workshop, or a consulting engagement, or a serious internal project. The output was good - clear roles, solid templates, a documented escalation path, a post-mortem process with accountability built in.

Then the work moved on. The next sprint started. The next product deadline arrived. The playbook went into a Confluence space, a shared drive, or a folder that everyone knew existed but had no reason to open in the normal course of their work.

Six months later, an incident hit. The people who responded had either not read the playbook, had read it once and not retained the specifics, or were new to the team and hadn't been onboarded onto it. The templates existed but weren't used - because under pressure, people revert to habit, and the habit was to improvise.

The playbook was there. The behaviour didn't reflect it.

Why Process Doesn't Automatically Become Behaviour

This is the thing I most want to be honest about, because it's the thing that is most often underestimated when organisations invest in process improvement.

A document is not a behaviour. A template is not a habit. A defined role is not a practised capability.

The gap between what a process says and what people actually do under pressure is bridged by one thing: repetition before the pressure arrives. Specifically, repeated practice of the process in conditions that approximate the real situation - under time pressure, with incomplete information, with the discomfort of making decisions without certainty.

In most organisations, that practice doesn't happen. The playbook gets built, and the assumption is that the building was the work. The implementation - the repeated practice, the testing, the adjustment, the embedding into muscle memory - is treated as something that will happen naturally over time.

It doesn't happen naturally. It happens intentionally or not at all.

The Specific Failure Modes I've Seen

Beyond the folder problem, a few patterns consistently undermine well-designed processes.

The key person assumption. The playbook was built around the assumption that specific people would fill specific roles. When those people are unavailable - on holiday, sick, having left the company - the people who step in don't have the same familiarity. The incident exposes the dependency that the playbook was supposed to eliminate, but didn't, because the practice was never distributed widely enough.

The threshold ambiguity. The P1 definition exists, but it's written in a way that leaves room for interpretation in edge cases. The on-call engineer at 2 am spends too long deciding whether the situation qualifies. By the time the P1 is declared, fifteen minutes have passed, and the first stakeholder communication is already late.

The partial implementation. The technical response process was implemented well. The communication layer wasn't. The communications lead role was never properly established - because it felt like overhead when there was no incident in progress, and nobody wanted to formally assign it to someone already fully occupied.

The update that drifted. The process was accurate when built. Twelve months later, infrastructure had changed, people had joined and left, and escalation contacts were different. The process hadn't been updated because no one had an explicit responsibility for it.

What Actually Makes the Difference

I'm not telling this story to discourage investment in the process. I'm telling it because the investment in process is only the first part of the work, and knowing what the second part looks like determines whether the first part pays off.

The second part is embedding. And embedding has specific requirements.

Regular practice. At a minimum, a tabletop exercise every quarter. Not a full simulation - an hour-long walkthrough of a fictional incident scenario using the actual roles, templates, and escalation paths. The goal is not to test people. The goal is to find the friction before a real incident does.

Distributed familiarity. Every person who might be on-call needs to have read and understood the playbook - not once at onboarding, but as a recurring expectation. The process should not be something only three people know well.

An assigned owner. The process needs someone whose explicit responsibility includes keeping it current - reviewing it quarterly, updating it when infrastructure or the team changes, ensuring new members are brought up to speed.

An honest post-incident review. After every significant incident, one of the post-mortem questions should be: did we follow the process? Where did we deviate, and why? That deviation is useful information that should be made explicit rather than ignored.

The Honest Conclusion

A well-designed process, properly embedded and regularly practised, makes a significant and measurable difference. The difference between a team that has built and embedded good incident communication and one that hasn't is not subtle.

But the process alone - designed, documented, and left to fend for itself - is not enough. The playbook in the folder is not an incident communication process. It is the beginning of one.

The work of embedding is less visible than the work of building. It doesn't produce a deliverable you can point at. But it is where the real return on the investment lives.

Frequently Asked Questions

Why do incident management processes fail even when they're well-designed?

The most common reason is the gap between documentation and behaviour. Under real pressure, people revert to their most deeply ingrained habits - not to processes they learned in a workshop. Bridging that gap requires repeated practice in conditions that approximate the real situation, not just reading the playbook once.

How often should an incident playbook be reviewed?

At minimum, quarterly - and whenever something significant changes: a team member joins or leaves, infrastructure is significantly updated, or a post-incident review reveals a gap between the documented process and what actually happened. A good playbook isn't static; it evolves with the organisation.

What is a tabletop incident exercise, and how do I run one?

A tabletop exercise is a structured walkthrough of a fictional incident scenario, usually lasting 60–90 minutes, in which participants execute the actual processes: declaring severity, filling roles, drafting communications, and making escalation decisions. The goal is to find friction before a real incident does, not to assess individual performance.

How do I make sure my team actually uses the incident playbook?

Three things: distribute familiarity broadly (not just to one or two people), practise it regularly (at least quarterly tabletops), and assign a named owner to maintain and update it. A process that only a few people know well is a fragile process — and the person who knows it most is often unavailable when you most need it.

Back to Blog