open thought and learning

Kübler-Ross model – Tailored for operations teams

leave a comment »

Unless a direct alert for an application fires, the almost-always assumption is that the problem doesn’t exist with an application. Most of the time, this is seen for alerts triggered on upstream tiers in the application stack. I first came across this model while watching House and reading some articles online. It’s fairly known as the ‘5 Stages of Grief’ around the world.

From our perspective here’s how the responses change from the application team, in alignment with the Kubler Ross model:

Denial: “Nothing’s wrong with our tier. Why did you even call us?”, “This is a *false alarm*”.

This is only a temporary defense for the application team. The feeling is generally replaced by the kind of impact this incident might have.

Anger: “How can my application fail?!”, “Not a single alert fired!”, “Check the freakin’ network!!”

Once in the second stage, the team recognizes that denial cannot continue and extends to getting other teams on the line. It cannot be responsible, alone.

Bargaining: “This isn’t really a user-facing problem!”, “This is actually a dis-satisfaction report, not an incident, come on!”

The third stage involves the hope that the team can somehow postpone the impact or the creation of an ‘incident’. Usually, the negotiation to ignore the incident is made with the Tier 2  in exchange for improved alerting , network-bashing and other personal favors.

Depression: “[TIER2] XYZ APPTEAM, are you looking at it?…[TIER2] Ping? …..[TIER2] You there?….[APPTEAM] Still Looking ….. [TIER2] Any update?”

During the fourth stage the application team begins to understand the certainty of an incident. Because of this, the team representative may become silent, refuse disturbance, and spend more time on looking at application counters, Cactus etc. and determining what went wrong with their beloved application. Didn’t they love it enough? Did it catch ‘the bug’?

Acceptance: “Yes, we appear to be losing X dollars per 100 page hits”, “Can’t fix the code, might as well fail over traffic and mitigate impact.”

In this last stage, the application team begins to come to terms with the ‘mortality’ of the feature and understands that mitigation needs to be done.


Written by mohitsuley

April 16, 2011 at 12:49 am

Posted in sysadmin

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: