Tooling

Is email your monitoring and alerting mechanism?
There is no monitoring solution and you wait on your users to let you know that your service is unresponsive
Does your infrastucture uses uses a lot more self-signed certificates than you think?
For any N applications, at most N/2+1 use the same certificate bundle
You end up using shell for "complex stuff" because it's easier that way
Your /etc/hosts is all colored with various rules

Culture

If you face a situation where the person who knows the script/ procedure to resolve an issue is on vacation
You often hear - "We've always done it this way."
"Prod" is just another name for "staging".
Nobody knows what exactly it is you do.

Leadership

If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.
Your quarterly planning has no meaning when the next re-org rolls around.
Management will always happily spend $$$ on outside consultants to tell them what you've been saying for years.
Management will much rather invest in inventing a new, square wheel than fixing an old round one.

Security

Your usage of Restricted shells are not as restricted
Your network team has a way into the network that your security team doesn't know about.

Confidence

Quite often, you see "human error" as the root cause

Ownership

If your culture is - "If you break it, you own it - for now; if you fix it, you own it - forever."
You cannot turn things off permanantly as there is no one who can "approve" of it.
And so it keep lingering around

The most critical services are maintained by handful of people and others do not dare to go near these services

Documentation

If your runbook to fix things say - "Turn on and off" and you should see it working again.
All the documentation that's there is README.md file
A document points to another document, which it turn takes you to some other document
And this goes on forever till you give up

The document is marked obsolete and no reference to any other
And there is no reference or availability of the current document which you can consume

Somewhere, somebody ran into this exact problem, but they never bothered to post a solution.

Code

The source you are looking at is not the code running in Production environment
The condition of any backup is unknown until a restore is attempted
That completely automated solution you set up requires at least three manual steps you didn't document.

Help Us Improve!

If you have any suggestions to improve this checklist, please let us know by filling out this form.