This post is a summary (list type) from an online class I took in 2022. I’m writing it as a note/cheatsheet for myself and to those who may find it useful.
- IDENTIFY the problem and determine the scope.
- Question all stakeholders concerned.
- Identify the scale of the problem.
- Collect additional logs / reports.
- If possible, try to replicate the problem.
- Perform backups.
- Escalate if necessary. Asking for help is not a weakness.
- ESTABLISH A THEORY of probable cause.
- Question the obvious.
- Does the problem stem from a central point or isolated?
- Escalate if needed.
- TEST THE THEORY to determine the cause.
- Establish a new theory if not confirmed.
- Undo previous steps.
- Escalate if needed.
- ESTABLISH A PLAN of action to resolve the problem.
- Notify users/ stakeholders that will be impacted.
- IMPLEMENT THE SOLUTION or escalate.
- Make one change at a time, test and confirm.
- Reverse the change if not resolved.
- VERIFY FULL SYSTEM functionality.
- Implement preventive measures.
- PERFORM ROOT CAUSE ANALYSIS.
- Did you threat the symptom and not the root cause?
- DOCUMENT your findings, actions and outcomes.
- Create a wiki, KBA or FAQ for the admin team.
- Use the Notes/Remarks inputs.
- Share the knowledge!