I want to share two great documents that we have created to share.
Here is the Post Mortem Template we use, below it is a link to download a spreadsheet version.
| Field name | Description |
| Title | As title |
| report status | status of incident analysis report: draft, reviewed and published |
| Outage reference number | As title |
| Executive Summary | As title |
| start time | when did outage start |
| end time | when did outage end |
| duration | total duration of outage |
| Product(s) | major product(s) impacted |
| Other products(s) impacted | additional product(s) impacted which is not within the scope of analysis |
| Ops On-Call | Oncall engineer during outage |
| Ops Contact #1 | Additional Ops engineer involved |
| Ops Contact #2 | Additional Ops engineer involved |
| Outage Resolution | the resolution of the outage |
| Last Outage | when was the last time we have a similar outage |
| Recent application builds | related application change within a given time period |
| Related change and maintenance | related infrastructure change and maintenance within a given time period |
| Timeline Analysis | detailed timeline of the incident and corresponding measurement Time to Detect (TTD):Time it takes for monitor system to detect the problem Time to Notify (TTN):Time it takes for monitor system to notify operations Time to Respond (TTR’):Time it takes for Operations to respond Time to Troubleshoot (TTT):Time it takes for Operations to diagnose Time to Repair (TTR):Time it takes for Operations to recover the system |
| Log Analysis | analysis for application logs and system logs |
| Monitoring Correlation | Correlation of monitoring data. -Application availability and system monitoring: i.e. Zenoss -Usser experience monitoring: i.e. Truesight -Application profiling: i.e. dynaTrace |
| Review and Recommendation | Specific recommendations to improve outage handling should be classified into one of the following types: -process change -Dev change request -Ops Change request -infrastructure enhancement |
| Root Cause | root cause of the incident |
| Reference | documents that we use during incident analysis – in-house design documents, vendor documents, white papers,…,etc |
| contributors | others who are involved in incident analysis |