Opsgenie’s alerting and on-call features are now available in Jira Service Management and Compass. Migrate existing Opsgenie data and configurations before April 5th, 2027 using our automated migration tool.
Incident postmortem process: Track, document, and improve
Key takeaways
Incident postmortems help teams understand what happened, why it happened, and what needs to change to prevent repeat issues.
Using Jira Service Management, Confluence, and Jira together creates a connected workflow for response, documentation, and follow-up.
A consistent postmortem template makes incident reviews easier to document, compare, and learn from over time.
Turning corrective actions into Jira work items with owners and deadlines helps teams turn lessons learned into real improvements.
When something goes wrong in production, the fix is only the beginning. What matters just as much is understanding why it happened and making sure it doesn’t happen the same way again.
An incident postmortem is a structured review of the incident from start to finish, covering what broke, how the team responded, and what needs to change going forward.
With an incident response plan template guiding the process, your team can document every incident consistently, so nothing important gets missed and every review leads to real improvements.
How it works: Running incidents and capturing postmortems
Good incident management isn’t just about putting out fires. It’s about building a system where every incident feeds back into better processes, better tooling, and better preparation for next time. Using Jira Service Management, Confluence, and Jira together gives your team a connected workflow that covers the full incident response lifecycle, from the moment an alert fires through the postmortem and into follow-up work.
This approach maintains consistent documentation across incidents and establishes a clear chain of accountability. Instead of incident details scattered across Slack messages, emails, and someone’s memory, everything lives in a single, connected ecosystem where it can be reviewed, referenced, and acted on. That consistency also means your incident response plan template stays central to the process rather than being something the team fills out when they get around to it.
Here’s how the process breaks down across each stage:
Run the incident in Jira Service Management
Jira Service Management is where your incident response starts. As soon as an issue comes in, log it in JSM, set the severity level, and assign the right responders.
During the incident, teams can use JSM to:
Track actions, decisions, and escalations in real time
Maintain a clear record of who was involved and what changed
Capture the details that will later support the postmortem
Keep leadership informed without interrupting responders
Because JSM integrates with Confluence and Jira, the data collected during the incident can flow directly into postmortem documentation and follow-up work. For teams using JSM as part of a broader ITSM software setup, the incident data also feeds into the larger service management picture.
JSM also supports strong incident communication throughout the response by helping teams:
Keep customers, support teams, and stakeholders updated
Reduce confusion during active incidents
Provide visibility into status and impact
Communicate more clearly during high-severity events or crisis management scenarios
By the time the incident is resolved, the team already has a detailed record of how it played out, which makes the postmortem easier to document and more useful for future improvement.
Capture the postmortem in Confluence
Once the incident is resolved, document it while the details are still fresh — ideally within 24 to 48 hours. The longer you wait, the more context slips away, and the less useful the postmortem becomes.
Create a dedicated Confluence page using an incident postmortem template and work through each section: timeline, root cause analysis, impact assessment, and lessons learned. The incident response template included on this page provides a complete framework your team can copy and fill in for every new incident so you don’t have to figure out what to document from scratch each time.
Keeping postmortems in Confluence offers several practical benefits:
Team-wide visibility: Anyone from engineering to leadership can review what happened without chasing down the on-call responder for a verbal recap.
Pattern identification: When every incident is documented in the same format, it gets much easier to spot recurring failures and systemic weaknesses across quarters.
Blameless documentation: A structured incident response template keeps the conversation focused on systems and processes rather than pointing fingers, which leads to more honest and useful reporting.
Faster ramp-up for new hires: New team members can read through past postmortems to understand how your systems behave under pressure and what the team has already learned from previous incidents.
For a more in-depth guide to running productive postmortem reviews, read our incident postmortem handbook.
Track follow-ups as Jira work items
A postmortem is only as useful as the action it drives. Every corrective action and recurring issue identified during the review should be converted into a Jira work item with a clear owner and a deadline.
This is the step that separates teams who actually improve from those who keep running into the same problems. When corrective actions live as trackable work items in Jira, managers can monitor progress, and teams can hold each other accountable for completing the improvements they agreed to. It also helps with prioritization. When incident-driven work sits alongside the rest of your backlog, it’s easier to weigh it against other priorities rather than letting it quietly fall to the bottom of the list.
The right incident management tools connect this entire workflow end-to-end. When your response, documentation, and follow-up systems are integrated, the gap between detecting a problem and preventing it from recurring gets significantly smaller.
Incident response template
Below is an incident response plan template your team can copy and adapt for each new incident. It walks through every phase of a postmortem, from the initial summary and timeline through root cause analysis, lessons learned, and corrective actions. Using a consistent structure for every incident ensures that nothing gets overlooked and that your postmortems are easy to compare over time.
The examples in the template are a starting point, not a rigid script. Adjust the language and level of detail to match how your organization operates. The goal is to document enough context that anyone reading the postmortem months later can understand exactly what happened and what the team did about it.
Incident summary
Write a summary of the incident in a few sentences. Include what happened, why, the severity of the incident and how long the impact lasted.
EXAMPLE:
Between the hour of {time range of incident, e.g. 15:45 and 16:35} on {DATE}, {NUMBER} users encountered {EVENT SYMPTOMS}.
The event was triggered by a {CHANGE} at {TIME OF CHANGE THAT CAUSED THE EVENT}.
The {CHANGE} contained {DESCRIPTION OF OR REASON FOR THE CHANGE, such as a change in code to update a system}.
A bug in this code caused {DESCRIPTION OF THE PROBLEM}.
The event was detected by {MONITORING SYSTEM}. The team started working on the event by {RESOLUTION ACTIONS TAKEN}.
This {SEVERITY LEVEL} incident affected {X%} of users.
There was further impact as noted by {e.g. NUMBER OF SUPPORT TICKETS SUBMITTED, SOCIAL MEDIA MENTIONS, CALLS TO ACCOUNT MANAGERS} were raised in relation to this incident.
Leadup
Describe the sequence of events that led to the incident, for example, previous changes that introduced bugs that had not yet been detected.
EXAMPLE:
At {16:00} on {MM/DD/YY}, ({AMOUNT OF TIME BEFORE CUSTOMER IMPACT, e.g. 10 days before the incident in question}), a change was introduced to {PRODUCT OR SERVICE} in order to {THE CHANGES THAT LED TO THE INCIDENT}.
This change resulted in {DESCRIPTION OF THE IMPACT OF THE CHANGE}.
Fault
Describe how the change that was implemented didn't work as expected. If available, attach screenshots of relevant data visualizations that illustrate the fault.
EXAMPLE:
{NUMBER} responses were sent in error to {XX%} of requests. This went on for {TIME PERIOD}.
Impact
Describe how the incident impacted internal and external users during the incident. Include how many support cases were raised.
EXAMPLE:
For {XXhrs XX minutes} between {XX:XX UTC and XX:XX UTC} on {MM/DD/YY}, {SUMMARY OF INCIDENT} our users experienced this incident.
This incident affected {XX} customers (X% OF {SYSTEM OR SERVICE} USERS), who experienced {DESCRIPTION OF SYMPTOMS}.
{XX NUMBER OF SUPPORT TICKETS AND XX NUMBER OF SOCIAL MEDIA POSTS} were submitted.
Detection
When did the team detect the incident? How did they know it was happening? How could we improve time-to-detection? Consider: How would we have cut that time by half?
EXAMPLE:
This incident was detected when the {ALERT TYPE} was triggered and {TEAM/PERSON} were paged.
Next, {SECONDARY PERSON} was paged, because {FIRST PERSON} didn't own the service writing to the disk, delaying the response by {XX MINUTES/HOURS}.
{DESCRIBE THE IMPROVEMENT} will be set up by {TEAM OWNER OF THE IMPROVEMENT} so that {EXPECTED IMPROVEMENT}.
Response
Who responded to the incident? When did they respond, and what did they do? Note any delays or obstacles to responding.
EXAMPLE:
After receiving a page at {XX:XX UTC}, {ON-CALL ENGINEER} came online at {XX:XX UTC} in {SYSTEM WHERE INCIDENT INFO IS CAPTURED}.
This engineer did not have a background in the {AFFECTED SYSTEM} so a second alert was sent at {XX:XX UTC} to {ESCALATIONS ON-CALL ENGINEER} into the who came into the room at {XX:XX UTC}.
Recovery
Describe how the service was restored and the incident was deemed over. Detail how the service was successfully restored and you knew how what steps you needed to take to recovery.
Depending on the scenario, consider these questions: How could you improve time to mitigation? How could you have cut that time by half?
EXAMPLE:
We used a three-pronged approach to the recovery of the system:
{DESCRIBE THE ACTION THAT MITIGATED THE ISSUE, WHY IT WAS TAKEN, AND THE OUTCOME}
Example: By Increasing the size of the BuildEng EC3 ASG to increase the number of nodes available to support the workload and reduce the likelihood of scheduling on oversubscribed nodes
Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down
Reverting the Build Engineering scheduler to the previous version.
Timeline
Detail the incident timeline. We recommend using UTC to standardize for timezones.
Include any notable lead-up events, any starts of activity, the first known impact, and escalations. Note any decisions or changed made, and when the incident ended, along with any post-impact events of note.
EXAMPLE:
All times are UTC.
11:48 - K8S 1.9 upgrade of control plane is finished
12:46 - Upgrade to V1.9 completed, including cluster-auto scaler and the BuildEng scheduler instance
14:20 - Build Engineering reports a problem to the KITT Disturbed
14:27 - KITT Disturbed starts investigating failures of a specific EC2 instance (ip-203-153-8-204)
14:42 - KITT Disturbed cordons the node
14:49 - BuildEng reports the problem as affecting more than just one node. 86 instances of the problem show failures are more systemic
15:00 - KITT Disturbed suggests switching to the standard scheduler
15:34 - BuildEng reports 200 pods failed
16:00 - BuildEng kills all failed builds with OutOfCpu reports
16:13 - BuildEng reports the failures are consistently recurring with new builds and were not just transient.
16:30 - KITT recognize the failures as an incident and run it as an incident.
16:36 - KITT disable the Escalator autoscaler to prevent the autoscaler from removing compute to alleviate the problem.
16:40 - KITT confirms ASG is stable, cluster load is normal and customer impact resolved.
TEMPLATE:
XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN
XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN
XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN
Root cause identification: The Five Whys
The Five Whys is a root cause identification technique. Here’s how you can use it:
Begin with a description of the impact and ask why it occurred.
Note the impact that it had.
Ask why this happened, and why it had the resulting impact.
Then, continue asking “why” until you arrive at a root cause.
List the "whys" in your postmortem documentation.
EXAMPLE:
The application had an outage because the database was locked
The database locked because there were too many writes to the database
Because we pushed a change to the service and didn’t expect the elevated writes
Because we don't have a development process established for load testing changes
Because we never felt load testing was necessary until we reached this level of scale.
Root cause
Note the final root cause of the incident, the thing identified that needs to change in order to prevent this class of incident from happening again.
EXAMPLE:
A bug in
Backlog check
Review your engineering backlog to find out if there was any unplanned work there that could have prevented this incident, or at least reduced its impact?
A clear-eyed assessment of the backlog can shed light on past decisions around priority and risk.
EXAMPLE:
No specific items in the backlog that could have improved this service. There is a note about improvements to flow typing, and these were ongoing tasks with workflows in place.
There have been tickets submitted for improving integration tests but so far they haven't been successful.
Recurrence
Now that you know the root cause, can you look back and see any other incidents that could have the same root cause? If yes, note what mitigation was attempted in those incidents and ask why this incident occurred again.
EXAMPLE:
This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452.
Lessons learned
Discuss what went well in the incident response, what could have been improved, and where there are opportunities for improvement.
EXAMPLE:
Need a unit test to verify the rate-limiter for work has been properly maintained
Bulk operation workloads which are atypical of normal operation should be reviewed
Bulk ops should start slowly and monitored, increasing when service metrics appear nominal
Corrective actions
Describe the corrective action ordered to prevent this class of incident in the future. Note who is responsible and when they have to complete the work and where that work is being tracked.
EXAMPLE:
Manual auto-scaling rate limit put in place temporarily to limit failures
Unit test and re-introduction of job rate limiting
Introduction of a secondary mechanism to collect distributed rate information across cluster to guide scaling effects
Recommended for you
TUTORIAL
Learn incident communication with Statuspage
In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.
Learn more about Incident Management
Find more Incident Management guides and resources in this hub.
The importance of an incident postmortem process
An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.