Digital Incident Response: The OODA Loop In Action
It’s 2 am. You’re sound asleep, dreaming of marshmallow clouds, taco cats, and rainbow unicorns. Suddenly, your pleasant dreams are shattered by the piercing notes of your phone’s alert tone. Your eyes snap open, but your brain isn’t fully engaged yet. You look at the screen and the funny shapes slowly resolve into letters, then words, and then finally thoughts. Your site is down!
Paralyzed by the shock of the rude awakening, you’re not sure what to do. Is it a problem with the order manager again? Is the hosting provider down? Did someone deploy code that broke everything? Did the payment gateway break? What do you do? Where do you look?
These are the most stressful moments in eCommerce, but they can be mitigated by a standard incident response plan that provides clarity and actionable tasks should an incident occur.
What Makes A Good Response Plan?
A good incident response plan looks like the OODA loop. Created by US Air Force Lt. Colonel John Boyd, OODA stands for Observe, Orient, Decide, and Act.
After retiring from the USAF, Lt. Colonel Boyd distilled his experience as a master strategist and tactician into the concept of the OODA loop. His goal was to create a framework for individuals and organizations to quickly make decisions in uncertain environments.
The Digital OODA Loop
So, what does this have to do with eCommerce? Critical incidents are some of the most uncertain business environments around. Here’s how to resolve them.
First, we observe. What error generated the alert? What does that error mean in context? What other outside influences are impacting your site at the moment? Does this alert have a business impact? Are all users of the platform impacted, or just a specific subset? These observations are important first steps toward resolving the issue.
Next, we orient to the situation. How does the current situation compare to the expected state of the eCommerce platform? Have we dealt with a similar situation in the past? Have we made any recent changes to the platform? Did something change “upstream”?
Based on the feedback from the previous two steps, we can decide on what action to take.
Finally, we then act based on the decision from the previous step.
If we’ve done things right, that action will have changed the situation, so we loop back to the top of the OODA loop to observe the new situation. If we’ve mitigated or resolved the issue, then we’re done. If the issue persists, we step back through the process. We continue this until the incident is resolved.
Applying the OODA Loop to Incident Response
In practice, applying the paradigm is a bit more complicated. We have multiple outcomes that must be met before we can call an incident resolved. We also need to be sure we’ve documented everything throughout the resolution to understand if we’ve simply mitigated an incident, or if there’s more work required to fully resolve the issue.
Furthermore, we have to immediately communicate with stakeholders, and also have processes in place to handle situations where the initial incident response is insufficient, and the response must be escalated to bring in more resources to help solve the problem.
Advanced Incident Response
To make sure we get all these points covered, we have to cross our OODA loop with a larger problem management process consisting of three steps: Identify, Mitigate, and Resolve. Through all of this, we include communication and escalation steps.
Escalation is required at any point in the incident response process when the current responders are unable to effectively execute the OODA loop. When this happens, the incident responders need a clear path for asking for help. Having a plan in place that defines this escalation path is a key component to instilling confidence in your incident responders. They need to know that there is help available should they get stuck.
Communication throughout the process is critical. Stakeholders and customers are extremely interested in knowing that their commerce platform is being cared for, and incidents can be extremely stressful. Should things go wrong, they need to have confidence that the incident is being properly handled and that they will be back up and running as soon as possible. Regular communication helps keep stakeholders informed and confident in the resolution process and their support provider.
Incident response ends when an incident is mitigated, meaning the condition that created the incident is no longer present. Sometimes this means that service is restored, but the conditions that created the incident could return to create a new incident. This is when the incident is documented and re-packaged as a problem to be resolved by more robust platform changes. Once the platform is made resistant to the conditions that created the incident, the problem is considered resolved.
The Internet is a dynamic place to do business. Having a solid incident response plan in place brings stability and confidence to your eCommerce platform. Incident responders will know what to do and you will have a solid process in place that will improve your eCommerce platform.