View profile

How Asking "What Went Right?" Teaches Us About Safety

The Morning Mind-Meld
How Asking "What Went Right?" Teaches Us About Safety
By Jaime Woo and Emil Stolarsky • Issue #4 • View online
During the second World War, the United States Army Air Forces needed to add armour to its planes to limit the number being shot down. But it could only be done sparingly, due to weight restrictions. The answer appeared intuitive: strengthen the most heavily damaged areas observed on returning planes. 
However, mathematician Abraham Wald had the insight that the set of planes being studied only included surviving planes: the results would be far different if they could also inspect the bullet hole patterns of downed planes. The planes then should be fortified where the returning ones were least damaged, as significant damage in those areas meant going down. That insight saved the lives of many pilots. 
The story highlights our evolving understanding of safety. The more well-known version is “the absence of accidents and incidents (or as an acceptable level of risk),” writes Erik Hollnagel, Senior Professor of Patient Safety at the University of Jönköping, Sweden. He calls this Safety-I—where we ask “What went wrong, and how do we not do that anymore?” 
Yet, focusing only on the areas with bullet holes tells an incomplete story—and a potentially misleading one. Instead, it was asking “What went right?” and noticing where the returning planes had avoided damage that led to Wald’s Aha! moment. This shift in perspective is revelatory yet intuitive: after all, if you wanted to learn how to consistently cook a delicious meal, you wouldn’t succeed by studying just terrible ones. Observing how things can go right as often as possible forms the basis of Safety-II. 
The Power of a Safety-II Approach
Safety-II is at the heart of a new report from researchers at the NASA Engineering and Safety Center, which they apply to civil aviation safety [PDF]. As Jon Holbrook, a cognitive scientist at NASA, told Forbes: “For every well-scrutinized accident, millions of flights in which things go right receive very little attention.” Given the wide range of trying conditions flight crews deal with, Holbrook led a team to figure out what behaviours were responsible for things going right as often as they did.
Although fatal accidents for commercial jets draw a large amount of attention, the probability of being in one is incredibly low: approximately 1 in 5 million. (Odds of getting struck by lightning in your lifetime? 1 in 3,000.) Of course, examining those accidents matters; however, solely focusing on when things go wrong—especially in ultra-safe systems where incidents are rare—leaves a massive trove of data off the table. Data that could help illuminate why things don’t go wrong more often.
It might seem obvious why things don’t go wrong: everyone followed orders as expected, so everything turned out perfectly. This might be true for simpler jobs, but for anyone dealing within complex work environments, “as expected” becomes trickier. (For example, on the Skylab.) In fact, Safety-I often views humans as being part of the problem, while Hollnagel says in Safety-II, “humans are consequently seen as a resource necessary for system flexibility and resilience.” 
Embracing the Adaptive Capacity of People
The NASA team interviewed pilots and air traffic controllers (ATCs) about their day-to-day behaviour to learn what they did in order for things to go well. First, they discovered just how frequent Safety-II behaviour was, happening on a daily basis: 83% of ATCs shared that they demonstrated resilient behaviour “at least once per session.” This underlines how humans are frequently adapting so they can be successful in their roles.
Surprisingly, people may not be aware of their adaptive behaviour. From Forbes:
On any given day, flight crews may experience mechanical delays, weather problems, sick passengers, or even the occasional aircraft malfunction. “Because pilots perform this way day in and day out,” Holbrook said, “they often don’t realize how exceptional and critical to safety their behavior is.”
One example the report gives is how pilots respond to their aircraft striking a bird, a potentially dangerous event. They’ll share that information to other pilots in the area: 
That information was immediately actionable to a plane behind them and actionable to ATC for warning other pilots and airport staff. However, this action did not directly benefit the plane that struck the bird, and, in fact required additional resources to communicate.
This isn’t something they’re told to do as part of their job, but done because it matters for safety. It wouldn’t show up in Safety-I reviews, and the researchers concluded that it was vital that processes and frameworks be built that include a Safety-II approach. A complete view of safety includes Safety-I and Safety-II—with an understanding when to apply each approach. We must understand how things go wrong, but also have curiosity around how things go right. 
There’s something compassionate, and almost hopeful, about recognizing humans as wonderfully adaptive creatures, with many lessons to draw from to make things better.

Digging In Deeper
This is an Incident Labs project, with new issues every two weeks. We’re interested in figuring out the best practices for incident management for software companies. We also produce the Post-Incident Review, a zine focused on outages. If you use PagerDuty and Slack, our software project Ovvy simplifies scheduling and overrides, and is currently in private beta and free to use.
Did you enjoy this issue?
Jaime Woo and Emil Stolarsky

The Morning Mind-Meld is a chance to build context between the conversations happening around DevOps and SRE, and to hopefully create some inspiration, even—or, especially!—during a hectic week.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue