NOEL: Hello and welcome to Episode 45 of the Tech Done Right Podcast, Table XI's podcast about building better software, careers, companies and communities. I'm Noel Rappin. My guest this time is Nickolas Means. Nick is a software engineering manager who's fascinated by engineering disasters and he shared that knowledge in a number of conference talks that focus on a specific disaster and tell us what we can learn both from the disaster and how it was handled. In this podcast, we talk about what you can learn from studying disasters, how to create a company culture in calm times that will work smoothly in stressful times. We talk about how a successful engineering team works with stories and how they handle their mistakes. Along the way, we talk about the recent incident at Seattle airport, the Citicorp Building in Manhattan and a few other engineering and team missteps. We have what we hope is a successful show about failure, which I have gotten off to a rousing start because this is the third time I have recorded this intro. Before we start the show, a few quick messages. Table XI is now offering training for developer and product teams. If you want me to come to your place of business and run a workshop on testing or working with Legacy code or Agile team structure or career development, that is a thing that can happen and for more information, you can hit TableXI.com/Workshops or email us at Workshops@TableXI.com. If you like the show, telling a friend or a colleague or your social media network or telling me, those are all very helpful and leaving a review on Apple Podcasts will help people find the show. Thanks and here is my conversation with Nickolas. Nickolas, would you like to introduce yourself to the podcast audience? NICKOLAS: Sure. Hi, everybody. I'm Nick. I'm a VP of engineering at Muve Health and I am a frequent speaker on disasters and the things that we can learn from them. NOEL: Right. Nickolas is here to talk about failure and disaster -- topics near and dear to all of our heart, I think. NICKOLAS: It makes everybody feel warm and fuzzy. NOEL: Well, there's nothing like a good near-miss disaster story... How did you come to become a connoisseur of engineering team failure and what do you feel like you learned from it? NICKOLAS: I don't know if you remember the stories or the TV show, Seconds From Disaster. I watched way too much of that show as a kid. I think it's sort of wired my brain to be very fascinated by things that are, like you said, near-misses. They're just fascinating to me, or not near-misses. As I got deeper into my career in software engineering, one of the things that I realized is that the aviation industry, in particular, is really good about learning from failure, learning from accidents. It's a skill that benefits us quite a bit. NOEL: There's a couple of different ways to go here. Let's first talk a little bit about what kinds of things that you feel like you learn from studying these failures that are not typically software team failures. You're usually talking about other types of complex system failures but how do you generally feel like that informs the way that you handle a software team? NICKOLAS: One of the things that I like best about the way that failure is handled in aviation in particular is that they spend a lot of time looking into the sequence of events of the accident. They don't try to find immediately whose fault it was, who did something wrong. They try to understand the full story and then they approach it from a perspective of what needs to change to keep this from happening in the future. I think that's something that most software engineering teams try to do but aren't always great at doing. NOEL: Yeah. We should mention, maybe going a little bit detail about the National Transportation Safety Board process here because it's really interesting and I suspect a lot of people here don't know very much about it. NICKOLAS: The way that accident investigations work with the NTSB, there is a go team that is constantly on call. As soon as the NTSB accident investigation board decides that an accident has occurred that's of enough significance to warrant an NTSB presence on the ground, that team is ready to go within an hour or two. They hop on a private plane and go to wherever the accident has happened, so that they can start gathering information almost immediately. There's also, everybody's obviously heard of the black box recorder and the cockpit voice recorder, so there's a real time record of what happened in the cockpit leading up to an accident. They focus on gathering as much objective fact as they can versus making subjective judgments about what happened in a scenario. They try to understand what actually happened, who knew what at the time and approach it from the pilot's perspective or the air traffic controller's perspective. NOEL: My understanding is there's also kind of a robust expectation of self-reporting of near accidents. Am I remembering that correctly? NICKOLAS: Yeah, absolutely there is. There is an absolute expectation of self-reporting and one of the things they do to incentivize that is there's really no penalty for self-reporting. If there's a near-miss, they want it reported so that they can learn from it. NOEL: Yeah. I listened to a podcast, I think from Vox about healthcare a few months ago. We'll try to find it for the show notes and they were talking about how hospital had dramatically decreased infection deaths and what they had done, at a cultural level, the way that they described it was they stopped treating deaths as car crashes and started treating them as plane crashes. Which is to say that rather than treating them is kind of like inevitable that every time there was an incident that led to a patient death, they walked through in a very NTSB-style fashion, to try and figure out what the decision point was, what things could have been changed to improve conditions and then implemented changes based on each individual patient incident. I thought that was really interesting. The metaphor of car crash problems versus plane crash problems really struck me and it seems to me like a lot of the software teams that I've been on, treat problems as car crashes when they should be treating them as plane crashes. Is that the kind of thing that you are hoping to bring to your software teams? NICKOLAS: Yeah, absolutely. When something goes wrong, I want us to immediately to switch into learning mode, even as we're fixing whatever happened, so that we can figure out what we should do differently in the future. There's definitely a lot of crossover between healthcare and aviation and the ideas of human factors and resilience engineering and just culture -- just culture sort of the idea that punishment is often not the most helpful way to deal with something that went wrong and neither is accepting it as inevitability. The goal is any time something goes wrong, to learn everything you can from it, get people to give their account of what happened and then figure out how to change going forward. NOEL: Yeah and in the talk that you gave about Three Mile Island, you talk about, I think it's "first story" and "second story". Could you describe those terms for a sec? NICKOLAS: Yeah, sure. The idea is anytime an accident happens, human nature is to immediately look for the human cause, who did something wrong. That's how our brains are wired to approach problems. That's the first story of an accident but buried beneath that first story is the reason that the humans in the story made the decisions that they made -- the systemic causes that informed the way that they approach the situation and that is the second story. NOEL: Right and so, what we're trying to do is create a culture of finding that second story and drawing lessons from the second story, which are systemic changes and not drawing lessons from the first story, which is blaming people for the systems that they're embedded in. NICKOLAS: The book that that idea of first and second stories came out of is The Field Guide to Understanding Human Error by Sidney Dekker. One of the other things he talks about there in that book is 'the no bad apples idea,' and that's that everybody comes to work, intending to do a good job. Nobody intends to come to work screwing up and nobody is particularly incompetent, unless your hiring process is really, really bad. That's a whole separate issue. NOEL: Yeah. Malevolence or actual bad actors, in my experience, are pretty rare. Not zero but pretty rare. NICKOLAS: Very, very rare. The idea is that if you'll pause long enough to figure out why somebody made the decision that they made, to look at it from their perspective, they probably had a really good reason for doing what they did. NOEL: Right and you do sometimes see this in the software world, where an infrastructure's company like GitHub or somebody will publish incidents reports after an outage, where they talk about what happened and try to talk about what changes they'll make going forward and that kind of thing is really interesting. I know that you wanted to talk about a systems failure that was in the news a couple of days ago as we record this. NICKOLAS: Yes. The story I want to tell is fascinating because it's not a systems failure on the surface. It wouldn't be obvious to a casual observer that it is a systems failure. It's the story of Richard Russell and the Bombardier Dash 8 that he stole in Seattle and subsequently, crashed in Puget Sound. I think the fascinating thing about it is that even though it's clearly an individual that took a plane and intended to crash it into the ground, the aviation industry's immediate response is still to try and figure out what to learn from it, so that they can prevent this kind of tragedy from ever happening again. I've not heard any public blame at all for Richard Russell. NOEL: No, what I've mostly heard is praise for the air traffic controllers and in particular, and other pilots for handling this situation calmly and to try and prevent larger accidents from happening but I think it would be easy to see a response where on the one hand you say, "Well, this is just a one-off. This is just a person who obviously had issues. There's nothing we need to do about it," or you could see the issue going the other way and sort of coming up with a response where, "What we need to do is dramatically increase surveillance of maintenance workers." NICKOLAS: The interesting thing about it is this is the rare case, where there is a true malevolent actor. When he took off in that plane, he didn't intend to land it but instead of being upset that he crashed a $32 million aircraft, they're upset that they lost one of their own employees. They didn't figure out how to help him and they didn't figure out how to prevent that plane from taking off unauthorized. NOEL: As somebody who is not tremendously well-versed in the aviation industry, I think that one of the surprises there was that this was even possible in the first place. I don't know, do you have insight on what kind of systems were in place or weren't in place that allowed this to even happen? NICKOLAS: There's a lot going on there. I think it's a classic case of security by obscurity. Planes don't have keys on their doors, they don't have ignition keys but you do have to know a pretty complicated series of switches and settings in order to get the turbine engines on that plane to start. You also have to have access to the secure area of the airport to start with, which he obviously did because he was a ground agent. Beyond that, if you have the knowledge to start a plane and the ability to get to the plane, you can get on a plane and start a plane and take off in it. There's a lot of assumed good actors in the industry and I think that's probably some of the changes that will come from this is how do we prevent people who are not authorized to start up and take a plane from starting up and taking a plane. NOEL: I think that we don't realize how much of our daily, security is not quite the right word, but we depend on mostly having good actors around us on the road, walking through the streets, in a way that when that is shattered, the response to that I think becomes genuinely challenging. I don't know what the question is here except like how do you think about how to frame a response to the one bad actor in years and years? NICKOLAS: That's the question -- how much security is necessary in response to this? After 9/11 happened, the entire aviation industry worldwide retrofitted every commercial aircraft with code secured cockpit doors. That was something that hadn't been in place before. Nobody had ever thought to put a lock on a cockpit and that was a pretty drastic reaction. What's the reaction here? It would be really complicated and really expensive to essentially key every commercial aircraft in the world and make it, where you have to have a key to start it up. NOEL: Right, when you're just adding another failure mode to the normal operation of the system. NICKOLAS: Yeah. The cost of that as like you just alluded to is probably higher than the benefit in that there's no telling how many flights would end up canceled because that ignition key would not work on a given day. NOEL: "Ladies and gentlemen, the pilot lost their key." NICKOLAS: And the other challenge for that is that a plane doesn't belong to a particular pilot. Pilots are interchangeable pieces in the grand aviation puzzle. There's times that a replacement pilot will show up to an airport five minutes before the flight's supposed to take off because the pilot that was intended for that plane is sick that day, so who owns the keys, who carries them around? Who decides when it's okay to give them to a pilot? NOEL: Right and then somebody is going to say, "Well, you make it biometric," and that causes a whole 'nother. NICKOLAS: Let's add software and really get this complicated. NOEL: One of the stories that you tell, that resonates with me both, as a really neat story and also a story that I think has really interesting lessons for the kind of culture that we build on our teams is the story of the Citicorp Building in Manhattan. Do you want to sketch that story out briefly? NICKOLAS: The Citicorp Building is an interesting building in that they acquired the land for that building from a church on condition that the church could remain on its corner and the building would essentially have to be built over it. In order to build the biggest building possible, that meant that it had to be essentially elevated on multistory stilts. This is not an uncommon construction technique but typically, if you did that, the stilts would be at the corners of the lot. On this case, they couldn't do that because of the church, so the stilts are at the midpoint of the lot and they're at the midpoint of the walls of the building going up and that resulted in a potential failure modality when the wind hit the building on the corner and it was one the building codes didn't recognize or didn't test for at the time. Bill LeMessurier, the structural engineer that designed it actually found out about the issue from a grad student who was doing a presentation on the unique structure of this building and it resulted in a massive amount of going back in and retrofitting and adding bracing to this building, in order to keep it structurally stable in these cornering winds. NOEL: If you want more detail in the story, we'll put in the show notes, there's a Nickolas's talk about this which is excellent and there's also a really good episode of 99% Invisible Podcast, which I think is where I heard that story first before I saw. NICKOLAS: Yeah. That 99% Invisible episode is excellent. NOEL: It's a fascinating story and what resonates with me, I think is the idea that here is this tremendous achievement that you thought you built, whether it's this building or whether it's a software system and you suddenly discover that it has a potentially, hugely damaging flaw, then the question becomes like, "What do you do with that?" and you want to have built the kind of culture where when somebody discovers that kind of thing, they surface it so that it can be fixed, rather than be defensive or try to hide it, so that it doesn't get fixed. I think to me, the interesting lesson of that story and the most interesting part of that story is the reaction of the engineers in surfacing it in a way that it could be fixed. Do you see that as having that particular kind of resonance with the teams we build and what kinds of things do you think you can do to foster that kind of culture? NICKOLAS: One of the classic engineering failure modalities is somebody that drops a table in the production database and then tries to restore it themselves and ends up dropping the rest of the database in the process. Things like that happen all the time. When something happens, you want somebody to raise their hand so that the full resources of the team can be focused on solving the problem. In aviation, there's a term called crew resource management that essentially means exactly that that when something goes wrong in the cockpit, everybody in the cockpit should be having input into the situation and working towards the solution, versus just the captain dictating what's going to happen. On a software team, I think the key to building this is to focus on blamelessness. If people are afraid they're going to be punished or they're going to get their hands slapped because they did something wrong, then they're probably going to try to hide it. But if instead, you build a learning culture where people aren't afraid of punishment and people know that mistakes happen and that's part of everyday engineering life and when a mistake happens, you're going to try to learn from it, not beat them up about it, then they'll be much more willing to step forward and say, "This thing happened. Let's get it fixed and let's figure out how to keep it from happening again." NOEL: Yeah, I feel like in software, this is particularly resonant in security or cases where there's a continual race. You can easily have security issues that you are impossible to notice in advance but they can be fixed once they get found, as long as you're willing to be up front about them. I feel like there's a whole classes of software problems to sort of fall into that category that they're really only visible in hindsight and you want to have a culture where you're not blaming people for not being able to foresee the future but instead, the team can come together. Blamelessness in retrospectives, one of the key rules of most agile retrospectives is no personal blame for exactly this kind of reason. NICKOLAS: One follow-on is there's this idea of hindsight bias where when something happens, our instinct is to look back at the entire chain of events and go, "How did they miss that?" NOEL: Everything looks suspicious once you know the ending is bad. NICKOLAS: Absolutely. It's human nature to think that if you were the person that had been in those shoes, that you could have seen it coming and you could've done something to prevent it. In reality, you probably couldn't because you're not taking into account the limited information that person had that is available to them at the time they were making decisions. NOEL: Right. One of the things about many systems disasters is they come to a point where people make bad decisions, what would in hindsight come off as bad decisions because they don't have critical information. Do you find that your studying these kinds of incidents leads you to make different decisions about how to surface information to a team on an ongoing basis? NICKOLAS: I don't know that it affects how I surface information to the team, as much as it affects how the team surfaces information to themselves. One of the things that you learn when you study these kind of accidents is that a lot of them happen because people don't trust their instrumentation. You see an elevated error rate or you see exceptions coming in and you think they're one-off and you just ignore them. In truth, those things are, sometimes, the first warning signs we have that something really bad is about to happen to our system. The more observability you bake in up front, the more ability you bake in to view logs and access information about the behavior and health of your system, the easier it is to respond in a crisis. Because you don't want to be figuring out how to use Splunk when your system is down. NOEL: Yeah, the Splunk being log analysis tool. I think that that also suggests that there's real value in working to remove noise from that kind of data. I am currently dealing with a system where there's an integration path that just like will throw routine errors a bunch. The people in the team, they're kind of habituated to it and then you are explaining it to somebody and they're like, "How do you know which errors to ignore?" and that's a bad question. Stuff should only get raised if it's important enough to be acted on. NICKOLAS: Absolutely. Charity Majors had a great thread on Twitter the other day about the overpaging of most engineering teams and it got into very much this that when you page for people at all hours for things that are trivial and inconsequential, they'll miss the page that really matters. NOEL: Yeah. Table XI is a consulting firm. We continue to host and provide ongoing hosting maintenance for many of our clients dating back for a while and we don't have a huge on-call culture. For a long time our on-call was handled by one person, eventually it got too much and it started to transition to a basically, rotating team of mostly all the senior engineers and that was when a lot of the pruning of the alerts happened. Because when it was just one person and this person had all the domain knowledge of all of our infrastructure for all of our clients, then it was super easy to say, "Yeah, I don't need to do anything with that." But as soon as other people are getting beeped at two in the morning, we really rapidly put a lot of work into making those things a lot less chatty. NICKOLAS: Yeah, absolutely. That pain is a huge motivator to get that particular part of your house in order. NOEL: But at all different levels, that's paging, that's logging, it's error tracking with Rollbar, it's dashboards that have alerts, focusing attention on the things that matter and not causing the natural human tendency to ignore things that are consistently there. When applied to alarms, is bad. NICKOLAS: Yeah, absolutely. The same thing is true of the applications that we design and build. So many of the things that we build have some type of notification or alerting system built in. I work in healthcare so if we show too many alerts, then there's some pretty significant consequences to that. We work very hard to make sure that the information that we are elevating as an alert to our clinicians is actually actionable and valuable information for them. NOEL: There's a security alarm company, making the rounds on podcast advertisements right now and one of their selling points is basically that they differentiate between things you don't have to deal with and things that you do. NICKOLAS: Interesting. NOEL: And I think it took a really long time to take the usability of this kind of alert seriously, as something that really could make a difference in incident response or crisis response but I do think that a lot of people do take it seriously now. NICKOLAS: As our digital lives get noisier, work is certainly a part of that and so, one of the things that we as a society seem to be increasingly more focused on and need to be more focused on is the things that we're willing to allow to get our attention when it's focused elsewhere. NOEL: Yeah, that comes back to general phone alerts. My pager, which is now my phone, if that's just one alert alongside 14 bajillion other social media things, that's a completely different kind of problem. In some ways, a problem that would have been almost impossible to foresee a few years ago. NICKOLAS: Yup, absolutely. I've got Victor up set up a special ring group on my phone so that it has a ring all unto itself. NOEL: Dan-daran-dan. Once upon a time a very long time ago, I wound up having to change my normal cellphone's ring tones. This is pre-smartphone but I was getting calls from a client at all hours and I got to such a stress reaction to the ring tone that I wind up having to change my ring tone just because every time the phone rang, my heart would race. NICKOLAS: Oh, my goodness. That's terrible. NOEL: Yeah, it was bad. It was not a good project. A completely a different kind of engineering failure. One of the things that is important in how engineering teams work together or don't work together are the stories they tell themselves. What do you think that successful teams do differently from unsuccessful teams in that respect? NICKOLAS: I think a lot of it comes down to communicating in story in the first place. We deal with a lot of information in our teams and the temptation is to break it down into pre-digested bullet points but it turns out that the human brain doesn't understand bullet points all that well. NOEL: PowerPoint is going to be really mad when they find that out. NICKOLAS: Oh, I know. There's an article that was making the rounds... I don't know, about a month or so ago about Jeff Bezos and how he bans PowerPoint in executive meetings at Amazon. That's been a known fact for a long time but the new piece of information in this article was that it's replaced by a multi-page memo. If you're in a meeting with Jeff Bezos, you're probably going to be sitting there for about the first 30 minutes reading this memo. It replaces the time that the PowerPoint presentation would normally have occupied in the meeting. The reason that he does this is because he is convinced that people take in information much better that way than they do sitting through a 30-minute bullet point presentation. NOEL: What's the kind of story that a team might tell themselves to improve their communication or work? NICKOLAS: I think the story that often gets lost in engineering teams is how particular work is important to the business. It's easy to get in this modality where we were so over-processed that we're just grabbing cards off a backlog, completing them, committing them and then moving on to the next card. I think the engineering teams that are truly impactful are the ones that understand why the work they're doing matters to the business, what it's going to contribute to the business and then, once it's deployed, once it's been in production for a while, they get the follow up story. They understand how that worked out, if it had the impact that it was expected to have. NOEL: An engineering team that has that kind of context can make better decisions. They can give the product team or the client what they actually need rather than what they say they need, if they have that context. Where the context is really valuable is in the case where resources are tight and the product owner or the client says the things that they want and the engineering team is much more empowered to say like, "Here is a way for you to get 80% of that at a 10% of the cost," but I can only do that if I understand what the actual goal is, not just with the story says. NICKOLAS: Yeah, absolutely. I think one of the reasons that software engineers get into the profession in the first place is that it's a high leverage way to solve problems. One of the traits that seems to be common in engineers is sort of this good kind of laziness that looks for a faster way to get to an end result that doesn't cost as much. I think that's exactly what you're talking about. NOEL: Right but I can only do that if I know what the goal is. NICKOLAS: Right and that's getting to what stories are beneficial. When you're talking to stakeholders, there's always this temptation for stakeholders to be prescriptive and give you solutions, so you have to work to get those stories out of them, to get like, "I hear that that's a potential solution. What is the problem that that would solve for you?" NOEL: Yeah. There's actually a question that I often find myself asking with the product owners that I deal with -- What is your goal? What problem does this solve? I know you want to use X technology but... NICKOLAS: That's blockchain right now. I want blockchain in my product pool. Why? What does that solve for you? NOEL: It blocks my chain, doesn't it? That’s good, right? I don't know. I'm sheltered enough that I have yet to have somebody actually, legitimately ask me for blockchain on a project so I'm very excited about that. NICKOLAS: Blockchain is an interesting solution to a very, very narrow scope of problems and it is much narrower than the things that it is typically proposed to be used for. NOEL: Yes. I did hear that somebody was trying to use it for voting machines in West Virginia and that, I find terrifying. NICKOLAS: Absolutely. NOEL: Speaking of failure modes that you can anticipate, what's the opposite of an un-anticipatable failure? But I think this ties together the discussion of context and the discussion of failure modes. To me, it ties together with the idea that having the right information and the right culture can help people make better decisions in the moment. Engineering, to some degree is about problem solving under constraints, whether that constraint is, "We're in a crisis mode and we're trying to figure out what's going on," or whether that constraint is, "We only have $50,000 to solve this business problem." Having the culture of open communication and understanding of how to present context serves you well in both situations, I think. NICKOLAS: Absolutely. I've been a software manager for a while now and sort of how I view my job is building context and facilitating relationships so that the engineers I'm working with can make the best possible decisions. My job is never to make the technical decisions for the team because I'm not the one in the code. I'm not the one that has to live with those decisions. They have much better context from which to make informed choices about what technologies we use and how we use them. My job is to make sure they have the information that they need to make those decisions and they have contact with the right people to talk to, to get the information they need for those decisions. NOEL: Right, because the best kind of crisis management is the crisis you managed four months earlier so that it doesn't ever actually become a crisis. NICKOLAS: Absolutely right, just like the best code is no code. NOEL: Yeah. It seems like the habits that you build on your team, in times of low stress are the things that serve you best in times of high stress or cause you to prevent times of high stress because your habits have been so good up until then. NICKOLAS: Absolutely. There's this natural ebb and flow to the way software teams work, where there's times when you have lots of margin and lots of slack in the system and there's times where you don't have much margin and slack. Those times that you do have the breathing room to think about these things, that the habits in the processes that you put in place at those times are the ones that you'll use when you don't have time to think about them. NOEL: Yeah. I have this thing that I call the boring software manifesto sometimes and this was partially caused by, I had a manager very early on who really only thought things were happening if there was a crisis. You could only convince him that work was being done if everybody was running around like crazy. He was an exaggerated version of that but I do feel like there's a certain sense of that throughout the software industry. As you say, engineers like to solve problems and we like to solve complicated problems and sometimes, in the absence of complicated problems, we'll create complicated problems for ourselves. I had this idea of the best projects in some sense are the ones that are boring or the ones where we see these things coming and put processes in place at the beginning, so that we don't have a crisis mode. Obviously, you can't always foresee the things that are going to happen but again, I think it's putting in the work with the understanding of trying to create a smooth environment and quiet notifications and open communication and the lack of blame and those habits continue to serve you when something goes wrong. NICKOLAS: Absolutely. To the idea of boring software engineering and boring technology, there's a famous blog post by Dan McKinley who was the VP of Engineering at Etsy a while back called Chose Boring Technology. He talks about this idea of innovation tokens and he says, "Each team really only gets two or three innovation tokens, so you have to pick the places where, in your stack, it really makes sense to bring in a new technology. But for most of your stack, it should be the boring tools that your team knows how to use really well, understand the ins and outs of and understands the weak spots of, so that they can work around them. But if you go and you build an entire stack around new technology, you're introducing a bunch of unknown unknowns." NOEL: Yeah, our normal practice around new technology is to take things out for test drives on internal projects or very small projects that have a relatively low risk and only then, to bring in a new technology into something that has more time and effort behind it. NICKOLAS: It takes a while to build the operational competency to use a new technology well, to know how to employ it and the way that it's best suited to be employed and to know what it's going to be like when you put it in production. NOEL: I did remember, there's other thing that I want to talk about as I was talking about habits and things like that and I think we kind of touched on it. It's the idea of how you behave personally when you have made a mistake. Whether it was a foreseeable mistake or an unforeseeable mistake, I feel like an important lesson to get across in client relations is how to behave when things go wrong because it is a moment where you can really build long term trust or destroy long term trust, depending on how you handle it. What are your models for how software teams should behave in the wake of having made a mistake? NICKOLAS: I think if you have done the work upfront to build trust with your stakeholders, the way that a software team ought to behave when a mistake is made is to be as transparent as possible, to share the mistake that happened, the consequences of that mistake and what's been learned from it. In my organizations, we write up a post-mortem, not a root cause analysis because there is rarely a single root cause for anything. But we write up a document that focuses on the things that we have learned and the things that we're going to change going forward. But the document also has a full timeline of who did what and when and to the extent that we can capture why those things happened. NOEL: I feel like this becomes very, very hard for a couple of different reasons. One of which is my own ego. I don't want to be vulnerable. Another one of which is in a consulting kind of position or even in a product kind of position, there's a career risk. I'm admitting a mistake and therefore, the client might rage quit or I might get fired or transferred to a permanent on-call hell or something like that. Also, there's the risk in an engineering context of just even being able to explain the mistake in a way that clearly delineates to the non-technical people who need to know about it like what the scope of the mistake is, that doesn't undersell it but also doesn't oversell it. I feel like if you do that very well, then you can really build a long term loyalty out of those kinds of interactions. My experience has been, that when you really go to somebody you say like, "This thing happened," like we are ahead of it. Especially, if you see it before they do, before product team does, "We did this thing. It was a mistake like we made this mistake or we had the best intentions or we missed this one thing. We didn't understand that this interaction would happen the way that it happened. But in response, we are doing X to make sure this doesn't happen again and we're comping Y or whatever to make sure that you feel like you are seen and your needs are continue to be met." That's the kind of thing that really can bring a team together or tear a team apart. NICKOLAS: Absolutely. One of the things that I try to do is I try to make sure that when our team has successes, I promote the team. When our team makes a mistake, I try to own as much of that mistake as I can, personally, as the leader of the team. I think that sort of gets to the trust building that you're talking about. NOEL: Yeah. I'm thinking about being at a restaurant. When a restaurant makes a mistake and there's a moment where the manager of the restaurant can determine whether you A, tell everybody that you know, never go to that restaurant again or B, become a customer for life and it is all in that incident response. The stakes can be really high in that context. I agree that it is a valuable thing for senior people with organizational privilege or other kinds of privilege to take on blame for the teams that they are managing. I think there's a good thing often for a culture and it can help in that kind of communication. Are there specific, whether you'd say like there's a specific kind of communication or a specific example that you bear in mind when you are trying to talk about a mistake that you've made or recovering from a failure or a crisis? Is there a crisis recovery in either, the literature and your experience that stands out to you as being particularly strong? NICKOLAS: There's a great counter-example in the Three Mile Island story and the public relations debacle that followed that incident. The spokesperson for the company that ran the plant was not being completely transparent about the dangers to the public in the day or two after the accident and the Nuclear Regulatory Commission spokesperson was. Very quickly, the public and the media got the idea that they needed to be listening to the Nuclear Regulatory Commission spokesperson and not the spokesperson for the plant operator. NOEL: I feel like the gold standard that I can think of and it's not really an engineering favorite failure was Johnson & Johnson in the wake of the Tylenol scare of the mid/early 80s, I'm trying to remember. Do you know what I'm talking about? NICKOLAS: I do, yeah. I don't know a lot of the facts around that story but I'm familiar. NOEL: I actually know a lot of the facts to the story because I was about the right age to be a trick or treater and in Chicago, where this was happening at the time but there was... a trick or treater will come in second. There was an incident -- I'm going to get the year wrong but it's on the early 80s -- where Tylenol capsules were tampered with in-store and caused some tragic deaths and there was a tremendous scare about whether this was going to be extended more widely and it was about Halloween, so basically, trick or treating was cancelled because you never knew who was doing what to what. I don't know. I never did quite get that part. But Johnson & Johnson who is the manufacturer, had what is normally praised as a really strong PR response in terms of owning up to the things and their manufacturing process that might have contributed to making this possible and rather swiftly implementing changes that would make it much more difficult in the future, which eventually were adopted industry-wide. That's one of the things that I think is pointed to as a really strong response to a huge incident but I'm trying to think of a better industry response to that in engineering or a software failure. NICKOLAS: One of the ones that comes to the top of mind is when GitLab not too long ago, I think they deleted their database. I'm trying member exactly what happened but they essentially livestreamed their response to that. It was sort of the most extreme example of transparency that I've seen from an engineering team. NOEL: Nick, if people want to talk to you some more about engineering failure or anything else, where can they reach you? NICKOLAS: Twitter is probably the best place. I'm @NMeans on Twitter. NOEL: Great. I'm really glad that we got a successful talk about failure. I'm glad to have gotten a chance to talk to you and that will be it for Tech Done Right this time. Thanks. NICKOLAS: Awesome. Thanks for having me on the show. NOEL: Tech Done Right is a production of Table XI and is hosted by me, Noel Rappin. I'm @NoelRap on Twitter and Table XI is @TableXI. The podcast is edited by Mandy Moore. You can reach her on Twitter @TheRubyRep. Tech Done Right can be found at TechDoneRight.io or downloaded wherever you get your podcasts. You can send us feedback or ideas on Twitter @Tech_Done_Right. Table XI is a UX design and software development company in Chicago, with a 15-year history of building websites, mobile applications and custom digital experiences for everyone from startups to storied brands. Find us at TableXI.com where you can learn more about working with us or working for us. As I record this, we have a job opening online for a project strategist. You can find that at TableXI.com/Careers and we'll be back in a couple of weeks with the next episode of Tech Done Right.