MIKE: Hello, and welcome to another episode of the Acima Development Podcast. I'm Mike, and I'm hosting again today. We've got a good crew here today, and I'm excited about this one. We've got Kyle Archer, Eddy Lopez. We've got Dave Brady. Hello, Justin Ellis, Thomas Wilcox. We've got Ramses Bateman, and Will Archer. So, I think we've all been here before multiple times [chuckles]. We've got a familiar crew to talk about an important topic that's always fresh because [chuckles] there's a constant need. I was racking my brain what story to tell for this, and I ended up going back to...I don't even remember exactly when it was, but it was somewhere in my late teens, early twenties, in that era. So, admission, that's quite a long time ago [laughs]. That's more than halfway back [laughs]. And I was helping out at my parents' house with some remodeling they were doing. They were tearing out the...they were redoing the bathroom. And so, they were tearing out...they had a wall that had some tile on it, and they were tearing out the tile. And they were going to put some new...I don't even remember. They shifted things around, but they were tearing out the tile. That's the important part. And I had my little brother with me nearby. He was too young to really help. He was, like, six. And, you know, he was just hanging out and chatting with me, and I was taking a...they call it a dead blow hammer. It's a hammer with sand in it, so when you hit, it just stops. So, it's a weighted hammer, but it has a soft landing, so it doesn't have a...it doesn't bounce back, right? It just kind of stops, rather than having a strong bounce. It's good for situations where you want to do that, right, where you don't...you really don't want it bouncing back and hitting you in the face. And I was breaking up the tile wall. Context, there I am with, like, a six-year-old breaking up a tile wall. And there was some wire mesh behind it, and I was gradually peeling back. As I broke it, I was peeling back this wire mesh that was embedded in some sort of mortar. And I was pulling out [inaudible 02:26] the cement behind the tile. And so, as I'm banging, I pull back a piece, you know, pull it back because I'm making some progress, and I swing in. And because that broken tile is now hanging out and mounted on that wire behind, with the, you know, the cement that's holding it together, when I swing with that hammer at full force, right after peeling, you know, an extra layer back, I sunk my knuckle of my pinky finger right into a piece of broken tile. And I go, oh, and I look down. And I look down into my knuckle, maybe five eighths of an inch, a couple of centimeters more than you should be looking down into a knuckle [laughs]. Oh [laughs], that moment, that's not good. And then the blood starts, right? A rather remarkable amount of blood, I'll say [laughs], was coming out of the finger. Remember, there's a six-year-old here in the room with me. And he yells, "Mom, dad, come help Mike. He's really hurt bad." And, of course, they're thinking the worst. I'm like, "No, no, no, no, no, it's okay [laughs]," yelling. But, you know, there's the moment of panic there. And so, I had some choices in that moment, right? What do I do? Luckily, I think I handled it pretty well. I comforted the people around me to let them know this isn't a disaster. I'm going to need to do something, but you don't need to, you know, call 911. Unfortunately...so, we got everything up, went to one of those urgent care places. They stitched me up. I could tell some other weird stories about it there. A few weeks later, I noticed a little white mark on my finger, and I started pulling, and it was a piece of the thread from the gauze that had somehow got stuck in my finger. And I pulled out, like, a foot [laughs] of this string out of my finger, and then it snapped down near the bottom, and some of it zipped back in. I've never seen it again, like, oooh [vocalization] [laughs]. And I still, when I touch my knuckle, I feel weird sensations all the way down the rest of my finger. It's a [inaudible 04:21] impact of that one. But my poor little brother [chuckles], he got sick from seeing it, and he was throwing up and just not okay. And I felt bad, and I had to comfort him, "This is really okay. I get some stitches, and it'll be fine [chuckles]. It will be fine." And [chuckles] I felt really bad because I was not really even thinking about it. I didn't realize that he was not okay. So, when I discovered before I left, like, 10 minutes later, he wasn't okay, you know, I gave him a hug, you know, tried to help him feel like things were okay, get a ride over to the urgent care facility. They stitched me up, and I'm fine. Today, we're going to talk about dealing with production incidents. And I bring up this example because it's outside of software, but it's a production incident, right? You've got the bad things happen, and what do you do? What do you do now? And I think that there's some aspects to that story we can riff on as well as others. But it helps set the stage for a lot of what happens when we have these production incidents and what we do in that moment because it matters a lot. And how some of the reactions, you know, there's a variety of reactions to this moment among the various parties in place that had some better, some worse, you know, impact. So, servers are down, you know, how do you keep cool? Things are on fire. And that's our topic today. And I've got definitely some thoughts on this. I've written down some notes, but, as usual, I don't want to...I've told the story, right? I've laid out the context. So, I am really hoping some of you all will have some initial thoughts to lead out with. EDDY: Sorry, is the answer not ask AI to see what's wrong with your server [inaudible 06:02]? MIKE: [laughs] DAVE: How do you think the server went down? EDDY: I was thinking, is that not the go-to answer now? I'm sorry, podcast over. Ask the LLM. [laughter]. WILL: Not not the answer. DAVE: The AI is going to say, "You are absolutely right to be upset that the server is down." JUSTIN: So, related to that -- WILL: I mean, I'm just saying that's not not the answer. Like, AI is great at reading a log. Like, it took me -- DAVE: Yeah, actually. WILL: Years, if not decades, to get, like, pretty decent at reading log vomit, you know what I mean, like, filtering through the chicken innards that [laughter], you know, a log will, like, throw up all over you and just be like, "Oh yeah, that's actually it." AI is actually super duper at that. I don't trust it, especially in an emergency but, like, do that. Sure. Yes. Do it. EDDY: I was literally pairing with someone, and we were looking at a Grafana log, right? And I'm like, "Oh, it's because of this." And they're like, "Where? Where is that?" And I'm like, "Oh, I read it somewhere here. Hold on, let me find it again." And, like, you get so good at ignoring all the clutter, you know, and just filtering everything. But, oh my God, dude, like, AI can sift through, like, raw JSON, like candy. DAVE: I have a thought to throw out. I have a bunch. I always do. But one of the things that...and this is not really a production thing, well, maybe it is: loyalty. The thing that makes somebody loyal, a customer, in particular, is you get this graph of, like, did they have a good time, or did they have a bad time? And then did they receive good support, or did they receive bad support? And the most vehement haters of any product are the people who had a bad time and got bad support, right? Just got told, "You go away, not our problem." We've all had examples of this. The most loyal customers, this is interesting, are not the ones who had a good experience with good support. They're the ones who had a bad experience and had fantastic support. These are the rabidly loyal fans. Imagine you've got a car, and you blow a tire on the road, okay? And you call AAA, and they're like, "We're busy. Go away." You're like, "I'm canceling my AAA membership immediately," right? You buy new tires at Big O. You drive along. You're great. You never have a problem with it. Okay, they're tires. They're supposed to be tires. I expect them to be tires. Now you're driving down the road. You blow a tire, and by the time you've hung up the phone, two tow trucks have arrived, one of them with a spare tire and a change and a mechanic, and the other one's ready to tow your car if the tire change won't work. They take care of your tire. They replace it. They get you back on the road in 5 minutes, plus a $10 coupon to, you know, to Chili's or whatever, for, you know, "We apologize for the impact on your time." Would you ever buy another brand of tire? I wouldn't, not in a minute. So, what does this have to do with production incidents? This is the story I tell myself in my head of I want to be that guy when my code breaks. I want to be the guy that absolutely had no ego about, you know, how the server went down. I'll talk story on myself here a little bit. We had an outage about a month ago. I'm very, very proud of the fact that I had gone...I've been here for five years. I've never taken out prod. I'm a very cautious engineer, and I'm kind of proud of that. And prod went down about a month ago, and, man, then there was, like, a five-hour incident call because stuff was going on and things were...oh my gosh. What are we going to do? And I joined in the call. And I'm spearheading. I'm like, "Well, it could be this. It could be..." and I'm, like, reaching, well, I might have screwed this up. It could be this other...oh, man, I didn't consider this thing. Let me go test that. And I basically was Johnny on the spot. With any resource you needed, I will tear apart my own pull request and anything in it. I don't care. I'm not here to be proud to be the best engineer. I know the server's down. I care that the server is back up, and I want everyone in the room to know that Dave was the guy who showed up with two tow trucks, a change of tires, and a $10 gift card to Chili's. And then when it turned out that the server went down three minutes before my deploy, and everyone went, "It can't be Dave's deploy," it went from, "Wow, Dave is really carrying this," to, "Holy crap, Dave is carrying this, and he didn't have to." And Andy gave me a pat on the head at architecture for really showing up and driving the ball on that, and that's how you turn an absolute crisis into a huge opportunity. What people remember is what you were like when things went bad. How you behave when things are good is a terrible predictor of how you will behave when things go bad. And how you behave when things are bad is the best predictor of long-term relationship success. And can I trust you, and do I want you around forever? So, that's my inspiring speech about that. I'm not trying to blow my own horn, because, I mean, obviously, my...I deployed something, and things went down and could have been me. But it's who you are when it goes bad that people remember. MIKE: You know, you talked about how you respond. In my initial story, I mentioned, you know, a few parties here. You got the little kids. JUSTIN: Mike, are we just going to let David, like, drink out of a beaker here [laughter]? DAVE: It's not a beaker. It's an Erlenmeyer flask [laughter]. I do do mad science. JUSTIN: What kind of a, you know, show you got going on there [laughter]? DAVE: For those of you listening at home, which I guess is everybody because we don't actually publish the videos, I have a magnetic stirrer. You got to [inaudible 11:31] tell the story. I'll tell everybody. I have a magnetic stirrer. I bought it for resin and, you know, paint and stuff like that. And every once in a while, I thought, you know, I could mix, you know, my Kool-Aid, or I could mix, you know, my Liquid I.V., or my LMNT. I could mix that in it. But if you put it in a regular cup, it splashes it everywhere. And I'm like, I might as well just buy the stupid lab equipment that goes with the stupid stirrer [laughter]. And so, yes, I do have this. Now [laughs], this does absolutely nothing to excuse the fact that this is root beer with hot sauce in it. I'm not kidding. I am a monster. I have a reputation to live up to. So, there you go. WILL: Don't drink out of the resin beaker, man [laughter]. DAVE: You're not my real dad. WILL: Do you want microplastics? That's how you get microplastics [laughter]. You get macroplastics [laughter]. DAVE: Exactly. These are culinary only. WILL: [inaudible 12:22] army man. DAVE: Yeah, these are culinary only. These are my portable flasks [laughter]. JUSTIN: [inaudible 12:27] you keep the labels correctly on those [laughter]. DAVE: Oh, jeez. I'll switch to this one. EDDY: I mean, how many of us actually drink from a plastic water bottle, you know what I mean? You'll [inaudible 12:39] way. It's inevitable. MIKE: Honestly, I drink out of, like, a mason jar a lot. It's glass. It's not going to give you the microplastics. It looks funny [laughs], yep. But -- JUSTIN: Mike, back to you. I was just very -- [laughter] MIKE: So, aside completed, segueing back...the responses. So, the response of somebody who was overwhelmed by the situation and just went and started vomiting. He couldn't control that, right? Like, that was a reaction that was completely outside of his...out of his voluntary control, and that's fine. You should, you know, you're in a situation where millions of dollars are on the line. You're not okay, bow out. And I think that that's the responsible thing to do. If you find yourself in that situation, delegate to somebody who's got a cool head and do that because that's, like, the first note that I wrote down. If you can't maintain focus and be like, okay, that's okay because you can't help it, like, there's not shame in that, but there is shame in not admitting it, right? You know, pretending that you're okay. Because, under stress, sometimes we have unexpected reactions. Usually, you're not the only one, right? You're part of a team. Bring the team in. Give it to somebody else. But having that cool head, I think, matters tremendously because you've got some important decisions to make, and the order you make those decisions in matters a lot. I would argue that, you know, the next...you probably got three things you've got to do. You can always...I wrote down five, but the first thing that you do matters a lot because, a lot of times, people say, "Oh, wow, things are broken. What went wrong?" And then they'll spend the next six hours trying to figure out what went wrong when the servers are down and your business is losing money [laughs]. DAVE: Yeah, we don't care what's wrong. We care about the servers. Yeah, give me cash flow. MIKE: Exactly. DAVE: Stop the bleeding then take the bullet out. Yes. MIKE: Bingo. And I was thinking, literally, that's what made me think of my incident [chuckles] back in my youth because, literally, I had to stop the bleeding. Nothing else really mattered, right? I put direct pressure on that. I went, and I got the stitches. And they asked me. I remember that, like, "Do you have feeling in your finger? Do you think it severed a nerve?" I didn't actually realize that I had at the time [laughs], but that didn't matter as much as, you know, let's get rid of this gaping hole in this guy's hand. That matters a lot. Stopping the bleeding should go first. Go ahead. JUSTIN: Yeah, and when you talk stopping the bleeding, I think a lot of this is, like, in the prep work that you do. And 9 times out of 10, for production releases, for me, if you do a production release and something goes bad, you've got to have that back-out plan ready to go. And whatever that is, hopefully, you're doing installs multiple times a day, and your back-out plan is just hitting a button, you know, just getting back to normal, which was, you know, whatever it was before you did that deploy. And, you know, if you have that up and running, that's a sign, I think, of a really mature business. It's like, hey, I can go into prod, and if something breaks, I can back out of prod within 30 seconds," and life goes on. And then you, like you said, then you could figure out what...dig out the bullet. WILL: Right. Well, yeah, I mean, but it's always, you know, I don't know. I mean, I'm always hesitant to, like, hop in the Wayback Machine, right? Because, like, if we're going to be like, all right, step one is go back in time and make sure that you can claw back that deploy [laughter], no. Step one is, like, don't write the bug in the first place. I mean, you know -- DAVE: I actually call this the time machine problem. WILL: If I'm [inaudible 16:19] I'll fix it all the way [laughs]. DAVE: Because everyone's solution is, well, don't do that again. Well, don't do that. I'm like, well [laughter], where were you an hour ago? MIKE: Well, it's also tricky if you're deploying an app. So, Will, you're working with mobile apps, right? -- WILL: Oh yeah, oh yeah. Like -- MIKE: You don't get to go to the App Store and say, "No, I didn't mean that. You downloaded that to somebody's phone. Please bring it back." That's not on your list of options. WILL: You can get done wrong. You can get done real, real nasty if you bungle a mobile app. I think it's only happened to me maybe one time in my career, where you get the dreaded crash loop, where your state in the app is corrupted, and it's not fixed with a hard reboot, right? Where, like, your state has gotten corrupted. And it didn't happen to everybody, but there was an edge case where we had some people crash looping, and, like, that app's got to get smoked, like, you got to pull it off your phone. You can get burned super bad, to a degree, that is. EDDY: What's the rollback strategy in a mobile environment, right? Like, because you have to follow certain standards, you know, in the marketplace, right? Whether that's Play Store or the App Store, right? Like, if I remember correctly, they have, like, certain criteria and waves that you can release updates to your application, and they've got to approve that every single time, right? So, if something leaks, right, in that deploy, like, do they have, like, a fallback where you can be like, oh, crap, it's not working; let me just deploy the previous version on the application? Like, how -- WILL: Well, it depends, you know, there's rules for some people, and there's rules for other people. So, I started out as a very, very small fish in the App Store pond, a minnow. And you don't get nothing, like, they'll review it when they review it, you know what I mean? And you can beg, and you can grovel, and maybe they'll get to it, maybe in a day or two, or whatever. But, like, there's just a lot of minnows in the store, and, you know, the dog is always eating their homework, so you know what I mean? Like, you just...they'll get you when they get you, right? Android turned things around, has historically turned things around pretty quick, because I don't think they have a lot of, like, human beings looking at it. Android, you know what I mean, you can really usually get it down same day. But, like, you know, App Store, it could be days, you know what I mean? We're talking, like, you know, three to five business days. But, you know, I got into a bigger fish, you know, maybe, like, a trout, you know what I mean? And I had a number. You don't call that number very often, you know what I mean? But you can call the number, and there's a person, you know, at Apple Corporate, and you could grovel. You could grovel to a person, versus, like, just, like, groveling to this email where it's just like, I don't, you know. And now, you know, and now I work for some pretty big dogs and people you know. And, like, I can grovel internally to the VP who could talk to, you know, another VP, and they can make things happen, you know. And all my lickings happen, like, you know, in-house. And it'll just be like, "Hello. I'm the SVP of technology. And let's talk about how you shit the bed, Will [laughs]." You know, which is, you know, I mean, like, I don't know. I mean, like, if it has to be that way, it has to be that way. But things have evolved, right? Like, I'm not just some sort of, like, cowboy. And when you're working with, like, sort of big money and big engineering staffs, everything you do is feature flagged, right? So, like, you have a, you know, a live dynamic CMS, and anything I put out, anything I put out ever, you know, I've got an off switch. You just have to have that. That's, you know what I mean, like, at this scale, you've got to have a panic button. And there was also, like, you know, the app deployment infrastructure has evolved rather significantly since I've been doing mobile apps, in that, like, you're not blasting it out to 100% of your customer base. That's crazy. Like, that's psychopath work. You roll it out to, like, 1%. Let's see how it does. Let's let it simmer for a little while, right? So, it's good and bad, right? But, you know, there are best practices which, you know, to a web development shop might seem, you know, kind of primitive and anxiety-panic-inducing, which there are, right? I mean, because you've got to remember, like, if you're on a mobile app, you're running on somebody else's server, right? Like, it's their hardware. It's their machine. They could do anything. Anything. DAVE: Including nothing. Including nothing when it goes down. WILL: Anything. Yeah. You're out of hard disk, baby. Sorry, no more hard disk for you. Oh, you got a little greedy with the RAM. We're pulling your card. MIKE: [laughs] WILL: Sorry, no no, you know. Like, hey, like, oh, you had the network. You had the network, huh? That's cool. That's cool. But I'm going in a tunnel now [laughter], you know. Like, there are levels to the game. And, like, when, you know, like, your app, you know, your distributed application, you know, is in no way a guaranteed stable internet connection, no, no, no, no, no. No. Nobody's even pretending that that's the case. And things can get really difficult, and getting accurate telemetry can be very, very difficult, you know. Because there are certain crashes where you're just done. You're done now. You're finished. The operating system is stepping in. Daddy's home, and everybody's going to their room right now. So, those can get more difficult. But again, you know what I mean, because, like, you know, there are bigger dogs. You know, there are a lot of really delightful, you know, third-party mobile app telemetry gathering solutions. They'll give you screenshots now. It was great. It's so cool. I could be like, "Oh, it crashed," and I could just be like, "Oh, what are the, like, last, you know, few things that they have done in the app?" And I'm just like, oh. You know, where have you been all my life? MIKE: [laughs] WILL: Sorry. Thank you for coming to my TED Talk. DAVE: No, all good. I have season tickets. MIKE: You did talk about several things, though, that goes back to what we talked about a minute ago, or this ongoing conversation. What you do ahead of time matters a great deal. You say you don't push out changes that go live. What? Are you mad? You say, you know, push out changes that are behind a feature flag, and then the rollout is independent. A rollout of the feature is independent of rollout of the app, right? So, you've changed the cycle so that you actually do control the rollout. Or, as was said, when you actually have a web app, you have the ability to roll back. You press the button, "Oh, wait, yeah, now it's back." Problem solved. That prep work ahead of time goes a long way to making things right. Now, let's say things have gone wrong anyway, right? You've got unexpected traffic that's 10x your normal level, and now you've got a database query that's unhappy. There's no rollback, right [chuckles]? You've got live traffic, and you probably want to be doing something with that 10x traffic, right? You probably want to be making some money. What do you do? JUSTIN: That's where prep work comes in again, horizontal scaling. Well, unless it's hitting the only copy of your database, then you've got to do more. EDDY: It should probably stem from writing an ORM query versus just a raw query. Just saying, there's a lot of magic that happens when you write ORMs under the hood. MIKE: Oh, and it's always the database. It's always the database [laughter]. There is maybe sometimes it isn't, but, yeah, it always is [laughs]. It's something you've done with the database. You're missing an index. You've done something that you could do undo with the database, but now you're in a bad spot, right? You're in the bad spot. We talked about stopping the bleeding. You get in the call, a bunch of people upset. You've got three or four business stakeholders who are in the call asking you for a status update. You don't even know what's wrong yet, but you know the app is down, and it's all on you. Step one, what do you do? EDDY: Roll back, unless it's a database. MIKE: There's been no deploy. Things are down. What do you do? WILL: What changed? Something changed. DAVE: You just answered some first questions. WILL: We were happy, right? And then we became unhappy, right? So, what is the delta? What is the delta between happy and not happy, right? Like, could be just a lot of traffic, right? That's okay. Like, I went from happy to very happy to very unhappy, right? It could be a deployment, right? Dave was talking about the deployments, like, "Okay, I changed this thing," right? Okay, that's an issue, right? I mean, and so, like, identifying the last time that you saw the sunlight, that you felt human joy, you know, okay, well, there we go. And then you just sort of, like, narrow that delta down to, like, "Okay, it was here, and then it was here." All right, now you've got a stew going. JUSTIN: So, you're talking a lot about, you know, identifying this stuff. It goes back to, again, planning and making sure you have appropriate monitors in place such that you can go look at those logs and you can have that dig-in ability, and something other than just, "Oh, prod is down." It's like, where are my alerts? You know, I should be able to go into the logs and say, "Oh, the traffic is hitting the firewall here, and it's hitting the VPC, and then it's hitting, you know, the application, and then it's hitting the database." You know, is that traffic consistent all the way down the thing? And can I see all that in the logs? DAVE: How is the system down, right? Are you CPU-bound? Are you disk-bound? Are you network-bound? Are you hung? Yeah. MIKE: Notice that we're talking about going and looking at our metrics to see what's wrong, not going and doing a deep, like, root cause analysis necessarily, like, what's hurting here? DAVE: Right. This is symptoms and triage at this point. MIKE: Yeah, exactly. DAVE: Don't prescribe until you've diagnosed. MIKE: And that's the triage, exactly. And as mentioned repeatedly, you go to your data; you pull up your dashboards, right? Whatever you've got that you have to go get some visibility into that. Whatever you've done to observe, that's the first place you look, like an instinct [chuckles]. JUSTIN: Actually, the first thing I usually do is I go hit it myself on the browser if it's down [laughter]. DAVE: For real. For real. MIKE: Verify. DAVE: Works on my machine is a valid bit of data. I mean, it's a terrible excuse, but, like, it is actually up from here. Okay. Are you on the VPN? Are you? Yeah. MIKE: Absolutely. JUSTIN: That's really what I do first is [laughs], like, "Oh, I can [inaudible 27:58] [laughs]." DAVE: Confirm the bug." EDDY: "Wait, it's broken? Hold on. I don't believe you. Let me go to the website and see if I can replicate your problem [laughs]." DAVE: I had a support call. I worked for [SP] Joston's Learning. They were, like, an e-learning thing back in the '90s. And so, we would go in, and we would string Ethernet like radio, like RF cable, a 10BASE-T cable, if you remember that, like, coax off the back of these things. And the students would...for, like, middle schools, they would kick the plugs. They would kick the routers. And some of the students figured out that if they kicked the plug, they didn't have to study that day. So, they started getting...and the teachers got real good about going in and reconnecting the plug and saying, "Do your darn lessons," right? And we had one server that just...they came in on a Monday and nothing. Like, it just came up to, like, an "operating system not found" message. And I'm like, oh my, and so I did everything over the phone that I could possibly think of. I finally had to dispatch an engineer to the site. Engineer walked in, looked at the server, reached down, and ejected the floppy disk that somebody had plugged into the computer so that they could play Doom on the LAN over the weekend, and forgot to pop the disk out. And I got a lambasting from the engineer of, "Check the A drive next time that the computer won't boot, if it's booting to the wrong operating, you know, to the wrong disk." But everybody else's system was working, so it wasn't...I knew it wasn't on our side. But yeah, this turned out it was just the one server. No other servers in the building were affected because that was the one that Jose had decided was going to be the Doom server. EDDY: Would it be valid to say, "Grow callus, and then you won't feel it anymore," as a valid response to being cool during a fire? I don't necessarily quantify that as a valid...I don't want you to grow callous on the fact that you've broken it so many times that you don't feel it anymore. DAVE: Right. You're not wrong, though. EDDY: Yeah, exactly. It's sort of like [inaudible 29:55] under the pressure after you've done it so many times kind of grows numb a little bit, right? Like -- DAVE: I had a manager teach us how to get calluses instantly. It was fantastic. Servers were down. We were losing money. And the president of our unit walked in. And we were running around like chickens with their heads cut off, right? And he walks in, and he goes, "All right, we knew this was going to happen." And we went, "Hey, you're right. You're right. We knew this could happen, okay." And all he did was just normalize it. It's not the end of the world. This is a thing that can happen. Let's take this back into the catastrophic level. There's a thing that they tell 747 pilots. "In an emergency, wind your watch." If you're at 30,000 feet and you blow all 4 engines, they just stop for no reason, and you don't know why, you've got 20 minutes before you die. And in that 20 minutes, you have to find the right solution. I mean, you have to find the right solution. But there's a million things that it could be. Now you've got checklists that you can work. But they basically say the first thing you need to do is stay calm. Machines break. So, when you're at 30,000 feet, and all 4 engines stop for no reason, it's not for no reason. It's because it's a machine, and something has gone wrong. We knew this could happen. This is normal. It's not great; it's not ideal, but it's not supernatural. It's not lightning bolts from the sky. And that gets you into a resourceful mindset so that when the answer goes right by out of the side of your vision, you're not tunnel visioned on, my next attempt at the...oh, oh, oh, oh. It's that, it's that -- WILL: Yeah. You know, I would add on to that, like, does anybody in this call know of anybody who shipped a prod bug, screwed something up, and they lost their job? Can you think of somebody that that has happened to? We have decades of experience here, right? DAVE: One time. WILL: Because, for me, nobody, nobody. I can't think of a single one. DAVE: One time. And it'll be real clear that it wasn't the prod bug. It was...we have a thing, when we ship code here at Acima, you have to have reviewers review your code. And I introduced it at architecture, a couple of weeks back, that, you know, at CoverMyMeds, we called this "sticking your head in the noose with the developer." And you had to have a review from an associate, you know, a coworker, and you had to have a review from an engineering manager. And the engineering manager rubber-stamped a review. I'm going to say it was his own code, rubber-stamped it, shipped it 4:00 o'clock on a Friday, took out the fax machines, and he went home and didn't come back and check. And we were down all weekend. This was 10, 15 years ago. We didn't have any observability. We didn't know the fax machines were down, but it was his job to know that it was down. So, he did not get fired for taking out prod. He could have taken out the whole fax bank if he had just checked his work, or if somebody else had reviewed it, or if he had just turned around and fixed it. He got fired for criminal neglect, you know what I mean? Gross neglect, gross negligence. My definition of gross negligence is: if we fired you and replaced you with nobody, we'd be better off. That's gross negligence. That's what he did. He didn't get fired for taking out prod. WILL: I mean, so it's just something to, you know, if you happen to be, like, sort of like a [inaudible 33:23] developer, right? DAVE: I see your point. You're not going to get fired. Yeah. WILL: We've got a literal lifetime of, you know, dev experience. And if I'm wrong, just, you know, open your mouth and say, like, no, you're not here, but this is, like, a lifetime of experience. We don't know anybody who got fired for taking out prod. And I don't know if there's anybody on this call, you know, at a senior level who hasn't shipped a prod bug before. EDDY: Okay. Can you define the parameters on what you mean by taking down prod? Our gateway for API traffic is completely haywire, kind of thing? Or are you talking about, like, oh, our hosted AWS server -- MIKE: I'll tell you my first one. I had been there a few months, and I was asked to restart the service. I ran the wrong script and turned off the server. This is back when your server was in a physical data center, and the only way to get that thing back on was to drive to the data center and turn that server back on. And I turned it off. So, when I say down, I mean it was off [laughs]. And my manager said, "What did you do [chuckles]?" And then we figured it out, and we fixed it, and nobody was fired [chuckles]. DAVE: I don't have a black and white definition for taking out prod, Eddy. But as a sliding gray scale, the more money the company is not making, the more your taking out prod was. And related to the nobody getting fired, I once heard a CEO say to someone, this was, like, 20 years ago, somebody wiped out the system and came in, resignation letter written, hat in hand, hangdog expression. And the CEO said, "I just paid $12 million to train you. Why the heck would I fire you?" And I tell you what, he was the most diligent engineer after that. He'd gone through a $12 million training. WILL: That isn't to say, like, you know, like, YOLO, send it, right [laughter]? But just like -- DAVE: Yeah, that's the guy that got fired, yeah. If you're gambling with...if you lose $12 million, you're not going to get fired. You're going to get fired for gambling with $12 million of not your money. KYLE: I've always looked at whether or not prod is down as whether or not you're affecting your five nines. If it's something that you can report on for your SLA, then you've successfully taken prod down. WILL: Yeah, yeah. This week, and I'm still a little bit salty about it, and it'll be, you know what I mean, it'll be fine. But I had a thing where there's some analytics telemetry stuff in the code review process. I had to refactor it, like, three times for no reason at all. People wanted it, oh, what would it look like if the couch was over there? What would it look like if the couch [chuckles] was on the ceiling? What would it look like if the couch was on the front porch? And I'm like, okay, man, all right, you know, whatever. And so, I moved it three times. In the course of that, I missed some telemetry. There's some telemetry on, like, campaign reporting that isn't going to get out until the next release. And, I don't know, in my mind, that's a prod bug, you know, because, like, they're not going to know which campaign for, like, you know, two weeks. I'm really grumpy about that. I'll probably be over it by Monday. DAVE: You've heard the rule "Fail early, fail loud," right? It's just observability from the other end of it. It's like, if something's down, I want to know. I've had two times in my career when the CEO found the bug before anybody in QA or engineering or anyone. And it's awful when that happens. EDDY: I do want to backpedal to what Will said. You probably had that in mind when you first started, but you probably did it, like, three different times, three different iterations. You were so far in with refactoring that you probably forgot by the end, right? And I think that was more of a symptom, you know, of the work of the refactor. DAVE: And if I was two levels up, I would want to know who made you change it three times and why because those aren't free. There's clearly not free. I'm not a machine. WILL: It was my fault. It was my fault. It was my fault, like, I did it. DAVE: And you fixed it, right? WILL: Yeah, I did it. I fixed it. It was, like, a 10-second thing. It was just, I don't know. Anyway. DAVE: So, as a CEO, I'd be like, who made Will change this three times? Because if you make him roll enough times, he's going to roll in that one eventually. WILL: Yeah. Anyway, anyway, you know, it's fine. It's fine. Like, somebody else took down the dev server for, like, a 24-hour period, like, the very next day. So, if anybody is looking for somebody to, like, grump at -- DAVE: Yeah, you don't have to outrun the bear. WILL: It was me only very briefly [laughs]. JUSTIN: So, you guys chatted about, like, you know, moving on to what you do. You fix it immediately, right, and then you dig out the bullet. You know, digging out the bullet is kind of like the postmortem, and kind of mature organizations have a postmortem process. And that's always interesting. That's where you truly find out where your policies and your processes are lacking because, you know, you shouldn't have shipped the bug. Something caused that bug. When I brought down prod, I was lucky because it was after hours. Otherwise, somebody may have been fired. But the postmortem was painful, but nobody felt terrible about it because it was like, it was my fault. It was like a string change, and the string was...I had changed it to...the string was supposed to be "production," and I had changed it to "pro," so "pro" versus "production." And we found out the reason why I changed it to "pro" was because in all of our other environments, it was "uat" or "dev." And I was like, oh, that's convention. We just use the three-letter word, but no, "production" was the whole word spelled out. And this was when I was working at Fidelity. We'd done the install. It went out. We got calls almost right away. And, luckily, we'd done the install at, like, 4:00 o'clock in the afternoon, so the trading day was over. But it was, you know, the conversation with my boss the next day was just, like, sweating bullets and everything. But it was just like, you know, like you guys said, it was like, oh, as long as you learn from this and don't assume. That was the extent of the postmortem. In other places -- EDDY: Also, like, I think that speaks volumes to, like, the brittleness of their [chuckles] system, right? Like, if you can change something... JUSTIN: Oh, it was...You'd be amazed at what our financial system is running on. It's, like, duct tape and very, very brittle -- EDDY: I'm not surprised [inaudible 40:27] tell us [laughs]. WILL: I would not. I would not. I want to kind of digress, and I'm very curious about this. Like, we talked about, like, sort of, like, you know, like the developer, like, blowing things up thing, right? But, like, Justin, you're working in security. What about security breaches? How do we deal with, like, a security breach? How do you even know there's a security breach? How do you, you know what I mean, do a postmortem for, like, a security thing? Like, oh, we had a compromised system. What do I do about that? The server's up happily spilling its guts to anybody. [laughter] JUSTIN: Happily divulging all the secrets [laughter]. So, again, it goes back to monitoring because you got to be able to know when you are being attacked. Because if you don't know that you are being attacked in some way, you just think it's normal traffic. You got to have monitors on what you are interested in because if you don't have the monitors on, they'll just take all your secrets. They'll take all your money and everything. A good example: when I worked at Coinme, which is a cryptocurrency company...Is it still okay? Yeah, they were bought out by somebody else. Okay, I can talk about this. When I worked there, it seemed like at least once a month our servers were under attack, either denial-of-service or password, you know, people were attacking, trying to steal passwords, or that sort of thing. And cryptocurrency is probably like the wild west, the most wild west financial industry there is right now. But we had to go in, and we had to...on the denial-of-service attacks, we were on the call with Cloudflare and trying to figure out, oh, what could we block to, you know, stop this denial-of-service attack, whether it's whole swaths of the earth. You know, we're going to block all of Russia. We're going to block all of Eastern Europe. Or if we decide that, you know, oh, we can block a certain type of browser tags or, you know, all those sorts of things were considered. And sometimes we actually had to do a live install to add custom tags to our traffic so that we know what was good from us, and that would block these bots that were under attack. And so, it was nuts. Like, there were several times when we were, like, all night long fighting this sort of thing. But you basically just had to figure out, okay, what's their avenue of attack? And then, you know, figure out ways to block that traffic that was coming in. And sometimes we had whole swaths of our customers who got locked out because they were under password attack. So, it is a wild west, depending on, you know, what could happen. And then, you know, the next week, because it usually happens on a weekend, the next week we'd have a postmortem about, you know, what could we do to defend against that kind of attack? And sometimes that postmortem was, you know, done with our security company, or with the companies that we contracted with to help us block that sort of thing. So, it was interesting, and it was very, very detailed and kind of a crazy thing that we had to deal with in those cases. MIKE: What you're saying there is interesting, and you're hitting on something that I was wanting to bring up, because it's kind of a gap in our conversation. We said, oh yeah, you stop the bleeding, and then, you know, you figure things out. Well, sometimes stopping the bleeding is not an instant process. You talked about, you know, part of the triage: okay, I know they're bleeding. You know, you've looked at the metrics. You see, okay, I know something's going wrong. There's internal bleeding here. Or, you know, obviously, you know, we're getting a denial-of-service attack. What next? Because there's usually different options, and they have different value. There's a difference in what you do. You got the database issue. Do you add an index? Do you rewrite your query? What do you do? There are different options, and those different options have different costs -- JUSTIN: I actually want to bring up one point that you have here. You're investigating what the cause is. You got to have the contact information for all the people that you might need to contact on a Saturday night in order to solve the problem because you can't be an expert at everything, right? So, make sure that you have the contact information of these people and that you treat them nicely, and you [laughs] reward them because you are intruding upon their time that perhaps they were not on call. MIKE: Ramses is on the call. He hasn't said anything. I'm always glad when he's on the call because he knows everything [laughs], which may not be quite literally true, but it's close. RAMSES: It's far off. MIKE: [laughs] You know, having the right people in the room matters a lot. That's a really good point. And you better have a process for calling those people. DAVE: He doesn't know where all the bodies are, but he knows where the memorial services are held. MIKE: Making those choices matters. And it's really easy to get rabbit-holed on something because you're like, okay, we need to come up with a solution. How do we make this work? And you don't want to explore every option. That takes too long. So, there's a delicate balancing act that you're performing during that time, whether it's all night with a security issue or your database is down. Every minute's costing you a million dollars or whatever it is, right? You better be making a choice quickly. We've talked a lot about having presence of mind. Well, it matters a lot. And I think it's really important that you give yourself the mental space to explore that and find the right option. And that can go really wrong really easily. It's very common when you have that incident call, you have a lot of people who join in, and maybe you do have several business stakeholders who are coming in who are asking questions repeatedly. And they want to know, and rightfully so. But they should not be in that incident call when you're resolving the problem. You jump in with somebody else to have the discussion. I think it's critical that, whatever it is, and, you know, there are business stakeholders who actually can be really good, and they'll back up when they need to. But you need to get the people who can solve that problem, those people you mentioned, into a place where they can legitimately think and make a good decision. They can evaluate those options and pursue the best option. A few months ago, I was involved in a production incident, and I saw a lot of noise and people getting focused on something, or not knowing what to focus on. You know, there was lots of bouncing around, and helping people make a choice, "We're going to go this way," went a great deal to getting that solved in a much shorter amount of time versus hours, days, right? You need to get that. That's a big deal. Have you all seen that same dynamic? WILL: Honestly, like, most of the time that I've been in these calls, people have weighed, you know, I haven't seen anybody sort of freaking out. I think I've been pretty lucky in that, you know, the people from, you know, like, the higher upper-level managers are just sort of like, what's going on? I mean, I don't know. I mean, maybe a piece of it is just me, you know what I mean, in that, like, I will tell you exactly what I know in clear and concise ways. This is what I know. This is what happened. This is what I'm going to do, you know what I mean? And then I just sort of, like, and now I'm going to go do it. And they're just like, it's everything I needed, and I'm going to leave, you know? So, I mean, I think, you know, kudos to them. And, like, ICD 2, you, as, like, sort of, like, a first responder, let's say, need to be aware of, you know, what their ask is, right? And, I mean, you're going to talk to their boss. And they're going to talk to their boss, and then they're going to talk to potentially, you know, the big boss. And everybody needs to know what's going on, what's being done, you know what I mean? Because the CEO, like, in a big company at least, right? CEO's hands are tied. They can't do anything. They couldn't fix that server if they wanted to nor, you know, in most instances, could your boss, or at least your boss's boss. If you don't have dirt under your fingers, you're useless, you know. And so, your job is to communicate. No offense. I mean, it's not like they don't do anything at work, but when the server room is on fire, like, if you're not helpful, just, yeah, let me get the fire out, and then we can do manager stuff, you know, later [laughs]. JUSTIN: Yeah. And there's generally a playbook for incidents. If you guys have an on-call rotation, which I believe you guys do, we have one. You have a playbook that clearly designates, you know, oh, somebody is on call, and they have the power to declare an incident. They are in there. They're the incident quarterback. That's what we call 'em here. And they have access to all the people they need to call. And they are also responsible for communicating up and handling, you know, the managers that may come in and, like, throw their arms around, or whatever. And the incident quarterback, I think, is really key to maintaining a calm, you know, demeanor during this incident. And it's key that anybody who has the potential to be that has that right training so that they know how to use that playbook, what they need to do. And, you know, it's really nice if they do know how to do that. If they don't know how to do that, then you're doing on-the-call training if you happen to be on that call. WILL: I'll take it for granted that there is, like, a binder that one could open up and handle the incident, you know, or God help you, training [laughs]. Your training is, this is the spreadsheet for your days, and maybe there's an email or something. Yeah, yeah, I don't know. I've seen training. I have seen it. I've witnessed it where, like, they're just like, "Okay, this is the training stuff." Like, I know it can happen. However, however [laughter], however, like, as often as not, it is just a Slack message, like, "Hey, server's down. Can you get on this call [laughter]?" And I'm like, "Yeah, yeah, I can". DAVE: I pushed really hard to get, at CoverMyMeds, to get...we called it the 3:00 AM playbook, which is just a checklist, right? You know, do this, do this; do this; do this; look at this. If it's this, do that. And we literally had to write it for somebody with no context, no knowledge of the system, other than, you know, generic familiarity with the tools. And it's 3:00 o'clock in the morning. You're sleepy. And all you want is to go back to bed. And literally, the outcome of the 3:00 AM playbook is to stop the bleeding. It's not even pull out the bullet. It's literally get the server back up, watch it for a few minutes. If the server looks like it's going up, it's still up, go back to bed. We'll dig the bullet out in the morning at 8:00 o'clock. MIKE: So, you stop the bleeding. What next? That's the key thing first, right? It's very easy. Well, maybe even have some partial fix in place. It's very easy for people to say, "Oh yeah, problem solved," and walk away. And then, two weeks later, it's still, you know, your Band-Aid's in place, and the Band-Aid falls off [laughs]. DAVE: We've all worked on systems that's got the little donut spare tire that's been there for seven years because it works. MIKE: Yeah [chuckles]. How do you deal with this in the long term, [inaudible 52:13] as soon it's happened? How do you end up stronger going out of it than you went in? WILL: It depends on how fast you got to drive down the highway, man. Like, there have been plenty of sort of, like, robust failover systems that had, like, a kind of a slow, you know, peptic ulcer memory leak, where they just cook for, you know, a couple of weeks or a month or so. And, eventually, you'd get to a point where it's just like, yeah, that one's got to go. And you just, you know, you vote somebody out of the pool. You keep on going. There's no, you know, it could be bad, like, you'd just be like, ah, it'll be fine, you know? There's no one-size-fits-all there, you know. Some stuff's like, we're working all weekend, baby, and other stuff is just like, nah, it'll be fine. DAVE: We did a couple of systems where we needed to know that, like, we called it Meteor Strike Level Readiness. So, we literally had our entire cluster, like, 700 servers running in a data center in Atlanta and another cluster in Chicago. They were not synced. Like, the databases weren't slaved to each other. They weren't, you know, synchronizing. We ran off of the one, and it just sent backups to the other one. And, every six months, we would fail over to the other data center and use the other one as the backup. And, in three years of doing that every six months, by the time I parted ways with the company, it was still an all-night. And at 4:00 in the morning, we were all writing down the stuff that didn't work that needed to be fixed over the next six months. And it was awful because we would take down prod to do the fail...I mean, we were simulating, like, literally, a meteor has hit Chicago, and we've got to switch over to Atlanta now. How fast can we go? And we still had long lists of things to do, but we got very good at triaging: what's the most important thing? And the most important thing was, how fast can we get Atlanta up and running and then figure out how much is left in Chicago, and what can we do with it? So, that's a lot of money. So, that's another element, right? You slap the donut on that spare tire. And, all of a sudden, the CFO is like, "Why are we spending money on this? I'm still making money." Well, you're not going to be CFO for long. WILL: I mean, I don't know. There haven't been a lot of meteor strikes, you know, in the past 20 or so years. Like, you know, like, Atlanta, both Atlanta and Chicago have been, like, remarkably durable. We haven't burned them to the ground in 100 years, 150 [laughter]. DAVE: And, honestly, it's historically had the same amount of likelihood that they both get hit at the same time, honestly. So, I'm not even sure what we're doing, so... WILL: Yeah, yeah [laughs]. Who would nuke just one? DAVE: Right [laughter]? It's like Lay's potato chips. You can't nuke just one [laughs]. WILL: Nobody who would do it is going to be short. DAVE: That's right. That's right. JUSTIN: Yeah. And I think that has to do with, like, a realistic evaluation of what could happen. Because you could sit there and prepare so much for any sort of thing that could happen, but there's a cutoff point. And I think a reasonable level of risk is acceptable to the business because the business has to survive and be profitable. And, you know, if you're spending all your time, like, thinking of the worst-case scenarios, one, you got to get a life, and two, you're going to spend way too much time, and your engineers' time trying to solve hypotheticals. DAVE: To be fair, the reason...so it wasn't hypothetical. The reason we came up with the meteor strike scenario is...I'll have to dig it up. There was a data center in Houston that had a transformer, like, the main power transformer inside the building shorted out. And it heats...it superheated the cooling oil, and it detonated. It didn't kill anybody because it happened in the middle of the night. JUSTIN: Wow. DAVE: But it was in the center of the building, took out all the servers around it in, like, a 20-foot thing, and then punched a hole through the ceiling. And the servers in there literally fell in the hole. And I can't remember who was on it. I was web admining for Schlock Mercenary. My best friend was doing a web comic. And all of Keenspace and Keenspot, like entire companies, like, their whole data center was just gone. I can't remember the name of it, but it's a name that you might recognize, especially if you're in networking. You would go, "Oh, I know them." WILL: I've got some [inaudible 56:52] words you guys might recognize: us-east-1 is down [laughter]. If you know, you know. I wish you could see the face that Kyle and Mike are making. MIKE: Yeah, instant recognition. You got to East, and I knew where you were going [laughs]. WILL: I don't think I'm allowed to use proper nouns, but us-east-1. Everybody knows who I'm talking about, and everybody knows what they do every other year. DAVE: Rack Shack, EV1 Servers was the one. 2008, they had a transformer explode. Sorry, 2003. And then it happened again in 2008. So, wow, wow. There's somebody who needed to be fired there, clearly. MIKE: Somebody needed to do the postmortem and take action. We're kind of reaching a good time to be shutting down. That cleanup matters. And maybe you determine, hey, you know, we can keep rolling on this donut for years [chuckles]. But you probably have some customers who need to be helped. And, you know, it's important not to neglect saying, "Okay, yeah, we've stopped the bleeding. What's the cleanup need to be?" Because there may be some important cleanup. And there's some, you know, people are going to care. People are going to care. Any final thoughts you all have? We've talked a lot about keeping our composure [chuckles], a lot about that, about the importance of having, like, an engineering sort of mindset. How do I fix this? Triaging, stopping the bleeding, fixing it, and pragmatically, you know, and then not neglecting the after, what comes after. Anything else you want to cover? DAVE: I have a strong religious belief that it is more important to be able to fix the problem than to correctly prevent the problem. Because if you correctly prevent a problem, you have not improved your capacity for dealing with something that you didn't correctly predict. But if you get good at solving the problems, you suddenly can stop worrying about missing something because you start to realize, "We'll handle it." You don't get cavalier. You don't deploy at 4:00 PM on a Friday and go home because you'll handle it. You don't get stupid, but it can help calm you down and say, "Yeah, this is what happens." JUSTIN: So, what you're saying, David, is, like, you should take prod down just a little, and then [laughs] and then that little inoculation [laughs]. DAVE: I wrote a tool called Tour Bus, which, over a conference room Wi-Fi, over a T1, so, like, 256K, with 200 people in the room surfing the internet over it. And I took out prod with it from my laptop while I was giving a talk on stress testing your server. And I did not get a talking to from the CTO because it was his pants that were down on the internet, not mine. And I didn't yank his pants down maliciously. I genuinely didn't think I would take out our prod servers. But there you go. So, give the emperor's new clothes a tug every once in a while. MIKE: So, Netflix, famously, I believe it was Netflix, correct me if I have anything wrong here, who had the tool called Chaos Monkey. DAVE: Chaos Monkey. MIKE: They would go and just break their system here and there, all the time, so that they knew their system would be resilient because unless they were testing it, they didn't know. WILL: I really like having a boring day at work [laughter]. DAVE: Me too. WILL: I like boring days at work [laughter]. I'm thinking I can ride with you on that one, Dave [laughter]. MIKE: I will say that, you know, they say, "Oh, it's always the thing you didn't think of." It doesn't matter how much preparation you do; there's going to be something you didn't think of. And we've talked some about monitoring along this and observability. I'm of a mindset that, given the choice between the two, observability is more important than hardening, not that they're not both important. But you're going to miss something. You're going to miss something when you're trying to prepare for whatever the attack is, because it's going to be some attack you weren't thinking of. And I say attack. It may not be malicious, right? Whatever bad thing happens, it's likely you didn't think about it. If you did think about it, you would've fixed it. But if you have really good systems to figure out what happened, you can solve that quickly, and if you don't, then you can't solve it quickly, and you're in a really bad spot. I've, for a long time, been of the strong belief that monitoring, that observability is the more important of the two. DAVE: Observability leads to good hardening. Good hardening does not lead necessarily to good observability. KYLE: Just to go along with your last point, I would say that monitoring is what you do to prevent historical events from re-happening. WILL: Ooh, I'm stealing that. I love that. MIKE: I like that. Hopefully, in your next production incident, you've taken something from this that helps you out. Until next time on the Acima Development Podcast.