*intro music* CHRIS: Here we go. AMOS: Hello. Welcome to the show, Christopher. CHRIS: Yeah, sorry, I had to publish the episode for this week and I had to write the summary and description. AMOS: Thank you very very much. CHRIS: Yeah. Hey, you have a pop filter this week. AMOS: Uhh, yeah, I do. Um. So, last week I had been travelling the week before and accidentally left my mic at home and I was sitting at my office so I had to pull out my old microphone and it did not have a pop filter. And this week I’m travelling again, and I’m in Kansas City but I brought my microphone with me and my pop filter so I think we should be good to go. Tonight I’m going to go to the Kansas City Elixir meetup and they’re talking about writing languages to write on Beam, so I bet we’ll talk about Beam by Code and stuff like that. What else is going on? What are you guys up to? I know there’s lots of crazy stuff going on in the community right now. CHRIS: I’m officially about to be one coffee too many. That’s what’s happening to me. ANNA: Uh-oh. Already? Well, I guess you’re three hours ahead. CHRIS: It’s twelve o’clock here. ANNA: I haven’t even had coffee yet. AMOS: Are you shaking? CHRIS: No. It’s about to happen though. You know how you can tell? AMOS: That’s when you go get a big glass of water. CHRIS: This one is going to put me right over the edge, so it’s about to get lit, as the kids say. ANNA: I’m ready. So ready for that. AMOS: So, I know we wanted to talk about supervision trees - or, we talked about that being something to talk about, but there was one other thing that I think was important for us to talk about and then there’s one fun thing. So let’s talk about the fun thing first. Honey Pot is releasing a mini documentary on Elixir. CHRIS: Oh, did you hear about that too? ANNA: Oh, yeah, yeah. CHRIS: Sorry, that wasn’t sarcastic, but good Lord! Everyone is talking about it. I’ve got like three emails about it. I got Twitter DMs about this Honey documentary. AMOS: What I want to know is - are you in it? CHRIS: No, no. It’s the same people that you would expect to be in it. That are in everything. But everyone has talked about this thing. It just - now I don’t even want to watch it because I’ve been sent the link too many times. Now my iconoclastic nature is going to rear up in me like some sort of dark force and it’s like ‘I don’t want to do this now. This is stupid’. AMOS: So, it got overhyped on you. It’s like a new Star Wars film. CHRIS: Yep. One hundred percent. This is our version of Solo right here. AMOS: Hey, I don't know. That Han Solo death scene was pretty awesome. CHRIS: Spoilers. AMOS: If you don’t know that - Han Solo died. Sorry. ANNA: Thank you, Amos. CHRIS: We’re coming in hot this week. Comin’ in hot. AMOS: In the end of the Wizard of Oz, Dorthy makes it home. Spoilers. CHRIS: Spoilers. Especially spoilers for Titanic, is what you’re saying? AMOS: Yeah, you know how far you can make it on the Titanic? About halfway. CHRIS: I just don’t know why they couldn’t have shared that board. It really seems like they both could have fit on that door or whatever it was. AMOS: Oh, yeah. Oh yeah. My daughter has watched that movie so many times and I make Titanic jokes and she got mad at me and I thought ‘how long ago did that happen? I think it’s okay’. ANNA: Aww, but if she’s just watching… CHRIS: It’s like it’s happening all over again for her. AMOS: So I - I’m actually still excited to see the documentary. I know - I think it’d be interesting to actually see the history of it instead of just reading about it. I like videos. ANNA: Yeah, I’ll watch it. What was the other thing you wanted to talk about, Amos? AMOS: Oh! So, uh, so… ESLint scope - there’s a ticket out there, issue 39 - I already added it to our show links - there’s a virus, well, what appears to be a virus - somebody injected some code that loads up some code from Pastebin on there. Not exactly sure what the problem is, I haven’t been able to read up on it completely. I just heard about it this morning. CHRIS: I think it steals MPM credentials. AMOS: Ooo… ANNA: That’s not good. CHRIS: As far as I read, it steals MPM credentials. AMOS: Oh, man. ANNA: That’s bad. CHRIS: Seems bad, fam. AMOS: Anybody know how long this problem was there - the issue was around? CHRIS: It was only open for like an hour. The issue was only open for like an hour. AMOS: Yeah, but how long was the actual malicious code in there? CHRIS: I have no idea. I don’t know if you know this, but this is an Elixir podcast, so I have no idea about how long some JavaScript library was out in the wild. AMOS: Well, yeah, but it’s ESLint. CHRIS: What is that? Besides a linter. I can gather that it’s a linter. AMOS: It’s Ekmoscript Linter. CHRIS: It’s like the linter? AMOS: Yeah, but, well, a lot of people use that if they're doing Phoenix and other front-end things. So I think it’s fairly related. Um. But that’s - I just found that it was pretty crazy - I had never actually seen an open source project that had actual malicious code in it. I don’t mean like ‘oops, we accidentally ran you out of memory, sorry’, but not - not on purpose. ANNA: I wonder if we just don’t hear about it. Like, maybe it happens more often and maybe it’s fairly unusual. CHRIS: I mean, most of the time that stuff gets caught fairly quickly is part of it, but I think the other thing is - we’re in a very specific group of people who are working in a relatively small language. It makes sense to go attack one of the biggest communities in the world, which is JavaScript, and attack one of the biggest tools used by that community - I presume - whereas like, even if you were to steal credentials to publish something in like Phoenix or something like that. The damage is probably less than what is is for like a linter in the Java world. AMOS: Well, the crazy thing is, someone would have had to put that into a pull request, and someone would have had to accepted it. CHRIS: No - no way. That’s not true. AMOS: Well, how did it get into the Git repo? CHRIS: Well, it doesn’t have to be in the Git repo to publish it to MPM. Somebody stole someone’s credentials and then published something to steal more credentials. AMOS: Oh, you’ve read more about this. CHRIS: That’s what I - I mean, I skimmed it. I think that’s what happened. I didn't read it that in-depth. But, I mean, there’s no attachment to Github or any of that sort of stuff to publish to MPM. You just sort of push code. AMOS: Okay, I didn’t know it wasn’t in the actual source. CHRIS: Well, it might be, I mean… ANNA: Yeah, somebody’s saying it looks like some MPM credentials got compromised. AMOS: Aw, man. CHRIS: And then, it’s possible some more MPM credentials got compromised. AMOS: So. Here’s what I learned. Don’t use Node. Oh wait, I already try to avoid that. CHRIS: That’s what your takeaway from this was? AMOS: I think that’s a pretty good takeaway. CHRIS: Yeah, that’s the life lesson we should all learn. AMOS: I think the main thing is - I mean, really, you need to be - if people trust MPM, you’re trusting the stuff coming from that server, but maybe you need to verify it. Maybe some of the package managers should check repositories versus the code they have uploaded and do sum checks. I mean, you can do ND5 sum checks but all that does is make sure there was no man in the middle, cause you’re usually getting the expected MD5 some from the same source. Like, MPM. CHRIS: Yeah, but at some point, you have to trust the credentials. I mean, if I publish something, I should be able to publish it if I have my credentials. I mean, that’s the core problem here, right? Is, somebody’s credentials got comped and now you’re effectively that person and you can do whatever you want. AMOS: Well, my thing is, if this ESLint - if some malicious code got added to that, what else out there in the node environment or in Hex or in anywhere else could potentially have this exact same problem going on right now. Just somebody noticed it. CHRIS: Oh, I’m sure it is. Just think of all the numerous amounts of packages that are on MPM. I’m sure there’s stuff on there that’s malicious. AMOS: Yeah, I wanna add it to Love Pad. It’ll get downloaded everywhere if you add it to Love Pad, right? CHRIS: I mean, honestly, I bet there’s stuff in Hex that’s malicious. AMOS: Man. CHRIS: I mean. I suspect, if you go look, you’ll find something that does something nefarious. ANNA: Why are people terrible? CHRIS: I don’t know. AMOS: So, what do you do to protect yourself? CHRIS: I mean, I could be wrong. They could be doing a ton of security or auditing of the code itself, but my intuition is there’s probably something on there. There’s probably some sort of package on there that’s supposed to send telemetry data back up to a server or whatever. AMOS: So, so what do we do as users of all of these libraries to protect ourselves? I mean, I don’t want to live in fear of pulling things down from Hex, but this makes me a little more cautious than I might have been, but I - you know, we’re using so much code. You’re not going to sit down and read every line of it to make sure this stuff’s not happening. It’s not possible. CHRIS: Not with that attitude… Commin’ in hot. ANNA: That caffeine is working. CHRIS: It’s kicking in. AMOS: If only I had a better attitude, I’d be just fine. CHRIS: The power is upon me, from this coffee. ANNA: I love that you’re telling Amos, who’s like the most positive person on the planet, given some of the stories he’d shared, to have a better attitude. AMOS: At two o’clock in the morning, after a long day with my kids, my wife might not tell you I’m the most positive person on the planet. I’m very mean after I’ve driven all night while the kids slept for vacation. The first day of vacation is ‘stay away from Dad. He’s a bear’. ANNA: Oh… Well, given all your other stories. AMOS: I said maybe we can share a little - slight background on that is that I was telling Anna and Chris that I was in the Air National Guard for thirteen years and my call sign was Vic, and it was short for Victim because there were always bad things that seemed to happen to me that were completely outside of my control, like dysentery and bedbugs. Twice. I got fiberglass in my eye and there are just more stories like this. CHRIS: It was a long list. It was a sizeable list. AMOS: And you don’t even know all of it. CHRIS: It was like naming all the sons of Abraham. It was a long list. AMOS: Somebody came up to me one time and was like ‘how are you so positive and happy all the time?’ and I was like ‘If you had my life and you weren’t positive and happy, you’d be dead’. I’ve had a lot of blessings in my life too, and I’ve been on both ends of the spectrum so I can deal with a little bit of bedbugs to get the great friends and family that I have, so. I’ll be okay. ANNA: See, so positive. CHRIS: And yet, won’t read all the code that he pulls down for his dependencies. ANNA: Do you read every single line you pull down for your dependencies? CHRIS: No, not every single line. AMOS: Maybe that’s why I’m so positive. CHRIS: I read maybe like 90% of lines. AMOS: He reads them as they’re coming across the wire. CHRIS: I spin wireshark. AMOS: I turn out Wireshark every time I pull. CHRIS: It's like looking at the matrix. You just, you know, you start to see it inside the code at some point AMOS: Look, virus! CHRIS: I will say - this is like not as much of a joke. I actually do read a lot of Elixir code that I pull in but I also just don't add that many dependencies, in Elixir but I look at - I actually typically skim through most of the dependencies that I bring in an Elixir project. AMOS: I frequently will skim the library like on GitHub or I might even pull it down and skim it. But when I pull it from hex. I don't go in and look at it and make sure that it pulled what I expected. CHRIS: Oh no, yeah, 100%. AMOS: And that's where this problem lies is - is and to be clear, I don't know if we were very clear. I think we were but I just want to reiterate - this was not something found in hex. It doesn't mean that there's not a possibility of something out there like this, but it was not found in Hex. This is an NPM issue. ANNA: Yeah, that's good to clarify that. CHRIS: Silver lining here. They were really fast about fixing it from what I could tell. AMOS: And so if you have a project that's using anything from NPM that that has yes lent - yeah, and probably just don't even worry about checking your - well, check your version, fix that, if it is. But even if it doesn't say that it's that version you might as well go out and change your credentials. Yeah, just just be on the safe side. CHRIS: Yeah, absolutely. ANNA: Supervision trees, huh? AMOS: Yeah. Was that the only other thing I got? I got a list of? ANNA: What was next on the list? AMOS: I think that was. CHRIS: I wanna talk about. I want to talk a little bit - we’ll cut this out. Hang on. I want to talk a little bit more about reading dependencies that you bring in. ANNA: That's important we should talk about that. CHRIS: The only reason I do it is - number 1, the amount of dependencies are bringing is relatively low. The other problem, though, is that at least in Elixir - That's alright? What's the like. The most correct and nice way to say this, I've had too much coffee and this could. This could get pointed really quickly. AMOS: I would expect no less from you, Chris. CHRIS: Well so like I think there's so - I think that there's tiers of dependencies in Elixir right now and I think that's true for any language. It's just like how wide the amount of any different stratification of dependencies in terms of quality is very wide for languages that have been around for a long time or have like a lot of users or whatever. And I think in a Elixir we have that same stratification, but .It's it's a little bit narrower, so for certain dependencies, the farther down you go in that stratification, you have to be more careful about looking at what it is that you're bringing in an analyzing it to find out if you're going to have bottlenecks or if you're going to have unforeseen issues that you. You know that the original author didn't intend or if you're going to have race conditions and that kind of stuff I've definitely pulled in dependencies, where all of a sudden our throughput, just like tanks, and it tanks because their serialising inside this library there serialising like every message through a single Gen server or something like that. AMOS: That's not OK. *extended pause* ANNA:I really wish I could capture those reactions via audio? CHRIS: Yeah. We should put the video up at some point. AMOS: We can totally leave in all of that silence and I think people would get it. CHRIS: Coming in hot. ANNA: I'm lacking caffeine. I think that’s compensating for all of us. AMOS: I've only had one coffee, but I think I should have had five more. CHRIS: I'm trying to make up for last week where I was so foggy and out of it that I couldn't even make coherent sentences. AMOS: I think we were all pretty out of it. ANNA: Yeah, I think we're all out of it last week. AMOS: Alright anyway. CHRIS: 321 back to the show. So. There are typically - not typically. There are often performance issues with libraries that just haven't seen as much like real world usage yet, and I do go in and I look for that kind of stuff, especially if I know if they're interacting with outside resources or outside services and stuff along those lines. Because that's often a place where you'll see people. Serialising calls or trying to like save state. In a single Gen server and then passing it through that one Gen server. You know yesterday, I was looking at a library that we're using for sort of monitoring and stuff like that, and they're using process dictionary, like kind of a lot inside of it and that was a little bit of a red flag just because - I think it's always - it should be a red flag when you see somebody using process dictionary. Because of the dangers that are inherent in process dictionary and that kind of stuff. ANNA: Can you explain that a little bit more for folks listening. AMOS: And for me. CHRIS: Yeah, so the process dictionary - actually there was a really good talk on it at Lonestar Elixir, which I’ll linked to in the show notes and it provides a lot more information, but process dictionary is a place to put Mutable State - or it's a way to have mutable state inside of a process and because of that it comes with all these other disclaimers about how to use it, where use it. The dangers of using it. That kind of stuff. It makes debugging really, really hard. AMOS: So if you're from Ruby, we would call that thread local. CHRIS: I think we just call it state. AMOS: I was saying thread local is is kind of the ‘run from this, if you see this in a Ruby Library’. CHRIS: Right yeah, yeah, so it's it's similar. It comes with some baggage. So, you know, I think it's important to go look for that kind of stuff. There's definitely been libraries, and we saw this stuff at Le Tote, too. We would go find libraries and then realize they have these optimizations that they needed or that they had race conditions or they weren't closing connections to external services or whatever. And you know, we try to ship those back and fix that kind of stuff for people to. And I also just find that I don't pull that in that many dependencies so it's not an it's not a burden on my time to some degree. I try not to pull in a dependency that I'm not using really like 90% of it. You know if I'm going to use one function in something that is giving me 500 things I'm probably not going to pull it in. I'm either going to rewrite it or look for a different solution, but I do have a tendency when I'm. When I'm looking at libraries is - I might pull it into an empty project, start it up if the library has gen servers, I'll go look in Observer, see what's going on. Look for anything where it's talking to the outside world. But you know performance too. I recently have been using benchy a little bit, but I see a lot going, especially from Michal and Devin of - you know - what is it - Fast Elixir where they have been trying to do things to push the envelope of performance in other libraries that are big. They go work on performance issues and so I think that's a good thing to watch. And learn from for your own libraries. 'Cause not pulling in a library doesn't guarantee you're going to be fast and not make the same mistakes. So we discussed talking about supervision trees, so I kind of got on this because Frank was put up in Elixir Forum. Post guidelines for supervision trees and setting restart intensity parameters. I put a link to it up to you guys and we'll put one in the show notes but. He's really asking about. You know how do you? How do you decide on? The two main settings for supervision tree or for your supervisors to restart before they give up. So those two settings. I just blanked out what they're called it's but it's restarts per second. Or, per time period, right. Restarts and intensity so it's - Erlang by default has one restart every 5 seconds and Elixir does 3 restarts every 5 seconds. I know Frank put in there that Erling gives in the rationale for for the choices that they made but he couldn't really find much in Elixir for for why those choices were made for the defaults. So I don't know like - when you're dealing with this when you guys are dealing with supervision trees and restarts. How do you decide what to set these things too? I know it's not. There's no like simple. Oh, this is exactly what you should just set it to. That's why they're configurable. And I, you know, I start with the defaults and then go from there. Unless I really know right up front exactly what I need, so I do some testing. I try to make things fail on purpose. Especially with outside connection stuff. If I'm talking to outside servers, I might have it fail a lot more, but what what do you guys? CHRIS: I don't think I have anything super insightful here when it comes to how you pick the intensity and and restarts and stuff like that, it really is case dependent. Um, and it totally depends on what is your supervising? So I don't know that I have any great wisdom other than to to think about the problem that you're trying to solve. AMOS: How about you Anna, any ideas? CHRIS: This is a bad topic. This is a dumb topic. ANNA: I would say my strategy is similar to yours Amos. I don't know that I have anything super insightful. I mean, yeah, it really is very case dependent on where in supervision trees something lies and. How worried you are about that process failing and what you need from it and how it so it yeah? I don't know. AMOS: I think one of the hard things in coming up with stuff is you know you have something that you say, Oh, it. It shouldn't fail that often, maybe I give it 5 restarts in 3 minutes you know, and then. Because you wanted to be able to handle a burst so you give it 5 restarts. But you say 3 minutes and you know what? If it has a burst at the beginning of five restarts that's great and then it's fine and at one second before the end of the three minutes is up. It fails, one more time now. You just shut down and is that OK? So there's. It's not as straightforward of a problem. Whenever you're thinking about it as it seems to be. I think is a big problem and then. You know you get trees going and you have supervisors under supervisors under supervisors and really at that bottom level, you have the ability to restart there even more. Mainly because the top - the person right above that one will restart and then you get to start over on your whole count on the lower one. So if you have - let's say you have an external service that you're being rate limited on and those restart intensities if it's lower on the tree. And could restart more you might actually have to pull it out to its own part of the tree to to control it. So you don't get kicked off the service for too high of a rate does that make sense. I don't know I might just be rambling. CHRIS: Well, I mean, I think it makes sense, but you are also rambling yes, I will say. No, I mean, I defer back to the old standby which is our friend. Dear friend of the show. Fred and his amazing blog post, It's All About the Guarantees, which I think we've linked to before even, but I think that's the right way to think about it. What are the guarantees that you're going to provide inside of the supervisors? You know, if you're monitoring external connections, what are the guarantees you need to provide in order for that service to be stable and working in a good state? And that sort of stuff? And if you're thinking about it from the point of view of those guarantees, then I think that can help frame the problem correctly for you. So I mean, let's say yeah, you set your restart time for three minutes or something like that. I mean, is that the guarantee you wanna give if you get 3 failures within 3 minutes. Then you're going to fail because if that's the guarantee you're giving then then. Yes, you the correct thing is to shut down and maybe you actually want to restrict that guarantee even farther and say. Maybe you don't want it to be 3 minutes. Maybe you want to actually restrict that to a much shorter time span in order to maintain again like a systemwide guarantee. I mean, I also think this has a lot to do with the ways in which we build systems and thinking about at what levels do we enforce certain guarantees? Right, because I think your guarantees change over the life cycle of the system. The first level is like I can start my application, so I have an OTP app. I can start it up and it does whatever it needs to do to reach sort of baseline level. And maybe what it needs to do is attach to a database attached to Kafka, start a Web server, and like bind. The port and start receiving web requests and maybe that's you know state 0 for you for your application 'cause. That's like the correct guarantee that you want to provide in which case you're going to design a supervision tree to accommodate that. And you're going to design a system that needs all those things like at boot time. AMOS: That baseline 0 is really the starting up of every individual application. Whether it's a database repo, right, that that's a separate application database connections or a Phoenix app. Or some other - when I'm talking about applications here. I'm saying the The Erlang / Elixir version of what an application is but sometimes that startup can even have problems and Erling basic base. OTP does not give you a way to have applications have restart strategies at that top level very well. SSH is one that I've talked about before that. It has I wanna say 10 restarts in 3600 seconds and sometimes when the system is coming up you just can't get a connection like. Like if you're using a firmware device that this boots up as the firmware device boots up. You just don't have a network connection yet for it to really grab onto so it fails to fasting crashes, and maybe you don't actually want to put that application and start it up by yourself 'cause that can be a little painful to especially if other parts are depending on it, so there's shoehorn from the Nerves Team. CHRIS: This is the nerves library that basically circumvents app startup in order to restart failed applications right? AMOS: Right, so it circumvents part of the application start process and allows you to catch application exits. Only during startup. After startup I believe it goes away. After everything's already started it undoes itself, but it allows the user to put in handlers and build handlers that do different things. It could be a circuit breaker handler. It could say. Hey, you, don't go ahead and try to start all the other applications and we’ll just try to start this one later. And you could do lots of different things. So it kind of gives you a little bit of supervision level control at the application startup level, which can be pretty complex, especially in an embedded device, which is why they did it. CHRIS: It so let's put aside for just a moment the like. Let's put aside some of the built-in OTP applications that ship with Erlang and OTP because they don't always behave the way that they get like special treatment in certain cases. You don't always get to control their stuff. So it's put those aside for just a second and let's talk about the stuff that we do control, which I think actually applies to basically a lot of the Nerves stuff right. The different Nerves applications, so let's talk about just that problem right now. The problem that people are attempting to solve is: I've got this application. It tries to start up. And when it starts up it needs access to these external things. And you can think about this. It's not a lot different than starting a web service that needs to connect to rabbit MQ Reddis Kafka in a database somewhere like a postgres or something like that, and also start a web service when it boots up. Now my take on it is - Number one, why is stable system zero? Why is the baseline level of stability all of that? Why does it need all those things when the application comes up 'cause. It seems like you could start up your application and just start the web service and nothing else could be running but the Web Service, and you could still return errors. And just say we're not started yet or whatever from your web service and that could be baseline 0. And then, once you've established that you can move down to the next phase, which is you know, stable system level 1. Wherein we now have a Kafka connection and we now have a rabbit MQ connection and we now have a Reddis connection and that gives you the same sort of resilience that you're looking for but in just a different way of thinking about it and I get that. That's kind of what shoehorn’s trying to do, but I don't necessarily understand is why is shoehorn a better way of solving like why - why are people looking for sort of application level restart strategies as opposed to just saying that when my I have a Kafka application when my Kafka application starts it doesn't start anything, yet it can just start and now when I want to start my actual stuff I can manage that myself or it can it can manage its own intensities or - maybe it's all dynamic, maybe all the all the connections should just be dynamic and all that the Kafka application does is start a singular dynamic supervisor and then just spins up workers. AMOS: I think that that gets back to what we talked about earlier about using libraries is that you're using some library that maybe you don't know how long it's gonna take to get upstream to change. Or maybe a configuration that allows you to do these kind of things and it's set up as an application. It's going to start so you can either change it to. Oh, what is that where you start the application and from your own? I can't remember what it's called right now? CHRIS: Where you add it to like your included applications? AMOS: Included application, that's what I was looking for. Yeah, thank you. Yeah yeah. So so you either add it to that and then put some controls around it or you have to depend on what they have going on, and I think that Shoehorn in a lot of ways is not to fill that - Hey, this is a way to fix the things that you're writing. It is like your other application. You shouldn't be using shoehorn to restart an application that you control the code inside of necessarily but it's for I'm using you know you, said set aside. The Standard Erling. But I'm using SSH SSH. I yeah, I mean that that stuff is yeah. CHRIS: It's a different. It's a different problem. AMOS: That is really, I think, the wheelhouse that shoehorns trying to solve is the applications I don't control, and maybe I want all their security fixes, but it might be a long time before they get my fix up. So forking gets to be a big pain in the butt to try to keep things up to date until they pull in my change. So instead shoehorn is really a band aid until I can get a better fix in place. That's how I've used it. CHRIS: Yeah, and in so many ways this feels like it's like the config problem all over again. Yeah, there's yeah, so we're inventing solutions to solve a problem that effectively the community to some degree has created for themselves. 'cause of patterns that we've we've used I don't know maybe that's too. Maybe that maybe that's like laying blame or something like that, but I just think like we're trying to. We're like building tools to solve a problem as opposed to changing the way we think about designing systems, which is understandably a lot harder and a lot more time consuming. I mean, I'm not. I'm not really saying that we shouldn't have these solutions. AMOS: I don't think that it means that people are going to stop thinking about those problems. I know I still think about those problems. I worked in Shoehorn, a lot and the whole time, I was working on it, I. I kept telling Justin Schneck that was like, Hey, Justin friend of the show. I feel really dirty. Every time we're working on this, I feel dirty and I don't know that we should do this, and I spent a lot of time trying to come up with other solutions to the issue, even after we had the PR and I kind of every day would come back and say. Hey, are, we sure we're doing the right thing and all of us feel a little dirty about it, but. CHRIS: They didn't - y'all try to submit this to OTP to try to get them to change the way that applications work so you could actually do this. AMOS: Oh, I don't know. I don't know if they did that. CHRIS: I feel like Justin actually wouldn't talk to people about this. 'cause this happened. I think it happened when we were on at Le Tote, before the great diaspora that was the Le Tote Elixir team. I think I remember him talking about that, he actually went and talked to. I don't know someone, someone, either there was like Jose or somebody about this about adding the ability to do restart strategies with applications so that effectively you wouldn't have to have a shoe horn because it would be just built in, yeah? AMOS: I can imagine that he did. I believe that but I just was not Privy to that conversation. So I can't tell you what was said or what the outcome was yeah. I I don't know it is a hard problem and it was something that we needed solved now like it wasn't something that was on my team that we could wait on. ANNA: I think that it's there's two things, and I think this goes back to the config discussion, right like? The solution to solve the current problem and then the longer term, harder solution to change how we actually build systems that these problems don't arise in the same way. And I don't know an effective means of communicating that quickly right through an entire community. AMOS: Yeah, I don't know how you make that change quickly, but it I mean start getting out the knowledge. I think you know, Chris was talking earlier about changing your baseline zero and a lot of that is when you look at examples of supervisors. They're not supervisors at the supervisor starts and then later starts another piece of the supervision tree right so. Which is what you need for that baseline 0? So, your first examples when you start working with supervisors and the code that we write is often. Hey, I configure this in and it and at the end of an nit. The supervisor starts up all of these children as part of its initialization process, but in order to get the baseline zero that Chris was talking about where we're up a level where we start just part of it and then start to start those submodules first you need the like. In his case. The Web interface to know that those other submodules aren't started and how to respond when they're not you also have to have those submodules as their supervisors. The supervisor starts, but does not start its children. Which is very rarely found in any examples online? CHRIS: Well, it's and it's much harder to do, to be clear, it's much, much harder. It's much more involved and honestly - there's good reason that supervision trees start the way they do, which is depth 1st and they block until everything starts and the reason you do that is to enforce guarantees. You know you want these things to block. Maybe you need to read a file into memory and that's a crucial part of your system working correctly, and so if you need to do that, you need to block the rest of your processes from starting until that thing is done. And if it takes 30 minutes to do that. If it takes an hour to do that, well. That's still part of our guarantee as a stable system so we're just gonna do that and we're gonna block everything else until it's done. AMOS: If you can. Push it to the right and down on your tree if not everybody else is dependent on it to make it one of the last things it starts. Or one other strategy. I've used for something that took a long time to start is - I wanted it to start way early on in the tree. So that it would have time to finish but I didn't want it to block an at the time you know the new continue stuff wasn't there, but I used to process and it's and it sends after so that other things can start starting and but it can steal one of the scheduler will give it time to finish it startup process in like an after and init. And that that can be super helpful, too, and that can be a way that you could actually start up your whole tree but you're you're in your your leaf nodes to those trees that actually have your functionality that are not just the supervisors could have after and init callbacks that so everything starts up in like this uninitialized manner is another way to do it outside of having a supervisor that then has to be told to start it’s children later. It can start it’s children, but the children push of their actual full initialization. But then you have to deal with a state machine type thing where it’s like ‘hey, you have to send me messages’ and I send it uninitialized, or you send me messages and I queue them up, and that depends on your memory. How fast you actually think you’re going to start. If it’s gonna be five minutes, you probably want to tell them you’re not started. If you’re going to be quick to start, you’re probably going to want to hold those things in your mailbox - maybe just put them back on your mailbox or some other way to handle that. The other thing is you can watch processes, right. You can monitor processes and have weird things going on with that that you can see when things are starting, failing, initialized - one of the things that I do want to touch on with these supervisors is using some kind of circuit breaker pattern. So sometimes in restarting - you can’t just restart over and over and over. So maybe you know - whenever I go down three or four times in a row, if I keep continuing to try to restart, even though I want to restart, maybe it’s not a super important part of the system, but maybe I run out of memory, or - I can run into all sorts of issues with trying to restart over and over and over. So one of the things is - put in a circuit breaker pattern. There’s a pretty good library that I’ve used for it called Fuze, and a few others that I can’t remember right now. And a circuit breaker pattern is like an automatic circuit breaker - you can get these for a car or your house, where a breaker will kick back on automatically - where the circuit will restart automatically if something fails, but if you have a circuit breaker in there, the process can fail, and you can have it configured to have a cool-down time period. Maybe if it fails five times in three seconds, you say, ‘hey, we’re not gonna try to restart for a minute’. Or you can have exponential backoffs on that or whatever. And that process will kinda get into that initialization that we talked about earlier, where it starts up, but it can’t really do anything, so it doesn’t crash again. And then it just sits. Or it will fully restart. That’s one way to implement it yourself. Those are pretty good strategies. And really - really, you can make your own pretty easily. Get a gen server, monitor the process, get something to talk to the supervisors. So you can set it to temporary instead of restarting it over and over, the monitoring process tells it when to restart, is a good way to implement that. CHRIS: Yeah, I really like Fuze. I tend to use it around all of our external communication stuff just to stop cascading failure, so you can kinda gracefully handle it. Because it gives you nice tooling and nice messaging around - so you can take different paths when you’re talking about success or failure. So, when I’m talking to an external web service, and it’s okay for the data to be a little stale. So, we’ll send a message - and this is all hitting on the circuit breaker - we’ll send a message to the external service, we’ll get the request back. We’ll stuff all of the response into Ets or some kind of a durable response. If at any point that request fails, or starts to fail, we’ll blow the circuit depending on how intense the failures are or how many happen in a given time period, and we’ll just read from the cache at that time. The data might be a little stale, but we’re still servicing requests. And that goes along with those guarantees. It might be okay for us to return stale data in that case. The alternative would be to just return an error immediately. And then you’re not wasting time trying to send a request to an external service that yuo don’t control, that’s down, or whatever, and you can just immediately return an error, which might be more along the lines of the guarantee that you care about. AMOS: I usually - I really like that two different pieces there of returning the mitigation strategy with something cached. CHRIS: And you could use a durable cache - like, put it in reddis or whatever, but, you know. It works pretty well for a lot of things. Well - I think Anna’s gotta run. I think she’s late. ANNA: Yeah, I am running late. I have to run. CHRIS: Alright. We should wrap this up. AMOS: Well, thank you all. Have a wonderful day. ANNA: Later, y’all.