AMOS (00:08): Welcome to the elixir, outlaw the hallway track elixir community. Paul (00:16): My dad's side of the family. He had, uh, nine brothers and sisters. So it's 10 total. So in that environment, like you have to be the loudest in the room. Right. And so of course we inherited that. Me and my sister, which is really annoying for people. I think that interact with me in conversations now, because they're like, just stop interrupting me. I'm trying to say something. AMOS (00:38): Are you saying I can blame that on inheritance? Paul (00:41): Well, I have that excuse. I don't know about you, man. AMOS (00:44): My mom has nine brothers and sisters. Paul (00:47): Oh, well. Yeah. It sounds like it. See, it's inherited from the parents because they do it to you. So as a kid, you had this problem where they're talking over you whenever you're having a conversation. So you end up fighting back a little bit. CHRIS (01:05): My dad is one of eight as well. And it's the same thing. Our whole family is like all of us kids, me and my sisters, we all yell at each other just over the top of everything. And it drives my wife crazy. Like the first time she came home, she was like, you guys fight all the time. I'm like, no, this is just us talking. Like we just, you had to, you have to do that in order to be heard, right? Yeah. AMOS (01:28): Yeah. I have five kids. So my house is that way. Paul (01:31): You're padded on both sides with this problem. So you definitely have an excuse. AMOS (01:36): And my wife was the quiet kid who just sits back and listens Paul (01:40): Probably going to change. Eventually if she hasn't already, CHRIS (01:44): It's a survival mechanism. Like a right. Paul (01:46): Yeah, exactly. CHRIS (01:47): You just have to do it. AMOS (01:48): Yeah. Yep. She's one of those people that starts to say something. And then if somebody talks over top of her, she's done and she'll never tell you what she was going to say. No matter how important it was. She's like, no, you live with that. Paul (01:59): She'll make you pay for interrupting. AMOS (02:03): Yeah. She's so strong-willed that, you know, they say you can't change other people. Well, she's really good at it. Paul (02:12): That's amazing. Yeah. You have one of two options, uh, to retaliate in an environment like that, right? Like either you become louder and everybody else, or you withhold information that you were interrupted or in the middle of saying, you know, so then people feel bad. AMOS (02:28): Yep. I, I constantly feel bad. She's really good at that, but I love her to death. We, uh, she also likes to take me seriously whenever I'm joking around, which breaks me of a lot of jokes. Paul (02:42): My wife will do that sometimes too. I have a problem being like sarcastic too often. And so to break me of that, she'll every once in a while, just take me a hundred percent dead serious. AMOS (02:54): Yep. That's what mine does. CHRIS (02:56): Hashtag my elixir status. AMOS (03:01): Now we've got the family show out of the way, Paul (03:04): Right? I mean, you've got to kick things off with that, you know, get that out of the way and then get to the elixir stuff CHRIS (03:10): The Elixir parts are boring compared to the family stuff. That's the fun, the fun part - the family Paul (03:14): Definitely can be much more interesting to hear about what people have going on in their lives. For sure. Particularly when in the community you see faces and names and you don't really know anything about any of those people. You feel like you get to know people just interacting on Slack or IRC or whatever, but you actually don't necessarily know a whole lot about what's going on in people's lives. It's funny how that works. AMOS (03:35): When I meet people at conferences, I try to ask them about their personal life. I have had people be like, that's a little personal. I'm like what asking you if you have children. Like, okay? Paul (03:44): So you discover that people that, you know, take privacy really seriously. Yes. You want to keep their work and personal life separate. CHRIS (03:54): That's part of the benefit of going to conferences. You can watch a talk on YouTube and get whatever factual information that you were looking for out of, out of that. But the benefit of going is not the talks it's that you go to meet people and make connections and get to know people and go have dinner and whatever else Paul (04:12): For sure. AMOS (04:13): Strange Loop - They give out different lanyards for people and you can have like a lanyard that says, do not take my picture. I don't even want to be in any pictures. Then they have one that's like, you must ask me. And then one that's like, Hey, take all the pictures you want. I usually put that one on and take my shirt off. Paul (04:31): Right? Yeah. That's, that's interesting. I haven't seen that yet, I guess, but it makes sense. Some people are particularly sensitive about that. AMOS (04:42): Yeah. You don't know what kind of background somebody is coming in. They could have, I don't want my family to even know where like my extended family, like there's people in my family that I would prefer to never speak to. Paul (04:53): Yeah. Or like somebody who's got like a darker past and might even, you know, have somebody that's trying to find them or something. And, and so they want to stay like low key, you know, an anonymous name, but that's like Chris said, you know, the whole point of the conferences to be able to like get together and talk with people. Not so much like the photo op side of it, you know, that's nice for the conference organizers, but for the rest of us, I think it's mostly just a chance to actually put faces to names, get to know a little bit about the people you're working with on projects and the open source side of things. CHRIS (05:26): That's where we met, I think it was at lone star. Yeah. I mean, that conference was fine, whatever, but the, like the things that I got out of that conference were like, I met, I got to hang out with you. I got to sit and talk with Dave for a while. I talked with Chris for a while. We just got to establish these sorts of relationships. And that's, I think where a lot of the conversations that we've had since then came from just the fact that we got to hang out at one conference. Speaker 2 (05:49): Lone Star, in particular, I feel like is great for that because you know, it's a smaller conference, but it's also kind of where things started. So for me, at least that it's got some of that and it seems like a lot of the people that show up there are either locals because it's convenient or people that were there fairly early on in Elixir's life. And just want to kind of come back and see people again, because you were giving a talk on consistent state. Right? Yeah. And that spawned a whole bunch of conversations since then, we're talking at code beam a little bit about some of that stuff. And that was when you were working on raft. Yeah. AMOS (06:26): Was that last year? Paul (06:27): That was the spring or spring? Winter. AMOS (06:29): I guess like right around elixir days. CHRIS (06:32): Or no, was that last year? Paul (06:34): No, it was, uh, February and March for those two conferences. CHRIS (06:40): Yeah, that's right. Yeah. Elixir days. Was that for me? Because it was the first time I felt like I had gone to a conference where I could just go and hang out with people. And that was something that was really, really important and special about elixir days, I think was that there was no wall in between speakers and everybody else, nobody was flying in and then leaving the same day, which is a thing that happens at big conferences. When you get these like big name speakers or keynote speakers and they like literally fly in and give the talk and then fly out because they've got other stuff to go do. And there's like, none of that. So you get to hang out with all these people. I think like at one point I was just looking around and I was like, Oh, I'm at dinner with like Dave Thomas and Francesco and James Fish. And like all these people and like having these conversations, Sasha and Paula, and I'm like all these, all these people who you and having these like important conversations, I dunno, it was, that was really, really wild for me. I mean, that's basically why I have my job because I met Ben there and then we just started, you know, talking to each other, hanging out, met up at other conferences and stuff like that. And then that's how I'm at Bleacher report now. AMOS (07:39): I moved at elixir days from, from doing web development and Phoenix stuff to mostly doing nerves work now because I sat at a table and ended up at the same table as Frank Hunleth. And we just talked and I like hardware. I played with it a lot. So yeah, from there it turned into now for two years, I've done hardware work. Paul (07:57): Speaking of Frank, he's like my favorite person in the Elixir community. Truly the most awesome individual I've ever met. AMOS (08:05): Yeah. I always tell him that I can't figure out who's nicer in the elixir community him or Jose. Frank has been super awesome. So if you're listening, Frank, thank you. Paul (08:16): Thank you for being you Frank. AMOS (08:18): That's right. It's like the nicest person ever. Paul (08:21): Yeah. As far as like the like Elixir things these days, like I just literally just released two Dido distillery yesterday. Technically it was the night before, but I had a conversation on Joe's a where I tweaked some things. So two.zero one is the final release. I guess that one, you know, it was going to be, I think pretty big it's taking configuration in a direction that I think the core team wants to go eventually, but it gives us sort of a platform for testing out the ideas and seeing works. And what, doesn't AMOS (08:56): My intern super excited last week I told him you were coming on. And I said, Hey, is there anything that you want me to ask him? And the very top thing on there was when can we expect an official two oh distillery release? And then this morning, I said, Hey, why don't we, uh, make sure these are still relevant? And he pulls up. He's like, well, two oh was released. So Paul (09:14): Yeah. Sorry to ruin the question. I am going to write a post more of a official announced, uh, this afternoon, but I haven't gotten around to it mostly because I was cleaning up small things and tweaking the tests and whatnot yesterday. So I didn't get around to it, but I figured at least put it out there for the people that were already using the RC releases on helping me test some things. And that was awesome. A lot of good feedback came out of that kind of shaped things. I think, you know, the way I was originally going to take it as the big feature and to search it was the same called config providers. And what it basically is, is a behavior that you can implement that when run receives whatever arguments you want to pass it, to find that in your release configuration. But when Ron it's supposed to push whatever into the application app, this runs early in boot, at least initially that was my plan before really anything has started except for like the really core, core stuff. All the providers would go. And then at that point, you know, all of your applications could boot and work with a fully reified config. AMOS (10:20): So are these runtime configurations? Paul (10:23): Yeah, exactly. It's for runtime config. They're like runtime ish, runtime as, yeah. CHRIS (10:30): They're early enough in the boot that you can stop the boot process. If the stuff isn't there, which is like part of the guarantee that you want to give to people you need, let's just use environment variables. Cause that's like what most people end up using. So like you have to have these environment variables here present before we will stop the boot process. To some degree. So like that's a guarantee that you can give people with with the way that you've constructed this stuff. Paul (10:55): Yeah, exactly. You can do all your validation transformation before anything is booted. And then, you know, once you've gotten past that point that you at least have that out of the way, like configuration should be valid for your application. Now you might have, you might not want to put all that in a provider. It depends. Like I probably would put as much of the validation transformation there as I can, but you may want that to be part of your application because you may not necessarily know if configuration is valid until you'd like to try to connect to a database, for example. Right? Your providers shouldn't really be doing that kind of thing at all, but you can at least make sure that they're the right types that the values are saying, that sort of thing. But the intent really is to still adhere to like the way we want people in the community to do configuration, which is very serve a functional paradigm, right? Like you make sure all your configuration is ready to go during your startup. And then you push that down through your supervisor tree. So once you've got your kind of like your big config objects or whatever the data structure happens to be that gets pushed through your tree and whatever shape it needs to be the idea being that once you've initialized your supervisor tree with this configuration, that it's sort of fixed, you can predict the behavior of the system because the configuration is not going to change underneath you because of some global state. That's the problem with application M and right, is you change some value here and it potentially changes how your system behaves, which some people might on the surface be like, well, that's how I want to work. I want to change the config. And they had the system adapt to that, but that's in practice, not what happens instead of just break stuff. You know, or it doesn't actually change it. You change the list import for your HP listener and it doesn't actually change the port, but it appears like the configuration is changing or that sort of thing. So the way that config priors have ended up is that it's taken a step further now. So pre-boot, before we ever boot the release itself, we run it an instance of the BM with all the code loaded and enough applications started to run the providers, the providers push everything into the application, and then that gets persisted. After all the providers have run into the system, config file the release, then boots with that. This allows us to leverage all the internals and OTP for dealing with application upgrades, that kind of thing, which is all based around Cisco config, but also forces people to make sure that their configuration really is like decided when the release boots, you still have all the power of being able to decide things at runtime, but it prevents you from doing things like rerunning the providers or trying to use them for global state. We're pushing stuff into the application at boot and the release. And it should never change. That should be like fixed data or things that you're using for some kind of state, but really that's not an appropriate place. So that kind of thing where I want people to go, I know Jose feels like that. And you guys as well, I think from some of the conversation you've had is, is really more of this. Like initially as you can figure out boot, push it through your application and if you have to change it, that's fine. But the way there's an OTP mechanism for doing that. And that's the config change a callback in application module. So you implement this and you can, when the config is change, whether that's a hot upgrade or whether it's manually triggered, you can instruct parts of your application to restart or do what they need to do to react to changes. And I know in Phoenix, if you looked at, I think the end point, uh, implementation, or if it was, I think it was maybe the application model that it generated. I don't know if it still does it, it actually generated that config change callback for you in order to itself deal with those config changes because it needed to react to some of that. It was only doing it for just Phoenix, but in that callback, you can do multiple things, you know, it's per, so each application has an opportunity to react to configuration changes or it's config. And I don't think a lot of people know that that's there or how it works or how it's intended to be used, but it's well worth investigating because it's a very, very powerful tool because you get this opportunity to have a deterministic configuration. So you push all your config into your supervisor tree, and then it is the way it is. It allows you to reason about your system from this source config. But if you do need to change a runtime, there is a mechanism for doing that. Doesn't require you didn't necessarily completely restart your release, but you do have to follow the OTP architecture for that. So anyways, that's sort of the direction that all that has gone and is the major thing that's in distillery, 2.0, There's some other features as well, but that was the one where a lot of the work I think really went into. And ultimately it was the prerequisite for getting some release tooling into core itself was we needed to figure out a solution to config, see how it plays out in practice and then sort of formalize those things. Once that's been formalized and the prerequisites are in elixir, then we can build the core tools out. So that's kind of the direct, AMOS (16:01): I think that config change callback, if it gets documented in the elixir documentation and people aren't going off to the Irvine documentation to find it, it'll probably start to be used a lot more too. Paul (16:11): Yeah. I had been meaning to write a bit more about that and the distillery docs. I'm also working on a book for pragmatic programmers where I dig into a little bit more of the meat of that stuff, but I still feel like it should be really just obvious stuff that you see right out the gate when you're starting to build applications. And so I want to at least make it part of the distiller docs as part of the rewrite. I did write, serve a guideline on configuration in general, not necessarily specific to releases, but just how to architect your application and deal with runtime, config, that sort of thing. And then it ties into, of course the features that disability has to let you do configuration. Ultimately, it's more about like, how do I even design an application when I have parts of it that may change at runtime? And I think once that becomes a little bit more commonplace, there's more examples like books pick it up, it's present and talks and that sort of thing will become much more second nature. You know, that's why there's so much momentum behind some of the like bad patterns, if you will, is that a lot of those were evolved early on in response to just the state of the world at that time. And trying to figure out how to, to cope with some of the problems. Unfortunately like it ended up in a direction that we later realized was not ideal, but because of all the existing content and everything directions around that, now it's kind of this confusing mix of advice. So speaking loudly and clearly about where things are going and how to do things now, I think is important because it will help dispel some of the confusion. CHRIS (17:47): I completely agree. I mean, these discussions have been more and More, uh, which is good and they've been getting louder and louder. I think that forum posts forever ago about what if we had this, like pre-boot on boot thing as part of config, the excess or whatever, but that kind of sparked a lot of these conversations too. So it's good that people are actually talking about this stuff now. Paul (18:06): Yeah. And that brings up another thing I forgot about, which is that there is a config provider for elixirs config file. It's not enabled by default. None of the config providers are, they're all opt-in you choose the ones that you want or build one of your own design, but the mixed config one is designed so that you can write release specific config, .exs Access stored separately from your other mixed config files like under REL config or something. That's how I've been doing it. My examples. And then some of the tests applications I've built around this. And then they're just set the things that change at runtime, right? So your mix config should be like all static stuff for the most part, like compile time config really is the purpose and defining the config settings that your application needs. But, uh, the release time config file would be just the things that change at runtime. And you do whatever you need to do to set those in that config file. But at least allows you to sort of reuse all the intuition you have about mixed config files. At this point. The one thing that still is not going to work is dumping functions in mixed stack. And you can call functions, but setting them as values for config is not, you still need to use like an MFA to bowl or something like that because ultimately it's going to be persistent to assist dot Config. But the problem historically with assist.config I can pick was that we were converting the mixing big file to that, and you didn't have any runtime control over it. That's why we had like the replace all as far as business and all that jazz and that still exists, but it's there really for like the file primarily would be the purpose there. Cause there's still some, you know, useful tools there. I would like to eventually extend maybe the config provider stuff to let you somehow extend like the BMR exert are going to be passed to the final release. I'm not really sure what that would look like at this point yet, but there are some potential options there, sync nodes, mandatory and optional. And if you're using distributed applications, all those take node names. So it'd be nice to be able to use config providers to figure those out and then inject those into the BNR. So it gets ultimately generated right now. There's some tricks you have to kind of pull to make that work. While I actually, that was another thing I added to the distillery documentation in the rewrite is, uh, instructions on how to do that specifically. So if you are using those settings, which I think are, I've seen them quite often more so than I thought I would see you at least have some information on how to, how to get there with that. CHRIS (20:45): Based on what I've inside the distillery dock so far, because currently we don't, we don't live the release lifestyle. It looks like exactly the kind of thing that you want to provide in the sense that most people don't actually want to boot their apps if they don't have the environment variables, or if they can't talk to SED and pull in their initial config, like that's just not a guarantee. We don't have applications constructed in such a way that they can boot in sort of degraded modes that often, because typically our apps, if you're building like a stock, you know, run in the middle of Phoenix, Ecto, whatever nonsense you boot that thing up, and it's going to start, start all these workers. And if it can't, you start them in some way, it needs to just fail because it can't run in degraded mode. That gets to a conversation I wanted to have with you anyway, about like this whole stacking theory, like how do we layer applications? That's a really, really good guarantee to give people about, uh, like how they're con- like how their applications are gonna boot. And it probably removes a lot of the like implicit replace OSTP VARs, and then there's a dollar sign. And then, and it works for things that aren't just environment variables like that that's really awesome. Paul (21:52): And being able to convert some like friendlier form of config to kind of whatever internal structure it needs recently, I wrote a Tom Al library for elixir just cause I felt like writing another parcer. CHRIS (22:08): As you do. Paul (22:10): Yeah. As you do. And there was a, you know, an opportunity there to make that config provider. So I wrote a provider for that library as part of it. And I added like an extension for doing like arbitrary transformation of the keys. So you parse out a config file and it has all the values in it. Tom L has some primitive types, but you may want conveyed to have like an IP address to pull in seven IP address string or something like that. Right. So it has this idea of transforms that you can configure it with. But in general, the idea of taking like a value such as a string and converting it to like an actual data structure that your application needs is super common. And so I think that's where providers really have a lot of benefits there as well beyond just making sure that you have config is giving your application opportunity to like convert all that stuff into the form that's needed so that the rest of your application doesn't have to care about that you can code very assertively knowing that if that config was set and if it wasn't set the application never booted in the first place, but if it was set, you're going to get the right type in here. So you don't have to spend a lot of time like, okay, I'm going to try and pull value from application. And well, maybe it's like one of those replace. So as far as strings, or maybe it's a string and it's supposed to be an integer or something like that, and you have to do all this work to like convert it and then deal with failure converting it, you know, it was just a nightmare. And so like you were saying, Chris, that's really, the big benefit to providers is, and it ties into the second through side of things where the baseline, your system includes configuration. If you've gotten to the baseline, config is taken care of your system should have never booted if you're missing required configuration, CHRIS (23:52): Right? You shouldn't try to start the application and have Acto try to connect to a database when the database host environment variable is nil or whatever, and then just blow up for some obscure reason. Like you should never get to that point, but the stuff's not there break. Speaker 2 (24:05): Yeah. That's the ideal for sure. That's kind of how I think some of this config stuff evolved as well as just knowing that that was going to have to be the case. And we've seen that just in practice too. That's what people are expecting from things I can pick out of access. So it's all decided when they try and move, as far as like the second through stuff goes in general and starting a degraded condition. I think that should be more common than not. I think it's not common because people don't necessarily want it to think about the implications, but I don't, it's not always as difficult as it appears. So like I wrote as part of the new docs as well, a guide for deploying AWS and in that it bootstraps a whole infrastructure using cloud formation so that people don't really have to deal with that. But their kind of general setup is that, you know, it's, uh, an auto scaling group, that's tied to a code pipeline, deploy CIA thing with low bouncer sitting in front of it. So it's a Phoenix application that talks to an RDS database and it's just, I tweak the to-do NBC standard front end thing to talk to Phoenix and actually write to the databases it's back in store. That example application is the one that that guide is built around. But one of the things I threw in there, just because I wanted to wanted to just see if it would fit into the guide is like a, you know, interesting idea is that it's using ecto, but rather than, as in the default application, that's generated rather than starting Acto as part of the application supervisor tree, I have another modules are a proxy for that, a database module that boots and that thing, uh, tries to start Acto when it is in it's a nit call back. But if it fails, that's fine. It sets a flag and an EDS table that is publicly accessible. So callers can, can check it. The database is available, but then it just manages watching that ecto supervisor, it tries to start it every five seconds if it's not available. And once it started successfully and monitors it. So that goes down, it sets the flag appropriately and will attempt to bring it back again. That's one of the ideas of the secondary, right? Is like you have the baseline on your system and you're trying to step up a level of operational functionality by trying to do something that's a little riskier than the baseline your system can support. So foundationally, it's almost always a database or a message to some kind, right? Rabbit Kafka, something like, yeah, exactly. So you want most of your application to interact with like a proxy of that resource because of how fragile they typically are. You don't want every part of your application that needs to worry about the queue or the database to be exposed to its fragility. You want them be able to say like, I want this written to the database, or I want to read this in the database and get back out, like here's the result or yes, that was successful. Or there was a problem, you know, I don't know what you want to do there and let the client deal with whether it wants to retry or just explode, whatever it wants to do, but you need to give those choice choices to the clients. And that's what the uProxy like that module is for. And whether the proxy is individual workers in a pool that each maintain their own connection to the database and do that, you know, kind of monitor and restart thing if necessary or something like I threw together, which is just monitoring ecto itself, which kind of has its own pool management. The idea being that your application should be able to start. So like use an example where like the RDS database isn't available does application will start up and the end points will work, but they'll return airs for operations, which, you know, are unavailable because the database is not available. The health check end point will return like, Hey, the application is up, but it's degraded. That's kind of the deal, particularly with things like liveliness and readiness checks, you want your application to be able to respond to whatever's looking at that and be able to say, okay, this application is degraded. For whatever reason, I'm going to fire off an alert. I'm also going to disconnect it from the load balancer and keep checking until it says it's okay. Again, I think that sort of architecture makes a lot of sense. Now you can get, it gets a little hairy if you have nested levels of this, right. But there is still kind of, that's what the stacking theory is. Is there is intent that you can layer this, that once you've gotten to your database, that things that depend on that database to do something else or, you know, add another layer of services system, um, can do the same thing. But using that database connectivity as its baseline, and if that database goes down, then they can crash and restart and try and come back. It's not that every layer has to care about everything below it and monitor all of it. All it needs to do is just care about its baseline. If the database goes down and I have to assume, that's my baseline, then I need to just die, just crash. And then once the database comes back, the way the application structure is that that process will then be restarted and can try and build back up from there. It's more complicated of an architecture for sure. But if you want your application to be resilient, which is oftentimes the whole purpose of using elixir or line, this is kind of table stakes, I guess, for really resilient system. I mean, if you can get away with applications that just kind of crash and burn because you're running a bunch of them and you've got the load balancer kind of just dealing with that or an orchestrator like Kubernetes doing it, then maybe it's not as important, but if you're running, you know, heavily staple cluster where each node matters than you can't afford to have some transient failure blow up all your nodes, they have to be able to deal that and self heal. Otherwise, you know, you don't get a whole lot of benefit out of supervision. CHRIS (29:48): Yeah, exactly. I mean, I could think of plenty of scenarios even at, with, at work stuff where we need to be able to heal from things like databases, just going down or like just dropping connections or just, you know, massive latency spikes, all these kinds of things that can kill some of, some of the different clusters and stuff like that, depending on the application. And you have to be able to degrade from that. And if your only method of degrading is either we're up or we're not like if it's all or nothing. Paul (30:17): Yeah. I was just reading something I forget who wrote it. I think her name was Cynthia something. I, I wish I remembered her name, but CHRIS (30:25): We can find it and add it to notes. Paul (30:27): I can, I can track it down for sure. Um, she wrote a really interesting article on designing systems, like from more of an infrastructural point of view with where flow control and back pressure and all that as part of the system, like how you, you need more active feedback between things like a load balancer in downstream systems. You can't just have like this on, off switch because then you get really weird behavior. When you get into the system as a whole right. Things aren't dealing well with higher than normal load. Like they're overloaded, but they're not crashing yet. If you just have like this on off condition, like traffic is going to be continued to route it to those things until they just explode. And that's not ideal, what you want is to be able to say, okay, that note can't deal with much more traffic. So I'm not going to route anything more to it for now until it says like, okay, my load is going down. And that might be as simple as just tying like CPU usage and to like your load balancers feedback. So it can say like, Oh, my, this app is like 70% CPU and these others are 20 and 30%. I'm going to route all traffic to the 20 and 30 until things balance out again. And I know HA proxy for sure has a thing that can hook into that. Uh, it's got this thing called, like agent checks, I think, or something like that. It's just a TCP socket that you write, like some, you know, well-structured texts you basically, but you can feed it information that it will then pull into is load balancing algorithm. And the result of a system like that is that it deals much more sanely with load because things are smoother the way it reacts to load and things breaking is not as uneven as chaotic. You have an opportunity to actually shift things around in the system or even slow things down at the load balance before it gets to your application, depending on how you want to deal with that stuff. But that idea was something that I hadn't really considered before. I don't know why it seems kind of obvious in retrospect, but I guess it's just that the tooling isn't necessarily there for that. CHRIS (32:27): I mean, first of all, you have to be able to monitor your system in such a way that you know, what your failure scenarios are and you know how they're going to get overloaded. Because if you have a fully IOPS bound problem, that's different than if you have a fully CPU bound problem, whatever, like, and realistically services or some mix of the two. Right? And then it's just hard because you have to build all of that at almost an application level. You have to be able to get the observability out of there and then pipe it in into your, to your load balancers. And I think we'd like to think of these things as just being really dumb. Oh, I'll just put another cash in front of it or I'll just put another load balancer in front of it. And then I don't have to worry about it. AMOS (33:00): You hop into an embedded system and you have to build all this stuff internally yourself. There's, there's not like you're not going to throw her a proxy in front of something. You actually have to build this into your elixir application. As part of like, if you're passing stuff off to a pool, you have to be able to balance stuff in that pool based on each individual worker sometimes. Paul (33:20): I mean, there's some great Erlang tools even for doing that within an application from just an observability standpoint, you have alarms, which I rarely see people use, but are a hugely beneficial operational tool because you have the ability to trigger sort of this global state that can be monitored, but can also be used to drive decisions in the system. So you can trigger an alarm. And one part of the system and another part of the system can be watching for that alarm condition and be able to react to it and say like, okay, I'm not going to route traffic to this pool anymore because, or this worker, whatever it happens to be because it's in like a overloaded condition, AMOS (33:58): Are the alarms a set group of things like number of messages in the queue cure? Is this something that it's a framework Paul (34:05): You can attach metadata to alarms, but alarms in general are more of a on, off type situation. So an alarm would be triggered when say CPU usage has been above whatever for like 30 seconds. So you had set alarm at that point, maybe some metadata about what that is or, or why you care about it, whatever your system needs with regards to that alarm. Other parts of the system say, whatever is doing in just for the data that coming into the system can then see that that alarm has been triggered, factor that into its decisions about back pressure or whether it's going to even accept requests or how it schedules work. Even like drawing down the amount of concurrency in the system. If there's too many processes or something like that, you know how you react to it is not set in stone. It's really up to your application. What you need it for the benefit of alarms though, is that they're not just a metric. It's not something that you push into a time series store. And later on, come in and look at it. It's like, these are things that your system needs to deal with. Now, whether it's notifying a actual human being, like you could drive Pedro alerts off of alarms or lighting system, or you can actually, within the system itself automatically react to alarms. It depends on, on what the alarms are and how you want to deal with them, but they are staple. So you can turn them on. And then later say, Oh, this has been addressed. Or you can even say like, I I'm looking at this right now. Actually that feature might be part of the LRM library, not CHRIS (35:40): Alarm the alarm handler built in this. Doesn't give you much more than it's on or off, but that's the cool part is from an embedded systems perspective, you know, you don't have to be reliant on getting your metrics out of the warehouse or wherever they are to some stats D from atheist collector or wherever, and then triggering alarms off of that. You can do this literally in the device because the alarm handlers just built into Sassal. Right. AMOS (36:07): And you can throw all your MQTT messages on the floor when your CPU's spiked. CHRIS (36:11): I mean, you can choose exactly what you want to do. I mean, I think you could just drive all like you could do some of that based on metrics you collect from exometer like, you could literally do all of this inside of the beam if you needed to do that. Paul (36:22): Yeah. I know they touched on a little bit or Francesca did in his saying for scalability book. And I think that's part of the whole OAM concept, you know, forget what it all stands for observability administration management or whatever. But the point is that it's part of this suite of tools that have metrics are part of it that gives you sort of longer term observability, the ability to go in and say like, okay, we saw this spike and HP requests in this timeframe and here's the impact it had on the system or the system crashed after getting a bunch of messages. And like, here's all the different things that change in the system over this time, period. That's super useful for kind of coming back in and diagnosing a problem. Right. But, but you have to know the problem happened. And that's where alarms and I mean, if your system crashes, then yeah, you're gonna, you're going to know that something went wrong. CHRIS (37:13): Yeah. Well hopefully. Paul (37:17): You've got a bigger problem probably, but yeah. Alarms are a key part of that. You need that sort of like instant, like this is happening right now and they need to be able to go and say like, well, why did that happen? You know, if this is part of the systems design and, you know, we just reached some threshold. That's fine. And actually, I think alarms are used even internally for some things. Um, I believe that if you use the OS application to do some monitoring and system like CPU and so on, if garbage collection is taking too long or there's a few different flags that you can set, uh, I'm pretty sure those use alarms to set those conditions. Uh, they also get logged to standard out. I think the ability to choose whether it's automatically dealt with, or you deal with it serve as a human logging in with a remote shell or whatever is part of the benefit to you don't necessarily have to make your system automatically react to every condition. But I think you do want to tie some observability tools into that. Whether it's something like pager duty or whatever, maybe it's just part of your health check in point returns, any active alarms. CHRIS (38:20): We do that for some alarms. Like we have some custom alarms that we start to throw from different services that we use all our handler for this. And then we actually, we actually push real. We don't always push like pages directly, but we push events that then can trigger like a page from our sort of unified stats event collector. Paul (38:40): Yeah. And if you've got a low alarms, they're triggering really often you for sure don't want those tight in it in kind of alerting CHRIS (38:47): A level of indirection there, just because then we can choose how the pages actually work. And like, if we would need to page, if we need to add extra data, I'm a big fan of putting in diagnostic steps in all my pages. So you start getting a list of like, here's how you go get the logs. Here's the page to go look at the metrics. Here's how you start to observe stuff. I put all that stuff in my alerts. When they go out on a page, I throw like Datadog charts in there and whatever else like, so you can start to diagnose it via, that is super useful. So that's easier to manage outside of the system. But if all you have is the system, then you know, you can probably make something work. Paul (39:23): Yeah. I think in general, the approach that you've taken is kind of where you'd want things to be like the alarms from one note in your system shouldn't necessarily be driving alerts directly. You want them to be dealt with in aggregate. So whatever is watching the whole system is checking for alarm and then dealing with whether it needs to aggregate those into one alert or, you know, just one for that particular line, who knows. But I do think that, yeah, it's, it's super beneficial to have, if you're on the hook for fixing problems, a lot of those steps, like here's the logs, here's the charts, you know, of the top metrics that we care about for this thing that broke. CHRIS (40:04): Or SLS here's where, where they got violated. Like here's why Paul (40:09): Super, super beneficial. Yeah. CHRIS (40:11): Especially like, you know, when you have a team that is of size in where not everybody has that sort of like hard one knowledge and you need to start diagnosing problems quickly, for whatever reason, like you're not on call or someone else's having to pick this up or whatever it is, even with the hardwood knowledge, it's nice to have that stuff in your face. And I have to go dig it out just to get clarity, click links and all of a sudden I'm like looking at stuff that matters Speaker 2 (40:33): For sure. Yep. Yeah. It's almost a dashboard of sorts for that particular problem. Speaker 4 (40:39): Hello everyone. Zane Amos. We had a really long conversation with Paul. So we've split this into two episodes. So I guess you'll have to come back next week to hear the rest.