Paul: Hi there and welcome to PodRocket. I'm your host, Paul, and with me today is Jimmy Bogard. So Jimmy is a consultant, creator. Some things he's made: AutoMapper and MediatR. And am I saying that right, Jimmy? Jimmy Bogard: I don't know. It's close enough. Paul: Well, it's spelled M-E-D-I-A-T-R, for people listening, if you want to go search it up on his GitHub. And we're going to be getting into talking about platform choreography orchestration, how microservices talk, and focusing on your talk that you gave at the DevTernity conference, Effective Microservice Communication and Conversation platforms. Welcome to the podcast, Jimmy. Jimmy Bogard: Thanks for having me. Paul: I'm really excited to get into this topic because everybody deals with this, microservice communication, everybody. I don't care what level of the stack you work on, unless you have some weird amazing job that I'm jealous of, you have to think about services. Jimmy Bogard: Exactly. None of us really want to deal with it, but it's there now, I guess. Paul: Did you step into being a consultant specifically in this microservice realm or is this just how the cookie crumbled over the years? Jimmy Bogard: Pretty much. My career started out in a startup area, then I went to a product company, I hated both of those things. And then in my first gig of big corporate IT, that was my first foray into like, "I'm not just dealing with a single app and a single database. Now I have to deal with other people's systems and other people's databases." And that's where my first introduction into distributed systems, was big corporate IT and how things can work, how things cannot work as well. But corporate IT wasn't for me, so I got into consulting to try to escape that world. Paul: Step into consulting, back into the corporate IT world, just on your own terms, or did you maybe reel it in a little bit and start small again, single systems? Jimmy Bogard: Actually, it was like six months later I was consulting back at the exact same company I just left. Paul: Go figure. Wow. Jimmy Bogard: But then I could leave wherever I wanted, I wasn't stuck at a desk job there or whatever cube farm. Paul: That's like a life graduation moment. You come back six months later, bigger, shinier, newer. Jimmy Bogard: Happier. Paul: Yes. So one of the big things you're talking about with distributed systems, making them coordinate. How do they talk to each other? If you have one system talking to another, you get failures and side effects. Jimmy Bogard: Exactly. My first introduction to even the idea of they're different patterns you apply and different... This idea of different ways that systems can communicate came from going back to that company who were going through this big... "We're rewriting the monolith in microservices, and it's going to be this utopia, and everything's going to be so great because it's small, deployable, whatever." And it was just a complete failure, just a huge, massive disaster. And, based on my experiences, I realized a lot of it came back to them not really picking the right communication patterns for the different kinds of use cases that were hitting. They had just basically picked, "We're going to do this one, single communication pattern for everything." And at the time that was everybody did APIs for everything, I guess was the... Jeff Bezos has that had that famous memo he sent out that's like "We're going to build services and everything's going to have an API. Well, they took that to an extreme. "Everything is going to be this synchronous web API," and that led to a lot of problems when they went to production because they decided on, "This one single way that all of our services are going to interact is this one single way of synchronous APIs." Paul: Do you think that this illustrates a common pitfall of how engineers specifically think about architecting assistant, where it's like if you can, in an ideal utopian world in your head, invent a standard communication that everything falls under, it feels really great, but often it's not. Is that common? Is that why you find people and teams falling into that position? Jimmy Bogard: Yeah, because it's easier if you just pick one way. You don't have to deal with how to deal with the other ways of doing things. And I've also found, especially early on in the microservice, people doing these kind of architectures, they really centered around a single style of integration and communication, which is this API style. It's just a very common, easy way of doing things. But what I tended to find was that those kinds of communications do have trade-offs versus other ways of communicating, and unless we're really paying attention to those trade-offs and understanding limitations of each of those, then when we get to production, or any other really distributed environment, those don't work. And when I talked to the team that had failed, I didn't understand how they got so far with something that just was never going to work. And it was because everything worked great on their local machine because everything was local. They didn't have to cross network boundaries, or servers, or racks, or anything like that. So everything was great. As soon as they went to an environment where things were actually distributed physically, then they started us see just all these different slowdowns magnified, multiply across their applications. Paul: What's an example of a slowdown that I wouldn't see on my local host? And if I'm developing, let's say, a three-service thing. I know there's unlimited, but I'm just trying to think of one that would crop up that's common. Jimmy Bogard: So the big thing that you don't see locally is just the latency between services. You're not crossing a network boundary, you're not crossing your laptop, everything just stays local. So latency between services is effectively zero. Even if running a hundred services locally, it's going to be pretty fast and I won't see any issues, but as soon as you start to introduce that latency between different services, and as you multiply out the number of services that you're calling, suddenly that latency becomes a huge, huge problem. And in this case, we saw that locally, something like 150 services for one page they were looking at, but as soon as they multiplied that out by it was like 10 to 20 milliseconds per call, but then this one was calling that service, and that service calling that service, suddenly it just multiplied out to like... Oh gosh, it was so bad. I think nine minutes it took to get a single response when they finally got to production. And it was just because of that multiplication effect that when you multiply out that latency between calls, suddenly you're going to have this really long effect. Paul: Correct me if I'm wrong on this intuition, but one reason why that can catch so many people by surprise is you get this thought of like, "Okay, well I'm running services. It's just a little laptop, whatever." But the difference of a multiplier between a value of zero and 1.2 seconds is profound when, as you say, as you scale it out. There's this misconception about how much of a difference that minutiae takes of a local versus in production between networks crossing network boundaries. Jimmy Bogard: And it's not completely obvious to you as developer that if you're calling one API, is that API having to call other APIs, and are those APIs having to call other APIs? If you're developing locally, you don't necessarily see that you could even trigger denial service attack internally. We also saw that there was something that caused a big explosion of API calls that, again, locally we just never see. But suddenly when I'm talking about actual network connections, we're starving the sockets. We can't actually make connections, and we're causing these internal denial service attacks. It's just none of those things you see when you're running everything locally. And these are all the well-documented things to look out for. There's this list of the fallacies of distributed computing that predates microservices and service-run architecture that says these are things that, when you're developing locally, you assume to be true, and one of them is "latency is zero." And when you're developing locally, latency is effectively zero, but as soon as you get somewhere else, then that starts to multiply out. Paul: What is one way that you would advise a individual or a team to start thinking about re-architecting something if they're starting from this classic position of, "I have a bunch of things that are talking to other things over http?" Jimmy Bogard: One of the big challenges there is, as we're defining the different communication between these different services, someone has to define, "Well, what in the world should these services be?" And one of the things we saw with this initial organization I was working with, is they didn't really have any guidance about where they should draw the boundaries for these different APIs. It was almost like if they saw a database table, they're like, "Well, they just have a service for that database table because we need a... I don't know, we need an order service and a product service." And what they didn't necessarily realize is that didn't, from the usage perspective of that information, it didn't necessarily map out very well that, "If I just create an API around that entire set of information, that could lead to some very chatty back and forth." Because some things need that information at different times, different places. That was one of the things we ran into with the denial-service attack. There's this common problem of using ORMs that you can have this, what's known as a select n+1 problem, where instead of using a database join, you're calling back and forth to the database multiple times to get the data. Well, what they were doing was that exact thing except in APIs. So they were calling the API to get the order, and then calling the product service for the product details for every single thing in the order. And they didn't realize like, "Oh, this can work really great locally, but it didn't work at the database level and it's not going to work out when we're building APIs as well." So for a lot of the things I look at is not necessarily just looking at the different interaction styles we should choose, but also making sure the boundaries of the services we choose make sense from more of a business perspective and not just from an informational perspective. So you see a lot of talk in the microservice community upfront about, "Well, what should these microservices be?" And if we can align those service boundaries with business domains or business concerns... Even if we look at the organizational structure of the business, if the business has already decided that this is a good way to organize our capabilities and our people, that also is a good starting point to say, "Then perhaps the ownership of that information, and those capabilities, and the APIs should also be aligned to those groups as well." So that gets back to this idea of Conway's Law, which has been bastardized over the years, but in general, we can design... Or the design of information systems often aligns to the communication patterns of the business. And today that roughly means the organizational structure of the business also aligns to the systems that they build as well. So when we look at designing those service boundaries, we want to make sure that the communication patterns are appropriate and make sense. I first look to see, "Well, how does the business organize themselves?" And if the business has already organized themselves in a logical manner, then that's a good starting point for that. Out of there, in terms of the interaction models, a lot of those will just naturally fall in place to how the business expects those communication models to work. Paul: When you're looking at optimizing this, it sounds like it's very much at an application-logic level. We're not asking the question, "Do we use this underlying protocol or that underlying protocol?" Like TCP, UDP, which might come up in the design conversation, but first and foremost, it's this common misconception about architecting your microservices wrong, or over architecting it, or over segmenting it or just not drawing those lines in the correct location. That's the first place that you typically want to tackle. Jimmy Bogard: Yeah, and unfortunately that's the hard thing to get right, and it's also a hard thing to fix after the fact. So I tend to spend a bit more time upfront just understanding how the businesses organized themselves, and the maturity of the different areas of the business as well. I see an area that's a little less refined, or maybe they haven't standardized their processes, or standardized how they decide to organize themselves. And it doesn't make sense to invest a lot of time building these well-crafted services, when they might change their mind in two months and say, "Actually no, we don't want to do business like that or do this process like this. Just do something else." But if I'm able to start with those more mature areas of the business, I tend to find that the APIs out of there, and not just web APIs, but any contracts and means of exposing information and capabilities tend to be a lot better aligned and better able to be exposed to other areas of the business. When it comes time to choosing the nature of those communications. I do often go back to understanding, "Well how, if I were to rewind time 50 years ago, what were the nature of the communications back then? How would they have accomplished these different tasks? How would they have accomplished these different workflows and things like that?" I tend to find us as humans are pretty good... Well, I don't know about that. We're halfway decent about optimizing our communications based on the different needs we have around the communications. So if I have something urgent, and I need to get an answer immediately, I will choose that synchronous form of communication to say, "I need to get this answer right away, so I will call someone or I'll..." Whatever, some sort of synchronous communication to ask that question. But if it's not something I need an answer right away, then I'll choose an asynchronous form of communication. I'll email, or text, I'll Slack, something that I don't expect to have that immediate answer. And if I rewound time 50 years ago, we tend to find that there's a lot of synchronous communication inside of an organizational boundary, and it's a lot less synchronous between organizational boundaries. So if I'm in accounting and I need to have something done by invoicing. 50 years ago, I might have filled out a form or sent an interoffice mail to them, and then a day later I get an answer back. But I've optimized my workflow to understand that there's going to be a bit of delay, with the trade off that I can send mail off, and then I can keep going on with my life and not have to wait for that other person to get an answer back to me. If we translate that into this idea that everything is an API, that'd be equivalent to if I need to get anything done at work, I would've to get everybody on some synchronous call to be able to get this answer, which is meetings. So if we're trying to answer a question of I need to place an order and have to get everybody on the phone to answer the question, "Can this person place an order?" Well, that's going to be a very expensive call, very expensive meeting to have, and it's going to be a very expensive request in my system as well if everything to answer that question is synchronous. Paul: One thing I love about this example is I think this gives me a rule of thumb, and I love rules of thumbs because there's things that I can take away from this conversation we're having and use it as a tool in my toolbox later as I'm working on my systems and stuff. So, correct me if I'm wrong, but I bring myself back to thinking of how a human organization would organize. In your examples you're like, "Well, 50 years ago, how would it have been done?" If I view each one of my services as a little department and I'm picturing little humans working in there, where would they end up wasting my time paying them in doing something? I don't know. Thinking how a human organization would interact, I think, brings to the forefront of my mind some of these areas that APIs and services might break down in their efficiency. It humanizes it. Jimmy Bogard: For me, it's a good, at least a very good starting point, because obviously metaphors are good to a point, but they break down at some point. But at least brings it to the forefront and humanizes the real-world limitations of communications that we have every day. If I'm talking about synchronous communication and a system that's API calls, whether it's GRPC or HTTP, one system makes a call to another system and is waiting for that other system to return a response before it continues this execution. Well, that's the same thing as me talking to you today. We're both sitting here blocked, not doing anything else, and as we're going back and forth, neither of us is doing anything else. We're waiting for the other response to come back. And you do see this in your daily interactions with other businesses. One famous example of Starbucks, there's this famous paper of, "your coffee shop doesn't use two-phase commits." And it's this idea that not everything in the world is a synchronous interaction, but there are some synchronous ones and some asynchronous ones. And so when you go to a coffee shop, you make that initial synchronous interaction of placing your order, but once that initial interaction is complete, the processing of that order is asynchronous to you. So you go mill around over there, in the milling around area next to the counter, and then behind the scenes there are these asynchronous operations that are happening to fulfill your order. You're okay with that, because you know eventually you will get your coffee or you'll complain and leave a bad review, but it allows you to have a mix of this synchronous interaction where it's highly efficient to get your order and pay for your coffee, but then allows the fulfillment of that order to then be optimized for that process. So I can have multiple baristas fulfilling different kinds of orders. There's a frappuccino person, because that blender's super annoying. Someone just working the espresso machine. They can optimize these individual tasks behind the scenes for you, and it just comes a bit of an understanding of you as a customer that although you're now asynchronous to fulfill the order, which means I don't now get my coffee when I'm paying, I get it sometime in the future. So there's some agreement between us and they're like, "Well, am I getting it now. It's in the future sometime, but the line goes faster, and they'll make more money." And so I see this in our distributed systems as well that I've seen people go the other way like, "Oh, API sucked, so we're going to do everything via messaging. Everything's going to be on Kafka, Rabbit [inaudible 00:18:44]." Well, you could do that, but then now every single interaction has to wait some indeterminate time in the future for a response. Paul: You're bound to a memory-tied application that is not inherently like stateful. If that goes down, what happens to your brain? You have a whole other set of side effects to manage that piece of infrastructure. Jimmy Bogard: Going back to the design of these distributed systems, that's what I look at. I look at each of these interactions and then try... I first start from the metaphor of like, "Well, what if these are people performing this job? How would they do these interactions? Would they have a synchronous interaction? Does it make the most sense there?" And a lot of times it comes to, if I'm asking a question, I expect an immediate response, so the queries in my system, "I've asked you a question that should be a very easy thing for you to do to give me a response to answer my question." Or if you have this, in the case of placing an order, you're not doing any big lifting, or heavy lifting, to take someone's order, the most you're doing is just writing it down and taking the money, and that's it. All the processing is behind the scenes. And so they're looking at, "Okay, from then on out, well then how do we then coordinate these activities behind the scenes? Do we use synchronous communication? Now we have to worry about both sides being up and available?" Just in real life. If I'm telling someone to do something and they're busy, well then I have to ask them again, and I have to ask them again, ask them again in case they're still busy. Versus durable, I have more guarantees, but I'm now disconnected of like, "Well, I don't know when they'll get this order or when they'll get this request, but it's sometime in the future, so that's good enough for me." So I look at each of those interactions and try to figure out what makes the most sense for these two system services, as well as what makes the most sense for the overall business transaction I'm trying to achieve, which is getting my coffee at the end. Paul: It feels to me as if it's a lack of perspective, a lack of point of view. As a system architect, your point of view needs to remain homogenous throughout each piece as you walk through, and you're determining, is this synchronous, is it asynchronous, is it durable, is it not? You need to... Because from what I heard from you is you first think about the customer, you put yourself in the mind of the customer. I'm getting my coffee. What does this top-level point of view look like? Once that's established, you have your synchronous bubbles, your asynchronous bubbles, then you can go into each one. And if I'm designing microservices for my own little app, I run into this problem. If I'm coding at 10:00 AM and then at 10:00 PM that same day, my point of view about who is my main constituent in the system is different. It's difficult to maintain that. One interesting framework, I'd love to get your thoughts on that I think has a heavy-handed approach to forcing my point of view that's like, "This is going to be your top level. This is a workflow Temporal.io" I'm sure you've heard of it. If you're the microservice man, I think it's a really interesting answer to this problem of here's your point of view, you have a workflow object, now we're going to proxy activities out. What's your experience in the enterprise world working Temporal? What are your thoughts? Are you a yayer a nayer? Jimmy Bogard: So I haven't had the privilege of using Temporal.io. I've used systems like it, though, in the past. Some of those come down to how much are you willing to have centralized orchestration of some processes, versus you want to have those a little bit more distributed, or less coupled to some central system. And the other thing I hit is not a lot of the customers I work with are equally mature in all the systems that would connect to those kinds of workflow activities. And so we have to drop down to a little bit more, less heavy-handed approach because they're just not able to be able to say, "Well, for this one spot, well their fulfillment engine is SAP or something. So that thing's not going to talk to Temporal.io, so we got to figure something else out for those pieces over there." There's also a tendency to try to, even in looking at asynchronous messaging, to also just say, "We're going to have one single interaction style between different services." To say, "Okay, if we're going to go messaging, then only do events between different services, because that has a high level of process decoupling that I'm not telling services what to do, I'm instead just telling them what happened with me and then they can decide how they want to manage it on their own side." But again, what I found is that there's still no hard and fast rule to say, "Always use events, or always use command messages." It is always coming back to, what is that overall workflow trying to achieve? And what is the most appropriate interaction style for these sets of interactions? Does it make more sense to use events or is it the case where I do need to tell the system to do something, and I don't want them to care about me at all. I need to tell them what to do. Because if they were to do events, then they would have to learn about the thing that's on my side. I don't want them to care about me, I just want to be able to tell them what to do. And it is always that push and pull tug of war to say what makes the most sense there. Paul: I like your answer, because I think it just sheds light on the very fundamental fact that Temporal doesn't solve everything. It's new. It works so well for a lot of people and we hear great success stories like Uber. All of Uber runs on temporal. If you're listening and you don't know what it is, all of Uber runs on this framework. It's a beast and it does this orchestration itself, but it doesn't solve everything. You're talking to SAP, like how are you going to get that integrated into the type script runtime? It has other languages, of course, and I think it has something to be said. This is beyond your typical, "I wrote a bunch of next apps and type script services that are trying to communicate."There are legacy systems, legacy pieces of hardware even, that have different push and pull models. When you're going out into a client and you're like, "Okay, let's tackle this. Let's look at slowness." What is your step one of indexing and discovering the topology? 'Cause if you're going into a org and you're taking over an org, that's a disaster as it is. You're like, "What team does what?" And that takes you two weeks to answer. So do you write it down? Do you have a notebook? How does that work? Jimmy Bogard: I guess pre-COVID days it was get in a room with the whiteboards and stuff. But my virtual whiteboard skills were still heavily lacking. We're still trying to get better at that. A lot of times the customers I work with don't even have distributed tracing for me even to have... I can't have the system tell me what's going on. So sometimes what we'll do is even add instrumentation to the systems so that we could prove what they tell us is actually true, that this is how things actually flow through the system. A lot of it starts out with talking, is doing those levels of interviews, and cataloging of the different systems that make up whatever applications they have. So getting with the dev managers, or C levels, that have a high-level understanding of how things are put together, and then going down from there. Oftentimes, though, I'm not trying to draw the most detailed map of Europe. I'm just like, "Okay, let me draw the country boundaries and then if I want to zoom in on something, then I'll zoom in on that." But it's not often important to get the most detailed map. But the whole thing, we'll just get the 10,000-foot view and then if we see problem areas, we'll start to drill in into those specific areas of concern. Paul: There's a whole new software startups coming out right now that are, they're literally, their whole thing is just based on mapping your AWS environment 'Cause it's such a behemoth of a task, and it changes, and that's the problem. You spend time and resources doing it and then it changes. Jimmy Bogard: That's why as much as possible, we try to get some instrumentation in so that the system tells us how things are put together, and we're not having to like, "Oh wait, there's this one config value for this URL. Where is that being called?" "Oh, it's over here. That's calling for this." And reverse engineering that by going through source code is, unless I'm getting paid by the hour, it's not very fun. Paul: I think it's an interesting job too, because you're basically a software engineer, but it's just like, "What do you do?" "Well, it changes every time. It changes every time, but it's up the same domain. It just changes every time because I don't know what system I'm going to be touching." My last question, when you're stepping into an organization in your career-oriented question, do you find that the source of difficulties when it comes to either the existing architecture, or currently being planned architecture, is the source of that difficulty often group think, where people are having an offhanded approach and they're just moving forward blindly and slowly as a mass, and it's yielding inefficient designs? Or is it more like you have one leader who's very much like, "Ah, I have this idea, we're going to make it great." And people sign on and then it ends up having a lack of checks and balances. Because I feel like those are two very different entry points of disaster. Jimmy Bogard: I'd say in terms of things going off the rails, which again, as a consultant, I've never bought in for systems that already work, so I only get to see the systems that don't work. But in those cases, I'd say most of the time it is that one champion that has achieved some level of, either they made it up the org chart very high, or they achieved some level of success where they've been now put in a position."Okay, now you're going to be one designing this new system." But then they might not have any of the experience in designing those systems, so they're also learning as they go. So it's not necessarily any negative intentions or even necessarily checks and balances, a lot of times it's a lack of feedback into do these designs actually work? Is this appropriate? So that's one of the things I do try to put in and as we're trying to do something different is let's not try to start 50 at once, because we won't have many feedback. It'll just be a too much feedback to understand is this right. Let's start small, and prove that approach, and then grow it out a little bit more. There's a desire to scale the approach too quickly before we've really had a chance to let those ideas bake, and take hold, and understand where the limitations were [inaudible 00:29:01] the limitations. Paul: That was my guess, too. So it's interesting to hear. It's interesting that you say it's not inherently a malignant action, it's maybe over trust, lack of experience, or some combination of the two that people just blindly move forward with something Jimmy Bogard: I wouldn't say it's never malignant. So one of the things I do when I talk to my teams is I also try to understand what is their compensation structure? What are they given? What are they rewarded for? What are they punished for? Because I've had teams, one example was an organization had way too many microservices. People raised the flag and said "We got too many, this is going the wrong direction." But it turns out in that organization you got promotions by how many services you owned. The more services you defined and owned, the more you got in your bonus. And so there was a negative, how would you call that? Paul: Bad incentive. Jimmy Bogard: They're financially incentivized to do the wrong thing. So that is one thing I tried to understand going in is I could make suggestions on changes, but if they're not financially motivated to do so, it's probably going to not go well. And so people will make their own choices based on their own selfish reasons, so I want to understand what are those extrinsic motivators that they have going on in the organizations, and that also gives me an understanding to how things got to where they are. Paul: It's funny how these sorts of problems fizzle out, not completely, but a lot when we step into the open-source realm. And I think that speaks a lot for the quality of the software that gets developed there. And, boy, do we have so many episodes on that, riffing on open source. That's a topic for another time. But just the incentives, that's an interesting example. You brought it just feels unique that, wow, you get bonuses based on the microservices and that caused an actual technical cost in the end. Well, Jimmy, before we close out, if people wanted to learn more about what you do, I know you have a website. What is the website name? Jimmy Bogard: It's really complicated. It's just jimmybogard.com. Paul: God, nobody's going to remember that one. And then you're on LinkedIn, you're on GitHub. Jimmy Bogard: Twitter, I guess for now. I don't know. Paul: For now. Jimmy Bogard: While it's still a line. I don't know. We'll see. Paul: Well, thank you for taking your time to come on. It was really great to pick your brain about how microservices are done wrong, maybe, and some ways that we might want to think about and problem areas. Jimmy Bogard: Thanks for having me. Glad to be here.