Anna (00:00:05): Welcome to Zero Knowledge. I'm your host Anna Rose. In this podcast, we will be exploring the latest in zero knowledge research and the decentralized web, as well as new paradigms that promise to change the way we interact and transact online. Anna (00:00:27): In this week's episode, James and I explore the Cosmos Stargate upgrade. We do three interviews with three different participants in the upgrade, that is Zaki Manian, Tess Rinearson and Shahan Khatchadourian. We talk about what the upgrade was all about, how it went and what kind of changes it brought to the Cosmos tech stack. But before we start in, I want to remind you to subscribe to zkMesh newsletter for the latest in zk news, research events and jobs, and more. The newsletter comes out monthly. Be sure to check it out. I've added the link in the show notes. I also want to say thank you to this week's sponsor Least Authority. Least Authority is a security consulting company known for their dedication to pushing the limits on how to build privacy-respecting solutions. They are a team of security researchers, cryptographers and open source developers, passionate about advancing security projects in the blockchain cryptocurrency and DeFi space. They're known for their audit of the Eth 2.0 specification, Protocol Labs, Gossipsub Protocol, Mina protocol, Blockstack's Investor Wallet, and more. They want to help teams improve the security of their protocol and use of cryptography, including zero knowledge proofs, from design to implementation. Their independent reviews improve the security of the technology by immediately helping developers of the projects along with their users. And when the reports are published, the broader community can benefit from this shared knowledge as well. With over 50 completed public audits to date, join the growing list of audited projects. To do this book a free consultation, at leastauthority.com/consult. I've added the link to this in the show notes. So thank you again, Least Authority, for supporting this podcast. Now here is James and my exploration of the Stargate upgrade. Anna (00:02:11): So today James and I will be exploring the Cosmos Stargate upgrade. And we're going to do this by interviewing three different people. First off, I want to say hi to James. Hey James. James (00:02:20): Hi, Anna. Anna (00:02:21): And I want to introduce our first guest Zaki. Welcome back to the show. Zaki (00:02:26): It's a pleasure to be here. Anna (00:02:29): And Zaki, you've been on the show before. We actually have an episode that I'm going to add in the show notes where we had a full hour, we got to talk about all the work that you've been doing, actually at least up to that point. But since you are somebody who wears many hats, I feel like we should say that for this episode, you're going to be wearing the Iqlusion hat, as we talk about the Stargate upgrade. So why don't you just introduce yourself and what Iqlusion is doing with Stargate? Zaki (00:03:00): So I'm Zaki Manian. I build blockchain protocols. I work on Cosmos and Cosmos-related protocols and other protocols. I'm a co-founder of a company called Iqlusion. I co-founded it with myself, Kristi Põldsam, Shella Stephens and Tony Arcieri. And we started this all the way back in 2017, basically with the idea of running a validator and trying to figure out how to be a validator. We do some combination of new protocol development, running validators, supporting Cosmos and the Cosmos ecosystem, consulting for other ecosystems. And basically the only thing that we don't do that other people do in the space is we don't run a fund. Anna (00:03:39): Got it. So many hats. So let's kick this off. What exactly is Stargate? What is this upgrade and how did you actually interface with it? Zaki (00:03:50): Yeah, so the history of Stargate is Cosmos underwent this significant reorganization in 2020, where a lot of dev teams moved around, a lot of people moved between orgs. And we did Game of Zones, which was an incentivized test that was a follow-up to Game of Stakes, which was the first incentivized Testnet. And after Game of Zones, which was the big public test of IBC, I had checked back in into what was going on in Cosmos software development. And I had known that there were a lot of big changes that were planned and that there had been a lot of engineering discussion and planning and work. And what I realized in the Cosmos SDK process, in the process of doing Game of Zones, is I was like: "Oh, there is no way to ship IBC." I've always felt almost religious obligation to ship the Cosmos white paper. And I was like: "We can't ship IBC. We can't ship the Cosmos white paper without also bringing in all of these other changes." And what I realized was those changes were going to have a really dramatic impact on exchanges and validators and wallets. And all of that stuff was going to be this big, really painful upgrade that we had to do in order to ship the Cosmos white paper. And so I decided that we were going to take this on, that we were going to do this, that all of these upgrades were all gonna get bundled together into this one big upgrade that also enabled IBC, and that I was going to go and lead the charge on making sure that exchanges and wallets and validators were all ready for this big disruptive change, which is basically the biggest and most disruptive change that has ever been done on a wide reintegrated blockchain in the history of this industry. And that's what Stargate was. Anna (00:05:42): Let's talk about this white paper, like what Cosmos was and what it is now. So you mentioned IBC. I actually have done an entire episode on IBC with Chris Goes, I'll add the link in the show notes to that. And IBC, as I've always understood it, is the thing that allows for this interoperability in the Cosmos ecosystem to actually exist. Before this point, what was Cosmos actually? Zaki (00:06:08): Cosmos was the first BFT (fast finality proof of stake) blockchain. It was the first framework for building blockchains, at least when the first framework that was widely adopted to build your own blockchain. So people built their own blockchain. Kava, Terra, Band all built their own blockchains on the Cosmos SDK. And then we were this proof of stake BFT consensus system, that showed that systems like this could exist. And that was Cosmos before. James (00:06:39): The promise of Cosmos and the white paper was this internet of blockchains. And it sounds like you're saying IBC and the Stargate upgrade are the last big push to achieve that goal that you set out to do five years ago. Zaki (00:06:54): Yeah. So five years ago, I helped with editing and writing and made comments on the original Cosmos white paper. And it was basically this idea of interoperable blockchains. And Stargate, the Stargate upgrade happened on the Cosmos Hub back in February, completed the Cosmos white paper. The Cosmos white paper is done. The software is shipped, the system is running, it's actually finished. And it's been a big deal for me personally to have that burden lifted. Anna (00:07:26): What does Iqlusion and yourself actually do in something like this? What does it mean to do an upgrade from where you're sitting? Zaki (00:07:34): There are more than a hundred exchanges that integrate with Cosmos. There are probably about a dozen different independent wallets that integrate with Cosmos and the Cosmos Hub. There are block explorers. There's a whole ecosystem of software that is not produced by, let's say anyone who commits code to the Cosmos repository or goes to any software development meetings or is the Discord channels. And so there's this whole peripheral ecosystem that is really fundamental to like: what makes a blockchain work? Why does the blockchain exist? And the way in which Cosmos organically evolved in the ecosystem is there were dozens of different bespoke integrations between exchanges and wallets with Cosmos software. And what I realized when we were doing Game of Zones and getting ready and trying to figure out what it would be like to launch IBC, we realized that this was going to disrupt every single one of those integrations. And you had this sinking feeling in your stomach. You're like: "Fuck!" And so I came up with this brand of the Stargate and it's the brand of the Stargate as we're going through this thing, there is a before and there's an after. And once we're after, everything is different and nothing will ever be the same again. And I thought the name would be helpful for convincing all of these different entities that they were all going to have to put hundreds of hours into their integrations, fixing the problems and the disruptions with their software that this upgrade was going to cause. And that did work. James (00:09:15): So the name was a way to sell the idea to all of these exchanges and external developers who are not part of the core team? Zaki (00:09:23): Yep. So essentially you set up calls and Slack channels and Telegram groups with the relevant technical person at the relevant technical content at all of these different organizations. I have an entire telegram folder of just exchanges that I'm tech supporting with Stargate. I have lots of Slack channels. I have all kinds of different ways of communicating with these folks. And you start, and then you run testsnets, and you communicate clearly. We published, I think 15 weeks of weekly updates on the Stargate launch repository, where we would just update everyone. And we would tell people and then we would just get people to report back. ---So one of them Stargate was all about was, you know, --- So the IBC feature is the feature for the token holders. It's the one big thing for the token holders. What mostly Stargate was about, was changing the way Cosmos works from a developer point of view. And one of the other things that we were doing is we maintained a backwards compatibility layer between the new way that transactions work, the new way that blocks work in Cosmos and the old way. And basically a lot of this process was essentially working with the developers, various exchanges on a "trial by fire" basis, to get them all to tell us what is broken, about this new software and the compatibility-ware doesn't work for them. And then going to the Cosmos dev teams and being like: "Okay, this thing is not working, exchanges need it, such and such large exchange needs it. Can we fix this bug, get that bug out, get a new testnet launched, get the exchange to test against the new testnet, get to the next bug, fix that." And the plan was that we were going to land all of the exchanges and their integration issues, and the software being QA-ed and ready, all of these pieces together on basically the same day. And that's what we did. James (00:11:24): This is interesting for me because at Summa we did a significant amount of work on a Cosmos SDK module, and we are currently in the process of upgrading that to the new version of the SDK. And just to take a relatively simple extension to the Cosmos ecosystem and upgrade it, is taking probably a few dozen hours. It is not surprising to me to hear that exchanges are spending hundreds of hours on this because is much more far reaching through the tech stack than any other hard fork I've seen. Zaki (00:11:59): Yup. That is the reality of the situation. I probably hadn't really looked very closely at what's going on in the Cosmos code base for four or five months, because I've been dealing with both trying to set strategic direction inside of Tendermint and then leaving Tendermint, and then reorganizing the entire Cosmos ecosystem and trying to make all of that work. And so Game of Zones happened, I started to look at what is actually going on here. And I realized: "Oh, you guys have decided to break everything". And for all good reasons that will make a huge difference in the future. And all of this stuff is reasonable and logical and makes a lot of sense. And any point in time in which we'd waited even longer to make these changes, they would have been harder to do. So, all in all, I'm not expressing remotely any regret into it, but the Stargate upgrade was this immense lift to try to get very widely integrated, very complex blockchain that has lots and lots of different integrations. The other thing about Cosmos, is that Cosmos never really participated in the listing process with the team or Tendermint, or no one ever really went out and got Cosmos listed on any exchange. Every exchange permissionlessly decided to integrate Cosmos and every single exchange came up with their own completely unique integration, which has been so much fun to support. Anna (00:13:31): Oh my gosh, this sounds like such a huge endeavor. Let's do something for our listeners though. And also for me, actually. Would you define Stargate as an upgrade, like one upgrade or is it a set of upgrades? Do you bundle this all into one thing? And I guess the other question is, is it over? Zaki (00:13:49): That was the reality of it as it was over. There is a point in which you go through the Stargate. You're on one side. The name has really turned out to work really, really well. That visual metaphor of: I'm on one side of the Stargate, the Stargate opens I go through it and I'm on the other side. I don't know how many people remember the TV show, it's on Netflix now so there's lots of people who can watch it. I mean, I loved that TV show so much growing up, but this metaphor of: there's like a membrane and you're on the other side and everything is different on the other side, is extremely accurate. Anna (00:14:29): So are we on the other side? Zaki (00:14:31): Yeah, we are on the other side today. Anna (00:14:34): Okay. So let's go into this. So back to that question of is it one upgrade or a series of upgrades? Zaki (00:14:41): Yeah, it was one upgrade. It was one event. That was the realization that all of these changes that people had made or were in the process or making, had to bundled together into one big painful change that all had to be done at once, because everything was so tightly coupled together that there was no other way. James (00:15:02): So there were a number of major features that needed to be added or refactoring or things that needed to be changed. And the decision was to bundle them all into one extremely painful upgrade and go through it all at once, right? Zaki (00:15:16): Yep. Anna (00:15:17): What makes it painful? I want to understand this. You described the exchanges having to put all of these hours in, but what does one of these exchanges actually do? Don't they just take new software that already exists and just spin it up? Zaki (00:15:32): That is not, in general, what an exchange integration looks like. And it makes sense to me that very few people are familiar with this. So the first thing you can do is, if you wanted to understand what an exchange integration looks like is: Coinbase has done this Rosetta standard, which is essentially attempting to formalize and standardize the interface between a cryptocurrency exchange and a blockchain piece of software. And Cosmos is working on being integrated, into having a Rosetta integration. ---But if you look at all the exchanges that are like, ---- The scope of what's in Rosetta is the scope of what an exchange has to do. They have to be able to send and receive funds, manage keys, manage all these hot and cold wallets, move funds back and forth between them. Many exchanges are integrated into proof of stake systems, so they need to be able to have the staking rewards unbonding. And as your blockchain becomes more featureful, there are more opportunities for exchanges to integrate the features of your blockchain into their software. And exchanges may choose to do anything from use the queries and RPC calls of your network, that your node provides, to interact with the network, to use the API that your node uses to interact with network, to write thousands of lines of code directly in the Cosmos SDK to interact with the network. Again, every single exchange pretty much has a unique version of this. The world of "all exchanges just use Rosetta" sounds like amazing and magical and fantastic, and would make something like the Stargate upgrade a lot easier, because there would only be one API surface area that you had to migrate over the upgrade. But I did not get to experience that world. I got to experience the world of "everybody has a unique upgrade" and there are over a hundred exchanges that list ATOMs, I wrote different relationships with them, and I had to help them all with random troubleshooting problems. And I'm probably still helping a bunch of them today. James (00:17:38): This is really interesting to me. Working on some of these core blockchain clients like Geth and Gaia and Celo, I feel like we're targeting these at specific users, people who are independent devs, running them at home or in the cloud. And exchanges are so far outside of that class of users, they have completely different needs for a blockchain node. So it sounds like the pain point here is that Gaia itself is not serving the exchanges' needs, and they have all resolved that in different ways. And so you and the other Gaia and Cosmos developers have to go out to each exchange and help them update their bespoke setup? That sounds incredibly painful. Anna (00:18:28): Let's talk about what's in it. What is it doing? I had Chris Goes on the show, I think, about a year ago, maybe it was a year and a half ago, where he talked about IBC. So I know that IBC is a huge component. IBC meaning basically the ability for the zones to actually interoperate... Zaki (00:18:46): Yeah. For Cosmos to be an internet of blockchains. Anna (00:18:49): Exactly. For it to become what it set out to be. It's a huge thing. But is there something else to this? Are there other key changes that would even be recognizable to an outside viewer? Zaki (00:19:02): Yeah. So most of these changes are designed to improve quality of life. Not really for end users of our software, like ATOM holders, et cetera. IBC is probably the most ATOM holder-facing feature of this whole system.There are a couple of other minor things that are related to how the community pool can work and stuff like that. But IBC is the big feature for token holders. Really what the focus of Stargate was, is totally reconceiving the developer experience and the node operator experience of running the Cosmos software stack. So virtually, I mean, this is very much in the weeds, but when you write a blockchain, a big part of writing a blockchain is how data and transactions are actually written to the network and read from the network and written to disk. What are the transactions that are actually in the blocks? What do the blocks look like? And virtually every blockchain protocol, Cosmos had an in-house homegrown solution to what the transactions actually look like as bytes on the wire? What do the blocks actually look like? And this homegrown solution was incredible pain point for developers. And the goal, the biggest thing in Stargate was: we are going to reconceive this homegrown solution into something that uses a standard format that is widely used in the entire computing industry and is no longer a unique thing to blockchains. And it represents the overall goal of the direction of the Cosmos developer community, to make Cosmos look less and less like a special, unique blockchain snowflake, produced by some weirdos, and more like a standard piece of internet infrastructure. And that change is the reason why I had to talk to all of the exchanges and all of the wallets and all of the custodians, cause what they had done is they'd all figured out their own ways of dealing with our unique blockchain snowflake software. And the other thing that we did, that was extremely painful, was we changed the entire inner workings. So every transaction that's on the network, all different, but we maintained a backwards compatible interface for all of that. That is like 99% backwards compatible with the old version, allowing people to sometimes not have to do all of this upgrade all at once. But part of the reality is, if you build something that is 99% compatible, you're going to run into the situation where like the 1% that's not compatible is going to break interfaces for someone, and then you're going to have to deal with it. James (00:22:08): Getting into how transactions and blocks are stored and transmitted between nodes, I have a bit of a personal anecdote here. Early on, in my Cosmos SDK development experience, we were at a hackathon trying to put together a very simple Cosmos module. And we kept running into an issue where a specific transaction would always fail. And it wouldn't just fail during processing, it would fail signature checks on the node. And we couldn't figure out why, because we were building it and signing it just the same as all the others. What it turned out is that the client was sending an empty list. And when it went through the serialization and deserialization, the node was receiving, instead of the empty list, nothing. So some step in there was taking what I had sent and signed, which was the empty list, and replacing it with a nothingness. And that's why the signature was no longer working, because the serialization and deserialization actually changed what we had sent. Zaki (00:23:14): So I have two responses to this. One is, this is weird edge cases, that's why it's better to use internet standard software. So we have switched from Amino to Protobuf, which everyone uses. But the nightmare of Stargate is that we have maintained a lot of Amino backwards compatibility. So there's actually a translation layer in the Stargate software between the new Protobuf stuff and the Amino stuff, that has to be one-for-one bug compatible with the old Amino stuff. And somehow this works. And it goes to Aaron and the Region team for pulling off this miracle. It's pretty amazing that this part of it went through the Stargate, and we succeeded, and it was painful and it was a lot of work and a lot of hours, but also pretty amazing that we were able to pull this off. James (00:24:13): Yeah, definitely. It's a really impressive bit of work. And I have to say that the new Protobuf stuff is much nicer to work with, and I am a huge fan of outsourcing maintenance of this to Google. Anna (00:24:28): I wonder, with the Stargate upgrade... So far, I think of this as it's focused mainly on the Cosmos Hub, but is it also across all the zones that have been built using the Cosmos SDK? This is what I want to understand, did they all have to upgrade something as well? Zaki (00:24:46): So in order for them to participate in IBC, to start sending tokens, participate in the internet of blockchains, they need to upgrade to Stargate. So that's one piece of the story. A number of zones have either upgraded to Stargate, like Akash upgraded to Stargate, or have announced that they are going to upgrade to Stargate. There's quite a lot of demand right now for engineers who can do Stargate upgrades on Cosmos modules. So there's quite a lot of consulting going on doing that, and the work that James is doing, upgrading his Cosmos module to Stargate. So in order to get the internet of blockchains, you got to do Stargate. And by doing the Cosmos, and I've been doing the Cosmos stuff first, we paved the way. Every exchange that has an integration, their integration with the hub is the basis for their integrations with all of the other chains. The other thing about Stargate is: once we went through this and we're not going back. The Cosmos dev teams are only building stuff for the post Stargate world. If people want the latest improvements to Tendermint and the Cosmos stack, they got to upgrade to Stargate too. Other reality of it is, in general, all of development slowed down across the Cosmos ecosystem because the Stargate upgrade was ongoing. And people either didn't want to build stuff, knowing that they would have to do the Stargate upgrade later, or didn't want to build stuff on top of half completed versions of Stargate, which was probably not a bad decision for most people. And so now that we're on the other side of Stargate, there's actually a lot more scope for building stuff on top. Anna (00:26:27): It's exciting times in the Cosmos ecosystem then. Zaki, thank you so much for the interview and giving us your perspective on the Stargate upgrade. Zaki (00:26:36): Yeah. Thanks guys. Anna (00:26:42): So now we're sitting with Tess Rinearson, who's the VP of Engineering at Interchain GmbH and product owner for Tendermint Core. Welcome to the show, Tess. Tess (00:26:51): Hi, thank you so much. Anna (00:26:52): I'd love to find out a little bit, and also to give some context to our audience, what exactly do you do as VP of Engineering at Interchain? And how do you exactly interface with the Stargate upgrade? Tess (00:27:06): Yeah, totally. That's a good question. And definitely merits a little bit of explanation. So at Interchain GmbH we focus on three pillars of core technology for Cosmos. One of them is Tendermint core, which is the consensus engine. One of them is IBC, which everyone listening to this knows what IBC is. And then there's also Gaia, which is the daemon that actually runs on the Cosmos Hub. So it's the software that ties the Cosmos SDK together, for the Cosmos Hub specifically. And I actually spend like 90 or 95% of my time focused on Tendermint core. So I'm the product owner for it, I'm the engineering manager for that team. I also do manage the managers, the other teams at Interchain GmbH that work on general engineering management practice, for lack of a better word for the company, but I actually really focus mostly on Tendermint core. And that's where my technical passions and interests lie. James (00:28:09): For people a little less familiar with the architecture, would you mind talking a little bit about the separation between Tendermint Core and Gaia or the Cosmos SDK? Tess (00:28:17): Totally. So if you think of them as an infrastructure stack, it's an imperfect analogy, but there is this dependency direction to them. If you think of them as a stack, Tendermint core is at the bottom, it is an algorithm for state machine replication. Usually when we say Tendermint, we're talking about the algorithm, which is this proof of stake consensus algorithm that also borrows heavily from academic distributed consensus algorithms. And that's actually why I like it. It's like this funny, very cool hybrid of these ideas from blockchains and these incentivized systems. And then also this very academic, classic distributed systems thinking. So that's one of the things I really like about it. So it's this almost agnostic protocol for replicating state in this way. You don't necessarily even really need a blockchain or a cryptocurrency to use something like Tendermint. So that's the bottom and it just lets you, again, replicate state from one node to another, in a network. And then you have the Cosmos SDK, which provides an application layer on top of Tendermint. So most people who are building Cosmos blockchains are going to use the Cosmos SDK. There are some other blockchain applications that use Tendermint directly, but, for the sake of this conversation, I'll focus on the Cosmos side of things. So Cosmos is one layer up. And the analogy that people sometimes like to make with Cosmos, an aspirational analogy in many ways, but people like to say it's like rails for blockchains. It's really meant to be this framework, this tooling that makes it really easy for anyone to spin up a blockchain pretty quickly. Again, a little bit aspirational, but that's the goal, and I think it's a good one. And then Gaia, like I said, it's the daemon. So it's literally the software process that runs on the Cosmos Hub. So if you are trying to join the Cosmos Hub, you need to run an instance of Gaia, and Gaia is built using the Cosmos SDK. So you have this set of layers going up almost, for 10 minutes at the bottom. And then there's the Cosmos SDK and then there's Gaia. Now this does beg the question, where does IBC fit into all of this? James (00:30:28): I was just about to ask. Tess (00:30:31): It's the natural thing to wonder next. And so I'll channel Chris Goes here for a moment, who's the original author of IBC. And I'll point out that that IBC is really like a blockchain network agnostic protocol. So when we talk about IBC as a protocol, we're really talking about the specification, the paper, the protocol that has to be spoken and not the implementation. But the way that IBC is implemented for the Cosmos network is in the Cosmos SDK. So there's a module, the Cosmos SDK is comprised of all of these modules that let you do different things, whether that's staking or I think there's account management, or different consensus modules, if you don't want to use Tendermint for some reason, it breaks my heart, but I understand. Anyway, there's an IBC module within the Cosmos SDK and that is where IBC for the Cosmos Hub is implemented. Again, that distinction is a little bit pedantic, but I think Chris would be upset with me if I didn't say it. So I'll make that really clear. James (00:31:42): I suspect you would. Anna (00:31:45): I want to go back. I have just another question about Gaia. I just want to take a quick step back. Is that the client, or is the client in it somehow? Is it a version of the SDK or is it a specific client software? This is what I'm not getting. Tess (00:32:00): I think it depends on what you mean by the client. Anna (00:32:04): I guess I'm thinking always in the Ethereum context of like you're running a Geth client, you're running a node with this particular software. Tess (00:32:13): I think it is. If you want to both know more about Gaia specifically, and also understand the comparisons to Ethereum, I know you're talking to Shahan next, and he both comes from the Ethereum world and works all the time on Gaia. So he he's our guy. He will tell you conclusively. James (00:32:32): Can I take a shot at an analogy here? Tess (00:32:34): Go for it, please. James (00:32:35): Tendermint core is the engine at the center of this machine, Cosmos SDK is going to be how you get the gas in and the energy or movement out of the engine. And then Gaia is the actual machine that we've built. Tess (00:32:53): Yeah. I think that's a pretty good analogy. James (00:32:55): It's not perfect, needs work. Anna (00:33:00): Maybe by the end of this episode you'll have it. James (00:33:02): Yeah, we can reshoot later, take this part out. Anna (00:33:04): So let's now turn our attention to Stargate. We learned just before from Zaki his perspective on what this upgrade looked like and the kinds of work he had to do, but I'm really curious, from where you're sitting, what did this upgrade actually mean? Tess (00:33:26): Yeah. So I think, especially coming from the Tendermint core perspective, the lower level, core engineer side of things, Stargate really represented the culmination of probably a year and a half of work to prepare the Cosmos infrastructure for IBC. So as some people will know, the Cosmos Hub had launched in early 2019, and it launched, as many systems launch, like a first draft. There were a lot of things that were like everyone's best guess of how the system should work and how it should be laid out and what encoding algorithm it should use for things and how you serialize data or what data structures should be used. And some of those things turned out: not necessarily have been the right choice, which is just how this goes. That said, because this is a blockchain, it's actually very difficult sometimes to make these changes. If we were working in a system that had immutability or didn't need to have this long immutable, consistent history, things would be a little bit easier to change. But with blockchains often you have to treat these kinds of changes as huge breaking changes. And so that, I think, is the reason that Stargate became a whole thing. It's like the core teams had to rethink a few of the assumptions that turned out to maybe not be as usable as they'd hoped or the performance wasn't there. And again, this is very natural part of software evolution, but I think one of the things that's kind of novel for blockchains versus more traditional software development is that you have to figure out things, how to get around this immutability issue. And so I think it was actually Zaki's idea to say: "We should really brand this launch, because we're going to need so many people on board and all the clients are going to break, and all the wallets are going to break, and it's going to be a huge thing for validators". And I think a lot of that was really driven by this change to the serialization format. So basically the first version of Cosmos and Tendermint and all of that used an in-house encoding algorithm for serializing data. And it just wasn't as performant or as usable as we wanted it to be. And so there was a huge migration process that took a really long time. And it took many smart people working for many, many hours to make the switch over and to do it in a way that wouldn't be breaking. And so that, again, this is really, really biased as someone who's working on low-level, the stuff that's at the bottom of the stack. But to me that was actually the motivating thing, even the branding of it as Stargate and all of that, it was like: this is the thing that's so breaking. We know it's really important because this is so foundational, but it's going to be so breaking. We can't just rip the Band Aid off or pull the rug out from everyone. This is actually going to become something where we need everyone to buy into it together. And the amazing thing is that people like Zaki, and I think chiefly Zaki, were able to go around and really rally that support from not only all of the development teams across the ecosystem, but partners and wallets and exchanges, and do all of that work. I mean, it's the most boring part, this huge branded release, but from where I sit that really was one of the biggest, if not the biggest, motivations for this whole thing. James (00:36:57): Okay. I have a couple of questions about this. First, was the idea behind branding it that you couldn't rally companies and the ecosystem around a numbered release, and so you had to make this something special that people would get behind? Tess (00:37:13): I don't want to say it's like: "Oh, people couldn't get excited about a numbered release", because that makes it sound like people are frivolous or something. I think it's more that it's just confusing. It's better to have something that's tangible. And, frankly, we had to change the numbered release, that we thought it was going to be, a couple of times. And also it was useful to have something that was like a moment that we could point to. And we could say: "This is the Tendermint core version, that's going to support Stargate. This is the Cosmos SDK version, that's going to support Stargate. And this is the Gaia version." Those numbers are all different. So if you can package them all up and say: "This is the Stargate release bundle ffectively. This is the set of software that we will run when we all go through the Stargate together." If you can do that, it's just an easier story to tell. James (00:38:03): Because it touches so many pieces. It's not just a couple of releases here and there, it's an ongoing project. Anna (00:38:11): One thing we didn't get a chance to talk about with Zaki, cause what we talked about mostly with him was that he was doing a lot of communicating, exactly this thing you were saying of sending the message out, getting everybody together. But I'm curious on the timeline. Was the work already done? Had a lot of the build been finished and then there was this period of communication? Or was it communicated and then you had to work? I'm curious how that happened. Tess (00:38:36): There was some overlap. So I joined the project in late fall 2019, which is to say, I haven't been with it for that long. And at that time a lot of these changes had actually already been decided, but it wasn't really clear how they were going to actually end up on the hub. So people knew they wanted to change the serialization format or they knew that there was a path forward for IBC, or there are a few major Tendermint core features that we knew we wanted to do. So there was a path there. And the big question mark was: what does it actually look like to get this onto the hub? And we had been writing the software and doing the development for several months before the communications started. But the comms phase, that Zaki ran and managed, definitely began before the software was finished. So he was working on rallying the troops, starting in July or August of last year, I think. If he said something different, he's going to remember more clearly than I did, but it was last summer. And then at that time, the Tendermint core team, and again, cause it's the stack, the Tedermint core team has to finish their work first. Again, you can start building on top of it, but you have to cut the releases in order. So the Tendermint core team at that time was mostly done with the work. We were wrapping up a few outstanding things. And really we were putting out what we call release candidates, so the things that we think will be the software that will run on the hub for the next major Tendermint release. That process lasted a few months and we had lots of people try the RCs out, find problems, and then we go back and fix them. And so I think it was the first week of November, or maybe the end of October, when we actually cut that final release for Stargate. Anna (00:40:23): Was the Game of Zones during that era, or was it after, or before? Tess (00:40:30): It was a little bit before. And the reason that we could do that was that, that was really testing the IBC piece of it specifically, which didn't depend on fully ratified changes all the way down the stack, if that makes sense. You can test that piece a little bit in isolation, and that's what Game of Zones was all about. Tthat said, there were so many later testsnets that happened, maybe with slightly smaller prizes or a little bit less advertised, but many, many testsnets were run, leading up to Stargate, that Zaki and other teams also managed. And those were instrumental in finding bugs and helping us get stuff fixed. James (00:41:12): Interesting. You mentioned that there was a lot of people feeling like they would eventually move away from Amino, people wanted to do this. And I remember for being involved in Ccosmos a few years back, that everybody disliked Amino, but nobody was willing to do anything about it. Would you say there was a specific person or event that turned this from mild distaste to actual action on a big scale? Tess (00:41:42): I don't know that there was any single person or event. I will say that this work, when I joined the project, and I joined the project again to lead Tendermint core, when I joined the project, that feeling was very, very strong and the work had already begun to remove Amino and replace it with Protocol Buffers. So that was not my decision. I also stood by the team in that decision and I was like: "This is something what we're going forward with." This was not something that every single person in the ecosystem wanted. And there was a Tendermint fork, briefly called "Tendermint Classic", that had Amino in it, still. I don't think that project is still under development, but I'll just be honest, this was not a universally beloved decision, but it was something the team really wanted to do. And I know a lot of users really wanted it, too. And so I think it was helpful to have someone who was at least nominally in a leadership position. I was brand new to the project. So what did I know? But I was nominally in a leadership position and I said: "I really trust this team and I trust the way they want to execute on this". And so I think that helped. Around the same time that that decision really was ratified, that was also the time that Interchain GmbH was formed. And that was another thing too. Interchain GmbH, it's a safe space for core engineering. That's the joke. And it's a place where people can make decisions like that, that might be a little bit politically unpopular, but technically correct. Obviously, I'm biased when I say that also. But that's really what it's meant to be. It's like we're liberated from other initiatives that are happening in the ecosystem, and we can really just try to focus on building the best core technology, that's going to serve as many people as possible. That's the mandate. And that's really why Interchain GmbH exists. James (00:43:33): That's a great mandate. Tess (00:43:35): Thanks. Anna (00:43:35): What you've just described... You said there was this very short fork, but was that fork when you did the Stargate upgrade? Or was that fork when you did something else in the past? Cause I thought it was all bundled into this one thing. Tess (00:43:54): I won't beat around the bush here.This was something that Jae Kwon, the original author of Tendermint, wanted to keep using Amino. And so he forked the project. I think at the point, at which it became clear that this Amino removal initiative was going to proceed. And so I don't remember exactly when that happened. I think it was maybe about a year ago, I'm not not totally sure. There's no bad blood there or hard feelings. This is just one of those things that happens in blockchain development. Sometimes you have a political disagreement or a technical disagreement, and someone forks the code base. And sometimes those things reconcile again later and sometimes they don't. In this case, I think, from a community perspective, there wasn't an enormous amount of contention. And again, I think it also really speaks to organizing scale and also the strength of the Cosmos community, that there wasn't any confusion there. People were like: "No, we know what we want. We know that we want to move forward with Stargate." This is not a big issue in any way. But I just mentioned this just to say, sometimes there are technical disagreements and this is how this goes. Anna (00:44:56): I think when you said fork, I thought fork the blockchain, you just meant fork the code base. Tess (00:45:00): No, no, sorry. The word "fork" is so overloaded. Tess (00:45:09): I'm just glad that I work in a system where we don't get the forks that are an accident of consensus and these probabilistic, proof of work-type chains. So at least we don't have to worry about that kind of fork. No, I just meant a fork of the repository, a fork of the code into the project. James (00:45:27): Are you worried about consensus failures between the Go implementation and Rust one, if that ever gets completed? Tess (00:45:34): I think the Rust implementation has been a great forcing factor for the whole project to get really, really scrupulous about the specification. And fortunately there's a lot of other reasons to also be scrupulous about the specification. I mean, there's the usual the table stakes, like if you have a good spec, then you're going to have better code. That's just table stakes. But there's also all this work that Informal Systems is doing, actually: formally specify Tendermint and then formerly verify that specification. So there's lots of reasons to have an implementation and a spec that are very true to each other. And so for that reason, because that work is happening, I'm less worried about the Rust implementation and Go implementation diverging from each other. It is a risk, but there's always a risk in these consensus systems, that something breaks content. That is our one job, preventing that. And so it's something that we'll be careful about, but I'm not super worried about it tt the moment. James (00:46:36): Interesting. it's an extremely difficult job. One of the recent Geth ones was caused by Shallow Copy - Deep Copy optimization. Tess (00:46:45): There's so many ways to screw things up. James (00:46:50): Yes, there are. Anna (00:46:53): One thing that I'm still curious about is: during this upgrade, the actual upgrade time, what was that like anyway? What was that like, in general? How was that really rolled out? And also, what was it like for you? What were you actually doing? Tess (00:47:06): Oh man. So it was extremely stressful, I'm not gonna sugarcoat it. I mean, it was one of those things where I knew that it was potentially going to be "all hands on deck" and that people are going to be standing by for it. The scheduled upgrade time was also like 6:00 AM my time, which is just kind of brutal. I mean, it's a global community, it's a global project. I think it was timed for the convenience of the exchanges in Asia explicitly, for some reason, I don't know the details there. But it's a political decision actually to decide when the upgrade goes. And I drew the short stick basically. So it was this tricky timing and, from a Tendermint core side, our software had been pretty much stabilized and we cut the Stargate version of Tendermint back in November, and the upgrade was in February. And what was actually great about that is that it gave the software some time to mature. And like I said, there are non-Cosmos users of Tendermint, and some of them started using the Stargate version of Tendermint before Cosmos did. And some of them found some bugs, which was great. And we were able to work with them to get those bugs fixed. So I'll give a particular shout-out here to Crypto.com, their team surfaced a security issue and helped us test it, and they were just fantastic partners. They were awesome. Anyway, some things have been serviced, but on the whole it's like: "Eh, this software has been around for a few months. It seems pretty stable. It's pretty okay." And then at about 11:30 PM, the night before this exam upgrade, I got a message from someone that said: "Hey, I submitted a security issue to your HackerOne, and it looks like it's stuck in triage. I just wanted to let you know that my private keys are all over my logs." For his validators. Like his consensus keys, they'll let you sign blocks. And also if you misuse them, the validator gets punished. We're being leaked all over his logs. And logs sometimes get shipped off to log providers, a log is not a secure place. And I was like: "Oh God!" And what had happened was that, like I said, the Cosmos SDK is a downstream dependency of Tendermint core. And I'm going to get a little bit detailed here, I think it'll be interesting to any programmer. What had happened is that the Cosmos SDK changed their logging library to something that, when you passed it, an object would use reflection to look at the internals of the object, while Tendermint core was using a logging library that assumed that you'd called the two string method on any object, and you'd get the string representation. So we had, in the Tendermint core code base, this particular private key type, and the one that we were using, or the one that was a problem, is local key storage, which no one really should be using in production. So probably it's not super secure anyway, so probably not that big of a deal, but just a really bad vibe and look to have these keys here. So what had happened is that Tendermint core expected this thing to be stringified, and the SDK was actually looking at the object itself, and the two string method in Tendermint would strip out the private key material. But if you look at the object itself, you can get that private key material. And so that's what had happened. It was this discrepancy between the logging library, which is the most mundane thing you can think of, but it had this really ugly effect. It just looked catastrophic. Even if, again, you probably shouldn't be using this type of key, this file storage for your keys in production. We were like: probably if you tell programmers not to do something, they're going to do it. And so basically, with about six or seven hours on the clock, I was like: "All right, we gotta figure this out." And it wasn't obvious at that point, when I saw this problem, that the problem was the logging library thing. It was just like: something is wrong. And so fortunately I was able to rope in a few folks: Alexander Bez, who is known as Bez, he's a longtime Cosmos contributor who's working on Tendermint core now. He's in New York, so he wasn't asleep yet. So I called him up and I was like: "What's going on?" And he and I looked at it, and he has worked on both Cosmos and the Tendermint sides. And so he was the one who was able to diagnose it, because he knew both code bases really well. And then we tried to cut a new release and there had been a Go release that week, which broke all of our release tooling. So our release tooling expected a different version of Go, than we were using, because a new version of Go had been released two days earlier. And it was just like series of unfortunate events, basically, that unfolded at three or four in the morning, two hours before the release. Anna (00:52:01): You know what's wild? As much as it's a technical thing, there is a right way that this works, but it is this emotional thing as well, where if things like this happen, it just throws off the entire process. It sounds like you can't just push through that stuff, because it will hurt, it will do something. Tess (00:52:21): Totally. And so I was up really late, Alessio who works at Tendermint, Inc was up really late, Shahan was up super late. Well, I guess it wasn't super late for him yet, but he still had to get up really early the next day to deal with all this stuff. And so I basically got the release done at like 4:30 in the morning, and then I went to bed and then I woke up at six to just see if the release had started. And I was too tired to function. So I went back to sleep and then we did a little launch party, which was super fun, but that was a couple of hours later. I got up for that and then went back to bed, which is all to say, that I pretty much slept through the actual Stargate release. So I had a lot of stuff to deal with right up until that point. And then it was actually amazing to look back on, because the validators and the community members were so incredibly supportive of each other and of helping each other out. And there were people from core teams, who were standing by and helping out with things, but there were a number of people, who are community members and who are running validators, who were just helping each other out, answering questions. I don't actually know the details of this again, partly because I slept through it, but there were some things that just took longer with the upgrade than expected. And I think Shahan will be able to speak to that. But people were really able to help each other out and say like: "Oh, this is what worked for me." Or like: "Oh, it took me two hours to get through this phase, just hang in there." And that was really, really amazing to see. James (00:53:55): That's an incredible story. And I have to say that I have literally had nightmares about dealing with the Go module release. So I am so sorry that you had to go through that at midnight, right before a major launch. Tess (00:54:12): Yeah. Well, it is funny, because the story is, the Go module system, like blockchains, has a certain aspect of immutability to it. And so if you look at our change logs in our Go modules and all of that, you can actually see the history of a couple of release attempts that didn't happen. And so Tendermint 0.34.6 and 0.34.7 are just no-op releases, they just don't actually exist, they're artifacts of a failed build process, or failed release process. Anna (00:54:43): The story you just told, that was the actual night of the actual upgrade. But the upgrade had also been delayed. That was a different thing, right? That was two weeks prior to that. Tess (00:54:53): So one of the things that, and I don't remember the exact cause of that delay, but the way that these upgrades get scheduled, is someone has to submit a governance proposal. And then there's like a voting period for the governance proposal. And then two weeks after the governance proposal goes through, then the upgrade is scheduled to happen. And I think that the first governance proposal went out and, even in that first proposal, at some point, while it was live and we were waiting for the release, that was when that security issue that Crypto.com found and helped the Tendermint core team out with, that was when that surfaced. So there was a date scheduled. And the thing about these scheduled upgrades is that they point to a specific commit hash on Gaia. So everyone's committed to running a specific version of the software. And if you change it, that's not part of the governance, that's not what people voted for. In the case of that particular Tendermint core issue that I just described, we actually were able to make the change for that in such a way that you could still run the commit hash in the proposal, and it would be compatible with the patched version. And so we told people: you really should upgrade to the patch version, and you can run the one in the proposal, if you want to verify that that's what's happening, you can do that, but we recommend you run this patch one. So that seemed fine. And then there was another issue that emerged, I think. In Cosmos SDK or Gaia itself, I don't remember the details of this one, but that was one where we could not get around it in that way. And so that was why it got rescheduled, it wasn't just that there was a bug. It was that there was a bug, where patching the bug was going to be incompatible with the scheduled version. And so it was at that point that we said: "Okay, we need to restart this process." Anna (00:56:51): Well, thank you so much for sharing this story of your experience going through the Stargate and going through this upgrade. Our next interview is with Shahan and we hope that in that interview, we can actually hear a little bit more on that exact time of the upgrade and what's happened since. Thanks so much for your part of the story and congrats! Tess (00:57:14): Yeah. This was super, super fun. Thank you so much for having me. Anna (00:57:20): Cool. Anna (00:57:25): So I want to welcome our last guest of this three-part interview series, Shahan, who's the Cosmos Hub lead. Welcome to the show, Shahan. Shahan (00:57:34): Thank you for inviting me to your podcast. I'm very excited to chat with you today. James (00:57:37): We're very excited to have you. Anna (00:57:40): So we've had an interview with Zaki, who shared his perspective on what was happening, from the perspective of Iqlusion, mainly. We heard from Tess about the actual Stargate launch day, in some detail. Now what we want to do with this portion of the episode, is to hear from you as the Cosmos Hub lead, what it was like for you to go through it, and also hear a little bit about what's happened since. But before we start into that, maybe just say a few words about what you actually work on and what being the Cosmos Hub lead actually means. Shahan (00:58:14): Sure. In my role as Cosmos Hub lead, I'm responsible for communicating and gaining agreement on the Hub roadmap from the diverse dev teams and the validators that participate with the Hub. So I'm new to the role, it was around 3-4 months, when the Stargate upgrade was taking place. So I was quite new to the ecosystem. And what that meant was communicating a lot with the dev teams at the different layers, the Tendermint team, the Cosmos SDK team, and those who had been managing the Gaia application over the years, and trying to understand what exactly the upgrade required. The upgrade itself was quite straightforward on paper. And the most complex equation of this upgrade was the communication with the validators, with the dev teams, making sure everything was in sync and making sure that everybody knew what to do, when. This Stargate upgrade was quite intense and required the stopping of the chain. And it was a coordinated effort. Anna (00:59:25): You're pretty new to this role, you started and then very soon after were faced with this upgrade. What was that like? When did you actually start? Shahan (00:59:35): My start date was end of November, beginning of December. The first two months of my onboarding was essentially talk to as many people as I could and go through the stack. Prior to joining Cosmos, I was aware of the Tendermint consensus algorithms, so I had a strong understanding of that, but it was the rest of the stack, which was a very deep and very interesting scan through code. At the same time, the Stargate upgrade was a migration of old code to new code. So it was very interesting exploration trying to figure out which parts were old, which parts were new. And trying to answer the questions was very hard for me, but essentially my focus was on the upgrade itself and the coordination around that, and the communication, making sure that all parties were aware of what their responsibilities were and making sure that if there were any issues, they would be addressed by the right parties. James (01:00:38): I mean, that's a really interesting position to be in. You've just joined the ecosystem, you have almost no idea what's going on and you're trying to catch up to speed really quickly. And your job is to coordinate everyone, all at once. Anna (01:00:55): Did you know what you were getting into? Shahan (01:00:57): Yeah. My background is from the Ethereum land. I have a lot of expertise with enterprise blockchains and Ethereum mainnet stuff, but what is different here is that the Cosmos ecosystem runs a different form of consensus algorithm. So the way that the network upgrades is different than how things proceed in Ethereum land. And what I enjoy about this community is that everybody is on board and in tune, and there is an on-chain governance mechanism that people participate in. So there's a lot of activity and involvement from the validators, and the development and the roadmap and agreeing with what's coming up with the Hub. So this social coordination mechanism and the on-chain governance, they work very well together to signal what's coming up and introduces a fairly straightforward process amongst the parties. Anna (01:01:54): When you say, you're the Cosmos Hub lead.... We learned from Tess about the different layers of the stack and the fact that like Gaia, we were confused about whether you call that a client or if it's an interpretation. And maybe you can explain a little bit exactly, what is the Cosmos Hub and how should we understand that? Shahan (01:02:17): A great question. This was a question I didn't even know to ask in my first month. The Cosmos stack, as far as I can see, is built up of the Tendermint consensus layer, which is essentially a way for validators to agree on some data. And the Cosmos SDK is the layer on top, which relies on Tendermint to communicate the business, somewhat business logic, key stuff. And then Gaia imports the Cosmos SDK, and is actually the application, or the client, that runs the Cosmos Hub, the current version of the Cosmos Hub. So essentially Tendermint is a consensus library that can operate for other business logic. And Cosmos SDK is typically the library that's used by partner chains in order to develop their own chains or zones. Anna (01:03:13): But the Hub also depends on it, I guess. If you change the Cosmos SDK or add something to it, the Hub could potentially use that. Shahan (01:03:22): That's correct. When there is a change in the underlying libraries, Gaia is essentially the downstream partner of these development teams. And we do some coordination, and when there's a new release, Gaia also will tend to have a new release. In the most recent weeks, there has been one upgrade, a security release that required some coordination in that manner. And it's straightforward to do a release, you just bump the version and cut in your release. Right now, Gaia, as a client or application, has some functionality, mainly for on-chain governance, for token transfers. But what's coming up are some exciting modules, like a bridge with Ethereum and a decentralized exchange. And including those modules into the Hub will require increasing amounts of coordination. And that's when things will get interesting. So the Stargate upgrade as a pretty involved process, was a good first step in that. Anna (01:04:24): Trial by fire training ground. Shahan (01:04:26): Yeah. Trial by fire is probably the most heard expression. Anna (01:04:34): So now we understand the Hub, we understand how the Hub interacts with the Cosmos SDK, but what actually happened during the Stargate upgrade? From where you were sitting, were you just updating Gaia the whole time, or what was it exactly? Shahan (01:04:49): So I was just sitting and watching, really. So as you heard from Tess, there was a security patch issued, and Gaia applied an update that included that. And once the release was available, the nodes were then informed that this is the version that they would have to update to. The actual process by the validators would be: they stopped their client, they export their state, they migrate their state, they install the latest version of the client and then they start their client. So just a few commands, really. But the process took some time. Anna (01:05:33): And I bet if you have enough people, they'll find every way to do that wrong somehow, am I right? Shahan (01:05:42): Yes, of course. Starting from reading or not reading documentation. Anna (01:05:50): It's very obvious that you'd have the validators doing the upgrading. And I know that that's how it worked, but it's so interesting that, when you think of this Hub, it's not like you updated anything. That's how I had asked it, and sorry about that, it's completely like every single validator going through the process. So for your work then is really preparing all of that documentation and being ready to answer all the questions. Shahan (01:06:14): That's right. And being aware of potentially some contingencies. So if the version had to change as it did from the original governance proposal, the original proposal had a particular commit hash of Gaia, which then changed and the original proposal couldn't change. So there needed to be some way to communicate the version to update to. Of course, some folks who are diehard on-chain governance supporters would be like: "Well, I didn't vote for that version. So I'm going to stick to this version." And of course people are free to do what they want. But essentially the upgrade was very smooth. It took some time because the migration itself, some of the steps took a few hours, and it also required some time zone coordination, because the validators are spread globally. So probably after 5-6 hours, there was over 50% of the validators that had upgraded. And I think it took around 6.5 hours for the 66% mark to be passed. Anna (01:07:26): And that's what you were looking for, I guess. That was the threshold of the upgrade. Shahan (01:07:30): That's right. Exactly. Anna (01:07:33): But what happened to the folks that didn't upgrade? Are they still not upgraded? Shahan (01:07:36): I don't think there is a way to find out, which validators are running an old version. But any validators that didn't upgrade, they wouldn't connect to the network. Anna (01:07:46): So they're just not earning anything. Shahan (01:07:48): That's right. Anna (01:07:51): They might notice maybe. James (01:07:55): Can you talk a little bit more about what makes this upgrade different from an ETH hard fork? The typical hard fork processes is that we choose a flag height at which all of the nodes smoothly switch over. What specifically made Stargate such a big deal that we had just stopped the chain for 6 hours? Shahan (01:08:16): Yeah, it's a good question. It starts to get into the more technical aspect. The Tendermint algorithm is a BFT style algorithm, Byzantine fault tolerant style algorithm. And what that means is that the validators are known, and in order for the consensus to change or the state to be able to change via this hard fork, the validators, the majority of those validators need to conduct their upgrade. In Ethereum land, a hard fork is essentially trying to convince the validators to update their client. And if they choose not to, then they can continue a separate chain. There will be a chain split and some validators will be on the old fork and some will be on the new fork. Anna (01:09:06): But when you say validators, do you mean miners in current Ethereum? Not validators in future Ethereum. Shahan (01:09:12): That's right. Anna (01:09:15): I mean, they play the same role, so... Shahan (01:09:17): That's right. And it's a bit interesting because staking is available in Hub, but it runs a BFT style. So it is a proof of stake system, but not like Ethereum, which runs a different consensus mechanism, essentially. James (01:09:35): Casper versus pBFT? Shahan (01:09:36): Yeah, that's right. Anna (01:09:39): When we spoke to Tess, she shared this story of how kind of hour by hour this process was going. And there was like a bit of a hiccup or at least optically, there was something that looked strange. I mean, she mentioned you, you were in it with her, or you were also online at this time. What was it like from your perspective, what was happening? Shahan (01:10:00): From my perspective, I saw a lot of devs diving into code, trying to resolve the issue. And they got that figured out very quickly. They are true experts. They are very familiar with the SDK and Tendermint stack. And so from my perspective, I was just really waiting and looking for the latest release of the Tendermint or Cosmos SDK libraries. Once that release took place, it was essentially a few steps to cut a new Gaia release, to bump the versions, and then issue a new version. After that it was essentially communicating that version to the validators. So we have a verified validators channel on Discord, where validators were asked to keep watch on for events like these. And once that took place, we were essentially waiting for the moment of time to stop the blockchain. Perhaps optimization we can do next time. For the Stargate upgrade we had selected a time-based halt versus a block height-based halt. And we were trying to do that, so that the time-based coordination would work out more easily, so that we would know which times zones would be most affected or not. And make sure there were enough people online. Because with a block-based halt, it means that the time could drift. So the side effect of choosing a time-based halt was that when the chain actually stopped, there are a couple of validators that had already passed the time of halt. And essentially there was this discussion of which blocks should be picked for the migration to occur. So there was a very interesting voting and social coordination of: okay, well, there's this block that was past the halt height and Sunny pointed out very clearly: "Any blocks after the halt height shouldn't be included in the migration. So let's pick this number." That was probably a key moment of the upgrade. James (01:12:07): I think this is such an interesting example of how difficult it is to even talk about time in a distributed system. Your software is running on hundred nodes worldwide. Each of those computers is going to have their clock set slightly different. Shahan (01:12:23): Yeah, definitely. It was very interesting. Even with the halt height set to whatever it was and the blocks still have a few seconds to get produced. And it was very interesting to see that even with that, a majority of validators did produce an additional block past the height. Anna (01:12:44): Okay. I was wondering about that. Was anything lost in that upgrade? What could happen? Is there transactions that get lost sometimes or what can happen? Shahan (01:12:55): That's a good question. Yes. If there was a transaction that was issued in that last block, when the Hub was restarted, those transactions would not be included in the migration. Anna (01:13:05): Wow. Can you imagine someone sent by accident tokens in a direction and then magically, they were back. That would be the best case scenario or something like that. Shahan (01:13:15): Yeah, that's right. Well, if that was for a payment that was validated for only 5 seconds, then that would be a risk. What other things could have gone wrong? I think one of the other challenges that some of the validators ran into was, in order to have robust systems, they would have a few nodes running and some nodes might be in front of their validator nodes. So because of these architectural decisions that they've made, and sometimes we have some clarity into the architectures, trying to figure out which nodes should be upgraded first, which nodes don't have to be upgraded. Those types of instructions also had to be shared prior/during the upgrade. James (01:14:04): Is this something where you had to do a little bit of a forensic examination to figure out what set up the validator was using and how to upgrade it safely? Shahan (01:14:13): We didn't do so much of that. I think it was, at least for the core devs, it was fairly straightforward about what had to be upgraded at what point. Essentially, if a node was acting as a validator, it had to be updated, but all other nodes did not necessarily have to be updated in that manner. They didn't have to go through the export and migration. They would just start with a new binary and then they would sync on the new chain. But given that, with the current security upgrade that happened recently and trying to improve the process and documentation for future upgrades, there is a plan to have these playbooks for how to upgrade these validators. What are common architectures that companies are using, and providing more support as the number of validators increase. And just getting them used to this whole experience. It's fairly straightforward to run a validator, in my opinion, but of course I've not run one, but there are some nuances involved. Like what happens when nodes fall out of sync, what kind of hardware requirements are needed? What if we want systems to be more robust? And of course the security aspect of key management. In the future, the aim of trying to do these upgrades as smoothly as possible, is to try and integrate upcoming modules more regularly. This Stargate upgrade was probably a year and a half in the making and fault many organizations. And the future of the Hub involves a similar set and a growing set. There's development taking place. These modules have to go through testing, getting more awareness for the validators about what they need to do and how frequently they need to do it, would be very helpful. This added functionality will increase the usage of the Hub, it'll make things a lot more exciting, add features. James (01:16:12): So hopefully the future upgrades are not quite as painful as this one. Shahan (01:16:20): That's right. We are exploring different mechanisms for making these upgrades smoother. And currently there are some mechanisms that try to automate some of this. There is an upgrade module that allows somebody to issue a governance proposal, which would update a client based on a particular commit hash. There's also another mechanism, which is being used by the IBC module, in progress to being activated. And what happens in that scenarios, a module's code does get integrated into the Cosmos stack, either in the SDK or Tendermint or Gaia eventually. And once that code's present, somebody can issue a governance proposal to issue a parameter change, in order to activate that module. So this way people have some awareness of, and some mechanism to activate that module. Anna (01:17:18): So what's happened since this upgrade? We've described it. You also just mentioned that the IBC still needs to be activated. What does that actually mean? And what's happened since for you? Shahan (01:17:30): The IBC code, essentially, is part of the Gaia stack now, the Gaia, Cosmos SDK stack, which means that modules that rely on IBC will be able to use it. And right now, even though the code is part of Gaia, what needed to happen was, there needed to be an on-chain flag indicating to the validators that: okay, at this point in time, IBC should be enabled for any modules they wanted. So essentially, somebody submitted a governance, I think it was Zaki, submitted a governance proposal onto the Hub saying: "Time to activate IBC". And people will have to vote on it. And two weeks of that vote, the IBC module will look at that flag and the on-chain state, and if that flag is set, then IBC will be available. James (01:18:26): Very interesting. So one of the outcomes of this big, painful upgrade is that you've started building in more ways like this to upgrade specific parts of the system. Anna (01:18:37): Or almost turn on.You don't have to go through the process of upgrading anymore; I guess. James (01:18:42): Right. And that gives you a lot more flexibility for future upgrades, to put them in the code base much sooner and then activate them when they're ready. Would you expect that this is the long-term direction that Cosmos upgradability is going to go? Is it more towards module specific and very specific governance, rather than big upgrades. Shahan (01:19:06): I think in terms of the Stargate upgrade, the majority of those changes, as I understand them, or with respect to the underlying data format. So it was a performance optimization, it was other kinds of optimizations. Future upgrades could be as involved, especially if state has to evolve in order to accommodate the new modules. I wouldn't say that there's going to be one path for upgrades. I think there are numerous ways for upgrades to take place and they should be available, depending on who's involved and what kind of coordination is required. The coordinated Hub upgrade, we're trying to minimize those, but at the same time, if they do need to take place because of breaking changes, we want them to take place much faster. So trying to get the export faster. And some of the updates that have been applied to the software since the upgrade, makes that process much faster. James (01:20:05): So it sounds like these big coordinated state upgrades are expensive, but you're not scared of doing more of them. Shahan (01:20:14): No, I'm not. I think these are a system level admin things that should be fairly well understood. In particular, we want the validators to be comfortable with upgrading. We don't want them to think that: "Oh, I haven't touched the system in a year and a half, if I touch it now something's going to happen." We want them to feel confident. And the process of this coordinated mechanism, and in some cases, there are also uncoordinated mechanisms. For instance, I think it was a week or two ago, there needed to be a security patch applied and it was required a 24 hour notice to the validators. And that 24 hour notice was essentially: within 24 hours, a new version of Gaia will be released, please update your software and you don't have to export, migrate, you don't have to go through that process. And it's uncoordinated. So the validators don't have to turn off their software at the same time. They can all do it at their pace. Of course, the sooner the better, just to avoid the potential scenario where the chain halts. And the scenario where the chain would halt, if it had, a majority of the validators would have had to update their client and then the network would come back up, the chain would start back up. James (01:21:35): We've talked a lot about adding modules to the Hub and about this significant upgrade. What kind of QA processes or what other processes do modules undergo in order to ensure that they're ready to run on a globally distributed consensus system? Shahan (01:21:51): Great question. So we are working on a module readiness checklist, that will help relay the information about the module's readiness to the community at large. In addition to the technical QA, that we would expect from anything being deployed on chain, whether it's a third-party audit, intensive performance testing, there are other aspects, such as: who are the downstream partners impacted by this, who are the upstream partners that this module is dependent on, what kind of coordinations are required as these modules get developed. And we're working very hard to have this checklist in a very easy manner for module teams to fill out. And this compliments, the technical checklist that some of the dev teams in the community are working on, including Region. Anna (01:22:47): Well, Shahan, thanks for coming on the show and sharing your experience of the Stargate upgrade. Shahan (01:22:52): You very much for having me on here, Anna and James. Anna (01:22:55): James, I have a quick last question to you, since we're wrapping up this episode. From what you've seen, looking across different ecosystems over the years, how does this upgrade compare? What did you think about it, also now that we've done these three interviews? James (01:23:11): Well, outside of the major Bitcoin and Ethereum forks, I think that this is by far the largest, most distributed and best coordinated network upgrade that I've seen. I don't want to underestimate, how much work goes into something like this. And talking to several dev teams and a hundred validators is incredibly difficult. Celo, my employer, we're going to be undergoing a hard fork next month, or late April - early May. And it is an extremely worrying thing. We're doing all of this testing and all of this QA. And I am happy to know that it is possible and to see someone and teams that have gone through it with poise... Anna (01:24:01): And come out the other side of the Stargate, huh? Shahan (01:24:04): They are in good hands with you there, James. Anna (01:24:11): Thanks to both of you and to our previous guests for doing this episode. And to our listeners, thanks for listening.