Alexander Gallego: People are just drowning in complexity, and I would say the next wave of software, just because hardware is so good, if they can do anything, it's really just about helping people tame the complexity of enterprise infrastructure. Eric Anderson: This is Contributor, a podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson. Well, I'm joined today by Alex Gallego, who is the CEO of Vectorized and one of the creators of Redpanda. Today we're going to talk all about Redpanda and stream processing, and Alex's history. He's famous in the world of streaming. Alex, thanks for coming on. Alexander Gallego: Thanks, Eric, for having me. Yeah, where to begin? I feel like I've been streaming for about 12 years. Firstly, I was part of this early stage startup in New York called Yieldmo, where we were doing real time streaming. I did a little bit of a streaming in finance before that, but that's where big data and data streaming merged for, as we were trying to grow with the biggest publishers in the world and the biggest advertisers. I was using Apache Storm then. And then what was the next step for us, after that, was my next company, which was called Concord. It was a distributed stream processing engine written in C++, on top of Mesos, and frankly, Kubernetes One. Alexander Gallego: So that's where another story is going. We sold that company in 2016, to alchemize, so far, some pretty big use cases there. But throughout my tenure of Concord and Yieldmo, really the last nine years, we couldn't find any storage engine that would keep up with the volumes that we were trying to push at Concord. So in 2017, I took two Edge computers of Akamai, and I just put a Kafka server and a Kafka client on two Edge computers. Then I measured how fast and how much can I push for whatever, for five minutes. I was just trying to understand, what are the breaking points? What are the limits of what the software can give me? Alexander Gallego: And then I wrote something in C++, using deep DK, which is a kind of bypass in the networking stack. And I was using this library in C++, called Seastar. So the super modern C++ library for doing IO. But one of the advantages of it was that it gave me primitives to do direct memory addresses to this storage layer. In layman's terms, it means that you could potentially bypass the kernel on the networking side and potentially bypass ... Well, you definitely bypass the page cache in the kernel for writing to the storage side. Alexander Gallego: So this it's really as close as you can get to the middle, before writing, I don't know, maybe an NBME controller or something like that. So I just measure, what is the gap between the state of the art and what the hardware is capable of? And the first try, I was shocked because it was 34X, [inaudible 00:03:06] performance improvement with the first try. Of course it was just a prototype. It was something really, really simple, but I just wanted to understand. So that was the beginnings of that. And in 2010, we started vectorizing, and the project Redpanda. Eric Anderson: Awesome. Let's have you tell us what Redpanda is, just so we're all context. And then I have some more questions about the story you've just told. Alexander Gallego: Redpanda is a drop-in replacement for Kafka. What we are aiming to do, we're trying to advance the conversation in streaming where, I think 10 years ago, people were having to choose between safety and speed. So you either run your streaming without writing disk and with potential data laws, and whatever. And what we discovered in the last few years is that hardware is so, so capable. So what happened was though, is that, and I always tell the story is hardware is the platform, right? Fundamentally, software doesn't run on category theory. It runs on this super scalar CPUs, like the super fast multi Q and MBME's is D devices. So that's the platform. Alexander Gallego: But a lot of the software that existed and exist today, that is still leading in the streaming was really built for a decade old hardware, right? Where spinning this was the main thing, where the Linux scheduler had a totally different block, scheduler, an algorithm, which had an impact on how software behave and how you interacted with the storage devices, etc. So what we found is that by rewriting a new storage engine from the ground up, in C++, we were able to really extract every ounce of performance of the hardware. And what it gave us was this property that you can now run, workloads that are saved, so no data loss, and that are just as fast. So you no longer have to choose between safety and speed. That's at the lowest level. So you could think of Redpanda as a drop-in replacement for Kafka. Alexander Gallego: However, what we've stumbled upon and the way we're trying to advance the conversation with the streaming is really in three things, which is, well, what happens when you start doing real-time streaming for fraud detection? You really need three things to turn a data stream into a data product. And it's not just what the standard Kafka API provides. Our observation was this, people love the Kafka API. A lot of people have a lot of difficult times running Kafka, the system, but they love the API. And here's the insight. There are millions of lines of code that were written against the Kafka API. So you could turn a product overnight because you can take Spark ML and TensorFlow and Elasticsearch, and you can just basically have a product overnight. It's incredible. And developers love that power, they can just plug and play all of this super large ecosystem. Alexander Gallego: So that's the first thing to do streaming, is for us, Redpanda was all about interoperability with, effectively, millions of lines of code of the enterprise, right? So one of our users, [Enly 00:06:14], who's a subsidiary from Snapchat, they had hundreds of thousands of lines of code written against the Kafka API. And they plugged it into Redpanda, everything just work. And that's a first level experience. Then you need two more things to turn a data stream into a product, which is division for the company. The next two things is unifying historical than real-time access. So what happens in production clusters, and this is actually what I think Pulsar got right, is their contribution to the streaming space was a desegregation of computer and store, except the context for Pulsar was Yahoo and HDFS and ... Alexander Gallego: Well, HDFS isn't successful and S3 is successful. So basically for most people, S3 became the data lake. I'm sure HDFS has some large users today, but really S3, Google cloud storage. They became the true desegregation of compute and store. So when I say unifying historical and real-time access, I mean the same Kafka API to access both historical data and real-time data. So what happens is, as a developer, you get Kafka up and running, or Redpanda up and running great. Then what happens? Let's say the cluster crashes or something catastrophic happens. You need to be able to restore the data. So that transparent tiering of a storage is handled automatically by Redpanda. Alexander Gallego: We probably have maybe 50% of the way there, so we just have archival. We're working on just transparent re-fetch. So that's step two. The last step is our idea of what most people refer to as the stored procedures for streaming. So everything, I feel invented this day. Somebody thought of it in the 1970s, like everything in computer science. But the surface, it just had been a thing in databases for a really long time. And the intuition here is you save a row in your database and something, there's a piece of logic that lives in the database, that will manipulate that data and write it to a different table, or do something with it. But it's logic that lives in the database. Alexander Gallego: So what we've done is take that and modernize it, using WebAssembly, and say "Oh, you can now have this store procedures for streaming." So the reason why that's powerful is, imagine you're a food delivery company, and you're trying to strip the PII information. So your social security number, or maybe your credit card. I guess a food delivery company wouldn't have your social, but they would have your credit card number. What they want is they want to run this machine learning and this recommendation algorithms, but they don't want to run the risk of private information leaking. Alexander Gallego: So what they do is they now ship a JavaScript, a little JavaScript snippet to the Redpanda engine so it lives on the storage side, to just do the simple filtering, enrichments. So this is not meant to replace the Apache Flink or the Spark streaming. I think it adds to the richness of the streaming. There's one-shot functions that live in the storage engine. Anyway, so to summarize it, it's really three things. We call the combination of this three things, the data, the intelligent data API. The hint is, how do you get a bunch of data streams when you turn them into data product, in a way that's self-service? Eric Anderson: You're right. It defies categories a bit, which is why you have to elaborate on what exactly it is. So Redpanda, to some, is a streaming service dropping Kafka replacement, but in many ways, it's much more, including those two or three additional modes, phases you described. Alexander Gallego: What is interesting, though, is streaming is evolving .and when people thought streaming, their ideas of using a streaming from a decade ago are totally different. Coming from having built Concord, the conversation six years ago, they were like, "Oh, how's Concord related to complex event processing?" It's really like basic people still hadn't moved onto the new ideas of putting these immutable events in your infrastructure as the source of truth that you can then ... Basically, the paper from Amazon on Aurora, they said the log is the source of truth and database has just become Cassius. So that architecture of separating the log of what is happening with your business, and the materialization of what ends up, whoever the consumers ends up talking to, like a SQL database or a Postgres database, is really enriching the conversation in and around streaming. Alexander Gallego: So the idea for disaster recovery and historical unification, all of that concept by [Touchbar 00:10:39], is because what developers want is you want to write your app. And let's say it uses a Spark ML for simplicity. It comes from data from Kafka, and it pushes it to a different Kafka topic. In our case, Redpanda. That's the mental model. Developers don't really care or want to have two jobs. One that consumes historical data from S3 and then pushes into Kafka, and does a bunch of ... That logic could just be handled by the storage engine. And by the way, when you do that, you also unify access control lists and security, and tenancy, and throughput. So there's a lot of benefits to a storage ending, given a standard interface, a standard API that developers can program against. Eric Anderson: Got it. And earlier you said, as you made these other nascent streaming efforts, the storage editing room was the thing that you struggle with. So Redpanda, the innovation there is largely in the storage engine. Is that ... Okay. Alexander Gallego: Yeah, correct. So the innovation is, can we treat the Kafka API like sequel engines treat SQL? So I would say we're like the cockroach DB to the Kafka API, right? So cockroach treats the Postgres API and adds a bunch of very cool primitives on here, like geo replication, follow the workload, just some really neat kind of things that people want out of a SQL database. Similarly, we're just treating the Kafka API as the communication there. So all of the Kafka ecosystem, it was ... But what we gave was new primitives. So one of the fundamental differences though, in design, is that we operate in a consistent way. So versus the Kafka alternative, which is an AP mode, which means we use RAF, which is a strong, [inaudible 00:12:27] protocol. Alexander Gallego: So it means that the developer understands, mathematically, what it means to have two out of three replicas up and running, or three out of five, or one out of five. It doesn't matter what the combination is. There is a well-known and an exact understanding of the state of your system. So that is so powerful because then you can build actual primitives on top of it. Then you understand like, "Oh, if a node goes down, you have an exact mental model that is proven with the mathematical proof. It has a design paper that is easy to understand." And because its' rafts, we hook into this huge ecosystem of how to verify that the code that we wrote empirically, actually delivers on those claims. So it's, I think, the right foundation to build streaming systems on top of. Eric Anderson: Got it. Yeah, no, you've assembled a great stack there. So you've got your new storage engine wrath. And then remind me, we did an episode recently on ScyllaDB and Seastar. Remind me of the role Seastar plays for you. Alexander Gallego: I love Seastar. I've contributed to Seastar for 40 years, little patches here and there. They have done this fantastic job at creating a library that allows you to build software for a modern hardware. And it's really very low level, very fundamental primitives. But Seastar is a future library, you can think of it like an actor framework, almost. So every CPU has a threat, and they don't move around. In fact, they're locked. So if you have four cores, you have four threats. Full stop. There's no other form of parallelism. If you have 96 course, you have 96 threats. Full stop. There's nothing there. It means that in order to communicate with code that is executed on a different code, you use this idea of the structure of message passing. So this explicit communication has some really powerful perimeters for the developer. Alexander Gallego: The main one being that there is no implicit synchronization, which is obvious because everything is explicit. But it forces you to write your code in a way where your worry about the program is structured, so the concurrency, and you let parallelism be a free variable that is determined at runtime. So you write your code in a way that can be parallelized, right? So the structure of your code has that kind of particular thread per core, primitive for you. So you always have to think in that way, and there's only one way to do things in CSR. It's really opinionated, but what's powerful is that then you can take the same code and scale it across 96 CPUs or 30 CPUs, or 14. And the code doesn't change because you run on more course. So I think it's just a powerful primitive for building systems software. Eric Anderson: Wonderful. You mentioned that people start with Redpanda, often as a Kafka replacement, which is a super interesting idea. One that we've explored in other shows, and seems to be a trend today, partly because of cloud and partly just because it's better for users. But we're retaining the interfaces of old. As you mentioned, Postgres compatibility, there's drop in Elasticsearch replacements using the Elastic API. And now there's increasing number of Kafka drop-in replacements. Earlier, you mentioned Apache Pulsar. Maybe help us understand where you fit in the Kafka drop in replacement world. Alexander Gallego: Sure. So from a protocol perspective, we're a drop-in replacement where we improved, and the state of the art was in removing the complexity of running Kafka at a scale, right? So to run Kafka, you need Zookeeper and you need Kafka. And I understand there is a new KIPP called KIPP 500, that is attempting to remove Zookeeper. However, the plan model still has the same number, fault domains. So the amount of data service, which is what KIPP 500 refers to, is a separate process, therefore, a separate fault domain. So it's not as though that binary was embedded into the same Kafka binary, that is not the case. So they rebranded the [inaudible 00:16:44] with a Kafka metadata requirement, and then they added some services. So there's genuine value there. Alexander Gallego: But what we added onto the deployment model for users was just the single binary. So people just loved that. They loved the fact that you can just SEP a file around and you have a cluster. The ops model for us is super, super easy. And I think that's what a lot of the generation Z developers love the most, some JavaScript developers, some Python developers, some Ruby developers. They don't want to become experts in streaming. They don't want to become experts in the JVM. They want to write code in their native programming languages and they understand that. So that's what people love about us, is really that single binary. Alexander Gallego: Now, let me give you a comparison between Pulsar, Kafka and Redpanda. Pulsar is the system that depends on three distributed systems. It depends on Zookeeper, it depends on Bookkeeper and it depends on, I think the front end servers. So then number of fault domains for Apache Pulsar is basically three, right? You have three distinct fault domains that can fail independently and have impact in your system. Kafka has two. So Kafka has Zookeeper, and of course the Kafka brokers. Redpanda has one. It has one binary, and we onboard the complexity given roles dynamically to servers so that there is ... You could do the same type of operations that you could do with the other stream of providers. Alexander Gallego: So from an architectural perspective, we're really easy to deploy and use. We mostly see Kafka in the wild. We see very little Pulsar, to be honest, but we have seen some Pulsar, and that's why they came to us. So there's a financial services organization in New York, and they were having issues with Pulsar, partly because they didn't understand the system. So it's much easier to understand the system when it's a single binary, that when you have three distinct distributed systems that you need to keep up in order for you to do real time streaming. Alexander Gallego: I think people are just drowning in complexity, and I would say the next wave of software, just because hardware is so good, if they can do anything, it's really just about helping people team the complexity of enterprise infrastructure. So hopefully that gives you a hint at what it is. I will say though, that Pulsar's innovation was the disaggregation of compute and store, where they allow the storage engine to scale independently of the other servers. In practice, this aggregation for the customers that we talked to, didn't turn out to be a problem because they ended up using Amazon S3 as actually the Trudy's application. Alexander Gallego: So I think what ended up being more meaningful to the customers we talk to is actually transparent, tiered storage, where all data just goes into a stream. They have a disaster recovery, they can fetch data, they can fetch all data, all transparent on the user side. So I think that cloud technologies have actually played a role in the development of streaming, which I think is an interesting angle. Eric Anderson: Yeah. There's also maybe something subtle in there that I feel could be important. You mentioned that app developers don't want to worry about Zookeeper, Bookkeeper. And then you also mentioned that you have, right, and I saw on your website, you've got your store procedures, you call them, are WebAssembly and you have Docs on how to do this, how to use Redpanda with node. It feels like where others have sold to data teams or have been used by data teams, you're being perhaps used by app dev teams, or at least there's some alignment between a straightforward installation process and your Docs for consumption. Alexander Gallego: That's actually a difference in product philosophy, to a large extent. Here's my take on this. In order for teams to be successful at building a data product, the things that they need to use have to be self-service. So the developer is the person with the most context when they're developing an application. Let me give you an example. If you're trying to do fraud detection, you have your data, you have your web servers pushing data into Redpanda, and then etc. What we're trying to enable is, what are the things that the developer, whose job is to create a fraud detection service for a credit card company, needs to deliver that fraud detection as a service? Right? Like what are the tools that he needs? Alexander Gallego: The first thing is it has to be self-service. So this idea of the store procedures was really a way to help a lot of developers navigate the usual data ping pong that happens in the enterprise. The example I give here is that let's say you were just doing a simple filter and an aggregation. Actually the simplest thing is data goes into Redpanda, there's a Flink job that consumes it and then saves it onto Elasticsearch. Let's take that example. The data ping pongs often between Redpanda and a stream processor, and often back to Redpanda, and then back to a different system and so on. So what the store procedures allows developers to do is to eliminate that data ping-pong altogether. Alexander Gallego: You push a store procedure to Redpanda, and then you can have that data materialized locally. So there's no network transfer. You have the same set of access control lists that Kafka gives you, and it's super scalable, right? Again, it's not meant to do the multi-way mergers that other stream processing frameworks do, like Flink or Beam or etc. It's meant for this one-shot transformation, for this idea, because if you ask the belly of the market, right? So the sophisticated companies, they have the expertise in power. But the belly of the market, they just consume the data into Redpanda, and then they just have a little callback that makes an RPC to Elasticsearch. That's the majority of how people are doing streaming, right? They make an RPC to Postgres, to save the data on a table, or they notify something because they saw an anomaly in the data stream. Alexander Gallego: So the simple use cases are really well-served with this idea of store procedures. What we did is we just modernized it, right? And databases didn't have WebAssembly because WebAssembly Superman is still so nascent and young. So what we took is the V8 engine that runs in your Chrome browser. And then we shipped that with the binary, and it allows you to make this transformation. So you could use the full node API. That's the first one that we're working on. And then we're also working on allowing people to have more low latency transformations in actually any language that compiles the WebAssembly. Does that make sense? Eric Anderson: Yeah, and very exciting. I think in the models with Zookeeper and Bookkeeper, you may have already had some experience with those systems for other distributed systems. Maybe you can convince yourself that you're reusing skills and that you have multiple uses for these distributed systems. But I think, for a new app dev team, looking to scale and stand something up, it's very appealing to have single binary Docs that are node centric and web centric. Very exciting. Alexander Gallego: And that was an accident that we discovered, to be honest. Maybe I should figure out a way how to give myself credit. No, I'm kidding. We wrote this thing in C++, and we wanted it to be a single binary. We just started talking to customers and they were like, "Oh, I love this and I could get started. Can we add transformations? Actually, the first version of this was just an embedded Lua engine. You can look at the history of the code. And then we kept iterating on it and just making it more powerful, and we had more feedback. Now it's this full blown engine that is stayful. So I think it's a really nice complimentary technology to do real-time streaming for people, in a way that is just super easy to do. Alexander Gallego: Here's what these developers love. The majority of people aren't pushing the boundaries of hardware, throughput and latency. That's just a fact. The belly of the businesses are not at the Google or Facebook scales. Right? So what they want is one system less to manage. They don't want to manage another Flink deployment. They don't want to manage another Kafka streams deployment, which is a separate set of services. They don't want to manage a Spark streaming. They just want something simple for the simple things that they're doing, which is make an RPC to Elasticsearch or save the data to a database. I think these simple things should just be easy, and we're just enabling people to do what they want. Eric Anderson: Awesome. One topic for you. I don't know if you have much to say on this, we've been talking a bit about Kafka and Pulsar, which are both in the Apache foundation. That used to be the norm. Increasingly, we're seeing fewer open-source projects jump into foundations. How have you thought about governance licensing and whether there's a role for foundation? Alexander Gallego: That's a really great question, and very hard. I spent two years thinking about this. When I started the company, I met with a cloud VP and I saw the person, which I still find a little bit cheeky, for what it's worth, that they said, "If you open sources with a permissive license, I will take your product and run it." I was scared and I didn't open source until I had a clear idea of how to monetize the company. Monetize an infrastructure is really hard. And of course, we want to pay developers well and do all of that. So how do we create sustainable business? Build an infrastructure. I think you can, I think cockroach is a great example of it. They just did a round out, like two point something billion dollar valuation. Alexander Gallego: And what I think is a good compromise is ... I wrote a blog post that you see, I was really well mentored in this decision, and I feel fortunate. So everyone that helped me along the way, thank you for taking that hour that you did take, speaking with me. I was lucky in that I got to learn from open source. And just one thing is, open source and the relationship between projects, and the cloud vendors is really fundamentally different from what it was, whatever, 30 years ago, when the open source movement started. So we couldn't open source the technology with a permissive license. We chose a source available license, which is the same as cockroach. The name of the licenses is BSL, where it says that we are the only company that is allowed to have a Redpanda as a service. And otherwise, people could go ham. They could just go, embed it. They can make money, they could put it in the product. If you're an ad tech, you don't have to pay us. Alexander Gallego: So that's where we see the revenue for us, is really on our cloud. And the reason for that is that the cloud allows infrastructure companies to monetize every part of the stack, the free, the pay, the commercial, the enterprise features. Everything about the product could be monetized in the cloud. Maybe a little bit, but you still can monetize a little bit of that. It's better than zero. The next stage of the company, for us, is the cloud. But maybe that was a long-winded answer to say that we couldn't have open source, the technology with a permissive license. So we chose a balance where in four years, which we need to update some of the licensing terms too, on the project, it'll become Apache II. Alexander Gallego: But no one wants to run four year old software. So I think in my opinion, I strikes a balance between letting people modify, letting people learn, letting people play with the technology, see the code. There's no secrets. We want to tell the world what we're working on, what we're building. It's exciting and we're making a lot of progress. And to my extent, for example, we have one of the most scalable raft implementations in the world. So we want to be part of that community and we want people to contribute as well, but we also need to make sure that we can pay the bills. That's what we thought was a reasonable compromise. And some people will disagree with us and that's fine. That's what I feel comfortable with, and ultimately, that's how decisions are made. Eric Anderson: Awesome. I'm excited that I think we may get more and better code available to developers and end users under source available licenses that we would, under open-source licenses in part, because they have means to make sustainable businesses from them. I think we all may win. Alexander Gallego: Right. And now the alternative for us was to stay close source. We were close source, we just opened source eight weeks ago. So made it source available for whoever's listening. I'm well aware of the differences. So I think that was the alternative, and I think it's better to have that. In four years or whenever, it'll become Apache II. So in my opinion, I think it's a really good compromise there. Eric Anderson: Awesome. Alex, as we wrap up here, tell us where the project's headed in the future. What are the plans? What does 2021 look like? Alexander Gallego: The base of the product is just really starting to settle. We had two years where we were just heads down, building and engineering, and testing, and making sure that we didn't lose data, that it delivered on the promise. Because as an engineer, I can tell you that when you try something and it doesn't work, it sucks. You're just like, "Oh, man. You let me down with this thing." So when we released the software, the thing we cared the most about was not losing data. So we spent a ton of time testing, rafting, and injecting faults or whatever. The next stage for us is actually to push the boundary on streaming and at the algorithmic level. Alexander Gallego: Let me give you two examples. One of them is proactive saturation detection. So what happens is, when you have a cluster that is large and you allocate some partitions to a subset of the machines, let's say you have 10 machines, and three of them happen to be overloaded. But from the controller perspective, they have an even balance. It just happens to be that at 10:00 in the morning, those three computers get hammered for whatever reason. Maybe it's a product hunt or something like that, and they just get hammered. So what we're starting to do is, how do we proactively heal the cluster? One of the ways is that we want to start to detect CPU saturation, disc saturation, network saturation, and start to shift around data in the cluster in a safe way, right? We still have to give the same guarantees of no data loss with rafts. Alexander Gallego: So really being truly autonomous and hands-off. There should be no reason why anyone should log into a cluster, either Redpanda or Kafka or Pulsar, it doesn't matter what it is, and do this idea of partitioned rebalancing. You should just give the engine goals, and you're like, "I want this kind of data distribution. I want this kind of latency. I want this kind of throughput." The system has more information that the human could possibly imagine. Of course, we allow out-of-band hatches. Alexander Gallego: To summarize, the direction of the product this year is ... The first step is, okay, let's get compatibility. Let's advance the conversation in streaming with being able to run safe and fast workloads.Let's do unification of historical. Let's do store procedures. Those are just features that are just required for the product to work in the enterprise. So now it's really exciting because we get to work on, well, now that we're here, where do we want to go next? I think where we want to go next is to be this autonomous type of infrastructure for real-time streaming. Eric Anderson: Very exciting. For listeners who are excited about this, how can they try out Redpanda or get involved? Alexander Gallego: Try now. We built a reproducible package so the binaries that you get from just Vectorized.io are fantastic. We actually compile the compiler to compile the libraries, to compile Redpanda. So just go to Vectorized.io. You could try it on Docker, you could try it on DPN and RPMs, Fedora and all of that. That's just free for you to download and use. There's no restrictions on anything, so you could use all of the features. And then you can reach us out on Twitter. We have the Slack community that is growing. So if you tried the product and you have questions, just join the Slack. I'm there all the time. The engineers are there all the time. We can just answer questions. We're trying to go, so we'd love to have you be part of the community if you're listening to this. Eric Anderson: Awesome. Alex, thanks for taking the time today. As an aside, it's been fun to follow your career when I was at Google, working on data flow, we interact together on streaming. Great to see you still at it and all the progress that Redpanda project's making. Alexander Gallego: Thanks for having me here, Eric. Eric Anderson: You can find today's show notes and past episodes at contributor.fyi. Until next time, I'm Eric Anderson and this has been Contributor.