Travis Nielsen:
It's just an ever-growing project. And I keep being amazed actually by how much more there's always to do. It's not like we'll ever finish the project. It just keeps going and going and people keep using it more and more.

Eric Anderson:
This is Contributor. A podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson. We are live today with Travis Nielsen who comes to us by way of Red Hat and is the creator of Rook and works on the CEPH team there at Red Hat. Travis, thanks for coming on the show.

Travis Nielsen:
Yeah. Hi, Eric. Great to be with you. Good to be on the show.

Eric Anderson:
This is one of our first fan mail introductions. So we have a common acquaintance who is a user of Rook and at a big American corporate company that I was talking to recently. And they requested that we get the story of Rook from Travis himself on Contributor. So thanks for obliging Travis. So as we always do, let's level set with the group and our listeners and explain to us what Rook is, and I imagine you may have to cover a bit of CEPH and then we'll start there.

Travis Nielsen:
Okay. So the first question I always like to ask is, well, what is about storage anyway? Who needs storage? Why is storage so important? So when we talk about this, I'm talking about for Kubernetes deployments. So Kubernetes being that basic distributed platform for deploying your applications in the cloud or on prem, wherever you have Kubernetes running. The question is, well, what do I do for storage, right? And Rook is all about bringing storage to Kubernetes. So storage... I mean, if you're running in a cloud provider, though, you're up in AWS or Google cloud, Azure or wherever, and those clouds, they provide storage for you. They've got EBS and all sorts of storage solutions. Now, if you go into your own data center, you don't have all of these nice cloud services that are so conveniently available and dynamic and all of that. So you have to provision your own bare metal servers or VMs, but you can't just say attach this disk and that's all you need.

Travis Nielsen:
If you need the storage to be durable and suffer outages of nodes or even data centers, right? You want your storage to be durable. So storage is not something that's deployed with Kubernetes, but it's something that's traditionally kind of this external plugin. You go buy an appliance and you connect your cluster to it and you go, or if you're in the cloud, again, you connect to that cloud provider storage. So why does it have to be that way? Why not manage the storage just like any other Kubernetes application? But you want to do it on a time tested platform that you can trust. That you don't want to... Just somebody who's built a new data platform, because you probably shouldn't trust your data with a new platform that's just come out.

Eric Anderson:
And maybe you'll get into this Travis, but the original Kubernetes vision, if I remember correctly, was kind of stateless by design. That was like, we're ignore storage. And that gives us horizontal scalability and all these benefits. And so maybe that's part of the reason that here we are years into Kubernetes and still haven't really solved storage.

Travis Nielsen:
Yeah, exactly. Kubernetes was designed for a stateless applications because of all of those things you have to know. And the reason is, storage is a really hard problem. And the second you start running all your stateless applications, well, you can't run very long or very interesting applications that are actually stateless. You need state. You need to store your data somewhere. So with Rook then, in the early days we set out, we said, Hey, we need to make storage natively available to Kubernetes as a cluster. It should look like any other Kubernetes application and it can run alongside all of your applications. We still want to consume it, just like any other storage, with the CSI plugins. That's the way you plug in your storage. But that storage doesn't need to be external to Kubernetes is the point. It can be inside the same cluster.

Travis Nielsen:
But in order to accomplish that now we really need automated management, which means, okay, the way you automate things in Kubernetes, is you have with [inaudible 00:04:28] operators, and then you have CRDs that define the state for the application. So fundamentally that's how something becomes [inaudible 00:04:40] is if it's deployed automatically, but you can also tell Kubernetes how to deploy it with this desired state. I guess we call it with the CRDs. CRD is short for Custom Resource Definitions, and they really just define sort of the schema for the settings you need to apply for whatever application. And specifically an operator is this pod that's running this automation that looks at those settings in the CRDs and then goes and applies those settings to make it happen.

Eric Anderson:
Okay, this is the configuration for the deployment.

Travis Nielsen:
That's right. So when you create an application in Kubernetes, you're telling it, Hey, go create this resource, create this pod, create this service. And you do it very declaratively. So the CRDs let you declare some other type of application. So for Rook, we have these CRDs where you tell Kubernetes, Hey, create these things. And that's how you start Rook. Almost declaratively. And then [inaudible 00:05:50] going and applying that configuration to your cluster. It manages upgrades, it manages a lot of the complexities of storage that people just don't want to think about. Well, it's really complicated, so they shouldn't have to think about it.

Travis Nielsen:
So then the last bullet there, ultimately, I mean, we looked at CEPH as a stable data platform. We said, Hey, CEPH's been around for a long time. It's been in production since what year? Around 2012, I think going on 10 years in production now, and it provides this data platform we need. And we said, okay, let's bring CEPH into Kubernetes. Let's deploy CEPH as a Kubernetes application, run the CEPH demons as pods, create all the other resources we need. So CEPH plus Kubernetes equals Rook. That's how I like to say it. We're managing storage for Kubernetes.

Eric Anderson:
When I think of Kubernetes, I imagine I think about the ability to survive failure. You know, you can lose nodes and the application lives on, spawns new nodes as necessary, or I don't know if I'm using it right nouns, but it will spawn new instances. And how does that work with storage? Because I'm kind of imagining that... is the storage is kind of getting tossed around in order to ensure that it's always going to exist in the event of a instance failure?

Travis Nielsen:
Yeah, exactly. If you're going to design storage for Kubernetes, that means it has to be fault tolerant. If a node goes down or you lose all nodes, your data is still safe in the cluster, because there are multiple replicas of the data, or it's erasure coded, which means it's broken up into pieces and has redundancy built in. But yeah, ultimately you can lose nodes. You can lose individual disks and CEPH and Rook know how to go repair that and make sure it stays safe. So you bring up a new node, you bring up new disks and Rook and CEPH recover. So we declared Rook stable with CEPH back in December, 2018. So going on our third year here. There's many people running it in production. Because it is open source, I don't even really know how many people or exactly what scenarios, but there are a lot... Surprises me how many people out there are running clusters with hundreds of nodes, multiple petabytes of data. Yeah. I'm just excited how often I get to hear about people running it in production.

Eric Anderson:
I like to make sure I leave bugs in my open source so that my users will reach... It's kind of like an analytics measure. They'll reach out, let me know they're using it-

Travis Nielsen:
Right? Yeah. Most people only reach out when there's a problem, right?

Eric Anderson:
That's right, exactly.

Travis Nielsen:
Yeah. There are definitely always bugs to go find and everybody else find them. That's a little bit more about CEPH, what CEPH is as a data platform. So CEPH provides the three basic different kinds of storage. So you've got block storage, which means generally used as a read-write once volume. Kubernetes. Just one pod attached to a single volume. There's the shared file system or read-write many, which might be shared by many pods. And then there's the object storage or an S3 end point that CEPH can provide. CEPH provides the data layer for all of these and, underneath the covers, stores them all in a consistent format in what we call the OSDs and CEPH, that object storage demons. But yeah, that's what we liked about CEPH. One platform for all these three types of storage.

Travis Nielsen:
We are a CNCF graduated project now. I think we'll talk more about that later. And it's exciting to see all the get upstart and downloads from Docker hub and how many people contributing. I've always wondered where those downloads come from. Like, wow, who's actually downloading this thing millions of times. It's a lot of Docker polls.

Eric Anderson:
Maybe you're in a test suite somewhere.

Travis Nielsen:
Yeah. I know there is a lot of CIA use case to pull this, but as a project in Rook, we really do like to put community first. That's everybody that works on it. We want to hear from the community, what features do people want? What bugs are they seeing? We want to go fix them and get releases out regularly so that people can keep it running in production and have a stable storage solution. It is open source with Apache 2.0 license. And we do quarterly releases with regular patch releases.

Travis Nielsen:
Currently we've got maintainers from four companies. So Cloudical, Cybos, Red Hat and Upbound. Just to get back to the question of where Rook runs. So it really is wherever Kubernetes is deployed, that's where Rook can manage storage with CEPH. Okay. So if you're running in your own data center with bare metal or Vms, Rook can consume right devices or local PDs in that environment, or if you're running in the cloud, Rook can actually consume cloud provider PVs in order to overcome limitations of the cloud provider storage. Like I think the cloud providers have limitations with like 32 volumes per node, things like that.

Travis Nielsen:
But CEPH doesn't have any of that limit. You can have hundreds or thousands of PVs in your cluster. No matter how small or large, you can get the same performance characteristics too that come from the underlying PVs in the cloud provider. So that's the environment in a nutshell. Rook is the management layer for CEPH. And then the CSI driver is what actually provisions and then mounts the storage to the application pods. And finally CEPH actually provides the data layer. So anytime you're reading, writing data from the application, it goes straight to CEPH and Rook is out of the picture at that point. Rook is just managing the higher level in order to get CEPH going. But ultimately, CEPH is this stable data layer.

Eric Anderson:
Good. Well, Travis help us. Now that we know what Rook is and we're reminded of CEPH, how did this come to be? We talked a little about the motivation. Did you personally run into some of these needs?

Travis Nielsen:
All right. So where did this start? Maybe I'll go way back. Tell me if it's too far, but-

Eric Anderson:
It's a way back.

Travis Nielsen:
It's a lot of fun history here as I was thinking today. So back in 2004, I was a lot earlier in my career. I started at Microsoft Seattle. I was on a team with a couple of guys, Bassam and Jared, were the names, and we were on the Windows server team. So we dealt some with storage, that wasn't my primary focus, but anyway, Windows server and server environments was what I worked on. Over time, so 2011, then I left Microsoft and went and joined a startup, doing storage that Bassam had gone and started. And Jared had gone with him as well a couple of years before I got there. I joined that. That startup was called SymForm. You've probably never heard of it, but it was kind of a peer to peer storage solution across the internet, which was really cool architecturally.

Travis Nielsen:
And so worked on that for awhile. Then in 2014, we got acquired by the Quantum corporation. Jared, Bassam and I were still there. And at Quantum, we were really part of the [inaudible 00:13:31] team to kind of plan, well, what's next with storage? What should the company take a bet on for storage? And so then, as we were researching it, that's where we started looking at CEPH really closely and saying, okay, we really see CEPH as a good solid foundation. Let's go start prototyping, working with it. And, Bassam was, he's the head architect, the head thinker. So Jared and I, anyway, we're right there with them, but I have to give credit to Bassam for coming up with all the big ideas. So we started to put together this project with CEPH and we thought, well, how can we deploy CEPH for cloud native environments. Kubernetes was just really young at that point and not a big thing.

Travis Nielsen:
So we thought we were trying to build something even independent from Kubernetes. We built it on FCD and it was really turning into a nightmare, trying to manage our own platform on FCD, a distributed storage application, which was a lot of fun. But at some point we realized, okay, Kubernetes really does have the support we need as a storage application to run the storage. We decided to open source Rook after we created it in 2016 and then went to KubeCon, that was our first conference we really went to. It was KubeCon Seattle in November, 2016. Wow, that long ago already? And only a thousand people were there. Now KubeCon's like 12,000, however many people.

Travis Nielsen:
And at that conference we learned about operators and CRDs. It was basically a new concept, well, they weren't called CRDs back then, but this new concept coming out and we thought, oh, well, that's all we need to do for CEPH. We need to build an operator and just make it work natively with Kubernetes. And then, so a few months later in our zero dot three release, early 2017, Rook was really born at that point. Before that it was just a nice idea, I think, or hadn't really found its place. But we really focused on Kubernetes at that point. And so Bassam, Jared and I were the primary creators of this project.

Eric Anderson:
And remind us. You were at Red Hat this whole time and Bassam, he was at a startup, right?

Travis Nielsen:
No, so let me clarify that. So I was not at Red Hat yet. So we were at Quantum Corporation during this time of actually creating Rook and going and first open sourcing it and creating the operator and things. So we were still all on the same team at the same company there where we created Rook.

Eric Anderson:
And, and what was the aspirations for Rook? This is worthy of your time. This would be a big deal. And as you open source it, are you looking for new users? What was the goal or the hope?

Travis Nielsen:
And we really had a vision again, back to Bassam. Tell me what the vision that's like, Okay. If things that are open source really have a much better chance for succeeding because open sourced, you get community members coming, other people can contribute. People believe in them, or because they can contribute and fix their own bugs even, and CEPH, being open source underneath Rook as well, kind of built with that synergy. Open source, community first. We really wanted to have this community project or an upstream project then. And we started also, in parallel, building a downstream product like Quantum, that would be something where, oh, we ship you an appliance type of solution where you plug in the storage to Kubernetes and this product would help you even provide more UI and management on top of what Rook was providing. And be the apps product [inaudible 00:17:27] . That product never came into fruition, but that was the goal of the initial thing. Like you've got an upstream project and then the downstream product that our company and other companies eventually create.

Eric Anderson:
Maybe just to round out the history, eventually you made your way to Red Hat?

Travis Nielsen:
Right. So it was just over three years ago, 2018, where Rook was progressing, building community and all that. And then Quantum decided actually that funding was out for our project, kind of the future of storage for Quantum. So they decided to cut our team. And so we dispersed that. So that event around February, 2018 is where I took a look around and I said, Hey, I really like this Rook project. The community has really picked it up. We had just barely donated it, finished that process to donate to CNCF.

Travis Nielsen:
So CNCF officially had ownership of the project, which was good timing for us. And given our close relationship with CEPH and the bet we'd taken on CEPH I thought, Hey, I'm going to go talk to this CEPH team. And it worked out. I joined the CEPH team right after that to continue working on Rook full-time. And I saw that as an important step because we did feel like we were lacking a bit in having deep CEPH knowledge, trying to deploy Rook and CEPH, because we did things that we thought, but we didn't have any direct team members who really knew CEPH deeply. So around that same time, Red Hat and the CEPH team basically made a bet like, yeah, Rook looks like it's the right thing to do. And so we just kind of worked out timing wise. So I joined the team and we continued the project on the same team.

Eric Anderson:
Yeah and Red Hat saying that Rook is the right thing to do is not obvious, in part, because there's a few other kind of storage solutions out there in open source if I recall.

Travis Nielsen:
Yeah. There's CNCF has storage projects in various stages of incubation or sandbox and incubation too. And even at Red Hat, there were multiple storage solutions like with Cluster and CEPH. And, you know, the CEPH team specifically had said, they'd tried some things out with Helm Charts and before they took the bet on Rook and that's like, yeah, Helm just isn't quite dynamic enough for us. We need really need an operator.

Travis Nielsen:
And Rook is the operator and yeah, at KubeCon Austin in 2017, right before all this went down where I joined Red Hat, that's when we kind of started discussions with them and we're showing the CEPH team how great Rook would be for CEPH. And then I joined. And yeah, the other maintainers... So Jared and Bassam, they decided to start another startup. They liked the startup world a little more than me, I guess. And yeah, so they started Upbound at that point about three years ago and that's where they still are. And they're still contributing to Rook, well, Bassam is officially an Americas maintainer. He's busy without bound, but Jared is still contributing, participating in discussions and things, still interested in making sure Rook continues and everything. So that's the relationship with Upbound. Yeah.

Eric Anderson:
We have Quantum, in some ways, to thank for the gift of Rook and Travis and Bassam and Jared, and all of you for shepherding it this far. Tell us about the decision to go into the CNCF and kind of the path within the CNCF. That seems to be an increasingly kind of common and important part of an open source projects life cycle.

Travis Nielsen:
I think pretty early as we went to KubeCon and started down the path of the operator, we realized that, Hey, the CNCF, it was a pretty new organization at that point, but it was kind of shepherding all these projects that were being built for Kubernetes. So it's like, well, we built this thing for Kubernetes. Let's go down that path. It gives more community adoption. The community has more confidence in these projects that are showing that they're following security best practices, and running in production and have good open source practices and governance. And so we wanted to go down that path to really build the community and build the community's trust in the project. So we started, yeah. Sandbox was... It was a long time ago, like four years ago and then incubation and then graduation just happened last fall, somewhere. We finished all of the last things, crossed our T's and dotted our I's and got the sign off.

Eric Anderson:
And the significance there is you get a bit more... It's signaling to the community that you're at CNCF level quality of community development, governance, security, as you mentioned.

Travis Nielsen:
Yeah. I think it's a good Testament to that. The CNCF board believes that TLC believes in the project, they see those good qualities in the project itself. And an important part of that, that lots of people are running it in production. There's proof of that. And people were willing to give us their testimonies or their stories. What's the right thing?

Eric Anderson:
So in some ways you've kind of arrived, right? I mean, it's now you're CNCF graduated. Rook has a home with a whole community. Does your role change, are you now kind of a cog in the community wheel or do you still have a kind of driving force in the vision?

Travis Nielsen:
That's a good question. Well, it's like we graduated. So what is next with the CNCF in the community, right? I feel like that's just kind of a stamp of approval on this journey that we're already making and isn't going to change much anyway. We're providing storage that's reliable for the community to build on. So we just keep adding features. CEPH keeps adding features and then Rook needs to add or expose those features to Kubernetes through these DRDs. Always fixing bugs and there's always more features to add, as people are expanding their usage, expanding the scenarios. It's just an ever-growing project. And I keep being amazed actually by how much more there's always to do. We can't ever... It's not like we'll ever finish the project. It just keeps going and going and people keep using it more and more.

Eric Anderson:
And, so, when do we get to declare victory on storage for Kubernetes? Is it now solved? I think for some time over the last few years, it's been kind of in limbo. Where do you see us at today, Travis? Is it time for us to move all of our production databases to a Kubernetes powered, Rook powered storage solution?

Travis Nielsen:
I'd love to see that of course. That's my vision. There's still so much more to do. There's always gaps. Like we're looking at next, for example, Windows clients coming into Kubernetes, right? So CEPH is looking at adding support for Windows clients and things like that. So.

Eric Anderson:
But there was a time when we debated whether Kubernetes should support stateful workloads and that debates over, right? We've kind of decided there's a place, if not a real opportunity, for Kubernetes to shine in the capacity that Rook is running.

Travis Nielsen:
Yeah, absolutely. And there's so many different ways to run it too. Somebody can dedicate the whole Kubernetes cluster to running the storage platform if they want and connect to it externally. They can run it alongside of their other applications in the same cluster. And there's just so many different ways to configure it. That it's like, yeah, most scenarios are covered. It actually doesn't cover all scenarios because I'll just give you one. If a database must have data locality to a local disk and it provides its own replication, or maybe it doesn't even want replication or data safety, right? It just wants highest possible throughput with no guarantees for data safety. Then that's not what Rook and CEPH are about. You just want to mount your local disk and, if it dies, it dies. And that's the application's problem. But CEPH really gives you that durability and safe data safety. And that's the promise that people want.

Eric Anderson:
Earlier, you mentioned that CEPH also has these different modes of operation or, you can do block, you can do file storage. You mentioned another. Are these also bubbled up in Rook? Rook is this kind of multimodal storage thing?

Travis Nielsen:
That's right. Yeah. And that comes back to the CRDs as well. You tell Rook, Hey, go configure these storage layers with these CRDs. And then Rook will enable those storage layers that you want. So you don't have to run extra things that you're not using. For example, if you're not using the file system, then we won't start that service up.

Eric Anderson:
And so a database company or database offering might build on Rook, I presume. And, they might spend quite a bit of time tuning everything so that things work as you might expect coming from a traditional environment to this distributed Kubernetes one. Is that right? So you'd probably have pretty long engagements and people will take a heavy dependency on you. And so this is an ongoing relationship, I imagine, with some of your users.

Travis Nielsen:
Yeah. And I'd say that close relationship is more of a downstream relationship that I'm not engaged in as much. So the upstream community really is about, you know, we answer questions about getting it going and people do have tuning questions, but I think if you really want to tune CEPH to get every bit out of it, that's not my area of expertise. There are CEPH experts who would get in there and Red Hat has a support team to help you go do that. So that's kind of more of that at that level, get into downstream version of CEPH and Rook to let you tune it. While you can tune it, if you can figure it out. But whether you can get to that layer level of tuning on your own, maybe some people can. And I don't hear a lot of chatter though, honestly, about needing to tune it for performance because people find that it works pretty well out of the box if you get the right hardware behind it.

Eric Anderson:
And now tell us what it's like doing kind of corporate open source. You have the pleasure of doing open source on Quantum's dime and now on Red Hat's dime and the CEPH team that you joined has been doing this for a decade it sounds like. So you have even a certain amount of the feeling of job security here. Like you could keep doing this for quite a while.

Travis Nielsen:
Yeah. So yeah, I love open source. You know, when I was at Microsoft and earlier in my career, I didn't do open source. It was proprietary. But so the open source way... When we started at Quantum, it was interesting because Quantum didn't have an open source philosophy. We kind of were leading the way for the company there.

Travis Nielsen:
And so when I joined Red Hat, I got a completely different perspective on that because Red Hat's philosophy is fundamentally open source first, upstream community first, and then downstream products follow. So in my day job now for Red Hat, I mean my purpose or what I'm supposed to do is make sure the community stays happy with Rook, that upstream we're solid, we're fixing things. We're getting done what needs to get done. Like that is my official job to keep Rook [inaudible 00:29:00]. And I do spend some time on downstream discussions to say, oh, well, because we do have a product to Red Hat that we ship on top of Rook, but that's not really my... I need to support that for my job, right? But it doesn't take majority of my time by any means.

Eric Anderson:
Yeah is that a separate person or group's responsibility? The productizing of Rook?

Travis Nielsen:
Right yeah there's a productizing team. Like there's the PM team and the QE to make sure it's good quality. And that's one thing, one difference about the upstream versus the downstream product. I mean, upstream, we don't have a QE team that actually signs off on our releases. We make them the highest quality we can and we trust our CI processes and automatic tests to make sure the quality stays there. But at the end of the day, we don't have a QE team that signs off upstream and that's something downstream that is there to make sure things are solid.

Eric Anderson:
I guess that's what you're paying for.

Travis Nielsen:
I feel like we have such reliable processes upstream after this long now. So about five years having it iterate or yeah, we really do have good success with our releases, I feel like, and that quality.

Eric Anderson:
That's fantastic. As we wrap up here, Travis, in the case that we have listeners who are excited about Rook and want to get involved, is there a place where the community hangs out?

Travis Nielsen:
Yep. We have a Slack group for Rook. So go to slack.rook.io to join there. On GitHub, you can find us at git hub.com/rook. And, of course, our website rook.io.

Eric Anderson:
Good. Anything I didn't cover today we wanted to get through?

Travis Nielsen:
Well, I just thought of one more thing. Sometimes people ask, where does Rook come from? Why do we have that name, right?

Eric Anderson:
Right. Tell us.

Travis Nielsen:
So it is kind of interesting. So back when we were at Quantum, we had a code name for our storage solution, which is called Castle because a castle is somewhere that is safe and you want to keep your data safe if you store it in a castle, right? So data safety around castle and that's where it started. But then it turns out because of some copyright or whatever, we couldn't use Castle as a product name. So we thought, how can we use our same logo that looks like a castle? And so we came up with Rook and rook.io is available. And it's a nice, short word that people can pronounce and type. And it's like, Hey, that just worked perfectly. So that's where Rook comes from.

Eric Anderson:
Sometime, we'll have to talk about how most open source logos are pretty lousy, but you got a nice and going on there at Rook.

Travis Nielsen:
Well, thanks!

Eric Anderson:
The logo. The name, the website. Very good.

Travis Nielsen:
Yeah. We had a designer with the team. He's at Upbound now. So if you look at Upbound designs, it's the same guy. He creates good stuff.

Eric Anderson:
Yeah. Travis, thanks for coming out today. Your contribution and Bassam and Jared's to Rook is awesome, inspiring and paves the way for a lot of future distributed stuff.

Travis Nielsen:
Yeah. Excited to be with you and talk a little bit about it, or maybe too much.

Eric Anderson:
You can find today's show notes and past episodes at contributor.fyi. Until next time, I'm Eric Anderson and this has been Contributor.