Haoyuan Li: Say there is a potential user there and they want elaborate to the open source software and the communities' goal is to really make those users very successful. And with that goal in mind. The community, the self, will be the different people with different types of knowledge and different type of a function to have the community. The goal of the mechanism is to enable these people, and to let these people have fun doing their work as well as at the same time being acknowledged. Eric Anderson: This is contributor a podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson Eric Anderson: Today I'm joined by HY the creator of Alluxio, an open source data orchestration solution. HY, thanks for joining us. Haoyuan Li: Thank you, Eric. Eric Anderson: Maybe the best place to start, HY is to tell us about day one of Alluxio. How did Alluxio begin? Haoyuan Li: Day one means the launch day or- Eric Anderson: No, like the first line of code was written. Haoyuan Li: Oh, that's a really good question. I don't remember actually. Eric Anderson: This will be a short episode. Haoyuan Li: I don't remember that day when I wrote the code, but I do remember the day one. At the time I was a PhD student of AMPLab UC Berkeley, and I remember doing something like Alluxio has always been what I wanted to do. And actually I put a, state of a purpose when I applied to a PhD to Berkeley at the time. And my first year there, I co create spark streaming and it became a [inaudible 00:01:51]. Haoyuan Li: And after that, my, advisors and I had a conversation to start something from scratch and at the time, this concept, what we wanted to do was what we wanted to start. And I remember the very first group conversation we had. It was my, advisor Ion Stoica, Professor Ion Stoica, and my friends, Matei and Ali from AMPATH, UC Berkeley, and we four of us. We had an early, conversation at a patio Santa Cruz. At the time we had an unplanned retreat summer retreat. It was really fun. Actually. I don't remember exactly what we talked, but I remember it's a good memory. It's a fun conversation for us. Eric Anderson: Okay. So maybe just to summarize, you had ideas that you wanted to work on, this type of thing, going into grad school, you worked on Spark streaming, made significant contributions there. And then as you were moving off, that project is when you first started talking about what this project might be. Haoyuan Li: Exactly. Eric Anderson: And, how definitive at this point was, I mean, did you know exactly kind of what you wanted to build or was it just... Haoyuan Li: No. Eric Anderson: I want to do something new and it's in this general space. Haoyuan Li: So it's funny. I applied to a PhD program at Berkeley. I said, I want, in the state of purpose, I want to create a new storage system. Alluxio is not a storage system, It's a data orchestration system. It's a new thing. Completely new concept started from scratch. But the reason I wanted to do a storage, it's all kind of a lie, the reason why I want to do a storage system was that I was very fascinated that to where in this data revolution, we're still in a very early stage. And in the revolution, I feel like the most value, the value is in the data and technology systems. Haoyuan Li: They are enablers and they enable people to extract the value from the data itself. So that's the key, I thought about it before I went to Berkeley. If that's the key, if that's the truth, what system will have the most strategic value in this, will help people create the future faster and better. And to answer that question before the Brookley time, I thought, look, what's the life cycle of the data. Most of the time data stays in their storage system. And sometimes when some people or some program want to read the data processes of it, read it out and put it back. And then if there's some output, it'll put it back as well. So the majority of the time that data is in the storage. So I thought that's the key. And there are so many innovations happening in the storage industry storage systems. Haoyuan Li: So many new system being created all the time. So I thought that's probably something worth doing before Berkeley time. That's what I wanted to do. So that in my first year at Berkeley and a very early stage, roughly time before starting this project, I was looking at a whole industry history, particularly a storage system, like industry history for 40 years. And you will find very interestingly, like every five to 10 years, there's another wave of a new storage systems disrupting the previous generation. That is a cycle. Always like that for the past 40 years. And I figured there was a reason behind it. And there are three dimensions of the reason, like fundamental trends always happening in our industry decided that, those three dimensions are hardware advancement, architecture advancement and the workload change. Okay. So that triggered this five to 10 years change and always in the storage industry, the single story being repeated has been, Oh, I have a better storage if you use us. Haoyuan Li: And what does that mean? The better storage is cheaper, faster, easier to use one, of the three or combination of the three reasons to use this new thing. Then I thought, look, if we create another storage system and we can be innovative as well, if we do this, create another one, probably 10 years later, we'll be disrupted as well, so that's awesome by creating something new. But if we can do something further, more fundamental, that's a reason we didn't do storage. Haoyuan Li: And we also look at the trend, the compute, from the application perspective, they follow the similar trends, like every three to five years. There's another wave of new workloads coming up. These either disrupting the previous generation or being added into the ecosystem workloads. So because of that, we feel in this industry, in order to really enable, and empower all those different types of data, like applications to completely leverage that data siloed and is stored in various storage deployments, possibly different storage systems. We thought the only way we can probably do this, can solve this problem is to create a new layer, which we call the data orchestration system. That's how this idea evolved and took some time. Eric Anderson: Yeah. I can see why Berkeley took you. You thought this all out, Haoyuan Li: Not all out. Eric Anderson: Thought about it. Haoyuan Li: Thought about it. Yes. Eric Anderson: Okay. So you had this conversation, eventually you did some work, you have this conversation on the patio with Matei and others. Take me from there, how you got to working on this project. Haoyuan Li: I mean, it's a pretty, it's a lot of work and to realize this vision, it takes a long time as well. It's not a one two people's job. Cannot get it done for one or two people. Right. So you got to start from somewhere. And I was very fortunate. The lab has a very strong history in terms of innovation. And as far as the results from this day the ecosystem and Apache Spark is hugely popular and also surrounded by the great technologies where people in the lab, like my advisors, my colleagues there, and also they gave me enough freedom to explore whatever I want to explore. The very natural way is to start from like spark ecosystem and the very first version of Alluxio at a time, the name called Tachyon. Yeah. Eric Anderson: Yeah. I kind of forgotten. Knew there was a name and I had forgotten. Haoyuan Li: Yeah. Tachyon. The very first version, the targeting workloads was only trying to share the data in memory between different spark applications. That's the initial, first version. A lot of people, when they saw that announcement, they thought it's a caching system for spark, at the time based on our functionality at the time, it was like that. But it started from there, till today we're supporting so many different types of workloads with so many different stores behind us. This is a different world today. Eric Anderson: So it sounds like you had some bigger vision and you knew that you wanted this initial use case. So who's working on it, it's you and other people in the lab at this point. Haoyuan Li: So many people like, I was advised by advisors, and people like my Matei, he has tremendous experience with open source and they all helped me a lot. Haoyuan Li: I'm very grateful for that, but I call data the very first version. Twice, there was a first version, never released, I don't remember what happened to that version, time flies, actually I don't remember why I wanted to review that version. So anyway, I called it the version and over the summer, or maybe over the fall, and then during the Christmas time I did another POS or I data again, completely. Maybe it could be the first version that wasn't beautiful. That's the reason when we announced our first open source release is called a version 0.2. instead of 0.1. Eric Anderson: I could see that in the versioning. Any thoughts around that initial launch? Were there preparations you had to make what, goes into putting code out for the first time? Haoyuan Li: So that's a, that's very early days. It's kind of crazy. Haoyuan Li: I mean, think about it again, it's very small, it's not a big deal, but at the time, I was a little bit nervous and there was reaction from the ecosystem, whether people would pick it up or try to out each other and we have another colleague called Andy Colinsky. He's very good at a marketing this as well. He helped marketing the first launch. How to push that in a very good way, very successfully. First of all, it was picked up by many media, very encouraging. Lots of people started to try right away during the day or the second day of the launch. It's like other open sources, a community forum, etc, and the way we use Google group as a community forum, user forum and the first external pulse to that forum is this doesn't work. Eric Anderson: That's great, very encouraging. Haoyuan Li: Yeah, very encouraging. So I was very nervous when I saw that. But today, if someone told me the software doesn't work, I'm very happy. So in a sense that people are trying out, I want to hear what doesn't work. And tell me what doesn't work and we'll either tell you it shouldn't be the right use case or we'll fix it. We have many interactions in our community Slack channel every day, like lots of people reporting, Oh, this is awesome. We love this tool or, Oh, this doesn't work. Why is that? So go back to the first launch. That's the first reaction from the community. Yeah. Eric Anderson: I imagine you're continuing to evolve the project. Now you have academic interests. You're trying to satisfy and fulfill at the same time. There's a few users that are nagging you or curious about development. How do you manage those two things, the academic requirements, as well as the community, is that in contention at all? Haoyuan Li: At the time in the end of the day, it's competing the time resource. On one side, I'm very passionate about this. I believe in this and it'll be done. And it's on one side, the other to curate the community, you need to love the community, help the community and then enable the community to grow that requires significant amount of time. Haoyuan Li: And there's the other side like us, I was a PhD student back then. As a lot of responsibility in terms of getting papers out, doing research, doing collaborative research projects on Alluxio back then called Tachyon. Right? It's all just competing the time resources. I think I was very lucky at the time, lots of people, they have an interest in this from both industry as well as academia. And there are a lot of people that are contributing to this project, to this community in various ways. For example, from academia perspective, there were several very high profile research projects build on top of this work, published in a top notch conference, like OSDI, ISO, S P and STI, the best conference in the system and the networking and databases, etc. And I was very fortunate to work with those people and the people leading this project today, they are professors in CMU, Stanford, etc. Haoyuan Li: So that's the academia side. And from the community side in the very early days, very fast, you attract people from various companies, but some problems and companies like Intel, like Red Hat, they all contribute to this, in a non trivial way. I remember like they flew me, I was a student. I was very happy. They flew me to a different cities where their major engineer resource related to this was located at the time and let us communicate with each other and exchange ideas. And then their are engineers will try to implement it and contribute it into the open source. And we just, communicate, I'll keep in touch all the time. Those people grow in community, more impact in the community as well. that's the two things. And at the same time, from the resource perspective, you have a 40 hours working week and if you work a slightly longer than that, so you kind of essentially still divide your time over the week and you still can make things move forward. Each side still have sort of enough time to carry them on, move forward. Haoyuan Li: And besides these two things, I think the AMPLab of UC Berkeley also provided a very good platform for the academia research, academic research to interact with industry through a sponsorship program over the lab, as well as the twice a year retreat, which the lab invites say all the sponsors and the engineer folks technical folks from the sponsored company to that one same value and students in the lab will present their results and the work, and then they will give feedback, those type of things. So this all helped in the early days. Eric Anderson: And I imagine at some point the project's mature enough that you hit some milestones that might be, you probably got some customers or users in production. And that could have been exciting. I don't know if you have any stories around early users and production or as you're considering leaving university and deciding what to do with this project, any stories around your shift to then decide to spend your career, I guess in continuing the project. Haoyuan Li: I mean, along with all the community engagement, industry engagement, it's very naturally many presentations. I remember first year I probably did a hundred presentations myself. Maybe sometimes three presentations a day do to effectively, I mean, we're leaving the valley, so that's more effective, up and down, [inaudible 00:16:48] Haoyuan Li: At one point, I remember it's the summer of 14. I got some inbound request from the venture community. They want to understand what's going on. What's the vision and they had some conversation. And at the time also towards the end of 2014, I really sort of became the bottleneck for this growing community in a much more scalable way to do with, while this requires much more resource. Than at the time, what we had, full time resource and then that's the time I was very seriously thinking of maybe that's the right time to start a company to carry this forward, to realize the vision, to bring more interesting stuff to the industry, to let more people use it. Eric Anderson: And are there any other individuals that take leading roles in this? I imagine it's hard to kind of get people on board, and maybe when they're on board, are there people that help carry the banner for you? Haoyuan Li: We have very strong people in the community, both from the industry as well as from academia. On one side from academia, for example, in my lab, we have the luxury that we can get some very talented research assistant. They are either master's students or undergraduate students from UC Berkeley. The very first person we got, it was Calvin, the person called Calvin jar. So he's a very talented individual, super smart. He was the very first internal person. I can rely on. And in fact, after I founded the company, he's the first person joined the company and he's with us today. Haoyuan Li: And carry a very big role to push this forward. And an example from internal resource, at the time and from an external perspective I remember people from Intel, particularly, there's a very strong team, Intel Shanghai, very big team working on various open source projects and that team and the person called Grace, she carried a significant role to evangelize and help grow this project. And from Red Hat, from Pivotal, which was part of EMC back then before the acquisition of EMC, Dell EMC acquisition, that's a very early stage and people started to evolve into the project and all contribute significantly today to this. Eric Anderson: Fast forward, maybe a bit now how you kind of have wrapped up the evolution of the project, like governance, for example, at some point people might want to know how that's going to operate going forward. We can talk through that. Haoyuan Li: First of all, what's the goal. That's the most important thing with a goal you define what say like mechanism. The goal is really to realize this vision and I mean, along the way, by realizing it, it requires significant collaboration, in a big community in a big ecosystem. And at the same time, by realizing it, this software can have very good, significant impact to the world. That is a goal besides, you need to realize the technical vision by being the data orchestration, orchestrating all the data. As the data orchestration system, it's like Alluxio to data, is like Kubernetes to compute. That's a goal, right? With that, you want to build a very big community. And in the community, there are different parties involved. And the fundamental part is that we want to enable more and more users. Haoyuan Li: Say there is a potential user there and they want the leverage the open source software and the community's goal is to help them to really make those users very successful. And that's a goal. And with that goal in mind, the community itself will be the different people with different types of knowledge and different type of a function to have the community. There will be people more focusing on the software development side, contributing side feature side documentation side, testing side, all critical, all very important. That's on one side, on the other side, you also have people helping evangelizing the software, users coming out to speak for this software, how to this software, share this experience with other users as well as other potential users. All these people are critical. So the goal of the mechanism is to enable these people and to let this people have fun doing their work as well as at the same time being acknowledged. Haoyuan Li: So the way we're doing this today is that we have the maintainer and we have the PMC and then we'll have the contributor. And today we have like 1000 plus contributors. People can see that from our [inaudible 00:22:14] repository. I think it's almost one 1100 today. And let's see contributor side and we have probably 40 PMCS around 10 maintainers and to hold up the cult standard.That's the current, this part of the structure. There are a lot of the people are testing the software, using the software and you want to enable those people to apply the feedback loop to the community. And the way we are doing this, like today, the best channel for us is called, it was Slack. Developer we lack Slack. And then whenever some people say this doesn't work, that doesn't work, or this requires, or this is confusing, that's confusing. Haoyuan Li: Or I need another feature. All those type of things. Besides, it's a conversation in the Slack channel when encourages those people to really contribute like file an issue in our Github and then some other people will take care of that issue. And let's say Slack channel side, that's a community engagement site with a user community. And then after some users, some people become the user and then we'll encourage them to talk. We have a global online meetup besides the physical, a local meetups at different places. Now we have this online meetup to enable users to share their story. And of course we have users at different cities metro areas. They will sometimes host the meet ups by themselves. Two weeks ago in Beijing, there was a meetup organized by the community, of course we helped as well, one presenter from Alibaba, one presenter from a Jd.com, one presenter from Baidu, and they all present their production use cases, super encouraging, super exciting material. And for the global online meet up example, probably four or five weeks ago. Haoyuan Li: So we had an online meet up, a ING group. They presented their production use cases of a Presto Alluxio and DCOS and they're as three storage systems, etc. We try to make this whole thing flow smooth as possible, but still has a lot of work to be done to really achieve what we want to achieve. Eric Anderson: Great. Maybe walk me through the contributing experience. So as you described, if I get interested in Alluxio I tried out and I will have an idea for something to be added. I can engage with people on Slack. I can raise an issue, Github. If there's some interest around those activities, I could create a PR. And then the maintainers you mentioned can choose to pull that in. That's more or less the... Haoyuan Li: There are three types of users. Like one type of user, super technical. And besides using our software, they want to write code for us while doing new features, like what you said, and to typically these type of developers, either from very strong, startups or huge giant tech giants, like the three company I mentioned. But thousands of strong engineering force in the company. So this is category one and they will do what you say. And when that happens, some relationships pretty strategic. For example, the company, I mentioned like JV, they are trying to use us in a very big way. They are already deployed 1000 plus nodes. And they want to deploy much more than that. And they have some joint development with us to contribute to the community and they are testing, new features, all those types of things. Haoyuan Li: Super, super exciting. That's a very close engagement. And the maintainers at PMC will work with those people very closely. And the developers who are doing this work from those companies, they gradually will be promoted into PMC's. And then if they do more work and have the high standard for the code, there'll be promoted, as a maintainer. Haoyuan Li: That's the first category we have done that actually, in fact, and the second category is that users as well, and they want to use the software or they find the sandbox with integration with their environment or some small features. They want to add. And that's relatively straightforward. As long as our side have community has a mechanism to make sure they're request, their code, their feature. Their conversations being handled timely. Then they will be very happy. And that's what we are doing every day as well. And the third category is more towards the user but not trying to modify a code at all side. And they are trying to solve an issue by using a software without [inaudible 00:27:21] software. And that's actually the majority in the end. And for that what we encourage them to do is to follow the box and file the feature request. And that's, that's the three different categories. We encourage in different ways. Eric Anderson: HY, I want to make sure we cover anything you wanted to cover. That was the story we wanted to capture today. Beginnings of the project, how the project evolved and where you're at today, especially as you engage with the community, anything in addition you want to say about Alluxio or its community. Haoyuan Li: Yeah. This is more about a story of how this software and community grew up, but maybe I will share a little bit more regarding what we exactly do today. So then hopefully the audience, if they find it interesting that they can check it out and possibly join our community as a user or contributor, etc. We are a data orchestration system. Essentially in the stack, we are in the middle between all the data stream applications, as a top layer on top of us and below us, there's all the storage systems and the good examples for the frameworks on top of us, they are Prestel they're Apache spark, tons of flow, all the data analytics, as well as motion running tools and for future more data workloads on top of us, that's the workloads running on top of us. Haoyuan Li: And below us, there are all the file systems and object stores today, like file system like HDFS, NFS, as long, etc. And for the object storage, there's Amazon as three GCP storage like Microsoft Asia storage, or like EMC ECSE etc. This new layer that our orchestration system in the middle essentially virtualize the data from all these different storage systems. And sometimes it cache the data as well and present all this data, extract and present all this data to all the upper layer applications. This enables any of the data dream applications we mentioned use Alluxio to interact with the data from any storage deployments. And it brings the data accessibility to these applications, as well as performance gain to these applications. And some very simple and common use cases are, for example, if you'll have a hybrid cloud, you want to leverage the compute in the cloud environment, you'll run your Prestel spark tons of flow in the cloud on top of Alluxio, collocated. Haoyuan Li: And Alluxio will help manage how the data move around between the on premise environment and the cloud environment. There's a hybrid cloud use case, very common and enable many leading companies to do this laboratory cloud computer resource easily without moving the data at all. The second common use case, a single cloud data saturation use case, essentially when you are running Presto spark in the single cloud environment, you have data unit three or some other storage. You can use Alluxio to provide that performance of acceleration, asses consistent performance SLA. And we have people running this type of workloads, like 50% of performance improvement, some 10 times performance improvement in the SLA guaranteed case. And that's very popular as well. And beyond these two, there are many other use cases. I would encourage people to go to our website, Alluxio.io, to check them out. Haoyuan Li: And we have maybe 40 or 50 or even more production deployment, detailed use case studies. In the powered by section and people can click to see their slides, blog to talk about how they are using Alluxio. And hopefully those could be useful to our other potential users. Eric Anderson: Powered by examples is impressive. It sounds like maybe in summary, the two value propositions are one, just connectivity. Being able to connect to various systems, [crosstalk 00:31:42] through kind of one paradigm or through being able to swap between those easily is one. And then the other is performance. Haoyuan Li: Exactly. That's the two strongest of value at the moment, and there are other people for different various use case leverage more value like a MarTech cloud, but that's the more than the last use cases where we encourage people to look at their website. Eric Anderson: Awesome. HY, thank you so much for coming today and telling us your story and Alluxios story. Keep up the good work. Haoyuan Li: Great. Thank you for having me. Eric Anderson: You can find today's show notes and past episodes @contributor.fyi until next time, I'm Eric Anderson, and this has been contributor.