Alexey Milovidov: What if our competitors will benefit from our product? The answer is, this is fine. If our competitors will not use our product, they will use some other product and other product will benefit. Eric Anderson: This is Contributor, a podcast telling the stories behind the best open-source projects and the communities that make them. I'm Eric Anderson. Eric Anderson: We're here today talking about ClickHouse, an analytical database from the Yandex team. I'm joined by Alexey and Ivan and I'll have them introduce themselves. Alexey Milovidov: Hi, everyone. I am Alexey. I am developer of ClickHouse. That's all that I can introduce about myself. Ivan Blinkov: Hey, everyone. I am Ivan. I'm technical manager of ClickHouse. I joined to the team not from the very beginning but from the moment it became open source. Eric Anderson: Wonderful, thank you. Let's start by having you tell us the story of ClickHouse. How did it begin? Alexey Milovidov: Actually, it's a very interesting story about how we came up to the development of ClickHouse and how we decided that we need to make it open source. Alexey Milovidov: It all started at year 2008. The first commits to the ClickHouse code base was more than 12 years ago, or 11 years ago, but I was unaware that the code I commit will eventually become ClickHouse as you know it right now. Alexey Milovidov: I was working at Yandex. Do you know Yandex, by the way? Eric Anderson: Yes, yes. This is the Russian search engine. Alexey Milovidov: Yeah, Russian search engine, and not only search engine. Also sell driving cars, many different technologies. One of these technologies was analytical system for web analytics. I'm sure you know web analytics. Probably the first product is Google Analytics. At Yandex, we have similar product, named Yandex.Metrica. If you compare market share by the number of websites, by the number of traffic, you will be surprised because Yandex.Metrica is the second. The second web analytics system in the world by the number of websites or by the number of traffic. I was working on the data processing engine of this system. Alexey Milovidov: Actually, it was very challenging job because if you want to collect as much traffic from the internet as possible, what will be the data volume? How to process this data how to store this data? How to structure it to allow users to see flexible, customizable reports? If you just count the amount of traffic, it will be tens of billions events each day. If you imagine how to store these events, what database you need to use? What you can implement to do this task? Can you use, for example, MySQL or Postgres or Oracle just to pipe this data? Tens of billions events to some tables and generate reports directly from this database? If you just try to do this and calculate how quick you can generate reports, you will find that this task is almost impossible. Eric Anderson: Just curious. Did you try those other options when you were experimenting with how to implement this? Alexey Milovidov: What were the options? There are different kinds of databases. Transactional databases, analytical databases, and there are many open-source database services, and there are open-source analytical databases. Even 12 years ago, we considered some options like Infobright database management system. InfinityDB, MongoDB. We started testing and evaluation of these systems for our use cases. Alexey Milovidov: We found out that every system had some drawbacks. Some system had no support for scale out, so you cannot implement them on cluster. But you cannot use just single source or single chart with multiple replicas to store all amount of internet traffic that comes to our system. Alexey Milovidov: Some system didn't have compression, but it is crucial because if you can use about five to 10 times less amount of storage, you should do it if the amount of data count in petabytes. Otherwise you will just spend too much money on HDDs. Eric Anderson: Got it. Maybe just one quick note. I'm hearing you say we, so I'm imagining you're in a small team that's considering how to do this. Alexey Milovidov: My team was about five developers, if I remember correctly. Five developers participated to Yandex.Metrica. We started evaluation in parallel with that experiment. The goal of this experiment was, what if it is possible to develop some specialized prototype database directly for the task that we need? The task was to make some data structure where we can simply store non-aggregated events with multiple, multiple attributes, and generate different reports on the fly without pre-aggregation. Alexey Milovidov: This experiment was successful, so we started to use that prototype. But that prototype was named ClickHouse. It was completely different system. We named that system OLAP server, but it doesn't matter. It's just some placeholder name. It was not open source. We started to use it, and we started to use it successfully, but it was quite limited. It supported only one kind of table with predefined structure and predefined type of reports. It was more confined. Alexey Milovidov: I just had a dream. What if we can make something from this prototype? Something that will be more usable for external teams inside my company. Something that will be more usable for even tasks in the same department. How we can generalize this data structure to make it better? Alexey Milovidov: When we evaluatee different systems, I get familiar with the notion of column-oriented databases. Column-oriented databases is nothing new. They exist from probably 1980s from Sybase IQ and Vertica and whatever. Our prototype was obviously also column-oriented database. So if we want to generalize as a system obviously we need to create some good column-oriented database management system. Eric Anderson: Remembering, Google's Dremel paper was in 2010. Alexey Milovidov: I have read this paper and I was wondering how it is nice. Why it's not open source, why we cannot use it. Eric Anderson: Yeah. So you were aware of that work. Maybe it was some inspiration, but you certainly couldn't use it. There was no code base. Alexey Milovidov: Yeah. About inspiration, the practice is, you just read everything you can find and grab all the best ideas. But, why ClickHouse is different from other column-oriented database management systems? The first reason is that ClickHouse was created to be performance-oriented system. Why we really need that? Because if you have some really big dataset, how many reports you can generate and how precise will be these reports? Directly dependent on how fast you can feed the aggregate, join whatever with this dataset. Alexey Milovidov: If you have multiple billions of records and you have time limit, in about one second. Can you process hundred millions of records in one second? Probably no. Sometimes yes, if you have a large enough cluster, but sometimes you have to apply data sampling to generate approximate reports, but you want to generate as precise reports as possible so you have to optimize all code path inside the system. Probably this is one of the difference. So, ClickHouse was created from very practical needs. Ivan Blinkov: Those Yandex.Metrica didn't have huge budgets to buy as many servers as core products of Yandex, so it was not... Business and [inaudible 00:11:14] performance, you also had to fulfill these business goals with limited budget. Every optimization allowed to provide better service for the same amount expenses. Alexey Milovidov: For example, we have 100 servers. You can think this is too many, but we have just limited amount and we have to do all our data processing needs within this limited amount. We just have to optimize the code. Alexey Milovidov: It... about how ClickHouse was created initially. Initially, it was proprietary system available only inside Yandex, and actually, it was initially available only inside my small team. It was developed very quiet. So, we have the system inside our team. No one else knows about it. Alexey Milovidov: Eventually, some gossips started to spread around the company because of some reasons, like from the beginning, we have the [inaudible 00:12:29]. That's no surprise, but the system appeared to be easy to install and use from different departments. I was wondering that some people from different floors in my office, from different departments, started to use ClickHouse, and they did not tell me about it. So they started to use the system, they were happy, and everything was fine. Alexey Milovidov: At that time, I started to have an idea that probably we can get more if we will make ClickHouse available outside of Yandex, but this idea was looking very crazy, very risky. Eric Anderson: Does Yandex have a history of open sourcing projects? Alexey Milovidov: Yeah. Today we have two main open-source products. One of them is ClickHouse. Another is CatBoost. But before, there also were some products, for example BEM framework for front-end development. Probably you did not hear about it. BEM is methodology for HTML markup. Or, another product named [Elliptics 00:14:05]. Elliptics is a distributed key valuator. It was developed at first outside of Yandex, then developer joined Yandex, continued to develop it, and then goes outside and continued developing. Actually yes, Yandex has some track record for open-source products, but just distant successful. Some of them were not successful. Actually, depends not only about company. It depends on people who are actually responsible for this specific product. Ivan Blinkov: Some projects were published, but only one commit and no developing at all. They were mostly negative looking things, but that's what it is. Alexey Milovidov: So, open source is not about company. It's about pretty. About developers and community. How much time you are ready to do some support of your community, and how much time you have to spend actually developing the code. Some developers just not ready for this type of work. Eric Anderson: The developers on your team, you mean. Alexey Milovidov: No. It's not about my team. I am talking about open source in general, what are the difference between successful and not successful open-source products. First difference is usually the right time and the right niche. The right use case. The system must solve actual problem in efficient way. Eric Anderson: Yup. Alexey Milovidov: This is just one reason for successful open source. Another reason is about people who are responsible for this work, for development. Sometimes, some open-source product can start, but developers can lose interest, situation may change. Some life may change and so on. If you want to make successful open-source product, you have to work on it day-by-day basis and never stop. Eric Anderson: Yeah. Alexey Milovidov: By the way, I did not tell you how we make the idea of open source good for our managers. Initially, this idea can look crazy. Why we have to spend our working time inside our company to do this separate job? I have prepared some list with maturation points. I don't have this list right now, but I can remember some of these points. Alexey Milovidov: For example, better maturation for developers. If some developers work on opensource product, it is usually more interesting. It is usually more satisfactory. You see that your code is used by multiple different companies. Ivan Blinkov: Worldwide. Alexey Milovidov: Yeah. You usually work with multiple people who are also interested in your project. From this sense, open source is a good way to make developers more satisfied with their job, so our developers are happier. Probably, and actually it is, some developer from separate departments tend to think that in our department we have the most interesting job. Eric Anderson: There you go. Alexey Milovidov: I think that's right. Eric Anderson: That was the main point that you gave your managers? Or just the first point? Alexey Milovidov: No, it's just one of about 10 points. There are other points. For example, open source is to make better quality of product. To make better quality of code and better coverage of use cases. This also appeared to be true because sometimes inside our company, there is some demand for some feature and the feature was actually developed by separate contributor. This is not so frequent use case, but there is such use cases. Eric Anderson: Okay. Alexey Milovidov: It is very surprising. So, if some company make a good opensource product that is used in many other companies, this company will benefit from community. Ivan Blinkov: It's also worth to add that it's not only about features and contributions but well, whatever issues. I think nowadays, more often troubles may come with performance or features or categorization or something, but often it's found by external people, not by production cluster of Yandex.Metrica, which is still most important for Yandex. So we're now able to find issues earlier and fix them earlier, they come never close to our production. Eric Anderson: Bug discovery. Perfect. Anything else important that you tried to sell the management on? Alexey Milovidov: Yeah. So, better maturation of developers, better quality of product, and one of the most important points is to sit in specific market niche. There was high demand for good quality opensource analytical databases, and we figured out that at that time, there were no major enough solutions. There was a chance to just be in this niche, and if we just wait for a few years, probably this niche will be filled with different systems, different companies, but today if you list opensource analytical database management systems that you can use both in cloud or on premises, what you can name? ClickHouse first. What else? Eric Anderson: What about Presto? Alexey Milovidov: Yeah, about Presto. By the way, Presto is more like data processing engine. If you store data in Hadoop HDFS, you can use Presto to process your data. Also, as I know you can use Presto to process data in local files, but you cannot use Presto as analytical database. You can insert data continuously and it will responsible for data storage and data indexing. So you can process range queries efficiently and so on. Alexey Milovidov: In contrast, ClickHouse is integrated solution. It will come with more operations simplicity. You don't have to use Hadoop cluster. Just install ClickHouse, upload your data, and it will be instantly available for analytics. Eric Anderson: Got it. Okay. So Presto is slightly different or in a different class? Alexey Milovidov: Yeah. Slightly different use case. We have some intersections, but in some use cases Preston is not the option. In some use cases, ClickHouse can suit better. Eric Anderson: How about Druid? Alexey Milovidov: Yeah. Also very interesting solution. By the way, for Druid, first difference is probably the lack of SQL dialect. Probably my information is a bit outdated, but if I remember correctly, that's it. Alexey Milovidov: Another difference is operational complexity. What you need to install to use Druid, you have to install, I remember several different systems, and pick this all together and probably it will work. It will do some data aggregations and whatever. Alexey Milovidov: Third difference is that Druid, the main use case for Druid is to do incremental data aggregations and serve reports from aggregated data. In contrast, ClickHouse was developed to be relational, to look like a relational database where you can upload your data in non-aggregated form, in [raw 00:23:38], perform and do aggregations on the fly, as far as possible on the fly from raw data. Alexey Milovidov: Of course, ClickHouse has support for incremental aggregation with materialized view, but long story. Eric Anderson: Just to wrap up the part about pitching management on open sourcing, you came with these, I think we've got three points now. Developers will love it. There's an opportunity in the marketplace, and it will improve our code base. Alexey Milovidov: So. There were also points that were named as potential risks. Potential risks of open source. For example, what if our competitors will benefit from our product? The answer is that this is fine. Our competitors will use our product, other companies will use our product. We will benefit from widespread of the opensource product. If our competitors will not use our product, they will use some other product and other product will benefit. It's different to explain, but actually it's not a concern. Alexey Milovidov: Now about if we count approximate number of companies that are using ClickHouse, it will be thousands of companies across the world. In Russia there are multiple companies that benefit from ClickHouse, but they compete with Yandex in some different areas. That's not a problem. Eric Anderson: Great. Alexey Milovidov: Another possible concern is, what if developers will spend too much time with supporting the needs of opensource product, the needs of community, but not about internal tasks that we need? This concern is actually unresolved. It appeared to be true and we spent actually more than half of our time for the needs of open source. But I think the amount of benefits is much more important than some drawbacks of open source. Eric Anderson: Got it. Maybe if we're coming to the conclusion of the list, how does management feel? Are they worried about the time spent in open source? Alexey Milovidov: They are excited and probably overwhelmed. Eric Anderson: Yeah. Good! Ivan Blinkov: At first it was like experiment, but now it is like all numbers that we have about popularity of ClickHouse, of how it'S widespread worldwide. Many people I think still don't believe it, so it's growing all over the world. Eric Anderson: Yeah, it's incredible. I can imagine everyone's very excited about the adoption. Alexey Milovidov: Probably another point that I wanted to mention is that there are different kinds of opensource products and we want to make ClickHouse to be true opensource product. What do I mean? Not only the code is open source. Also, the development happens in open repository. It's to be compared with some repositories of Google where code development is inside, but then just publish it to opensource repository. Alexey Milovidov: For ClickHouse, it's 100% open source in mind. Also, it's about opensource continuous integration system that is available to external contributors. External contributors should feel that they are first-class citizens, the same as me. The same as my team. This is the way to build really good community and to make the product live well. Ivan Blinkov: The condition's also open source, which is important. Everyone can extrapole or [inaudible 00:28:04] example that's useful for them, like share. So it's also like the point is to make as a threshold to enter community or to start using the product as slow as possible. To ask, to bring new people, new companies to try, to see how well it fits their needs. In general, it helps being open for community. Helps for Yandex and for everyone in the world. Eric Anderson: Yeah. Agreed. It's fantastic. Maybe take us to the launch for open source. Management agreed to go forth with the opensource launch, and then what happened? Alexey Milovidov: Yeah. I have prepared some lists of to-do points, how to make a great launch. What we need to do. To prepare an article first in Russian to how we can widespread information in social media. What we have to do with our repository. How we can prepare it and how we can split repository from internal code. What we will do at the beginning. How and where we can host our website. Alexey Milovidov: We started to check these lists, and it took about half a year. We made our launch in July 2016 but we get approvement from our managers in about January or at the beginning of year 2016. I think it's a good schedule, half a year. For a big company it's perfect. Eric Anderson: Yeah. Okay. So, July comes. You launch. What's the reception like? Alexey Milovidov: Mostly positive. Actually, we prepared only an article in Russian but some people just post a link to Hackers' news, to machine translation of this article, and even that worked. Eric Anderson: Okay! Alexey Milovidov: Yeah. So it gets some points and people start to discuss. Discussion was about internals, about comparison with different systems and how it looks like and whatever. Eric Anderson: Mostly technical discussion. Alexey Milovidov: Yeah, mostly technical. For Hackers' news, yeah, it's a good community of engineers. Alexey Milovidov: After another few months, some people started to really try our system for their production needs. By the way, if you think about our users, what is the picture of the best user of ClickHouse? The best user of ClickHouse is some company that was suffering too much and they tried everything that is available. If they have used algorithms for analytics as I usually tried [inaudible 00:31:20] DB, they tried Greenplum, and now there is ClickHouse. They tried ClickHouse and just surprised about it works with unusual performance characteristics. "Probably we can use it, even if it is quite a new system, but it works and it can solve our problems." This is the picture of the best ClickHouse user. Alexey Milovidov: Actually, there are many companies that look like this. Eric Anderson: Yeah. Basically, you prefer somebody who's tried all the other options. Alexey Milovidov: Yeah, because I like if some people using ClickHouse because they really need that kind of system. They understand why they need ClickHouse. Actually, we are trying to widespread ClickHouse as much as possible. Sometimes even create some hype, but the best clients who understand they need our system. Eric Anderson: Yeah. So, the Hacker News launch went well. Then you get a few people reaching out to you telling you they're using it. It just grow organically from there? Alexey Milovidov: The next step, I remember Russian conference named [inaudible 00:32:55] plus plus. At this conference, I met with Alexander from Altinity. There were no such company Altinity. Alexander was working in company Lifestreet Media that do some web advertisement. They have the same demand. Alexander tell about how they tried to use ClickHouse for their needs inside their company. There were many pain points, and that's normal because ClickHouse was young product, but we met and started to work together in solving these pain points. Alexey Milovidov: This conference is one of the biggest technology conferences in Russia so ClickHouse get some traction. Also, ClickHouse started to get some traction not only in Russia. In United States, in Europe. Probably the main driver is developers with Russian background. Alexey Milovidov: About China, it's very unusual because ClickHouse is very popular in China and China is the second largest market for ClickHouse usage. China is unique because the big data in China is always big. Not like big data in, for example, I don't want to name some small countries but sometimes [inaudible 00:34:48] about big data. It's not really. But in China, as I said, we have small mobile application and we want to collect logs, and our mobile application has one billion of users daily. Eric Anderson: Yeah. That's big data. So, you have adoption among Russian companies, I imagine in part because of your personal networks and the language that match the documentation, your explanations and the communities, predominantly Russian. Then you have adoption in China because of their big data needs. Ivan Blinkov: Can I tell a story? In China there was contest for a lot calculations, like some famous company in China made a contest, who makes the best system to complete some civic market funnel, like specific task, and they make huge advertisement for this contest and who makes it fastest system wins and gets huge coverage. Some guys found Russian ClickHouse, the [inaudible 00:35:54] Russian, and they used it to beat famous products like Apache Spark and everything else that was on the market, and they beat it by huge margin. So, they built specific reports much faster than any other opensource system available. Ivan Blinkov: When they won this contest, every engineer who observed it, who participated, wanted to learn more about how they did it. They also started training and many became enthusiasts in their companies. It was kind of funny that it was growing, started from one moment and we started being adopted from the various people, various engineers who are excited, want to contribute. Some people started translating documentation to Chinese which is important because many people in China have troubles in English. From there, it became very fast going to huge Chinese companies that everyone know even outside of China. Eric Anderson: Wow. So you've got your community winning high-profile contests with the product and then you have important documentation contributions. Translation into Chinese. Alexey Milovidov: Yeah. Documentation, translation and code. There are multiple good engineers in China that working in ClickHouse and I am happy. Ivan Blinkov: Also one important point. We started hosting, organizing meetups of ClickHouse users and developers all over the world. Like, some people from China invited us there, so we made some events in Russia, then in Europe and States, meeting with community [inaudible 00:37:36] also played a huge role because we started feeling better, it actually working for people, talking with them. Not only online but [inaudible 00:37:44]. We started to do regular talks on large conferences and regular meetups all over the world. Eric Anderson: As this community is getting bigger, you're getting people who are using you in production and they have mission-critical needs. You said that your model for developing new features is to do it in the open. Alexey Milovidov: Yeah. Eric Anderson: Talk to me more about how governance works, how you manage those. This is all just posted on GitHub, or what's the mechanism that you use to accomplish that? Alexey Milovidov: Yes. We had such contributions as [pull 00:38:20] requests and the first step is just continuous integration system that will run the tests that we have. Multiple different kind of tests. Some are pretty simple and other, some complex integration tests that create multiple instances of ClickHouse, multiple instances of different systems like Kafka, MySQL, Mongo, whatever. We have static analysis, whatever. Alexey Milovidov: Then there is code review. We started to discuss the code and sometimes we started to also try to test the new feature for usability standpoint. We started to discuss how good this contribution integrate to the overall design of the system. Is there some risks? Is it consistent for us? So, we discuss with test and eventually the new feature is merged. Then there is also products that we deploy the new version of ClickHouse to testing environment inside our company, but it's not that feasible for open source, so that's it. Eric Anderson: What's the governance of the project? Do you have committers? Are those all committers within the index today? Does the index own the code base, and have you considered what the future looks like in that regard? Alexey Milovidov: The main direction is to make this product as open as possible. What does it mean? Currently, Yandex owns the product legally, but probably we will do something to make it even more open. Apparently it's open question, so I cannot say more about it. Alexey Milovidov: About current governance, we have many committers inside my team and also we have some dedicated committers from the community, but actually all contributions must be accepted by my team. Eric Anderson: Got it. Maybe in our last minutes, concluding question here. How do you feel about the decision to open source it? I imagine you're pretty excited and this has gone as well as you would have hoped. Alexey Milovidov: Yeah. I'm very excited and very happy that I had made this decision. Eric Anderson: How big is the ClickHouse team in Yandex today? You mentioned it started at five people. Alexey Milovidov: It is just 14 developers. Eric Anderson: Wow. Alexey Milovidov: When I come to China, they said that 14 developers is very small team, but I think it's not very small. It's quite normal. But I always want more. Eric Anderson: Yeah. Don't we all? Well Ivan, Alexey, this has been fantastic to have you share the story with us today. Your contribution in ClickHouse to the world is amazing. As you mentioned, thousands of companies relying on it. I feel like we're just seeing the beginning. The growth is astronomical and I imagine we'll have much more to discuss in the future. Alexey Milovidov: Okay. Thank you. Ivan Blinkov: Thank you. Eric Anderson: You can find today's show notes and past episodes at contributor.fyi. Until next time, I'm Eric Anderson and this has been Contributor.