Eric Anderson: We're going in a direction. I like to call it embedding everything. And I think that's really exciting for not just machine learning, application developers, but also for a wide range of existing use cases as well. So really, vector database is, I would say a database for the AI era. Eric Anderson: This is Contributor a podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson. Today, we're talking about Milvus, a vector database, and we have with us Frank Liu, who's an architect at Zilliz, the company who's backed Milvus. Frank, thanks for joining. Frank Liu: Yeah, thanks for having me on the show, Eric. Eric Anderson: So Frank, I think it's imperative that we first clarify what a vector database is. This is kind of a component new to us in the world of machine learning, right? Frank Liu: Yeah, absolutely. I'm actually going to take a quick step back real quick. I'm going to talk about embeddings. Eric Anderson: Yeah. Frank Liu: And just machine learning in general. First, I know a lot of listeners probably are familiar with this concept, but for those of you out there who are fairly new to machine learning, there's this idea called embedding space. And what we're able to do is take human generated data or unstructured data, and embed those as a high-dimensional vector. A lot of these vectors can be, 256 dimensions, 512 dimensions, but it's essentially a floating point vector of values. And these are called embeddings, right? Frank Liu: In general, I would say the vast majority of embeddings that you'll see in the industry or in Academia today are generated using machine learning models. So in the case of computer vision, either ComNet or transformer, vision transformer, and in the case of NLP, you have LSTM transformers, so on and so forth. But the idea is that, with these embeddings, you are able to capture the semantic information in all of this unstructured data and all of this human generated data images, video, audio text. Frank Liu: So let's say I have two pictures of German shepherds. They can be very, very different in terms of actual pixel values. But if I were to use machine learning and map both of these into embedding space, we end up getting two vectors that are very, very close to each other, just in terms of UK cleaning distance. So that's really the power of machine learning. That's a power of embeddings and what a vector database hopes to do is to capture the power of those embeddings and users allow applications to search index and store those embeddings. Right? I will say one thing, which is that there are other ways of generating embeddings beyond just machine learning models. Frank Liu: So, we do have a Milvus user, for example, where they actually have a way of generating embedding by a handcrafted algorithm, turning and executable into this embedding space. And what they're able to do is, they're able to scan for viruses or malware just by seeing. If I have this new executable coming in and I turn that into an embedding and it's very close to other existing pieces of malware, that's probably something that we should flag for review. Frank Liu: So just to summarize that real quick and embeddings a high-dimensional vector where it captures the semantic information of the input data. Now a vector database such as Milvus, what it is able to do is store index and search across those vectors. And that is actually a critical component for a lot of AI applications today. And I'll give a quick example, right? Frank Liu: So a very popular one is reverse image search and reverse image search, what you want to do is given an input photo, I'm going to go back to the German shepherd example, given an input photo, you want to be able to search for semantically similar images. So I think if you go to... let's say Google image search today, I think that's still very much based off of a combination of color features. Probably some text based features as well, but we want to be able to search these images based on the semantic content. Frank Liu: If I have a picture of a German shepherd, I want to be able to search for other German shepherd, similar to that as well. And with a vector database, if you imagine you have this huge data set, maybe a hundred million or a billion images, and you want to search for other similar contents, similar images within that vector database, that's what something Milvus allows you to do. So that was a are really a crash course on embeddings, on a vector database as well. So I hope that's mostly clear there. Eric Anderson: Yeah. You know, a couple follow ups. So, vectors and embeddings in this context are synonymous. You've used them a bit interchangeably. Frank Liu: Yeah, absolutely, absolutely. Eric Anderson: And then, you mentioned that you'd have a vector for this German shepherd dog, but basically anything that might have a label in the image could have a vector, like eyes pause, could those also have vectors associated with them or is it just kind of one vector per image or- Frank Liu: Yeah, they could, if you have, let's say an object detector and an image, and you run that image through your object detector, you could have individual components inside of that picture, which has embeddings as well. So let's say I have a picture of a conference room. There's many chairs in there, many different types of chairs. There's probably, let's say a television mounted on the wall. There's a conference table. All of those could be turned into embeddings. Frank Liu: There is one point that I forgot to mention earlier, which is that embeddings nowadays are actually multimodal, which makes them even more powerful. There's an idea where I can actually embed text and images into the same space. So if I were to take, let's say the sentence, a picture... let's say of a chair and I would, take a picture of a chair. I could turn that image. Frank Liu: And I could turn that sentence, that piece of text into vectors, which are close to each other through the power of machine learning, we're going in a direction. I like to call it embedding everything. I think probably a couple years ago had a tweet where he said, "Just embed the world," something along those lines. And I think that's really exciting for not just machine learning application developers, but also for a wide range of existing use cases as well. Frank Liu: Because what you're previously able to do through only, probably some human curation or probably having descriptions, you're now able to do by combining vector database with machine learning models with AI. So really, vector database is I would say a database for the AI era, right? Eric Anderson: Yeah. Now that we're talking about the database a little bit, is this a component in order to generate the model, which I think is where a lot of people feel like they can understand where tooling fits, in my process of generating models, I need certain things. Eric Anderson: Or is it more on the serving side or is it used for these second order use cases? You described maybe if I wanted to create an image source engine, I could put a bunch of embeddings in a database. And so in that sense, it's not quite a model that's serving as much as it just is the database that's serving. Frank Liu: That's actually a great question. If you have a model or you have a way of turning your unstructured data into embeddings, then, I think vector database really becomes extremely useful for you. That's a great question by the way. And I'll talk a little bit deeper about that, where you want to be able to, oftentimes it's more than just wanting to create a reverse image search engine, or it's more than wanting to do recommendations. Frank Liu: Oftentimes, you want to simply analyze the data that you have coming as an input. So I'm going to take a different approach from machine learning, where if you have, let's say protein structures or you want to do AI drug discovery, you can turn these molecular structures into embeddings and be able to analyze all of those view drug discovery in a vector database. Frank Liu: And beyond that as well, I think a vector database is really meant to supercharge a lot of the applications where you have huge, enormous quantities of data, and you're not able to analyze all of those effectively. So I would say it's more, if you already have a model or you already have a way of turning a piece of unstructured data into an embedding, turning images, video, audio, text, a graph or protein structure, you're able to turn that into embedding that's where Milvus really becomes useful for you. Frank Liu: I will follow up on that as well. And talk a bit about the whole idea of unstructured data, which I just realized I forgot to do so earlier in the 1970s and '80s when computers were still very much in their infancy, the idea that computers could be used to search, store, and analyze data, I think that was really at the forefront of a lot of use cases. You have MySQL, I think came out in 1995 and post pre-SQL came out in 96. I believe all of these traditional databases are meant to store structured data. Frank Liu: So data that you can store either in a table based format or data that you can store in. Let's say an object based format, or let's say a NoSQL database JSON, like MongoDB, key-value stores like Redis, so on and so forth. But all of these are really considered structured data. We have very unique data model and there's a way for us to be able to turn that data into a way where we can go buy these key-value pairs and get that data out. Frank Liu: And unstructured data being able to search through, let's say broad pixels in a database, or being able to search through pieces of text in a database, that is not something that these traditional databases weren't meant to do. So I'll give an example where the other day I was talking to a friend about... in college, I was an electrical engineering major. Frank Liu: Now I would say I'm firmly in the field of computer science. We were talking about how these two fields really blend very well with each other. So down at Stanford Packard building and gates building, these are literally right next to each other, the EE and CS halls, right? But in a traditional database, you probably have a phrase like computer science be closer to social science, or be closer to some other traditional form of science, right? Frank Liu: And you wouldn't really be able to tell that electrical engineering, computer science, they're actually very, very related fields. Now, if you were able to turn both these phrases into embeddings, you actually find again through the power of machine learning through the magic of machine learning, that these are actually two very, very related concepts. And that's really where we want to be able to leverage that power over hundreds of millions or billions of items, billions of embeddings and store those in vector database and be able to index and query across that. Eric Anderson: That's an interesting analogy. And you're right. So, these vectors are conceptual representations of things, and you're pointing out that unstructured data doesn't fit in database, but we can extract conceptual representations of the things in the unstructured data, in the images, and then store those in a database. Well, good. Let's switch gears a bit. And now talk about Milvus, Milvus is this impressive project. I know Frank, you weren't there at the very beginning, but maybe you could let us in on what you do know about its origins and how it came to be? Frank Liu: Sure. Yeah. At the very beginning, I want to say this is back in 2018, we discovered through a lot of user engagement that vector storage and vectors search, these are really two very important components in a lot of AI applications, but there are really no ready made tools to solve this problem. There were a lot of vector indices, a lot of vector indexes, which had already existed back at Yahoo. We were using something called locally optimized pro quantization, low PQ. There were these indexes that existed, but there was never a fully managed service. Frank Liu: There was never a fully managed vector database out there. And we found that a lot of these users, they really wanted some type of vector management solution beyond just these pure indexes, right? So something like early 2019, we ended up completing the first prototype design and development there a little bit later, there was testing done with a lot of early users, bug faces, improving the functionality, and really laying the groundwork for what this is today. Eric Anderson: When you say, "This is the Zilliz team, right?" Frank Liu: Yes. Eric Anderson: So- Frank Liu: Yes. Eric Anderson: ... Zilliz is the company formed first and you wanted to build some solutions for machine learning engineers and through these interviews happened into this need around vector databases. Frank Liu: Yeah, that's correct. That's correct. Eric Anderson: At some point you set to work on Milvus is that right? Frank Liu: That's great. Yep. Yep. Eric Anderson: Okay. Got it. Frank Liu: And that actually ties in really well with the next point, which is that in early 2020, we actually ended up migrating Milvus over to the LF AI Data Foundation. It's an umbrella organization under the Linux Foundation and it became an incubation project in early 2020. And it graduated, I just want to say last year. So I think it was the first half, or maybe May or June of last year, I'll have to double check that, but in becoming a graduation project, Milvus we really wanted to take it the next level. Frank Liu: We wanted to make it a fully distributed cloud native vector database. And this ties in a little bit with the technology as well, which is that the very, very early versions of Milvus were very much single instance. And again, they were very prototypish and we were just really trying to get a product out there, get something that users could use. And in that sense, you see a lot of these big, very well known database companies. Let's say, Snowflake is a big one, they're going with a fully distributed, fully cloud native solution. And we really believe that, and then Milvus community, and we really believe that is the future of databases and the same goes for vector database as well. Frank Liu: So I would say, the 2021 all the way to today is really when we made a shift in our thinking and we really wanted to be a complete solution for the future. Something that will be scalable horizontally, scalable to one day, maybe even trillions of embeddings. That's our hope really, really taking the phrase, embed the world, I think, to the next level. And that is really our vision for Milvus within the Milvus community is that we're able to take something from a small scale vector index, something that's used to search over through maybe a hundred thousand or a million elements and bring that into very much a fully managed service, something that you can scale, to infinity and beyond quite literally. Eric Anderson: Yeah. And maybe while we're at it, I think Zilliz has a couple other open source projects. Is it worth mentioning them? Are they related to Milvus in any way? Frank Liu: Yeah. I will mention them briefly. Zilliz as a company, we're all about the vector database ecosystem. And Milvus, I would say is very much the core open source project. But I think if you want to be able to build an application that utilizes AI, you need more than just a database, right? You also need ETL. You also probably want visualization tools to see what's going on within the database. You want a management console. And really, I will mention three other projects which really tie very well into Milvus, but I won't spend too much time on them. Frank Liu: There is something called Towhee which I'm involved in a lot more. And that project is all about ETL for unstructured data. We have another one called Attu, which is a Milvus management GUI, and we have another fairly new one called Feder. And that one is meant to help you visualize a lot of these vector indexes. Let's say, HSW or I won't go too much of the details of those, but we have these projects which are meant to compliment Milvus and help turn it into a full vector database ecosystem rather than something that's purely there for storage. Eric Anderson: Awesome. And then if we've covered the history, Frank, what are the things that users are doing with Milvus today? I imagine you have some interesting use cases. Frank Liu: Oh yeah. It's funny that you bring that up. We had logo search. So we had a user that was trying to recognize brands. They were using Milvus there. We had as I mentioned earlier, AI drug discovery, there were these a little bit more niche use cases that don't necessarily 100% utilize machine learning. I saw a recent project where we had a user, they would turn in crypto prices into a time series embedding and trying to see if they could predict future prices on Bitcoin, Ethereum, Eric Anderson: In that example, to make sure I understand it, there might be a certain time series pattern that would be captured as an embedding. And then if that pattern were going to occur, you'd be able to see in the similarity search that was reoccurring. And then, you could potentially forecast where it's going based on where past patterns ended up. Frank Liu: Yeah, well, no, it was a really, really interesting use case. And as the developer is this... would it be possible to apply it to let's say the stock market, but I think that ended up being a little bit out there for us, but on that topic as well, there's also textual text search, paragraph search sentence, search text embeds are a huge component of our user base. We also have video de-duplications. Frank Liu: So you see on a lot of platforms, let's say TikTok or YouTube shorts, they want to be able to find duplicate short videos and turning those short videos into an embedding and seeing if there are any embeddings that are underneath the threshold to be considered a duplicate video. That's another use case that one of our users was looking at there's the malware and detecting these executables later might be trying to steal data. Frank Liu: That's just, I think a small number of the ones that can list off the top of my head. We also have users that build a QA system. So you can imagine embedding questions and answers into the same space. I was actually playing around with Siri the other day and you've seen the matrix, right? Eric Anderson: Yeah. Frank Liu: And I was asking, should I take the red pill or the blue pill? But I was asking this in a lot of different ways. And Siri was still able to answer, with a very matrix-asked answer. And I think that really shows the power of being able to turn a lot of this context into embeddings and just be able to search across those inside of a vector database. Eric Anderson: Because when you're question about the red blue, blue, blue might, might generate a vector that would be associated with movie matrix, maybe if it generates a vector that's associated with like pharmaceutical companies or like certain drugs, red or blue? Frank Liu: Yeah. No, exactly. I don't proclaim to know how Siri works, on apple side, but it is one of the potential applications for vector databases, right? and I think there are more and more users coming to us every day. We had another one where they were trying to do music recognition. So something similar to what Shazam does. Eric Anderson: In all of these, it seems like determining similarity is the thing you describe search. And sometimes search means there's a very specific thing I want to find. And I use search in order to see a list of similar things. So I can identify one thing. I guess what I'm saying is that recommendation engine is like, I want to discover something I don't know yet, but it's similar to this. Frank Liu: Yeah. Eric Anderson: And then search for some people means I want to find this book. Frank Liu: Yeah. Eric Anderson: And I know what it is before I go into it, but I want you to help you. It seems like we've heard more of the recommendation engine style things, but I suppose this could be useful. No, I guess in Shazam, you're looking for a specific thing, aren't you? Frank Liu: That's actually a great point. I'll talk about one of the key features that Milvus tries to implement, which is tunable consistency, right? So we have a lot of applications where we want very much exact matches, but at the cost of maybe a little bit of extra run time, and then we have other ones where they want really, really fast query speeds, but perhaps at the cost of a little bit of accuracy. Frank Liu: And that's what I mean by tunable consistency, where you can have these exact matches with a long run time. If I have a... let's say an input image or an input piece of text, I want to be able to find something within the database that matches exactly what that input image or text is, or the closest neighbors to that. That is one option that we have in Milvus. Frank Liu: And another one is if you want to be able to have maybe quote, unquote, "less accurate results", but you want be able to have your user base, you're developing an application. You want to be able to have your user base, see more of the potential pieces of unstructured data out there. That is a way for... we have these knobs within Milvus that you can turn to be able to fulfill your specific application needs. Frank Liu: And it's a fairly new concept within vector database that we're really trying to push forward within Milvus as well. So I hope that makes sense, that whole concept of team with consistency. Eric Anderson: Yeah. Good. So I feel like we've covered the use cases a bit and the history what's Milvus or even the company Zilliz up to these days, what are the big features or challenges that the projects tackling and what do we have to look forward to? Frank Liu: Yeah. So Milvus these days we are looking to integrate, I think we recently released a very cool feature called Time Travel, which is where you can restore the database to a previous state at a previous point in time. We're planning on releasing GPU support very soon. So you'll be able to really speed up the search process. If you have machines that have GPUs in them, we're continuing to iron out a lot of the bugs, continuing to add some features that improve you the entire user experience. Frank Liu: And as you mentioned on the Zilliz side, we're looking to have a managed service that utilizes Milvus along with some of the other open source projects that I mentioned earlier as well, Towhee, Attu, Feder. We want to try to really integrate all of those and make a complete vector database ecosystem and try to have our users be able to very much similar to what you can do with Snowflake or some of these other cutting edge database companies be able to spin up a Milvus instance very quickly and just get up and running in developing your application. Eric Anderson: One question before we wrap up, which I guess takes is in some ways back to the beginning, but maybe just to illustrate the value of vector database, why can't I put vectors just in Postgres, or in Cassandra, or any database, what is special about vector databases? Frank Liu: That's a great question. You'll see a lot of traditional databases or like traditional search solutions. They're also integrating these vector plugins. So ClickHouse, and I think Elasticsearch. I think they both now have vector plugins, not limited to those two services, but those are really traditional database or traditional search system models that are leveraging that existing architecture. They're expanding it to support vectors. Frank Liu: Whereas for Milvus in particular as a community, we've designed Milvus from the ground up to be a fully managed vector search solution. So on top of improvements in performance on top of Cloud nativity, there are also aspects of Milvus that allow it to really give the user a great amount of flexibility. So we have different types of vector indexes that you can select from standard, IVF or some of the graph-based ones. Frank Liu: We plan to implement Google's scan in there as well, scalable, nearest neighbors, and really Milvus is meant to be something that is centered around the idea of vector search these embeddings. And built on top of that rather than something that is meant for structured data, I'll say traditional database and extended to support vector search Eric Anderson: Makes sense, and a speculation for us before you go, Frank, does this replace a lot of traditional search? So, I mean, we used Elasticsearch or things like it for a while to do traditional searching of text. Will we ever do that in the future? Or is this vector search the way to do things like text? Frank Liu: I would say from my perspective, the future is a combination of both of these search solutions. So there's always going to be tons of structured data out there, but as I think time goes on, and I think, we've all been able to see this in the past decade where you have all of this human generator, you have this a flood of human generated data. Again, text, videos, and the proportion of unstructured data in the world today is only going to be growing more and more. Frank Liu: We're going to have, I think I read a statistic somewhere where it says like, "80% of the data generated today is going to be unstructured." And to be able to search across that is something that is going to be very important for a lot of applications. But that doesn't mean that a vector database is meant to replace a lot of traditional ones. That doesn't mean that we hope to be your one stop, full on database solution. I think a lot of these traditional databases and vector databases are meant to work side by side, hand in hand to help users develop these applications, to help us understand search across the world. So to speak a little bit better. Eric Anderson: Awesome. Anything Frank, you also wanted to add about the project or anything that we didn't get to talk about yet? Frank Liu: I just want to say, if some of the... I know there's a lot of information that I put out there, if any folks listening today, if you're interested in coming out and checking out a little bit more about Milvus or seeing some of the applications that you can develop using Milvus, we encourage everybody to go to our githubmilvus-io and yeah, we'd love to chat more with some of the folks out there. Eric Anderson: Frank, thanks so much for coming today and pass our gratitude on to the Milvus team. It's quite a project and we're all excited to be able to enjoy it. Frank Liu: Absolutely. Thanks so much for having me on today, Eric. Eric Anderson: You can subscribe to the podcast and check out our community, Slack and newsletter@contributor.fyi. If you like the show, please leave a rating and review on apple podcast, Spotify, or wherever you get your podcasts. Until next time, I'm Eric Anderson, and this has been Contributor.