Milos Rusic: technology like this must be somehow accessible. And in order for people to really trust it, we felt it's beneficial if it's simply transparent, people know how it works, people know what happens. We want it to be open and really establish it as something that everyone trusts and everyone wants and can use. Eric Anderson: This is Contributor, a podcast telling the stories behind the best open-source projects and the communities that make them. I'm Eric Anderson. Well, we're excited to have Milos from Haystack on the show with us today. Milos, thanks for coming Milos Rusic: Thanks, Eric. Eric Anderson: Haystack is an interesting project. We've talked a little bit about it and I'm curious, Milos, how you would describe it to people. Milos Rusic: When you think about all of these new capabilities that are actually quite hyped right now, everything we can see around, for example, Chat GPT, natural language processing, then we always talk about models as one essential component, of course, to these powerful natural language processing capabilities. But usually what you require to really have a full application that supports this NLP functionality, what you need is many more components, probably more models, you will need more functionalities. You will need a separate way to store your data for this NLP functionality. And all of these components, this is part of Haystack. And with Haystack, you cannot only access the components, you can also combine them into let's say an application or a service or as we call it in the Haystack world, a pipeline. Eric Anderson: Got it. So Haystack is an aggregation of a half dozen components or maybe more that you can mix and match to create a pipeline or application. And generally, these are all requisites in order to produce a natural language-aware application. Milos Rusic: Exactly, exactly. That's what it is. And it's not just half a dozen, actually. I actually don't even know right now how many there are. But Haystack as a standard, so to say, also allows developers in case there's something that needs to be a bit more custom to also develop custom functions that are then part of their applications. They can be then reused and they usually fit into this, let's say, pipeline logic, so in the way you glue together these different components. Eric Anderson: And maybe just to ground this abstract idea in a couple of examples. If I wanted to build an application that could search documents using AI, I need a document store to put my documents I want searched in. Milos Rusic: Exactly. Eric Anderson: Haystack can help me with that or at least give me the primitives to turn a common store like Elasticsearch into a proper document store for these purposes. Milos Rusic: Yeah. Think about it this way, when you want to add, let's say, question-based search, you want really that people can search on your website the way they're searching on Google. They ask a question, for example, I don't know, "In which stakes does scale actually invest or in which areas?" And then you want the answer is highlighted on your webpage. Well, the thing is that the one solution is, of course, the so-called question-answering model you would need, but the question-answering model itself is probably not enough because you will need to feed it with your data because this model hasn't been trained on your data, it probably needs to still find the right curated information in your data. And this is then when you can, for example, pick something like Elasticsearch as a document store. Now, the problem is that the way your website is designed and the way you write it in Elasticsearch, this has to be also, let's say, a bit special because NLP models have certain requirements how they deal with text data. This means we need a way from your website to actually Elasticsearch where the original text data is converted, it's preprocessed, prepared for an NLP model. And then comes the next thing, which is probably a website consists of quite a bit of text, it might make sense to not just have one question-answering model on top of all the data because that would probably be a bit too slow. You want maybe to add some kind of a filter. This means you don't want that this question-answering model is fed with all your website information. Because if people are asking about the investment stages, then probably it's not relevant that the model is searching through, let's say, the team page. And that moment we can add another machine learning model, for example, a so-called retriever model that is then changed, so to say, in between the document store and this question-answering model. And now you see from my long explanation how many steps it takes until we can present the answer to this question to someone on your website. Now, all of these components, this is something you have in Haystack, you put them together in Haystack and you can choose whatever you want to use in this component, which other technology. We were talking about Elasticsearch as the document store. If you feel like for your use case it might make sense to use something like Milvus, then you can also simply switch to Milvus. Same as for models. If you are using, let's say, a model from the Hugging Face model hub, you can pull it in. If you want some generative capability, you can simply pull in also something like GPT-3. Eric Anderson: Fantastic. Exciting. This is certainly the way things will be built in the future. How did you get into this? Or maybe more properly, how did Haystack come to be? Milos Rusic: Actually by building before having it, so to say. We started Deep Set as a company back then in 2018, and it was June 2018 and this was before we had transformer models. Today when we look at Chat GPT. GPT-3, everything that we have on Hugging Face, there's always this T for transformer. Transformers came to life, so to say, and of 2018. Back then, Google released BERT, Hugging Face started the Transformers Library. And we were early contributors into Hugging Face's Transformers at that time and we used transformer models to build NLP applications for enterprises already. Organizations like, for example, Airbus, big companies. And I remember we had this one customer and we wanted to actually build a question-based search engine the way I described it to you. And then we realized that, oh, that's all a bit cumbersome because wait, we need first of all to take care about 1,000 of documents they're having. Okay, that's a bit difficult. We need a document store. And then we learned, this is not going to scale. We need this filter in between. And then we deployed this one pipeline and then we had the following problem that we used one model and decided for one model, but we felt like, well, maybe this other model is going to perform better, it would be great to change this model. But this took us then two days until we had picked a new model, we had to re-implement everything. And then we thought that actually, first of all, we need more than just a model to really deliver these applications at scale. And we also would like to probably try out more things because natural language processing is still machine learning and machine learning is based pretty much on idea of experimentation. And this was when we felt that it really makes sense to have an abstraction just for our own work where we can simply think about the functionalities we want to put together and then pull the technologies in as needed. And this is actually how Haystack was born. So we built then the first version of Haystack just for us, it was a Retriever-Reader pipeline back then. It was very simple and that's also where the name comes from because we had a lot of demand for search and question-answering applications at that point in time. So we said, "Okay, then let's give it a name that is somehow referred to finding information, something finding the needle in the haystack," and that's actually how this happened. Yeah, I would say we were dogfooding our own project before we had it. One anecdote maybe when we deployed, when we thought, in the beginning, we'd just take a model and that works and we just send 1,000 documents to this model, what we realized was that it's taking very long. We had a demo for the customer and we couldn't really demo real-time queries because a query was taking I think one minute. So it was absolutely not practical to leave it that way. And then we said, "Okay, we need another model in between to filter out irrelevant files such that this scales a bit and gets a bit faster." This was pretty much how Haystack was born then. Eric Anderson: I can imagine going through that process and wondering if this is going to be the standard model for how people build applications or if we're just kind of hacking things together and making it work. What maybe led you to have some confidence that this was the way everyone would build pipelines like this? Milos Rusic: That's maybe now a bit bullish or whatever, but at that point- Eric Anderson: Yeah, premature. Milos Rusic: At that point in time, this was a very early technology. We're talking about the end of 2019 when we started the work on Haystack. This was maybe one year after Google Spur, the first-ever transformer was released. And at that point in time, there weren't many people who were actually building or who were so familiar in building this. We were one of the few around that point in time. And in the end, I think this is how standards are created. Some very early adopters are throwing themselves on a technology the way we did, are working with it, really try to industrialize it. We were working against hard requirements. This was like we had to build enterprise-grade software and this is why we felt like okay, we were early in this, maybe we're not the smartest people around, but we're also not the dumbest people around. Probably what we thought about here makes sense. And then, to be honest, once it was released, then you can see if it makes sense because what starts then you see if you can grow a community around this, and this is what happened. In May 2020, we released it and then we saw actually how people adopted it. This was a time we didn't have big marketing spend or anything. People simply found it. People used it, people improved it, gave feedback, appreciated it. And that helped us, of course, also to make sure that through 2020 and until today, everything is the way it should be because if you give it away and the community starts to build, then you have many builders and then you have many opinions and a lot of feedback, and then you can make sure that you are converging really to the way people want to build NLP applications at scale. Eric Anderson: I think you mentioned that you started doing this before transformer models were available. This was June 2018. What was the plan for producing results then? Did transformer models surprise you and change your approach? Milos Rusic: Let me say it like this. When we founded Deep Set back then the future of the world was obvious to us. This was obvious that something like what we see today with, for example, Chat GPT, that this is going to be possible and this is going to become an essential part of each and every product. And around that time, we had the hypothesis that something like a transformer model, transformer learning NLP is going to be around the corner, so we were expecting something. At that point in time, we were using, let's say, the older more traditional methods of NLP that required way more training. But the idea when we started was pretty much to build, to learn what is needed to build, and to hope that some technological tailwind and technological revolution slash evolution is going to happen. And then by the end of 2018, half a year after we started, we were lucky enough to see that this happened. And then we thought like, okay, great. Hypothesis became reality. That's great. Eric Anderson: And at some point, you decided to open-source Haystack presumably in maybe 2019. What was the thought process there? Was that clearly the plan as soon as you started working on it or was there some evolution in thinking? Milos Rusic: Honestly, there was a rationale, but it was always clear that something like this should be opensource. When we started the company, we only used open-source technologies. We were contributors into Hugging Face transformers around that time. We open-sourced models before we open sourced Haystack as well. And for us, it was very obvious that technology like this must be somehow accessible. And in order for people to really trust it, we felt it's beneficial if it's simply transparent, people know how it works, people know what happens. This is why it was always intuitive for us. Of course, the bigger rationale behind it was all so that we felt like there is a space to create a standard in the way how applications are built, and standards always somehow open source in IT. When we think about databases, that's usually open source databases that are the ones that establish a standard and the way how people are solving certain problems and which databases they're using for certain problems. We had this aspiration for sure, and this is why it was somehow clear that we don't want this to be a proprietary service or proprietary platform. We want it to be open and really establish it as something that everyone trusts and everyone wants and can use. Eric Anderson: The response has been extremely positive. There's over 6,000 GitHub stars. You've got a growing Discord community. Maybe you can highlight some of the things that got you there. As you released this open source, what were the first users and use cases and how did things spread? Milos Rusic: In the beginning, I remember that... The first use cases, we tailored it a lot around search, question answering, and Haystack, it enables developers to be very pragmatic. You can get started very fast, you can take a template pipeline and you have a template application. The early users I remember were very diverse and it feels even today still so diverse when we look into the communities. I remember we had some practitioners right away in the project. There was one contributor I remember from Netflix who shaped the project, in the beginning, a lot. We also had some researchers, of course, who were also helping with the aspect to keep it a bit state of the art. That's also important. The one thing is it needs to be good software. The other thing is, of course, how do you really ensure that the technologies you're providing, especially in the fast-moving fields like NLP, are always, let's say, up to the latest stuff? And then when you ask me about how did this happen, how did this adoption go forward? I'm a strong believer that marketing is not so important for an open-source project. I think that good projects, and I mean the quality of a project and the quality of the software you write, the develop experience actually is somehow important. And this is what people appreciate simply, and this is what has a very sustainable effect. And we always focused a lot on this practicability. We were rethinking our primitives from time to time where we thought like, okay, is this now really easily usable or is Haystack already a science in itself, so to say? And actually, now you need an own handbook just for using it. This was always something we took a lot of care about and focused on. And then at some point in time also I think, you get this word of mouth. So two people talk to each other and when people ask like, "Oh, I have this problem. What are you using?" And then people are talking about Haystack. And we saw that then happening by the end of 2020 because also this was when actually our inbound from the Haystack community was growing. Many enterprises reached out to us because they heard of it or used it. Many startups were starting to use Haystack, were talking about it openly. We saw then also again, researchers from the community putting it into talks, they were giving into research papers. At some point in time, after you have again these early adopters, it amplifies a little bit. People start talking about it, and this word of mouth is then pretty much what helps to at least grow the awareness and the interest. I would say, to maintain the community and to really grow the community to get contributions, to get people use it, you constantly need to work on the project to reinvent yourself. Also, the latest advancements with transformer models, what we are seeing is something that we simply need to be aware of and we need to adapt Haystack also to these changes and advancements that are happening in order to be really a valuable framework for people who want to build NLP applications. Because people want to use the latest stuff, people should use the latest stuff if it's good stuff. And we always need to take care about this and also adapt it constantly. You cannot stop developing it further. People will get bored and once your community gets bored, your word of mouth becomes less. And then I would expect it probably, I'm also going to see way less usage and adoption. Eric Anderson: Switching gears a bit to the industry, NLP has exploded. Well, I guess maybe Chat GPT really has captured the attention of everybody, not just developers, but it's in the news and that sort of thing. But you've thought a lot about what the applications for NLP can be in businesses. Maybe you could walk us through what your customers are trying to do with Deep Set. I imagine you've thought a lot more about the use cases for NLP in business than many of us have. Milos Rusic: I think in the end, some of the biggest, let's say, umbrella use case, so to say, is simply information access via the most natural way humans are interacting anyway, which is language. Language is a very powerful protocol. You and I are using language, it's the best way for me to express things. We can express so much in every little detail, so to say. And that's what makes it so powerful. And I think we have a lot of information around in all kinds of formats. Also in images that we can explain with language or in text, of course, or even in structured forms like SQL databases. And I think this idea that all of this information, these terabytes that are in organizations are easily accessible to everyone. If I want to know, I don't know, what is the best way to solve a certain problem, and this information is around in my organization or my company, that I can really be sure I find this information and I find it fast because usually it would take me forever to go through the whole index, even to go through two documents would be too much of an effort or actually would simply be a big effort. And I think this information accessibility in a very natural, easy, fast way, that's a big thing and a big promise. And Chat GPT is somehow that because you ask the question and out of, let's say, the knowledge it has, you get the information you want. You can also create things, of course, but it's mostly information access. That's, I think, the big topic. And, of course, that has different flavors. That can be a classic way of searching files, that can be a way to extract information. For example, in financial services, we know many banks who are Haystack users and also commercial customers with Deep Set who are using NLP tool in risk management processes to extract core information. They want to understand, for example, a certain company, what are the risks this company is exposed to. What is the outlook for their business? What do they project maybe in revenues for the next year? And then you have many sources of these companies, websites, company reports, proprietary reports, public reports, all of that stuff. All of this is easily accessible, very fast, and helps risk analysts to make a thorough quick decision and get a very thorough and quick full picture about a company. Knowledge management, a lot of effort has been made to consolidate data lakes, lakehouses, data warehouses, whatever. Wikis, I think there are so many wikis around. Every good startup uses Notion. Every good enterprise uses probably something like Confluence or SharePoint. There's so much knowledge in there, probably years, decades of knowledge, project reports, best practices. All of this is now really accessible and you simply plug in an NLP pipeline and you give the power to everyone to simply ask. And simply, I want simply to know how do I solve a turbine failure that I have on my Airbus machine. What's the best fix? Or what proved to be the best thing to do when a customer is facing a certain buck in our software? Or I don't know, selling. Which materials should I send over when a customer has security questions or whatever. So all of this. Eric Anderson: Yeah. Well, one of the things I've been trying to figure out is how these natural language applications gain access to all the necessary data in an organization. You've talked about bundling up PDFs, makes perfect sense. They're in English or something, a language the models can understand. But what about some of the SaaS applications? You mentioned a few, but their data is still in English, but Salesforce records. Will these natural language models understand what's in my Salesforce? And how do they get that information and make sense of it if it's more structured data as opposed to text? Milos Rusic: It works. You already see how, for example, you can also perform natural language queries at least on simple tables. That's actually working already quite well. If you have tables and documents and you want to know what's the revenue projection for year 2024, then usually these models are ready, very well performing. For SQL or even for something like graph databases, which is even, you could argue more complex than SQL database, even there you see that the natural language processing is working. There are different approaches to it. For example, what you can do is, of course, models can, for example, translate, so to say, a natural language query into an SQL query. That's even almost like a translator that translates from one language to the other. This is one approach of doing it. The world of NLP consists of the idea of embedding vectors. Everything is somehow put into a characteristic vector. And then you're looking for these vectors. And of course, there are also approaches to take structured data and somehow vectorize it and then translate, so to say, language, images, structured data, everything into one semantic space that it's represented by vectors. And this is what we're seeing, for example, with multimodal models. Stability AI, for example, where you use language and get an image out. This is already something like, hey, I use one way to describe something and I get a description and another form it out. This means we have two different modi of data, so to say, in this case, both unstructured, but with structured it could be the same. And they somehow are in the same semantic space. And we are seeing it, we're seeing it works, and this is only going to improve over the next years. Eric Anderson: And what about Chat GPT, the chat element, this idea that the model isn't just a single request-response, but now the API maintains a session and that session has state. Increasingly, it almost seems like rather than the application adding a model via API, it's almost like the application is Chat GPT and what we call the application is really just the API that accesses the stateful session. Does this chat idea, a session break the model for the way people are building NLP applications? Milos Rusic: Let's say it like this. I think the question I'm always asking myself is Chat GPT or whatever, machines or machines, and I want information from them or I want to create something with them. I don't necessarily want to chat with them. I want to chat with humans. I want to have a conversation with you that's probably more engaging for me. Where it is useful, of course, is... And this is the great thing also about something like Chat GPT, it narrows down the space. If you can ask, for example, "Oh, what do you mean by this," or helping me again as a user connect to- Eric Anderson: Yeah, refine the initial query. Yeah, Milos Rusic: Exactly, exactly. To navigate me a little bit, that's for sure a great thing. I think the big opportunities in the end when something like a model with the capabilities of a Chat GPT can be really used on your own data. If you ask it today about your SWIFT number, it'll probably give you a response, I hope. I hope it's not your SWIFT number because otherwise, it will be publicly available. The moment we can actually make your SWIFT number somehow, maybe even just for you accessible in this model, that's actually when I would say when another big thing is happening, there's this idea of retrieval augmentation. It's already possible today. In other scenarios, also with generative scenarios, you can already build these generative models that I'll say retrieval augmented, where you also know, for example, what the source was, where does this information come from? And that's also very good because it prevents you a bit from this hallucination, so you don't run into a risk that there's information you think you cannot trust or that is maybe even completely made up because all is based on actual data that you can name or that you know where it is, which file, you can read the whole source document. I think that's the big next step. Or actually, it's already possible today. And I think the big next thing is then having Chat GPT, fully Chat GPT capabilities on that kind of data. Again, it is possible already today with other models, you can build something similar in Haystack. And I think that being said, this is probably the way forward, how to build applications. So really having these conversational experiences on your own data, that's, I think, essential. Eric Anderson: Yeah. I suppose we're talking too much about Chat GPT, but presumably, they have built a pipeline around GPT 3.5 or four, whatever they're using as their model like Haystack in order to serve up the application. And it's trained on the internet. That's their document store presumably. Milos Rusic: Exactly. Eric Anderson: So you're imagining a world where every application or person can have a Chat GPT based on some repository of information. There's going to be some collection of folks who give us models, give us bots, and tell us to bring our data to them. And there's going to be people who already own data and they'll expose bots on their app. Basically, I'll either ask my bank for my SWIFT number or I'll just ask my smart assistant for my SWIFT number. And we're not quite sure which model will be most prevalent at this point. Milos Rusic: Probably, it's going to be a little bit both, I would expect, I think because probably you have personal finance app, which is a third-party application, you're still having your online banking. I'm usually not very dogmatic. I use tools completely randomly. Sometimes I go into my online banking and sometimes I go into my personal finance app. What's now easier, so to say? I think for this, there's probably a coexistence of some features. I would expect really that these capabilities in a few years... I'm not making out any prediction really. A few years, whatever that is, two years, five years, 10 years, maybe not more than 10, I would say, there's going to be an omnipresence of these capabilities. I was just talking about my daughter that was recently born. I think for her, when she's in her teenage years, I think it's going to be very normal to interact with any kind of software, just by language. Eric Anderson: It's wild. And today it's mostly tight, but it certainly seems like it has to become a verbal audio thing sooner than later. Milos Rusic: Yes. Yeah, probably. Eric Anderson: Milos, this is fantastic. We've learned a lot today about Haystack. Where is the project going next and how can folks who are listening and getting excited about this get involved? Milos Rusic: Of course, the whole generative advancements, that's something that we support even more than before. This was always a part of Haystack. We're now focusing even more on generative capabilities. We just released some new functionalities there, some new integrations. That's, of course, something where we're really looking forward to anyone who wants to use it, but also, of course, to contribute it. And outside of that, NLP is not just models, it's databases, document stores, it's data preparation. There is a lot of things that need to be in place in the pipeline. And also those people who are not maybe so much into NLP, per se, to models can find a home with Haystack and also contribute. Yeah, we're happy about everyone who makes the project better. Eric Anderson: Awesome. Anything we covered or didn't cover you'd want to mention here? Milos Rusic: From my standpoint, quite comprehensive. We had quite some things about Chat GPT, but I think anyway, it's going to be somehow cut and edited from it. Eric Anderson: Yeah. Milos Rusic: Yeah, we're not making an open AI marketing show here. But yeah, that was good. It was fun. Eric Anderson: Fantastic. Thanks for all you're doing. I think it's fun that these things are an open source, everybody can... I think we saw with stability, just the power that the open source community has in advancing what's going on here. It's quite a contribution you've made similarly to bring NLP to the people. So thank you. Milos Rusic: Thank you. Thanks a lot. Eric Anderson: You can subscribe to the podcast and check out our community Slack and newsletter@contributor.fyi. If you like the show, please leave a rating and sreview on Apple Podcasts, Spotify, or wherever you get your podcasts. Until next time, I'm Eric Anderson, and this has been Contributor.