Adam Carrigan: It was sort of no fireworks, and I guess that's how most open source projects start, right? They start as a bit of a non-event, side project. And then people find them useful, and they interact with that project and the founders, and you then realize actually, we've got something on our hands. Eric Anderson: This is Contributor, a podcast telling the stories behind the best open source projects, and the communities that make them. I'm Eric Anderson. Eric Anderson: All right, we're here today with the co-founders of MindsDB, Jorge Torres and Adam Carrington. Good to have you both on the show. Adam Carrigan: It's Carrigan, sorry. Eric Anderson: I'm sorry, Adam Carrigan. Adam Carrigan: I was going to leave it, but Carrigan. Eric Anderson: Off to a good start. Jorge, Adam, thanks for being here. Adam Carrigan: No worries. Jorge Torres: Thank you for inviting us, Eric. Eric Anderson: So, I was intrigued. I met Jorge recently, and the concept behind MindsDB is certainly clever and interesting. Explain for us quickly, what MindsDB is. Jorge Torres: Yeah, MindsDB essentially allows you to do machine learning in the database. Mostly because you already have a lot of data in your database, it doesn't make sense to do it elsewhere. Whatever you want to predict, it's also coming from the database. Eric Anderson: And interestingly, it does so via like a SQL table, this kind of SQL View, Artifacts. You don't learn a new programming language or new API, it just acts like another table in your database. Is that right? Jorge Torres: Yeah, that's exactly it. Adam Carrigan: Yeah, we speak to a super wide protocol, so it looks just like an existing database, and you can connect to it like you normally would a database. Eric Anderson: Awesome. And how did you get to this point? What's the story behind MindsDB? And maybe part of that story is how the two of you came together. I don't know if one came before the other. Adam Carrigan: Yeah. So, I'll take the first part and talk about how we came together, and then I'll hand over to Jorge, and he can talk about the start of the company. So, Jorge and I have been friends for many years, I think probably it's coming on to 10 years now. We met in Australia, actually, I was living there and decided to do a master's, as did Jorge. We ended up living together in college, became really good friends, worked together on a few projects, became drinking buddies, and just generally hung out. And I was studying economics and finance at the time, and he was doing computer science, specializing in machine learning and AI, or the early versions. Adam Carrigan: And I then went on to Cambridge to do my dissertation in natural language processing, and helping predict equity market movements, and Jorge started working for a number of very successful startups. So, he was one of the first engineers at CouchSurfing, Skillshare, and has had a lot of experience in that area. I went into consulting for a few years, and then realized that consulting was not for me. And Jorge dragged me out and introduced me to the world of startups, and that's where we started our journey. We actually have a previous company together that we decided to close down, and there's probably a whole podcast just on the learnings from a failed startup. But I'll hand it to Jorge, and he can talk a little bit about the rationale behind MindsDB, and why we started it. Jorge Torres: I have a bit to add to that story though. 12 years ago, Adam was pioneering in home automation. And when I met him, he was like, "You should come to my room, I want to show you something." And he was showing me how he could control from his... This is 12 years ago, so Nest and all of this were not in the game. And he could change all of the lights in his room. And then he's like, "And I have this special button," and he would press this button, all the lights will turn pink and music start playing. He's like, "This is when I bring dates to my room," so when he showed me that I was like, I need to be this guy's business partner. We're going to build something, and it's going to be fun. Jorge Torres: Anyway, many, many years later, after doing a lot of work on machine learning, we started to realize that people were reinventing the wheel in two verticals: the first one is all the ATL-ing that needs to happen when you have to do machine learning, mostly because you build machine learning applications essentially as an application. And in many cases, the data is already in the database, so they have to do a whole reinvention, have to do ATL-ing, even though databases are really good for that. And mostly because the data scientist or the machine learning engineer, starts in a Python notebook, and they have an ideal panda data frame of how they want their data. But then they end up doing a whole bunch of tricks to get to how that data has to be. Jorge Torres: And then the second part that they end up doing, is they have to reinvent a whole bunch of infrastructure to deploy this model. So, they essentially wrap those models around a web service, and this web service has to be consumed, again, to put the predictions wherever you want the predictions to happen. And it just felt that this dance for reinvention could be simplified, if people could see machine learning models as tables in a database. And that's one of the things that we decided to provide. And in that journey as well, we started to understand that there are problems that are really hard when your data is in the database. And essentially, say for instance, you may have a lot of time series problems over and over, and these problems tend to be really hard for machine learning engineers as well. And we decided to automate a lot of the work for getting those particular problems solved. So, MindsDB allows you to not only deploy models as tables and databases, but also to generate those models in just a few queries. Eric Anderson: And presumably you're generating the models from the data already there in the database? Jorge Torres: Yeah. Eric Anderson: It sounds like you solved this whole slew of problems, which is the primary work around creating models: extracting data, munging it, cleaning it, creating the model, and then you deploy it. And you're saying, I just create this table, describe it alongside, and as long as it depends on the data already in the database, you're good to go? Jorge Torres: That's exactly it. Eric Anderson: Yeah. And we didn't really get to cover all the... Go back to some of the elements in the story. Adam, I have yet to find somebody who sees themselves as a consulting person. Everybody who is in consulting is looking for a way out, or trying to figure out where they want to be. And sometime, you'll have to give us a tour of your current home automation. Adam Carrigan: Yeah. I'll tell you, it's much easier these days with Alexa and Google Home. It was much more difficult back in the day, but yeah, happy to do that. Another episode, maybe. Eric Anderson: So, you had this idea for the model as a table in the database. And how do you go about getting started on such a thing? Is this something you and Adam were writing together, and you just take a crack at it on a weekend? Jorge Torres: Yeah. So, on our previous startup, even though we failed fantastically, we were doing a few things very well. And a lot of those were around automating the work, and augmenting the capacity of the engineers that we had, for machine learning workflows. So, we did learn a great deal on automating, machine learning, and essentially not having to build the same castle over and over. And that's essentially where we understood that even though in that startup, we weren't doing something that had a good market, we understood that the problem that we solved for ourselves was a problem that many people had. And that's essentially how we decided to start this one. Eric Anderson: Got it. And what's the... I was going to ask you about the first line of code, what's the first push? Are you able to construct this thing and test it? Did it have any role in your prior company, or is this a long effort to produce such a prototype? Jorge Torres: Yeah. So, in our previous company, this thing was always a [inaudible 00:07:25] prototype, we were building this thing just as a support engine for what we were working on. So, when we decided to start this company, we started from scratch. We were like, "Okay, this is something that was working for us. It just needs to be actually built for that purpose." We had a long debate whether we were going to open source it or not. At this time, we were in Berkeley doing research for ML automation as well. And Berkeley has this philosophy for open sourcing stuff, that it's contagious, and we just let ourselves be carried away by that. And actually, it was probably the best decision that that happened, because it really forced us to not only just start putting a lot of code out there, but to improve this code at a frequency that we wouldn't have otherwise. Eric Anderson: Remind me when open sourcing happened, and how do you prepare for such an event, pushing something to the open? Adam Carrigan: Yeah, I think we really didn't put much though into it. I mean, we certainly put a lot of thought into whether we should open source or not, because we knew there was many implications after that. But in terms of the actual making our repo on GitHub public, it was a fairly non-event. At the time there was probably three stars, there was myself, there was Jorge, and there was our lead engineer. Jorge Torres: My mom. Adam Carrigan: Yeah, it was four then. It was four. And that was it, and it was sort of no fireworks. And I guess that's how most open source projects start, right? They start as a bit of a non-event, side project, and then people find them useful and they interact with that project and the founders, and you then realize actually, we've got something on our hands. And so when you look at the growth in stars, we've got 35, 3,600 stars now over the past couple of years. And in the first couple of days, there was very little. And then it kind of caught on, and students started to use it through Berkeley, that was our first niche market. Students were finding it useful. Adam Carrigan: And then word of mouth got out, and more and more people joined the project. And before we knew it, we were at a thousand stars in a not too long period of time. And so, we knew we were onto something when we started to get this kind of traction, and the number of people using it, and people contributing, and the stars and the forks. And so yeah, initially non-event, and then as these things do, they snowball and a couple of years later, here we are. Eric Anderson: And are these initial users, are they coming from a world where they've been deploying models the old way, ETL and data munging and pandas, and are they relieved to find that they can just put it in the database? Or are these folks who had never thought about doing ML, they have all this data, and now that the barrier to entry is so low, they give it a go? Adam Carrigan: Yeah. I think when we first started the project, we started with the open source for the ML components. So, it wasn't really, we always had the vision, and credit to Jorge, he always had the vision of bringing machine learning to the data layer. But we started off with the core product, which is the AutoML, making it easier for anybody. And so, the types of people that were using it initially, as I said, were the students that were experimenting with CSVs and just trying it out. It then caught on to people who knew the benefits of machine learning, but hadn't really... They can't afford data science teams, or if they had, they were relatively small or overworked. Had ideas on, we've got all this data, because it's kind of been en vogue to collect, even most traditional industries has been collecting data for five or six years, but they're not really doing anything with it. Adam Carrigan: And so, with a tool like MindsDB, at least in the first instance, they were able to build those predictive models. And now as we have spoken to countless users and customers, and now that we've brought it to the data layer, it takes it one step further. There's these businesses who are perhaps a little bit more sophisticated in their machine learning endeavors, can now also get the benefits. It's not just the AutoML component, it's the whole end to end package that we provide. And so, there's people who are students who have never touched machine learning before, all the way through to very sophisticated organizations, who now can make the process a lot easier than it ever has been. Eric Anderson: Maybe just to ground what you're doing in an example, so imagine I have a bunch of time series events, and maybe these are click views. I don't know what would be a good representative example, say in a thing like ClickHouse, how do I make a model out of those? Jorge Torres: Yeah. So, let's go to the events and let's assume that you're using those events to also forecast if there's going to be demand on a product. Let's say that you want to tie this to inventory. So, you have all the stream of actual clicks, on the different products, and you also have information about what you have in inventory. So, you have all these tables, and maybe ClickHouse is not your operating database, it's kind of like your analytical database, so everything you end up dumping there. And it's really good, because ClickHouse is very good for you to make the transformations that you need to make. Jorge Torres: So there is no real intention in MindsDB to reinvent that wheel, because ClickHouse does it really, really well. So, you can then aggregate all this information, and then you will have, being the domain experts you understand what you want to predict, in this case inventory. And you feel that clicks, as well as the history of all of your products, and how the inventory moves, is what you want to use for this prediction. So, then you can aggregate all of that into one single query, then you can tell MindsDB, "Well, from this query, I would like to predict the volume of inventory. And I would like to predict this, of course, for the next day or for the next week." Jorge Torres: And with those simple statements, and again, you can summarize it in simple statements. You can insert into a magic table that we have on databases, called predictors, and you say, "I want to create a new magic table called Inventory Prediction Table, and I want to learn from this query, and the values I want to learn to predict are," let's say, inventory or units in stock. And once you do those, then MindsDB, since it's actively listening to the entries on this table, realizes that that is an instruction for MindsDB. And then it starts trying many different models, and it will come up with one that actually works well for you. Jorge Torres: Now, one of the things that MindsDB does really good here, is that it will understand, actually this is a problem that is actually many times there's problems. Because you may be clicking into many products, you may have many products in your database, so it's not that you want to create one predictor for just a single product, say the iPhone, you want to do it for the thousands of products that you want to do there. So at the end, it figures out how to create a meta model that can contextualize, and it will go and train it, and give you some information about how good this model is. Then as soon as it's done training, it publishes that model as a table in ClickHouse, but you can also publish this to all the databases that you can connect. Jorge Torres: Say for instance, if your production database is My SQL, then you can also tell MindsDB to go and publish this model in My SQL. Even though you trained it from ClickHouse, you can still go and publish it in My SQL, and it will allow you to query that model for, let's say, now I want to forecast, select from model, product IDs, iPhone, and date is tomorrow, and it will give you an estimate of what the stock is. Jorge Torres: One of the cool things that we're doing now, is we started to realize that since people have a stack that is plentiful in terms of different solutions for different types of data problems, they may want to train, say for instance, from information that they have on ClickHouse, but they would like to make predictions in real time. Say for instance, they would like to forecast for anomalies that may happen in inventory. And what they can do, is they can also project these models or deploy these models to streaming technologies like Redis and Kafka. And all they have to do is okay, well now you're going to listen constantly to an input stream, say the input stream for inventory, and for the clicks and all of that. And the output you're going to get the prediction from this model. Jorge Torres: And essentially MindsDB finds a way to pipe it all together, and now you have a stream with the actual forecasting for inventory. And you can also tell it, put in another stream whenever you see an anomaly. And that allows you to essentially have a solution that plays very well with the data that you have, wherever you have it, exposes the machine learning capabilities in a construct that is native to you, either a table or a stream. And then the dissonance that you have to have for building machine learning, is not the current one where you have to have a different team that specializes just in that, and that team has a high bar to move everything that they do into production. So, you can essentially play with the same tools and the same knowledge that you have. Eric Anderson: Got it. There's a couple concepts there that are kind of mind blowing to a degree around, we often think of models as a thing, you ask a question, you get one answer, single response results. But with databases, sometimes I think of like a select all, and give me a whole table. Can you select all a MindsDB magic table, and it just gives you predictions for every scenario? What would that look like? Jorge Torres: I think that a good example to what we're doing right now, is you can join models with tables. And what happens, is that it will take everything there on the table and it will make a prediction for each one of those, and then it will give you the output of the equivalent of joining a table with another table. This is useful, say for instance, if you want to visualize how your inventory is going to look across all your stores in a map, so you can essentially join the current inventory with the model, and then aggregate by store, and then you can just visualize this on, I don't know, a heat map or whatever you want to do. Jorge Torres: So actually that point that you bring up, it's one of the reasons why we started building partnerships with BI tools. So, we have one partnership with Looker, Looker being this SQL, native BI tool. It plays really well with MindsDB, because then you train predictive capabilities as you train tables. So, now you can have predictive dashboards with these capabilities of joining tables with models, and then being able to visualize the output of that in your dashboard, is essentially what we provide for that realm. Eric Anderson: Yeah. Giving Looker or Tableau a point click forecast button, seems certainly compelling. Jorge Torres: Yeah. And I think that that's what we understand now, we start to see that you may consume predictions in different scenarios, but SQL and databases are essentially the lingua franca for analytics as well as for applications, and all of the stack and data. And by being able to portray machine learning as a building block within all of that, it just binds in a very natural way, to the way that people are doing things already. It makes it very easy for us to make these types of partnerships and integrations. Eric Anderson: And characterize for us a bit on when we say machine learning, the types we're talking about. Certainly this kind of tabular numerical forecasts, that makes sense. It sounds like Adam had a history in natural language processing, do you have NLP, neural net type use cases? I think a lot of us come to SQL tables, and imagine numbers. What's the scope of machine learning that you see? Jorge Torres: Yeah, so you're certainly right. In most relational databases, you have columns that have numbers or categories, which you can map into some sort of numbers. And then the challenge is that you may not have all the information that you need in just one single row, whatever you need may depend on other rows, so that's something that you can solve through MindsDB. Now, you may have also columns that have text. Say for instance, you have description and reviews for a given product, and demand for that product may change as well based on the reviews. So, you may want to take that information into account for your predictive models, and MindsDB right now, can do inference from columns that are either numerical, categorical, and then a combination of those in a time series manner. Jorge Torres: We are starting to also understand that people have the need to extract structure from text. So say for instance, you have descriptions for your products, and in the description it says, "This iPhone has this fantastic processor," whatever processor, "It has an 8.5 inch screen," and all of this information that may be in the text description. And ideally in many use cases, what you would like to do is select from description, that display, and it should give you a number. So, being able to extract structure from texts is one of the things that we are actually working on with some of our partners, and that is yet to come. Jorge Torres: But truly, being able to portray these capabilities in a way that still looks native to the database user, is powerful. They don't have to go and train language models, and then from those language models, do kind of like entity extraction, and then those [inaudible 00:20:07] them into the database. So, the trick here is being able to portray those features in a way that still looks very natural to the way that people do queries in SQL. Eric Anderson: Got it. And you've given us the e-commerce scenario, where else do you find people getting excited about this? I wonder, Adam, you've got this background in equity, finance. Is there a place for this there, and where else? Adam Carrigan: Yeah, absolutely. I think one of the difficulties, as opposed with finance, people automatically turn to hedge funds predicting the equity markets. What stocks should I buy today and tomorrow? And it's an extremely complex problem, because there is so much data out there that you need to take into account. No one individual firm holds all of the data, There's a lot of public data available, from Reuters, and from national statistics and macro economic data that needs to be taken into account. Adam Carrigan: We haven't yet had anyone try to do those sorts of tough problems, although because you can combine data sources from a variety of different places, it's something that you could certainly try to do. In terms of finance capabilities, a lot of people try to use things internal to business, projecting what costs will be like over the next 12 months, in terms of revenue numbers and margins, et cetera. And that's something that we can do very, very easily. We have had a few people who have reached out to us in the open source community who have tried to predict really random things, like the lottery numbers, which we can't help with unfortunately, wish we could. Jorge Torres: Also, I think that even to tap into that, we've had people using MindsDB for risk, financial services risk is one of those things where the ROI is very clear. If you can forecast risk very well, then you can act upon it very effectively. And MindsDB actually plays very well with these risk forecasting models for financial services as well. Eric Anderson: Certainly. So, I guess that's one dimension you would operate in, the various sectors. And then the other is, we've already touched on this a bit, but you help databases, and it sounds like you've got integrations with quite a few. And then you also help the visualization layer. Are there other kind of layers in the stack? I suppose anything that's SQL oriented could benefit, is that right? Jorge Torres: Yeah. So, we are now starting to understand that beyond SQL, like say for instance, the Mongos of the world, we're starting to see that there are challenges that are hard for even if you know what you're doing in terms of machine learning, but once you crack them, then you can automate them, like machine learning and semi-structured data. So, that's why we're now actively working with Mongo to bring these capabilities to them. There is of course, the streaming technologies out there that are an animal that takes a lot more skills to kill, because you have to be really good at time series data, time series data with all the different challenges that we described before. But then also being able to pipe all of this infrastructure, so that we can make these predictions on the fly. So, latency of the models and all of those challenges we're solving to. Jorge Torres: Actually for Redis's conference, which should be happening two weeks from now, we're launching with them, Redis SAIL, a streaming, scalable, artificial intelligence layer with MindsDB. So, if you have a streaming data on Redis we can do it, thanks to MindsDB as well. Eric Anderson: Yeah. So, you're helping me realize that you're this AutoML framework that has these friendly interfaces, SQL being the primary one, which puts you in the visualization tools and many of the databases, relational databases. But you're not limited there, you can have other interfaces that are more streaming oriented, that are more document DB-like oriented, and they all benefit from one another, because you can train on ClickHouse, and deploy on a stream, on Kafka for example, as you described earlier. Jorge Torres: Yeah, yeah. And speaking of Kafka, today we became an official partner for Kafka as well. So yeah, Kafka is certainly very interesting for us, because the way that they see streaming data, at some point also found a merger with SQL. One of the big key features that Kafka is portraying [inaudible 00:24:12] now is called Case QLDB, which is a way for you to see streaming data with the mindset of a SQL user. And it just ties so nicely with what we're doing as well, so we're very excited about the work that we're doing with them too. Eric Anderson: So, tell us where the project's headed in the future. I also wanted to ask about, you have quite a few contributors, and I imagine some of those are beyond the scope of your organization. Are these users that want to add a feature, or plug a hole? Who do you find jumping into the project? Adam Carrigan: Yeah. We have all sorts of users, as you can see from the number of people who contribute. Some of those people are certainly internal to MindsDB, as we obviously have public repos. But a large chunk of those people are people who just want to build something for themselves. Want to add a feature, or want to report a bug, or fix a bug, want to add to documentation. And so, even though we are now looking for monetization, we've still focused a lot of efforts, and we have dedicated people inside the organization, we're still a small team of 18 people. But we have people inside the team that are dedicated specifically to fostering and looking after that open source community. Adam Carrigan: One of the things that we try to do is make machine learning much easier for people who don't know machine learning. And so behind the scenes, what we do is extremely complex. The code behind the scenes is hugely complex, and a big tribute to our developers inside the organization who have been able to condense this extremely difficult problem into something that is available to SQL users, literally you just need to know SQL. And so, a lot of our users are not the kind of people who are able to contribute to the main code base that we have, because it is, as I said, extremely difficult to do, and you need very specific skills. Adam Carrigan: But people find ways to contribute in many, many ways, from reporting bugs to submitting datasets to our benchmarking suites. So, we welcome all of those contributions. Not everybody can code in Python and Pytorch, and build extremely sophisticated models, and the kind of things that our internal team does. But you can contribute in many, many ways, and we always want help from our community. It's very important to us. Eric Anderson: One of the things you have to wrestle with as you start to grow an open source project, is licensing and governance as you get bigger. How have you thought about these situations, and where have you ended up? And what is your plan going forward for managing the community? Jorge Torres: I'll let Adam answer to the community part, he manages that. But licensing, we started with a very permissive license MIT, and then we started to move to a license that was less permissive, just so that people wouldn't build products that are based on MindsDB, without MindsDB being aware of those. But soon, we realized that our go to market was essentially through the cloud. And by being able to produce MindsDB as a multi-tenant cluster kind of environment, that is where we develop our enterprise features, so they don't even make it to the source code. We're happy with the license that we have. A lesser license would work just as well, just because of now, the understanding that we have of the global markets. We were very lucky to have a few investors that came from heavy open-source projects, like My SQL, MariaDB, and they've been instrumental to understand that licensing, I don't know. Pass off. Eric Anderson: Yeah. Adam, tell us about the community, and where people can get involved and that sort of thing. Adam Carrigan: Yeah, absolutely. And so, we have a number of areas and ways that people can get involved. The first point of contact is one of two places: you can visit our website, MindsDB.com, hopefully there will be a link below wherever this is posted. Or you can visit our GitHub, which is github/MindsDB/MindsDB. We have a number of public repos, but that is the main one. And from those two places, you can join our Slack channel, you can join our forums, you can join a beta testing list, so we have a thriving community of people who love to test out the product, the new features before they actually get rolled out to the rest of the community. And so, there are many, many ways that people can get involved. Even if you're not particularly technical and just want to get involved in the community, then reach out to those, join our beta tester list, join our Slack channel and we'll be happy to point you in the right direction. Eric Anderson: Great. Anything we didn't cover today that we should have? Adam Carrigan: No, I think this has been fantastic. Good fun. Eric Anderson: MindsDB, I just find is really clever and unique. I think we see a lot of open source projects that make sense. It takes a lot of work to build a stream processing service, or a what have you, distributed engine of some sort, and so I don't want to diminish those efforts. But this appears fairly novel, and it's really awesome that you're bringing machine learning to people who otherwise, maybe would struggle to get there, or certainly a much more efficient approach. Jorge Torres: Thank you, thank you very much. I think that we're trying very hard to ensure that that happens, so glad to hear it from you. Adam Carrigan: Thank you, Eric, for the kind words. Eric Anderson: Appreciate you coming on the show. Jorge Torres: Thank you. Adam Carrigan: Pleasure to be here. Eric Anderson: Any time. Eric Anderson: You can find today's show notes and past episodes at contributor.fyi. Until next time, I'm Eric Anderson, and this has been Contributor.