Paul: Hi there and welcome to PodRocket. I'm Paul, and I'm very excited to have André Kohn with us today. Hi André. How you doing? André: Great. Thank you so much for having me. Paul: Thanks for coming on. We're going to talk about WebAssembly today. We're going to talk about DuckDB. These are all really exciting things. Because I feel like WebAssembly isn't talked about a lot on internet media that I might find on YouTube and stuff. So this is going to be an exciting delve into what's coming with what we know is the most common development and content platform in the world. Paul: André is from the Technical University of Munich. Before the podcast he wanted to make sure that we're saying of Munich because that's the correct way to pronounce it. And he's from the database group over there. So you're doing your PhD right now. André: Yes. Paul: And so your PhD's in databases, right? André: Yes. Paul: That's the general topic. Is it good? Have you enjoyed it? André: Yeah, so far it was a fun ride. I started... Yeah, I started my PhD four years ago, and yeah, we have a database here that we developed, which is called HyPer and the successor to HyPer being Umbra, which is very fast, and it's quite interesting to work on these projects. But yeah, last year I took a side turn and explored WebAssembly a little bit. I came in touch with the founders of DuckDB, Hannes and Mark. And yeah, we came out with DuckDB-Wasm, which turned out to be a quite exciting project. Paul: Right. So databases are ... There's such a breadth of the types of use case applications out there. Before we get into the main topic of the podcast, which is DuckDB and WebAssembly, I'd love to just take a quick excursion and ask for my own selfish interest here. What are some examples about the different types of problems that you maybe need to think about as a database scientist? Because there's all these terms that we have data lakes, we have columnar monsters for aggregations. And I know there's these different use-cases and stuff. And I was wondering if you could just speak a little bit on the breadth of use-cases and implementations. Because I feel like that matters a lot. Right? André: Yeah, so the database landscape is very wide, as you said. There are very many systems for very many use-cases, but usually the blunt answer is there is not one system that defeats every, or is good at every situation, which means that, yeah, there are these sweet spots for the different systems where they are fast and then there are situations where are not the best. André: Interestingly, the database that we talk about today, DuckDB is in a sweet spot where not too many systems are. It is an in-process analytical and relational database system. And the keyword here that you can focus on a little bit is in-process, because in contrast to many systems out there, you can just use it more or less as a query engine if you want. So a large user group of DuckDB is just, for instance, in the Python world where you just spin up your Python script, have Pip install DuckDB and then you can just go from there, querying with SQL whatever data you have. André: This convenience is something that is a very strong argument for DuckDB because existing systems usually ask you to install something, have a server running, things like that. So DuckDB focuses on very fast analytics in situations where you want to be ad hoc or embedded, situations where you don't want to have the hassle dealing with the server. Or yeah, you can use it as a server- Paul: That's what in-process means. It means it's in the Python process? Can I think about it that way? André: Yeah. So we would just- Paul: In memory. André: No, it's not only in memory. It is in memory, but you can also use it out of core. Paul: Gotcha. André: So with writing to disk. But the key difference is if you have a Python script and you want to use DuckDB, you can just use it as if it was a Python library. So it feels very integrated into your program. Whereas if you have an external database, let's say Postgres or so. You would set up Postgres, you would connect against or to Postgres and then file some SQL queries against Postgres. The big difference here is that the approachability of this concept is very, very strong. Because if you want to use DuckDB in your script you can just, without much setup, take it from there. Paul: So you can use it as a library. Now, when you say that, does that mean I'm going to get tab completion in my code editor? Or when you say I can use it as a library, is it more like we're talking lower level? André: No, no, I'm talking about, if you have, for instance, in your process, your data sitting, for instance, in your Pandas frame or so, and you want to ... Pandas is very widely used in Python, for instance. If you just want to query this existing data, you don't need to ship your data somewhere else, but you can just use DuckDB to consume the data that you already have, which is, A, faster, and B, more convenient if you just want to get your answers for your queries. Paul: And is the in process kind of benefit here, the speed up, because ... Is one of those things, I'm just thinking lower level because, oh, we don't need a ... there's no inter process communication. There's no network communication. There's no paging as heavily in different sectors of memory on the CPU level. André: Mm-hmm (affirmative). Paul: Okay. So it's like, we're getting all these benefits into ... André: Yes. So of course. With the convenience, like the second big point is, of course, if you have your data sitting in your process, you don't need to spend time shipping it over the wire, or over a domain socket or whatever communication you use. Paul: Because a socket has latency, right, as well? André: Yes. Yes. Paul: I feel like usually as a higher level programmer myself, I think of a socket. I'm like, oh, it's a direct ... It's like I'm talking to the other guy in the room. It's he's right there. But there's still, there's something to be realized there if we're doing massive volumes of work. Gotcha. Paul: Great. So, all right. We got a little bit of database speak in here before we talk about really what you're focusing on, which is bringing DuckDB into WebAssembly, and exposing it to a bunch more use-cases all over the place. So do you want to maybe start off of talking a little bit about how you got started with the project, and what is WebAssembly for people that are just stepping into it? André: So, as I said earlier, my PhD actually is not with the CWI people who invented DuckDB. Instead, we are developing a main memory engine called HyPer and Umbra. But the big advantage of DuckDB is that it is more or less self-contained. So with WebAssembly becoming more and more popular, I got excited about the fact that you could pick this fast analytical database and bring it to WebAssembly without dealing with very many problems that you would have to solve with a database that is more complex. André: So there are, for instance, HyPer and Umbra depend on libraries that are very large, and coding them to WebAssembly would be a lot of work. And with DuckDB, it was a lot easier. So how did I get to WebAssembly in the first place? Well, it started as a side project. So I had a pet peeve, you could say. When COVID started what you saw very often was these rather simple dashboards and you would see loading spinners all over the place. Yeah, exactly. And you had loading spinners all over the place initially, because usually in the world where you have analytic service presenting analytical data to users, you would have a server doing the processing and then the fronted doing just more or less the dump rendering of the results. André: This turns out to be a large bottleneck, especially in the COVID situation, we had this where like 10,000 or even more users want to look at a dashboard very frequently. And then you just are left with a question, how do you scale up? How do you throw hardware at the problem? How can you get the request being answered quickly? So this is a problem. And I was a little bit surprised that it is, for these scenarios, surprisingly hard to do all the analytical processing within the browser itself. André: So the analytical questions that you ask in these scenarios are rather simple. The problem is that JavaScript is not really well equipped to do analytics. So JavaScript is, even though it is, of course, a very flexible and dynamic and very widespread language, it's not the most efficient one. And if you want to crunch data or do more complex joins, JavaScript is just not the right tool for the task. Paul: Sorry, if I may interject for a second. When you're talking about processing with JavaScript or saying, all right. In this example that we're talking about, in the back end, it's JavaScript grabbing this data, grabbing that data, combining, aggregating, summing, making out lines to- André: Yeah. Paul: Okay. Gotcha. André: Exactly. So our task in the databases do all the time. So you just fire your SQL query against your database of choice. And what the database internally will do, it will join the rows using hash tables and aggregating them and so on. Of course you can do this in JavaScript as well, but JavaScript is just not the right fit. It's just too slow. André: With WebAssembly this picture shifts a little bit or changes a little bit because WebAssembly is quite fast. So if you bring a database to WebAssembly, my hopes were that you can get fast analytical processing in the frontend. And thus, in many cases, just eliminate the central analytical server. So in these situations, you would just serve the very same dashboard. For the user it would be the exactly same output, but you no longer have a server processing in the background that you need to scale, but instead, everything is computed in the frontend. André: Of course, this has implications. This is not working for every situation. But the cool thing is that this was a, let's say a first stepping stone towards an area that you could explore. Because with the browser or frontend being more powerful, you can build much more exciting things. The cool thing is that this data processing becomes more of a tandem operation. You have in the frontend, you have analytical processing power. You have in the back end, analytical processing power. And then your decision where you do work is more or less dependent on what's your optimization target. André: The big strength of doing things in the frontend is latency. If you just want to have, let's say a filter in your dashboard, and you just have your snappy interface giving you the new range, you can decide if you want to just send it to some remote server, which will immediately bring up the cost of the round trip to the server, or you do it in the browser and you have immediately your snappy result. So the difference there is surprisingly cool. So if you have that once it really feels very snappy to have the processing done in the browser. Paul: So how is this different, I guess, if I had a dashboard and I said, all right, I want to display all the COVID data from Massachusetts? I brought it all in and I had, God, I don't know, like a 30 or 50 megabyte array of numbers in memory. Now that JavaScript is ... I feel like I'm pretty good at slicing that with a JavaScript. I'm pretty good at combining it. And it's pretty snappy. But what we're talking about is on another level, right? We're talking about intense processing in joins that need a database engine, but we're bringing it to the front. Is that kind of where this- André: Yeah, there are ... sorry. There are two points here that I want to make. Of course, I totally agree with you. Scanning through 30 megabytes of rows, even in JavaScript is not a terribly expensive operation, but there are two sides to it. First, usually, of course, if you just compute a single sum or account or so, or minimum/maximum blah, blah, blah, this is simple. Even to write this in code, this is in JavaScript just usually a single loop. You just compute something. This is easily done. André: But let's imagine you have three or four of these 30 megabyte sets of rows, and you just decide, I want to join my COVID data with the states and then find something out over the course of a certain set time range, and then join this with population or age range or so. These kinds of parameters immediately bring up the question of how do you evaluate join sufficiently? And then this is a business where databases are already very, very good at. And if you just do it by hand in JavaScript, you will end up, A, being slower, and B, being probably not optimal. André: For instance, evaluating joins is a very well explored problem in the database community, which has many rabbit holes if you just want to try to build this yourself. I can almost guarantee you, if you just write this naively, you will end up with an implementation that is not the optimal one. André: But yeah, this is nothing you actually have to worry about. Especially if you know SQL. This is the second point that I wanted to talk about. If you know SQL, you don't need to write a four loop over your 30 megabytes at all. So you just load your data in your database, and then you just write your SQL query and the database takes care of everything else. So it's not only about the fact that you are efficient when processing, but it's also the convenience to just use the querying language that many are familiar with. Paul: Right. That's widespread and well known. Right. Okay. So that totally makes sense. And we're taking this, and in the COVID example, I guess one possible, or there's a bunch of ways you could do this, but one possible architecture is you're running WebAssembly, DuckDB. You bring in the data, you load it live on that very loading of the webpage. And then you can query that database to populate your graphs very quickly. Conversely, could you have a webpage with DuckDB that stores something in the local browser memory or cache, and then you load that up and it's already there sort of? André: Yeah. Paul: Cool. Okay. André: We are exploring this quite a bit at the moment. Of course, this is something ... Local persistency in the browser is a difficult topic. Usually people have something like IndexedDB that they use, which is not really fast. But there are scenarios that we could focus on in the future, which would, for instance, allow us to just have a locally persistent version of DuckDB that would behave just like a normal database. André: This is for some reasons that are out of reach for us, not working today. Like for instance, the browser APIs are a little bit evolving at the moment. What we would need is a local file system access that is fully synchronous. There is a current origin trial in Chrome that gives us synchronous file system access. But unfortunately, the authors decided to have two codes in there being asynchronous, truncation, for instance, being one. And yeah, this hinders us at the moment to go fully the native persistency world. But this is something we want to explore in the future. André: I wouldn't consider this to be like, especially today, not the main setting point of DuckDB-Wasm, but this is a very exciting opportunity for the future. If we are able to get it fully persistent locally, you would get IndexedDB on steroids with DuckDB-Wasm. Yeah. It is like that, right? So you have just your local database writing to your local browser file system. And then whenever you visit the website again, you can just pick the data up that you left and just consume it with SQL, as you are used with ... Native applications on mobile devices do this for ages, but in the browser, you are usually very limited in what you can do. So yeah, this could be quite exciting. Paul: That could be very exciting. That could totally change the way we think about websites. André: So let me just get one additional thing out. One strong point for DuckDB-Wasm that we focused on consuming data partially. So one thing, if you have a website and you have, for instance, on AWS S3 or so, your CSV or parquet file sitting, and you want to query them, the classical stack or architecture how you would build this is that your JavaScript application would load the data into the browser and then do whatever you want to do with it. André: But databases traditionally are quite strong at just picking up those parts of the file that you are interested in. So if you only read a subset of the columns within a parquet file, or parquet is a format that is also structured in, they call it row groups. You are only interested in a certain part of the parquet file, because, for instance, it's ordered by time and you are only interested in, let's say, the last 10,000 entries in there. Then DuckDB-Wasm will be able to query these files and read only the parts that qualify. Which is cool, because then you can have your remote file being one gigabyte large, but the relevant data for your browser relevant query is in the order of, let's say, 10 to 20 megabytes or so, which will be then in a manageable area. Paul: That's critical if you want it on the client side browser, because you could really have performance capabilities and hardwares that's down in the dumps, for lack of a better phrase, and you want it to work for as many people as possible. André: Yeah, exactly. This also blends into what I said earlier with the tandem between the browser and service side processing. I believe one very exciting use-case for DuckDB-Wasm is not to fully replace the server, but just work together with a server side database to improve latency and scalability of the system. André: This is something we will be able to help users as well. But with, let's say, a more distributed query optimization that spans across multiple nodes. But the architecture in the future could just look like you have your local browser sided database, you have your server sided database. And then you answer queries with the lowest possible latency based on where the data sits. André: So if you spend, for instance, the time already to download a subset of the data, you can just filter it very instantly, or almost instantly with a browser based database. And then if you just want to query this gigabyte/terabyte large data sitting somewhere in the cloud, then of course it won't be an option to download it to the browser. Then you use the server sided version. So this tandem could lead to very interesting new ways how you design these applications that would get rid of very many loading spinners today. Yeah. Paul: I know I've used D3 before for massive data sets, and it's done a pretty good job, but that's only on some data. This is going to open up every type of data, every type of bit invite that you can store. And D3 definitely, it has its use-cases. So maybe those people who did the spinners maybe should have checked out some of the new capabilities of D3. André: It's interesting that you mentioned that. We developed DuckDB-Wasm together with a professor at the CMU, Dominik Moritz, who is the author of Vega. Do you know Vega? Paul: Yeah. André: Vega is a visualization framework, which is based on D3. And the cool thing there is that it, for instance, talks arrow and we can just feed or integrate DuckDB-Wasm rather easily into Vega to just display whatever data you want to display, which allows you to just formulate your data processing with SQL instead of the Vega transforms. And then let Vega do the visualization business with displaying your charge. Paul: Which sounds how it should be. It's visualized and the things that are good at what they do are doing the things at what they're good at. André: Yeah. Emily: Hey, this is Emily, one of the producers for PodRocket. I'm so glad you're enjoying this episode. You probably hear this from lots of other podcasts, but we really do appreciate our listeners. Without you there would be no podcasts. And because of that, it would really help if you could follow us on Apple Podcasts so we can continue to bring you conversations with great devs like Evan You and Rich Harris. In return, we'll send you some awesome PodRocket stickers. So check out the show notes on this episode and follow the link to claim your stickers as a small thanks for following us on Apple Podcasts. All right. Back to the show. Paul: I would love to take a little excursion also off the database. We'll probably get back into it, but about WebAssembly specifically. So what is it? It's this new sort of compiled assembly language that runs in the browser. But I was wondering if you could enlighten us a little bit more about some of its capabilities, maybe some of its downfalls. Because every time I hear about something that's running some new feature in the browser I'm like, all right, who's going to hack it first? What are they going to find? How are they going to ... What's the new way they're going to control my computer if it's not flash games? So I wonder about the security as well. André: Yeah. So WebAssembly is very interesting because it's a binary instruction format in the first place that all browser vendors pretty much agreed on to implement. So in the browser world you always have the problem that you need to find the common denominator between all these different use-cases, which is why JavaScript won't disappear for a very, very, very long time because everyone started using JavaScript. Everyone can speak JavaScript. This is a very excellent language for the purpose of being platform independent. André: But the problem that I mentioned earlier already is JavaScript is slow. So with that having said, WebAssembly is the answer to JavaScript being slow. It's just an instruction format that is spoken by all the major browser platforms, plus a few native run times already. And it can be compiled to machine code very, very efficiently. Conceptually, it's a stack based virtual machine, but you can lower it to machine code very good. And this allows browsers to get to near native speeds. André: It is also secure. So WebAssembly is designed in a way that, for instance, memory won't shoot you into the foot. Like if you take a C++ program and you just let the user write whatever he wants to, this would be a very dangerous endeavor to run this code in your browser, if you don't trust it. But the WebAssembly folks have done a good job keeping this isolated and safe. And this gives you just a near native speed with the security of a browser sandbox, which makes us the ideal target for higher level languages like C++ to compile to it. André: So DuckDB is written C++, we compile to WebAssembly and then just rely on the browser to get this up to speed as much as it can. And the very interesting aspect here is that the Chrome developers, for instance, behind V8 developed a compiler called Liftoff. And Chrome is insanely good at executing WebAssembly. André: I once visited a talk here in Munich of the Liftoff people, and they talked a little bit about how they compile WebAssembly and it was amazing. So for instance, they have this very quick first compilation and then upgrade to a more higher quality compilation output later when they see that your code is hot and so on. André: So these things are insanely complicated to build, but your browsers just gives it to you. All you need to do is just compile your code in whatever language you want to write it to WebAssembly, and then just throw it into the browser and let the browser do the rest, which is, of course, amazing. This brings us to speeds that haven't been seen in the browser before, and combined with the focus on analytical processing of DuckDB we can evaluate analytical queries very efficiently, which hasn't been done before. André: There is a WebAssembly version of SQLite, for instance, which is another database, that is not tailored towards analytics specifically, but they are already in, or they also use WebAssembly to evaluate queries. And yeah, it's very exciting how this space is evolving. So now you have very different choices starting to emerge in the browser space and you no longer need to resort to JavaScript for things that are probably better done in a database. Paul: So do you think any application that has any sort of processing could benefit from looking at WebAssembly? And I'm thinking like right now we're on a high definition video chat software. That's something that it's maybe not completely database related, but is this something that could benefit a type of application that maybe is a good target for somebody to think about writing in WebAssembly? Or is that off center here? André: So one thing that cycling back to the Chrome developers, this always strikes me. They did an excellent job to get JavaScript up to speed. So right now we are looking at performances of JavaScript. Even with JavaScript being such a dynamic and flexible language, they did a pretty good job at executing this JavaScript code very efficiently. That means that for many use-cases JavaScript will just be perfectly fine. André: So right now, WebAssembly is still very young, which means that you are fighting very many childhood problems of WebAssembly still. And in the meantime, if you would have just picked JavaScript, it would've done the job almost as good as with WebAssembly, and then you've would have been already done with the task. So JavaScript is still a very solid choice. André: Also, one thing that WebAssembly cannot do very well is interacting, for instance, with the DOM. So if you have your document object model of your website, and you just want to have, for instance, React representing your website state, this is nothing you would today do with WebAssembly. So in all these cases where you, for instance, want to react to a user event or things like this, WebAssembly is just not the right fit today. André: So for the question whether you should use WebAssembly, usually I think if you are processing a lot, lot of data usually, then WebAssembly is probably something you could consider with something like DuckDB-Wasm. For cases where you are not processing bound, I would probably not use it today, unless you want to unify your stack. There are these projects, like in the .NET work Blazer, where you can just basically write asp.net code and just run it in your browser. André: These things are an upside because you can just have almost the same code base and run it over on the client side. But this is not an argument anymore about performance. This is more an argument about convenience then. And I at least see the near term and midterm future of WebAssembly being more an accelerator. So you have cases that benefit from a better processing efficiency. Then you write the main application still in JavaScript with React or whatever framework you choose. And then you just follow these performance sensitive things, you glue your database or your WebAssembly code to your application. Paul: So right now it's very much in the phase of like, it works and it has very tangible benefits, but how realistic is it to use it? It's inserted in very specific use-cases right now that relate to, maybe processing. Because yeah, JavaScript is so mature at this point. And like you said, I've watched other YouTube videos of conferences and stuff of people talking about, well we had this problem, but the Chrome guys just solved it by making JavaScript faster again. So that's where they just keep doing a very good job at making JavaScript faster. André: Yeah, usually today more and more things are about data. So even if you would just pick our use-case of accelerating analytical processing in the browser, this already touches a very large variety of use-cases. So probably with data being ever more present and more and more relevant to users, the data processing alone is a very strong argument, probably to consider sooner or later using DuckDB-Wasm or a database, WebAssembly, whatever it will be. Because this is still something that we are good at. André: But yeah, as you said, if you have just an application or a CRM tool or whatever, like a very, a very complex interface where you need to guide users and so on, this is nothing where you would benefit from WebAssembly immediately. Paul: Gotcha. And one last thing about the WebAssembly, just as a topic in general. It's a standardized compiled language by which these engines can run code and run, as you said, at near native speeds. That's a big thing for a language like Go where you can write something and it runs everywhere. Do you think that WebAssembly is going to start stepping into that role in any way of being just a common run time that people are just going to use to perform the work that they need to do? André: Yeah. It's interesting that you focused on run time. One thing that strikes me as a very interesting future environment where WebAssembly will have a deep impact is actually function as a service. So if you look at, for instance, Cloudflare functions, I'm not quite sure whether you've- Paul: Or Workers. André: Yeah, sorry, Cloudflare Workers and Netlify also as an offering. What they do in contrast to AWS Lambda, they use just WebAssembly and JavaScript to build function as a service in a very cheap way. So there is a cool article about how Cloudflare workers internally work. What they basically do is while you do your HDP handshake with Cloudflare, they already prepare your function to be executed. And then when the HDP handshake is done, they are ready. So which basically eliminates this cold start problem that you have with AWS Lambda. André: The interesting aspect of this is if you have WebAssembly sitting on this stack, you get near native execution speeds with the flexibility of the isolation of V8 in these edge functions. So these edge functions work that way because you don't need a virtual machine to isolate your tenants and so on. They are very snappy and can start up very quickly because you just run them over your V8. With WebAssembly being in the stack, you get the luxury of these very fast and cheap functions at near native speed. Paul: So it's like V8 VMs. André: Yeah, yeah, yeah, yeah. So these Cloudflare workers are quite interesting. Of course, this is still also very young. So this is nothing we have explored excessively today, but I totally see that running SQLite or DuckDB-Wasm in there is a huge opportunity because then you just introduce another layer to what your data is sitting in S3 that can scale much easier or much better when answering your SQL queries at a lower price. André: So you could, for instance, cycling back to this tandem thing, if you have your processing on your browser, you're processing on your nifty backend server, and you have your processing in your edge servers in between, in the function as a service, you could just decide, I do my processing there, like the closest to the user, or the closest to the data necessary in terms of downloading performance or bandwidth speeds. André: So for instance, if you can't afford to ship the data to the user, maybe you can afford to just chip it to the edge worker and don't need to pay the price for the larger VM setting in the background. So this multi tier processing is actually, I think, a quite exciting opportunity for an embedded or in-process analytical database, because it allows you to just run complex SQL queries on all of these layers. Yeah. Paul: So let's hope Cloudflare puts that in because ... André: Yeah, we looked at Cloudflare Workers more closely actually. And one downside is they currently implement the service worker API, which doesn't allow us to implement these remote synchronous fetches that we talked about before. Like this file system, we built on top of the file system that would partially read remote files. And right now we can't do this in the current version with Cloudflare Workers, but I hope, or I'm cautiously optimistic that this will at some point be ruled out or sorted out. And then, yeah, we are happy to work with them. This would be quite nice. Paul: That would be cool. As a user of Cloudflare Workers myself, that would be ... I love to just dump my computational needs on them. I don't know. Do they just make it really easy? If there's simple aggregations and transformations, put it close to your customer. André: Yes, totally. And the costs are so small. So they are very... Paul: Yeah. It's like, they definitely did something special taking from the Lambda people and making something novel. Anyway, we're coming up on time here, André. Before we go through our closing ceremonies here, is there anything that you would like to link to our listeners, point them to, people you want to shout out at? André: So, yeah. I would love to just mention Mark and Hannes who have been very generous and very exciting to work with. So of course I'm working remotely with them, which was a quite interesting endeavor. And those two have very big plans for DuckDB and I hope for the best for them. They have this company called DuckDB Labs, which is doing support contracts. So if you are interested in DuckDB or DuckDB-Wasm, this is the best point to start. Just chat with them and approach us. We have a Discord server. Paul: Oh, so we can link that for sure. André: Yeah. Paul: If you have this. André: If you have anything coming up with DuckDB or DuckDB-Wasm just ping us there and we are happy to talk. So, yeah. Thanks for all the time. Paul: Thank you for being on the show, André. And hopefully some people will start looking into WebAssembly in these new kind of website architectures that are going to shape tomorrow. Thanks for coming on. Kate: Thanks for listening to PodRocket. You can find us at PodRocketPod on Twitter, and don't forget to subscribe, rate, and review on Apple Podcasts. Thanks.