The following is a rough transcript which has not been revised by Vanishing Gradients or the guests. It is a transcript of [the entire YouTube livestream here](https://www.youtube.com/live/c0gcsprsFig). Please check with us before using any quotations from this transcript. Thank you. hugo bowne-anderson 0:03 All right. Hi everyone. Hugo Bowne Anderson here. I'm so excited I could barely sleep last night. I'm here today with currently with Eugene Yan Shreya Shankar and Hamil Hussein, who I'll introduce in in due course, their co authors will join us. Join us very soon, but what I'd love to do is we're going to start in a minute or two, if you could introduce yourself in the chat, let us know where you're calling in from, and what your interest in this type of stuff is as well, and also what you'd like to get out of the session. Do you building LLM systems currently, or working AI data science, all of these things? So let us know in the chat, and we'll get started. My name is Hugo Bowne Anderson, and I've worked in data science and machine learning and education for a while now. And this is a live stream for a podcast that I do called vanishing gradients. And I'm just going to put a link in the chat here. And this is, whoa, whoa, yeah, I'll put a link to the podcast, and then we have, we have a lot more live streams coming up, and I put them on Luma, so I'm going to put the Luma there as well. And excitingly, I've got, we've got a bunch coming up, but two of my next guests coming up soon are going to be Shreya Shankar solo, who I haven't spoken with in public solo for some time. So we're going to be talking about evaluations, and how to evaluate the evaluators, and what that means in the cool products and research she's she's doing. Another guest will be Hamil Hussein, who we have have here hamel 1:37 again. Wow, dude. Yeah, exactly. hugo bowne-anderson 1:38 And so Hamel is everyone has been on a real journey as a of late, but Hamill and our dear friend and colleague, Dan Becker, recently taught a course which blew up into a conference. Everyone here spoke at it, and Dan and Hamil correct me if I'm wrong, Hamel, but you thought maybe you're going to teach a couple of 100 people. Turns out 2000 people enrolled. Became a multi sided marketplace with products and vendors and open source frameworks and speakers and learners. So we're going to have a podcast episode on what they learned teaching llms to 1000s of people around the world. So I've just put the Luma calendar link in the chat got lots of people, uh, tuning in from around the world, doing all types of different, different stuff with llms, which is, which is super cool. Um, I honestly think, um, without further ado, we should get started. I do want to say, if you enjoy this type of thing, please do share with friends and hit the like and subscribe, because we do a lot of stuff like this. That'd be super cool. But without further ado, let's jump in. So to set the scene, over the past year, Eugene Strayer Hamel and their co authors have been building real world applications on top of llms, and have identified crucial and often neglected lessons that are essential for building and developing AI products. And they've written a very meaty report on the lessons they've learned, which I'm going to share in the chat now. But I would love if we could start off, maybe just go around and introduce yourself, maybe starting with Eugene, and let us know why you're even interested in llms. eugene 3:28 Alright. Hi, Ron. I'm Eugene. I sell books at a bookstore. Literally, I work for Amazon books, but that's it. Opinions here are my own. So my work is really around recommendation systems and search, but recently, I think large language models can help readers customers understand their recommendations and search a little bit more and just add a little bit more life to it. So that's how I'm thinking of using large language models to try to help customers understand their recommendations and search amazing. hugo bowne-anderson 3:58 And to be clear, you build REXIS, among other things from millions of people that millions of people use worldwide. Why? Right? Yeah, eugene 4:08 I used to do a lot of Rex's. I think my first three or four years was really focused on Rex's. By the last 18 months, I've been trying to catch up with what's happening in language modeling, and I'm still struggling to catch up so they can figure out how to serve it reliably, absolutely. hugo bowne-anderson 4:24 How about you? Shreya, when What's your your background? What led you to this wonderful world of llms and thinking about evals so much in particular? shreya 4:32 Yeah, so I am a researcher and also ml engineer. I'm doing my PhD in kind of data management and UX or HCI for machine learning. Why am I interested in llms? I think most people, myself included, want intelligent software, and many people want to build intelligent software, and llms make it super easy to do this, not just prototype it, but really to get it into the hands of other people and create a very. Simple flywheel to learn from. You know, what are the errors and how can we really quickly iterate on it? Yeah, very simple, hugo bowne-anderson 5:09 right? And in particular, though you've always kind of had a deep pathological interest in databases, No, I'm joking about the pathology of it all, but hopefully something we'll get to is, once again, the importance of the data. So maybe you could just speak to your interest in databases more generally. shreya 5:28 Yeah, well, I think all ml problems are data management problems to some extent, and you can look at it in a lot of ways, right. What are the right samples of data that we use to solve problems. How do we quickly incorporate data? How do we quickly iterate on these systems to improve them? And a lot of it just comes down to, how do you help people manage their data better so they can build these intelligent products? Databases are one way of thinking about doing it, but databases do a lot of things, so can't really compete Yeah, monolith of software, hugo bowne-anderson 6:02 absolutely. And just before we move on to Hamel, which I'm really excited to get to, can you please introduce your friend? shreya 6:08 Oh, this is my dog, papaya. He's very happy that I came home from the lab. hugo bowne-anderson 6:14 What a cutie. What's his interest in llms Is it just through doesn't shreya 6:18 even know what it is. It's probably for the best the hugo bowne-anderson 6:22 good old days, um, when we're all naive, ignorance is bliss. So, Hamel, what are you up to? What's your interest in llms? I mean, you have a, you know, a decades long history in in ML and these types of things. So, hamel 6:35 yeah, um, so, yeah, I've been doing ml for 25 years. I've been, I've worked a lot on infrastructure, like tools for data scientists and machine learning infrastructure and developer tools. Started working with large language models early on at GitHub, I led some research that led up to GitHub copilot around large language models and code understanding. Yeah, just been doing it for a while. At this point. Does it make sense to do something else? So it's not so much like, why is it interesting? It's like, Well, it'd be kind of feels dumb not to keep doing it, because they're so powerful the technology. So I haven't really stopped to think about, I don't know. I don't know. I never really thought about, like, why is it interesting? It's like, well, yeah, this seems like, like, you know, it's obviously like, a powerful technology, and, yeah, very helpful in many respects. So, hugo bowne-anderson 7:39 yeah, I love that. You like, it's, it's just what I'm doing now. But then you speak to the power, and I think the power of these, these systems that we're still, you know, working hard to understand, um, we talked about this recently as part of your course, right? But Simon Willison has a video where he, he actually referred to llms as fractally interesting, that you can look at them at a lot of different scales, and all types of interesting things emerge. And I do think there's so much to discover in there. hamel 8:06 Yeah, one thing I'm interested in the most is, probably the most is like, how to really speed up the software development lifecycle with llms. Like how you can code faster, launch applications faster with a smaller team. I haven't worked specifically on that area, but it's like, probably the thing that's the most like, I haven't worked on that in a while, but I think it's pretty interesting. hugo bowne-anderson 8:32 Yeah, absolutely. And I do think, you know, some of the things you all think about and what you work on, shraya in terms of, you know, human computer interaction, stuff, you know, speaks, speaks to this and making sure that developer experiences are as frictionless and smooth as as possible. But hey, we've got a lot of people saying they've read the report recently, or they've read part of it, and they're someone says, lovely dog. So papaya. I mean, clearly everyone who could not think papaya is a lovely dog, but this is a beast of a report. And I love how you know you go through tactical, operational and strategic things like position towards different parts of an org who may you know, really need to wrestle with all of these issues, but I'm interested with you six authors. I know you've all known each other and been friends for some time, but what? How did this even get started? What's the origin story of this eugene 9:37 report? That's great. Oh, okay. I guess. Oh yeah, yeah. I guess we were just chatting in our chat. We have a group chat, small group chat. We were just chatting about how you were thinking of, I don't know, Brian was talking about something, and he was talking about creating slack, but, you know, let's maybe discuss during office hours. And then he was saying that he was thinking about. Writing about a year of llms. And I was like, Dude, I'm like, drafting it right now. And I took a screenshot of my obsidian note, and it's like, it's like, Haha, okay, definitely, we just left it at that. Then Charles gave us the next nudge, just like, Guys, it'd be cool if we just collab. And of course, Hammer was like, hammer and Jason just, yeah, let's do it. And of course, the first thing I think about, is Shreya, right? Shreya, I really want to work with you again. I mean, I've worked with Shreya on several of my own writing. Shreya has been an amazing editor, and I've also chatted with her about her own, her own, her work. So it's really exciting, and that's how it came about. And then, okay, everyone, write your ideas. We just figure out how to mishmash and merge them, and that's how it happened. So I think, I think you spoke about tactical, operational strategy, right? That was how Brian thought about organizing, right? And I think, I think I'm so glad that he actually did that, because it there's just, there's a lot in there. I think there's 40 something different lessons, and by organizing it at different levels, I think it just makes it easier to consume. So I think that's how it came about. And of course, the moment as Shreya, she was like, yeah, man, coming in, she was just so game. I was just so, so happy that that happened. hugo bowne-anderson 11:11 Awesome. Um, so I'm interested in, we've already mentioned evals, so I've got a question for you, Shreya, the report really highlights a lot of work and stuff you're doing on evaluating LLM output quality. Can you share more about your approach to this challenge and how it differs from what we've seen in traditional data? Quality? Yeah. shreya 11:32 I mean, just very kind. First of all, other people, other authors wrote about my work, and there's no greater pleasure than having other people write about things that I'm researching. I think there's two main things that I was thinking about. One was when you consider real world data quality, right? So not Paul Graham's essays, which every single LLM has been trained on, or collections of data that you know just are so have permeated the internet a lot. How do a bunch of evals perform? Right? For example, canonical needle in a haystack, queries, I just had this question, try to conduct a bunch of experiments and learn some really interesting things. For example, when there are typos in the input data, or the casing is a little bit different. When you ask Mistral to retrieve a name from a document and the name is lowercase, Mistral fails to retrieve it like small, very small idiosyncrasies like this are really hard to uncover in traditional evals, and I just, I don't know how to study this further. I think the report that we wrote makes one step towards pointing something out, but there's still quite a long ways to go. The other thing around evals that we talked about in this report was just generally, how do you think about validating the validers? How do you how do you construct a flow to give you the right assertions and evals to deploy? This is really hard, because one people don't know what concepts they should check for in their outputs, right? This itself is a function of the LLM outputs. You have to look at hundreds of outputs to even know what are the weird failure modes and llms and what at the end of the day, do you care about as a developer, holding so go, kind of doing this process in flow. Dog is stretching. We wrote about this in the paper, in our eval Gen paper, and we that kind of flow made its way towards this applied llms report, which I thought was very exciting, and I'm happy that it resonates with emo Eugene and the others. hamel 13:38 Well, yeah, I actually wrote about it. I actually put it in, like, one of maybe I'm the biggest cheerleader or one of the biggest. I appreciate that. It's because, like, okay, like people, every time I talk, I I'm helping people in the wild with large language models. They really struggle. Struggle with evals. Like, how do you get started? How do you think about it? What do you mean? Write a test? What does a test look like? Should I just write like, unit tests? Is it like, high test? Like, then they're like, mentally, like, a writer's block and just, like, don't know where to begin. And they get really they like, people shut down. They're like, this is so hard I don't even know. I need, like, I need some expert to help me. I'm stuck. There's no way I can do this. This is, like, beyond my capabilities, whatever. And you can be really easy to get overwhelmed. Um, but actually, like, okay, so, like, my daughter, who is five years old, she's learning programming and, like, it's really interesting. Like, if you you know, there's these amazing applications like, or like, languages like scratch that are, like, really fun and approachable. And like, teach you that programming is is doesn't have to be hard. Like, you know, it can be fun. And so one of the things that I noticed with trail. Tools like that were part of a research. It was kind of like, scratch or evaluations, like, visually showed you how to think about it. You could, like, build these building blocks and then, like, sort of it would help you generate evals, different kinds of evals, and, like, validate them. And what I saw in that, I'm like, wow, okay, this is kind of like Scratch for evals. And I would show it to my clients, they would immediately say, say, like, Oh, I get it. I like, finally understand what to do. And, yeah, super powerful. Like, super powerful teaching tool, honestly. And, yeah, I think I told Shreya, she was like, maybe a little bit surprised at first, that that's the angle of of the like. I'm like, yeah, it's teaching. I use it for shreya 15:49 more. I've been thinking about it a lot like we collectively are teaching people the processes. We're not teaching people tools. We're not teaching people how to use llms. We're teaching people how to do the strategy around, okay, I'm committed to building this AI product. How do I go about doing it and making sure it works in one year from now? It's like, still there and people using it like that's there's so much work, I think, to do in that process oriented mindset. hugo bowne-anderson 16:16 So when Shreya said Hamil might be one of the bigger, one of the biggest celebrators of my work. I was going to joke that Hamil just doesn't stop talking about it, but then he launched into it immediately, which I love. I am. Eugene, you actually prepared a question for Shreya, which is related to this conversation, so maybe, maybe you could ask that now. eugene 16:37 Yeah, I guess the question I have for you, Shreya, is, why are you so bullish on LLM evals, LLM, LLM based evals, LLM as a judge or LLM as a evaluator. shreya 16:48 It's like we've never talked about this before, ever. I was kidding. Oh, man, lot of reasons. Um, so I think there are a lot of people who want to build LLM powered software that don't have the resources to do, you know, traditional monitoring and evaluation that traditional ml products would have, like, it's one person team, two people team, very small startup, very early stage product, and they just they want to ship quickly. I think there's also people who just cannot collect the data that they need to be able to write their own evaluate, fine tune, their own evaluation models, and to have all of these barriers, seems very challenging and at odds with the simplicity of deploying llms in the first place. Like, if llms are so simple to deploy, we need to have somewhat of a simple evaluation method. Otherwise, like, what's the point? And then the other argument here is, in these llms are getting much cheaper and much faster for the same I hate to use the words intelligence, but I don't have a good substitute, because I don't want to go down this like AGI route, but GBT four, oh, is great. This new Claude sonnet model came out this morning. It's just getting faster and cheaper, and you don't have to run it on all of your data. But if you sample your data every day, run GPT four on it. GPT four, oh, it's affordable. Could be good, hugo bowne-anderson 18:17 awesome. So we've really, really jumped in in a nice way, but getting in the weeds with evals. But I'm interested, Hamel, maybe you could tell us why it's so important to talk about evals as soon as possible. hamel 18:31 Yeah, I mean, so I actually sometimes I think that maybe I should just not even say the word evals, because it's like when I see you say the word evals. Then people start looking at you, like, you're some kind of rocket scientist or something like, oh, like, he's using this word. Like, what does it mean? And really, it's not really about evals. It's like, hey, let's take in a, like, a systematic approach to making your AI better. Like, just stop, stop fucking around and start having, like, a process, like, you can only, like, just eyeball it, you know, eyeballing it can only take you so far, you know. And so it's like, hey, it's not really eval. It's like, it's, let's measure, have some way to know whether we're making progress or not, and that that measurement of making progress, like, let's make it to where it's as frictionless as possible, so that you you can, like, know every time you make a change whether you're making progress or not. Otherwise, like, there's no other way to make progress and like, it's not really about evals. It is just how you work on AI. Like, it is not a component of AI. It's not optional. It's not like, oh, like you should do evals, or maybe you can do it, or it's good if you do it. If you're not doing it, you're not doing like, you're not doing any AI stuff, like you're kind of maybe filling with. It, and just saying, huh, I'm in the prompt, and I'm using, you can use AI, like, whatever. But if you're, like, building a product around it, if you're not measuring anything, then if you're, yeah, you don't have a way to make it better. If you know a way make it better, you're not, like, how are you going to build? And so it's really just how you build. And then, like, you can drill into that and say, what are the steps? eugene 20:23 Yeah, I think it's just machine learning 101. I mean the first thing you learn machine learning, training set, validation set. This is the same thing. It's just validation set, hugo bowne-anderson 20:33 yeah, and it does, I mean, in some ways, that it's a proxy for a loss function, because in the movement from classical machine learning to generative AI, it isn't obvious what the loss functions are anymore, what we need to fit to I am, as we mentioned, the report covers a lot of ground across the tactical, operational, strategic aspects of building with llms. I'd like to hear from each of you as to which area do you think is currently most underappreciated or overlooked in the industry, from tactical, operational, strategic, I shreya 21:09 don't want to go first, because my answer would change every week. I hamel 21:15 think Eugene, go first. eugene 21:18 I mean, I'm going to go first, but I'm probably going to say what he was going to say. I think what is overlooked is how to bring everyone along. How do you train your existing software engineers, equip them with the skills to go do this, right? I mean, so I think about this a lot. I work with a lot of excellent software engineers. How can I teach them to do very basic emails, you know, create some kind of synthetic data, either use something data, data from Kaggle, etc, etc. How can I teach them to understand that generation is autoregressive, a longer output will lead to longer latency. It's just the some of the basic things of how it works. How can I teach them that context is just conditioning, so they can be more effective in using that. I think that currently, right now, I just don't see a lot of that. I don't know why, and I think about about that both on my job and outside of my job, and that's what I think we haven't done a lot of. So, yeah, shreya 22:25 Shreya, I'll tell you the one thought that I had last week. So we conclude saying something like, oh, there's, like, so many demos, and we've got to, like, put things. I don't remember exactly what we say, But enough with the demos, like, let's start putting things into production. And I heard a talk at a conference that was also saying the same thing. And then I really thought, like, there, there's a new demo plus plus, or prototype plus plus going out there where you will see things in production, and you don't really somebody like hacked it together over three days, and then it's like a catalog product and like a large cloud retailer. I don't even know what to say and and now I'm thinking like, how do we move from prototype plus plus to like a real product? And the bar for production has lowered a lot with generative. Ai, anyways, this is this such a rant. It's a tangent. It's not answering your question, but hugo bowne-anderson 23:33 in what way do you think? Tell me more. Tell us more about how the bar for production has changed. Well, shreya 23:39 I think it's you can hack together something in three days that you know passes a sniff test for V piece, and it could go to production depending on the culture at the company. And this can also be large companies, right? I'm not saying like, like chatgpt is like this, or some, or like Bard, or whatever, Gemini. I I don't remember the Google equivalent is, but there are small, small, both of them, actually, hamel 24:07 yeah. It's really confusing, actually. shreya 24:12 Um, yeah. So I think it's, I think it's like these products have not really been launched with evals or some sort of rigorous way to quantify improvements, and I wouldn't call them production like. I now want to change by definition of production to like you also have a way of systematically improving the product, and a way to convince yourself and other stakeholders of this. And I think we're pretty far from that. I hope that the report helps people get towards that but and hugo bowne-anderson 24:43 but is there also is this related to the issue that with generative AI, like shiny demos are a lot easier to build, but then we have this tale of all the things that arise, then, from hallucinations to difficulty getting in prod to not having robust evaluation. Was this, this type of stuff. So you get it, you get a flash, and then kind of a relaxation of some sort, maybe. shreya 25:06 But I think, I think, still, people are putting their first things in production without having any form of emails. eugene 25:13 And I think, oh, go ahead, yeah. Oh, shreya 25:17 I to me, that's not a commitment to the process, right? That's like a commitment to showcasing your demo? hamel 25:26 Yeah, there's, I like that we've touched on, like, learning the process, which I think is key. I think that, like, it's worth it to linger on that for a second. So a lot of people in the space are really obsessed with tools like you say, Hey, we should do evals. We should do you know, we should improve your rag. The first question you you get asked more often not, is, oh, like, what vector of database should I use? What embedding should I use? Are, what vendor do you recommend for evals? And we get, like, really hyper obsessive tools and not in, like, we're not learning the process, and that is where it's going very wrong. I think that's part of a some like, that's it. That is part of the narrative which we can drill into. But I think that's that's where a lot of people are shooting themselves in the foot. Is like, not learning the process and focusing on tools. hugo bowne-anderson 26:25 So Hamel, you've been working in and out of consulting for decades now in classic data science, ml, now all of this generative AI, LLM stuff, as someone who's advised the range, a wide range of clients on AI strategies. What are some of the most common misconceptions or knowledge gaps you encounter when orgs are first exploring the use of these technologies? hamel 26:49 Yeah, so there's one that is probably, I would say, the big elephant in the room. And really what it is is it's a skills issue, and that's related to a certain narrative. And this narrative is so you may be familiar with the role of AI engineer, and the way that the role of AI engineer is coined, we should. I should probably share a screen or something. I don't know if I should, yeah, hugo bowne-anderson 27:19 do it, man, you're giving shreya 27:21 away the keynote like eugene 27:23 Hammer's bringing out his big slide is ready? Man, no, hamel 27:26 no, I'm not gonna tell you a slide. Let me just. eugene 27:30 Hold on. Let me just so, I guess for the audience. I think this is a sneak preview of the keynote that we'll be giving at a conference hamel 27:42 next week. Yeah, engineered, World's Fair this, hugo bowne-anderson 27:50 I didn't even know you believed in the AI engineer Hamill. hamel 27:53 There you go, trying to Yeah, so okay, there's this very popular sort of characterization of the skills that you need and that you should focus on in this new era of generative AI. And it's really taken off. It's really like been, you know, it's kind of been adopted by many people across the industry, and it's this articulation of skills, where you know you have the spectrum of different roles and different skills, and you have this API level boundary. And on the kind of right hand side of the API level boundary, you have something called an AI engineer. And the AI engineer is, like differentiated in this new era, because unlike you know, unlike years before to deploy an ML product, you don't need to know about these things that are listed here, training, evals, inference, data and that you only should, you should, like, be mostly concerned with tooling in infrastructure, chains, agents, so on and so forth. Like, that's the thing that you should pay attention to. And so, you know, like, this seemed very reasonable to a lot of people. There's some like, this is very, like, uh, it was very persuasive in some degrees and in some circles, very persuasive. And it really took off. And lots of people hired their talent according to this narrative. Now I think, like, one thing that sort of went a little bit sideways with this narrative is that, you know, evals is not really, this doesn't, you don't even have to, like, have a model that you train to need evals, like, even if it's someone else's model. Evals is just like, how do you measure stuff? And evals can, like, the the notion of evals is. And unit tests are actually pretty close. The notion of having evals and writing software are, like, very close in nature. And then, like data actually is another thing like, so data literacy, so data is not about necessarily training models, per se, like, that's not the only use of data. These actually, it's actually useful to look at lots of data to see, to debug a system. And it takes, like, some data literacy to to be able to navigate data. Data is kind of messy, as you know, Hugo like and everybody else like, data is, like, really messy. And like, having good tools to, like, parse your data, navigate through it, filter it, do so on, so forth. It's actually, like, surprising, like it takes some skill over time and so by, you know, sort of ignoring evals and data and like, you don't need training and inference for sure, like those things, I would say, Okay, I would agree with that. Training don't need it. If you're using API inference, you don't need it. Using API, the thing that is really, that's really gotten people stuck is, like, this evals thing and this data thing, and by kind of, like, making it not your concern, this AI engineer role sort of gets stuck after the MVP, and they get stuck immediately, like, immediately stagnate, and that that really hurts, and then also the title really hurts. So the AI engineer title, just like, if you name someone AI engineer, it's the same thing as naming them AI King, like, the expectations are really high if something is going on with the AI like, you know your company is building AI stuff, and you're like, hit stagnation. Your CEO is going to look at the AI engineers like, you're the AI engineer. I thought your title is AI engineer. Everyone else is going to look at you and say, Hey, I thought you're the AI engineer and engineers would be like, I don't know, like, what to do now. I've mastered the chains, the agents, the tooling. I have all the tools. I'm like, expert at, like, all the tools. But like, what do I do? And that is kind of, this is where this like gap is where most of the this is, like, not most. I would say 100% of all the consulting business comes from. And I'm talking about between Jason and I it's like, you know, Jason's not here. There's like, millions of dollars in consulting just because of this. One thing is the talent gap. And I would say this is, like, very impactful because, you know, like, if you're building AI, or really anything, not, not even AI, the talent that you have is the biggest impact. Is it the biggest like, the lever that you can pull, like, what talent you hire, what skills, how you hire. And this is the thing people are getting wrong. So I'll just stop for a second. I think I ranted on. hugo bowne-anderson 33:07 Firstly, no, that that was, that was really, really fantastic. And I think helped, um, elucidate, you know, one of the, the major issues I do, I am concerned now that, like, I can feel job listings going up as we speak for AI kings now, and that makes me slightly, slightly uncomfortable. I also, I would have loved, I mean, you and I talk about evals and data a lot, and I know how you feel about them. Everyone here now knows as well. But I would have loved to be a fly on the wall when you first saw this, this figure, and seen what, what happened? I also, I'm interested in you all of your thoughts. It looks like there are just so many false dichotomies in this. I mean, it's it's useful in some ways, but you know, if we're talking about product, we're clearly talking about evals as well, in some sense, or hopeful, like some way of measuring the impact or effectiveness, or however you're measuring a product, right? So I do feel like there are some false dichotomies there. I I am also interested, actually. So Eugene, you got to ask your question to Shreya. Shreya, you've prepared a question for Eugene as well, which I think will help us step back and consider the report as a whole. shreya 34:27 Yeah, okay, Eugene, my question was as as you kind of primarily led this effort, and it exploded into, I don't even know how many lessons if you had to pick if this report could only be about three lessons. Which three would they be? eugene 34:52 All right, so I think the first one I would think of, so I'm going to try to pick three lessons to what I think has brought the most value to people every time i. Spoke about them. I think the first one is going to be Eve house. I mean, going to lie. I mean, I was just talking to a founder yesterday, no, I think Tuesday, and he was telling me about here, here's how my here's how my process looks, here's, here's how my team's process looks. And I was like asking him, Why do you only do emails at the end? It's like, Wait, that's not what we're supposed to do. Like you do the product. No, but why are you not evaluating throughout your development cycle? It's like, Huh? Why do we think about that? Because it's like, almost like a training, a machine learning model, and machine learning models, that artifact is the product. You do the eval at the end, I guess. But now you're building a product. You do the eval throughout your iteration cycle, as you update your rag, as you update your fine tuning, as you update your product, you do the email. I think the second thing is lexical search. So I'm on this soapbox. I think Joe is also on this soapbox. Not everyone agrees, but I really think that if you everyone should have BM 25 as a baseline. So now that may not work, but most of the time it does. I think here's a story whereby there was a team that was like asking me, Hey, no, Eugene, how can we improve our retrieval? And I'll send what are you using? And say, Oh, this. Do you need embeddings? Like, some embeddings on the hugging face. I was like, why don't you try consider BM 25 so I was pretty insistent about it. Like, Okay, try it, and then let me know how it goes. So the next time they go back to me, I think one or two weeks later, they said, Hey, Eugene, did you know that BM 25 accounts for 80% of our retrieved documents, relevant documents. Flip it the other way. If they were not using BM 25 they left 80% of the juice on the table. It's like, holy crap. That's a lot BM 25 is such an easy and mindless baseline, very easy to tune I mean, you don't even need to tune in honestly. Just use it. Honestly. Just use it. I think the last one is, I think, really much salient because of what happened today. Today, 3.5 sonnet dropped. Okay, we're going to update the model. Just change the model ID. I think the point here I'm trying to make is that the models will keep coming and go, like when Lama four drops, Hammer's gonna start his fine tuning pipeline, bam, is gonna be done. Or, like, GPT 3.7 Hammer's gonna take off his fine tuning pipeline. The bonds will come and go. But what's constant? Your evals, your guardrails, your fine tuning pipeline, your retrieval system, right? These are your mode, the durable parts of your system, the model is not so a lot of times. So the thing is, some teams are like, Okay, this model is not good enough. Let's go start to fine tune our model. And they place all the bets on the model. Like, okay, it's all about the model. But no, you, I think it's really think about your pipeline that generates the model, which is the artifact, your eval that evaluates the model, your retrieval system that augments the model with context. I think that makes a more durable system in the long run, and like it's just lower effort in the long run. So that's my three things. hugo bowne-anderson 37:58 That's, are you satisfied? Trial? Yeah. shreya 38:02 Yeah. Yeah. Well, the emails thing is very unsatisfying to me, not for anything Eugene says, but just the fact that we all talk so much about it, and still it's such a problem, it makes me wonder, like, what are what are we? What are we doing? It's, eugene 38:18 it's sometimes like, for example, for me, I am crippled if I don't have emails when I start a new use case. I mean, how would I try to prop how I try to retrieve So sometimes I'm, like, advising someone, I was asking them, how do you know if you're making you're getting better? Sometimes they don't. They just look at us like, Oh my God, dude, that's so tiring. Like you look at it over time, you just get numb. You need a way, like assertion based way, a programmatic way to evaluate it. I mean, I'm just lazy, but for the audience, I mean, do try it having the evals upfront as your insurance harness, your test harness, just simplifies your production, simplifies your development cycle. Yeah, absolutely. hugo bowne-anderson 38:59 I'll hamel 39:00 say, like, about evals is like, Okay, again, I predict that roughly a third of people, when we have this podcast be published, be like, okay, evals. They're gonna look up evals and be like, Okay, what are the like, let me get the tools like, what are the generic evals? Oh yeah, there's this like, conciseness metric in this toxicity metric, and they're going to, like, pull all the generic evals off the shelf and be like, Yeah, I got evals that is actually worse, probably than doing evals. So it's like, again, it's like, domain specific evals for your problem and also looking at data. So what I tell people all the time is, like, Don't be lazy. You have to look at your data. People, there's a dichotomy. Is like, okay, there's AI, like, automates everything in life, but it doesn't automate looking. You have to look at your data. Like, otherwise you don't know if something is going wrong, or you have to look at it for debugging. You. Yeah, and that's the part that I think, yeah, that's like, a very interesting gap that people succumb to. They just want a tool that just does it, like, let me just do it with some pip install something. And I checked that list that Hamil said, like, Okay, I'm doing it. No, hugo bowne-anderson 40:20 I'm just like, that's a question. No, shreya 40:22 to some extent I understand this mentality, though, because it feels like aI should be able to help you look at your data. I think it's very funny that the AI engineering job description or whatever that that diagram is, does not have data literacy on it. As you said him, like data literacy means you can look at your outputs and your inputs and make conclusions about them that quantify your performance, that can help you drive up the performance, stuff like that. It feels like aI should be able to help you do this, but I don't. I don't think people should rely on like chat, GBT or AI to look at your data. It's hamel 41:03 very interesting. You said that like, I was talking with Brian Bischoff. He's going to be on this panel after us. So Brian Bischoff actually posted a job titled AI engineer very early on in this, like, cycle. And I just he was like, Hey, man. Like, you hate this job title so much. Like, but I, you know, we kind of went back and forth and I drilled into like, Okay, how in the hell is it working for you? Because usually when people post the job title, AI engineer is basically saying, you want a unicorn. And it doesn't really work out. But for him, it worked out because Brian is a machine learning engineer, and his process like, okay, concretely, one of his tests are about cleaning data. Like, take home exam is cleaning data. So already, if you give me someone so Brian is probably, like, very rare in in terms of a hiring manager that's giving someone a take home doing cleaning data like that takes a lot that's like, very thoughtful that, like, you know, he's correcting for, I think, a lot of things. And it's actually that's why it, like, works really well. He's like, has a superb process. And it also suggests eugene 42:15 this, Brian's definition of data AI engineer, includes data literacy, which is different from the graphic that hammock showed, right? So I think there's different definitions of AI engineer Brian happened to cover brand thoughtfully covered data literacy, which I completely agree is absolutely essential, and I found it not so easy to train. hugo bowne-anderson 42:42 I also, I just love that we can say, Brian says anything right now. I've got so many things I want to say, Brian's Brian's into but to put more words in Brian's mouth as well. We can when he arrives. We can, we can explore this further. But I feel like something Brian would say is, look at your data in notebooks as well, right? Like data literacy for Him and for Him, notebooks are one of the best, well, no, the best eval tool. But you know, when you're generating data, or you've got data you're playing with, get it in a notebook, explore it, experiment with it, and then start to automate things and validate things more programmatically. For example, I did. I mean, eugene 43:19 I think you might even say that notebooks are magic. Okay, bad joke. I don't know if anyone got the pun, but yeah, I mean, he has this thing. He's building a tool called Magic. So and then the notebooks, hugo bowne-anderson 43:30 great, great joke, Eugene and I will cast the hex on all of you with my with my notebooks. I also put two links in the show notes and sorry in the YouTube chat. One is Hamels post your AI product needs evals. Saying, This is a panel about evals now, but no more seriously, this provides a very nice way of thinking through how to how to softly and slowly build up evals into your system. And it's actually fun. We Hamilton, I talked about all this on a podcast last year, and kind of came up with the story just beforehand. Then had Emil from reach at join as well, and I remember Shreya was in the chat, and really like forcing Hamel to break down his ideas more, which led to this, this blog post, I also shared a link to Eugene's prompting fundamentals and how to apply them effectively. Also, like, read these blogs all the time. I think these guys have newsletters as well. Follow them on Twitter. I mean, I can't, can't get enough, but I do something Eugene writes here in the second paragraph is his usual workflow when building so manually label around 100 eval examples, then write initial prompt, then run evals and iterate on prompt and evals and then eval on held out test set before deployment. So I've put Eugene's words in his own mouth. Hamlin schraer, I'd love you to reflect on kind of this rapid iteration as a. It says, particularly when we're thinking about how to improve the velocity of the life cycle and all the things you think about with humans in the loop trail, shreya 45:11 that's a lot of content, and I'm not sure I have anything new to add. One thing I will say is, in schools and in classes and in research, they don't teach this process, they have standard, canonical benchmarks, data sets for everything that are already labeled and cleaned for you. So you cannot expect you can't do the same thing that you did in school or for a project. Here, you have to actually look at your data. First, one thing that I feel pretty strongly about are binary in having an assessment of you know whether an output is good, which is a combination of binary indicators of quality. So let's say I care about conciseness, tone, whether or not a specific phrase is present, whether or not a specific phrase like as an AI language model, I cannot that also should not be present, right? If you're able to characterize what makes for a good output with some combination of binary indicators, evaluation becomes a lot easier, because now you just need a way to evaluate each of these binary things. I think a lot of you know ml literature, and even people in practice are trying to train these, like, multi, like, Likert scale, like evaluators, and that gets really hard to calibrate. Ai, elevate. Ai, evaluators, like, how does it know that a seven out of seven? What is the seven out of seven, versus five out of seven, versus three out of seven, these kind of rating based things are just way too hard to train, but yes and no, it's much easier. So I don't know if that answers your question. No, that's super hugo bowne-anderson 46:43 helpful. And could you just actually tell us a bit more about like it scales, and the alternatives such as binary but also choosing between options and kind of multiple choice vibes and this type of stuff? Yeah, shreya 46:54 there's a lot of ways to evaluate things. I think I like to think about it as maybe you're comparing two methods to produce, you know, the same output. So maybe two different models or two different pipelines. In this case, pairwise comparisons is great. Like, you should look at the outputs and directly compare is one better than the other, instead of try to rate each individually and see if one rating is better than the other. And reasons for this, like, goes back to, you know, crowdsourcing days, like when you do crowdsource data management, there's a ton of strategies to improve, like inter, sorry, inter rater, reliability, consistency, stuff like that. It's the same stuff holds here. I think there's another form of evaluation that is simply how good is the output with respect to what my implicit constraints are, and that's, that's not a pairwise comparison problem per se. That's like, how do I get a rating, or some definition of good that aligns with what I think would be good, or what my user thinks would be good. In that case, I think, like, we should break things down to, you know, five different criteria, good or bad. I think we can. We might be able to, you know, expand to multiple choice, but in the beginning, assertions are just the easiest to start out with, and then begin trying to calibrate for hugo bowne-anderson 48:11 very cool. Hamel, I know you've thought about this a lot. Is there anything you want to add? hamel 48:17 Yeah, when Eugene, so, okay, so this is kind of like, very similar to a meme in a way that let me, like, share my screen, share the meme. I eugene 48:25 also know how much the most waking memes, no, so hamel 48:29 I can't actually find the real me. I can just find a meme that's like, somewhat related. So, like, someone can maybe find the one for data scientists. But you've seen this meme before, right? Like, what my friends think I do, my mom thinks I do. My society thinks I do whatever right exactly in this, like, lower right hand corner is what I actually do. So when so, like, everybody thinks, you know, like, AI, like, you know, the real work in AI is, like, look, looks like this matrix thing. It's like some magical thing. That's what you're really doing. That's how everyone's articulates it that is kind of like, you can find evidence of how, even this post, like, thinks that what machine learning people do is that if you look at it, if you read it carefully. But really what Eugene is saying is like, Okay, in this lower right hand corner is looking at lots of data, that that's what it is, very unglamorous is, like, very tedious. That's what you're doing. What's what you actually do. So when anyone tells me their process is, like, swamped out by looking at lots of data, like, 50% of the time looking at data, I'm like, Okay, you are. They're actually doing the work. Like, that person's actually doing the real work of, you know, AI, that's, that's the signal to me. And it's like, really funny that this meme exists. It's like, it's like, hilarious, because, like, when you see the meme, you like, Oh, of course. Like, whatever. But actually, it like lives out in real life, like this meme, these memes actually like a real like, people actually have, like, really gross misconceptions. I. Of the role and what it entails. So Eugene is absolutely right. Like looking at data, that is where you should be spending most of your time, and it might even feel like this. So yeah, that's my thoughts. Awesome. hugo bowne-anderson 50:14 And so we actually have a related and great question in the chat once again, about evals. How can you have evals up front? And they've asked Eugene, but I'm opening this to everyone. If you haven't had your app used by people in the wild yet, eugene 50:29 I think hemo and sria can easily answer this question. hamel 50:34 Yeah, I could take a stab at it. So if no one has okay? So, yeah. So I always work on these tasks, actually, like all the time, where no one has used the app yet. So there's a lot of things you can do. One is synthetic data. You can generate synthetic inputs in in your into your system. You know you can sort of like, outline, what are all the scenarios that might happen, what are the different features that you want your AI application to handle, and the scenarios within those that might occur. And then you generate lots of synthetic sort of user inputs of what user inputs might look like. And then from there, you can generate, you know, that is like, good test data. And then you can also write, start writing tests. You can write some evals that measure that, and kind of go back and forth. So that's like, a good way to bootstrap yourself. Um, but then also, like, honestly, like, you can definitely, you don't even need to do synthetic, I mean, so the process that Shreya shows is actually quite wonderful. Like the tool that she the scratch thing that I talked about, it doesn't, it doesn't necessarily suppose that you have any users at all, and does the exact same thing. It like, shows you how you can brainstorm tests based on just your own prompt. Like, if your own prompt says, hey, the output should be marked down or should have three bullets, then, yeah, write a test for that. So it's like, I'll give to share Sure. I don't want to take the wind out of all, shreya 52:03 yeah. Okay, 111, nice thing to think about is, for every instruction in your prompt, do you think an LLM is going to follow this more than, like, nine out of 10 times? Probably not. Write some tests around this, like having at least three bullet point, like these kinds of things, llms just cannot follow over, you know, batches of outputs. So to kind of think about this scale case, another thing that is that, I would say, is talk to the end users to come up with potential sample queries or tests. I'm always very surprised when people try to build things that other people use without talking to those end users. It seems very wild that this would not happen. We're doing a kind of HCI research project on evals for rag systems, and you can already classify successful rag systems into people who got a sample workload of queries of like 20 different queries, at least, versus people who did not at all and tried to build something for them. You don't need that many. You need some and you need some confidence that this is what people want, right? And I think llms are a great tool to scale up what you think is useful. So like, if you want to come up with additional test cases or data points. I would hesitate to say, outsource everything to the L, but yeah, I eugene 53:29 think the one last thing I would add to that is that you are your user. You are your own user. You should be dogfooding the heck of that your app. You should be trying to break it and, okay, collect all those edge cases where you actually broke it to make sure that you fix it so that it doesn't break. Happy cases, bad cases. And I think, yeah, just, just based on that, I think in the day, you could probably, like, just write 100 emails. Which would be good enough? I think it's good enough, but not the best, but you ideally want more, but 100 should be fine. hugo bowne-anderson 54:03 So I'm very excited to soon be bringing on your co authors. But before we do that, what else can we put in Brian's mouth? No, I'm kidding. I've got one final question for the three of you. Just so forward looking. What do you think some of the most promising opportunities for llms to have impact are, and what are some of the challenges you think we'll be facing? eugene 54:30 I can go first. I guess a lot of people are now thinking about a lot of companies and startups are thinking about how to use llms to do really fancy things, right? Okay, I'm going to SF next week. Can I extend my stay, or I want to have breakfast with Shri a hammock? Figure out a place around dollar rest apart, etc. I actually think about, I actually don't think about that at all. I actually try to think about, what are the things that are unsexy, expensive, slow that we can now delegate to an LLM. It's like simple. Things like classification, information extraction, or maybe, for example, given some kind, given the DMV handbook, how can we extract quizzes from it to help people learn the DMV handbook better? A lot of times are extremely good at that, but for a person to actually do that, it can be quite slow. So I'm thinking about all this unsexy tasks that is just very feasible at low cost and high scale. Now, hugo bowne-anderson 55:29 great. Hamil Shreya, anything to add to that before we introduce our new arrivals? hamel 55:37 Yeah, I mean, I think like having good UX is is is underrated. And a lot of people think of AI as, like, a one shot wonder that you should just, you know, the UX needs to be, you ask it, and it gives you exactly what you want. And if, even then, when you don't get it, you're kind of stuck. And really like, you know, I think more like the co pilot mentality of like and people abuse the word co pilot, so I hesitate to use it, but essentially, you know, having a graceful failure mode where you ask an AI to do something, it shows you what it's doing. It maybe gets it wrong, but you can edit sort of what it's doing and have it assist you on the task, rather than just having you just completely fail and you're like, you're like, okay, that output is garbage or it's not working, think more thoughtful. UX is, is going to be really key, and that's, yeah, people are, I think ignoring shreya 56:37 them. Mine is very similar to that. I think AI is never going to be able to read your mind. What I think is very exciting is everybody, end users can be programmers to some extent, and that's actually where the successful AI technologies are, right. Chatgpt as the end user, you are kind of a programmer. You keep re prompting when it's not understanding your intent properly. You go back and edit your previous messages. I love and Brian's now in the chat. I love notebook like interfaces for using LLM or AI products for this reason, right? It's inherently this workspace and programming environment for even non technical people. So I'm very excited about that for the future. Well, hugo bowne-anderson 57:20 thank you all. And what a great introduction to our new arrivals. What's up? Brian, what's up? Charles, charles 57:26 yeah, hugo bowne-anderson 57:29 that is what's up. This is actually really exciting. Now, it's a shame Jason can't be with us today, but you know, he's, he's living a 40 hour day with everything he's up to. Very briefly, Brian is the head of AI at hex, where he leads the team of engineers building magic. He has a long and illustrious background. I'm really excited that he started the data team at Blue Bottle coffee, as well as a coffee fiend myself, worked at stitch, the Stitch Fix before that, built the data teams and weights and biases before that as well. But even more importantly, Brian is the only person I know in the ML space who has a background in non commutative algebraic geometry like myself. I know another, a few other algebraic geometers, but not the non commutative side. It's, you know, when we realize that it you know. And so great to have Charles here, who Charles does many things, working at modal currently, but in broadly speaking, Charles teaches people to build AI applications. He's worked in psychopharmacology, neurobiology, got his PhD from Shreyas background. Is that correct? Someone incorrect? Someone in the chat said Shreya looks like she's at Hogwarts. I don't, I don't. I don't even know enough about Harry Potter to know if that's funny or not, but shreya 58:57 this campus does not look like Hogwarts. Maybe, maybe other people will disagree. The charles 59:04 Campanile is the most Hogwarts like component of the campus, which, otherwise, yeah, kind of resembles diet Stanford. hugo bowne-anderson 59:13 I love it. So, um, as we did with all the other guests, Brian and Charles, maybe Brian, you can go first. I'd just love to know why you're interested in ll like, why are you even working on llms and spending a lot of time thinking bryan 59:25 about them? Oh, interested now working on them? Maybe just Yeah, I hugo bowne-anderson 59:34 think a bryan 59:37 about a year and a half ago, I went and did a short research program at the redwood Institute related to like, interpretability and trying to understand, like, using mechanistic interpretability, like, what was going on with llms. I was a part of a demo of GPD two when I worked at Citrix, using it in production. And I remember thinking like, these completion models are kind of fun, but like, man, they're flaky. Tell. But everyone had been so excited about chatgpt that I went and did some Mecca research, and I found it to be incredibly exciting. The things that it had suddenly gotten really good at. One specific example is asking it questions to sort of like reference earlier in the conversation and looking at the attention patterns of what it's looking at when it's making these implicit associations. And I started realizing, like, Okay, this feels a lot more like intelligent conversation than I ever had before. And I asked myself, ultimately, if this is really such a big step change, where should I be, like, interested in applying this, and as a long time data scientist and someone who cares a ton about, like, answering questions with data, this seemed like the biggest opportunity and the most exciting thing to apply it to. So why I'm interested is because I think the technology has gone through a step change, and why I'm working on these problems is because I think they're the most important problems that I can possibly have an impact on. hugo bowne-anderson 1:01:09 Awesome. And how about you? Charles, charles 1:01:12 yeah, I think my answer most closely resembles Shreyas answer, which is like, I want intelligent software. I kind of pivoted into machine learning about a decade ago. From neuroscience, I did some psychology research, shot some lasers into mouse brains, and I was like, I want to understand like minds and brains, but it's like trying to study stars without a telescope. We haven't invented the measurement apparatus to understand animal brains, so it's like, okay, what do we need to do? I guess we need to invent, like, artificial brains first and get that working. And so then I did my PhD on, you know, studying neural networks. And I guess I thought maybe, like, by the time that I died, that I would be able to have like, a reasonably intelligent conversation with a computer. And then it happened, like, shortly after I left grad school, and so that kind of pivoted, that was, like, a wake up call, like, cold water, like, Oh, wow. This isn't just about this, like, broad intellectual goal of, like, understanding thought and cognition. It's also about commoditizing cognition and putting it into computers, and, like, filling the world with intelligent systems. So yeah, the same, similar thing to Brian, it's like this feels like the most important thing to possibly be working on this, like everything from like the model foundry people to the to folks like us in the application layer and spanning that and all the way up to, like, turning these things into products. hugo bowne-anderson 1:02:42 So if you've thought there was a chance of you speaking with something computational that was reasonably intelligent and creative, or whatever else before you died, and it happened after you left college, what do you now think could happen before you die? charles 1:02:59 Yeah, I I've become much less confident about predicting the future. You know, I've decided to just flatten my posterior over future events. You know, so much heavier discount rate, for example, on future value, because who knows what might be around in 10 years or not. hugo bowne-anderson 1:03:20 Wow. I mean that. I don't know if that's half glass full or or empty, but there's, you know, charles 1:03:29 the glass is in an indeterminate state in the future. So who could possibly plan for the state of the glass? You're hugo bowne-anderson 1:03:34 not telling me about schrodingers glass now, are you because I won't, I won't have it. What I am interested in this is so you all have created something very valuable and wonderful here. As far as I'm concerned, your report represents a collaborative effort to distill a huge amount of insights and lessons learned about something that's really cutting edge. Currently, I'm just wondering, like as a collaborative effort, what was the most valuable or surprising thing you gain from the process working together, and perhaps Brian, you could go first. bryan 1:04:08 Yeah, a lot of Twitter notifications. hugo bowne-anderson 1:04:12 I'm still getting them right now. bryan 1:04:14 Yeah, shout out to the infinite group chat Twitter notifications. No. I mean, like, God, this is so cheesy. But like, I think the connections in the community of the group, like, we talk about a lot of like, very real things, not just about like generating these particular artifacts, but just like ideas and discussing them at a pretty like, high level, high level of sophistication, not high level, like, abstract and useless. Like, yeah, honestly, just like, the ability to, if I have a deep, hard question about AI, I have five people on call that like, not only will, like, tell me very transparently if they don't know. Know, but are ready to like battle if we disagree about something, and that is very valuable. hugo bowne-anderson 1:05:09 How about you? Charles, yeah, I think charles 1:05:17 maybe I was surprised by how much alignment there was. Like, the I was expecting, like, people wrote separate documents, wrote stuff out, and then kind of pushed it all together, and then we had a job of, like, kind of turning it into a single article. And I was expecting there to be like, oh geez. Like, page one says, like, prompt engineering is the best. Like, light your GPUs on fire, and page 10 says, like, you're just a GPT rapper, you're ngmi. Like, but it turned out that we all like, while there was disagreement and distinction and daylight between people the like, there was not really any merge conflict to be done. Like, like, the like, core insights that we all gained even working at kind of different parts of the stack. I was doing a lot of infrastructure work and some and like advising venture capital and like and other people were working. Brian was working on one product really intensely for a long time, like Hamel and Jason were looking at many projects for like over that time, Eugene, also single product, Shreya, like in research, taking a very different lens from everybody else in industry. And yet we, like, basically all like, we had grabbed different parts of the elephant, but nobody had grabbed onto the elephant and declared that it was actually like a crocodile. You know, we were all like, oh yeah, that's a very nice, fine Tusk you got over there. I had hypothesized the existence of tusks while I was fondling this ear, and that was really great to see. hugo bowne-anderson 1:06:57 Awesome. I'd love to hear from the rest of you. I just do want to say Hamel seems to have disappeared from the call, but he's making comments in the chat, trolling Charles. So he says, Charles reverse trolling. Charles is a GPT rapper or some kind of higher intelligence, still trying to figure it out. And Brian, you're right. Fondling any is definitely going to be a pool quote for this. charles 1:07:20 Eugene Hummels, just Thomas just trolling me because I'm an AI engineer, and he feels like threatened as the as a unicorn AI engineer who proves his claims incorrect. So he's just coming in the chat real hot because he's mad. eugene 1:07:37 He'll just has more fun heckling. charles 1:07:39 Oh, by God, that's how we'll be saved. Music, hugo bowne-anderson 1:07:44 welcome back, Hamill. hamel 1:07:47 I won't say anything. It's uh, heavy. You can't argue with unicorns. Um, hugo bowne-anderson 1:07:55 I want to step back and watch this play out. No, I am interested in what like Hamill, Shreya, Eugene, What? What? What's some of the most valuable stuff you've got out of this experience? I eugene 1:08:07 think I can go. I think the experience is really fun. I don't know how much, how many of you play, you know, MMO, RPGs, world Warcraft, like this thing was like a raid boss, and it's like, we assembled. I've really solid team. I mean, it was so satisfying. I mean, just seeing it come together, and then everyone was just like, so high agency, right? I mean, Charles was talking about, you know how there was no merge conflicts? Well, there were no merge conflicts because Charles fixed all the merge conflicts, right? He like, we had like, 60 things, and then he shrank it down to 30 things. No, literally, he shrank it down from 40 pages to 30 pages. So that's how much Charles shrunk it down. And then it's like, okay, we posted it on three different sites. No, three different pieces on O'Reilly. And then he was like, Guys, you need a site like, two hours the site was up, bought the domain, set up to get I'm like, that's really satisfying. I mean, to work on a team like that. It's I just felt so excited, so inspired every single day, while we were working on this and just seeing it come together. It just got bigger and bigger and just went nuts. And I'm really thankful the community enjoyed it and they found it useful. So, yeah, I really enjoyed the journey hugo bowne-anderson 1:09:27 awesome. I just hamel 1:09:29 really enjoyed just becoming better friends with everybody that was like the best. hugo bowne-anderson 1:09:36 That's cool. How about you? Shreya, shreya 1:09:38 yeah, similarly, I mean, I echo what everyone says. I feel like I learned so much from this group who was building on a day to day basis, and I probably build much less, but talk a lot more or write a lot more as a researcher. So it was, it was very eye opening to learn from the group. I. I will say I think I was, there's so much impact that this piece made in especially in verticals, I did not even think, like academics, for example, just don't read industry blogs. But this is one piece that professors that I've met talked to me about. They were like, oh my god, this is like, really good. I think it was a right time that we've released such a piece because people are curious about this. People really believe that this technology is changing computing as a field, and I'm very proud and grateful to be on this project that is able to communicate this to a bunch of people. hugo bowne-anderson 1:10:39 Yeah, I think that's critical. And this is kind of, I wanted to avoid this this earlier, but you you mentioned Shreya that this type of stuff isn't taught in college. And I, like I for 1am, entirely over educated. I think a lot of people on this call are probably entirely over educated as well, spent way too long on campuses. Love it, though, but I actually, I think there are strong arguments that the way education evolves should be conservative in a variety of ways, like you want to make robust. You want to have something robust and dependable before starting to teach it to that many people. And of course, one, this is a slightly cynical and slightly joking take, but one take is that as soon as we rebranded all math and stats departments as data science, data science went the way of the dodo, and it's all AI now, right? So should we have rebranded everything as data science? I don't know, but that's just one. charles 1:11:40 I actually, I don't know, I feel like that's exactly the skill set that people need to be effective. Like, turns out you can actually probably, like, prompt your way through a lot of software engineering problems now, but then like, the like, understanding the problem as a good like data science program teaches is much harder. So I like to think that actually the rise of like, the like, data centric AI application developer will successfully consume all of the data science grads that these programs are pumping out, and they will be in a notebook, like environment, sort of like going back and forth with a, You know, a SQL database and a language model and a Python kernel to like, solve real business problems. Man, hamel 1:12:28 can we say it again? Data? Oh, charles 1:12:30 I think you did like application developer, because I feel like aI engineers sort of implies that you can actually engineer the system. You know, like a database engineer is somebody who can, like debug lock contention in Postgres, not somebody who can like query Postgres, you know, for information. So I would, I would say it's like a more AI application developer. hamel 1:12:57 I like that. That's really, that's, I'm gonna write that down. hugo bowne-anderson 1:13:01 I love that, and I do think, if that's what I mean, a lot of these programs are teaching, I do think, and we've all discussed this in a variety of ways, but, you know, it was, I don't know, several years ago that we all became like the term data centric AI kind of started to gain more critical mass because we model focused, And then all the generative AI stuff took off, and we kind of have become, like model focused again, and it's about bringing it back to the conversations around, around data. And I think you all are doing, doing the good work, but this type of stuff, this is one of the reasons I do this podcast, because this type of stuff doesn't happen in college. You write, what you do, the type of things you write, including this report, including your blogs, for for similar reasons. So I suppose I'm trying to figure out what just what are the best resources, besides your report, of course, and this podcast, but for people, where should people be looking to figure out how, how to work? bryan 1:13:54 I have a couple thoughts here. Also one, if this podcast is really supposed to be about like the data, the ones are called vanishing gradients, but like we can let them vanishing men, why are you always trying to optimize? Okay, let's, let's, let's, let's leave that to the side. Couple comments, one. So I teach, I've been adjunct at Rutgers, and I teach data science, and I teach data science to master students who want to get jobs. They literally like that is their focus. When they come to me, they are not thinking like, I want to really deeply understand the atom optimizer. They're also not telling me like diffusion is pretty cool because I like PDS. They're like, I would like to get a better job than my current one. That is like what drives them. And there is one amazing thing about having that kind of student, they know what they want, that was never something that I had. And I have so much respect for them for knowing what they want. And. And when I, when I try to, like, teach them, the core thing that I am trying to focus on is problem framing, data framing and objective framing. Those are, like, the three lessons in my class, and I, like, repeat this over and over, ad nauseam and like, it's, yeah, so, like, all of those are the exact same, like essential tools of an AI application developer, which I have recently been harangued into calling this now. So like, I think that feels identical to me as what I always try to teach in data science that feels like a 00, shift operator. And so I sometimes feel a little bit surprised at like, how much people think that this is fundamentally different than data science. A lot of us, in fact, all of us, gave talks in hommel's course with Dan, this really wonderful course that turned into a conference, which turned into a meme. I mean, just a conference, but I think I prepared my talk for that felt self conscious, because one of the first things on my slides was look at your data. And I was like, I feel really strongly about this. I'm a little bit self conscious. I'm gonna get laughed out of the room. People are gonna be like, shut up. Like, yeah, okay. And then the day that I presented, three people had that explicitly on their slides. And that wasn't the only day that that like phenomenon occurred. Like this was a recurring theme in this lecture, or in these lectures, and so I think there is a lot more similarity to what makes a great data scientist, then I think people give it credit for so I feel unscandalized by the notion that AI application development is simply data science. hamel 1:16:56 Can I ask a question, or is it going to derail too much? So we have some, we have some debate about AI engineer, and you've been harangued into application developer. Can you, like, but you have a you have posted the AI engineer job, and you have successfully hired engineers. Can you talk a little bit about the process that you use that makes it successful? Like, just, I want to actually about it. Like, yeah, yeah. More bryan 1:17:18 so, so, very ironically, I about a year and a half ago, I started hunting for AI application developers, which I was calling AI engineers. This was before the famous blog post was written. I genuinely just came up with the term AI engineer on my own, and I thought of it because I wanted to encapsulate a couple ideas. The first idea was that they are going to be working on like applied llms. At that time, people had already started calling that AI. What I didn't want to do is put like applied LLM engineer, because I was worried it might give people the impression of, like, sort of like model development, like model like, tuning or deploying. And I knew that we wouldn't be deploying the models, or, like, even, like, fine tuning our end models initially, so I wanted to kind of steer away from that. The second reason I went with AI engineer is because I was thinking, like, what's a good analogy for the skill sets they needed to have. It needs to be related to data engineering, because they need to be like manufacturing pipelines of data that like process and clean. And it also needs to be a little bit related to like ML engineering. What's ml engineering about? Well, it's mostly about like processing data and like building evaluations. And in March of 2023 what made sense in my brain was that if it was sort of like about ml engineering and about data engineering, but it was in this, like applied LLM space that AI engineering felt like a reasonable title, I'll just go with that. What's the worst that could happen. No one could ever, like, make a conference or anything around that. So, like, I just kind of rolled with it. One thing that I'll share is, like, in my job posting, there's a little disclaimer and Shreya, please cover your ears. I said, No researchers please. I really did that's really in my job posting, and it's not. And I say, like, I really respect the work that researchers do. I really, like, care about that work. It's really important, but that's not our job. And so, like, I wanted that to be really clear, because one of the things that I've felt really guilty about is when I've had people apply to other jobs I've posted, and they've been wonderful people, and they're applying to a job that I literally don't have, that always makes me feel really sad. And so I want to be really explicit. Hey, we're not going to be like, trying to improve, like, Flash attention. That's not this job. Like, trust me. And so, like, I wanted to make that clear. The other thing that I wanted to make clear in this is that they should be like. Are aware that they're going to have to make contributions to a production stack. So I put TypeScript in the job description. I did not put you must know TypeScript. And I also did not put you must know Python. What I said was, you must be good at one of these two things, and you must be willing to engage with the other because, in my opinion, I'm only hiring people to like do work that eventually will lead to product. And so they need to be able to look at the production stack and the production code and be able to make minor changes, infer what's going on there. Have opinions about how that is built. They don't have to necessarily be like, doing a bunch of back end or front end coding, but they have to be able to interact with it. And on the flip side, if they were really strong on the back end side, that's great, but they need to be willing to engage with what I thought was going to happen somewhat in the Python environment, because, after all, we're a notebooks company, so that was my mental model, and I stand by that job description, and I reposted it three months ago. hamel 1:21:15 So when you read the blog post, the AI engineer blog post, how did you feel? Would you if you were to write a blog post? Would you tweak anything about it yourself? Or, like, change like, how would you frame? Yeah, bryan 1:21:26 the one thing that has always stood out to me about that framing that has not resonated is, I don't think a lot of the good work is simply like interacting with the model. I think a lot of the good work is really digesting the data. And that's the emphasis that I felt like was never there for me. It was never a deep enough focus on, like, the types of things that, like data scientists think about. That's always been my criticism of it is it was a little bit too focused on anybody could do it. That's not a gatekeeping thing. I don't want to gatekeep. I don't want to say, like, product managers, you can't do AI. My wife is a product manager who does AI. We do it at home together. It's fun. Now, it sounds like I'm making a euphemism, but truly, like I, I'm not interested. But like I, I'm really excited for a lot more people to be involved in the like application development with artificial intelligence or language models, but like when I'm trying to hire top professionals for my team, which I'm very privileged that I get the opportunity to do, I need those people to have really strong priors about looking at data making inferences based on what they see and what to do next. hugo bowne-anderson 1:22:49 What I want to know is, Does anyone disagree with Brian? charles 1:22:55 I will. I'll take a slightly counter group position here, which is that the user experience around models like this, the user experience of the application, is very critical, and the you can't really do it in Python, like work at Moodle. It's like it's for serving Python stuff, I have built full stack applications on it, using radio and streamlet as front ends. And even as somebody who is like Python, like, you know, Python is my primary language, I'm like, I feel frustrated using that stack when I know how much nicer it is like to do things in in reactor svelte. And so I maybe the piece that feels a little bit missing here is with other types of infrastructure, like databases, like messaging services, etc, there is a clean API divide where there's like, you should like a full stack JavaScript developer should know SQL. Like, most people seem to agree that with that, but they, but they don't have to know about operator planning like they don't have to know about like. They certainly don't need to know about like, Cindy vector instructions for OLAP like, and they get, they could ignore all that and learn like, a post it note version like use duckdb If it's a big query, like and SQLite if it's small. And so I feel like what maybe the AI engineer blog post is pointing towards, or what people are trying to imagine, is like a world in which there is some kind of a divide that splits up responsibilities, because otherwise you will need these, like full stack unicorns who can follow a problem all the way from a like a, you know, a 400 in the front end, or an angry user, all the way back to like, you know, floating point numbers in a GPU. You hugo bowne-anderson 1:25:02 great. So I am we've talked about all the wonderful collaborative efforts that you've done together. I am interested. It seems like you all are so aligned on so many things. What I do want to know is, were there during the writing process, were there any key areas where you found yourself disagreeing or debating different perspectives, or have you all come to a lot of the similar conclusions in in your work. eugene 1:25:26 I mean, sure, and I just had one LMS judge. I think the other one is I strongly think that, even right now, I strongly think that fine tuning is smaller models is probably more cost effective, more performant in the longer run. Not everyone agrees with that. Some people say, you know, I don't know all that. Other people champion, bryan 1:25:49 yeah, I'm not. I'm not super bullish on fine tuning. Still. I'm still, like, hesitant. It's just like, I mean, you know, I get a new model every two weeks, like, how much faster can I possibly need one? Like, I'm sure, I'm sure sonnet three, five is good enough, right? hugo bowne-anderson 1:26:09 Well, and to that point as well. I mean Hamel, who may be the king of fine tuning in this group, at least eugene 1:26:15 the AI, hamel 1:26:22 oh no. hugo bowne-anderson 1:26:23 You know Hamel is the AI king of fine tuning, like he may be one of the people who's as bearish as possible on fine tuning as well. In the larger scheme of things, that doesn't mean he won't I mean, you can tell us more about this Hamel, but if fine tuning is necessary, you'll do it. But by virtue of that, and as the king, you understand the limitations of your of your empire and your moat, right? hamel 1:26:47 Yeah. I mean, like, Look, I'm not zealous zealot, a fine tune. hugo bowne-anderson 1:26:53 You're not a zealot. You're a king with with a great, yeah, um, hamel 1:26:56 yeah. I mean, I only do it when it makes sense and it makes sense in a it really, it tends to make sense in like, when the use case is fairly narrow, you like very specialized task, then it makes like, ton of sense. And like, the in the you know, the thing that is trying, you're trying to do is not moving, there's not, like, immense drift, um, then, like, yeah, I find that it makes a lot of sense. But it's not just a technology thing. It's also, there's a people thing, there's a business thing, you know, and all these other factors to consider that can derail a fine tuning project. So it's not like, Oh, is it just like, faster and cheaper? It's like, it's all the people involved, like, can, can you actually use a fine tune and model in this environment and deploy it in this company? That is, you know, that's the that that is like, something that actually like is a key concern consideration for all technology. But like, that's the reason why, you know, like open AI and and anthropic like APIs off the shelf. That's why they're so popular, is because you don't have to deal with your own internal companies red tape, absolutely and so I would say, like, even more so than like, whatever the use case is and whatever the biggest barrier is, like people most like, a lot of times hugo bowne-anderson 1:28:22 I'm charles 1:28:24 interested in, like, yes, I would like to bring up the possibility for Brian and Holl of a skill issue, which is, sure, every two weeks, one of the foundation model providers will put out a new model. Um, but you can probably set up a fine tuning run that finishes in an hour. You might be able to run like a hyper parameter sweep at large scale that finishes in an hour, like what fundamentally prevents us from having single minute exchange of models to steal from my new favorite, Senpai shingeo, Shingo and like Hamel, you also mentioned, I think, that there are many things that can derail a fine tune fine tuning project, sure, but if you're like fine tuning apparatus is Like, fast and automated, supported by a high quality evaluation framework. Then, like, it should be, like, as straightforward as, yeah, like, I hamel 1:29:30 have a client like, actually, Emil from rechat in the yesterday, he got access to GPT four point GPT four. Oh, fine tuning, like alpha access, and he's able to deploy a fine tuned version of it within an hour. And that is a case where that particular client of mine is insanely aligned. They're like, yeah, it's great. We'll do it. Like, let's do it whatever. But I mean, actually, it's like a little bit, I mean, they're not. Yeah, that's the easiest kind of, like, kind of fine tuning you can think of. We're not, like, just fine tuning open AI, but it's a powerful pipeline. And also, yeah, we are starting to fine tune open models too, because, like, once you have the pipeline, you can go anywhere with it. But, you know, it doesn't work for everybody. I would say, like, what's the difference between ml and some other person and other people, like, if I look at the gradient of everything, and it's not necessarily, like, the use case of technology, a lot of times, the difference is like, hey, like, Okay, going to deploy a fine tune model in this such and such startup. You're going to have to have like, 100 meetings and get some VP approval and some like, you know, SOX compliance. And it's like, okay, just, let's, let's just, like, just stop. So that's what I meant about people. hugo bowne-anderson 1:30:54 I also just want to give a shout out to. I'll put a link to modal.com in the in the chat if you want to do fine tuning and get spun up with it really quickly, with some really nice tutorials and documentation. Modal is a really fun, fun place to do it. I don't know whether you created much of the you know, Dev facing content, Charles, but it's really beautiful. charles 1:31:15 Thank you. I wrote, I my fingers bled to get. TRT, LLM, running the I've seen, I've seen seg faults no human should ever lay eyes on, and I did it for you. hamel 1:31:33 I feel the love, man. TRT, LM, hugo bowne-anderson 1:31:37 bleeding like hamel 1:31:39 a torture. Yeah. hugo bowne-anderson 1:31:41 I also put, yeah, this is like, what that AI King torture vibe is, I think. I also put a link to, I was really grateful to be able to have a live stream with Emil and ham about everything Hamel was just discussing. So I'll put a link to that as as well. I am interested. Eugene answered this briefly, but I would love to just go around and hear from everyone what your favorite lesson in the report is. And I just want to say schedule wise. So we've had the first third of the panel. This is the second third with everyone, and the final third will be with Brian and Charles. So Eugene Hamilton, Trey, feel free to leave at any point. But if you want to stick around, feel free to as well. So there's no need to leave. But I know you all have duties, and three hours is a long time. So yeah, favorite lessons, let's, let's go. charles 1:32:49 Who's, who's got one ready to go. My hugo bowne-anderson 1:32:52 favorite is no PMF before GPU, charles 1:32:58 other way around. Whoa, the bryan 1:33:00 other way around. So I hugo bowne-anderson 1:33:05 don't like that one, actually? No, no, I don't like it anymore. I prefer, I prefer the other way. Yeah, sorry, man. charles 1:33:11 I mean, if you can raise $6 billion from various sovereign wealth funds, then sure you don't need PMF for your chatbot, but for everybody else who needs to, like, tie outcomes, you know, like tie input, capital outcomes, you know, you probably don't want to take on that capex or operational expense before you have, like, some sort of business model that it supports. hugo bowne-anderson 1:33:37 I, yeah, I rarely, I rarely get embarrassed. I rarely say so. I say a lot of dumb things, and I rarely say something so dumb that I blush under my beard. And that's what's happening now. So clearly the lesson is no GPU, GPUs before PMF, um, but perhaps we can hear from you all as to your favorite lessons. Now I put my my foot in my mouth. charles 1:34:06 It's hard because they're all such bangers. Go ahead, Trent. shreya 1:34:08 Yeah, I can go and then I'm gonna start cooking dinner. Um, mine is the discourse between rag and fine tuning. I don't like it. I don't like other people's discourse on, should I do rag or should I do fine tuning? Or, like, oh, we have a million context lengths, like, you Long live, whatever. I think it's all stupid. We have a whole section on it. And what I particularly like is the emphasis to really put what's useful and relevant in your prompt, and not anything and everything, just because you can I know Brian has spoken extensively about this and in general, like people are not optimizing for what's relevant in the prompt, and it does boost model performance a lot more. charles 1:34:58 Yeah, somehow. Well, we made RAM too cheap, and now it takes eight gigabytes to have a Firefox tab. And, you know, let's not do that. Let's not do that with rag. You know, just because you can cache a million context tokens doesn't mean you should, shreya 1:35:17 yeah, and one thing people don't talk about is when you're debugging your system, are you going to read your 1 million tokens in your prompt to figure out what's going on? I don't know. hamel 1:35:30 That's a great question that then leads into, no, you're not debugging your system like which is very often the case, except for the legends on this call, I see Brian's like, looking at me very skeptical. I'm like, you guys, I always like, remind Brian and Charles everyone. I'm like, you're like, the one of the few people that are doing things like correctly, like in this world, and not only correctly, but at a very high level of excellence that you just like, don't you don't know you like, maybe I've lost touch with that. You're at such a high level of excellence. bryan 1:36:07 Hubble's always like, people are, oh, sorry, go ahead eugene 1:36:10 saying Brian's gonna pull out his he's pulling his hex number. Look, guys, yeah, exactly half a million contacts. No problem. No. Totally bryan 1:36:22 aligned with Shreya. I want to use my x notebooks to, like, deep dive into this stuff, not shake my head at a million tokens. You don't want shreya 1:36:31 your you don't want to put your whole SQLite database in your prompt. bryan 1:36:36 I've, I, you know, I benchmarked it. I really have, I promise you, like I benchmarked it. Also fun fact, here's the thing that people underestimate about just jam the SQL schema into the prompt. People are like, Oh, that's that's silly. That's a lot of tokens. But like, context, windows are big. And I'm like, how many columns do you think people have in their schema? And people are like, like, 100 I'm like, order of magnitude, two orders of magnitude higher. People have 10,000 columns in their schema. I can't get into specifics, but that's not the biggest one in terms of orders of magnitude that we've seen people have no idea, like, the kind of like actual, like, scope of these things that you want to do anyway. That's a tangent, but yeah, well, hugo bowne-anderson 1:37:33 it's a good It also reminded me of one of my other favorite lessons, besides the one I messed up, actually, is look at computes bryan 1:37:40 getting more expensive. No, hugo bowne-anderson 1:37:43 exactly, no. Look at, look at your prompts, particularly if you're doing if you're putting a lot of stuff in, prompts from rag or whatever it is, actually, have a look at what your prompts actually look like, and you'll see all types of really wacky stuff, right? Oh, hamel 1:37:59 my God. Like so one day, I was like, just a thought occurred to me, like, there's all these tools that we're using, right? And I was like, You know what? I can't I like looking through the documentation. I'm like, how does these things work? Because I know that there's an API called Open AI somewhere that's happening somewhere in this and I like, started looking at the code base. I'm like, Where does this API call happen? Like, what is? I just want to know what the prompt is. And then finally, like, I I was like, You know what? I'm going to do a man in the middle attack. So I'm going to put a do have, like, a proxy on my laptop, and I'm going to use, like, a bunch of these tools. I'm gonna intercept, like, what the API call is. And, like, I was like, I wonder. I'm like, I'm just curious. Like, I'm just curious, like, what is going on? And I was, like, shocked by some of the things that I saw. I saw some dirty shit going on in some of these libraries that I was like, wow. And then that's why I wrote the blog post. And that's how eugene 1:39:04 the AI King, MK VHD of AI wrote ducky show me the prop exactly, hugo bowne-anderson 1:39:09 and I love that's how he wrote his review, the way, the way the king just framed. It was like, oh yeah. I was like, I would like to see his prompt, whereas the blog post isn't, oh yes, please show me the prompt. It's dark. You show me the prompt. Can I just put it in the in the chat as as well? Awesome I am, I suppose. The other thing we've used the term email, so much. Oh, there are two things, two places I want to go. We haven't used, I think I was the first person to use the term rag in nearly two hours, like a minute ago, which is incredible, but seeing we have some, you know, we don't have Jason here, but we've got some real ragheads on the call and rexus legends. So I would, I wish Jason was here for this, but. Yeah, Brian, perhaps Eugene. Anyone else wishing to chime in? I know a lot of you kind of like to think about rag as REXIS in disguise, so I'm wondering if you would tell me a bit more about about that. bryan 1:40:14 Yeah. So yeah, this is a phrase that I coined, like last June, just being straight up. And that is because ultimately, what you're trying to do in a rag system is provide the most relevant content for the agent. You're making a recommendation to the agent of what might be helpful. So I don't think it's like actually that big brained of an idea. I know that it's gotten like varying levels of reaction, but I ultimately think it's like straightforward to the point where, like, a lot of people feel this way. I just happen to say it on a podcast. But the reality is, like, where I think the value comes from is not the like memetics of it. I think the value comes from the like type of evaluation that you can do on a rag system. If you remember that you're making recommendations to the agent, you can ask questions like, does the order of my recommendations matter? You can ask questions like, how? Like, how often do I even get the right information for the agent to have a chance if you're trying to write SQL queries based on particular like schema, but you haven't shown that schema to the agent, I can give you a hint as to how often it's going to get it right, that's it. And so you can answer questions like, well, what's my eval rate when I don't give the right table close to zero? What's my eval rate when I give it the right table, hopefully a lot higher. And so then you have two problems, and actually breaking them into those little problems is easier. So I think where I'll start is just like, yeah, the point of that, like headcanon, is to provide you more opportunities to dive deeper into your problem. I think where people have gotten maybe a little too excited on this is they are jumping straight to a lot of things, like rerankers, and they're jumping straight to things like, Okay, now we're going to build an entire like ecosystem around like, like, just trying to tune rag. And I find myself now hearing about people training re rankers very early on, and I find myself thinking like, Have you actually tested how much ranking matters for your I would say like agent based pipeline. One fundamentally, fundamental difference between a rag system and a recess system is that, if I am talking to Charles, and Charles was saying, I'm going to the beach. I need a book recommendation. I can give him three recommendations. If I give him more than three, it's static. There's no chance in hell he's going to remember more than three. That's not a statement about Charles. That's just the way we are with recommendations. Agents. They have longer context. Windows three is not that much. Now, let's be like reasonable, and let's not give them 100,000 but like, I can give an agent 15 recommendations. So I do think that, like Top K accuracy is a very different situation in agent recommendation and so I find myself in this like, weird, combative stance, where the people that don't think of it like recommendation systems at all, I'm like, happy, wrong, and the people that are like, it's just a recommendation system. I'm like, I'm sorry, you're also wrong. And I find that it does require that nuance and that familiarity with the field. I am very fortunate that this particular group has to like OG REXIS. People like and like both Eugene and Jason are also like very experienced in rexus. And so this group, we didn't have a lot of, like, clashing, but, yeah, that's kind of my like, feelings on the whole rag, rexus dichotomy, Eugene. Does that resonate? Or do you feel? Like, yeah, eugene 1:44:33 it does. I think, I mean, I'll just end this is going to be my last comment before I drop out. I think, like Brian, probably he just said everything that you need to know about rag. And I think Hugo actually, before he mentioned this, he actually said, this is the first time we actually mentioning rag in this podcast. So actually, a lot of people ask me this question, like they asked me, like, I chat with them, and it's like, Eugene, what's your thought about, right? Like you have, why? Why are you not talking. About us. Oh, rag is like, I don't have a prompt. I have never run a single query that doesn't involve rag. It's just a way of life. I mean, like you want to be feeding your query with the best possible data that you have, be it your company policy, your previous chat, your database schema, your product reviews, etc. I mean, it's just a way of life. I mean, like, yes, we just gotta do it. Now, don't you talk about it? You just everything should just be retrieved, about the generation now, then Brian talk about, spoke about how you actually want to evaluate it. Well, I think by doing retrograde generation, it's just easy to evaluate the retrieval. I think it's just easier than evaluating generation. So this makes no sense. All right, peace out. That's hugo bowne-anderson 1:45:45 definitely more than two cents. And appreciate your your time and wisdom and spirit Eugene. So see you on the other side. I've got one more question for Hamel. So do you want to you want to come and say hi, Hamel? So because a lot of the things we've been talking around are the importance of data and evals, data on one side, evals on the other, but actually being part and parcel of the ways we need, need to think, and being very um closely aligned, I want you to tell us a story. It's story, it's it's story. Time with the king um, and and the story I want you to tell, it figures slightly in the report, um, but it's the story of you working with Honeycomb and working with Philip, and how you got him to be a judge, got an LLM to be a judge, and then had some like meta inception of LLM and human judges going up what your process was there. And maybe start by just telling us a bit about, you know, the text to SQL was text to Honeycomb query language. Sorry, hamel 1:46:49 yeah, okay, let me see if I can tell the story. So, Honeycomb is a observability platform where you can log, you can log traces to and traces are just sequences of events, and it's used a lot in the DevOps community, especially, think, especially in distributed systems like Kubernetes, where you have lots of different events and you want to correlate them. So this database is has a domain specific query language called the honeycomb query language. And the problem with Honeycomb query language is no one knows it, and so it's a big onboarding problem. So honeycomb said, Okay, we have AI, maybe we can just do natural language to Honeycomb query. So it started with that. And you know, one thing that they desired to build was like, Hey, can we have a private version of this model. Because, like, not all customers can, like, send their data to open AI. And so, you know, is it possible to kind of have a private model? And so, like, when I began the project, sort of, so okay, it was like, basically, like, can I fine tune the model? Like, that's kind of the modus operandi of, like, you know, it had to be small. It had to be really performant. I had to, kind of to make it like a to make it really appealing. I wanted to see, like, how much I could push it, like, can I have model that's, like, really small? Can it be just as performant as open AI on this task, so on and so forth, and so, yeah, the first thing I did is, like, got the data so, like, this is a very classic data science mistake. It's like, hey, like, do you have data that has good examples? And my clients, like, yeah, I have, like, curated this data very carefully. And, you know, I went through it, and these are some, like, really good examples. Here's 1000 examples. I'm like, great. I have this, like, it's like a Kaggle competition, just like, you have the data and I'm gonna fine tune, and it's like, model, am I okay? But he told, like, he told me the data is good. So, like, let me just, let me just go with it. Let me trust that this data is good. And then I like, a fine tune the model, like, you know, deploy it. And then, you know, have them tabs. Like, can you just eyeball it, test it? I was like, oh, you know, these queries are actually like, not that good, kind of mediocre. Kind of like, what from what I expect? I'm like, Huh, okay, can you give me some examples? To give you some examples. And then I go, and then I like, look in the data that he gave me, like, very similar queries and very, like, very similar, similar, like, natural language questions and very similar queries. I'm like, do you see this? Like, this is almost the same thing in your data that this thing that you're he's like, Oh yeah, yeah, the data, yeah, the data sucks. Okay, so in a classical machine learning problem, you would just get stuck there. You'll be like, You know what? That's it? Like, move on. Like, I give up. Like, where I have to now, like, go get data. Let me like, it's going to be a long process. I have to, like, find the data from somewhere to under, deeply understand this domain. And I have to, like, go do this, like, long, expensive process. Yes. And so the magic of large language models like, okay, let's wait. Let's hold on a second. So I need to know whether each of these data are good or not. And you know, like, I don't have all the knowledge My client has as domain expert. I wish, like, I had, his name is Philip. I wish I had Phillip in a box, like some artificial version of Philip that I could just query, right? So you know where this going. So, hugo bowne-anderson 1:50:33 so I hope I know where it's going. Yes, you're not putting Philip in a box. Actually, if hamel 1:50:39 you're not, if you're not in the gutter, then you know where it's going. So, but, like, basically, so, you know, Philip, I said, Hey, Philip. Like, can you label some examples? Like, hey, I think I need, like, you know, like, whatever, few 100 examples to start with. It's like, you know, I don't have time for that. I have to, like, read each query. I'm like, I need to know, I had to have a critique of, like, each query, why it's bad, so then I can, like, understand, like, What the hell is going on. So he's like, Okay, I don't have time for this. Like, it's gonna take up all my time to, like, write these detailed critiques. I'm like, Okay, that's pretty fair. That's pretty fair. So what I did is I said, Okay, let me kind of I prepared a spreadsheet. I said, like, every day I'm going to email you 20 examples, and all you have to do, you could just do 20. It's not going to, like, destroy your life. And basically had them label 20 queries and write critiques. Then on each iteration, I would make the LLM as a judge align more with Philip based upon what his critique was. Just through prompt engineering, eventually I got the LM as a judge to write critiques that he that he agreed with at a very high rate, enough to where I could just, basically I had Phillip in a box at that point, and then, and then, basically I used that to make all the data better, because I could have LM as a judge, write the critique, respond to the critique, make the data better, and then also, like generate massive amounts of synthetic data, and help me generate massive amounts of synthetic data, and do so in All that data is now like much higher quality, and then we're able to fine tune the model, and then we're able to get really good results. So it's like, really interesting about getting unblocked on data. It's like, that's like, one thing that's like, super exciting with llms is you can get unblocked with data. I think Brian has some examples of how he's like, generated entire worlds. hugo bowne-anderson 1:52:43 I'd love to hear that right now. Someone has started the hashtag, hashtag free Philip, which I think is important in the chat. I like that, yeah, but, but thank you for for sharing that. And I think one thing that I love about that story is that almost like it, it really demonstrates nicely the importance of looking at data as a practice like Zen and the Art of looking at data, or some, something like that, right? That it's not a one off thing you do. It's, it's like drinking water or or breathing or doing Tai Chi if you do before breakfast, or whatever it is, right? So I, having said all that, I would love to hear from you, Brian about building synthetic worlds. bryan 1:53:28 I can't get into all the details, but I can say a couple things. One thing that I definitely like we do a fair bit of is like human labeled data here at x we we. I call it homework. I joke that we have homework. And everyone on the team gets a interactive hex application where they can see whatever it is that we need more signal on from humans, whether it are it is generations from the model that we're saying, like, good generation, bad generation, whether it's two generations, and saying is one better or the other, whether it is a classification problem that we're asking the agent to start doing for us and doing manual classification, we have a couple types of, like, hard classification tasks that have, you know, seven classes, and we will, like, use humans to bootstrap that. And just be clear, like the humans are people on my team. These are not like random people that we find, like, I do need Phillips phone number so I can let them know they get some homework. But, but, you know, besides Philip, they're all people on my team. So I do think that, like this is very much part of our process, 100% and I think one of the things that, like Hummels alluding to here is, once you've got these sort of like initial bootstrap data sets, you can you. Use that to calibrate or align. Some people call it your LM as a judge, or your LLM evaluation, like he was talking about and that it feels like a souped up bootstrapping process from the olden days in statistics. And it is, it is quite powerful. I will, I will just say like it has been a very valuable process. The other thing that you can do is you can make synthetic data sets where the inspiration for the initial data set is real and human labeled, and then you just expand on that corpus more and more and more and more and more. So one thing that you might realize is our evaluations for SQL, they need to query databases. So one thing that would be really nice to be able to do is say, what are some particular like shapes of data that we're struggling with? We want more and more examples of trying to solve problems in that ecosystem. Well, turns out, like synthetic data generators is something that alums are quite good at. And so being able to sort of go from a vague statement to a data set that's custom built for that particular shape in that particular like pathology, is actually something that we're capable of doing. And in fact, we do. And so it allows us to really go from originally, like 25 examples, or something that's usually our lower bound, all the way up to, like, human me val of 200 examples, and then bootstrap that to recently, 7000 in one case, case. So I think there's a lot of power in this process. And it's it's fun, because, like, humble, and I haven't talked about like this, but we kind of CO evolved into this like process of use a human, learn from the human benchmark a little bit, bootstrap a little bit, and then get that to a state, and then use a human to, kind of like, pull it back into alignment, and then bootstrap from there. And yeah, I've had to do, like evaluation and other kind of models and over the years for ml, and I've never felt so empowered to build phenomenal like evaluation suites, as I do today. hugo bowne-anderson 1:57:11 Awesome. Something I love about that response and a lot of the conversation we've had and the report as well, actually, is that we're really focusing on processes and methodologies, rather than than tools as well, besides the glorious notebook. But I'm wondering, Charles from from your experience and vantage point, what are the in terms of thinking about methodologies and processes, what are some of the key kind of process level considerations that are most critical for ensuring success, um, when building LLM, yeah, um, charles 1:57:48 I think the ability to bring back information from production seems like the most critical piece. Like going all the way back to the before times of people training their own neural networks before the fire nation attacked in with the release of foundation models and like that. It's like it's non trivial lift to set that up like you, like, production data is treated very like you frequently have like, an iron wall in between your ml teams and your production teams. And like you have, like, they have, like, a simple handoff of an artifact, and then, like, there's not communication back, so that, like some like, Honeycomb is actually a great system for this, like, it's a generic system for this kind of, like, event tracking. But, and that's kind of what you want to do, there are more specialized tools like the like LLM ops tools are oriented to solving this problem, like getting information from production, like, from the production behavior of your LLM systems. I think the what makes that useful is then also like an experimentation oriented workflow. So the ability to, like, try stuff out very quickly and easily, that, you know, evals, evals should be, like, easy to run, straightforward, like, it should not be something where it takes like, it should be like your your tests, where you, like, run a single command, Line command with minimal configuration, and it runs your evals and like, you can drill down and pick a specific one, or you can adjust like your flaky like your flakiness setting or whatever. But you it's like, relatively straightforward, and the ability to, like, yeah, quickly run those things and iterate on them, track what you were doing as you do that. Like that feels like the other like critical piece, both from like, a cultural perspective of like, caring about thinking about in sporting experimentation. And then for the like, like, yeah, the tooling that supports that. So with hugo bowne-anderson 2:00:05 respect to the tooling, I actually, I think I asked Brian to prepare a question for you, and I think that leads really nicely into his question, Brian, do you have the question? bryan 2:00:16 Yeah, it's basically like, in this LLM off world, like, what do you think is like, required? Like, what do you think we really do have to buy? What do we think we have to, like, not just build and Bootstrap from gum and shoelaces? Yeah, charles 2:00:36 that's a good question. I think there's, yeah, there's maybe two components, like a back end and a front end. And I think it's probably hard to build both of them yourself, unless you have a lot of resources. So by back end, I mean, like, the ability to collect this information from, like, from production or from annotation teams like that. It sounds like you have a really great system for that. From everything that you've said in your talks about, like, what you do it at hex, it sounds like you like, maybe unsurprising for like, such a data centric organization, but it seems like you have a really good way to do that. I don't know that everybody can build that themselves. And then on the other on the other side, you have the like, front end, which is, how do you like interact with this information? How do you like, discover like? How do you discover the patterns in this data? I think you're like that bryan 2:01:30 one, that one you have to buy, and that's hex, and it's the only way, charles 2:01:34 oh, hex. How's that spelled? H, E, C, K, S. Is that? bryan 2:01:41 That's what we say internally sometimes, but not like that, H, E, X, I think is what you're shopping for tech. If I was, charles 2:01:49 I was, I was actually using that platform recently to analyze some data, and I there was this nice AI co pilot that wrote several queries for me, which was very nice. But then, goodness, yeah. But then my computer ground to a halt because it consumed eight gigabytes of RAM, which was all the RAM on the machine. But boy, did I get some data science insights out of that cache. I'll tell you that much. But no trick is, buy a buy a better computer. Is actually, as you know, Ram issue, but yeah, so, yeah. So I think you actually are really onto something with the idea that it's like a notebook environment, like when I was a weights and biases. I was when we were weights and biases. You may recall that I was like, like, a very big proponent of the flexibility of the weave, like query language and weave platform. It was like, this is exactly what I need to like, I send a logging service like high cardinality, like semi structured, like everything I could possibly need to figure out the problems in this system. And then this, there's an interface for me to discover the information in there. I think you can dissociate the front end and the back end, like you can have some like, Grafana, Prometheus hotel kind of thing, pulling information in and then, like, you could consume that from a front end that you build yourself, or that you like. I expect people to build more LLM ops tooling that, like, consumes those sorts of things. Or you could like, I think the people who are trying to be like LLM ops vendors are going straight for like, let's put all of that in a single package, and depending on the team, like buying, like, building one part of that for yourself might make more sense for, I think, for others, if they already need the lift of having to learn data science, having, like, previously convinced themselves they just need to know Software, then they probably shouldn't then, like, also build this platform that they don't know how to build, just like the resource requirements are really high, bryan 2:03:49 yeah, yeah, yeah, for what it's worth, just because you kind of like, implicitly asked, we write a lot of our like events as either segment events or like traditional events, to a database, and we transform them with like fivetran into snowflake, and then we read directly from snowflake, just to kind of, how do you read from snowflake? That one's with hex. What do you charles 2:04:13 know? So that's the notebook environment you're referring to. Indeed, bryan 2:04:18 indeed. So, yeah, I do think that, like, our stack is incredibly lean. We have five trans separately for the data team, and, you know, mixed feelings there. And we have segment which is also for traditional, like, product analytics. So, charles 2:04:38 yeah, that makes sense. Yeah, you need to define your traces. And, like, I guess the thing that I like about, for example, Lang Smith, or, like, actually going way back, only the real 90s kids will remember this. But illicit, close, illicit had, like, a tracing tool from like, they tried to get, like, GPT early. PD, three models to work well for stuff. And they were all about evals. They were all about like finding like. They were also all about like, shrinking problems down so that they're easier to solve, which is another thing that came up in our report. And Eugene, big advocate of that, decomposing problems. You mentioned that as well, but they had a they had their interactive composition explorer ice, and it would just grab every async Python call. The assumption was like, any call that you did async is like, probably important enough to log. It's probably a call to an external service for information retrieval or a call to a language model. So let's just trace all of those. So something like, the nice thing about that tool and langsmith, which basically, just like, hooks into the Lang chain ecosystem and automatically grabs things, is like. It defines all of your like trace events for you. It defines your like spans for you, and like that. And it like, puts it a little bit closer to the Python programmer. So if you have like, just like a Python programmer who's the persona who's defining your chains and flows you like, there's less lift and there isn't the impedance mismatch of them having to, like, hand something over to somebody who knows how to use something like segment, which is, like, I associate more with, like, your full stack JavaScript persona, bryan 2:06:21 but let me, let me maybe push back on one side of this. Okay, so like, sure, I hear you on the segment saying, I have a lot of mixed feelings, but let's come back to this trace thing. So like, I love traces. I've been beating the drum for traces since I was at Stitch Fix, where we had Lightstep, and it's super convenient. When I was at weights and biases, I had suggested that we've have a trace viewer actually waits and biases. Have a trace viewer as, like, a core feature. For a long time, you and I were, like, very aligned here charles 2:06:51 I did, but like, physically demand a pytorch Trace viewer, bryan 2:06:55 I remember, and I see, yes, that was wonderful. But, but there's a lot of great reasons that when you're building these kinds of applications, that you need to write events to a database, and those events can be used for control flow. And what I claim is that there's not a lot of difference between the events that you write for control flow and the events that you need in your trace log anyway, where it can be a little bit like funny is if you're using microservice architecture, because then you have, like spans across different services, but they all have trace IDs anyway. So like, I guess what? My only like pushback here and why we're not currently, like a customer of William Smith, which I think is a cool like product, is because, like we we have like trace IDs on all these events, we use these events for control flow. And so what I end up finding out is that, like, okay, all I've got is all my spans tied together by trace IDs. It's like five minutes in X to make a data visualization that gives me my trace viewer anyway. So like, I just don't see the extra like value in a real trace viewer. And you might say, Well, cool, do you are you not a customer data blog, then? And the answer is, like, no, no. Like, we are a customer data blog, so, like, I mean, I know that there's a limit here, and maybe it's a spilling thing, and maybe it's a like, you know, a Fidelity thing, but, like, I don't know, I think most of the kind of things that you want to trace for LLM stacks is damn similar to the the control flow. Okay, charles 2:08:53 fair enough. I guess. Are you saying there is not a trace viewer visualization in hex notebooks that I can turn on where there is, oh, good, good, good. bryan 2:09:06 I mean, it's Python. You just build yourself. charles 2:09:10 Oh, I mean, yeah, I guess if I did, you just, yeah, you get you got that Tony Stark in a cave with some rocks kind of energy build it. bryan 2:09:24 Yeah, I'm lacking some of the like charisma or Riz, sorry, the Riz, as they say. But other than that, yeah, brutish and somewhat evil, yeah, charles 2:09:35 yeah, hmm, yeah. Okay, wait, maybe one last point on this is that one of the things that I found, like, surprising and interesting about the product direction of Lang Smith is the orientation to enrichment, where it's like you can either trigger. Enrichment, like semi automatically, but that involves humans, like, going to an annotation flow, or, like humans can just go in there and just like, you can type a little note and be like this one looks sus or like and that is in the same database that is storing all of your like trace information. You add these kind of like semi structured, like JSON, keys and tags and metadata. What I like about that is it like the thing that Lang Smith is trying to build is like, pretty close to what we're describing, is the place where value accumulates in, like ln applications in our report, which is like, it's got information from production. It can be turned into evaluation sets, and it brings in like product information and is accessible to like, like a product manager and to the like technical team delivering stuff. So like, yeah, maybe to bring up a point that while you and I were seemingly out of the room. Some of the other, like people like Hama, were making it's like we would if we were to build this ourselves, build that. But not everybody would. They would maybe build something that doesn't include, that doesn't actually serve as, like the full store of value. And yeah, so there's utility in, like, buying the right thing when you might build the wrong thing, bryan 2:11:30 yeah. And on that one, like, again, now I feel like I'm like, I'm, you know, I'm in my cave, and every everything is peaceful and warm in my cave. But like, you're like, Oh. Like, I want to pull down all these traces, and I want to, like, manually inspect them. I want to dive into them. And in a certain moment, I want to, like, add back some data. And I'm like, cute. I pull that into my hex notebook, I do my deep dive, and then I use my write back cell, and I literally write it back to snowflake. It's in my same ass data warehouse. And like, I know, I know, like, the modern data stack is dead, like, you know, rip, but like, I do think that this is one of the things that we loved about the modern data stack is, like, it's all in the data warehouse. Like, where is it? It's in there. You can find it. It's in there. And so I don't know. I, like I said, I do feel a little bit heavily biased, but yeah, that's my rejoinder, I charles 2:12:25 guess. Got it? Well, you know, I've only been using hex for about two weeks, so give me some time to get hex pilled fully. Okay, bryan 2:12:33 okay, yeah, um, we usually estimate like 19 days, so I'll see in a couple. Got hugo bowne-anderson 2:12:40 it. So you've seen the hashtag, hashtag, Hex pill as well. Charles, I presume, charles 2:12:46 oh yeah. I mean, yeah, we all know what Brian's alt is on Twitter. So hugo bowne-anderson 2:12:52 I know at least three of them. I am so I love that you referred to Brian as Tony Stark in the caves, banging rocks together or something like that, because Eugene has said in the chat prior to you, saying that you can tell Charles is getting angry when his mobile Hawk starts to glow and his face turns green, like the Hulk. So, so we're really bringing some, you know, more Avengers energy to this. Whoa. That was so Avengers. Dude, dude, oh, wow, you have done it. There's Yeah, you win. You just won. Um, ding, bored. We may have already kind of covered this, but in the interest of symmetry, you did prepare a question for Brian as well, and I would like you to ask him that you've already been talking around langsmith. That's charles 2:13:41 true. I've effectively asked it already, but I'll ask it literally. Hey, Brian, I really like langsmith, and I have used it to debug ll on applications, and I have to admit that I have recommended it to people. Are you mad at me? Yes, no, maybe, absolutely bryan 2:14:03 not. Here's why one. I think that team is really great, and I think that they've got good intuitions. When a team has good intuitions, they're certainly going to be building things that, like, have potential for value. What I've seen of the product so far, I think, is incredibly aligned with kind of, like, what I want, if I wasn't in the cave with my like, you know, crappy iron that I have to, like, forge into an ugly suit, like, would I be entertaining something like langsmith? Maybe? I think I'm certainly not mad, and I'm certainly not even, like, skeptical that it's like, wrong for everybody, or anything like that. I think I feel very. Passionate that like, the time to buy tools is when you have like, identified the delta between what you can get to in a short amount of time with your resources and what like can really make an impact on your product and your like Team velocity. So like, if I look over and langsmith, like, that workflow is going to accelerate my team's ability to, like, act on user interactions. Oh yes. Like, it's time, but not until I tend to be a little bit of a like, build it and then buy it, when it really becomes like, that transition time. Um, so I just think that, like, for me, it hasn't been that time yet. Um, I talked to Harrison about, like, would hex be a langsmith customer? And I find that, like the production environment, it doesn't feel like it hits for me, but actually some of the like offline, like workflow development and like the evaluation loop that it can build, actually, some of that was really exciting to me. And it's charles 2:16:22 not because that weights and biases we've offline experimentation platform met your needs. So, bryan 2:16:29 so, yeah, so let me, let me, like, officially do my disclaimers of my like, yeah, my conflicts of interest. One, I have a conflict of interest that Harrison is a very nice guy and very smart to I am an ex employee of weights and biases, and I am very aware of like the smart and in like knowledge of that team. So like, those are my like disclaimers. But I think what's interesting about weave is it feels very like notebook focused. And that really, really appeals to me. What's also interesting about weave is it interfaces directly with the rest of the weights and biases ecosystem, which, like, I mean, let's be real. That's the fucking best for experiment tracking, charles 2:17:22 conflict of interest, obviously, and am not also a weights and biases ex employee. So I would, I would agree, I've never met the fan before bryan 2:17:31 in my life, right? But like, genuinely like, I think for experiment tracking, like, let's not even have this conversation, it's useless. But that relationship between, like, the experiment tracking for llms and like, how much our LLM workflows experiments is a really deep and nuanced thing there. I think that interface is really exciting and interesting. And so what I love about like the weave approach is precisely like how it links back to the rest of your experimentation flow we're doing, like we're doing experiment flows in in hex, and we're tracking them very similarly. So like, if you ask me, like, what's more important here, like langsmith or or like weave for the iteration and experimentation? Yeah, I think if I was more of an expert on both, I could give you, like a very like blunt answer. But from from my outside take, I think I kind of do lean towards weave, sorry, but I Yeah, but I think on the production flow side, like, I don't really care how you, like, build your trace viewer. I don't really care, like, how you get that data, as long as you're integrating with it and interacting with it, that's really that's really wonderful, whether you're like, you know, whether you're using brain trust or Zeno or Patronus or langsmith or log 10 or log 10 human loop or prompt layer. Oh, good one, yeah, autoblocks or these are all my beautiful children, and I buy them all. I am a customer of all of them, and they are important parts of my stack. Please invite me to your AI dinners. hugo bowne-anderson 2:19:33 Just to be clear, this podcast is not sponsored by anyone. This is a labor of love of Me and all of us, and none of those vendors have sponsored this, I do. charles 2:19:44 It's not sponsored by any of the venture capitalists. You know, in the same way that the city of San Francisco is not sponsored by the venture capitalists, you know, it's like or in the same way that the room is not sponsored by oxygen. Haha. hugo bowne-anderson 2:20:00 I totally, yeah, we're totally on the same page. So I do, I do want to step back a bit, because there are several other moving parts of the report that I really like, and I'd like your your thoughts on stepping back from the more technical side of things. You'll discuss the importance of building trust with stakeholders and users when deploying AI powered software, or LM, LLM powered applications. What strategies or best practices have you found effective for establishing and maintaining trust? charles 2:20:32 Yeah, get people in the room. I think that's one like, one point that I liked was like, yeah, get designers and UX people in the room from the beginning. I think we a lot of the implicit position of the report is for applications where the developers are sufficient domain experts to confidently grade LLM outputs. But for, you know, medicine or law or other like very specialized domains, that might not be the case and that so like getting those people in allows you to find the like shibboleths or the like, very easy, low hanging fruit that, like, you know, if you violate that, you will lose their trust and like, yeah, to sort of CO design with them. I think, yeah, there's some good stuff from Carrie Kai, who does a lot of great like human computer interaction work at like Google and with folks at Stanford, that sort of talks about a lot of techniques for, sort of, like bringing these people in, involving them in, in application development, yeah, and I guess maybe, you know, rolling stuff out slowly, so that You can, kind of like, I think frequently, like issues or problems are discovered sort of after several iterations of interaction. And so if you, like, roll something out to a small group, somebody in that small group will interact at least 10 times with the system and then discover this bad pattern. So there's, like, more value to that, like repeated measures interaction, than to go into 10 times as many users at the beginning, and so that allows you like, that gives you like a benefit of operating at that smaller scale at first, and to like correct issues before. Like you tell people to eat pizza with glue on it, like you only tell like one person who's your friend to eat pizza with glue on it, and then they like, help you fix it, instead of dunking on you on social media. hugo bowne-anderson 2:22:32 Absolutely, Brian, is there anything you'd like to add to that? bryan 2:22:41 I think the antidote for demo itis is user feedback. Like last summer, I went to as many of the like, little AI meetups as I possibly could go to, and my motivation was not like, to eat, like, really crappy finger food. It was actually to, like, just like, quickly put, like, turn my laptop around and, like, do one thing with magic and see how people reacted and like, this was like, why we had an act of beta going on. But just like, seeing people's reactions to different things, super, super interesting. I had people who, like, immediately, were like, I don't know what's happening. And I was like, Okay, great. That's lovely to hear. Um, okay. And then I had people that were very like, you know, immediately, like, Can I do this? Can I do this? And this was while we were sort of like building a new product in secret, and so I couldn't show the new product, but I was just like gaining feedback with the thing that we already had in beta, and seeing how people were, like thinking about what it was doing. I learn absolute ridiculous amount from every bit of feedback that I get. And so I don't know I'm I'm terrified by people that are spending two years in stealth before launching anything. And I think, like as much as we're all tired of the like aI hype demo, what I am very excited for is the AI hype demo as a data like sign up list that they start rolling it out to users. That's a whole different game. hugo bowne-anderson 2:24:29 Yeah, absolutely. And something we've been talking around also is building systems, focusing on systems and not models. And so Brian, I know you have a lot of experience building end to end systems with LLM, so I'm wondering if you could share an example or ideas of examples where a systems level view was particularly important for achieving the goals of the project. bryan 2:24:56 Yeah. I mean, you know, I started at hex. The last day of February 2023, and at the time, there was one engineer on my team, and he flew to California, like, I think it was two weeks later, and we went to the whiteboard and drew out our like, prompt templating architecture. Our evaluation architecture and our like context construction and rag architecture, that was like what we did with that week, and the rag architecture is still in prod with a lot of changes to like certain aspects of it, but like the fundamental architecture, the prompt architecture, hasn't changed almost all, including the like appendix thing that we have, which is, like, we inserted this thing into our system called a conformer layer, and we were Like, this is going to be really important. And like, it's very much an appendix in the sense of like it has it has a purpose, but like, it sure is easy to misunderstand why. And then, like, our eval architecture, like that lasted us nine months, and then we ultimately did have to rewrite that. But like, seeing ml systems meant that, like, two weeks into this job, I was like, okay, evals are important. Like, like, I know that evals are important building something that's like, composability first for our like, context, construction is important. And then thinking about, like, our prompt construction, like flow as, like, essentially, like, meta programming, like that was, those were the three things that I was just like, cool, stealing this from my previous experience. And then, of course, the rag is Rex's thing. Like, yeah, that's it. And those were all things that I just pulled from, like, textbook, ml, if you go and you pull down the the book, machine learning, design patterns, like, everything that I said in my first, like, months at hex was like, could have pulled it straight from that book. Shout out to the authors of that wonderful book. It's still my all time favorite O'Reilly book, charles 2:27:26 yeah, hopefully not your all time favorite O'Reilly document, though. I hugo bowne-anderson 2:27:37 believe so Brian. Charles Brian, one of the things Brian one of the things Brian mentioned was composability. And last time we spoke, this is something you and I chatted and chatted quite a bit about. And, oh, bryan 2:27:47 do you have opinions on this, Charles Oh, do charles 2:27:49 I have the monad laws on a chalkboard in my room in the back of this shot? Oh, yeah. Oh, I think, I think I actually let's Oh, wow. Charles Brian, if you could indicate in which kleisley category your contexts compose, that would help me understand your previous claim regarding bryan 2:28:11 oppositionalities. We used to call it cleesely. Is it kleisley? charles 2:28:18 I went with like a German pronunciation, but that's probably wrong. bryan 2:28:22 No, I'm not actually sure. I feel charles 2:28:25 like it feels like you're probably stonewalling because you don't know in which monad your context compose. bryan 2:28:32 No, it's a totally valid it's a totally valid criticism. I'm not sure if I usually like just stick to the monads. I mean, let's see. All I really care about is that these arrows. Well, okay, let's see. I care about the types of the interfaces, and I care about composition from the purely like there is an arrow in the like Union category. So that's a, yeah, that is a stronger monad than, like, the weakest one. hugo bowne-anderson 2:29:12 So for people not versed in monads and category theory esque, bryan 2:29:16 who isn't, sorry, wait, what audience do you appeal to? Oh, dude, hugo bowne-anderson 2:29:20 you know, I eat spectral sequences for breakfast, but it's, bryan 2:29:25 I don't think hard, hugo bowne-anderson 2:29:27 horrifying, actually. charles 2:29:31 Oh, I guess maybe the thing I would dial in on, what Brian said is like, so you mentioned unions, which is sort of like splits of things, or pairs of things, where you have like one or the other, you know, either in in Haskell or like, left, right. So, so you're saying your notion of context composing is like there's lots of forking paths, as opposed to say, like concatenation is a form of composition or. Right? That's like, that's a more producty or intersectiony approach, bryan 2:30:04 right, right, yeah. I I am not just concatenating. I am sort of saying, like, like, I have to do some currying it sometimes, and I have to, like, yeah. Like, I have to, yeah, this is, this is a little bit more. charles 2:30:24 Got it? Got it sounds like a bi Cartesian complete, closed category. We should, we should whiteboard together. hugo bowne-anderson 2:30:32 How are these things applicable to the world of llms? More, more generally, I'm so sorry this question as well. I feel rude bryan 2:30:40 thing go into thing charles 2:30:44 equal to meta programming, you should have a nice little, nice little programming language theory, you know, might as well, hugo bowne-anderson 2:30:54 um, did you say that bryan 2:30:56 we were over educated at some point you go, is that what I think I charles 2:31:00 heard? We do have only PhDs in the chat now, I think on the call at this time, yeah, it shows hugo bowne-anderson 2:31:07 that's what's up. That's what's up. And especially like three people who are, I definitely have never felt well versed in category theory, because the people who are well versed are pretty epic, but three people who at least can talk category theories. I mean, somebody else needs to join, essentially. So I do want to go back to we did talk a bit about, we have been talking around data literacy, AI, literacy, education. Charles, I'd love your thoughts on this first just what we've talked about being able to look at data and evals. I'm just wondering what some of the key skills or conceptual understandings that you believe everyone working with LLM should prioritize developing? charles 2:31:50 Yeah, I think the primary conceptual gap that I see between software engineers and data scientists or between people who have no data literacy and have data literacy is like, the tendency to care about, like, the semantic content of bytes, like when I'm interact when I work with software engineers, you know, front end engineers who are like playing with a form, they will like key smash To enter all the fields in the form. And when, like working with like platform like more systems platform engineer type people, they'll frequently have these, like mock byte sequences that are just like random byte sequences or whatever. And it like this speaks an orientation to the computer system that's like, oh, the content of these bytes don't matter. They just, like, they have the right type, and therefore, like, the system will run correctly if I've like, yeah. And that's like, precisely the opposite of what you need in data. Like, what we do with models is we learn a probability distribution over the like data domain. And so what makes like, what makes our model good or bad, is not the like types of its inputs, but the like the relationships to the specific values in that type. And so that it's like a different orientation to sort of be like, maybe like an extra, you know, an extra, you know, capitalization versus not capitalizing. Slightly different framing of a question, whatever will change the behavior of this system. It's or like, the way that you examine data is not by like, checking to make sure that it was encoded correctly or decoded correctly, and there aren't like, stray escape sequences. You do still need to do things like that, but you also need to be worrying about like, does this accurately represent the problem domain that I am trying to solve? These like, does the synthetic data cover the like, the part of the domain that I care about, does it have the right proportion of these things? And that is that it's been surprising to me how challenging, like, how big of a perspective gap that is from, like, for how big of a jump that is for traditional software engineers? Sort of like, yeah, it took me a lot of effort to learn the software engineers perspective, but, and I guess maybe I was surprised that the other way around was so hard. bryan 2:34:34 One thing there that I think it might just be reframing in, like, a little bit simpler languages, but like, I have this thing where I can look at a distribution of data, and if I know what the like data supposedly produced by, or like what generator has produced this data, I can be like, sometimes like, and I, I can't always put my finger on it, but. It has happened a large number of times, especially in the sort of back nine of my career. Like, I think early on, I didn't always have this intuition. And sometimes people would be like, something's wrong here, Brian. But like these days when someone on my team says, like, Hey, I did what you suggested. And here's the like similarity scores for the rag system. And I'm like, No, it isn't. No, it isn't. That's wrong. And I don't know necessarily like that can't like. I don't have the theorem at hand from the information theoretic perspective of like that can't possibly be the way the data's distributed, but I know it's wrong, and they go back and they dig into it, and they're like, sorry, and it's okay, but it's just like, I don't know. I find that really odd, because I do have that intuition, and I think I've also observed that it's something tied to like, spending a ton of time making the same crappy matplotlibs. charles 2:36:02 Yeah, yeah, yeah. Funny, interesting experience with that. I demoed like a project. I had only half finished this vector analogy thing that could, like, find a Wikipedia article via analogies. Demoed it like with, I don't were you there that time of the of the, like, Friday afternoon podcast? No, no, yeah, showed it Jason and homil were there? Maybe Eugene as well. And, like, they suggested a query. I typed in, I went, did a search to try and find the article. And it was like, Bill Gates or something. GitHub was another one, and the result didn't come up. And I was like, Oh, I probably broke something in the search pipeline or whatever. But Hamel was like, something is very messed up here. And he was right, turned out, I dropped like 40% of the rows at the end of this big ingestion job, just like, it just stopped. And I had not noticed, because it was like a hidden crash, yeah? And it was like, I was like, Oh man, yeah, like that, yeah, there's like a certain there's like a scent where it's like, something very serious, has gone wrong or like that. hugo bowne-anderson 2:37:23 What gave it away to Hamill? charles 2:37:26 The smell, yeah, bryan 2:37:28 he's all about it's like a mouth feel, charles 2:37:31 yeah, yeah. It was, yeah. I think I don't have as much experience with, like, your tech, your text search, you know. So I was, I, I think text search kind of sucks. So I was like, oh, it's probably the text search sucks, but Hamel believes in BM 25 and lexical search. So he was like, No, I can't hugo bowne-anderson 2:37:51 do that. Hey, what's up? Shirelle, welcome back. Um, so it will be time to wrap up shortly. Um, if Hamilton or Eugene are around, I want to come and say hi again. Please feel free to otherwise. Don't. But I've got a couple of last questions for you all. So as llms become more widely adopted, integrated into products, services, life, what are, what are some of the things that you're just really excited about technology wise or process wise, or conceptually, or what do you want to be doing in the future? charles 2:38:32 I can go first. Robots. Man, yeah. Robots tell us we all dreamed of building robots and then ended up building business to business software as a service. But inside of me is a small child who dreams of robots, and I think more seriously, like multimodal models solve this problem. Pulque Agarwal at MIT has a great paper about this. It's like the one of the core problems of robotics is not the mechanics of like, you know, keeping your servos well calibrated. It's not the like policy learning. It's like getting information from humans into the machine so that a human can say, like, please do not go in the corner the dog has pooped there. Like, you know, being able to communicate that to your Roomba. And so I think that like fund that, like, there's much more work to do on multimodal models. There's much more work to do on robotics. But that fundamental gap that felt, that felt like a quantum research leap that needed to be crossed, and we have, in fact, done that jump. And so that's what I'm most excited about. I don't know, I already said I don't predict the future, so I won't say how long it will take, but long before I die. hugo bowne-anderson 2:39:53 Awesome. How about you? Shreya, shreya 2:39:57 I'm really excited for people who. Can use Excel to be able to do very cool things with llms over large collections of data. I don't think, I think we have building blocks. We don't quite have the interfaces or even the tooling required for it, and certainly these tools don't have interactive latencies. But I think we'll get there, and that will be very cool, super hugo bowne-anderson 2:40:24 cool. And I am so I just had a vision. Excel took me to Microsoft, and I had a vision of, remember Clippy, like the little paper clip, having a Hamel Clippy in llms with, like, a little crown, would be charles 2:40:41 pretty it's actually pretty easy to code up because it just pops up a little speech wall that says, look at your data like every five or 10 minutes. Yeah, I think I got that. I could probably implement that with DaVinci. I don't even need three five turbo. You know, hugo bowne-anderson 2:40:54 occasionally you need to intersperse it with duck. You show me the prompt as well. charles 2:40:59 Oh, yeah, yeah. We'll need one of those forking contacts that Brian was talking about. Yeah. hugo bowne-anderson 2:41:05 I mean, we've got the team here to do it. So hey, Eugene, what's up, man? What? What are you what are you most excited about in the coming year or two, about everything that's going on? LLM and Gen AI wise. eugene 2:41:18 I guess my answer is going to be a little bit boring. I'm really excited about more people using it and benefiting from it. I think right now, it's just not so easy, and there's a lot of sharp edges, and that's a big motivator for why we wrote it, right? I mean, I mean, right now some tech companies maybe in SF, maybe they're using it, right, but I would really love to see the SPCA use it to write better copy for dogs up for adoption, to increase adoption rates or screening adopters or building better like helping Helping Children, helping people build quizzes out of textbooks so that they can study, or letting a child talk to Dumbledore. I mean, just put all of the book content and then create a Dumbledore context. I mean, that will be pretty cool, I think. Or I would love to talk to Marcus Aurelius if I could. So I think there's, there's a lot of fun things here that now is within the reach and just possible to do. hugo bowne-anderson 2:42:27 I appreciate that. And that's straight. Was straight from Eugene, the stoic, which I say that because I remember on your website at some point, it actually like, flash up the stoic. And of course, I imagine you like Marcus Aurelius on the battlefield at dawn. You know, engaging your my philosophy. So, yeah, I am. I didn't know that explicitly, but it makes perfect sense. And yeah, totally, I was gonna say I approve, but that, sounds condescending. So, but anyway, I don't care. I do it. I do approve and encourage. bryan 2:43:05 I don't approve. Just the counter. You go, I don't good. hugo bowne-anderson 2:43:08 I like that. Wrong. Um, charles 2:43:09 I like hearing Brian bishoff, what is your thing you're most excited about talking in the next couple years? Yeah, Brian, bryan 2:43:23 so the internet was exciting because it was, like, the whole world's knowledge at our fingertips. But then, like, I don't know, I feel like it never materialized for a lot of people that way. And then the phones are exciting because of the whole world's knowledge, like, in our hand. And like, kind of materialized that way. But I think the exciting thing about AI is it's like all the knowledge, like accessible and like useful, but like a lot more, like niche and specific. And the example I usually give here is memory extenders. I've been really passionate about the Memex since high school. And like, I've always thought that that was like my ultimate, like wish Genie thing was, like, give me a Memex, because my memory sucks, spoiler alert. And so, like, that's always been what I wanted more than anything. And I think these days it is actually, like possible. I've been working on the like, side hack project, just trying to build, like, a personal mimics, and it's amazing. And how many things are just like, very doable now, and I that's the thing that I'm most excited about, just because, yeah, I hate, I hate not having access to like knowledge that I like should have. So yeah, I just think that's going to transform the way that we socialize and the way that we interact and carry out our days. Like, yeah, hugo bowne-anderson 2:44:44 very cool. And I love, I mean, as we've, you know, figured out. You all agree on a lot of things, but I love the variance of responses here as well. I do. We've touched on this in a variety of by the ways. But as a final question, I'd just love to know if you had to distill all your key lessons down into. A one piece of advice to get someone who's, you know, embarking upon a journey of building LLM powered software. Um, what would it be? Perhaps Shreya, we can start with you. Start shreya 2:45:12 with Eve house. Yeah. Don't write anything else. Write some sample inputs and ideal outputs. And then, yeah, do whatever else you want. hugo bowne-anderson 2:45:26 And would you start with, like, basic assertions? shreya 2:45:31 I would start with getting some sort of gold, even if it's small, trying to get some gold standard data set. And then, as you're iterating, thinking about what makes for a good output, because you're going to run into a lot of failure modes that you you can't anticipate upfront. I think enumerating assertions is really hard to do up front, actually, because you don't know the specific weird ways llms are going to fail or say weird things that like aren't good vibes to the end user. So start, but start with some tests. I think, awesome, hugo bowne-anderson 2:46:07 Eugen. eugene 2:46:11 Um, I'm going to interpret your question in a way that, okay. What? What advice would I give to someone who's trying to get into this space, trying to learn and pick it up? I would say three things. I mean, read a lot. I mean, there's some people I know who spend a ton of time trying to put all their condensed knowledge into a very distilled, punchy article. Read that. Build a lot. Get your hands dirty, right? See how it feels, see all the edge cases. I guess maybe this comes, maybe across into demo items like try to build a demo. But yes, that's the best way to learn. It's so easy to learn that right now, build a lot. Build a simple front end and finally, share about it. I mean, the best way to learn is to write it down and share about it. When one person teaches, two people learn. So yeah, do that. Put your voice out there. Beautiful. hugo bowne-anderson 2:47:01 Charles or and or Brian, eugene 2:47:03 I'm gonna popcorn it to Charles. charles 2:47:07 All right. Love it. Um, yeah, I think, yeah. What I've continually come back to is, like the the core idea of building any complex system is like validated iterative improvement. It's like the zen of gradient descent. I was like the way you solve a linear equation is not to manipulate algebraic symbols until the answer pops out. The way you solve a like, large linear equation is like through like, iteratively, following a gradient. You know the cry log subspace method, for example. It's like step by step. The way you like, divide a number is like one step at a time, and that's also seems to be the case. It's like how people build businesses. It's how people run factories. Is how people build full software systems with llms. The complexity, the like epistemic uncertainty, is thrown immediately in your face. And so it feels novel, but I think if you pay attention to any complex system, you discover that the only way forward is like one step at a time, in a way where you can be sure that step is forward. And for so with that conceptual big Chicken Soup for the engineers, Soul idea for LMS, that means like data, like data experimentation and like operationalization and getting out to production. hugo bowne-anderson 2:49:01 Awesome. Thanks, Charles. How about you, Brian, and I don't want to set any any seeds, so to speak, but we haven't actually talked a huge amount about experimentation. I know how much of a fan you are about experimental processes. bryan 2:49:14 Yeah. So I do think these are all really, really fantastic, and none of these are. They're all ones that I wish I would have said first the, I think the thing that we are all thinking, though, and no one has said, is the most important thing to start with, is $50 million in DC, money and training a foundation model from scratch. Like, I think we can all be honest, like that is really the, like, the key alpha, yeah, charles 2:49:44 but he was telling me actually he bought a bunch of Nvidia stock right before this call. And so now I understand why. bryan 2:49:52 Oh, yeah. So after your 50 million in BC and your foundation model that is truly bespoke. Look to you. I think you know, you gave me that lead, and I certainly can't, like, I am a fish, and it is a shiny piece of metal. So, like, you know, I think it is not dissimilar to what Charles is saying, ultimately, which is like, iterating to success, like, what are experiments? Experiments traditionally are like, you put a new thing in prod, and you see, if it, like, resonates with the user better on a particular metric that you've aligned to neat, but now we've got, like, offline experimentation, which is like, very like, quick cycle. And the way you do that is by building evals like you start with your evals to instrument your experimentation, and you experiment like mad to try to get to something that doesn't suck. And I think ultimately the most important thing that like you're looking for here is, how do you constrain down the problem to one that actually delivers value to your user, and then you just iterate like hell. One of the things that has been most important is just like thinking about what each individual chunk that can go from zero to one is, and then iterating to that little like chunk, Charles loves the phrase zero to one, and that's what he, like cares about in this world. I'm kidding, but the hugo bowne-anderson 2:51:34 Charles is a Peter table fanboy, right? I mean, bryan 2:51:38 is as much as Charles loves going from one to n, I think Charles would agree that getting from one to two is zero to one, and from two to three is zero to one, and that has been the like digestion that has been very valuable for us at hex. And frankly, everything I've ever worked on that didn't totally suck. So, yeah, hugo bowne-anderson 2:52:01 nice, awesome. Sorry. I'm just engaged in the chat as well with Eugene's comment about making a short video of several quotes that have happened during this during this call, which I think is gonna eugene 2:52:11 actually, there's only one quote. We all know what that quote is, so just make a video of that hugo bowne-anderson 2:52:15 quote. Yeah, well, there's the AI. Actually, there is doing AI with Brian, doing AI with his wife as well. eugene 2:52:22 And Charles fondling the year, I don't know, yeah, bryan 2:52:26 yeah. Finally, funneling the ear is a pretty good hugo bowne-anderson 2:52:32 look. This, this has been, I've never done anything like this before. I've been doing podcasts and this type of stuff, and for years, and I've, I've never had, eugene 2:52:41 I bet you say that to all the girls, okay, well, and hugo bowne-anderson 2:52:45 finally, you may not be far from the truth there, but you know, the context is slightly different, as is the the context window. I um, look, I'm so grateful for all of you. You've you've put in, irrespective of what we've done today, you've given a huge amount to the community. You put in a huge amount of work to come together and put out this media report, which not only has had a huge amount of impact on me and people I know in the past few weeks, but will will continue to and I'm very excited for that. I want to thank you know the 80 to 100 people who are still around, sticking around for for three hours is is a huge effort. So thank you so much for your your presence and patience. I'm putting a link once again to the report in the chat, and I'll put in the show notes with the podcast. Please do. There's an about the authors page as well, where you can go and find links to everyone's website. Follow their blogs. Follow them on social media. If you enjoy this type of stuff, I will be we will be putting out the podcast in the next week, probably like early next week. So keep your eyes open for that, and we'll post that all on on social media. If you do enjoy these things, please subscribe to the channel and give it a like and share with friends. But most of all, thank you all for your time and wisdom and good vibes as well. This has been an absolute so much fun and great to see a bunch of friends get together and talk about stuff they love. So really, very much appreciate you all. charles 2:54:22 Thanks for having us. Hugo, hugo bowne-anderson 2:54:23 absolutely. Thank you, Hugo, awesome and so thanks everyone, and we'll see you in the next live stream. Take care and. Transcribed by https://otter.ai