The following is a rough transcript which has not been revised by Vanishing Gradients or Rachael Tatman. Please check with us before using any quotations from this transcript. Thank you. hugo bowne-anderson Hi, Rachel, and welcome to the show. rachael tatman Thank you for having me happy to be here. hugo bowne-anderson It's such a pleasure. And I'm so excited to talk about a bunch of things in the data and NLP, NLP space. In particular how to make NLP boring again, including all the work you're doing on conversational AI and chatbots. I'm really excited to talk about demystifying the use of machine learning like when to use it and when to not. And the other thing I know you're interested in that I'm really excited to hear about is how we approach the globalization of NLP, particularly that so much training data and models are in English when it accounts for so little of language spoken globally. But before we dive into this, I thought you could give us kind of a rundown of how you got into the data world and what your trajectory has been so far. rachael tatman Yeah, great question. So I've had, I mean, it's been a little bit windy, but I would say that my focus has really been language technology since the beginning. So in undergrad, I was really interested in language technology. But I was also double majoring in English literature and linguistics. So I didn't have I took a like the basic computer science series, but I didn't have enough credit hours to get also a CS degree. So and the reason I originally got interested in linguistics, I found my like, undergraduate statement of purpose a while ago, and actually, I was really interested in machine translation at the time. And I was like, well, we have to understand how language works to be able to like, translate between languages. So obviously, linguistics is the field that's going to have the most bearing on this, which I would say has not been borne out, fortunately not the case. And in graduate school, I realized that NLP researchers really aren't in conversation with the linguistics literature. So I had a lot of interest in speech variability. So I worked predominantly in speech, and then more and more in text over the course of graduate school, speech variability, and particularly units of language below the level of the word. So I worked some with sign language that were some with acoustic data. And also I did some work on emoji and punctuation and capitalization, capitalization, and sort of the sub lexical features. I mean, sub sub indicates some sort of hierarchical relationship here. I think there's not always a hierarchical relationship there. And then towards the undergraduate school, I realized just as NLP researchers were usually in discussion with the linguistics research, often the people who are building the language technology itself weren't often in discussion with the NLP research. I think that's changed a little bit, particularly starting around, you know, 2017, the transformer papers, the Bert paper, GPT two and the discussion around there, I think there got to be more people working on language technology, more interested in the research side of things. But I wanted to help build really helpful language technology, being an industry was really the place where I could do that. So I finished my PhD in 2017. And then I started as a contractor at Google, and I converted there to Developer Advocate, which is, I wouldn't recommend that as a career path. If you're listening to this and be like, Oh, I started as a contractor convert. It was a rough process for me. And I think it's only gotten rougher. And now I'm currently a Developer Advocate, senior developer advocate at a company called rasa we are, it's RASA. We are a startup that builds an open source framework for building chatbots. And then also, we have closed source tooling set for maintaining and updating and doing sort of like human in the loop updates of your chat bot once it's in production, so that, you know, over time gets better and better and a better fit for your users needs. In the best case scenario. hugo bowne-anderson That's awesome. There's so much in there, I'd like to like to unpack one thing I'm interested in, is, you mentioned transformers. Okay. So this is current state of the art, there's a lot of a lot of buzz on one side, maybe hype on the other. I wish we had words, which could take the positive and negative connotations of words and put them maybe we can discuss it. Yeah, exactly. That's great. Maybe you could give us a rundown of some basics of the history of conversational AI, which could provide a framework for us discussing the relationship between machine learning and rule based chatbots, for example, yeah, absolutely. rachael tatman So the history of conversational AI is very, very deep. I don't remember if we discussed this before, but pretty much as long as computers have been something that people have been envisioning. A lot of people say that the first reference to a computer was in Gulliver's Travels by Jonathan Swift, as something that we can talk to, right. And that talking happens in an automatic way. So that's always sort of been the dream of interacting with computers. And I would say we got sidetracked by compilers to just sort of look at it in a historical viewpoint, that is not a dig at people who work on compilers. It is a joke, so it's got a very deep history in terms of people thinking about it. Probably the first dialog system that there was a lot of excitement around was Eliza, which I I don't know. I feel like a lot of people in my circles have heard of Eliza but I know some people probably haven't. So this was a very early system. And it was based on a type of psycho analysis where somebody tells you something and then you rephrase that as a question. And this was done completely rule-based. So the token string was taken in, it was parsed. So parsing is an NLP task where you create a hierarchical structure of relationships within a string to other places in the string. I'm trying to avoid saying things like words and sentences, because it's not always words and sentences. And focus on words is a very English centric thing to do. So it took the input it parsed it, it transformed it into a sentence. That was a question because, you know, as in, in linguistics, this was back when linguistics and computer science was much more closely related. In linguistics, we have you know, a lot of different types of models of the transformations that occur to take a statement and turn it into a question. And that's usually how it's phrased that you you have the basis, the statement, and then you turn it into a question. So it was completely rule based, and it was very popular with the people who used it. So even though they knew it was an automated system, obviously not no ghost in the machine, very transparent process, well, transparent depending on how comfortable you are thinking about parsing a very transparent process for the next turn to occur. People found genuine value in it right? hugo bowne-anderson It's intention was for therapy, is that right? Or it was used? rachael tatman It was modeled after it. I don't believe it was ever intended to be like a therapeutic system, but it is how it was used. Yes. Yeah. So I don't think the the original paper they make any claims about this, you know, solving the need for psychotherapy or talk there. hugo bowne-anderson Yeah. Although reflection is an incredibly important part of therapy. rachael tatman Honestly, I would probably call it something like a journaling aid. hugo bowne-anderson Yeah, I love that. So that was the first rule based methods, rachael tatman I don't want to say that it was the first rule based system, because I'm sure someone will come at me with an earlier citation. Absolutely. And then the next big just to guess what you're asking, the next big shift was from these completely rule based systems where the transformations and roles are in by hand by linguist to basically systems where you counted things. So a lot of Corpus based methods, things like a Markov models, hidden Markov models, if you're familiar with that, a lot of more statistical methods is what they were usually called, came to the forefront. And these were much fuzzier. Right? So the output was like a little bit less predictable, but still deterministic within a distribution. And this was something in around 2016, when Facebook launched their support for chatbots on their messenger service, there was a huge boom in people building chat bots. And I wasn't, this was actually I was still in grad school. So I wasn't working in dialogue systems, and I was working predominantly in speech. But I believe at that time, there probably would have been a mix of statistical methods and deep learning methods in play there. And statistical methods are definitely still used. But the next big shift, which sort of they, you know, handed over again, people are still building robust systems, people are still building systems to build systems. But the next big sort of new set of methods was deep learning. And when I was in grad school, like 2016 2017, the state of the art architecture that everyone was really excited about were bi directional LSTM, which is a specific type of recurrent neural network with attention. And in the speech community, there was a little bit more discussion of CNNs for some speech tasks, but a little bit less on the text side. And then in 2017, the transformer paper came out, which is attention is all you need the Swanee 1017, etc. And other authors, it wasn't just a single author, paper, and transformers are, this may be review for some of the listeners, but they are a family of deep neural networks that are basically fully connected. So instead of like with a recurrent neural network, you have to train time point one and then you can train time point two, because it depends on time point one, and transformers, you train all time points at the same time. And the benefit of that is that you can parallelize to a greater degree. So the training time bottlenecks aren't there anymore. And the drawback there is you have way more parameters, orders of magnitude more parameters than you would with an equivalent recurrent neural network. So that was sort of the I guess, sub part of the third wave, if you were talking about this in the third waves, the deep, deep learning way. And there has been a lot of work on particularly with large language models, treating them like dialogue models, they're not designed as long as language models are not designed to be are not designed to be dialogue systems. So these are things like Bert GPT three T five... hugo bowne-anderson and you may have said this but the T in Bert and GPT three transformer. Yes, absolutely. Could you remind us what a language model is? rachael tatman Yeah, absolutely. So language modeling is a task from the statistical era of natural language processing, where given a you know token set that you know about, you are trying to find a given your token set and an input string, the probability across the token set of the next token in the string, right. So just an example: f I had a very small vocabulary of, let's say, I bought a cat and dog, and you have equal numbers of examples of I bought a cat and I bought a dog, if you just gave me the sentence, I bought a blank and I was trying to predict that word, I would probably give, you know, 50% likelihood to both cat and dog, that's a very high level explanation of what was going on there. Transformer based language models aren't statistical language models in the same way, they usually just generate the output. And they do not produce and are not intended to produce a statistical model of the language as a whole. You can still use some features or not features but measurements to compare them. Perplexity is a really common one, which is sort of like it's a measure of entropy, which is an information theoretic blah, blah, blah, we can talk about it. hugo bowne-anderson why do you think neural networks have been so had the success? Let's say that they've had for these questions? rachael tatman Yeah, great question. I mean, I think there's three answers there. So one is the machine learning answer, which is that their universal function approximators, so if your function is potentially existant, and you can get there. So I think that's the machine learning answer is that they're just very flexible models. The sort of history of computing answer is that we had access to data and hardware that made parallelization, you know, graphics cards, basically being readily available to consumer level that made it possible for the tractable, like computationally tractable, for these to be trained, right, you you had enough data to get to, you know, to not just underbake model and something useless, you had enough data to get something that got, you know, results that were, if not exactly what you wanted at least a useful proxy, a reasonable proxy thereof. So that was the computer science answer, and then the machine learning answer. And I guess there's sort of like the cynical field, which is that if you have enough data, and you have enough compute, and you can just throw money and labels at a problem, you don't need to spend a lot of developer time or a lot of researcher time hand building stuff. I'm not, you know, trivializing building deep learning systems, I think anyone who's done it knows that it's, you know, it's not trivial, there's a lot of sort of secret sauce that goes into it, even to this day, I think. We don't have good theoretical guarantees for a lot in the machine learning space, in the deep learning space specifically. So there's a lot of guess and test, but you don't have to write you know, a pronunciation dictionary by hand, you don't have to, you know, sit down and develop the parsing algorithm by hand. And that takes a lot of specialized knowledge. And it takes a lot of time. So there's something we were talking about a little bit earlier, or someone was like, Hey, we're using the same machinery and computer vision and NLP and, you know, all of these different fields, where, you know, you can get started in NLP and really not have a lot of language specific knowledge, you may make a lot of boneheaded, I guess, you may make a lot of beginner errors, right. And knowing a lot about language absolutely helps, and I think is important. And I mean, I've spent a career learning about language. So perhaps clearly, I believe it is a very special type of data that we should treat as such. But if you don't need a linguistics degree, right, if you don't need to understand grammar, if you don't need to be able to do the specialized work, then it does open up the applications more. And I don't think that that's a bad thing. I think it's good for things to be accessible, I think it's good for people to have access to tools to build things for themselves, particularly if you're outside of let's say, the Bell Labs of the 70s or the you know, the Stanford stuff today, and you don't have access to 3000 wheeling undergrads who would happily learn in your very specific annotation system and then annotate, you know, 50,000 lines of data for you. But I don't know maybe I'm just a curmudgeon, but I do worry that not having a lot of understanding of how language as a type of data Works has led to some wasted work basically ever on problems that are either not solvable or were solved in the 60s and have like a very complete very easy solution. Yeah, that's my thoughts on that so hugo bowne-anderson thank you for for breaking that down like then I think the wasted work you don't sound curmudgeony. You definitely don't sound as come on to me as I feel I sound on increasing number of occasions these days. I think the wasted work is very important to talk about, not only because of wasted resources, but how peoplein these positions feel meaningless in their work when their work is wasted in the end, the other side, which I hope we'll get to is the fact that we're talking about deeply human data as well, and what are the rights and responsibilities of everyone involved when working with human data. But before we get to that, I really want to jump in. And, you know, a lot of people like, let's, let's throw neural networks at this, let's use a lot of machine learning all of these. And these, these techniques and tools are incredibly valuable. But I know you think in a very principled way about trade offs between rule based methods and machine learning methods, and you do this for your work as well. And there are, you know, even regulations involved in this right, so maybe you could tell us a bit about where we are today, because I think a lot of people think rule based is the past and machine learning is the present in the future. I'd like for us to kind of open up that myth and discuss it. rachael tatman Yeah, absolutely. I'll start off by saying that when I talk to someone, and that is their stance, usually they are someone quite Junior in the field. I'm not dunking on people who are new to the field, we all have to learn somehow, right. But I think anyone who's been working with language data, let's say in, you know, a professional capacity for like, let's say, five years, you've run into regexes. And they've been inappropriate, you know, they've been the appropriate tool for the job at the time. So I think it's a missed comprehension, a misunderstanding that rule based systems aren't still used, they absolutely are. And the benefit of deep learning systems is it does make things that were previously intractable tractable. It does, you know, help with systems where you want flexible behavior, and where you either can't or don't want to or, you know, it's infeasible to maintain a system of rules to create the illusion of that flexibility. And I would also say, speaking of wasted work, avoiding that if you can automate it is great, talking about sort of the misapprehension that rules are gone for good, they surely aren't. They're here, you're going to need to learn Perl at some point, at least a little bit. I'm sorry, if that is news to you, fortunately, there's lots of great learning resources out there. And will deep learning eat the world? I think sort of the second question. No, it's a specialized tool. And it's good for what it's good for, it's good for those flexible situations, it's good for, you know, reducing the amount of boring repetitive work basically. So a good litmus test that I use is, if I can teach a person to do something in like two to three minutes, I can probably do it with machine learning if I have enough labeled examples. And if it's easy to teach a person to do pretty quickly, I can probably get my labeled examples pretty quickly, because it's probably pretty simple. I think there's also a lot of, I don't want to say purposeful disinformation. One of misinformation is intentional of misinformation, disinformation is intentional. The other one isn't, I don't remember the difference, but hugo bowne-anderson I can never remember either! rachael tatman right. It's one of the two. But I think that there are a lot of people who are or have been oversold, the ability of deep learning systems. So they will always make a guess, right? There's always going to be some sort of output. Like if you're doing softmax, something is gonna, somehow has got to be one, right? Yeah. But that doesn't mean that they can capture a signal that's not there. It does mean that to the untrained eye, it might look like it is so I think a great example of this would be this is an example for computer vision, the gadar paper, I read about this, it was a while ago, it was a paper that claimed to be able to tell if someone was gay or straight from looking at a photo of their face. And what it was picking up on was, you know, shape of your glasses, right. And I believe Margaret Mitchell, actually, and a co-author, who I do not remember off the top of my head, I'm so sorry, showed that they were able to fool the system by like changing the angle they took the photo at or like changing their makeup, even though I think it is hopefully clear to people that if I take a photo at a different angle, it does not change by sexual orientation. So you can, especially if you don't have like deep domain knowledge of the problem you're working on, I think you can convince yourself that you're doing something that you're not. And unfortunately, I do think that some machine learning systems, deep learning systems that are out there are being sold as doing things that they can't do. hugo bowne-anderson And also I so we'll include a link to that paper in the show notes. It reminds me there was some computer vision problem where it was animal classification of some sort. And one of the animals lives in snowy habitats. And what the the algorithm was actually doing it was just saying that snow was in the photo, essentially. Right. So we see this crop up again and again. The other and I guess maybe a slight tangent, but I think it's probably very important for us to talk about right now is why are we working on questions like someone's sexuality from a photo? Should we be more principled and self reflective and thinking about the types of things that we're actually working on and putting our limited energetic resources into? rachael tatman Yes. What a great segue. Absolutely, yeah. Not everything is worth building, and even things that you know, may have some value in building have a real genuine potential to cause harm. I think we talked about this little bit earlier, but I don't believe in the United States that software engineers or machine learning engineers have a duty of care towards their users, in the same way that like a civil engineer does. Like I personally feel that I have a direct duty towards people who end up using things that I work on, to not make their lives materially worse and not make... not perpetuate existing cycles of oppression. And that's my strong, principled personal stance. And I don't think everyone in learning is going to share my exact principled, personal stance. But yeah, what questions we're spending our time on, is a great thing to think about. And with language data, I think something a lot of folks, again, who are coming to language from a machine learning perspective don't realize is that in every single piece of language data, there are reflexes of your social identity, there are reflexes of your past of, you know, national origin, potentially gender potentially, I'm not saying that you can necessarily detect these things directly from the language, but they are latent variables that might affect your prediction in a way that you didn't intend and you might not have looked for. So speaking of unintended effects, I think being very aware of, for the population from whom your training data is drawn, the relevant social factors that are reflected in language use is extremely important and can help prevent a lot of I mean, a just systems that don't work being built, which would be good not to do. I think a great example of that would be kind of you, you read this, but a while ago, I believe Amazon had a system they were trying to build to do like resume screaming, screaming, screaming, hugo bowne-anderson I actually really love that term that I feel like that encompasses modern HR in general. Yes. Like ,the Munch painting of the screen, but just like throwing resumes in the air. rachael tatman But the the most prediction feature they found about whether offer would be made based on past candidates was I think, whether you're on the men's lacrosse team. hugo bowne-anderson Yeah. It was also female college, if you've been to a female college as well, it directs you completely. rachael tatman Yes. Which I don't think is indicative of, you know, the abilities of these people. It was indicative of a history of bias in the system. And they did not end up putting this in production. Also, it would have been a straight it's illegal. hugo bowne-anderson They had Yeah, I'm so cynical, I actually, I don't part of my mind, where my mind goes is, is this even real? Or is Amazon? Like, is this a PR move on Amazon, because they made a big deal about it publicly. And I. The other thing that I'll include, I don't know if you've seen this, there's, um, a game that came, I think out of a collaboration with Mozilla Labs. I can't remember who made it, but it's called survival of the best fit. And, of course, a great a great name. But you in this video game, you play a CEO who's making hiring decisions, and then it becomes automated. And you see the bias that that creeps out there. rachael tatman Oh, that sounds like a great computing and society project. hugo bowne-anderson Yeah, absolutely. The I think this conversation dovetails nicely going back to rule based versus machine learning, in the sense that, sorry, let me Let me collect this and get this. Get this right. Machine learning methods, especially for conversational AI and chatbots, are not always predictable. And when you have a lack of predictability, then it can say stuff, which is super offensive, for example, what role does rule based I suppose NLP play in these types of products, particularly at the end of the production pipeline, I suppose rachael tatman That is a great question. So we can talk a little bit about the the Rasa approach here. And the first thing that I would say is never serve raw generated text to your users. So this would be like the raw output of a language model. We know that they were trained on the internet. And if you have, you know, done much research into harassment, I think it is not a surprise to you that a lot of the data that it was trained on is, you know, deeply toxic, and would be hugo bowne-anderson Yeah, I don't mean this to you. But I kind of joke like how idealistic it was that we thought allowing everyone to write anything in broadcast mode to the entire world at any time was a good idea. I mean, we've seen the limitations of web 2.0. And to comment on things who would have thought that allowing everyone to comment would be bad. rachael tatman Yeah, I have a lot of opinions on the importance of moderation and a strong code of conduct and being consistent and nurturing communities. Yeah, it's important and you cannot skip it. And if you try to skip it, you are not nurturing a community. You're setting it up for failure. Absolutely. Yeah. So that's the first thing just don't don't serve raw language model output to users ever. That doesn't mean that doesn't have a place like so. language models are not really -- as a task language modeling is intended as part of a pipeline. So for example, in automatic speech recognition, traditionally, you had the sort of signal processing part of the problem where you took the raw audio form and generated some candidate sounds or words that someone might have said. And then with the language model, you ranked those candidates based on how likely they were. So balancing those two systems then gave you you know, the most likely output based on both the waveform and your, your trained language model and whatever corpus it was originally trained on. So they weren't as a task originally, they weren't intended to generate language. And they certainly weren't intended to generate language as like the thing that they did. But we've gotten to that point where with large language models where people are doing that, and I strongly recommend against it and I think there's a larger question there of where do you use rule based systems? And where do you use the guessing systems, learning systems, and use them together. So a good example over here is entity extraction. So in a string of text, I want to extract part of that string of text and assign a label to it. So for example, if I say my name is Rachel, I might might want to extract Rachel and save it as, say, this person's name, and, you know, use it to refer to them later on or pre fill a form or something like that. So you can do that with machine learning, you can use, you know, a neural named entity recognition approach. And we do do that at rasa. So we have the diet is our architecture that we've developed, our research team developed, which does both intent classification and entity extraction, it's a dual task. hugo bowne-anderson Is that but also an acronym, or initialism. rachael tatman It is, believe it isdual entity and 10. Transform. I can look it up if you're interested. Yeah, I believe the papers on arxiv then we also have some some YouTube videos on it that Vincent, another research advocate, actually sorry, he's not a developer advocate. But at the same time, for things like addresses, email addresses, numbers, times, we just recommend you use stock language which is a regular expression, information named entity, not just named entities, but like entity recognition program that was I believe, originally in in Haskell, and has been since we factored into another language by Facebook, I believe has taken over it's an open source project has taken over maintenance. And I would say for the you know, when I'm building a system, the system that I built, probably 80% of the time, I end up using the the transformer based model and probably about 20% of the time using the regex base model. But also, I'm not doing a lot of like, I'm not asking for people's emails and addresses or things that have like a really strongly templatic... a strongly predictable template that they follow as often. And then the other sort of place where we use both is in deciding what to say next. So usually called dialog policies. So given the history of the conversation and the current turn, what do you say, and in platform based on something like GPT three, you might just use that as the prompt, and then whatever the model puts out, you sent to the user, don't do that. Please don't do that. I think that there are some fun uses for like that sort of thing like GPT dungeon. But that is a game setting. And the stakes, if you get it wrong are very, very, very low. So I would strongly recommend against it for anything in a commercial application. So the way that we sort of handle that balance is part of it is you provide conversation flows, we call them stories of like, ways that the conversation could go. And if you are following a pattern that this system has already seen, it's just going to keep following the pattern that it's been given. But if there's a time when your user let's say interrupts themselves with a different stream of thought, that's when the transformer model would look at these two have the conversation and the stories, which are the training data to decide what to say next, I would say at least in again, the systems that I'm building for, for my stuff, maybe 10% of the time, at most, you'll get the transformer model hopping in... most conversations, particularly if you're really have a good understanding of your users needs. Just follow the templates. It's good to have both and it helps you build more robust systems with way less training data. Imagine trying to train a neural network to extract emails and only emails. hugo bowne-anderson The other thing that comes to mind with respect to template responses is unless you require some sort of validation, like need to confirm that it does fit whatever template, in my experience, uses will continually thwart what you think the response will be. Right? Like, you can think you're going to get an email address and you'll you can get anything but like, you'll be amazed at the types of things people put down, right? rachael tatman Yeah, that's a great point. So our, again, this is speaking in terms of what we recommended rasa. We recommend something called CDD: conversation driven development, like tests, except they're cool human conversations. And the idea is that as soon as you have a prototype that kind of works, you get it in front of test users who know that they're test users, right? Don't just like launch immediately, we're not recommending that. And then going through and annotating those conversations, adding them as stories if you can't handle that flow, and then retraining and redeploying so CI CD, but with, you know, adding conversations, adding annotations and training data over time. So I mentioned we it's like a closed source thing. RasaX is the community version of that. So it's really used, but it's not open source. And that's for streamlining the annotation process, basically. So going through and being like, this term was wrong. It should have been this. Let's retrain and redeploy. hugo bowne-anderson And of course, congratulations on the recent launch of Rasa 3, is that right? rachael tatman Yes,it is. Yeah, it's been a long time coming. So this is our third major version of rasa open source three. So it's our open source framework. Okay, great. We've had some major changes, probably the biggest one from our end, we did a big refactor, tidied some stuff up a lot of work that I think won't affect most users. But one of the things that we did is, previously, when you had language data that you were inputting, it went linearly through the steps in your pipeline. And now it does not have to be linear, you can draw your own custom graph, you don't have to it ships with a default one. You don't have to touch it at all, if you don't want to. hugo bowne-anderson So is this some sort of DAG? rachael tatman It is a DAG! So yeah, completely customizable, Dag architecture. And the other big benefit of that is a training time, because, of course, we're, we recommend retraining quite a bit. So to help keep the time down, and also, you know, reduce the environmental impact, only the components that are changed, and then their parents and dependents will be retrained to help, you know, make things faster and easier. Who doesn't want that? We changed how slots work. So these are the variables that you store over the course of the conversation, we made them easier to reset, just like you have to be more explicit when you're when you're setting them, describing, what they're going to be like to begin with, so more Pythonic. And then our third biggest change was, it's still experimental. So if you try it out and run into any issues, please come to our forums, or give us a GitHub issue we'd love that is a way to track specific events that happen once your systems in production. So you have a tracker store, which is basically a database that stores all your histories of conversations, and markers is this new feature. And it's a way to, you know, set a counter that like, hey, every time somebody, you know, successfully orders a car through my car service ordering chatbot, you know, I want to add one to that marker row in the table, right? Or if somebody tries to order a car and is unsuccessful, I want to track that. So yeah, it's hard to say whether a conversation went good or not. But it is possible to say, you know, people could do this thing, or people fail to do this thing, or people ask to be switched to a human. And I think all of those things are really helpful in determining, you know, from a product leadership and design perspective, what are you doing well? Well, where do you need to improve? So that's the third big change. I actually recall from one of your talks, maybe I can't remember. I mean, there are just too many things in my head right now, my neurons are firing. But another way of figuring out whether people were pleased with talking with a chatbot is if they choose to the next time they come to chat again. Absolutely. Yeah, did they like it enough to come back? which doesn't necessarily always mean that they had a good experience? Because ideally, like if it's a customer support thing, they never come back, because their problem solved right away, and they never have another one. And then I think in terms of user satisfaction, the most robust measure, so it's a paper from, I want to say 2019, maybe that found that the single most reliable indicator of whether somebody liked using a chat bot or not was whether it saves them time. So if you can help somebody get done with something faster, good chatbot. If it takes longer than it would have otherwise, it tries about the wall. And I'm sure everyone listening, hopefully, you've had both experiences. When we haven't had the things took longer experience. But yeah, I dream of a time when people don't talk about chatbots because they they're fine. They work. It's just like, you know, clicking a button works. hugo bowne-anderson something that just came to mind, is there an ethical obligation, and in some places a legal obligation for chatbots to identify themselves as such? rachael tatman Great question. I'm not a lawyer. So I'm not going to answer the legal question, because I don't know, I would say that there's always an ethical obligation for a chatbot to identify itself as such. And actually, at Rasa, we have a set of ethical principles. I'll pull this up and give you the link. Please do. Yeah, it's rasa dom com slash community slash principles. Yes, this is something we talk about a lot internally. So A, you shouldn't cause harm using an assistant. It should not encourage or normalize harmful behavior from users. So for example, you shouldn't create a chat bot that encourages people to sexually harass people, or like reward sexual harassment, which has been big discussion in the chat bot community. And it should always, always, always identify itself as a chatbot. And actually, when you create a new Rasa assistant that's baked in to begin with, you would have to remove it to have it not happened. And then finally, she'd have a way for the system to prove its identity so that you're not not impersonating with an assistant. So those are sort of four main ethical principles. And we also talked quite a bit internally about stereotyping, so avoiding chatbots that perpetuate harmful stereotypes, or, you know, misrepresent a group. hugo bowne-anderson A very good modern example is how, as far as I know, nearly all voice assistants are female. rachael tatman by default, yep. Yes. And I think this has caused real harm to real women in particular women called Alexa. There's been actually a spate of women who are called Alexa changing their names because it has made their life awful to have the same name. hugo bowne-anderson I hesitate to say this, but last February pre COVID, I was in a, in one of these weird, you're in Seattle, I was in one of these weird Starbucks Reserve places in New York. And there was a couple sitting next to me. And her name was Alexa and her boyfriend kept asking her questions and ending the quest. He was like, you know, you want to go and see a movie Alexa. And I was sitting there just going what what is happening here, deeply troubling. And I hope I my intention wasn't to stereotype anyone from Seattle, it was just a point of connection, connection that and these Starbucks reserved places a truly, truly bizarre. So where I'd like to go next is to, I think there are probably a substantial number of listeners who know a bunch about data science haven't really gotten involved in NLP, I'd love to talk about how people can get started with NLP. But before that, I was wondering if you could maybe tell us a bit about at a conceptual level the differences like actually the different things you need to think about when working with language data and the paradigm shift there? rachael tatman Great question. Yes. So there are, I think a couple big things that may or may not seem obvious, I don't know. So the first is that there's no such thing as unstructured language data. All language data is produced by humans, it is produced according to some set of structures, I will remain agnostic as to what those structures are so as not to start linguists speak. hugo bowne-anderson And we generally say unstructured to refer to non tabular data, is that correct? But a lot of the time non tabular data is highly structured, actually, yes. Images are structured in a lot of ways as well. rachael tatman So that's the first thing is that there is structure there. I think it can be tempting to ignore things like order. And in some languages, that is not as a big issue as it isn't something like English, English is a highly ordered language hugo bowne-anderson Is that why LSTMs for example, had early successes? rachael tatman There is some evidence that LSTMs in particular are better at capturing hierarchical relationships between items. So things like parsing, like I mentioned, yeah, then transformers are I don't know whether that's obtained with new and larger transformer models that would have been from I'm gonna say that paper was like, 2018, I think. But yes. So that's the first thing is the structure. It's there. You may be ignoring it, but it's there. And that includes things like, you know, markers of social identity. So what a very, like, an easy way of thinking about this is that if I have produced some language output, you know, at the very least, or at least have sort of a general guess about the language that I'm using. Right? So you can tell ah, Rachel is speaking English. She must be an English speaker. Right? And that's a social identification of me and you might identify sounds like she might be a woman possibly, which you might identify, you know, Oh, sounds like she's American, if you have a very carefully or you might be able to identify where in the US I am. I am by dialect, which is a little bit a little bit harder. hugo bowne-anderson I think I know this, though. I could be from Virginia. But I know that because I know where you're from, not because I can recognize Yeah, yeah. rachael tatman Just like the thing here we someone might be able to identify Oh, you know, sounds like you know, I might be able to make some guesses about your gender about your national origin. hugo bowne-anderson is incredibly frustrating as well, by the way, but urban Australian say you don't sound pretty. No, I know. I don't. Yeah, yeah, rachael tatman absolutely. Yeah, most people have not had training and Dialectology. And that is perfectly fine. You don't all need it. The second thing is when working with language data to build language technology, specifically, your technology will always be held up to my standard of language use, which is another human. So that is the bar. Part of the reason why chat bots are so deeply frustrating is I know how conversations work. I have conversations all the time. This looks like a conversation, but it's failing my basic expectations for how it should work. And like I mentioned, I dream of a day when nobody talks about or has to think about chatbots because they just sort of work. And that means that we're going to have to meet those expectations of how conversations work. And we're going to have to do it consistently at a high quality. hugo bowne-anderson But if chatbots identify themselves, we may actually be happier with them being less like it could be a slightly uncanny valley conversation and you're okay with that. Because you know, it's a chatbot and you don't require all the human norms, right? rachael tatman Yes, there is a lot of support for this. But I also think that that is going to require that you've had, like had previous experiences with chatbots. And something that makes me think about the technology is that it is, has the potential to really increase accessibility, right. So thinking about, you know, even you know, if you have used voice technology recently, in your own life, it was probably due to a reason where you didn't want to or couldn't input text, right. So imagining all the different folks who, for whatever reason, don't want to or can't input text and being able to have or can input text, but maybe don't have a comfortable relationship with GUIs, or even mice, like using a computer mouse as a learned skill. I think as technologists, it can be very easy for us to forget the enormous range of people who have not learned that skill or no longer have it hugo bowne-anderson actually related to this, I, as you know, and some listeners may I lived in the US for the best part of a decade. And when I would call my bank or my insurer or heaven forbid, the IRS, I had to learn to do a horrible American accent in order to be understood by them their voice prompts and that was that was deeply I mean, trying to call the IRS is deeply frustrating in general. But having to modulate my own accent in order to be understood was was pretty frustrating, rachael tatman You have triggered one of my trap cards. That was the topic of my dissertation, more or less was variation in accents, and how humans learn to understand and handle that difference. And then building, I didn't build a full system, but I built some machine learning models using cue waiting in the same way, we're not enjoying that if you use similar cue waiting, so not just the linguistic input, but also broader knowledge about the person you can make human like mistakes instead of non human like mistakes. Yes, accessibility of voice technology varies tremendously by region, by ethnic group, I shouldn't say by ethnic group by use of ethnolect. So in the United States, some of the relevant ones are Chicano English, African American English, and these are language varieties that have, you know, consistent internal structure that are often associated with members of specific ethnicities, or races. hugo bowne-anderson That's fascinating. And wait, I know this is going slightly away from what I've done. This is important, and I find fascinating. What's the role of dictionaries in propagating I suppose, I want to hesitate to say, power structures among dialects, but I that is essentially what I mean, I wanted to avoid that. But I think that's probably the best way to say it, for example, Chicano English in a lot of dictionaries that may be referred to as idiomatic or something along those lines. Whereas how does that play into this entire discussion? rachael tatman I think that's been an ongoing discussion within the lexicography community. So I don't want to speak for historical lexicographers because that's not a group I'm part of are deeply familiar with. But certainly the sort of current ideal in lexicography. lexicography is making dictionaries I don't know if I said hugo bowne-anderson very short note, my mum has had many careers in her time, but she was a lexicographer for some time for the Macquarie Dictionary in Australia. And very small note, when they did an abridged version, each lexicographer on the team got to include one of their favorite words, that wasn't in the abridged version, and she chose the word defenestration. So hi, mum. Love you, mumm. But yeah, let's go back to lexicography and dictionaries. rachael tatman So the current state of lexicography, to my understanding as an outsider to the field is to reflect usage in practice. And that may also include reflecting the variety from which the term comes... something that's related, I don't, it's really hard for me not to get sidetracked on language stuff, because of course, that is my passion in life. So the the annual American Dialect Society word of the year, which of course, is for America, often will be drawn from a collection of words that might include African American English, in particular, and other very socially salient forms of language use. Yeah. But yes, I definitely agree that dictionaries can have a very normative effect on language and can be used as kind of weapons, unfortunately. But that's in the sort of like published book sense. Dictionaries also exist in NLP, or things like stop word lists, which I think people don't use quite as much. But were at one point, extremely common. And these are ways of removing very common words. hugo bowne-anderson They might not be used. But if you're learning this stuff, in a bootcamp or an online course, you'll probably be introduced to nltk, and their stop word list at some point or something like that in an introductory sense. So it still occupies kind of a top level in the education system. rachael tatman Yes. And I'd say the main reason that we don't use stop word lists anymore is that we have enough memory to store the stuff that perhaps doesn't have as much as much signal in it. I've got a whole spiel on data cleaning and removing relevant information. But I am going to ask the question that you asked like 20 minutes ago. hugo bowne-anderson I just want to say thank you to the to the listener now for for going on this this wonderful journey with us. So yeah, let's take it back. Thank you, Rachel. rachael tatman Um, yeah, so my usual recommendation is the textbook speech and language technology by Dan Draski. And the second author hugo bowne-anderson include the link in the show notes as well. rachael tatman So for people who are new to NLP, I would recommend the book speech and language processing by Damn jurasky. And James Martin, I just available online for free I. So I'm just looking at the, the webpage. Now, I think chapters one and 11 might not be the rest of it is, and maybe also chapter 16. And this is, it's a living text, it's been constantly updated. And it's sort of the, I would say, the standard textbook in natural language processing, and goes through a wide variety of methods. So that's the second thing that I would say is to learn about and try a wide variety of methods. So a very common beginner mistake that I see is wanting to immediately go into using whatever the most popular current neural network architecture is. And I don't think it's wrong to learn. But many times you won't have in an industrial setting, the time the compute or enough data to justify using, for example, transformers. And in those situations, having a variety of different NLP techniques that were developed in a time of more compute scarcity and more data scarcity will really serve you well. So those are my recommendations. hugo bowne-anderson So many things came to mind, then the first is when you mentioned, a lot of people might not have all the compute all the time, I think we are in many ways living in the long tail of FAANG, as a discipline or shadow. Exactly. Yeah, that's right, the giant overwhelming shadow, which of course, brings a huge amount to the industry as well. But I think the two, the two things that come to mind the scale, a lot of businesses, a lot of people trying to build a stuff, don't even need the scale, right, let alone have the resources to do it. That's scale respect to compute with respect to models with respect to resources in terms of human bandwidth, right, but on top of that, the size of the data, so maybe I don't really have a concrete question. Maybe you can tell me a bit how you do more with less? Data scarcity! And compute scarcity. rachael tatman Yeah, absolutely. Yeah. Which I will say, first of all, is the reality of day to day life and people outside of the Global North. So I talked to lots of developers who write code exclusively on their phones, which like mad props, if that is you, or who just have intermittent internet access, for example, or who are working in a language where there is maybe one corpora, and it's from 20 years ago, that's been published and available to them. So that is, it is not a hypothetical, it is many people's daily reality, I would say my number one thing is to look to methods developed before the advent of widespread availability of enormous amounts of data, and compute. And when I say enormous amounts of data, I mean, specifically for English, NLP has a huge, I would call it a problem. I think some people wouldn't. But I think it's a problem with English centricity. So there was a tweet recently by I could find this person that retweeted it. And while you're looking for that, I might just add that I don't know what the number is, but it's like 75% of the internet is in English, or something like that. And when we use such things for training data, of course, like who would have thought, you know, of course, it's going to be deeply skewed towards the Anglosphere. Absolutely. And the tweet was by Leon Dzerzhinsky. Speaking of English centrism, apologies, in I said that wrong, who was talking about how the fact that we don't say in English in NLP is very similar to how a lot of medical studies don't necessarily say in mice, but it's very important that it only works in mice, right? It's very important that it only works in English. Because if I'm trying to build a system for like, tomorrow, or something, and I just don't have that level of access to data, it doesn't mean nothing to me. Like it's nice for you, I guess. But I'm not a mouse. I'm not using English. hugo bowne-anderson Absolutely. I actually want to set the context a bit more for this I was hoping we'd get here I've actually recently been reading a book called something like the information society by a Spanish sociologist called Manuel Castells. And I'm just going to read a small section from this book, he writes that scientific research in our time is either global or ceases to be scientific. Yet while science science is global, the practice of science is skewed towards issues defined by advanced countries. And this is to make the point of how anglocentric and Western centric a lot of the resources a lot of research has been historically not just in NLP, not only in data science, he continues, most research findings end up diffusing throughout planetary networks of scientific interaction. But there is a fundamental asymmetry in the kind of issues taken up by research, for instance, an effective malaria vaccine, he wrote this pre 2020. Of course, an effective malaria vaccine could save the lives of 10s of millions of people, particularly children, but there have been fewer resources dedicated towards the same sustained effort towards finding it. Another example: AIDS medicines developed an HIV in the West are too expensive to be used in Africa, while about 95% of HIV cases in the developing world,. So I think this kind of speaks to a very serious context in which we're, we're operating right and a lot of forces at play there. Yeah. So maybe it's worth going into that in, in your world. And in the NLP space now. rachael tatman Yes, there's a lot to unpack there, I spend a lot of time thinking about, you know, technological imperialism, and the ways that that, you know, affects people. So one of my again, I'm gonna start talking again, about my, my very strongly held personal moral values, I really believe in self determination, and in particular self determination within communities. So I fundamentally cannot know what is most important to say, just like a developer in Lagos without asking them, right, like, and it's not my place to know, it's not my place to make that decision. What's the most important thing, it's, you know, I'm a developer advocate, we talk a little bit about what that what that means a little bit, but I'm a developer advocate. So my, my foundational responsibility is to help developers with their their needs and to, you know, build technology that genuinely solves the problems that they're having, and that their users are having. And that I don't know, I hope makes the world a better place, I hope, improves people's lives, and is a good use of their their time and effort and resources. And I can't tell you, and I shouldn't want to... whatthe the biggest problems in Nepal are right now from a language technology standpoint. But what I can do is I can elevate those things when I hear about them. So if you don't follow, I think you do, Hugo, Rest of the World, I would strongly recommend it. So it is an English and it is a newspaper, I guess an online newspaper, I might call it that reports on technology outside of the global north and I have learned so much and I want to continue learning, I want to continue listening. And on the one hand, I think that it's very dangerous for me or people like me to say what the fundamental problems are and what the most pressing problems are, because I don't know. And I risked diverting effort and resources of which I have more than my fair share, away from the things that would be the most impactful. And I think like AIDS research, malaria research, great example, self driving cars, something I feel very strongly about that we are, you know, spending a lot of time on, but I don't think we're going to get a good return on investment from on a, even a national level, even in the US. So there's that, I think there's a lot of hubris. And I think that it's something that, like, as technologists, we want to solve problems. And I think sometimes the problem you got to solve is you. hugo bowne-anderson There are several... a whole constellation of challenges and concerns here, what what is the trade off between trying to build, quote, unquote, scalable products, whatever that means, right? And then trying to actually engage with users and stakeholders and other humans and figure out with them, what would help help them? I do think that's a problem endemic, you know, in our industry, in particular, the other is we've kind of seen the fruits of globalization, but also the deep problems associated with globalization in terms of not empowering... rachael tatman and the fruits and the problems have not been distributed equally.. hugo bowne-anderson I wonder if it's worth talking about. You enlightened me with respect to an institution, I might pronounce this incorrectly. But how I would pronounce it a first approximation is Masakhane. rachael tatman I also do not know Masakhane, must, eventually, we can spell it. I believe it's MSA, K and E. Yep. hugo bowne-anderson And on Twitter, it's at MasakhaneNLP, and we'll include that in the show notes as well. But they're doing what looks like a lot of incredible work with respect to natural language processing with African languages and dialects. rachael tatman by African developers and research. Something I've been thinking about is okay, I genuinely believe it's important that we as a field, very broadly and including people from everywhere, and that empower people to solve their own problems, and also make sure that they have the resources to do that, which are two separate things. I think this is a great example of what an organization can look like. And one of my worries is that I think a lot of people in linguistics, a lot of people in NLP are trying very hard to make it very clear that you know, solving English is not solving language. Also, anyone who claims to have solved English or language is lying to you and you know, keep your hand on your wallet. This is a general, a general piece of advice from me. Something that I do worry about is if we, so this is an issue that's been around for quite a while in linguistic anthropology, you have a situation where Western researchers come into a language community, work with local, often they're called informants, which I think is a great way to describe that relationship, but work with local informants and create a resource about the language so maybe a grammar, maybe a dictionary, you can imagine in an NLP setting, maybe some language models, a tooling a corpus, and then they leave and you know, this dictionary is published by an academic publisher and it costs $300 And it's not accessible to anybody in the community. So their cultural heritage has been exploited for gain by someone who's not part of their community. And it doesn't help the people whose language it is in the long run. And I do worry about the same thing happening in NLP. And I think the way that we're going to avoid that is, you know, supporting local grassroots efforts to build the language tools that are genuinely needed in situ, like, I don't think it makes sense to create, again, to use the Bambara example, a huge language model from Bambara, I think it's much more important to build things like Tokenizers, right, those like important NLP pipelines, where ownership is, you know, through open source, but the resource stays within the community and is guided by and governed by the community that needs to benefit from it instead of being extracted. hugo bowne-anderson Bambara is, is from a language in West Africa, in Mali, I picked an example like, well, I don't know if you know this about me, but I went to Mali 15 years ago, and it was one of the had a life changing experience there, actually. And maybe that's a conversation for another time, but it was very, very beautiful country and incredibly resilient people. So I think this dovetails nicely with something I, I wanted to really drill down into our responsibilities as data practitioners, when working with human data. And I want to frame it, there's a flipside to we live in an age, which is all about individual rights. I don't think there's enough of a conversation around responsibilities. And we've seen that in the past couple of years, right in a very serious way. But I think rights and responsibilities are deeply intertwined. So our responsibilities of data practitioners are in some ways, the flip side of the coin of the rights of the people whose data and humans whose data we're using. So what are your thoughts on this, particularly with respect to the work you do, but more generally? rachael tatman That's a great question. So I think we talked a little bit about my, my path into language technologies. So my background is in social science, and I have received a lot of ethics training around data. And it's something that I've thought very deeply about, and I do full disclosure, I, there were things that I did in my research that I think were ethically ill founded, and I would not have done now. And I regret doing in terms of, you know, extracting data specifically from social media. And I've talked about this at length in other places. I'm happy to talk about it now. But yes, data is people. So all language data is people, even language data that's generated by language models, is predicated on that data that was originally produced by humans. And you can also absolutely extract PII from large language models. Just as an aside here to keep everybody up at night, there is a large and growing body of research showing that large language models or anything that encodes a sufficiently large corpus and can generate language contains pretty big security risks. So yeah. hugo bowne-anderson I presume most people may understand what I just want to unpack what happened there. models use training data. Of course, PII is personal identifiable information. But more importantly, people may not realize that if you train a model, and then deploy the model, and you give no access to the training data, there's a growing body of research on extraction attacks, which allow you to draw out bits and pieces of the training data, including PII in these cases, I think we don't have any cases yet of it actually being. It's not that we know about No, and that's my point. Right? So I'm setting up a straw man in a certain way that that we know that it hasn't, we haven't heard of it yet. But the fact that it's possible is incredibly troubling. rachael tatman Yes, yeah. And so that would be things like addresses, I expect things like, you know, crypto wallet stuff to be a prime target into the future. Don't do crypto. That's my, that's my personal aside, I would strongly recommend against it as an individual. Yeah. So that's the the even if you think that you've never touched the data directly, you definitely have, or in a way that is traceable. And one way that I really like thinking about this, and this is not my original idea. I do not know who to cite for it. So my apologies, is to think about the human data that we use in our work as being given to us by data donors. And the thing about a donor donor relationship is that it comes with, again, you know, certain moral responsibilities. And again, whether or not these are legal responsibilities, I think it's a separate question. hugo bowne-anderson But also, I mean, legal responsibilities in the end come from more like we try to, you know, impact legal systems, legal regulations isn't always the way as you and I have discussed, like establishing norms among practitioners in certain fields, arguably, can be more impactful than, like people hack legislation all the time. Right. So yeah, definitely. Then once again, I just want to state that there there are deep historical precedences for these types of things. So I listeners may know, I used to work in the biological sciences in cell biology and biophysics. . So a lot of my former colleagues worked with a cell line a human cell line called HELA cells, which were taken without consent from an African member A moment called Henrietta Lacks, who passed away from cancer in her early 30s, I think. They're trying to figure out different ways to write this wrong and to make sure it doesn't happen in the future. And her family who is still around are working a lot to figure this out, I'll link to I mean, there was something in nature, the nature journal last year and editorial about it, particularly with respect to policy review, and particularly with to action on consent going forward, her family's like, look, you know, of course, these things have happened, but we want to make sure that they don't happen in future. And I think, once again, establishing that this is not just relegated to the world of tech, of course, it's dire there. And I think a point you've made, I can't remember what where is that when we create Twitter accounts, what we sign away to the fact that we essentially say that all tweets we do what can be accessed via the Twitter API and these types of things. So what's the role of, I suppose, companies in general, I'm trying to be careful. My point is that even when we provide consent, which we don't most of the time, it's really informed consent. rachael tatman Yeah, definitely. So informed consent has several components that I think not everyone's necessarily familiar with. So one is that it can be withdrawn. So that's part of the reason why scraping Twitter as a, you don't want to store the tweets, you only want to store a reference to the tweet. So if someone deletes it, you're not holding a copy, because they have withdrawn their consent. It's ongoing. So again, at any point I can I can stop doing it. And I should know what's going to happen. But what's going to happen with my data, what are you going to do with it? Why I should be able to say like, yes to this part of it, no to that part of it. And like, honestly, you know, the fact that the the standard right now is like the huge EULA End User License Agreement, that's just a wall of text that's written in very formal language that's very impenetrable. That as a dyslexic person often is very difficult to physically read. Like, it's usually in a Serif font, for some reason, not always, but often enough that I noticed that hugo bowne-anderson Did you see... It was some NLP project that I think the time New York Times did, where it ranked the some metric of complexity, I'm gonna get this totally wrong, but some metric of complexity of all the terms of service that we sign compared to a famous philosopher, philosophical works, and like most of them are more complex by whatever metric This is, then Kant's, the Critique of Pure Reason, something like that, right? Yeah. rachael tatman Yes. And that's a decision, right? Somebody made a decision that this is how we're going to tell users what we do with their data. And that when we update it there, they may or may not be informed. I mean, I think at this point, everyone should be informed. I believe there's relevant legislation, not a lawyer, don't listen to me about the law. hugo bowne-anderson Yeah. Also, would you speak to a good point, the asymmetry in legal representation and legal knowledge? I mean, big companies have huge councils that can design these things. And as an end user, I honestly don't, rachael tatman yeah, something I've been, I started to, I don't know if it was a flame war, but certainly a discussion, which some people felt very passionate about, about self driving cars recently on my Twitter is something that I feel very passionately about. And one of the things that sort of, like shook out of it is like, well, you know, when something awful happens to somebody as the result of a self driving car, we'll figure it out in the courts. Well, I'm gonna guess that the pedestrian that gets hit by a car that is pressing charges, or, you know, their survivors that are pressing charges, are not going to be able to afford as much lawyer as a company owned by one of the richest men on earth. And that's what's gonna, like, determine the history of our not the history. I mean, it'll become the history but you know, the future direction of tech policy, especially given how we're getting a little bit into US politics here, but how friendly US courts tend to be towards corporations that keeps me up at night. A long list of things. Yeah, hugo bowne-anderson I did see that on Twitter. It wasn't quite a flame war, but maybe it was getting that way. And if anyone, any listener, like strongly disagrees with anything, we're saying, I'd actually encourage you to just send me a DM and come on the podcast and have a chat instead of starting a flame war on Twitter, I'd much prefer that. And actually, that was part of the kind of secret conversations on Twitter at the moment, that was part of the impetus. I'd like to move on soon to discuss junior engineers, but I'm wondering if, if there are any other points you'd like to hit around our responsibilities as data practitioners before we go there? rachael tatman Yeah. I mean, I think we have more than we, as a field currently want to admit. And something that again, we talked about very briefly earlier is we have entered into a social contract with the people who are, you know, using technology and the public as a whole in our in our role as engineers who build things that the public interacts with. And public trust in a sector is a limited resource, and if we squander it, by building models that leak data, by building untrustworthy software by building genuine only harmful things. We don't get it back. And, yeah, I don't know, I think at some point, we'll also deserve to lose it. And I don't want that to happen, right? I genuinely believe in the power of language technology to make people's lives easier and better. It's made my life easier, better. I don't I think I mentioned I dyslexic, I don't think I would have maybe even finished high school without, you know, a system of word processing technology. I certainly would not have finished my PhD. Anyone who's ever had to deal with handwritten or edited documents, for me is acutely aware how much assistance I need. But yeah, we've got to be, I think, a it's the right thing to do to be mindful and careful, and good stewards of the trust has been placed in us. And be I think it's the self serving thing to do, if you would like, additionally for us. hugo bowne-anderson Yeah, I love that. For me, and you like the most important reasons that we're discussing the altruistic reasons. But even if, if you're not really interested in those, or whatever your motivations are, the self interest, like these things are aligned in the medium to long term, not necessarily in the short term. But framing public trust as a limited commodity, a limited resource, I think is incredibly important there. I had questions around junior engineers getting started in space, and also what I thought were different questions around developer advocacy. But I wonder if we can combine them in some way? I don't know if that's possible. rachael tatman Maybe! Yeah, developer advocacy is. So for those of you who aren't familiar with the field. It's really only something that exists at companies that build developer tooling. So I mentioned I'm at a chatbot that builds a chat bot with a company that builds a chat bot framework. So we don't build chat bots, we build the software to help people build chat bots. And it's a peer educational role. So it's someone who I mean, in my case, it's a machine learning NLP product. I'm she and learning NLP practitioner. It's a peer educational role. It's sort of as a field, we've come to the conclusion that it shouldn't be tied to revenue goals. So we don't sell things. We help people build better technology. And it's a great fit for me, because I mean, one of the reasons I went to grad school is that I was really interested in teaching, I really like it. So I get to teach and talk to a lot of folks and hopefully help them that's my that's my goal. hugo bowne-anderson I'll build on that by just adding that it is very important in the tooling space. But of course, a lot of companies, whether their main line of business is tooling, or not do work in in tooling. So you can like think about pytorch and TensorFlow, and how, particularly Facebook is investing a lot more in developer advocacy, these days. But that's another conversation for another time, Rachel, on top of that, some of the most inspiring so I work in developer advocacy and developer relations and evangelism as well. As you know, some of the most inspiring developer advocates for me, are actually open source maintainers. One of the first actually, who when I was working in academia was Andy Mueller would suck it learn, and going to his tutorials made me go, oh, wow, this is so important. And you mentioned conversation driven development. And he was telling me a while ago, that they practice documentation driven development, which I love as a concept, you know, you write the docs, and then build the API to reflect what how you want to communicate to people. I've also run developer advocacy functions in companies and just for all your listeners out there, who are into data science, and also potentially like into communicating, and that type of stuff. Developer Advocate is one of the hardest roles I've ever had to feel. And I think there's a huge space that's opening up. So if you want to chat about developer advocacy, or thinking of going into it at some point, I'd really love to talk. But it is very difficult to find people who do this right, Rachel? rachael tatman It's a very specialized sub career. We're also hiring advocates, or trying Absolutely you need someone who has not necessarily a depth of experience, but a depth of knowledge and sort of being a thoughtful engineer, developers that you can give guidance. And also, you need to be good at giving the guidance. So I should say not everyone does. I do a lot of video content. Not everyone does a lot of video content. Some folks do a lot of blog. Some folks do a lot of live talks, not so much recently, as you might imagine. Some folks do a lot of documentation work, although often that that gets shuffled into other roles as well. Yeah. So it's it's a pretty diverse field. Yes, hard to fill. hugo bowne-anderson And the role is very much community building in a way as well, right? Which is fun. And just to you due to a lot of live streaming, I would encourage everyone to check out Rachel's Twitch channel and all of our YouTube content is it RC tatman, at most places, rachael tatman Yeah, most places, although the majority of my my YouTube content is on the rasa channel right now. There's also quite a bit on Kaggle YouTube channel. hugo bowne-anderson Awesome. The other important thing I think to mention is that involves a lot of conversations listening, learning, and developer advocacy because he is a two way street between frameworks, libraries, product and community of users, right. So getting feedback back to the to the product, so maybe you can talk a bit about what do you like there? what the challenges are? rachael tatman Well, I do like it, there's a lot of challenges. This is the part of work where if you are a developer in the community, you're not going to necessarily see it. Probably, I mean, there's a couple of big challenges. So one is that usually the developer advocates, we might, you know, help out on issues, but we're not usually in charge of the products. It's pretty rare for the developer advocacyto run the product team, I'm sure somewhere someplace do it. But usually they don't. So if I'm trying to convince, let's say, our head of product that we need to make a change, I need to like a have a relationship of trust with them, they need to have like, seeing that I'm not giving them just like a frivolous laundry list of like random asks that people asked me that I've really liked sat down and thought about it. And like I have an idea of the scale of the problem, like how much of our community is this affecting, and then have evidence of that. So I actually do a weekly report out to our product and teams, and then the developer marketing and customer success as well. Because I also work with with developers on like, Hey, here's the big issues, you know, here's all the blog posts, or community member members wrote that you might want to know about, you know, this is what people are angry about on the forums, not that people are often angry about things on the forums. But that's a good way to be like, oh, you should know, we should address this ASAP. So it's a lot of communication. It's a lot of cross functional trust building. And then it's a lot of filtering. I once heard developer advocates referred to as filters for product teams. If I've passed on literally every user request, I got to the product team, I think, just to me out, you know, it takes experience, it takes developing product knowledge, which I think that's part of the reason why often through the path in developer relations is from engineering or from a developer position where you've been there for a while. So there are definitely people who've started their career in DevRel, I would say I started my career in grad school, right? That's when I started building language technology. And then I was I did data stuff at Kaggle. And then I switched into developer advocacy. So I also followed that that career trajectory. One thing I would say is the daily cultural aspect of this. And one, one way I think about is that you want to establish a dev advocacy culture within the organization where everyone's doing their job and has to be doing their job, particularly at the size of startups you and I work for, but making sure that everyone's aware that they're able to contribute in a variety of ways. And what I mean by that, let's say, we have a new, a new feature a new version of the API, is that a dev rel or evangelists or dev advocacy person who's the best person to write that? Or is them collaborating with the person who released the feature the best way to do that, and figuring figuring that out? the other aspect of culture? You mentioned that it's all based around. I mean, people would say soft power. I don't really like the term power in this because it's, it's a two way street. But the real truth is that ahead of this is one of the like, the in our industry, the head of product doesn't necessarily like they have to leverage all types of social relationships with engineers in order to get stuff done. You can't tell I had a product. If I had a product telling an engineeering team what to do, that there are problems already, right. I think definitely getting people letting people know as much as possible about what's happening in the space, particularly relevant to the work they're doing the problems they're solving is incredibly important. We've talked about the challenges junior engineers face, and this, arguably, is a very large amount of people. So maybe you can tell us your your concerns there and potential paths forward. Yeah. So one thing we talked about a little bit earlier. So specifically in the machine learning space, I mean, machine learning as a discipline is fairly old. I mean, not compared to like agriculture. But like within computer science, it's been around for a bit, right. So I think the Eliza that we were talking about was developed in the 50s or 60s. Does that sound right? 50s? I think yeah. The sort of first big wave of funding for AI, as generally envisioned came from in the United States defense money. Oh, yeah. So early speech work. hugo bowne-anderson Also the internet. Yeah. That was sort of mentioned the trends. You mentioned before of, you know, rule base to statistical stuff to neural networks to transformers. This type of stuff isn't only in the world of NLP, these are big trends around kind of L technologies ation and informationalism in general. rachael tatman definitely. In machine learning, in particular, sort of each of those waves has also been accompanied by a bunch of hype when they were starting, and then a bunch of disappointments and then the money disappearing. hugo bowne-anderson We call them the AI winter. We've had three or four of them. And maybe the fourth is coming! Winter is coming. rachael tatman eventually, right? Yeah. But because we're sort of in shall we say the summer? The AI summer I guess the actual summer for you down in Australia. It is it's hidden. hugo bowne-anderson Oh. But London, London. Yeah. is keeping it a bit cool. rachael tatman Yeah, yeah. So because we're in a stage of like rapid growth and expansion of the field. There are way more junior engineers and senior Engineers in machine learning specifically. Yeah, I would say that right now five years of experience, machine learning is a lot. Which is really interesting. Because you think about like, just like databases, five years of experience in databases, it's fine. It's some for sure. But it's very common to, you know, talk to somebody, or at least it's common for me to talk somebody who's been databases since like, for 40 years, and they're going to meetups and telling about like, the one time you know, the newt was in the server room and, and shorted something or whatever. hugo bowne-anderson And if you're a beginner, learn some SQL, that's the best advice I can possibly give rachael tatman 100% Yes, SQL is mandatory. Unfortunately, at some point, you're gonna have to know it. If you already use R and are familiar with the tidy verse, a lot of the verbiage is pretty similar. hugo bowne-anderson The other thing going from junior engineer, like looking to senior engineer, I don't know what's on the y axis here. But there's some sort of step function, right? That we want to turn into little logistic curve or something, I want to make that trend getting, like learning and building your career, we want to make a lot easier in the early stages, or provide support for people doing this right, which doesn't quite exist now in a lot of parts of the industry. rachael tatman Yeah, it doesn't, which I think is really, I mean, there's sort of like pros and cons, right? So if you're a junior engineer, now, there's a lot of space above you, that's empty. But the downside is that space does not have, you know, a friendly helping hand that's reaching down to help you get up necessarily. So I think it can be really hard to find mentorship, I think it can be really hard to find guidance. And you know, especially in a very changing landscape, it can be hard to manage expectations, they're coming down to you from, you know, upper management in a way that in a team with like a lot more variation in seniority, a very senior IC I can can push back with a lot, you know, a lot more clarity than someone who has just entered their first job, for example, and is the first data scientist at a company where the CEO was sold a bill of goods, perhaps someone fairly unscrupulous, talking about or misinformed, perhaps, about the capabilities of a specific system or type of system, hugo bowne-anderson I had to speak to someone recently, who's a junior data scientist for a real estate company here in Australia. And like one of their executives sent them something the other day saying, Hey, I've just seen this tfx pipeline stuff, can you use that? And they'll just like, Oh, come on, please. So I think the proliferation of technologies actually, as well contributes to the hype and the unreasonable expectations that can be baked inthis is related to the manufacturing of demand by a lot of companies that, you know, are in the space. And arguably, also, I just want to be self reflexive. I do think in some ways I've been part of the problem, like my incentives is running evangelism and marketing, this type of stuff in companies is to earn search engine optimization and that type of stuff and build content that I think I hope will help people. But it also does, perhaps manufacture demand, we have a role on marketing teams called demand generation, Rachel, I mean, this is this is like a very cynical stage of capitalism, in a lot of ways. rachael tatman Yeah, we're gonna create problems. To be fair, I don't think that's what marketing does. I think that marketing can be a force for good. I don't want to, I don't want to like marketers out there to be like. hugo bowne-anderson know. But I was mean to myself in that that's not I want to be self reflexive on what I'm comfortable with and what I'm not I do think, hopefully, the companies I've chosen to work for, and the work I've done is helpful for most people who I actually view marketing at its best is increasing signal to noise for people who will find the product useful. Now, I've heard marketers of, you know, toxic financial products say that as well. So and I don't think I don't work in that space. But I think it's important to be self critical in that sense. How can we help junior engineers more? rachael tatman great question. Great question. I mean, from a developer relations standpoint, I think, a change that I see, or a signal that I see that someone's probably ready to move up or, you know, perhaps was under leveled, and hopefully that didn't happen, is moving beyond thinking that being a developer is writing code. So sometimes you'll see, I don't know, I'm sure you've seen these sort of like very snippy rude comments on Twitter, where someone who's like pretty senior talking about something someone's like, hugo bowne-anderson I've never seen a rude comment on Twitter. rachael tatman You live in a blessed timeline -- where someone will be very senior, they'll sort of like be talking about like, high level architecture stuff. And some will be like, you're not sitting there writing code. And yes, I would, I would hope that staff engineers aren't sitting around writing code all day, something would be deeply broken. Yeah. Yeah. So having an understanding of, you know, if I can hand you like a PRD, a Product Resource Description. And you can, you know, with the implementations laid out, and you can sit down and code it up. That's great. I'd say that's, you know, well, I'm looking forward and a junior engineer. If I can hand you a PRD with Like, hey, here's what I want the product to be able to do. And you can sit down and like, write down different implementations, talk about the pros and cons, talk about, you know, maybe have some sort of estimation of timelines, always dangerous. And complexity and feasibility given, let's say, budgets that I'd say is more senior regular engineer. And then if you can sit down and be like, here's, you know, a vague description of what the product might be able to do, and being able to, you know, figure out what's feasible, what's a good idea to do. So some things that may be feasible, you know, I would strongly recommend against on, let's say, legal grounds or ethical grounds, like, there's, to my stand point, there's no reason to put facial recognition in a product and you should just not do it. There's never any reason to guess gender, just don't do it. hugo bowne-anderson I actually noted this down when you say I've never I haven't heard it quite far as like this. You phrased it beautifully. Not everything is worth building. Right? And actually thinking about that, and I love it because then, you know, we've talked about all the potential harms and pitfalls way I'm talking about all of them. But we've spoken about some of them. And of course, we know they're a great deal more. And but it isn't only that causes.. can cause serious harm. It's that why are we doing this in the first place? And that's worth reconsidering every time? Absolutely, yeah. And then being able to think about things at that level of complexity and scope and, you know, impact to say, but I word but the impact of your work is, I am interested in what excites you most in the space currently? rachael tatman That's a great question. Recently, I've been struggling a little bit because I feel like there's a lot of harm being done much of it unintentionally. And it can be, again, talking about public trust, like my trust is also a limited resource. And it can be you know, pretty draining to hear about awful thing after awful thing after awful thing after awful thing. And that I think it's important that we're aware, I think it's important that we consider constantly the possible harms that our work could do. I think security folks are a really good model for this, specifically for machine learning folks, like, I don't know, if you've ever spent a lot of time in security person, they are very paranoid. And I think we should be paranoid on behalf of our users. But that's not the whole field, right? There's a lot of, you know, there's a lot of things that work. And there's a lot of accessibility features that work. And we talked about Maskhane and the sort of development driven by the community, it impacts to meet their genuine immediate needs. But I think is fabulous. I love seeing, you know, different ways of approaching building software and not just machine learning. There's an app slash web site called Miria. M I R A ya, I believe, and it sells black owned goods. So things produced by by black owned businesses. And it's a single developer who's building it. And he's basically building it as a community. Good, right. So there's, there's a patreon to like, pay for his living expenses. Otherwise, pretty much everything is folded back up into the business recently, he added a way for people to, you know, provide mutual aid to people in need. So someone's like, hey, you know, I'm sick, I can't work, I need $200 to help make rent, a way for the community to come together and help fulfill that need. And I think that that is a radically different approach to building software. And one that really, genuinely interests me and feels like it is a feasible, sustainable, long term way for us to exist in a society where software is integral. I think there's a lot of that sort of ethos in the open source community as well, that this is something that we are doing for our community to create good. And of course, that comes with responsibilities. And I think the log4j thing is at the top of everybody's minds right now. It's a heavy burden. It can be a heavy burden, I should say. And I am not saying that anyone is beholden to contribute to open source. I think that that is a awful way of approaching the process. I think you should you should see it as a, you know, a gift and a calling. hugo bowne-anderson And particularly as you frame that part of your driving philosophy, let's say as self determination, particularly within communities, we shouldn't behold people to be involved in things that don't don't don't work for them. I I do think you will have seen this recently. A bunch of wonderful people have just created the distributed AI research institute. Yes, yes. So this is Timnit Gebru, and others. And on the Twitter on their website, they have a wonderful sentence AI is not inevitable. And I think there's a huge growing movement of people interested in figuring out paths forward for, let's say, responsible AI. Yeah, and so definitely follow their work. There's also great work that the Algorithmic Justice League has done, Black in AI there are many institutions will include some in the show notes as well. But if that's something that interest you, listeners, and arguably, I was gonna say, arguably, it should interest you that isn't what I what I mean that sounds too moralistic for my likes, it's an important conversation to be having I think and the more people that are part of it from all around the place, the better off we'll all be, I feel rachael tatman And something that I think about whenever I'm in that situation where you know, I'm feeling particularly pessimistic is I'll I'll look around at the other people in the community who do also genuinely care, and they're talking about it. And they're bringing it up even at like, great personal and professional cost. And that makes me hopeful. hugo bowne-anderson Yeah, me too. And that's another motivation for why I actually love having these long conversations and long format podcasting. As you know, I I've done a lot of online education in my time, and I've um, I get messages from people all over the place, saying thanks for the course. And it's like, I'm, it's so so wonderful. That's something I really, really gives my life meaning. But occasionally, someone will message me saying, Thank you, I learned so much Python from you, and I'll respond, what are you using it for, and they'll say, I'm working on high frequency trading now. And I, I'm not putting any judgment on anyone else, I'm talking about my own personal reaction to a message that sent to me that I always, naively in my 20s thought education is a great thing. But once again, education can be used for good and for harm. And I do think having long format conversations where we delve into these types of issues can kind of help bring out the depth of the lives involved and impacted, as opposed to because teaching it a lot of time is teaching API's, right? Because that's what people need to learn in order to do stuff. And what else we have in the education system is important. Actually, this is actually an a really important point I spoke with Jeremy Howard, recently. And he mentioned to me that, um, in the maybe it's on YouTube, I can't remember wherever the videos are, like most of them have really high ratings, except the ethics lectures, which there's a significant drop there. But I think is that people are interested, but they don't view it as necessary to learning these tools. And, and I honestly understand if if I'm working a full time job, and I'm learning stuff in the evening, as well, and I need to learn stuff. In order to do this. Arguably, we need to open up space and time for more people to learn more things. rachael tatman Absolutely. And if you want to work in the field, and you want it to be here in five years, I would say start the ethics training now. Absolutely. hugo bowne-anderson Absolutely. Rachel, this has been a fantastic conversation. I'm wondering if you have a final call to action for our listeners or something you'd encourage people to think about? Unknown Speaker Yeah, I would say just as a way to get started with with thinking about technology ethically is like a very concrete to do you can do is to sit down and think about what applications of technology you believe in your personal moral framework to be immoral or unethical and decide what you won't build. We're in the hottest labor market of my adult life right now. If you have a technology background you have hopefully you're having you know, lots of offers and opportunities and options. Your labor is limited. You get to choose where you spend it and pick things that you feel good about. hugo bowne-anderson I couldn't agree more. Thank you so much, once again for a wonderful conversation. Rachel. Transcribed by https://otter.ai