[00:00:06] What's up, everybody, welcome, welcome to the @TheArtistsOfDataScience Happy Hour, it is Friday, July 2nd. I'm super excited to have all of you guys here. What's up, everybody? Funneling into the room and super excited to have all you guys here. Hopefully you guys got an opportunity to tune into the podcast. Released an episode today with the one and only Dr. Jordan Ellenberg. He's The New York Times best selling author of How Not to Be Wrong The Power of Mathematical Thinking, as well as this book Right Here Shape It is an amazing book, absolutely enjoyed reading it, absolutely enjoyed discussing it with him. I'm actually going to be giving away this very copy right here. Dr. Ellenburg was generous enough to hand me two copies, so I'm going to be going to be giving one away. How can you win one of these? Well, it's easy. Just go on to LinkedIn and share this video with their network. I'm going to randomly select somebody who's shared this video on LinkedIn or is sharing this live stream. And, you know, we'll we'll figure out who's the winner is this copy. But great book. I really enjoyed it. You guys enjoy it, too. If you want to win your copy, go and share this live stream right now on LinkedIn. Yeah, I'm super excited to have all you guys here. The room is packed and I look at all these wonderful friends here, man, I, I'm excited to have everybody here. Mark, what's up? Eric, what's up, Russell? Rashad, what's up, man? Super excited. [00:01:31] Have all you guys here. Hey, I've got a I got a question that we can start getting kicked off with. Um, so I got a few of them here, but let's start with one. So what's a topic that you decided it's finally time for you to go in on? Right. Because we all know that not only is Data science a broad field, but just like life in general, there's so much stuff to learn out there. What something that you have [00:02:00] finally decided that it's time to go in on. And and why that why that book? Why that topic for me, um, this week I've decided that, you know what? It's been far too long. I've been putting it off for way too long. I'm going in on natural language processing. And that's that's right. I'm taken. I feel like there's a need for it for me personally on, you know, with the podcast and all the the transcriptions I have, I feel like I can apply natural language processing to that data and extract some really valuable information and insights out of that. So I figured, you know what, it's time for me to to go in on that. What about you guys mind? Let's say let's start out with Rashad and shout out to you in the house for Rashad. I mean, what's something that it's just been far too long and you decided it's time to go in? [00:02:47] All the far too long is really easy. It's audio analysis because when I first got in Data science, I. I love music. In fact, I was just strumming my guitar right before this and I really wanted to work for Spotify or something and to create the ultimate music recommender. So it's also unlike, say, a lot of the real estate finance analysis that I do at work. It's like a very different starting point. It's like a signal, a decomposers a lot. It's like a very different way of thinking. So it's very refreshing. So I've been getting into that, like, actually just go out LinkedIn finding some some presentations that people put up. I'm sure I'll find talks eventually to. But I've been going to that song. [00:03:25] That's so cool man. Like that said, that's something that I probably would do next because I like exactly what you're talking about, man. I'm super, super into music itself and just doing music analysis and stuff like that. Building recommendation engines like you're talking about would be freaking awesome. Have you played around with like Spotify AIs API at all? Do they have so many, so many rich data sets that that have like these audio features from all these different tracks and then things like that? Have you check that out at all? [00:03:51] I have not, but I will put a note to do so in my task tracking app. [00:03:55] There's even a integration with they've got like that that it's called spotty for [00:04:00] in Python. So it makes it easy to to do that. Um, shout out to Greg Cucu in the house. Alexander, what's going on? Joe Jewelry's taking a walk. Give man happy. Happy to have you guys here. So, yeah, I just found out much more than just, you know, doing this happy hour thing. Anybody that's just tuning in, if you want to win your copy of Shape, you have to share this live stream. Right now, I'm going to randomly select somebody at a later date to to win this book. But the only way you can win this is if you shared this live stream right now. Marchman, what's something that, you know, you've decided it's finally time to go in on. [00:04:34] For me, it's business strategy and it's like really diving in on, you know, really time. I work the business metrics and like business, meaning a lot of my stuff I'm really focused on beforehand, like I'm startups and despite partnership, my early stage of thinking of it. And I've shied away from like the inner workings of like once companies are established. And now I like just talk to my manager. And this kind of direction I'm trying to go is like this, the piece that needs a really upscale on to really take me to the next level. So it's like trying. Read some more books on this going, various courses and training and just thinking more deeply about this. I just jumped into work. So, yeah, business is like really interesting, too, to seeing all these different levers and frameworks out. People frame frame these business problems. [00:05:18] Yeah, but I was actually catching up with Vin earlier today, making up on that session that I had to miss out on. Have you checked out the episode I did with that? Fred pillared the the book was all about strategy. It's called How to Be Strategic. [00:05:31] It was able to get my colleagues to choose it for the book club and we got it. So if reading it, [00:05:38] There's another book too I like called Cracking Complexity. I interview the author of that book. That episode should be released sometime later this year, but keep an eye out for that as well. Alexandra, welcome. First time I've seen you here. Happy to have you here. So what's something that you've even if you feel like you've been putting off for too long but, you know, decided it's time for you to start picking up and start studying. And while Alexandra is giving us [00:06:00] her answer, if anybody else has questions, you can put that right there in the chat, wherever it is that you are enjoying this from, whether it's live in the zoo room on Twitch on YouTube or LinkedIn. I'll get to your questions, Alexandra. Go for [00:06:11] It. Yeah. Thank you. And thank you to Eric. Invited me on a LinkedIn earlier this morning. So thank you for the invitation. I'm similar to you. I'm trying to work on my natural language processing skills. I come from a marketing background. So being able to understand how consumers are talking about your products or even I've been dabbling with the Twitter API a little bit to try to do some tweet analysis has been fun and to connect to Richard's point about also being a Spotify music junkie. I've been using the Genius API a little bit to try to look at the sentiment of song lyrics and things along that line. So trying to really dove into that space [00:06:48] Is the genius at the same thing that like Apple did. I'm beyond what Apple does genius anymore with their music. But is that the same thing? [00:06:54] Yeah, I'm not sure if it's exactly the same as what Apple's doing any anymore, but the concept is the same where they're just pulling song lyrics. [00:06:59] Basically nice. That's cool. And so shout at everybody else in the in the room. If anybody got questions, let me know. We'll get to all those questions. Let's say let's hear from a man. I hope I don't put your name here, but is it too early to make a memo. That's right. Please help me out here. [00:07:18] It is safe fairly. [00:07:20] Do I do what? It's something that, you know, you feel like it's it's finally time to go in and start studying. Yeah. [00:07:26] Hi, everyone. And just. Yeah, it's I attended a few sessions here. It's 12:00 midnight. Sometimes I have my video on but it's normally too late so I'm really happy to be here. Yeah. Well in fact I am also doing a bit of NLP my Dening Data science is still quite a new eyewitnessed artistic environment, though. Recently I started using NLP in analyzing survey data. You know, these critics [00:08:00] that you normally need to analyze and get some information from them. So I've used that this Indiana survey that would be on that. It's quite useful because normally statisticians don't as much as they collect information. You know, when you fill in a questionnaire, sometimes you have a section and it's a bit more descriptive. And in a statistical environment, you deal with numbers. So that takes normally they collect information, but they don't analyze it. So that's a process that really helps me. Just did a basic and basic basic basic coding. And you're really able to do some modeling and you can extract quickly information depending on the subject. So that was quite useful. But I'm still learning. I still want to explore more and maybe do a bit more modeling and explore some packages. I actually use AIs in most cases, so I do Python, but I mostly analyze and use ah in my day to day work. [00:09:09] Yeah. So I used to be a statistician as well. I worked as a statistician, clinical trial statistician, studied statistics in grad school and all that stuff. So you kind of cut from the same cloth. Eric, let's hear from you. And by the way, if anybody has questions, let us know there's some questions rolling in on LinkedIn that we'll get to. But first, after we hear from Eric on this topic, we'll go to a question from Spencer. Spencer emailed in question, but I'll just have you like yourself in training video on Spencer, if you don't mind, but go for it. [00:09:40] Yeah. So I think I would have to say, kind of like Mark knows that I'm probably going to say because I was going to ask a question today about bias in data sets of machine learning. So that's that's the thing that I'm really interested in and want to dig more into, because like we hear about it, we know it's there. And it's a thing that there is much hand-wringing done [00:10:00] over the topic of bias and machine learning. And yet I don't. I know a lot about quantifying it and mitigating it through algorithmically, and so that's that's my next that's my next big push. [00:10:12] It's an interesting topic. And I like to add to see what you learn about that and follow along on your journey for sharing that on on LinkedIn through and stuff like that. And that would be absolutely awesome to hear about a shout out to everybody that's just joining in. Remember, you can win a copy of Shape by Jordan Ellenberg if you share this on LinkedIn. Just go ahead, share this entire live stream. Doesn't even matter if it's a couple of days later. Just make sure you share this with your network and you can win this book right here. It is an amazing read. This is actually a book that has an advanced version with uncorrected proof. OK, given that one away, though, shout out to Spencer, Holly Spencer, go for it. Greg, I've got you added in the question queue as well. And then friends on LinkedIn. I will get to your questions as well. Spencer, go for it. [00:11:03] All right. Thank you. So I've been working on getting started with an independent project of doing an independent project. And I have two ideas in mind that I've dabbled with them a little bit, but I'm going to really, like, move forward with one of them. And so one of them is to do lead lead scoring type of thing for a software company that I have in mind. And the other one is to do it like some YouTube analytics. What I have in mind right now is predicting how many sales, like a certain YouTube video can do, like let's say let's say we have a YouTube channel and there are business that sells like a course that teaches you Python. And then you're like trying to predict how many poor sales and individual YouTube could make or could be attributed to that video in between those two projects. I think that the second one seems [00:12:00] more doable, but I'm more excited about the first one. And I'm kind of worried, though, if I do the first one, like, for example, I can find out which within the company, which which customer or which leads did convert or like like who the customers are. But I can't really find out like who is not a customer. So I've got like ways to I thought of some ways to kind of fill in the gaps and make a really scrappy project. And I'm kind of my concern is that if I do that, I'm just going to have this like kind of odd looking project where I'll have a bunch of gaps and say like, well, in a normal case I'd do this, but I'm missing all this stuff. So I don't know what kind of my question is like how the trade off between that really scrappy project that might have a bigger impact versus smaller project that's out, but it's actually like has the Data boy there. [00:12:58] That's a great question. That definitely this Exane better to have an odd looking project than no project at all. I definitely do with that as well within. What are your thoughts on this and this? Hear from Ben then after then we'll go to Rishard and then Joe, if if you're if you want to chime in on this, I'd love to hear from you as well. [00:13:15] Yeah. I like projects that are real world. And so as you're listing, like you're my problems here, my problems, that's all that is. That is absolutely perfect because that's what we deal with. You know, I don't have enough data yet and we never do. And so any project that you look at and you say, well, it's not going to be perfect, it's not going to be great, those are actually perfect because what you're going to do throughout the course of that independent project is learn and display your capabilities of handling just real world issues that we come up against all the time. And that makes for such a more rich project, especially when and I know you want to probably think of the perfect project. We have the perfect results. And that's no project [00:14:00] ever. The caveats that you can put in where you say, you know, I started out thinking I could do this. And I realized about ten percent of the way through Data analysis that I couldn't. And so I pivoted. Those are amazing because that's real world where you didn't waste the time. You got to a certain point and you said, hey, I can't do exactly what I initially did, but here are some suggestions where I could still do value that stuff that we do all the time. And so think about it that way. Even if it's not the perfect project, think about it in terms of how it does work in the real world, because those capabilities that you're showcasing to handle ambiguity, setbacks, creatively solve problems that we run into, especially around Data. Amazing. Those are great. [00:14:41] Thank you very much. In Rishard. [00:14:43] Yeah, I agree with everything Ben said. I think of projects as being facsimiles, a way to demonstrate your performance on a facsimile of the real job. I also think of the interview process. You try to get as close to like, OK, what are you actually going to be doing day to day and how can I test that? So a side project is like an extension of that. And so if you are working with more imperfect Data. You actually have like a real business case for some actual thing like problem you're trying to solve, that's probably the most valuable. And then I'd say you take a couple steps forward. Oh, I can't do that. And then you do this. If you were able to write about that, that actually shows. Wow, that separates you from all the bootcamp people because then you're like is that it's a very difficult thing to train unless you actually experience it and do it. So if you experience that, it's differentiates you. And if you're able to communicate effectively on the on the fact that, oh, I had to pivot from here to here, that I'd say that would definitely be a part of my AIs hiring. Right. So that I think that I would definitely go for more real world. [00:15:44] Some awesome, awesome advice coming in so far. So, Spencer, hopefully you taking notes. Joe, what do you think? [00:15:48] I agree with what everyone's saying here. Hopefully my reception's OK. I'm kind of I've got it perfect. But yeah, I would say one hint, too. I think everyone agrees on your first project that you mentioned being awesome. [00:16:00] I agree. I would also say that this gives you a good opportunity to help define one of the things that is my biggest pet peeve with these types of projects is defining what is the customer. Right. And so I think if you can go through that exercise and really show your ability to not just process it Data go through the rigorous definition of being a customer, which is actually that question is a very difficult question for I think a lot of companies to answer. It was a customer who isn't a customer. When are you customer one or are you not one? So if you demonstrate an ability to ask that sort of a question, I know somebody's hiring manager or somebody like myself or vendor. Greg. Yeah, that's that's awesome that you're not just doing the analysis, but you're actually taking the time to define what the customer you're trying to squali for. [00:16:52] So awesome. Thank you very much, Spencer. How you feel about that? Some good advice to anybody else you want to help. [00:17:02] Yeah, definitely good advice that helped help definitely give me more clarity as to, like, just just knowing that I'm doing a project like a project that's that other people think at least has potential to be worth something, even if I can't really pull off the full, full project. I definitely when I was I was going through the early stages of just writing up the readme and stuff. I was writing like I'm going to have to be flexible while doing this because I'm not sure how it's going to work. So, like, I think it's important for my understanding to be flexible, but still definitely having the goal in mind to to do the project, but also being flexible that maybe you can't, like, necessarily force it to work, but you can at least analyze data and find some kind of value [00:17:55] Somewhere and then. Right. Well, I'm looking forward to seeing what you come up with. And if you ever do want to do like a YouTube [00:18:00] ad or any type of ad project, get in touch with me. I might have something for you. So let's go ahead and look, Greg. And I hope you guys don't mind if I jump to LinkedIn first grab some of the questions from there, some great questions coming in. And by the way, speaking of LinkedIn, if you guys want to win a copy of Shape by Jordan Ellenberg, you can win this exact look right here by sharing this live stream on LinkedIn. Make sure you guys do that. So there's some questions coming in here from LinkedIn. One of them is how do you go about finding Data quality issues? Does anyone you spend as profiling or sweet visit libraries? This is coming in from Akaash. Oh, I've used both this profiling and sweet biz. So, yes. How would you go about finding data quality issues? Let's hear from a listener from let's hear from Mark on this one and then go to Greg. [00:18:50] Yeah, I mean, I've used Kandace profiling before. I normally do. It just sounds like a quick check just to get started from doing. That's a very quick scrapie project. But for the most part, they have the Cresta method. I forgot the various stats, but there is essentially this methodology of identifying kind of like was your data set? What's it look like with Data? All the issues? Key things I always do is I'll quickly view the data set you it's large, a quick head. You get like ten Rostowski idea what I'm looking at. I'll get the value types. So like for each row, like this string character and whatever may be looking for null value. So getting those counts, looking at various distributions and then also like working with like taxane or categorical data, what type of are things consistent across those things. Probably get you pretty far for, for the most part, for a lot of things for for dead party. I guess another question is like, are you think about data quality for one off project or anything about data quality for a pipeline, because those are two different problems with like two different ways of approaching that, depending on where you are. So for me, for my job, I work with a lot of data, [00:20:00] probably issues from a pipeline perspective. And so as a data scientist, I'm not the one creating the data not to work with engineering to get these things salt. And so it's just really dependent on like what what I wish I have more clarity on where your question is for kind of like we mean by data quality that was seemed all over the place. That's how I think about data quality. [00:20:21] That's just a huge, huge area. Right. So definitely I mean, all the suggestions make complete sense. Greg, how about you? [00:20:28] Oh, I'm not the most versed in those tools to Data truth. So I couldn't tell you anything more than what Mark was saying in terms of one of Data cleaning's when he was talking, I was thinking about, oh, yeah, those are the things that you can do, even with a simple tool that's been around for the longest, something like Excel. Right. So you want to spot check the missing values, et cetera, look at the distribution of your data, etc. and then for the data pipeline quality piece, this is one thing that I typically work on also working with Data engineers. This is key to partnering with business folks because we want to understand the origin of of of those triggers of Data events, understanding where they coming from and understanding why they're changing over time and then making long term strategy for capturing those changes and making sure that they are addressed properly so that you come up with a set of tools, whether it's data sharing pipelines or data processing pipelines or Data our industry versioning pipelines, where you can properly monitor those things. So that's a different ballgame. But to Marc's point, I don't think I can add anything else. [00:21:53] Yeah, thank you very much. I agree that's extremely valuable. Hello, Russell. Eric, do you have any tips here on finding [00:22:00] Data? The issues crickets from. [00:22:06] And I was waiting for everybody else to jump in, so what I would say with Data quality issues is I want to add something to what Greg was saying with the Providence piece. You really can't do too much, has to surface level assessment in tools like the ones you mentioned with pandas and so on. But there should be some metadata associated with every data set that you have. And a lot of what you have in pandas doesn't really allow you to understand the provenance side of it. And a lot of Data quality is understanding how it was collected and how it was gathered, because the gathering methodology could completely invalidate everything you're trying to do with it. So you definitely want to use the standard tools that are out there, but also just asking some questions, sometimes not even getting to the point of using tools, just asking questions like where was this gather together that when was together? What was the thinking behind gathering? You know, just go through some of those questions with whoever it is that owns the data. And if it's like a homework assignment that you're working on or classwork assignment that you're working on, obviously less relevant. But this is more of a real world applied situation. [00:23:14] Yeah, absolutely. I was waiting for somebody to jump in and talk about the humans in the loop. Right. Talking about how did this data come to be, where is it from and getting that kind of background knowledge, as it were, Joe, or [00:23:27] There is something in there that reminded me of bringing a human into it is a good idea. So like I was maybe this is about three weeks ago. So I was like doing one of my very first analyzes here in my new job. And I like did whatever, you know. And when I showed my manager, my managers like those numbers look weird. And of course, I have no frame of reference because I've never seen these numbers before it. So I didn't know what looked weird or what didn't look weird. And she's like, oh, that's because this particular place that you [00:24:00] were pulling the data from only holds two years worth of data and you were pulling two and a half years worth of data. And so it held two years of data for this chunk and then like two plus years for this other chunk. And so I didn't realize it's just like some institutional knowledge that I didn't have. And I don't know, it's probably written down somewhere, but I had seen it. And so, yeah, just talking to fellow humans is sometimes a good idea. [00:24:25] So people also codifying this stuff, too, right? So, I mean, you're starting to see data catalogs and Data dictionaries suddenly become popular. I mean, but I think anybody who's has enough data, i.e. enough that you had to repeatedly use it, have been kind of documenting it in some form or another. If they had, there are somewhat smart about it. Even just defining what what are these fields? What are the expected values, these fields, what's expected distribution? Where does it come from? XPoint. Right. All these things that were and so [00:25:00] We're having some issues that [00:25:02] I think a lot of open source projects like Data collaborative use or so I'll type and chat. [00:25:09] Ok, yeah. No worries. No worries. I know you out there. [00:25:11] I think Eric made me think about something that used to burn me all the time is that when you're establishing a Data strategy is understanding the I guess the footprint that you're projecting for a company or a department over time. Right. So if you know you're going to be in the US, then you can think about a very uniform way for the timestep. Right. So but if you're going global, right. So I've collected data for global operations and some missing values were due to the manufacturing plants were closed at the time, right asleep. So we're not collecting data at the time because the factories are closed. And I was sitting here wondering why we can't see anything. And simply because the data wasn't collected at the uniform UTC, for example, time [00:26:00] frame. So it is good to understand how these systems are staged to collect data in a uniform time frame or time zone so you can quickly translate it and be uniform across the board, especially if you have a global impact. So that's a great reminder. [00:26:15] Yeah, absolutely. Russell, go for it. Also, Alexandra or Nisha, if either of you want to chime in on this, please, by all means, do but go for it. Russell. Yeah, I've [00:26:25] Got a couple of follow up points to the human aspect of it. So I think that we can try and do a lot of proactive analysis on the quality of the data so we can look at the the sorts of data, the formats of the data and try and predict the stability level of that data. Can you see me? Okay, I think I've got some connection issues today also. So, yes, one of the biggest issues I have with Data stability is the owners of the Data that are providing that to us do wanting to change it because it suits them and not informing the Data pipeline. So at. New fields changing the titles of fields, computers, etc., you know, adding new lines, just basically breaking the quality of the Data. So with some domain expertize, you can try and predict the likelihood of that happening, try and increase the Data literacy of the people that are the owners of that Data to try and reduce the likelihood of that happening and where it's impossible to be mindful that these types of breaks might happen. So if you find there's a failure in your model, you can kind of look at these places first because they're the most likely that the issue is going to be. And then one follow on point, which is from one of Cho's messages about figuring out who your customer is. [00:27:43] I heard something in the Jansing sometimes that consumers can be different from the customers. It depends on the setup. So if you're if you're building a report or some kind of data analysis output for your customer, who's asking you to do it, but he's going to be consumed by a number of people, try [00:28:00] to be mindful who the consuming audience is. I've been in some situations where we've created some output that's been now almost cutting edge or just really, you know, we kind of pleased ourselves by the output and felt really good about ourselves. But it's gone completely over the head of the customers because it's too advanced for their current levels of Data literacy sometimes, really. So I think you do need to be pragmatic about the technology you put in for the solution to make sure it is optimized for the audience. And that may mean that you need to swallow your pride a bit and do something that's not as good as you would like it to be so that it's one to the audience at that time and then try and try and increase that basic literacy and then the the the technology levels of the solution in time. [00:28:46] Thank you very much, Russell Alexander. And any [00:28:49] Of you got [00:28:49] All that? Yeah, definitely some good points there. [00:28:52] I just wanted to add that what Eric said very, very much resonates with me because in health care, Data the Data is essential for a whole different purpose, probably for something else. But we use it for secondary reasons, outcomes, deciding outcomes or something else. So in that case, um, Data quality is it's not really about the Data quality, because in the first in the first place, it's it's collected for some other main purpose within it being used because it's already there and can be used for secondary outcomes. And when it comes to Data quality, it is kind of important to understand as to why. What better to give an example, let's say there are four or five problems that actually has the date that we want. But some columns, even though the Data populated, you shouldn't be using it because it's populated for some other reason. And that's probably buried somewhere deep in the documents who essentially load the data, do the EPA process [00:30:00] for it. But as an analyst or data scientist, you need to know either through talking to other people who've been there for a while or somebody who is really aware of that process to figure out which columns you need to use. They all kind of look the same because they all have bits and it's still the data are valid. They're good quality, but you cannot still use it for your analysis. And that's not something you would want to do because your results will be totally off. And then when you go show to your boss, look, it just doesn't make sense. That's what will happen. Um, so I just want to say that when I believe that you need to figure out where the data, how it is generated and you need to understand that and figure out whether it's actually useful for your good analysis. But overall, talking about profiling, I pretty much agree with my to the purposes that we use it for. I think of it pretty much everything that I would say. [00:31:06] Are you an aspiring Data scientist struggling to break into the field or then check out DSG doco, forgo artists to reserve your spot for a free informational webinar on how you can break into the field, that's going to be filled with amazing tips that are specifically designed to help you land your first job. Check it out. DTG Dutko Forward Slash Artist. The mayor, thank you very much, if we can go ahead and move on from this topic unless anybody else has something else to add, a lot of great topics or questions coming into LinkedIn. And Greg, I know you guys have questions, but I just want to get through some of these. Doesn't look like anybody else has anything to say about Data quality. So let's continue to move on. Shout out to everybody on LinkedIn. What's up? Robert and Albert Alberts asking if it's a school pick today because we all have fresh haircuts. And happy birthday, Albert Einstein birthday. [00:32:00] Happy birthday, Albert Einstein. [00:32:02] Or was it yesterday? Well, happy birthday or belated birthday [00:32:05] And happy birthday, Albert. So, Antonio, comment from earlier talking about something that he's been waiting too long to go in on. It's no code that somebody is asking if we have any job openings. Don't ask that, please. And I can't say this person's name on India. Can you throw some light on how to survive in Data science if coming from a background other than math or stats, how difficult life would be without knowledge on these? To my second question is where to start the journey towards Data science? So we talk about this a lot and a lot of different previous office hours. I highly recommend you to go through my podcast, look at every single office hours and see the annotations. This is a question that gets not covered quite, quite frequently. So I urge you to please go in and listen to any one of those episodes in the podcast. There's a lot of stuff that that can help you there. Plus, oh, I like to watch a good friend of mine. What's up, man? So I want to know what are some key metrics to measure user experience on a mobile application to add more context, the idea is to see if adding advertising windows on a app page has any impact on the user experience. He understands that we should do a B testing, but he's not sure what metrics should be captured to measure user experience. That's a great question and I actually really like that. Let's go to it. Let's go to Ben on this one then here from from Marc, because I know Vince got a ton of experience on this. And I know that this is something that Mark works on with humor as well. Then, Rishard, if you got any insight on this as well, I would love to hear Bhavin go for it. [00:33:45] We're talking about apps, right? Like mobile apps. [00:33:49] Yeah. Mobile application, key metrics to measure user experience on a mobile application. [00:33:54] What are the ones that I just love is if you're doing AB testing, look at the [00:34:00] flow, the difference between flow for with ad, without ad and that can be everything from time on page or time completing a particular task or I don't know what kind of after you're using, but essentially look at the time that two people would spend one versus the other doing something and figure out if you've gotten in their way and try to understand when you think about impact on the user, exactly how much time they used to spend on that page. If they're reading something or playing a game or whatever it ends up being. How long did they spend in that particular flow? Before the ad? After the ad? Because you find so many different things by understanding just how people continue to use or in some cases, like you see the time on with that particular ad plummet. And that's a horrible sign because now you know that it's drastically impacted. So you're looking for big, drastic changes in the time people are taking doing a particular thing versus what they did before. [00:35:01] And Kevin. Yeah, I remember we were talking about this a couple of weeks ago as well at market. I think that was you that was having having some of the question. First of all, how did that situation work out for you and what it's like? Can you provide us Boshier? [00:35:12] Yeah, I haven't had a chance to work on that project because things always pop up my at and other things. But it's interesting. I was like listening to Ben. I was like, wow, that would be really nice if we had the infrastructure to do those things. The more so talking on the early side of like how do you announce allies? I think it ties back into this question of like, you know, your strongest quirks, your Data, where you don't have access to things like how I still drive value. And so for me, it's like just getting logs of Data. And so I'm always like pestering our engineering team whenever building new features. I'm like, so what a lot is going to look like, how you can structure it. Why are the values, how are you going to find this. They kind of get annoyed maybe. I don't know because they're so busy themselves. They're like, why are you going to worry about this stuff like this? [00:36:00] Get the logs where the logs. And so I think the biggest thing is the game logs and being mindful, like, where do you want to log where the time spots and which like what's happened, that's really important. That can inform you when a users reach a certain point. So like, I don't have access to Data, like where were they clicking and how long they're on for it. But I know when they get to a certain page, I know when they go through a certain workflow within our within our app or within different channels and so have no specific logs. [00:36:28] And more importantly, being a startup, there are different channel. And different sources bring them all together into one table where I essentially have a snapshot, like I know how a user move through our products basically for all of their lifetime as as one of our users have been Hooke's. That's what I'm talking about. Yeah. And so we have we have things like that because they've been prioritizing, like, do we spend time building out these hooks versus building new features, which is always a balance with the Data. I do have things I look at is just like just different rates of like opening, closing, going through how far people made it through. So like, things that might look through is like a funnel niles's or click through analysis. Those are kind of the things that I'm thinking about a lot for that. But I think some of them probably make you really affecting before talking about the metrics. It's like thinking about like what type of Data AIs impactful to even get in the beginning and like influencing people who are building the product to care about those values, to actually measure and then making it clear to them that like why measurements important? Because the argument we use, like, wow, you spent all this time building this product feature will be great if I can mathematically show it's awesome and that nobody gets gets the gets the buy. It makes [00:37:44] That like that last bit the Gregg or had any input here [00:37:49] Only to add something real quick. Oftentimes we say we want to test. And by the way, this question is very close to the question I have. I am so happy about this question and [00:38:00] I've enjoyed listening to Vin and Mark. So that's that's pretty awesome. But oftentimes we want to make do an experiment and then we say, well, how do we validate that is going to work? Well, before you run the experiment, what I can tell you is you need to define what success looks like to you and then work backwards from that and say, OK, I declare success as this based on my hypothesis. And then what are the key inputs that will translate to success, or are we actually collecting these inputs already? Or do we need to start creating ways for collecting these inputs? And then when we run the test, we go take a look at the inputs and we look at the behavior post experiment and then we see whether it works or not. So if activities increase when you run marketing one on Harp, then you can go through these inputs to see whether clickstream, increased, etc., etc. So always start with what success looks like in mind and then move backwards in terms of what does it take to measure that success? What are the key inputs to take a look at those outputs? Very, very [00:39:05] Practical and valuable advice. Thank you so much, Greg. Bushrod, anything to add here? As I recall correctly, work with like apps and stuff like that. Right? [00:39:13] Actually, now at the moment, not really. OK, although I do you I do use poly dash as I prefer, like visualization for for users. But I wouldn't say I've actually never really worked with mobile apps. But I will add, I do think I have something that I would suggest that you're thinking about. The total user experience oftentimes is not just the mobile app itself. And if you expand your definition of what the user experience is and what they're trying to do, you can get a clear idea of the cost of doing things like adding ads and making the interface slightly more clunky. So, for example, you if you have an app and if the person is trying to do something in the app and if they can't do it in the app, they will call your call center. Right. So that's outside the app. But it's like really, really important for business outcome because the call center is a much greater cost and the user is trying to do something and the ads make them [00:40:00] make some subset of users less able to do that thing. That's really bad. And so I would say, like the user experience, also think outside the app and and consider if you can measure that more omni channel Data, especially if you think of like a specific outcome. I mean, bank banking apps, for example, have lots of very specific things that users are trying to do that's relatively probably easier to constrain that problem. But it really depends on your app and the total universe of what the user could do and what the cost of those things are. [00:40:28] Awesome. Thank you very much. I appreciate that you got a lot of good insight there. This is recorded, obviously, so you can catch that later. Don't see any more questions trickling in from LinkedIn speaking on LinkedIn. If you want to win a copy of this book right here. This is shaped by Dr. Jordan Ellenberg, released a podcast episode with him today, New York Times best selling author of How Not to Be Wrong Power of Mathematical Thinking. You could win this book. Just share this live stream. Tag me tag Vienne Tegmark Tag Greg, spread the word, try to get these things blown up and big. So no more questions coming in from LinkedIn. So let's go to go straight into Greg's question and then we'll go to Eric's question. Anybody else got questions at all? Let me know right there in the chat. Wherever it is that you are viewing this, go for Greg. [00:41:15] All right. So my question is just about anyone who wants to answer this. We talk about Data science projects, where the Data is there we do some cleaning, etc. we train them auto, etc.. Well, let's take a case where we have a supervised. Where Data is not yet labeled and we want to test for causation, we have some hypothesis we want to test and we want to go beyond the AB testing. I want to hear from you guys and use cases about experiment urine tests that you collected data and was able to we're able to label your data and then use that collected data to train your model and deploy [00:42:00] to production. So in other words, who do you work with? How do you design these experiments? What is your control group? What is the treatment group? How do you make sure that you put a stamp on that causation to make sure you move forward and confirm that, yes, email is a solution for the problem you're experiencing? [00:42:25] That's a good question. Did I have to think on that one? As I said, I interviewed Elissa Simpson Rocktober yesterday, and she wrote this book called Real World I. And the company that used to work for is called Apon, and they were Data annotation company, specifically looking to them, looking at the work that she's done. I mean, I probably talk about the experimental design Harp, but as it relates to annotation and I'm not sure, Mark, I know you've done a lot of work in experimental design as well as if you have any insight, definitely time in here, rishard, than anyone go for it. [00:43:04] I was about to say this, but this process in which I recently read the article is really interesting where it said be careful when interpreting predictive models in search of causal insights because and again, I'm not I think there are way more advanced in thinking about these kind of things, for they're like a bunch of Microsoft researchers. But essentially the argument they're making, it's like with with predictive models, you know, you're looking to the future. When we try and show causality, that's more of a like looking the back understanding relationship. So that's that was I thought was a really interesting article that really stood out for me. And so when you're asking about causality, like how to prove that I was thinking like two sides of my mind doesn't go straight to, like, machine learning. My mind goes to, like, setting up like a randomized control trial or in this coming from like a health care background or doing observational studies. So the gold standard would be the randomized control trial. Those are really expensive. So like thinking [00:44:00] about your inclusion exclusion criteria, like who who's in this population and how do you make them similar enough where you will have confounding and then randomizing that the control for was it confounders, unviable confounders you don't measure. I'm blanking on the name of it. So there's that component. Right. But that's really hard to do, especially in a business setting like that's that's a lot of investment. I don't see that really happening. You do it for health care because you don't want to kill people with drugs. So that's why they go through that. But like many times I think about Data science, observational studies and so there's different methods. There's a question in the chat confounding variables I'm going to get back to it's going to bug me. [00:44:43] I'm going to put it in the chat room. But what I meant by those those words, but essentially unobserved covariates. There we go. Unobserved covariates. That's that's the thing you want to look out for. And so when we do observational studies is that you already have a data set they already have available to you. And many times you didn't collect yourself. So you don't have control of the experimental design. But here's the thing. You have observed confounders which you can control for with special statistical methods, but the unobserved ones you can't. And so when thinking about that again, you go back to what's my population and what's my treatment control. So who got the pill? Who didn't get the pill, who got the intervention, who did it? And how can you create, like, similar populations between the two? My favorite method of all time is frequency score matching. I love it. It feels like magic essentially equates to similar profiles where you just do a simple like test regression. You can see kind of a difference between them, but it goes deep. There's so many different methods like instrumental variables, all these different things you can do. And so I think it really comes down to is like the statistics. But how do you frame the problem to know that you have two separate groups and how do you measure that bias? What kind of tools in place to like, say, like now [00:46:00] reduce the bias, but point out here's the bias and that's how we kind of account for it. To give you a better idea of how to show that causality between things, to make a better decision, move forward with. And from there you can take those insights. Do like predictive model. [00:46:14] Yeah, exactly. So I think you're on the right track here. You said it very well because in my case, it's like a use case is we want to gain more. We want to turn prospects into we want to convert them and we want to test, for example. For an experiment, we want to offer them some sort of marketing campaign, and in that marketing campaign, we will reduce the price of our product by 20 percent and we will select a group that will receive that campaign and select the group, will not receive that campaign and observe and collect data and then see what happens in that data collection. Now, you can use that to train the model, because now the model has annotated data that tells them who converted based on that campaign you're in on, that experiment you're in and who can tell you who will not convert so you can deploy it and now you know who to target. Better target for these marketing campaigns. So in real life, when you run these experiments, it takes months. So that's why, you know, Data science projects, we often think it takes three months, but it can take years. It takes time to run these experiments. So that's why I wanted to get the conversation going in terms of what other use cases, what kind of experiments I've done before, outside of marketing, outside of maybe testing. What are the use cases are you seeing on your daily lives? Do you work with economists because you got them is all they want is they want behavioral economics. Right? That's what you care when it comes to causality. Right. You do this. This will cause me to spend more money. You don't do this watching this money in my pocket. You know, those are the things that I wanted to talk about that [00:48:00] I [00:48:00] think one thing that I want to add real quick that that makes it so hard to do in the business use cases like when you have a customer who's paying for it, they many times they don't want to be experimented on. They don't want to be the one that didn't get that didn't get the treatment or about just like the best use case. And so it's really hard to push that. And also another challenge is that when you do ask tea or just like experiments, that one doesn't give you the result that you want so many times, which is science. That's fun for us. But the business stakeholder, that's a level of risk that they don't want to engage in. They only may want the answer. So I don't like doing the controlled trials or like showing that causality. It starts getting into the politics of being sometimes where you're rising comedy, always being. I always go back to it like, what's that? What's your business stakeholders risk threshold and experiments definitely steps on that I've noticed. [00:48:51] Yeah, yeah. It's a it's a tenderloins across that's for sure. [00:48:55] The just other use cases like I mean that's a statistician like my whole job is designing experiments coming up with randomization schemes and experimentation schemes and the I've seen a wide range of them. I've seen like bioequivalent study we were trying to figure out of two compounds are equivalent on that one particular endpoint. Do they do the same thing, like do studies where you're looking at dosing studies? We're trying to find out the optimal dose to dose someone. So you have increasing dose levels and trying to figure out what impact that has on some endpoint. Also, there's we did this flu design, this flu study, and the design was taking our drug, comparing it. We couldn't compare ethically to a placebo because you can't just like somebody has severe influenza, you can't give them like a sugar pill. So we give them the the standard treatment, which at that point was Tamiflu. So that's almost like a Bible study. And just see how how our drug works against that and things like that. Anybody else? Nesha, let's hear from you. I know you're doing some hardcore science as well. [00:50:00] And if anybody else wants to chime in on this, let me know. And if anybody else has questions, let me know as well. Nischelle, there you go. [00:50:07] Hey, I just wanted to come on. I have not had a practical use for this. It's more and more coursework oriented. So developing a probabilistic graphical model, usually in colloquialisms, going to look essential. You had to be out in front of us. But as far as I know, I've never had had done a practical use case on it. But this, uh, this is done in the health domain very often. So trying to figure out if one particular drug is going to work on a patient or not in a big hospital, then you create a network from the data that is available from any future, including the game Data, whatever Data you can get your hands on. And the payroll data putting together that network definitely requires that putting together that network, it's that effort because that requires domain expertize. Once the network is good, then you can dove into the closer relationship that happens between then and then. You can ask any kind of question to the network that it would give you a probability based answer. So you have a level of confidence on how a particular variable affect that particular variable is going to affect your end type of it. And check out. If you're trying to implement it, I'm trying something like something similar to it for my thesis currently, and I am using use my Data iPhone plugin or myelination, which is essentially a graphical network, and it essentially draws out the network and you need to figure out what the network is. And that's what I'm trying to utilize to generate my network. [00:52:00] See, I can make it a proof of concept, so hopefully that helps. But I've never had a practical use yet. It's still in theory mode and I'm trying to figure it out myself. And I've got a nice Buckett set on Tumblr. I've been reading on and off. I can tell you that if you're interested, [00:52:18] Is the work you're doing at all related to this stuff? Judea Pearl is doing a lot like the causal inference and causal models like the book. Why is that? Because he does these causal diagrams and it sounds similar to what you're talking about. I was curious if there was some overlap with that. It would be pretty interesting. [00:52:34] My. So the question that I am interested is not really causal based, but in the course of my research, it does say that when you define a network, you can have an influence as well, rather than going through a tunnel sorta through a proper archcity, which is just a lot of time and money in my domain domain outside the battle. I know that it is it can be applied to that as well. I just haven't quite dived into that particular aspect of it, but I know it's applicable to it. [00:53:10] And for anybody else interested in, like designing experiments, Penn State has a great Penn State's statistics department is like my favorite, even though I didn't go to school there. I love their stuff. They have this awesome introduction to design of experiments course, that you can just kind of go in through. And it's all just like really well thought out notes and, um, just the entire course, really quite interesting and really, really comprehensive. Highly recommend that. I'll put a link to that right here in the chat. Anybody else have anything to add on this topic then? You've been very helpful. [00:53:43] Yeah, I'm doing with so I just had a question, Manisha. I think she pointed out some of the specific burden that network raised in the context of an explanation. I was curious to understand where that network is. And, you know, part of that relate to the end outcome, which usually, you know, they [00:54:00] weren't doing the project initially. [00:54:03] Did you step away or are you still there? Looks like Nesha slipped away, but definitely look me up on LinkedIn and send her a message. You might be able to provide more insight. And if you did hear that question, definitely let us know what you meant by networks in the context of the work that you're doing. I don't see any of the questions coming in through the LinkedIn or chat here. So we'll move on to Eric's question. Go for it. [00:54:28] Yeah. So I have been trying to think of, you know, trying to find a data set to analyze related to bias. Yep. Like bias in general to be able to see how it could be in a model. Right. And so actually even just like defining what a data set like that would look like is actually pretty tough. Like Mark and I were talking about it. And it's easy enough to see, like you can see injustice in Data a lot easier than you can find model bias. I guess I should just say it's like a much like smaller subset of Data. Right. And so just and my my question is, I'm going to like, give my I my basically my definition, my working definition of what this needs to look like. And I just want any ideas anybody has about like what kind of Data might fit in to that. So basically I'm looking for something that is a there's a desirable outcome that somebody is trying to attain, like a job, a promotion, parol, something like that. But their ability to attain that outcome is dependent upon the decision or verdict of someone else. And that's where that's where the bias comes in. And trying to figure out where I can find that Data [00:55:51] Probably look at like all of the court data that comes out of the Supreme Court of Alabama. I'm sure that will have a lot of bias [00:56:00] now because the good question, if anybody has any insight here, would be happy to to hear it rishard or more. Yeah, I mean, I got [00:56:12] I can't give you like this is the Data say you should look at. But in real estate this is like a big deal, historically speaking, because there's something called redlining many people are familiar with is the idea that zip codes were labeled based on their racial composition at certain times in the 20th century. And this affected the ability of people who lived in those zip codes to get loans, I would say. So you could look for trying to think like what the Data would look like. I mean, loans, sale, mortgage Data having they're connected to demographics of the receiver. Could be difficult, I imagine. But that's I'm just brainstorming here, to be honest. Yeah. Other things are like, ah, just lending in general because lending drive big driver of the economy and jobs and business formation and whatnot. And you could imagine that lending standards have varied by, you know, that they've been unfair to certain groups, especially if you can get historical data. My guess is that that government data set like Data might be a good place to start to get something like that. Yeah, those are like the biggest examples of bias I could think of right now. [00:57:17] There's something that that's kicking off my head from from that. And I'm not sure if this is I'm kind of in line with what you're looking for is like the city of Winnipeg, for example. We have Data, like in our open data portal, and we have data that is like there's a data set that has all the parking tickets that were handed out in the city. Right. And it says the neighborhood as well. And then, you know, the latitude longitude neighborhood as well that the ticket was given. We also have demographic data for each neighborhood that includes stuff like median household income, racial makeup and things like that as well. So if you come if you combine those two, then maybe you can test a hypothesis like our police might, you know, given how [00:58:00] more parking tickets in poorer neighborhoods, would that kind of fit the definition of the type of thing you're looking for? Is that [00:58:07] So? I think it fits like it kind of fits like the front half, which is detecting the bias. Right. So we that's that's like I was saying, like that's, you know, detecting injustice. Right. You can see, like, hey, you know, police are giving more parking tickets and and disadvantaged neighborhood. Right. And so the other half and this is this is a I'm just trying to wrap my head around is like how do you so how do you, like, mitigate mitigate the bias? Right. So if you detect it and you can see like I use promotions as an example. Right. So. If you can look at the various factors that go into getting a promotion and you can show that the difference in men and women are more like men are more likely to get a promotion like X percent, more likely to get a promotion, then you can compensate for that in like either you can VINs like, not nothing. Not not that. So, like, how do you how do you address it even. How do you see it? How do you address it [00:59:08] Anywhere in your mind if I dove in on this [00:59:10] One. Absolutely. [00:59:11] Yeah. That exact I worked on that exact scenario. You can't you cannot deny us the Data said after the fact. You just can't do it. And I will argue until I'm blue in the face with people I have argued with very, very smart people. You can't bias the Data set after the fact. It's once it's gathered, once the bias is baked into it, that bias is it just in what you detect? It's everywhere. Every outcome, every label, every everything you are learning from a data set that's biased. The only thing you can do is regather, especially when you're talking about policy, which is where you're getting into you're not talking about like hard outcomes. You're not talking about a cargo's one hundred miles an hour. The person dies because of the collision. That's a hard outcome. You're talking about policies. You're talking about somebody making a decision and doing decision support based on a recommendation that you're giving. [01:00:00] And so when you do anything that recommends policy and this is an interesting kind of going back to what Greg was talking about and ask you about the causal, there is a lot of causal analysis and causal graphs created to support policy decision making and to do support models that are used for policy. Because of exactly this, you inevitably have bias in your data. And unless you go through the process of that initial models of hypothesis, you're going to find that bias in the data. And then you have to, starting from that hypothesis, go back and build a new data set without the biases that you've detected in your initial data. [01:00:39] And you also have a ton of other concerns, obviously, in regathering data. But one of the big ones and one of the reasons why causal inference is used so much in policy recommendation is because it's a nod to this that you have a biased data set. Once it's biased, it's done. You can't build a reliable model based on it. And so you have to use that as a starting point, not an ending point. And there is a consensus opinion in some companies that you can do biased data sets and it never works. Every single time someone has tried this through any number of biasing techniques, it's failed. The bias has been discovered in the data set through some other type of feature engineering, and it ends up in the model. So your what you're doing is really related to the policy space and the policy recommendation space. And so if you go down that road, you're really now walking down almost a causal inference, creating causal graphs and then validating those causal graphs to support the policy which any given model ruptures or recommends. And so you're going in that direction. I think that's where you're going to end up, is that there is a primary and a secondary and usually tertiary data gathering process. And that first data set that you use, you [01:02:00] can definitely use it to identify the bias in the data set so that the next data sets together doesn't have the same biases. So that's one way of handling it. But it's not exactly talking to the exact use case that you're referring to. [01:02:13] I think it's like that's really helpful just because it's obviously like when you say it like, well, once the bias is baked in, it's in there. It's like, well, yeah, I mean, garbage in, garbage out makes sense. So, yeah, that definitely. And I think I need to like, think through a little bit more what the actual use case like the final kind of use case is. It's just nebulous and big. So thanks. Go for [01:02:36] It. Yeah. So I've actually been talking with Eric about this use case for a little bit, so I've been quiet because I'm just trying to listen as well. But also I want to get feedback on a potential approach that that we pitched was essentially is like going back to the policies that you have, going back to sentencing. Right. As someone's sentence for a certain time for a crime, you can tie specific crime to a specific law, give you a recommended years from the lower level and the upper limit. And so maybe that's a potential way to future engineer. Like someone has their basic demographics. They have their sentencing time and feature engineering a value that's like where is fit within that range. Is it really far up, really far below and taking those differences and maybe like a way to detect bias is putting that on like a normal, not normal distribution, but like how many standard deviations is a as a as a someone sentence compared to like other people. And that's more of like a statistical way to see like as only a necessary finding, biased as Marzo finding outliers and therefore. A proxy for bias. Would that be too much of a stretch right there, I'm buying, you know, just a starting point being labeled Data set up like these are potentially biased instances. And that was probably a lot in a quick second. And maybe even you repeat or dislike, talk to someone one on one. [01:03:57] If anybody has anything to add to to add onto [01:04:00] that one, go for it. I mean, it's a great topic in a couple of quick Google searches here, and I hope I can find something useful for you here, Eric. I'll send some links right here in the chat while I'm trying to stall for anybody who has something to say on this topic, let me just pull up what I found real quick. [01:04:18] And so sounds like you need to to Vince point. You need to run a couple experiments and test them. Like if you if you have such biased data set to see what what Neopolitan says, if the you know, you would run some experiment to check whether the bias goes away and then you would use that collected data to train a model. Is that is that what you would want? Is that what you meant then by wants bias is it's over and you would need to introduce another set of data on this. Just wanted to make sure I understand your point. [01:04:56] Well, actually, Marks kind of clarified and I can tie both of these together. So marked the bias was introduced before the data set. And this is the key piece to understand is that when you're looking at defendants who are charged with a crime, that in and of itself has already been impacted by previous bias, how likely is it for any given person to be charged with this with this crime, given the same facts of the case? And so that would be the data set that you needed to gather because that would inform you about biases that exist in the data set that was gathered about sentencing, because you already have entered this entire process with an initial bias. And so that's really where you begin to do experimentation. Is your experimenting with what came before you figure out where bias begins because you want to look at you have to find the origin, the first step that introduce bias into a process. [01:06:00] And then in order to build a model, you have to gather data in some way that removes the bias from the process in order to create a model. Now, if your whole purpose is to discover bias again, same process going back to now that initial step and saying, OK, the decision to charge what came before the decision to charge was decision to investigate, OK, was there bias introduced in that part of the process? You create this complex chain where you say this thing has an origin and a lot of your experimentation is really walking back the chain. [01:06:38] You find the the original insertion of bias into the chain that you're measuring. And that's where you have to start, because if you start trying to find bias later, you're like I said, you're done because that data sets already got the bias baked into it. And it is bias is baked into it that you're unable to measure because you don't have enough data to understand. And so it's the same process to different outcomes. So when I say experimentation in many cases to find that upstream process, I mean, I'm using a really simplistic example in this case. Typically, if you're doing something like marketing, there's a level of I just can't get to the origins of a buying decision or the origin of a search with intent decision. And so obviously we're using a very simplistic experiment. So it's for example, so it's obvious where bias may have been first introduced into the process. But when you look at more complex processes that you don't necessarily have full control over full access to in order to perform each one of the experiments, that's where the experiment comes into play to understand how bias exists, because all you have is data. At that point, you can't do a physical experiments. You have to do an observational experiment with an existing data set knowing that there's likely bias in it. And so you have to design your experiments such that No. [01:08:00] [01:08:00] One, you first find the bias that's in that data set. But number two, you're also working backwards to figure out where, you know, how far can I get down the path feasibly? And costs come into play here. And this is kind of going back to Greg's question as well. Your cost comes into play because how many steps are you really going to trace in a very complex decision chain like buying or something like that? And how much data can you actually get access to? And in each one of those phases, you're looking for two things. No. One, there's bias in the data someplace. Then you have to find it and improve your data gathering process. That doesn't mean your data is worthless. It just means you can't. Lie on it because, you know, bias is there and you may end up making some bounded statements where you say my model will likely perform well in these scenarios. However, it won't. And you talk about how bias plays into the bounds of performance and accuracy for your particular model and the uses that you can you can touch on and control, really, when this model should be used. And that's a lot of what you were talking about, Mark. And also, Eric, you're talking about trying to understand the bounds of utility for a particular model. And so you're. Yeah, and I'm sorry, I just kind of threw, like, the planet at you did a bit. [01:09:19] But it's amazing hearing from you then. And I can only go back to my manufacturing roots, lean principles, asking for AIs, going back to the root cause. I mean, we're really not changing anything in Data science, really, when you think about it, agile, mythology's lean principles. I mean, job instruction, training, it's really not changing. Right. Those are the things that we've come up with in the early nineteen eight or nineteen eighties or whatever for software development for Toyota came up with it in the eighteen hundreds and stuff. And we're just going to Bill's methods to go back to the root cause. At the end of [01:10:00] the day that's what it's about. So I think that's where you go into when and I hear from you. So that's pretty [01:10:05] Awesome. Yeah. We just have way better tools and a whole lot more data so we can do a lot of the things that quantitatively that we used to have to do, heuristically. We used to have to rely on expertize far more and now we can apply a more rigorous process to treat the data, to understand the complexity of the system that we are actually interacting with. Correct. [01:10:29] Awesome. Well, thank you very much, Alexandra. So you had a great point to make in there, so definitely go for it. [01:10:37] Oh, yeah. I mean, obviously, the conversation is kind of gone in a different direction. But I was just trying to think back to some cases that might have been applicable before we kind of had this head explosion moment. And I've done not from a statistical point of view, but more from an economist point of view. Some work with the Innocence Project and the idea of wrongful convictions and the difference between what the sentence outcome would be for people that use the public defendant versus a private attorney. So I know we've kind of gone in a different direction now, but I'm not sure even that type of question or idea would spark anything related to this idea of Data bias. [01:11:15] And to that point, I've dug up this kind of resource that I will share, link the chat. I think it's very applicable to everything that people have said already. And it's talking about a quantitative and qualitative assessment of bias in Data shortly, 12 pages. And just really great questions to consider when trying to think about bias in Data. So go ahead. I'll just share a link. Hopefully it's useful for you guys [01:11:40] And looks like this. Thank you. [01:11:42] Reference actually looks pretty good to, so I might be able to find a couple of good references there. Right. Does not look like there are any other questions coming in. So thank you so much for taking time out of your schedules [01:11:53] To be [01:11:53] Here today. I appreciate you guys coming and hanging out. Great discussions as usual. You guys make my Friday evenings [01:12:00] that much more better than you guys would be here. Remember, you can win a copy of by Jordan Ellenberg. Definitely tune into the episode released that earlier today just for this very live stream on LinkedIn. Do it from now until Monday. And then on Tuesday, I'll pick a winner and I'll I'll post about it. So it's a great book. I recommend it actually talks about hitting geometry of bias and stuff here as well. They've got an entire section on them, gerrymandering and politics and things like that. But guys, thank you so much for hanging out with guys next week. Again, I'm excited to have you guys here take care. Remember, you've got one life on this planet. Why not try to do some big cheers, everyone.