comet-ml-feb21.mp3 [00:00:09] What's up, everybody, welcome, welcome to the comet, our open office hours powered by the city of science. Super excited to have you guys here see a couple of my friends in the chat here. We got tour. We got okay. We've got my wonderful co-host, Odali. Mustafa's in the room. Welcome, guys. Super, super happy to have you guys here. [00:00:30] I'm really excited to see what kind of topics and questions we dig into today. If you guys didn't know, I only hosted a session on Thursday. This week, it should be up on Comet Miles, a YouTube channel, which I'll be sure to link in the show notes. And there was a session on Data preprocessing Data validation and and things like that. So I'll give you guys a second or two to kind of to to settle in. If you got questions, put them into the chat. That way we can cue you up in line. [00:01:05] But while while we're waiting for you guys to try to warm up here, ideally talk to us about talk to us about this this concept of Data validation. I mean, you know, that that's something that that I feel like I hear a lot, but I don't really know if I understand what it means. So when we talk about data validation, what is it that we mean? [00:01:25] Yeah, so I think there's a couple of different things. And then data validation typically comes in the like Ebor process. But first of all, making sure I think you're documenting your data set. So something that I have gotten Data teams used to doing is creating data sheets for every data source they have, basically asking the questions, who collected the data, why it was collected, making sure that we have a good understanding of are there any reasons why this dataset might be skewed and then able to use that in our modeling process. So we can say definitively we know some historical context about this data, especially if it's about people. Maybe there are different methods of sampling or the different methods of modeling we want to use because of all that we gained from this data validation. So first of all, with our data that has been manipulated before, it gets to us, I think especially when we're dealing with APIs, it's really hard to know. You know, you're like the weather API takes the actual sensor data and rules that into numeric set that you can download as tabular data. So being able to understand what's been manipulated before it gets to us, as well as making sure that we're using the right modeling frameworks in the right building, the right features based off of the kinds of data we have. So groupings are classes and categorical data. So you can get better predictions. SFX it's it's typically within kind of falls and sometimes under Data cleaning. But I would say that the vast majority of data scientists and engineers don't spend enough time on the balance sheet portion of this like data sheet that you're talking about. [00:03:19] Is this like it's not like a special sheet for it? Is it just simply the Excel sheet that you make up or do you have like one that people can refer to? [00:03:30] Yeah. So it is based off of the paper that they can go. You can just put it out in your own Excel sheet. If you are like a common user, you can actually use it in your notes and you have templates for trying to do the project. So being in understanding the ethical requirements, but being able to go through and say what word, what's the composition of my Data, what were the motivations here will help out in your in your modeling process. [00:04:00] And it's just it's just like the same ideas, like a Data dictionary or metadata. Is this the same kind of concept? [00:04:06] Is that different in that it guides modeling and it guides the later use in a dataset type in an Borg's where we come in kind of a couple of years into this Data practice and they are like, we don't have our documentation together for the models that are already in product early flight. So it's trying to attempt to deal with a lot of those issues around reproducibility and then really just transparency about how you've collected the data, say what kind of predictions we make about people that that's awesome. [00:04:42] Thank you so much for sharing some insight into that for everybody that just joined. Welcome to the office. Our session. Super happy to have you guys here. We're just discussing at a high level this webinar that Odali had hosted a couple of days back on data validation, Data preprocessing. So if you guys have any questions around that topic, definitely go for it. But at this time, like, you know, I'm happy to to take questions from. From the attendees here, if anybody has a question, go ahead and take the floor, go for it, and then while that person is asking their question, if you want to hold your place in line, you can just go ahead and just type in. I have a question right there into the chat. That way I can make sure that we get you in in the right order. I see. Actually is unmetered here. I'm happy to take your question actually. Go for it. [00:05:30] Sure. Hey, good morning. Good evening. Wherever you are from, we are having a great day. So I'm working with a group of students for a Data challenge. That's coming up this week. They have selected a topic for covid-19 CentOS Tracker and they have the option to leverage additional Data sites like what are the symptoms in different regions or if there have been rain prizes that went up and down because of cases in certain regions. And their job towards the end of the week is to come up with a with an interesting research question and then use these datasets to kind of answer that to a visualization. So it's a one way competition and there's a rubric on which they will be given points towards the end of the week. It's an interesting challenge. We had the kickoff meeting yesterday. I'm helping the students with my mentorship right now. But I wanted to ask the group here what would be like a good economic development research questions. If anyone has that background, like looking at different reasons and the opportunities within that region, like which area would require more supplies because they don't have enough medical care in that region. So Data points like these are something that we are having a challenge right now bringing together. And then coming up with a question that can answer some insights within a week. That's the challenge that we can take as much as we want to dig in. But they only have a week and these are undergrad students. So it's not like they have really high and analytical skills. But on a high level, what would be a good approach for them to pose those questions? [00:07:08] That's a really, really fascinating and interesting challenge. I absolutely love to hear from the rest of the audience from this as well. What kind of questions might be interesting to ask? Right off the top of my head? I think something that could be interesting to look into is I think there are some like the World Bank and even they have datasets that deal with like demographic information and economic information based on particular geographies. Right. So I think a situation like this, something I'd be interested in looking at is seeing if the recovery rate of covid-19. Right. Does that recovery rate decrease as the median household increases for a certain demographic region? Right. So are wealthier people getting more access to care? I think that could be a proxy question to ask from there. And a situation like this, I think maybe there might be opportunities to do some type of predictive analysis, machine learning. But I think the most interesting type of questions are going to come from like statistical hypothesis testing type of questions. So that's something that comes to my mind, is is is the recovery rate, you know, as as median household income increases for a particular demographic or geographic region, is the actual recovery rate for covid-19 actually increasing as well? I've my head. That's something I think would be really interesting to look into. NLP, what about you? And I'm happy to hear from anybody else as well. It's a really fascinating question. [00:08:41] Yeah, my my first thought is kind of similar in that. Are there differences geographically that and maybe this is something that is more exploratory than predictive, but are there factors that reduce the recovery rate? So if events, for example, maybe like lower income neighborhoods, so maybe being able to identify some of those factors? Yeah, those are those are kind of my first thoughts, because especially as we're looking at like historical data and medical data, there's so much historical context around this data center. It's difficult to initially grasp that we're not social scientists, most of us. So but I think being able to identify some of those the same risk factors, but like clinically like risk factors for having a reduced Kopitar recovery rate might be an interesting challenge. [00:09:47] I wouldn't be. Oh, sorry, I'm cutting somebody off. Go for it. [00:09:51] I was saying that's really interesting. And the additional point I wanted to say is they have access to survey data which was conducted by Facebook in different regions. So that's a good dataset to use. But my only concern with that is it could be skewed because. You might have survey respondents from a particular age group like what, if any, in Italy, there's only respondents in the age 25 to 40, and we can go with the assumptions that the recovery rate is better because this is only a young crowd that's responding. So when opposing research questions for an academic project. So these assumptions be given a lot of stress and then build models on top of that, or should I kind of add noise to the Data or maybe not nice, but maybe normalize the data in such a way that there is no bias involved? [00:10:41] And definitely an important thing to consider, which is actually what I was about to bring up as well, is this silicon issue of sampling greatly, because if you're comparing two different groups, you want to make sure that they're on equal footing as possible. So definitely the way you design your experiment and the way your sampling is going to be very, very important. So if you're comparing two groups, you got to make sure they're as easily matched as possible. Right. In the sense of the something you might want to look into is I know that they look they do this in like an economic policy. And I think to some extent in epidemiology, it was this concept of propensity score matching up. So that's really, really useful for for this type of observational study that you're doing. Right. Because this is essentially what you're doing. You're not randomizing and distributing people to a trial yourself. You're collecting observational data. So that that might be an interesting keyword to research as well for your students is maybe design of observational experiments or design of experiments for observational studies, because there's a whole host of factors that you've got to take into consideration. And the core issue, I think, would definitely be with respect to sampling how you're sampling. [00:12:01] Yeah, we are thinking of using random floras because we have the hundred and fourteen countries in the data set and it's not a huge data set. But I would rather that they focus their analysis on like the top five countries in each region and then go on the granular level to explain how those top five from each region kind of relate to the smaller countries or smaller regions. But yeah, random forest is something we are exploring right now. It really depends on what the results look like. [00:12:29] Yeah, well, also it depends on what it is that the actual question it is that you're trying to answer. So might have might have been a bit of miscommunication there when I was about sampling that respect to like sampling with, you know, looking random for us, like bootstrap migration, anything like that sampling in terms of who are we going to include in our analysis for study, that type of sampling. So sampling from your population in terms of, you know, do we have an accurate and adequate mix of people so that whatever inference that we do make is extendable beyond the small group of people to the general population? I think that's an important point. I'd love to hear from anybody else. What questions do you think would be interesting based on the description of the data she had provided us here? What are some questions that you personally would want to look into? Doesn't look like anybody was looking at any questions. That's all good. Was that at all helpful, actually, to that? [00:13:27] Oh, yeah, absolutely. And I also came through your flowchart are the five steps of asking statistical questions that you shared last week. I think that's very helpful to look at what data you have, like depending on what variables you have and imposing the question. So I'm definitely forwarding that to the students. [00:13:44] Yeah. Yeah, definitely. Um, so for people who don't know he's talking about, you got to get on my newsletter where I send you guys free stuff every single week, had a bunch of amazing goodies. And one of the one of the three things that I had sent that was actually a um, there's a couple of them. One was the five questions that you should ask yourself when you're performing type statistical analysis. And then the other one was a flowchart of statistical methods, both of which I think you'll find very, very helpful in this case. But, you know, just kind of from an intuitive level, I would say that something that you really want to take into consideration is how you are, how you're sampling individuals to be included in whatever analysis it is that you're trying to do, maybe look into some some type of sampling techniques for observational studies. A couple of keywords there for you I think might be helpful. [00:14:33] Tor I see you've got your hand up almost just like studies or interesting angles. I just posted a link to where I download weekly copy Data statistics like information, which I do my own personal analysis to keep an eye on people to go on. And one of the things I've been looking at is the travel holiday versus the trend in a number of cases, et cetera. But also now looking at the. Vaccination programs and the impact on the actual cases and hospitalization, as well as age, which is in the area, which is quite interesting, has a lot of the quote unquote, old people who had a high likelihood of getting caught and getting sick from it. What is the impact on the younger generation, which is most likely that's what you're seeing in the new cases, or is it still the older generation? So, I mean, there's lots of angles here. One of the things I looked at, for example, I live at next in France and Europe, you all people travel last summer. And the funny thing is, is that in the streets, people didn't fly last summer that were taking the train or the cars to travel in Europe. [00:15:58] And here in this, we will see a lot of currents from Britain, Holland, Belgium and a few of the Eastern Bloc countries, the old Eastern Bloc countries. And, you know, a couple of months later, those were also the countries where you saw a lot of increases of knockdowns as well. Now, that correlation, I can't prove or say that it is, but you not walking around the streets here and seeing all the Dutch cars and it just kind of gave me that feeling that well, that the other day, whether it's a real correlation or not, it's an assumption hypothesis. So there's lots of things to look at. And like I mentioned as well, poverty, not poverty, chairman of rich countries. You can even bring it up on the country level. Look at the GDP levels to see if there's any correlation medical health services in the various countries. So, you know, lots and lots. [00:17:02] Yeah, I definitely agree. There's that really just limited by your own curiosity. Um, well, you know, curiosity and what type of data you can get your hands on, obviously, but, um, definitely look like there's a wide range of possibilities that you can look into. And definitely we'll leave this this topic open through the end of the session today. So if anybody comes up with an idea that they want to share, definitely go ahead and either drop it into the chat or just hang yourself and let us know. So I know a few people joined in. Welcome, everybody, to the office hours. So if you have a question, just type in. I have a question right there to chat and we'll hold your place in line. I see a question in the chat from Mustaffa and I'll flip this one over to our elderly here. And Mustafa is asking the difference between Data validation and Data cross-validation. [00:17:54] Yeah, so that's a great question. Data validation is more so before we actually enter this model building process. So understanding the skew of Middleford summary statistics about a Data set, whereas cross-validation is more so trying to test how well our model does on new data. So really, when we're looking at testing and validation data sets, if we're able to use cross-validation to see if there's overfitting issues, still more postmodernity process and pretty long process. [00:18:31] And speaking of cross validation, part of your topic, you're discussing Data preprocessing as well. And this is probably a question that that maybe might be a little bit open ended. And it really is a right answer. A wrong answer. Is it like anything Data science? It depends. So at what point do you do you're trained has split with respect to Data preprocessing and the whole cross-validation thing. Where does this kind of fit into the pipeline of events? [00:19:03] Yeah, I would say for me it has mostly happened after my the only feature engineering is done. So I mean Combination's interaction features and basically then I try and do all this before I go before modeling and building new models. [00:19:23] Awesome. Thank you very much for that. So looking into the chat, there is no current questions in the chat. So if anybody wants to go ahead and it looks like Tor has a question, go for it. So Tor go ahead and put yourself on unmuted folks while tois asking this question. If you want to hold your place in line, go ahead. Just type that right there into the chat that you have a question tor go for it. [00:19:45] Just to give you a little context, it's funny. When I started joining this group, what I've noticed for example in LinkedIn the articles and people postings, et cetera, that I'm starting to see now is becoming more and more annoyed about related topics. And you know what? No way, I don't mind looking the other way, I'm starting to lose the all too common financial topics that I used to see, you know, the prioritization and the question I have is really I'm assuming this is a trend in Facebook where you become more and more radicalized. [00:20:25] It's more extreme to use algorithms. The question I have is that what can I do from a user point of view to compensate for that, to print the information I get back or what I want? That was actually in fact, that's one thing. The other side is the people who are making these algorithms analysis, what parameters that been building in the ways that trend in their analysis. And I assume that's machine learning in the back end. [00:21:00] So to answer the second question, I wish I knew because I'd be gaming that algorithm so much and getting 100 million followers by now. But I think but I had to kind of answer that question more concretely. There might be some type of content and recommendation algorithms in the background. So maybe that's kind of like the key where you're looking for some recommendation engines. And, you know, there's there's a whole slew of different methodologies for that. Which one any given social media platform is using. They kind of definitely keep that proprietary and don't share that bit of the algorithm. But to answer your first question. So one thing about LinkedIn is LinkedIn. They're kind of from from my understanding, the way their philosophy for content recommendation is delivering you stuff that you find interesting. They had like a certain catch phrase around it. I can't escape you right now. But I think one thing that you could do if you want to get more of that financial stuff back into your timeline is start interacting more and more with the people and the content that you want to interact with and you'll start getting fed more and more of that. So maybe I wasn't answering the question correctly, but I'd love to hear from you ideally as well. [00:22:15] Yeah, I definitely kind of took the same way you did. If you are, let's say I would say maybe like Twitter. I know you can also follow specific hashtags and then you start to see those recommendations. And I don't think there's any to search under your moments or just on your timeline. You'll start to see desired pops up, because I always interact with a ton of people who post Data signs and then kind of recommendation for LinkedIn. If you have to kind of think about it as we are the users and by certain actions kind of put us into these categories, they we have for long interest. So the more you are likely Kamandi things that are about finance in general, the more authentic you will probably see these things in your area naturally. [00:23:10] And I just looked it up that catch phrase I was talking about. Now this is coming from an article in 2019. So not terribly outdated, but you know, the pace of technology, everything is moving quicker and I go ahead and stay here. But essentially what it's saying is this is that the the mantra of the LinkedIn feed is people, you know, talking about the things you care about. So if you wanted to get more and more of that data science into your I'm sorry that financing your feed then interact with the people who are posting that stuff. And, you know, that's probably to be the best way to do it. To answer your question, Data it that's. [00:23:51] But I think from my point of view, the other side of it is even more important is that what is being built in to control these algorithms, to manage that process so it doesn't go way too far to the left or the right or up or down or the finance of the other from or from a program perspective. So what considerations, if any, are taken now? [00:24:19] Yeah, that's a good question. I wish I had the answer to that. Um, there's an excellent documentary on Netflix that I watched towards the end of last year. It was all about this concept discontentment. The name of that is escaping me as well. But it was all about the the content that you get on social media, that the social dilemma I think it was called the movie. That's the one. Yes. Yeah, that's the one. Um, but yeah. I mean, how we're never really going to know what how the algorithms work because that that is proprietary. I see a comment here that algorithms get biased to monetize the platform. That's true. I mean, the the whole point of these social media platforms is to get you to engage with the with the app and keep you engaged in that app. Um, so. To that extent, that LinkedIn mantra of people talking about things you care about, um, getting you to continually USRAP, I think it's, you know, what's going to drive that algorithm. [00:25:20] But I'll give you an example, because one of the things I'm working on, my project, for example, is that based on a question here that generated some questions or to generate a program and when using this program during a fieldwork period to at, ET, etc.. The idea, of course, is that there are lots of people doing this. When you come back, the system is going to bring you to analyze the information it gets and then start feeding into the future processes. And this is where I'm starting to get concerned that if a store or a large number of people use only certain parts of it will, what do I do to control to make sure I don't do things that will be included, even though they may not have thought about it? And how is that put into the algorithm? [00:26:25] Yeah, I don't have a good answer to that question. [00:26:26] And one, I think I think it's really hard to say, especially because unfortunately organizations are so there is a massive lack of transparency and about and a lot of these things that you mentioned, they could have more transparency than they currently do. So it's difficult. I think I would probably lean towards saying that maybe they are taking a couple of mitigation steps so that they don't go too far. But I would cautiously generalize that they're probably not spending as much time doing that because unfortunately, you know, it kind of is a negative for users. It's still positive for them as far as, like monetizing their platform. But I think it would be really interesting to have that that may be something that this is imposed like FDA or algorithms that's kind of been tossed around for a while. That might be something that they are able to expose or at least make sure it is public knowledge or the public has some way of kind of auditing these processes. I think we'll see more of that with regulation, but it's just really incredibly hard to know what they're really doing. [00:27:47] And I just wanted to talk, just read a great book. It's called Weapons of Mass Destruction. It's from twenty eighteen, but it's still a great read. And it pretty much touches on these algorithms that are being fed, this data that's already biased. And if humans aren't updating that data with new, you know, like monitoring those algorithms and updating it based on how it should be, those algorithms are going to give themselves a feedback loop of what they're already doing. So it's like garbage in, garbage out. That's what ends up happening to these algorithms. So where so all these articles we read and stuff, we kind of take it out. It's like, oh, this algorithm was created by Facebook. They have X amount of data. This algorithm must be right. So in this book, they actually she's a data scientist and she goes through algorithms in different industries and stuff and just from schools, how it's skewed to the voting in the US and everything. So once you read it, I think you'll start you'll stop asking this question, because right now every algorithm we have out there is pretty much it's biased to a degree. [00:29:04] And like I totally said, we need to create a framework that has rules and regulations. But right now, when you're looking at these fine companies, it's private Data. So when you question why is this quite like they quantify you so they can't ask me my race, but they'll ask me my postal code based on my postal code with data science, you can tell if I live in a rich neighborhood or a poor neighborhood. So even though they're not asking the question, they're still quantifying you as rich or poor technically. So they're kind of fun of loop around it. So for what you're trying to do, I would suggest, like with that book, it's a super easy read. She's a data scientist. She's worked in the field. She was there during the 2008 crisis and she'll pretty much tell you everything that's locking in these algorithms. And right now, the sad part is it's hard to get the regulations because it's capitalism, right? Everybody wants to make profit for their shareholders. So trying to get some sort of regulation out there is it's going to take time. Very, very good. Thank you. [00:30:17] That might help to be really good. Thank you. [00:30:22] I like to add something to Harp bit of adding on to just various comments. There's an organization called the Algorithmic Justice League. They created a documentary called Kodet Bias, and that was an amazing show that shows how much bias we have in our algorithms. [00:30:44] And, you know, from job hunting to, you know, like. [00:30:50] Like what just say the financial aspects of it and so forth. And they did an amazing documentary. And I would highly encourage everyone to watch it because it's called coded bias. And even for housing people who people of color who need who need housing, how they track them, and things like facial recognition and stuff like that. [00:31:12] Amazing how and the movie the book that Shiman just been mentioning, Martha Fotomat of destruction or something like that, weapons of mass destruction. And she is also in that documentary and and and quite a few of notable. I spoke on that as well. [00:31:32] So it's an amazing documentary and really kind of opens the eyes up and wondering all those algorithms that's out there and how they are being coded and so forth. And not only the data can be biased, but even the algorithms that they create can be biased, too. But yeah, but I think more and more, I think is coming up a lot these days. And people are trying to create something called the FBI for algorithms. So I think this is a good thing, really. I think then you will have equality in all areas where the housing, education or employment and so forth. Yeah, that's what I want of that, though. [00:32:12] Thank you very much. [00:32:13] Appreciate the person who's running this algorithm. It's a Rhodes Scholar. Her name is Joy Lomani. Yeah, she's pretty prominent and she's. Yeah. I mean she's from I think she's at MIT. [00:32:28] Yeah, yeah, yeah, yeah, yeah. [00:32:30] I'm definitely going to be digging into these resources and hopefully I can get some of these people into the podcast because this is an area that I would definitely love to explore. [00:32:38] So I watch that show and you find a lot of notable people and you can read more. [00:32:46] I forgot the name and Cyndi I think, who wrote the book on the map of destruction or something. [00:32:52] Yeah, quite a few people in there that you can definitely help is definitely helpful to because we're starting to see this question around ethics, unfortunately, because I have kind of spread into every industry. So it's not even close to being isolated attack in software. We are using it in health care, trying to triage patients. I think especially when we're looking at these public works kinds of algorithms, there is a much bigger need for them to be transparent. And let's say like Netflix is personalization, right? Like the big the worst thing that can happen is you don't like a movie on Netflix. But if we're thinking about triaging patients in emergency rooms, the worst that can happen is that you are delayed care. And we have to, especially in industries like health care, policing and a lot of those environments, like thinking about the harm that could be caused to users. So some things in finance may not seem like that big of a deal when we're like in these organizations. So getting denied for credit card may seem kind of average, but we're so far removed from like any individual user's experience like that can be a big deal for people. So we have to especially with like Data about people, have stricter ethical standards for being transparent about. These algorithms are being used with the facial recognition systems and such mentioned the UK because having public CCTV cameras is a lot more acceptable when people are comfortable with it. It's kind of the norm. [00:34:39] Those scenarios, every algorithm that's used image on the general public should be open for the general public to also criticize, should be open to for press to look at the data that they're using for us to investigate how decisions are being made. And it gets us closer to having like an appeals process like this whole idea for the FDA algorithms is so that people can say, I think maybe you guys made the wrong decision. And then we can go back and debug for users number seven and 12, why do we make specific decisions? And then either we maybe assess that you made the incorrect decision and either change what their that outcome is or we are able to offer an explanation to users of why. So we're looking at future importances is fairly easy, but that's not something that organizations really allow us to do right now. So I think we are getting there, but I think we are very much still in the first stages of like awareness and awareness of organizations that we have profound power. So like I went from doing product to analytics to working on things that were like life and death scenarios. So trying to identify where things would be centered, we had a much higher threshold for what was acceptable error then. Oh, I'm just predicting user segments. You know, I think we have to you and saying we just take this with more severity, especially in like public kinds of algorithms. [00:36:20] And this has given me a lot of good stuff to think about. I hope you guys have been absorbing this as well. Definitely an important area in our field to really think about and discuss. And, you know, hopefully we can get some of these people into the podcast and ask them some more questions and help share some of this knowledge. Another book that is on my reading list I just recently downloaded, it is Invisible Women, which apparently is all about how several algorithms are essentially biased against women. So another book to add to the list, Tor. Thank you very much for asking that question and get this discussion going around that. I appreciate that. So we'll go ahead and continue on to see. There's a couple of questions in the chat. So I'll start with the earliest one here for Mostapha. And folks, if you guys have a question, please do feel free to hold your place in line by just saying you have a question and then we'll go ahead and ask you to the queue there. So question here for Mostapha. Would you please briefly describe a whole timeline of Data from pre processing to generating insight stuff? If you're still here, definitely go ahead and admit itself and and maybe add some more color commentary around that. [00:37:35] Um, I don't know what I would do here. [00:37:40] Yeah, it's a little bit of an echo, but, um. OK, good. Here you are. [00:37:48] Just a great question through from that. I opened the door to the conference. I should so I just want to get out of the the process so far from the prosecution and through working on the Data using which I just want to get to general point of view over the whole process before. [00:38:24] Yeah, that's an excellent question. It's a huge question as well that I feel like there's a number of different ways we can talk about that because I mean, like anything in Data science, it does depend. But I guess I'll turn this one over to you, because I think this might be based on the conversation we're having regarding the course that you had created. So we'll talk to us about that, that this whole kind of timeline or series of steps. [00:38:49] So from a high level, we would start with basically data acquisition. So we have multiple ways. Sometimes this is pulling our API writing in Python script to scrub the Internet for things like tweets, and we go into the preprocessing and the A.D.A as well as data processing. So we want to understand when I say I need exploratory data analysis. So basically looking at things in our data set, like our statistics, the ranges, the standard deviations for different columns and based off of the understanding inside speaking there, we're able to use that to inform our feature selection and model selection. So once we have this clean data set, we're able to go ahead and create new features. So this is huge and we'll see because this has a direct impact on how well our models perform. So we want to select features that are correlated with our target variables. And then we're able to also, let's say we're dealing with a fairly small dataset. We're also able to augment Data. So add in additional sort of. Columns or deals that just help provide more color and make it easier for us to make these predictions, so we go through this stuff to select what features we want to predict a certain thing. And after that, we're able to start building models and then exploring how well they do. So the models actually piece is also really good. So you have your fingers ready to go based on if they're mostly numerical or mostly categorical, based on the kinds of processing you could do. So for categorical data, you could do like one hot coding. So instead of a single Category four color and context that it would be green and yellow, you could have a code for green in yellow and they are zero. [00:40:57] If a specific example is not that within that range and then they would be one. So a example would have a one in Overbey and zeros in the columns for the other categories. So based off of all of that work, where then able to start building models, understanding if we are trying to classify something, so basically tell two or more objects apart, or if we were trying to predict the number for something and doing more regression modeling and then we go into the evaluation. So we have experimental built on 10 different models and we want to understand how they compare against each other and which model, which is the most valuable for what we're looking at. So the biggest thing I can say here is that it is most commonly not just accuracy, it's the most important thing. So there are other evaluation metrics like precision and recall, things like F1 score, and they just kind of like targeting now. But it's important to know that based off of your specific instance. So let's say you're in health care. It's easy to discuss with your team whether false positives or false negatives are more important to avoid. So based off of things like your context, you're able to evaluate these models. And from that kind finally comes this last step of insight. So you're able to see if you do have a good model. We are able to look at the feature importances. So go in and say for predicting the probability of a specific health care outcome, the location, age and X factors are the most important. So I hope that kind of gives you a high level understanding of what you do at each stage. [00:42:54] That was absolutely awesome. I really enjoyed that. Yeah. So myself add did that kind of answer your question? That was really, really well put together in depth. I really enjoyed that. Thank you. [00:43:06] Thank you. Thank you very much. Yeah. [00:43:10] So I got a question here, Fregoli, something that is a head scratcher for me. What exactly is insight. Right. Let's say I'm doing some exploratory data analysis and I'm looking at shapes of my distributions for whatever feature it is I'm looking at. Like is me just be able to identify that. Oh, this particular feature follows a gamma distribution. Is that considered an insight? [00:43:35] I would say in a way, yes, because it can lead you to choosing specific modeling, using specific models. I think of an insight as almost a small piece of data that we then use to guide our decisions. So I think in general, data science is kind of tasked with finding these insights about large data sets and then either reporting to stakeholders we should do X, Y or Z. So I consider it that way and that our insights about what we're doing EDA are really there to layer the things that guide all of the other modeling decisions. So let's say one column has an extreme amount of outliers. We're able to maybe say maybe this shouldn't be a feature that we use. [00:44:32] So in that case, it would still be considered an insight to me is domain knowledge domain expertize? Is that required to produce an insight or can we produce insights without actually having that domain knowledge? [00:44:47] I think that's really hard because the best insights come from having some level of domain knowledge. So I think when we talk about domain knowledge, it seems like we're only recognizing like subject matter experts like you really, really have a strong grasp and experience with one. Everything was a health care, for example, and I would say that we don't always need that domain knowledge, but it's but it's helpful and it is impactful. So I suggest especially if, like me, I would be the first person to report on health, it is that I would definitely suggest still taking some time to truly understand your context, understand, especially in health care and a lot of other industries working with the doctors and radiologists on the ground, you end up using these models is something that is like impossibly. It's so hard to understate that why it's needed, because I think so many of us as as people who are interested in the science on most of these teams were pretty far removed from the actual models and how they get used. So I think being able to somewhat better ourselves to the people who use our models is a good example is my last role was working with risk managers at companies like Uber and Lyft to try and understand they have millions of drivers. Which ones are the most obvious? Which ones should they give? Certain kinds of driving courses to are like, hey, we've noticed that you tend to see fewer hours of succeeding classes. We're here some aggressive driving classes and being able to identify what should go into making that decision was incredibly hard. And like, I've got to schedule meetings and sit down with these risk managers and understand what they already do. So overall, domain knowledge is great. And if there's anything you can do to try and get more of that are closer to the situation, it will definitely help how we build these models and hopefully build them closer to what people use in the machine. [00:47:09] Thank you very much. I really appreciate you that I know the audience did as well. So I've got a question here from Davran slowly making my way through the Chechi. So go ahead. And if you have a question, now's the time to go ahead and make sure you add that right there to the chat that we can go ahead and get to it before the session ends here. So Davran is asking the question, if I want to build a portfolio project to have to look at the data set first and then define the problem. This is one of those classic chicken and egg type of questions. I think I personally feel like first you have to have an interesting question to attempt to answer. Then you have to find the data to help support your progress against answering that question. But in between there, you might need to tweak the question you're asking based on the data that you have. [00:47:58] So to me, it's like a little bit of a back and forth, back and forth. But my personal philosophy is, you know, I say start with the question and then find the Data to answer that question and then and then move in that direction. But given kind of the the landscape of what the self learning tools are out there, we're typically just looking for data sets and then coming up with questions that which I don't think it's necessarily a wrong way to do anything either. So your approach, I think, is it's it's reasonable. What about you ideally? [00:48:31] Yeah, I tend to lean towards starting with the question. So I see that mostly because a lot of the data that you can easily find either wasn't collected or made for modeling or we have a hard time in meeting when we don't really we can't really answer questions of the data that I think there's an underlying assumption, especially for data for us, that we're going to be able to find something interesting regardless. And that's not always the case. But it feels more defeating when we started with Data and we may be answering questions that aren't necessary or aren't. I know this is like project portfolio related, but in industry there maybe aren't relevant or aren't important to answer for our product. So I would say for the most part, try and hypothesize an interesting question and then track down the data. I think on the portfolio project piece, it'll also show that you can really do this data collection in Regling. So I know when I was really starting, it was really hard to demonstrate. Yes, I can go like build whatever Pythonesque that they do, scrape this website or get data from this API. I think by starting with that question and then showing that you can go get the data, even if it's not the data that matches up perfectly with what you need. I should give you a couple extra points on that portfolio side, 100 percent of that. [00:50:07] That that that kind of philosophy, and I mean, you think about it, right, if you're doing a Data science project, you're trying to find answers. Right. And typically, in order for you to find answers, you need to start with a question. So you kind of want to think of it in that way as well. Actually had some really great insight here to share, actually. [00:50:27] Go ahead and and share that. So it's not there in the chat. [00:50:31] Yeah, of course. So I'd like to use an example on a project that I worked on. I worked in the forensics Data science practice. So a lot of the work we did was looking at financial information for potential fraud or risks for the company. For one of the projects, we had a sales data set and as a data scientist, we were just expected to kind of build visualizations to show if the sales trends are going up and down at a particular point in time. When you ask about whether to come up with a question first and then find the data, it goes both ways. So in this particular case, we requested the client for additional data. So since we already had sales, we also got the billing and the delivery details, which is where the product for the company have been sold and who are actually purchasing those items. As a data scientist, I only saw trends going up and down, but I didn't really understand what risks the company would have. And this goes back to the question of having the domain knowledge. So when we spoke with the business stakeholders, we realized that it was a potential channel stuffing scheme, which means that some company locations would inflate their sales towards the end of the month to show that they have maximize their profit. [00:51:46] And then in the next month, they will have those orders to return back, which means that there was no actual sale ever made. They just fake those sales to show that their profits went up and then they returned those back in the inventory. And we only found that after we had that additional billing and delivery data sets. So sometimes coming up with a question first and then asking for additional data could help. But at times, if you have enough data, it's more about understanding what problem you want to solve, what are the insights that your business wants to take? And this goes hand in hand with the business and technology. Like like I said, as a data scientist or an analyst, I will build interesting visualizations, but I wouldn't really know what my business or senior management team is trying to find in terms of risks for the client. And that's where these conversations help, because you kind of see the problem from different perspectives. I hope that helps that. [00:52:41] I absolutely love that response. Thank you very much for sharing your experience. So hopefully that that answered your your question there. I forgot who it was that asked that it was Davran. Devon, if you get any follow up questions on that, go ahead. Let us know. [00:52:59] Thank you very much to all of you. Yes, it was very good. [00:53:01] I think the next question I have here is going to be coming from, uh, Quinten. Yes, Quinten, go for it. [00:53:10] I everyone, the question is in the line of what you guys have been talking about. So first of all, I agree completely with what you did. He was saying what everything that was said about the domain knowledge. [00:53:22] I think if you don't have the knowledge that you cannot relate all of the work you're going to do in the Data said the analysis to the actual purpose to add value to the business and I have read recently your book, I think it's pretty well known. Maybe you guys know it. It's the monk who sold his Ferrari and it is really sort of amazing. And the metaphore inside of this book is about a God and God in your mind that you have to take care of. And within the garden you have a lighthouse and the lighthouse is like your purpose in life. And basically, if you had the lighthouse lighted up, you can focus your mind on this and achieve way more. And I think the problem statement that I say that I sense a problem is the lighthouse. So if you have the problem in mind, then you can relate all of the Data, all of the research that you're going to do to answer that question, you can focus your mind on the problem and solve the problem. Way easier. The question regarding all of this is how do you guys go about creating the narrative? Because at the end of the day, even if you do an amazing work, you have to sell it to business people. And if they're not souls, then you might not have been doing this work for nothing. So how do you guys like work? Are you doing a lot of analysis and then you try to correlate the visualizations that you've made afterwards? How do you or do you have like some kind of methodology to follow through the analysis of your data sets to build the narrative, or how do you go about it? [00:55:01] It's a very, very good question. And I just to say, like Robin Sharma is amazing. I'm a huge Robin Sharma fan. Householder's, Ferrarotti, secret letters from the mortgage holders Ferrare, both of those amazing books for the club, the one that are trying to get Robin Sharma to come on the podcast. Oh I would be good to try and try my damn hardest, but by the way I, I couldn't help it. [00:55:24] That's what this story is. This story you imagine is, is based on some reality. [00:55:31] Marcus Alder's Ferrari, it's all of Robin Shaman's books are pretty much their they're fiction books, but they're like I guess more narrative nonfiction than fiction. [00:55:42] Right. So that's kind of like the genre. I would classify that as narrative nonfiction because he's giving practical business advice, practical scientific advice, but wrapped in a narrative type of in a way and you know what I mean. Speaking of narrative nonfiction, I think that's probably a good way for you to communicate your insights and results to stakeholders, and I'm sure ideally has a lot more experience in this realm than I do. So I can't wait to hear from her on this. But I will say that there is a excellent course on LinkedIn Learning by Doug Rose on Data storytelling that I highly recommend. And what I found helpful is really trying to convey what it is that you've uncovered in the Data in the form of a story, but in such a way where it's maybe not looking at everything holistically, but painting the picture of a individual user, an individual customer. Right. And really driving the point home with examples about that customer, if that makes sense. I love to hear from Idella on this. [00:56:44] Yeah, I definitely think one of the ways that I've worked to create these narratives versus by trying to understand what they're really doing. So trying to have completed an analysis, I have some insights and I'm ready to share a lot of what I'll do is go to the team requested and all of the stakeholders who requested it and ask what they plan on doing with the insights. So a good example is I was at a company and a lot of the time that I was spending was doing marketing kind of analysis. So they knew that they request and they basically wanted something that would predict the gender of a user because there is AIs was that they noticed men and women tended to use the product in different ways. And so they're like, well, based off of first name, can we just try to predict gender? So I did a lot of digging and asking why over and over until I got the little nugget in. Kind of weird how I tell this story. So basically they wanted to do this whole thing just so they could understand power users, really inactive users. And so I ended up just pitching a consumer segmentation model and then creating that. And when I went back and basically tried to persuade them to use this or how they were addressing in applications and push messages for the app, I was able to go and say, well, what we are noticing that different users use our app and different ways, but it's deeper than you thought. [00:58:33] You thought it was just by gender looking at men and women and how they interacted. And I would say, well, we actually found six different types of consumer segments, from power users to diversified users who go to a lot of different kinds of gyms and do different activities all the way down to inactive users. So I kind of pitched my work as a an ethical extension of the question they were trying to get to solvability. Those narratives having a good understanding of why they care what's important to the stakeholders and then building that especially with a narrative, because those stories stick with people more so than stats do. So I think we tend to try and just say statistics because to us it's like, well, I'm just telling that's. But that's not as memorable as especially for like non-technical Vogues and people who are making decisions. It's not as memorable as, say, walking through the process of maybe one example user. And so from all the insights you gain, you can say there's this hypothetical user that does X, Y and Z and then try to build your story around it. [00:59:49] Ok, so basically you say that. So thank you for your answer. Basically, you say that going from the point of view like you were thinking of, maybe the gender was the main reason why people were acting differently. So you would go from that point and tell the narrative based on this and. Add on this based on new findings, but in a business way, like not entering into the technical things that we love and everything, but with something that would create some emotional way to memorize what we are talking about, especially regarding the business impacts that this can actually have on the business and not just the technical insights. [01:00:28] Exactly. Bad or a solid issue. People with my story there, I would say, OK, you can imagine user number 12000, they signed up for the app because they saw this ad on social media. They came onto the app. They kind of walk them through that process of this hypothetical user or the insights of Afghans. And so later on, in conversations after this project was done, the marketing team would be like, oh, yeah, I remember that one big user, they did this, this and this. And so it's so much easier for them to remember those kinds of insights and stories. And he would bring that up consistently. So they were interested in users. You take a different path. What kinds of messages should we be sending to them? So, yeah, yeah. Trying to somewhere or away from our technical inclinations to recite stats and more. So try to build a story. We can have the fake names of examples of a mob in a Ferrari, you know, but those are things that business people especially remember. [01:01:40] Yes. And people remember names and stories and things like that. So I think. Sorry. Sorry to cut you off there. So as much as he can personalize it as possible, the better. Right. So I'm sorry, I didn't mean to cut you off. [01:01:54] So go ahead and finish your thought and then I'll I'll jump in also, as I just wanted to add regarding a different question as well, because you mentioned the how to build a project portfolio and ask the same question basically last week. [01:02:09] So regarding the narrative, like how do you go about writing it properly in your coach as well, like practically speaking, when you build a project? [01:02:18] So I think the narrative aspect would probably be more incorporated in like the executive summary or stakeholder presentation type of thing. So in terms of how you incorporate in your code, I mean, you probably need to code out whatever you need to to pull out one example and slice and dice Data in for one particular segment of users. Maybe you can put all that into a separate notebook where you do that for yourself. But I wouldn't present that notebook to any stakeholders. I would just incorporate as much of that into the presentation, into the executive summary as possible. [01:02:55] And I would just add on to that as the that's something you can also go over verbally. So whenever whenever I've been in NLP processes that have had me like walk through a past project, I may show like a Jupiter notebook and walk them through these things. But at the end, I always have like my five bullet points of that storytelling piece. So because I did this modeling and came up with these results as way to find X, Y and Z and and frame it in a way that's within a very short conversation, they have an understanding that like, OK, I do the process of framing how I would talk about this project and talk about what I found. And it may not. And it's OK if it doesn't necessarily exist is like a paper kind of at the end of your project. Sometimes it's OK if you know you're kind of facing an enemy where you can just talk through it, say some of those points for that, too. [01:03:55] Ok, thanks. I think another good example is like when you're doing some type of clustering, for example, like if you're doing some type of clustering and you've got some number of features that you used, I mean, you don't want to tell a stakeholder that. Yeah. Would use this particular clustering algorithm. And then because we used this distance formula to identify whatever the closest people in this particular segment, you know, you want to dove into those details instead. What you want to say is, you know what? Here's a cluster, you know, one and cluster one we call these are a mom and pop shops. Why are they mom and pop shops? Well, because this is the type of volume they see in terms of traffic, foot traffic and sales. And, you know, you're using the data about that particular cluster to create a persona or create a story about that cluster. Right now, that one could be, you know, this cluster over here, Cluster five we call these are our fast movers where they are fast movers, because as soon as they get inventory, they tend to have a quick turnaround time. They don't hold much inventory in stock, you know, things like that. Right. That can help bring that cluster to. Life, hopefully, that made sense, you know, if you ask any other questions on that point. [01:05:14] All right. OK, so next question I see in the chat here from Toure is, do you eliminate Data feeds or do you want to elaborate on that question? [01:05:30] A little bit of context. I just wanted quickly to respond to your bullet points. As I work in auditing. You know, you go in and you talk to clients, you get a picture of where their concerns lie. This is kind of giving you a context. And when I mentioned three plus two bullet points, what I mean by that is really that the first three bullet points is trying to respond to what the client wants and what they're looking for. So you actually answering whether the calls, maybe the storytelling around it, but the plus two, that's where you have the add value added. In other words, when I go back and I've done my analysis of the transactional Data contracts evaluation, then I will normally have two points, which I am looking at saying, well, in addition to what you mentioned, here's two more. Of course, there are a whole bunch of lists in addition to that, but that is really just to keep them to come back that you understood what they were looking for. And B, you now also have allies providing them with more. So, but I never have more than five bullet points on that slide. That's it. Very, very short. The question I had about the elimination in quotation mark, what that is, is that very often, for example, in my case, I get Data, which is unusable, OK? It means it has no relevance to what I'm going to do. And I'm sure that you deal with the same thing in all fields when you get all these different data feeds coming in. And by eliminating, OK, I never delete Data because I always want to have the source original. But I will if you take them away from you to do a similar process in your processes where you actually turn around and tell your client that this particular data feed, you're wasting money and time on it by generating it. And this is not required. It's something that in the long run, most likely. What be it? I don't know how you feel. [01:07:34] Yeah, I would say that this is something I've definitely done, especially when looking at data that's not really helpful for modeling. So I think that a lot of organizations set out to let's just collect it and maybe need it maybe down the line. And I think that we kind of end up in situations like year end where there's a lot that's not really useful for predicting what a specific goal or or getting to meet those KPIs as far as the kind of analytics process. So in my experience, yeah, there have been many times where, like, there are plenty of tables that aren't really used for model training or you might like Data science process. [01:08:24] We absolutely love that feedback. I was going to I was I was thinking way to Rallier there for a second. I was like, oh, is this a feature selection question? But then towards the end, I realized that actually it's not a feature selection. It's more about people who have that mindset of if we can collect it, then why don't we collect it? Um. So, yeah, that's that's a that's a very interesting point, not one that I've had to deal with personally, but I'm also the type of person that I just love collecting data, so. [01:08:51] Oh yeah. But this is the challenge, you know, because, you know, with management and where you're dealing with key stakeholders or this would be nice to have. This is nice to have the there. But all these nice to Harp, which is not really unquote necessarily or required or or even has any value by eliminating the bureaus that they there to be cost in processes leading up to that data being collected. Because you are collecting the data systems, you have to pay for storage and data capabilities, etc. you have people analyzing it and then of course they have all the people getting data that technically is just wasting their time, including the programmers and even TV analysts, etc.. So I'm just curious if you actually physically go back to that. You know what? Why don't you consider about eliminating this particular data collection process? Because it has really no value to your operation, your business, or is that the best way to do that? [01:09:56] Probably is by doing exactly what you said. Right. Let's map out all the cost and clearly spell it out, like in order for us to collect, for example, this particular data set. Here's all the. Has that actually cost right? There's a real world process that occurs to generate this data and then all the downstream effort to collect it, clean it and then push it somewhere and have it saved. [01:10:19] What does that actually cost? Right. And then probably tie it back to them or rather turn the question back around to them and say, OK, great, it cost this much to collect all this data. What do you think this will be useful for and how much do you think this will help you either increase profits or save money? And does that add up like is it going to outweigh the cost? And if not, then what are we do in collecting it? [01:10:44] But that's just not going to solve the question. Is it cheaper to buy it somewhere else? The Data when you need it? Yeah, there's another evaluation process as well. So thanks. [01:10:54] I do. [01:10:55] Didn't need additional points that I think you hit nail on the head, but really it is about getting business folks to instead say let's collect it all and how we use it later and more inverting that process. Do we have something that will use this or do that cost benefit analysis and then making the decision to stop collecting specific kinds of data that they're not as valuable? [01:11:23] Question here from Davran Odali. What is your favorite Data science bill? That's a good question. I asked you to actually look at my bookshelves. [01:11:33] I would say it's actually effective Data storytelling. That's one that while I felt like I like, so I transitioned into science. After working in marketing for a couple of years, my undergrad degree was in communication and I had probably a little bit of hubris that I can communicate anything. I you know, I kind of know this really well, but that gave me solid examples of how you talk to both technical teams and non technical folks. I think that was the biggest gap I was coming from. I have a communications marketing background. I learned all the technical data science stuff. I can speak to that. How do I translate that for people who haven't spent the time learning, starts learning algorithmic modeling? So that's what my favorites. Is that the one that's by call Netflix because that's the one I know that one is storytelling with Data knowledge of shelf. Effectively, I have the most effective Data storytelling. So I mean, probably getting the job for that too. [01:12:46] Wonderful. Thank you. I guess what doesn't look like there's any more questions I just to say, guys, thank you for taking part of your weekend to come hang out with us. Be sure to subscribe to MLS Channel will be sure to go check out Idalis Wonderful Toxified earlier this week about Data validation and Data preprocessing. If you haven't already, guys, make sure you go download Comet Amelle. I mean, it's not necessarily downloading. You just sign up for it, pip, install it and then it's two lines of code and then all of a sudden you're tracking all the experiments. It is a wonderful tool, highly recommend exploring it completely free and it'll change the way you do your machine learning for sure. So definitely sign up for it. Um, just kind of it's super easy to to set up in it. And one Pippin's all the way from being able to actually track your experience like a pro.. [01:13:41] So definitely sign up for committable, subscribe to their their YouTube channel. And guys, this will be posted on my YouTube channel as well as the podcast will have shown notes and everything. So you guys will know exactly where to jump in if you wanted to revisit this conversation. You guys take care, have a good rest of the weekend. Remember, you got one life on this planet. Why not try to do something big? Guys, take care of your.