HH78-22-04-22.mp3

Harpreet: [00:00:09] What's up, everybody? Welcome, welcome to be participating. Sighs Happy hour. It is Friday, April 22nd. Matt, I just had a awesome week in Boston hanging out with with my coworkers for the first time ever.

Speaker2: [00:00:21] Which is pretty awesome. I haven't actually.

Harpreet: [00:00:23] Like hung out or been new coworkers too, so that was pretty cool. Hopefully you guys got a chance to go out to Odesk and check out some of the talks if you weren't there in person. Hopefully check out the virtual talks is a great event. Met a lot of awesome people. I met Ken in person and he hooked me up with the K and t shirt. I also got a chance to also get.

Speaker2: [00:00:45] A chance to meet Kasey.

Harpreet: [00:00:47] She came and hung out at the pachyderm table for quite some time. She actually.

Speaker2: [00:00:51] Left her bag at the pachyderm.

Harpreet: [00:00:53] Table. We had to run around the venue trying to find.

Speaker2: [00:00:55] Her and get her bag back.

Harpreet: [00:00:57] To her.

Speaker2: [00:00:58] So, Kelsey, where are you?

Harpreet: [00:00:59] Here.

Speaker2: [00:01:00] I'm happy to be here.

Harpreet: [00:01:01] I'm going to duck out a little bit early in just the next couple of minutes here. But Mexico is going to be taking over the ones and twos.

Speaker2: [00:01:08] She'll be monitoring not only the chat here, but also the LinkedIn live stream as well for questions.

Harpreet: [00:01:13] So if you do have questions at any point, feel.

Speaker2: [00:01:16] Free to go ahead and just drop your questions in the chat or the comment section and people.

Harpreet: [00:01:20] Will take care of it. Serge is in the building. Also got a chance to hang out with Serge and drink some beers with him. That was awesome and yeah, it was a great event. I'm looking forward to doing this again, albeit MLW Ops World in.

Speaker2: [00:01:34] Toronto.

Harpreet: [00:01:35] In early June. So if you all are going to be there, please do come the great event. Be sure to check out the episode that released today with the one and only Data professor chiming and putting them up.

Speaker2: [00:01:48] We recorded that episode a while ago. If you wanted to watch.

Harpreet: [00:01:51] It on YouTube, it's there as part of the live stream.

Speaker2: [00:01:54] To look at the live stream section and.

Harpreet: [00:01:56] Be able to find him. But it's also releasing the podcast for [00:02:00] you all to hear. That's it for me. I'm heading over to Mexico with you. Go. How you doing today?

Speaker3: [00:02:07] Hey, hey, hey, everyone. Are you excited to spend an afternoon with the one and only Mickey? Even though there are other Mickeys in the world. Only one Mickey is me. So. Yeah, all are excited for that. I'm excited for.

Harpreet: [00:02:22] It. Oh, yeah. Awesome.

Speaker2: [00:02:24] I'm gonna be looking forward to hearing y'all.

Harpreet: [00:02:26] I'll be. I'll be watching live and direct you to see me. Linkedin you even see me in the YouTube chat? But it's all you. I'm out. Take care, y'all.

Speaker3: [00:02:34] Very cool. Now, also, feel free to tag me in the LinkedIn like chat because I'm actually trying to chase down the different serve videos and all that in all the chats. So yeah, so let's all be fun. So let's start off with the question of who here was at Odesk, what was the best part of it? And for those who weren't, you know, what would you if you could have gone Odesk, what session would you have liked to see. And I think Serge you were there right.

Speaker2: [00:03:04] Yeah. That was. Well yeah. The, my favorite part of Odesk was meeting colleagues in real life. I've been stuck behind Zoom for two years, haven't met a single data science scientist in person and all that time, yeah, it's been weird time. So I guess that's in general my favorite thing. Besides that, there were some great workshops and keynotes, all sorts of things from NLP with Spacey. I've evolved a library, Spacey's the one I've worked with the least, so I actually was great. Get to use that. There also was a very a few very good ones on interpretability, as you all know, I really like that subject. So I [00:04:00] got a few like deep dives into that and. Yeah. And the keynotes yeah of course can jeez like keynote was amazing. I also like Cassie's a lot. I thought it was very clever the way she organized that. Yeah, I. And the rest was just like there was a lot of times I wanted to go to a session, but I just got caught in a conversation. I met a lot of people, too many to list here. You know, I met all of of heartbeats, colleagues, awesome people at Pachyderm. And then I hung out with the guys from Selden as well, very cool guys and filled in a lot of, you know, I have fun hanging out with them. A lot folks, you guys are a lot of fun. So yeah, that's, that's my spiel. That's that's how it went.

Speaker3: [00:04:58] Yeah. Small labs folks are a little bit nuts because we were like, Yeah, why don't we go into this field that's not well defined. That's a little bit crazy. No one knows what they're doing and. Everyone comes from everywhere. Kind of like that show. It was it. Welcome to Whose Line is it? Whose line is it anyway? Where the points don't matter and XYZ. So that's awesome today. Carrie Yeah. Yeah. Did anyone else there get a chance to go to Odesk. No, that's okay. That's great. I want a few years ago. It's awesome, but hopefully we'll have some more in the future. Cool. So I don't see any questions queued up in the chat. So if you do have questions either on LinkedIn or over in the chat chat, please do let us know. So let me kind of ask another question to everyone here in the last couple of weeks. Was there a problem that you ran into either while working on a data science project or [00:06:00] a machine learning model, or even an analytics project where you learn something new and you thought this was so cool. I want to go share it with with with people randomly, but maybe you didn't post on LinkedIn. Maybe you're like, so does anyone here have like something that they learned that they really loved and they'd love to like share with the with the group here, right?

Speaker2: [00:06:21] I wish I did. I can't think of anything.

Speaker3: [00:06:24] Well, I think something fun that I learned personally was the fact that and this is like informally. Right. Which is that as much as people talk about online, about, you know, how innovative their companies are or how mature their companies are in a lot of places there you don't see like not everyone is performing at the most mature, best practices stage, you know. So when we're thinking about deploying machine learning models or what have you, there isn't really a whole lot of consensus necessarily on the best way to do it. So for me, I thought that was really kind of interesting. If any of you have been following along with her presets, he's doing the six six days of ML Ops, so I would definitely check out that hashtag. And every single day he's been sharing something new that he's been learning as part of his role as developer relations over at Pachyderm. And there's a really, really lore of cool insights there. And one of the insights that to me that stuck out was the fact that there's a lot of best practices we talk about in testing and monitoring in containerization, but every company does it slightly different. But no matter where you go, Kubernetes is very, very important. Everyone talks about that. Kubernetes is is dying out. And yet so many companies rely on that to deploy and containerized their machine learning models. So I thought for me that was personally pretty cool.

Speaker2: [00:07:45] Yeah, I don't I don't think it's dying at all. I am in more and more conversations where they want to drag me into the upside because they know I have the software engineer background, but I want to keep it on the down low. Is that [00:08:00] what if you dragged into that? That's what you get for being in that field, you know, like I know I used to be a webmaster for a very large online poker site. So we, we were getting all the time attacked by hackers. And, you know, I had to deal with the low, the load balancing and it was all on premise, which made it even more, you know, like crazy. That was for both security reasons and also because of jurisdictional issues. And so yeah, but that's that's all something I want to keep in the past, you know, because I like it, I like performance, I like structure. These are things are light. But at the same time, I wanted to like be more devoted to like data exploration rather than just building stuff. You know, at some point I'll swing back around. But yeah, I'm involved in the conversations in my company about these things and I think they're very much important.

Speaker3: [00:09:07] Yeah, absolutely. And I mean, it's a hard balance because I think in like even even in my team over at MailChimp or Memo Ops team there, there's a lot of we we try to figure out what is kind of the trade off between the exploratory analysis that is important for creating innovation, for asking you questions, for sourcing new potential areas of value. But then how do we trade, how do we negotiate the trade off with also the fact that, you know, when we launch a machine learning model or data science project, we want to make sure that first off, it doesn't break. Secondly, that we're not unfairly biasing against certain segments of the population. For example, you know, that we're able to be proactively responsive. So if there's changes in the underlying distribution of, [00:10:00] for example, the sales predictions we're making or the marketing campaigns, we want to understand that before essentially it's a little bit. Too late into the pipeline. But there is this kind of like fine creative tension between, on the one hand, certain parts of the process. You do need the flexibility, right, to do the research to sometimes take a little bit of extra time to ask the hard questions. But at a certain point, it is software needs to get deployed, needs to be measured and monitored. And I feel like we haven't quite found that. We haven't necessarily found a way to understand those tradeoffs without building something, having something break and essentially doing a sort of like ad hoc and like seeing like what happens, we we can try to sort of do a matrix of understanding, but it's a little it's a little tricky. And I'd love to hear Russell your your input on that.

Harpreet: [00:10:55] Yeah. So rather than specifically talking about one particular.

Speaker4: [00:11:01] Element, I think there's generally a syndrome with all businesses, all companies that are.

Harpreet: [00:11:08] Innovative.

Speaker4: [00:11:09] In that there's a balance between innovation and technical debt. The faster and more aggressive you innovate, the faster your technical debt will accrue. And there's a balance to be had, and it's far more extreme for companies with a large workforce. So if you've got a startup that's got five or ten people, it's not likely to be too much of a problem with those. But if you've got a multi-thousand or, you know, tens or hundreds of thousands of workforce, it's going to just exponentially be huge. So setting up a a formal pipeline to manage the innovation and then deploy the innovation throughout the organization in a in a controlled practice manner to try and limit the technical debt is a real challenge. And as I say, for those large companies, far more so than the small ones.

Speaker2: [00:12:00] I [00:12:00] would agree with that 100%. At the same time, like in the field where I work, there's there's a lot of R&D kind of work. And then there's also product building work, engineering work. And somewhere in the middle it's like blending in, you know, and, and some, some of the procedures and processes they're structured from the engineering side and, and they might be like very restricted. And it sounds kind of silly, but, you know, like even like a naming rule, you know, baked into a GitHub kind of bot that kind of says, okay, I'm going to delete your repo because it's not named in accordance to the convention and you understand why those, you know, bots exist. But you're like, this repo isn't something that's never going to become a product. You know, has nothing to do with the rest. But I need it there, right? It's kind of annoying to have to deal with this stuff because they don't realize that that's like R&D stuff. That's just stuff that's kind of in development, but it's not, it's, it's not even close to being done. And, you know, when you turn it into a product, you're going to follow all the rules. But in the meanwhile it's just the nuisance. So yeah, I think these things are really hard to define and you have to say, okay, when does something become something we have to monitor like that? When and formalized.

Harpreet: [00:13:27] Yeah, maybe I can just.

Speaker4: [00:13:28] Jump back in on that.

Harpreet: [00:13:29] I typed something in the chat in Mexico.

Speaker4: [00:13:32] So I think that's.

Harpreet: [00:13:33] A very valid.

Speaker4: [00:13:34] Point, Serge. And for the for those things, those items that are changing, whether they be a fairly small, insignificant element in one corner of the organization or something that affects all levels and all strata throughout the organization. If there's a formal change process and procedure in place that's followed ardently, that helps de-risk misfires [00:14:00] and other issues with the implementation of changes in that, you know, the changes identified and the deployment of the change is planned.

Harpreet: [00:14:11] And considered.

Speaker4: [00:14:12] And.

Harpreet: [00:14:12] Executed on a certain.

Speaker4: [00:14:13] Date. And you can then have some kind of a.

Harpreet: [00:14:16] Backup plan or.

Speaker4: [00:14:17] A mitigation plan if something goes wrong with implementation to immediately roll back to where you were before. And if it does, go ahead. Okay. You stamp that down and you have an audit history of all change that's happened.

Harpreet: [00:14:29] From, you know.

Speaker4: [00:14:30] From day zero. If you've started so soon.

Harpreet: [00:14:34] That helps.

Speaker4: [00:14:35] As I say, limit and de-risk.

Harpreet: [00:14:37] Issues.

Speaker4: [00:14:38] With such changes, and.

Harpreet: [00:14:40] Probably.

Speaker4: [00:14:40] More so for those that seem inconsequential or innocuous that people might argue about in little enclaves of the organization. You know, document it and then you can understand why it was changed that way and not just see, you know, two different sides of a large organization arguing about a naming convention and seeing it flip flop back between, you know, over, you know, weeks or months.

Harpreet: [00:15:05] Yeah. So so change.

Speaker4: [00:15:08] Is or a formal change process I think is a is a big advantage.

Speaker3: [00:15:12] Yeah, absolutely. And we have a question coming in on LinkedIn from Santosh saying, you know, he'd love to hear the latest on interpretability, interpretability and explainability. I think that's a pretty broad question. So what I actually want to do is I want to kind of tie that back into talking about value for data science projects, because I think and I had this discussion with a junior engineer within our organization where the question I was sort of asking them was, you know, the difference between observability of normal software products versus observability of machine learning models. So when we talk about observability for traditional software products, a lot of times the question is, does it work and doesn't work as intended? And would it be is differences right between [00:16:00] software, traditional software products and machine learning or data science based products? Is the fact that data science, machine learning a lot of it is based on non determinism. It could be non deterministic data, it could be even within the model itself. There could be some amount of randomization that is core to that model. So essentially my question to them was, was it enough that it works and it works as as intended? But more importantly, when we deploy models and let's say, for example, it works, it predicts, but it does it for different reasons between different sessions. I would be kind of curious to hear sort of where it is in terms of interpretability and explainability. Where does that sit in in both helping us understand the value of like a data science model, how it's like driving a product or a feature, you know, does it how does it help, especially with thinking about adoption of data science, machine learning products through a company or an organization. So I would love to get Serge your take on it. And then if he's joining in, I see that he's part of it, but we'd love to hear some thoughts to that.

Speaker2: [00:17:14] Well, I think a lot of the value is is in de-risking. There is value coming from understanding the model, of course. And maybe if you can present it to the stakeholder, the end user, like say, okay, this is this is what the model predicts and it's because of this, this and this. That is value by itself, I think. But a lot of it has to be de-risking and there's tools out there.

Harpreet: [00:17:42] For.

Speaker2: [00:17:43] Things like sensitivity analysis, error analysis, uncertainty analysis that help kind of form an understanding of the risk in addition to tools for fairness, mitigation and fairness, also [00:18:00] understanding of all the different things that happen with bias in the model and then robustness. That's another thing. There's adversarial robustness. And so all these different tools can be used to understand where there's potential failure with your models. There's of course, it doesn't cover everything. So explain why Explainability tools might help you understand things that are only understood, like on a case by case basis. Like you might find a subset that is misclassifying and you're like, This is very interesting. You know, it happens to happen with images that are like this or sounds that are like that, and you start to find these commonalities, but no tool kit will actually bring that information together, at least at first glance. So it's like interpretation is one of those things. And you'll find this also in the realm of statistics, of statistical analysis, that it often takes a human, you know, it's not easily automated. It takes a human to actually take the visualizations and and draw the dots. And so that's what I find fascinating. And I think as a lot of ML ops tools will evolve and especially the no code ones, I think that will release a lot of brainpower for people to focus on these elements because I find that a lot of the value data scientists can bring are in these, you know, I wouldn't say uncharted territories, but they're at least not traveled enough. There's not enough discussion in you know, I found from from data scientists in the area of error analysis and uncertainty estimation and so on.

Speaker3: [00:19:51] Yeah, absolutely. And actually, Vin, before Kosta, I'd love to get your take too, because a while back we [00:20:00] had this I don't want to call it the Zillow disaster, but we had a little bit of a Zillow disaster. And you had. Some really good pieces on it, and especially with regards to risk and data science projects and kind of making sure not only that you're getting the value, value for your money out of the model, but also that you're protecting the corporate brand, you're protecting future initiatives. You know, have you had experience of using interpretability explainability tools in your projects with clients or even what are some other methods you found to help clients kind of de-risk their endeavors?

Speaker4: [00:20:36] First thing that a company has to do is create reliability requirements for everything. Every single project needs a reliability requirement set, and that's what's different about traditional software and data science and machine learning, because traditional software is just supposed to work, you know, here are your requirements. It will work and it will do these things. But if you look at something like Amazon, Alexa, there's no way that you could put together a requirements document for Amazon Alexa because it works in a fair range of use cases, and then there's stuff that might work. I mean, ask it the question, why not? Who knows what'll happen? And so that's that's really what you have to do to define reliability is you have to say, here are the cases where it must work or we as a business cannot generate value from it. And if you look at Zillow, they actually did that. The C-suite said, look, we've got this really narrow window of profitability and if our model is not, this accurate, will fail. And what was really weird is the data science team gave them a model with fairly good reliability requirements or reliability kind of parameters. Gates The street and everybody in pretty much the data world said, look, your estimates, bad luck, really bad. And so if you're going to do the same sort [00:22:00] of work to estimate home prices for buying as you did with Zestimate, that'll end badly. You'll have a bad time. I mean, they should have just sent the meme and Zillow just bowled over everyone who said you would have to do something far more complex to do this right and said, well, the market's going to rise no matter what.

Speaker4: [00:22:17] And if you look at what just happened with CNN, plus, same thing happened. People at McKinsey were of two minds. Mckenzie on one side said, if you look at the data, we can we can make a case to support 2 million users by the end of the year. But the more analytical, hard core data science side was like, what are you talking about? I have not only like models, but also anecdotes, anecdotes of this failing horribly. Why would you do this? And again, they were overridden and it's this. That's where it doesn't matter what framework you use. At the end of the day, this becomes a business decision. And if you don't have the awareness at the C-suite, at the senior leadership level to make a go no go decision when you're dealing with research, I mean, it doesn't matter how loudly any one segment of the data science team yells, someone will fall in love with their model. We'll push it forward. We'll figure out a way to justify its reliability, and then it ends up in production and disaster happens. And so, I mean, when it comes to what do I use to gauge reliability of a model, it isn't so much reliability as a body of evidence. I mean, I can explain the model all day.

Speaker4: [00:23:30] If it doesn't work, it doesn't matter how well I explain it. There's a completely different framework. It's not about observation, because when you're talking about observing the model, you're already too late. If the model is failing and that's when you figure it out, that's not going to work. It isn't the model that you have to keep track of. It's the data in front of it when you have to understand your model well enough. And that's where you come into some of these explainability frameworks and where you really decompose your model and decide How deep do [00:24:00] I have to understand this thing in order to deploy it to production? How much confidence do I have to have in it? That's where it comes in is you understand what changes in the data, what changes could be macro factors like inflation's killing models right now. You understand what conditions could cause your model to begin to behave unpredictably. And that's what you're monitoring. That's really what you need observability into so that you are having fail over sphere model. And when it begins to become apparent that your model is probably not going to be performing that well, it begins to fail over to some other sort of usually traditional software development, maybe even a legacy model that was just stable. You have something else that you fail over too, until you can figure out how to retrain your model and get it back to the point where it meets reliability requirements.

Speaker3: [00:24:46] I love it. I love it. Yeah, that's a really, really good point with the if it's if you can't understand it, you push it to production anyway, then that's probably not a good place to be.

Speaker2: [00:24:56] Costa So the thing that well, I mean, so I've been working for the last three or four years in basically highly regulated environments, right? So you're talking about. And since some medical applications have I. Right. And the funny thing that you kind of have to grapple with is the notion of risk is not about how often am I right. It's also about weighing that up against how often is the prediction wrong and what is the potential for and limitation of damage and risk essentially comes down to two things. It's not likelihood of being wrong, but also the damage caused when something is wrong. Right now, this isn't new. Stuff like this isn't like new to the whole world. The idea of a risk analysis is we've been doing that for a couple of hundred years. Right. So essentially what I'm sorry, I think I dropped off there for a second, but basically the whole thing with the risk analysis is being able to trade off the damage done versus [00:26:00] how likely that is to happen. Right. And that's how we view the whole framing of business decisions in the first place, even in a highly regulated environment. Tomorrow, if you go into an audit and you're in a medical device audit and you're asked, hey, what happens when the model predicts something incorrectly, you're assessed on two essential fronts here. It says, Don't what's the risk of patient harm and what's the risk of eventual like secondary like adverse events happening from there? Right.

Speaker2: [00:26:30] That's the main thing. If it's if there's minimal patient harm, you're already down to a lower risk category. If there's no risk to patient harm, there's already a lower risk category. So it's really understanding and valuing that which like we need to stop looking at this as as, oh, yes, there's this normal distribution and it must work for so much. What is work? You know, like, what does it mean to actually work? We need to define those risk factors. What kind of risks are acceptable to a business? Let's take Zillow, for example. What's the risk of that prediction being wrong? Is it going to hyper inflate house prices? I don't know anything about this market, by the way, so I'm talking random crap here. But what's the risk of it over predicting and then like hyper inflating the value of all the properties that they've listed there? Right. What happens if that happens? Is that going to have an effect on the housing market in general? What's the risk of that? So we need to be really conscious of those risks when we're actually designing our models the way we're reporting them and the way in which we implement them. Now, the downside of all of this is, let's say you're looking at a medical scoring tool.

Speaker2: [00:27:37] Let's say you're looking at retinal images and they're trying to detect if someone's got glaucoma right now, the way a lot of these things are tested, let's say a a ophthalmologist comes in and diagnoses someone with a particular condition. Right. The way they're reviewed legally is among a journey among a group of similarly [00:28:00] trained peers. Is their decision unanimously unrealistic? Would anybody else with the same training and the same level of experience have made the same would have rationally seen themselves to be able to make the same decision. So if you can find three or four other ophthalmologists who would go, I'd look at that. And to be honest, yeah, I might have made the same decision. Then you can rationalize the correct decision was made at the time, even if that wasn't the correct decision from a medical diagnosis perspective, maybe there was something else going on that doctors did not notice. Right? That's the kind of framing that they use both in the legal system as well as in the medical in the medical world. Right. At some level, that's the only way we can actually come up to this bona fide trust of a model is if a trained professional is able to make the same kind of call and could rationally justify having made that same kind of call, right? Otherwise you end up sinking yourself into this trap where typically a human might only achieve 80% precision in a particular task.

Speaker2: [00:29:08] But we're expecting models to have 99.9% of the precision because without that understanding and framing of of rational error and without contextualizing that into the risk of damage done by that error, there's no way of us to be able to find legal recourse to feel safe in saying if something goes wrong, what's the likelihood of us getting through that legally? And it turns into this litigious madness where then companies are like, No, it's got to be 100% accurate before we will use it, because essentially otherwise we'd get litigated against, right? So there's this interesting like essentially this environment where we have to start understanding what is rational from a legal perspective as we're applying models into the real world in any context, whether that's housing market, whether that's medical, whether that's, [00:30:00] you know, whatever it is and it takes kind of a host of expertize to come up with that. And that's where I think a lot of companies may not have the right combination of statistical knowledge, model understanding, as well as legal understanding in their specific domain. Right. Having that right combination is probably what's going to essentially give companies confidence in, hey, we can trust the models that we deploy because these are the factors we look at when assessing a model, not just precision and recall.

Speaker3: [00:30:31] So you say legal and I'm just thinking like, ah, yes, nightmares of GDPR again and and democratization and access and oh my God, why? Why are we trusting people with data? But yeah, I think, I think those are all like fantastic, fantastic perspectives because I think especially when it comes to the question of like, is the model working? Is this working the way we want it to? Is is the way we actually want it to work even the right way? I feel like that's that's the wonderfully creative art of data science there. It's because to a certain extent, much like one person's diet versus another person's diet, some people want to eat cookies and some people can't eat cookies. I'm not going to tell someone they can't eat cookies. They got to decide for themselves. But I think that's all fantastic stuff. We have a question from Marc Lammy. Right.

Harpreet: [00:31:25] Yeah, that's it. You nailed it.

Speaker3: [00:31:27] Awesome. Fantastic. And it sounds like you're working on a social social media NLP project. Do you want to talk us through it and pose the question that you're you're struggling with?

Harpreet: [00:31:40] Yes. So I. So I was an Instagram user. I stopped using Instagram. But before I've seen since the past few years that on blog posts from really big accounts you have in the comments, so many bots commenting random random words in a sentence that don't really make sense. They often have photos [00:32:00] of girls in bikini porn, actress, everything, and they always have one link in the bio which leads you to page. Who leads you to page? Who is like You're going to have to pay for fake sexual content at one point and it's always the same through all the bots. So I was always.

Speaker3: [00:32:16] Like, I get those same bots to it's.

Speaker5: [00:32:18] Crazy. Yeah.

Speaker3: [00:32:19] Don't know what they're doing.

Harpreet: [00:32:21] Yeah. And so I was always wondering why Instagram doesn't ban them. Like is there really reason? And I was wondering, can I try to develop the model to make something to flag those users? So I collected a bunch of data. I have about 130,000 comments, which is roughly 80,000 users. I already labeled 4000, which is still a lot, but I have a lot to go, so I'm struggling on that part. I found techniques to label more bots because bots are similar, so from one to another they have the same description in the same photo, the same the same format for many things where I can find bots but I can't really like find more legit users that make sense. So I'm kind of struggling on that data labeling part.

Speaker3: [00:33:10] Awesome. Cool. And just to reiterate, is the struggle that is a struggle that it's time intensive or that you're struggling to find more legitimate users of social media sites?

Harpreet: [00:33:22] No, it's just very, very long to do because I when I see the users, I on like half and I know if they're if they're about on and about, but just there's just too many because I collected a lot of them because it's a very imbalanced problem, because there's way more alleged users than bots. So it's just really hard to label really long. Sorry to label.

Speaker3: [00:33:45] Everyone. Awesome. I'm going to let anyone jump in on this one. The fun, fun question I thought that was interesting was why would a social media company be incentivized to not kick off more bots? I don't know. I think that's a fun one. But if anyone wants to jump in on the actual [00:34:00] techniques for how we can help out Marc, please feel free to do so. At this moment.

Speaker5: [00:34:06] I think I have a rationale that sorry, I'm outside of it, and I think the model is really about engagement and advertising. And if the boss does not interrupt the engagement and their advertising budget, why shouldn't stuff block them? That's my rationale. I could be wrong, but unless somebody flags it that this is could land them in legal problems or whatever, you know, they're making money on it. So that's what I think is going on. They're making they're still making money on them.

Speaker3: [00:34:43] Yep. Agreed. And I think from so in terms of tackling the data labeling perspective, all the much smarter data scientists on the call, which I know is most of you actually is much smarter than me, it seems like there could possibly be two options. One is to do upsampling upsampling downsampling, and another option is I can't think of another option. So this is where. People who are smarter than me. Feel free to suggest some.

Harpreet: [00:35:09] So. I had a talk with Morgan Freeman on I think it was this week or last week actually recommended me to come here. And it was he was telling me about some supervised technique, which I never heard of before, but it just mentioned it like that. So I was just putting it up as you just told me that.

Speaker3: [00:35:29] Very cool. And actually semi supervision is a little bit different right from upsampling down sampling. What anyone here like to to to dove into semi supervision or weak labeling.

Speaker4: [00:35:40] I can recommend a book on it just set by Bernard Kosa. Mastering Machine Learning Algorithms in Python. It will go over your head very quickly and it's almost biblical in size. So it's a book to own. It's a good book for us to have in our libraries for those super special cases.

Speaker3: [00:35:58] Mickey Very, [00:36:00] very cool. And it's good to see you. Tom It's been a while, I guess for me. I don't remember when the last time I was at the Geoscience Happy Hour was okay. Cool. Well, if anyone on LinkedIn and then. Yes, please.

Speaker4: [00:36:17] Look, I talk too much, so I try not to. Yeah. When you're dealing with bot detection, you already kind of figured out they all look the same and you are going to find the exact same messages. But the problem is that the message is different every iterate and so they'll have different talking points and it almost sometimes follows news cycles with bots that are meant to amplify a particular message. They'll also follow advertising cycles, and sometimes they'll follow like their contract cycle. But you can tell some bots are on a one month contract, some are on a three month contract. And so when so what you're actually doing is you're trying to figure out the cycle that each bot is on because no matter what you build, it'll only work for so long. The problem with bots is the drift is so fast on identifying them from content that they post. The real way to go at it is to network to do a graph analysis and to begin to understand why certain bots interact with the same content. And once you find content that gets amplified or message, that's what you want the model to latch on to. Because as soon as you find this week's message and so that's what you want to label, you don't want to label the accounts themselves, you want to label the activity and then the connectivity between bots. Because what you're I'm not calling anyone out. So no angry DMS this time there are certain influencer accounts that use bots and reliably in their comments.

Speaker4: [00:37:57] You will always find bots, but [00:38:00] they also hit the like button. They also hit the retweet or reshare or whatever. And so you can learn a lot from those aspects. And so when it comes to bot analysis, trying to analyze text is almost that's the most painful way to go. And eventually your model is going to break down because like I said, the cycles really are going to blow you away. So do more of a network analysis. Once you find a small botnet, what you're going to realize is that they're all interconnected because there's like, I can't remember there was an article that was written up a while ago that I'll try to find where they called out. There was a small number of bot farms on the planet. I guess it's kind of like crypto mining where it's all centralized now, but there's it gets really easy once you find a small cluster to map the rest of it from there. And so that's like that's where I would focus. Don't worry so much about the content as the activity. So once you find a small botnet begin to analyze their activity and that's the data that you want to mine is to follow those accounts and see what they do for how long, with what main accounts, like legitimate accounts that they work with. Because that's going to be your indicator of really identifying a bot's behavior versus what a bot says.

Harpreet: [00:39:23] That makes sense. Maybe a little bit hard for me because I don't have the activity of those bots. I only have I only can see what they post and they're just the public data they share on Instagram. So that might be a little bit I'm not sure how to how to do that precisely on the on the example.

Speaker4: [00:39:40] Start with a manual analysis. This is one of those things where if you watch the behavior, you'll create a little bit of a heuristic on your own. And that's how I've seen a ton of different analyzes. Start is you have to watch the bots and you'll see the pattern of behavior that you first model. And so your first [00:40:00] model is an expert system, your hard coding, and it's really meant to gather your data set and to begin. Fermenting with what manual like just expert systems type model starts to look sort of reliable and you're not going to be able to predict anything very well, but your data sets are going to get way better. And that's really where you're going right now, is you want the data gathering to be automated, but you have to first create the heuristics that do that for you because otherwise you end up with so much noise because bots are only a small, even though there are a lot of them, they're only a small percentage of your data set. And you want to change that. You want to make it far more likely that you pick up bot behavior in all of your data sweeps, and so create that heuristic to start gathering data behind and then validate, obviously rigorously validate Eucharistic before you, before you dove in with the assumption that bots that are bots are bots and then destroy your model.

Speaker4: [00:40:59] But you really have to watch them. And I hate to say it, it's, it's a manually intensive task because you're going to have to watch create the heuristic. There'll be several iterations. You'll have to look at the actual data sets that you gather and trace back to figure out if they are bots. And this is a long term thing. Like you'll be watching them for a week sometimes to figure out if this is a bot or just somebody that's acting like one or if your heuristics are just completely off. So it's it's hard to do. And like I said, you're also going to be looking at likes, re shares and commenting. And that's it's really hard to gather like data, but you can do it by looking at through their history. I think most social media is you can actually pull history for a particular user account and that will typically include likes and interactions and engagement. So it's it's not easy. You may have to do some scraping and I used to say scrapings evil, but we've had court rulings in the US that now say it's okay, so go on ahead. It's great.

Harpreet: [00:41:58] It's awesome. I love it. Yeah. [00:42:00]

Speaker3: [00:42:00] Very cool. And I think there's a couple relevant comments and also Kosta has his hand up. So if you have additional suggestions, please feel free to go ahead and and talk about them. I just want to point out, Serge asked Wooden outlier analysis be helpful in this case, another idea that popped up. So if anyone knows Chip Queen, she used to be over at Snorkel A.I. Their big thing is weak supervision. So that's an area to check out. And then Costa, feel free to go ahead and add some additional suggestions. Yeah.

Speaker2: [00:42:41] Awesome. To be honest. At the start of that question, I didn't know much about how to go about that. Right. My main insight was that we've got Mark alumni and I've got alumni here. Yeah, Penn nerds, the fountain pen brand. But anyway. But based on what I was listening to from then, right, this is weirdly a lot like pattern recognition in signal processing, right? You've got a radar signal that's got bucketloads of bounce back, right. And you've got different reflections from different objects from everywhere and the way in which we visualize those signals rather than trying to like I mean, in this case, you can it really comes down to how we represent information, right? As opposed to trying to represent information as here's this linear timeline of what this one individual account did. Right. You're trying to identify patterns in the whole in the whole mess. Right. It's similar to how they look at security information for the banks and false transactions and things like that. Right. So they're looking for patterns to match against. So if you can visualize the data of all activity kind of per account user over time, right, you're going to start noticing those little essentially what Vin was describing is like feedback loops, right? Because you've got this bot that's fitting to this particular news cycle or this particular [00:44:00] particular set of words that are really popular right now.

Speaker2: [00:44:04] Right. Or actions that are really popular right now. And as that trend moves on, they'll take a little bit longer to react. So that's the transition that you're trying to look for. And if you can overfeed the bots in some way, artificially prompt them to make those things right, like you see it all the time. If you've ever were like, have you ever seen a lot of those chat bots, right? If you spent enough time on one of those online chat bots things that you just want to test out, if you spend enough time with it, you can actually teach it to learn a cycle of of questioning that doesn't actually make sense and you can detect that, right? So what we're doing is we're overfeeding this feedback loop, right? And you're making we're making these bots learn into this feedback loop. That's a slow rise time, but a super high oscillation time if you're thinking about this as like a signal analysis, right? So it's oscillating like crazy. It's staying on the old topic. We've moved on, so it's correcting and then it stays on that topic for a while, but we've already moved on again.

Speaker2: [00:44:59] So it's that transient like nature. It's always going to be a little bit behind, but just relevant and. Right. So if you're looking at this almost like a signal analysis kind of thing at like a mass scale, it might be more visible than if you're looking at it at a, hey, is this individual but is this individual account about that? Might you end up looking for the trees and get lost in the forest? In a sense, right? The scale of the data may tell you a different picture. Like it's easy. It's the same analogy as you're looking for a line. So you look at texture differences, but then what's the difference between the edge of my face and like a wrinkle, right? Which line is relevant that you're detecting for? So that's a difference between like an edge detector and an object detector. Totally different scales of what they're looking at. Right. So yeah, that was just my okay. Wow, we can look at this problem in a completely different way. I'd be really curious if anyone's tried stuff like that. I think that's similar to the output [00:46:00] of graph analysis and that's similar to the kind of things they do in, I believe, the financial transaction world. So maybe yeah, maybe there's avenues there.

Harpreet: [00:46:09] Great. Thank you. Thank you. I'll look into it similarly.

Speaker3: [00:46:12] Very cool. And so Tom Ives, feel free to to jump in. And then if anyone has questions, please feel free to put. Oh, right. There was someone who posted a question about insurance. So we'll get to we'll go to that question after Tom Ives. Also, please feel free to eyeball the LinkedIn chat Hamzah to hear from oh XHTML. Hello there. Very fun project. You all should check it out. He had some really good suggestions about ways that you could label, so feel free to check that out. Tom Ives, let's go to you. And for the person who asked about the insurance thing, let me go hunt down your message and then we'll get to your question.

Speaker4: [00:46:54] I just want to help those in the audience understand.

Speaker2: [00:46:58] The chiropractic data engineer.

Speaker4: [00:46:59] Is Mark Freeman. And if you would like to understand this story more fully, all of you are welcome to join our LinkedIn Data Scientists Learning Guild on LinkedIn. Just send me a direct message in LinkedIn. Let me know you want to be part of.

Speaker2: [00:47:15] It then.

Harpreet: [00:47:17] I.

Speaker4: [00:47:19] I'm Catholic and so I'm going to use some Catholic words here.

Speaker2: [00:47:23] I don't worship you.

Speaker4: [00:47:24] But I venerate you. That's funny. Venerate And so I was really trying to listen closely, like listen between your spoken lines. And I think what I heard you say was.

Speaker2: [00:47:36] Something pretty.

Speaker4: [00:47:38] Basic and important. I just want to make sure if we're considering doing semi supervised learning, don't rush right to it.

Speaker2: [00:47:45] Still do a good job of.

Speaker4: [00:47:47] Unsupervised learning going ahead and do all.

Harpreet: [00:47:50] The analysis.

Speaker4: [00:47:50] You can there before you just dove into trying to train something semi is.

Speaker2: [00:47:54] That is that.

Speaker4: [00:47:55] What I heard you say? I think no matter what you do, you have to start with the data. But I think in this analysis [00:48:00] in particular, the data gathering itself has to be done in such a way that you're curating a data set, sort of slanting your data gathering mechanism. And if you don't do that, it's even more intensive, I would say, than most efforts, because what you want to look for when it comes to bot analysis or anything, I mean, anomaly, any anomaly in general or any graph type behavior, ecosystem behavior, network behavior that's engaged in by a minority of the community or a minority of the graph. That's when you have to really curate your data set carefully because you have so many different communities that bought behavior. It could overlap pretty easily. And so you can end up with a just a normal looking data set that destroys your model performance. Because, like I said, you you have so much community overlap when it comes to bot behaviors. Some people naturally act like bots because, you know, they say half of everyone that you meet are below average intelligence. And that's especially true on social media. Just so you know. Then the angle I'm coming at from my question is more like general methodologies, not some specific case, because I'm always trying to figure out best practices. And it it seems like you're typically if you if you're at a point where you're doing semi supervised learning, it's just because the labels are kind of expensive to collect and you're wanting to do more. But typically in my mind and I could be wrong on this unsupervised learning is when you're kind of trying to figure out the groupings, classification wise of these features so that you know what predictive analysis you might want to do. So I'm trying to just make sure I'm going. You know, we don't talk about semi supervised learning very much and I don't hear as a as a result, I don't hear best practices around that type of.

Harpreet: [00:49:59] Analysis [00:50:00] very much.

Speaker4: [00:50:00] So that's why I wanted to that's where.

Harpreet: [00:50:02] My question is coming from.

Speaker4: [00:50:05] Any time you. Do labeling, I think that's actually way harder than unsupervised. I think supervised and semi supervised both are actually harder to do than a completely unsupervised, you know, doing what you're saying is essentially using the modeling process to create a hypothesis and then going down that rabbit hole and curating a data set based on what you think you found and then doing some sort of observational study. I think that's way easier than trying to do any sort of labeling at any sort of scale and then trying to reconcile multiple labels. And then you introduce semi supervised techniques on top of that. And it's just I mean, I think that's one of the most ambitious and you know, large companies have the resources to do that. So when you talk about best practices, any time you get into labeling, it's just this it's a rat's nest because it's so hard to get just any labeling right. And then this line of questioning is coming from another conviction.

Speaker2: [00:51:08] That I teach like all the time.

Speaker4: [00:51:10] Like, hey, you know what?

Harpreet: [00:51:12] It just so.

Speaker4: [00:51:13] Happens that that 80% of more work we do in the data pipeline before we get to.

Harpreet: [00:51:19] Predictive modeling really gives.

Speaker2: [00:51:20] About 80% of the value back to the business if you treat it right. So I kind of hear you saying again, well, don't rush off to any kind of learning.

Speaker4: [00:51:31] Just do a lot of data visualization, because a lot of times that.

Harpreet: [00:51:34] Answers the most important questions.

Speaker4: [00:51:37] Anyway. It's really about asking the business what it is that's accurate. Yeah, yeah, yeah. And then you don't have to do machine learning to figure that out. You can do a lot of data exploration.

Harpreet: [00:51:48] To figure that out.

Speaker4: [00:51:49] Yeah. Okay, cool. We have an accord.

Harpreet: [00:51:53] Captain Jack Sparrow.

Speaker3: [00:51:54] It's very, very cool what we'll do. So let's help over actually to this question from [00:52:00] a Cristiano I posted in the chat and I would love I'm going to call on people I'm wielding my evil powers of of doom as the person that is doing the happy hour. So we have a case study question it sounds like I would love to hear from Mark or the chiropractic data engineer Eric Santoni. How would you go about solving this problem? So the problem is, as an insurance company, your goal is to reduce the number of calls from your customers because you need to maintain a high customer care workforce that adds to, okay, you need to save money. Basically for an insurance company it's support calls your problem statement is well that's a lot of words oc y non drivable vehicles accrue more customer calls than drivable bull vehicles. You visually explore the distributions of both population and extract the mean. Okay, so basically if we were to sum it up, you have data or Cristiano, do you want to sum up the problem in like three sentences?

Harpreet: [00:53:05] Yes. So basically an insurance company is finding out that the vehicles that were non drivable are their customers are calling on average two times more than the vehicles that were drivable. So you want to find out what are the what is the when you when you found out the mean of both these populations, you find out like there were two extra calls that were coming from non drivable vehicles as compared to drivable vehicles. So what is contributing to that difference? So what I did, what I did by starting out was I found out the top factors that were contributing for non drivable vehicles, but I found out what the factors that were contributing to the non drivable vehicles basically explained what are, what are the top [00:54:00] factors that are contributing that are like adding to the cause. It does not still explain the difference between the two means, you know.

Speaker3: [00:54:08] Awesome, cool. So all the people that I just named in the chat, how would you one solve this case study if you were doing it? Number two, you're the hiring manager or the interviewer. What are you looking for from someone as they're solving this problem? So we'll start with we'll start with we'll start with Marc. We'll go to Eric Santana and then Greg definitely.

Harpreet: [00:54:32] So my first stop was saying this is like this is not my problem. I'm not trying to predict who who is going to call or stuff like that. So that removes off like a lot of things I'm going to like this seems a causal inference kind of problem. The second thing that I'm thinking is that while I don't control data collection, so this is going to be a secondary analysis. I'm doing observational methods. So that means that I'm going to have to I have observe covariates so possible reasons as to why call the outcome variables going up or down. Right. We're seeing the same. But that means I have unobserved covariates so things I didn't collect them that accounting for. And so for me this becomes more of a statistical problem and thinking about your populations and so I'm hearing mean but I feel like mean is not really a effective use because it can like is it median is it is it mode like there's say average send means never take that back. So you said mean right but you don't know the underlying distribution. And so from there I would look at the underlying distribution for the outcome variable, which is call numbers among your two different populations. Right. And I would just try to understand first, you know, what type of distribution do I have as a normal distribution, which I highly doubt. But then that will determine kind of like what assumptions you have to use moving forward from there. And then [00:56:00] then you need to ask, like for me thinking about am I missing data? You know, if it's missing, why is it missing things and missing a random or is there a reason why it's missing? Right.

Harpreet: [00:56:13] Because then they'll tell me, like how I handle certain things. And so as you see, I'm like going through different levers of like understanding my population. And then from there I would I would move towards like some forms of hypothesis testing. So, you know, your outcome variable is number of calls, right? And so the first, the first thing is just how can you make these two populations similar and then what's your intervention? So and I've come from a health care background, so that's how I use intervention. But like what's your treatment or control? And so that's, that's going to be the what you call it, what status they are for, what status they are for like non nonmoving and moving vehicle. There's along a long list and then from there you can see kind of like what are the driving factors? So whether it be like regression or whatever thing you may do with that test and that's how I'll start again, really try to start simple, make sure you understand your two populations and then engage foundational statistics for that. And there's more kind of advanced things you can think of like for example, like multi level modeling. So maybe, you know, you have these two populations, but like there's a subset where it's like maybe by zip code or by state, there's like some factor that's just being hidden by combining them all together. So those are some of the things that I really look for. Does that answer your question? And then I know there's another question.

Speaker3: [00:57:39] I think let's let's actually hop over to because Santonio and Eric are asking some clarifying questions in the chat. So either center or Eric, do you want to hop on? And especially I think both of you have gone through interviews. You've been on the candidate side, but you've also been, I believe, on the interviewing and the hiring side. So [00:58:00] what are some questions that you've also asked in the chat? What are some what some additional information you would need to feel confident solving the problem or the case study? We'll start with let's start with Santonio, then we'll go to Eric.

Speaker5: [00:58:13] Yeah, Eric. Those goes yeah, I will answer this question by answering the second part of the question, which is like what I'm looking for is a bunch of clarifying questions. And the question, so was this an interview question or that you came across or.

Harpreet: [00:58:33] No, this is not an interview question.

Speaker5: [00:58:35] Okay. Yeah. So I don't I don't understand the question. So first of all, what is a non drivable.

Harpreet: [00:58:42] Vehicle that has shut down? Maybe the engine is off or something now it's not in a driving state, so it needs to be towed. So it's it's basically a non drivable and drivable is like it's still driving, it's just damaged from some some parts are damaged, but it's still driving.

Speaker5: [00:59:03] So it's not a tire prior labeling that you have at the time of the call you get this piece of information is that this is the.

Harpreet: [00:59:12] Yes, yes.

Speaker5: [00:59:14] And then when your analysis of that data, you're finding that on average the most more calls have to do with a non drivable vehicle.

Harpreet: [00:59:27] That's correct. That's correct.

Speaker5: [00:59:28] What do we want to get? What's what's the question? What do we want to get from this?

Harpreet: [00:59:33] We want to reduce the basically, we want to reduce the number of calls and we want to exactly find out why are those extra like why on average we get more calls from non drivable vehicles so that by looking at the factors maybe we can make some adjustments and reduce the number of calls. Once we know the factors that are impacting that are causing those calls, maybe we can use [01:00:00] them to our advantage. I hope you got it.

Speaker5: [01:00:03] Yeah, I think you're answering your. Question a little bit. I was just to say, there are lots of other factors that we have to explore in the data before we could formulate a good answer. I mean, I like Mark's approach, which is like this is like general steps that I would take with with this problem. But really, to me, it's I mean, now I know what the goal is, which is reducing the number of calls or claims. And I also know that this labeling of the data non drivable versus drivable is just another piece of information I'm getting in that poll. And then so after that collision happens or accident happens, I'm going to get a bunch more information around what was involved in the collision, like what cars were involved, who was at fault versus not. So all of that and all of that information is going to be relevant to answering that question. So I am skeptical that we can get to the bottom of how do we reduce number of total incoming calls by using this particular piece of information around drivable versus non drivable. So I know, I know that this isn't about this particular problem to solve, but I think generally the approach is like if I was interviewing someone, I'd probably not ask that specific question. But putting that aside, if I did happen to ask that question, what I would want to see is defining what the goals are, defining what the data are that are available, and then taking that.

Harpreet: [01:01:37] Approach based.

Speaker5: [01:01:37] On that. So no formula.

Speaker3: [01:01:40] Very cool. And Eric would love your insights as well.

Harpreet: [01:01:44] Sure. So Sandra, kind of building on top of what you just said, it kind of would be kind of an interesting question to ask in an interview because, like, it like totally doesn't have like a good, like, straightforward answer where it's like, well, if you just start spouting off some some like [01:02:00] really detailed, heavily assumed plan, then maybe that's not wasn't really the point. Maybe is like, well, actually, I was hoping that you would ask me like, what the heck is a non drivable vehicle. That's like totally not a term we usually use. I agree it would be a tough interview question, but I guess it depends on what you're trying to get out of the interviewee. One of the thoughts, the questions that came to mind for me was, I don't know if reducing. I'd want to understand where the goal of reducing the number of calls came from. Because because it could be that people who have non drivable vehicles were in horrific accidents compared to people who got fender benders. And yeah, you just like call your insurance one time compared to the person who got hauled off to the hospital and is calling you because they're dealing with all sorts of problems. And in that case, it's like totally normal that they're going to call you perhaps even twice. And so and so I would want to understand why those people are calling multiple times. And it could be that you could reduce the number of calls by recognizing like, oh, maybe we should put in some sort of a service related to text me from your hospital bed or something like that.

Harpreet: [01:03:12] I don't know if that ends up being one of the causes, but knowing that and then the other another angle that I was taking it from instead of like severity of accidents was I would want to understand maybe there are subpopulations within like the multiple calls group because there could be like multiple calls people who got in horrible accidents and have to call that a bunch of things. Or there could be multiple calls. People who drive way nicer cars than I will ever drive who are just like calling about their baby. Like, because there's like something that, you know, that they want to get some claim or something taken care of because it got damaged or something like that, in which case that would be a very different group and would, would, could potentially be treated differently. And you could see [01:04:00] like, oh, well, the value of this person's car is way different. So we could we could potentially treat them differently or offer them some different sort of, I don't know, something to help them. Yes. Press one if you're dying to if your condition is stable and then to just like hangs up on them because we can't afford to take another call. Anyway, those are kind of a couple. A couple of thoughts that came to mind is number of calls really like the target metric or do we want to look at something that's going to tie to revenue or the reason behind the calls or something like that? Now.

Speaker3: [01:04:32] That was very funny and also very sad because I think some of that actually does happen, but that's brushing aside. But, you know, so, Greg, you've worked with a bunch of data scientists. You've you have you've had data scientists, data science initiatives under your belt. If you got a problem like this or a question like this from the business side, how would you sort of approach it or how would you want your data scientists to approach it? What would you look for? And also, how would a question or project like this end up becoming like a. Or product over at Amazon or even a report would love to get your love to get your insights on that.

Harpreet: [01:05:11] So I can't I can't speak much for the company I work for. But when I hear cases like that, my, I guess my management senses are tingling in the sense of I want to ask way more questions than what's presented. And I think you guys, Santoni and Eric, you guys, you guys dove into this very well. And then, you know, Marc gave a great approach in terms of how to dissect the data. I think the key information for me is what are we trying to accomplish? Of course, as you guys said, what are we expecting as a as an outcome versus output? Right. So an [01:06:00] outcome really speaks to me, in my opinion, highly to what the business is expecting, because that's what you tie with, I guess, a business goal. Now, a key thing for me when I hear that is when I hear this use case, I would wonder exactly what Eric said. Why are we thinking what led to the notion of understanding why a certain group creates more calls? It becomes the goal, like, what are we trying to accomplish there? The way I would look at this is, you know, maybe there may be some some data dissecting that needs to happen in the sense of why in the first place, you know, a certain group of people are calling more than the other group and then kind of understand this demographic, understand, you know, if there's something different with regards to their lifestyle or the kind of car they're driving, etc., etc..

Harpreet: [01:07:02] But most importantly is isn't it expected though, like when somebody has a non drivable car that they should call more than someone with a drivable car that should be like it should be a common concept, right? This doesn't need any a B testing. Right. That of course if you have a non drivable car you should call more and then and then go from there. Right. So now when we ask all these questions and answer it through exploratory data analysis, then we go through a key. What can we kind of maybe do in terms of better control of our budget? This is where I think it's a little bit more useful there. So we have a budget. How can we leverage this data to predict where we're going to fall in that budget knowing that we don't have full control of when a car is going to break down? Right. Unless you tell me we have full control of when a car breaks down or not, then I [01:08:00] don't see why we should leverage who calls more than who calls less over that. So if we know that we don't have control over when the car breaks down, we can, however, use this data, understand the demographic to kind of predict when we're going to have a high season of calls, etc., etc., so we can better adjust our budgets.

Harpreet: [01:08:24] So I think from a business perspective, what they don't like is unprepared outcomes or unexpected outcomes. But when they have tools and methodologies to better prevent when things are moving inside of a budget, I think that makes it a better or more powerful thing. So what I would partner with the data scientist and is to kind of like segment those callers understanding their lifestyle, understanding what causes these calls and then kind of like create some models that better predicts when we're going to have a high call season or whatever and then kind of better place the customer workforce to accommodate that. And this way I can kind of see where the budget is moving. I can better predict whether I'm going to hit budget or go over budget over time, etc., etc., and have a better conversation with the business in terms of what would be the expenses and things like that. And then longer term is to figure out what can we do as an insurance company to prevent these accidents from happening in the first place? Right. Or we're seeing younger crowds calling because they don't have a lot of experience. So what can I do as an insurance company to prevent that? Right. So this is the long term thinking to kind of minimize that impact and overall get that saving on the budget long term.

Speaker3: [01:09:47] So we'll go to Mark. But see, this is why I want to be Greg when I grow up, because the point of the wouldn't we expect more calls from people whose cars are broken down and janky? It's like, yeah, [01:10:00] actually we would, we would. It's like practicality right there. But yeah, we'll go to Marc and we will. So I'll keep the chat open for another 1015 minutes. If anyone has additional questions, please do otherwise. Then we'll we'll shut it down while I wait for my better half to come back from the watch fair where he's been all day. Mark, please.

Harpreet: [01:10:24] Do you have a question? There's room afterwards, too, but you're about to say something.

Speaker4: [01:10:29] I'm sorry. I was going to say. Oh, he's not at the fountain pen flea market.

Harpreet: [01:10:34] Sorry.

Speaker3: [01:10:36] I can pay. Okay. You know, obviously, this is recorded and he never watched Save my calls so I can ties found pens, watches and James Bond movies. So that's if one of those forms there is there is a persona there. But I'm sure he'll come back with like two or three watches. But yeah, go ahead Mark and we'll, we'll take the question you have to afterwards.

Harpreet: [01:11:00] So one, one thing I completely agree about asking the more clarifying questions. I would argue, though, that I feel like the question how is set up the unknown space around it was so wide that I felt like I wouldn't know how to ask the right questions. That makes sense. That's why I really emphasize like understanding underlying distributions, because they didn't say like, Hey, we believe this factor is causing X, Y, Z, and we want to explore it more. It was just like we have to population's something different. What's, what's driving that, right? So that's a very open question. And so I feel like you can ask more targeted questions if you do that exploratory analysis and do some quick hypothesis testing initially. Again, not saying that X, Y, Z causes it, but to give you some more signals of what's even worthwhile to start asking is that when you go business stakeholders, it's hard to imagine this level of data. It's very complex. And so we work with this data all day. And so it's it's easier [01:12:00] for us to navigate. But I think for someone who's not very used to the data, you know, you can get some wrong kind of leads and go down the wrong path. And I feel like adding structure by doing the underlying kind of exploratory analysis first. And again, they'll spend hours and hours doing it really time box. It can help you create some more targeted questions. And I would actually push back on the assumption like, hey, cars, the non operative cars, of course they're like more, they're more likely to have more calls. For me, I feel like whenever I use like logic and not like an underlying data, that's when you kind of get messed up.

Harpreet: [01:12:36] This happened to me over and over again, health care. And I think another thing to really consider, especially for insurance, is like it may be more obvious, but like what about that population that they can better, better configure for the policy plans to really optimize it because they're they're basically like trying to minimize how much money they can give away. Right. So it's it's an interesting kind of problem space with way too much possibility. And I think looking at data really makes it more manageable to ask more targeted questions. Don't forget to ask the business folks what are the motivations to what motivates them? Like what? What are the what are they trying to get right? What brings victory to their pockets? Right. Once you figure that out, you better be you'll be better place to come up with a solution that works for everyone. So that's the super important stuff because they can they can be there to block everything, any solution you come up with until it really speaks to what they're here for. And then also, you can you can get hired into the bureaucracy of things, right? So a VP may be motivated by how long he or she thinks he will be in that position. For example, if you're bringing up a solution that will see the fruits in five years, that VP may be like, I don't know. I don't know about that because they're thinking they might be leaving in the next to right. They want that victory quick and your solution is taking too long. So you have to be worried about that. You have to think [01:14:00] about that, too.

Speaker5: [01:14:00] Yeah. Just just to add on to that data, is everything right? You can begin to answer that question or most of the questions that we work with or on without looking at the data or studying the distributions and so on and so forth. But but the point of asking the questions upfront is so taking this specific example, right, I'm already being told that this drivable versus non drivable nature of cars has something to do with my, my profitability as a, as an insurance company. That is so many assumptions and someone or some set of people has gone through and made those set of assumptions and then built this question around it. And my my approach would be to start identifying those assumptions and then validating them before trying to answer that question at all. Because I don't necessarily think that that's the right question to ask.

Speaker3: [01:14:59] A lot of these case study type problems. I love them. I love them. Cool. Well, I hope that was helpful for you on that particular question, but also the methodology and kind of approach that you would use for for future problems. I know that was super helpful for me. I always love these ones. But cool. So let's go to Mr. Chiropractic Data Engineer. It's funny, I saw that and I was like, Wait, I didn't let this person in. So, I mean, I would I would. But I was like, I just I didn't remember seeing that name. But, yeah, feel free to pop your question.

Harpreet: [01:15:34] So my my question hopefully doesn't turn to long conversation, I think is relatively straightforward. I'm doing reviews with my manager. I got some really great feedback by how I can kind of improve in my in my career. So as a senior data scientist and we're currently hiring for a staff data scientist and the question that I asked was, you know, what? What are you hiring for the staff data scientist [01:16:00] that I'm not feeling right and that's not like a negative way. It's just like there's obviously a gap in my skill set that I'm really curious about. And my manager probably really great feedback is that shout out to Vince class always point this out. I feel like every call I do this but that strategy class really got me understand like the business case I'm able to identify business problems, pitch them and get buy in really easily now. And so my manager commenting on that. But now I have this problem where I struggle to break down this like I have the solution, I have the business problem. I know how to like, how to solve it. Breaking it down to individual projects are extremely difficult to me and I don't know how to break that down, what processes I have and my manager saying staff data science level, I'll be able to note that opportunity which I'm doing.

Harpreet: [01:16:45] But then I'll say like, these are the series of projects you need to do to get that solution, and this is how you delegate it. And I think a great way to highlight this is like I've delivered on these projects where I said like, hey, like this, we're going to do this and I do these long term projects. And then I have a PR that's like thousands of lines of code wherever it is. I have to break up the PR to different things where they get code review. And for me that's like, Oh wow, that could have been individual projects that could get a lot more little wins along the way instead of like this long, long process to gain a big win at the end. So at the core of it, my question is, you see the opportunity, you see a project, it's a very large project. How do you go about breaking that down to smaller components and building momentum from there? I mean, this is a terrifying question.

Speaker3: [01:17:32] Good group to ask. Go ahead, Eric.

Harpreet: [01:17:34] How big how big is a very large project like the Olympics is a really big project. Yeah, let's say six years go back on unit and that's big for for startup I'm a startup so six months for a startup is like five years.

Speaker5: [01:17:49] But very real answer. Hire a product manager.

Speaker3: [01:17:52] Hey, Greg. Hey, hey. This is perfect because Greg has his hand up and he [01:18:00] IPM background.

Harpreet: [01:18:01] So. So what? One thing I can tell you here, Mark, I don't think it's you not knowing what to do. It's. It's probably more people relationship than anything else. The way I look at it, when you think about like you have an idea, you picture what the and the outcome would be you pictured in in mind. You work backwards through one thing, which is the execution plan. So to me, an execution plan explores everything you need in terms of dependency that will allow you to get to that outcome, right? So that in end game and once you have an execution plan, the only thing that's missing there is alignment, right? Because sometimes this execution plan will involve many projects that fall beyond your control that other teams have to do. If they don't do it, there's a dependency on continuing the overall six months project. So once you have that plan, start socializing it to keep people. And those people are the ones who will help you fine tune it because you're not going to catch everything. It's going to be impossible. The bigger the project, the more impossible it is for you to come up with every single line item to focus on. And then once you have that execution plan aligned, make sure during the alignment that we understand what's prioritized because some of those sub projects may not be fully deployed as you would like. So have these groups responsible for deploying their key subprojects prioritize how do you want to launch it and understand what the dependencies are? Because if you miss out on the dependency, that will further push your ETA, right? So really what you want there is again execution plan alignment in prioritization and it's more about people relationship than anything else. There's nothing else you're missing, [01:20:00] really. So maybe your opportunity is to beef up your. Working before your, you know, the people you call, you know, colleagues that can give you a push, give you really be champions of your work and then get it moving once you align with the big dogs, right? So the VP's, etc., etc.. And you should be good to go.

Speaker3: [01:20:21] Very cool. Eric, do you want to hop in there? And also we I will welcome input from lots of people actually. Eric And then we'll go to Nivi.

Harpreet: [01:20:32] I think Nivi had her hand up first, you know, first.

Speaker5: [01:20:36] I think that was a short question for you, Mark. So whenever I'm in these projects that go anywhere between six months, nine months to year, sometimes it's very important to know what your deadline is and that whoever the top guy is needs to let you know what that deadline is. And everybody below him should align to that deadline. That's like number one, B, you want to have a receipt and that stands for our C, which is responsible, accountable, communicator, informed. You should look it up. It's like a big concept in how you manage large or long term projects, and it's important that you define who those races are in that project for you to be successful in that and then have many deadlines working towards that deadline. So that's kind of so race is kind of like a long topic, but for any long projects, you you want to have people who are responsible for doing it, people who are accountable for doing it, who is going to communicate on that project and who is going to be informed on that project. And if you take almost any of your data science, analytics, any project from that point of view, you're bound to be successful. And that's [01:22:00] something I do mostly for my long term projects, especially my marketing mix work.

Harpreet: [01:22:07] So I had a question when. So with the the staff position break, being able to break those things down, is the expectation then that you are the one who is like managing all of those smaller pieces or doing all those smaller pieces or what's the kind of balance there? Definitely I wouldn't focus necessarily on like a staff person. I was just I was using that as kind of like a differentiator to get better, more context on my manager. But expectation is like my manager is basically trying to groom me and be like a technical leader within within the company and being able to see these strategic opportunities as a technical leader and like how can we shift our architecture or like product choices from a technology standpoint and then build out those capabilities over the long run. And so creating those various projects and being able to either implement some of them, but also being able to give concrete things for people who aren't like kind of junior level, be like, this is what you need to do. It's like move this, this kind of long term project over. So I'm not managing people, but I'm managing the process and making sure like this end result happens.

Harpreet: [01:23:14] Kind of similar to what you said. So is that it's interesting. I'm really glad you asked about this because kind of certain aspects of this are kind of something that I deal with, like on a semi frequent basis, maybe a little smaller, but things that things that are way bigger than anything that I can get taken care of and also way beyond the analytics organization. And so it's great because I get to talk with lots of people all over the place in teams that I'm not even quite sure what all they do, but somehow they all plug together to make work, work function. But I was kind of so I was, I was wondering about it because I guess you you like you said. So you're kind of is that what a data product manager is? Because I've been thinking about like Eric Weber. [01:24:00] Eric Weber talks about data, data, product man. Is that kind of the same type of thing where you're being able to say like, this is the thing that needs to happen and here are all the different technical pieces, or is that something different?

Speaker3: [01:24:14] If I could go in there, go ahead. I can hop in. Well, so, Mark, from your question, right, you're kind of talking about there's potentially like two components we can look at. Right there is the component of no matter kind of like what your skill set, what happens when you're given like a big, big project regardless of whether or not your senior or junior or staff that is is like big relative to you. Right. But then there's also the like career progression aspect which are real quick.

Harpreet: [01:24:41] Yeah. Is that the strategic component is that I'm not given these tasks, I'm like seeing these opportunities, I'm seeing a gap within our business and opportunities in the market. And I make a case. And so I'm developing the solution, I'm developing the problem statement. And so I called out Vince because I didn't see this stuff before, and now I see it everywhere and I, I see this opportunity. So like most of my projects for the past year, I've come up. Like with myself, I don't get projects really given to me anymore because I just keep on coming with these big, big projects. And now it's like, How do I manage this? It's like going beyond me the kind of scope and impact I'm having, which is great, but my manager is really trying to set me up to be really effective at the next level.

Speaker3: [01:25:21] Gotcha. And to just sort of.

Harpreet: [01:25:23] Also, I don't have deadlines either because I came up with it. So they're like, what do you think it is?

Speaker3: [01:25:27] Gotcha. And so just to kind of sort of re summarize this, the struggle or the challenge is breaking down the projects into pieces that you can deliver on.

Harpreet: [01:25:38] Definitely. So like I can I can have the idea, I can see the whole thing and I can build it into end. But now it's getting to a point where like I'm a bottleneck and I need to be able to break this down because if I can get a smaller win, I can get quicker by and get more people involved, or I can give it to some of my colleagues because it's now starting to push strategic initiative now because leadership sees it now. [01:26:00] Yeah. Which is why you want you want alignment at the top and alignment includes what you are saying, having a race that they all align on. And also even before you go to individual teams alignment, you want to already socialize it with the VP's who take decisions because you want to show how does this big project tie into their overall mission and goal, right? Because if there's no alignment there, you're not going to have champions who will push for you. Right? It takes one quick email from a VP to make the subgroups responsible project owners to agree to align with you whether they like you or not. Right. You have their VP's in your in your core on your port. Then you should be good to go then. Eric, you are talking about data, product management. To me, I make no difference between like a product managers or product manager, right? Whether it's a data or something.

Harpreet: [01:27:01] When you think about data, product is the same thing as any product, but it's the way you're doing it. That's that may be different, right? Timi A product I guess creates value for its users or customers, right? By creating value it means you're tackling some sort of pain point they're having. And when they use this product, the solution, the solution eliminates the problem, right? And the data product is the same thing. So as a data product manager, you're here to kind of like think about the strategy behind that data product or that portfolio of the product. And it all starts with understanding who our users are. And based on your users, you know, you're able to kind of like not only collect what pain they're having, but also anticipate what future needs they will have and understand, you know, what to do. What kind of [01:28:00] like team do you have who can help them address those pain points? And now comes the different methods, the different techniques when it comes to prioritizing, right? So you can't solve it all at the same time. So to prioritize, you have to come up with certain frameworks, right? Some framework maybe. Okay. How much effort does it take for you to release a set of features for your data products? Or it could be how aligned are you with the overall business objectives? Right.

Harpreet: [01:28:31] When you release feature a, y, a, x, y, z, whatever. And also, does that feature really solve a problem for your customers? So, you know, I can think of different data products like for example, when I released dashboards for Salesforce when I was prior to where I work now, I consider those data products because before, you know, Salesforce, they ran left and right to get information. Now you come to my dashboard, you know what prices are, what the future expects. Price is based on currency exchange rates for 49 different countries. You can kind of like create quick quotes so fast and things like that. I call it a data product because now my sales team, who are my customers now, could tell me, Hey, I can do XYZ with this product, but I still miss some things. So I kind of take their feedback to prioritize my features and kind of understand the trends of what they've been asking me to kind of like create a roadmap for the next 3 to 3 years to, to make to continuously make this product better. So you can there's no difference, right? Whether you hear data, product or something else. A product is a product and you need people you need more than one person to manage it and dream for it.

Speaker3: [01:29:55] I think something I want to add there, because I've seen our [01:30:00] friends both in the company I work at, but also the companies that. The people I know at other companies where so they've recently been moved into technical leadership roles. And one thing I will say is that something that they had to navigate because I remember one of them saying the exact phrase, which is like, I'm becoming the bottleneck. And I think a change they had to make in their mental framework was that of an individual contributor versus a leader. And I think the big difference is as a leader, it's okay for you not to be hands on. And that is where the alignment comes in, because if you need to request head count, if you need to even get a dotted line like borrow someone from another team, that's where that alignment comes in and being able to sell that value. And we've done that like on our team. My manager has had that conversation with other managers right where we need to get projects through or like there's two ways that comes about. One way is we had to rebalance our portfolio of projects and we had to rebalance it away from operational service oriented stuff to more architecture builds. And it's really hard to do both. It's like trying to lose weight and build muscle at the same time. It's really hard to do both for most people. Yeah, it is, but it's the same thing with balancing that portfolio of projects.

Speaker3: [01:31:24] So for us to ask, we need more architecture, building time or infrastructure building time. We do have to let go of service on call time and they had to have that conversation, but that conversation had to be around. You'll get more value out of the architecture building. But we've also had have other conversations, one thing at the manager level, but also at the staff engineer level where they've had to borrow people from other teams for expertize. And once again, that comes back down to can you make the strong enough argument? And so I see some people that I know who have made that transition or that's one of the struggles is they kind of feel [01:32:00] like they're they sort of have to do it all on their own when because they're being moved into technical leadership. Part of part of the reason they're moved into which they may or may not directly realize is because they were able to wield influence to get collaborators on stuff. And a lot of times they did that just because they're cool people to work with. But it's not always explicitly brought up that that was the reason why they were promoted. And so when they get promoted, they're still working in IT mentality and then they stop asking people for help, which is exactly the thing that got them promoted. So I think, yeah, like what Greg was saying definitely kind of resonates with at least what I've seen from people that I know recently who've made that move to technical leadership.

Speaker3: [01:32:41] And part of that does also mean kind of like figuring out how to get unstuck on scoping and getting around paralysis analysis. And that's another thing that we've run into where we have projects that are really kind of very big for any individual on the team to take. And so we've had to kind of get together as a team or whoever we can kind of pull from other teams to literally even just mirabaud stuff out. Like, what do we think are the next steps? What do we think? Our tickets we bring in like technical leaders from other teams, go carry this up, give us all the bad stuff, like give us a feedback. And it's messy. It's a messy process because we're just sticking Post-it notes on a mirror board where we're like, Oh, we think we have to like migrate this thing and like we think we have to talk to the like sales engineers and we think we have to get like it access and did it. And you just see a bunch of randomly colored Post-its. But that's how we've also had to handle it sometimes because I feel like there's a lot of like methodologies out there for how to like refine tickets. But I also feel like it's theory a lot of times, and it still comes down to like putting Post-its on mirrored boards, but it's really oh, no, no, go ahead, go ahead.

Harpreet: [01:33:50] I'll I say what's interesting is I feel like I have the actually same kind of problem, but on the opposite end of the coin, like it's actually not analysis paralysis. But I see an opportunity and the way I get by [01:34:00] because we move so quick in a startup is that I build an MVP within like a week or less, and I just show them like, this is what's possible if you give me time and they're like, Whoa, you can do that. I'm like, Yeah, I can. And then they're like, Okay, then do it. And I'm like, okay. And I realize actually this is a six month project and now I'm stuck on this. And so I get the buy in really quick and I move really fast, but then actually a plan it out and I'm like, Oh wow, I'm kind of in over my head. But I keep on delivering, being in over my head.

Speaker3: [01:34:28] See what some of us call that is on the road to burnout. But because it sounds like those are some internal barriers and challenges and obstacles, the whole what got you here won't get you there kind of thing. But Kosta has his hand raised, so feel free to jump in there. I'm just chilling until my boo gets back from the watch fair. So.

Speaker2: [01:34:51] So basically one of the things that that struck a very interesting chord with me, right? It's very easy to be very honest, very easy to get by in from the top. Right. Like that's not the. Toughest challenge if you show them some promising small statistics on, hey, this is going to make you bucketloads of money or save you bucket loads of money. Yeah, it's not hard to make that side of the business case. It's not hard to find impetus and motivation there. Who you've really got to convince is, especially in a large organization, right, is all of the the seniors and architect kind of level people that are going to be in charge of delivering a lot of the subsystems, like if you need significant infrastructure help, you're going to need to get buy in from the infrastructure team to make sure that it's actually rational and doable within a span of time. Right. So I would almost say if you've got a week and you've built out something to prove something in a week, maybe take a week and a day, take literally a day to try and figure out, okay, if I do pitch this, how quickly do I need an answer for? Here's the plan on how we're going to execute. Can you think ahead of [01:36:00] what the C-suite team is going to tell you? They're going to want a plan right now. They're going to see, oh, that's great. Can we do it? When can we do it? Let's get it done now.

Speaker2: [01:36:07] That's where if you already have. Hey, here's a plan I prepared earlier, right? These are the unknowns. You could potentially just go out and start talking to all the senior managers out there and just start getting a bit of buy in from the team and start getting a bit of a reality check. You're in for things going to tell you, Hey, mate, that's crazy. And take us two months just to get the infra in and then you know, that's when you start getting these at least the large pieces of the puzzle, the large epics going and okay, there's going to be some dependencies in here and then you build in some buffer time. So when you go and present you say, Hey guys, I did this initial experiment and it looks promising. I reckon we can do this. It's not going to be a next week kind of thing to do this properly at scale. We might need to scale this out. So there are some risks there. Here's what I reckon is going to be the projected timeline now that is often so finger in the air kind of measurement because you're going to find things a totally different when you hit the ground. But yeah, it's just try to kind of project that out. And the more times you do that kind of pitch, if you try to do a bit of that preparation ahead, it's often really helpful, I guess. Yeah, it.

Harpreet: [01:37:16] Is one note, but I think this is key. I think that's the key piece I was missing that connected the dots for everyone saying that was great, what everyone's saying. But I was like, I'm I'm having an issue right before that point and I think you really got to that point. And I think, yeah, if I spend that extra day moving slow to go fast, I would prevent myself getting into situations where I'm like, Oh, I thought this would be a week, but actually it's six months. And I think getting ahead of that, that's that's a really good idea. And yeah, it should totally slow down a little bit more to get that kind of information in. And that way also I could still move fast and identify like, what's the quick one I can do in a week that won't solve the big thing, but get them hungry to actually start wanting to do this planning outside of me. That's [01:38:00] really helpful. Thanks for.

Speaker2: [01:38:03] That. The only downside to that is that, remember, estimation is a crazy game where you're going to estimate that, oh, it might take this long and they're going to hear that as, Oh, we can definitely get it done in this long. And estimations are always crap, right? So there's estimations. Senior managers typically hear them as deadlines and then we hear them as optimistic. I have no ideas, right? So just managing expectations becomes a really good skill there and communicating that is kind of important. So yeah.

Speaker5: [01:38:30] Yeah. Let me just add, the more you break it down, not just the overall project and the little pieces, but even the process of development, the better it's going to go. So think smaller than be MVP. Think POC first. Right? Get the POC up. Then you have to document you write, write a pitch. Even if you are not a product manager because it's a working document and others are going to collaborate on it and it's going to become its own thing. But and then write this spec and now you have material to use when you are socializing your idea and when you're saying, Oh, this is either going to be a six month project for me or it's going to be a one month project for four people, you know, that sort of thing. Yeah.

Harpreet: [01:39:13] Oh.

Speaker3: [01:39:14] No, no, no. Go ahead, Mark.

Harpreet: [01:39:15] Go ahead. I was going to say, I appreciate you all so much. I strongly feel like there's going to be an inflection point for my career with this feedback. So I appreciate you all a lot.

Speaker3: [01:39:23] Thank you, Mark, for asking the question. It's funny because like our team is going through something similar, the project that I was leading. So that is a question that was very good and the insights are awesome and I hope everyone who is still watching and stayed beyond the one hour point like got that because man, there is some fire in the last hour that really was okay. So if anyone has a last question, I thought there was a fun one. How do you convince a pharmaceutical company that they need data science? How do you how do you convert them to the Church of Data Science and [01:40:00] that they need it? Would anyone like to take it? Before we close out.

Harpreet: [01:40:05] I can. I can go real quick.

Speaker3: [01:40:07] Okay, cool.

Harpreet: [01:40:09] So it depends on where in the pharmaceutical company it is. So, one, understand your population. Are they doing, like real world evidence stuff? Are they doing drug discovery? Really understand your population from there? Kind of. What are their main needs regarding that? And this is a really complex question. So I'm summing up as quickly as I can. Like one minute from there, what how are they using data today? Where are the roadblocks and what's preventing them? So for example, like they may be focused on biostatistics and statistics, not data science, right? They may be they may be a cultural thing where they think data sounds like maybe ML and they're already doing a lot of data work already. And so where are the roadblocks to that? And I think before I can even answer more, I think goes back to what I was saying. It's like you have to ask a lot of questions, understand what who they are, were their main outcomes and how would data be a strategic advantage for them, not just using one data scientist, but actually having like a whole data initiative because you just give them one data scientist. That's not going to change anything, especially for a pharmaceutical company. They have a lot going on. They're probably going to need a data team. So like how would a data team position themselves in the market to be even more competitive wherever the outcomes are? And that's like that's just scratching the surface and go way more. But I think we're out of time for me to go into more depth. I am. I'm curious what others would think about that or I'm only want to talk in and.

Speaker3: [01:41:37] Actually this is a fun shotgun round of how would you convince a company that they need data science operations? Let's get a sound bite from everyone here before we close out Greg Vincent to Eric Costa. Russell, now's your time to shine. How would you convince a company that they need data science operations? Go, go, [01:42:00] go.

Speaker2: [01:42:00] All right, I'll jump in. Ask them first how much data they collect and how much they're paying in database infrastructure and storage costs overall across the business. Right. And ask them if they're doing anything with that data.

Speaker3: [01:42:13] Love it. Let's go to.

Speaker2: [01:42:16] They're going to buy it.

Speaker3: [01:42:16] Awesome. Let's go to Greg now. Greg, how would you convince a company that they need data science operations?

Harpreet: [01:42:22] I would start by asking questions based on needs something like do you know how much revenue you will have in the next five years and which of the products, which of your product portfolio will contribute to that growth? If the answer is no and most likely is going to be no, then data science is something that can help with that. So it's kind of like surfacing something that they need and then say, okay, data science can do that because, you know, nowadays I can't think of a company who should who cannot work. I can't think of a company who would blatantly say that data is not useful for them in today's age. So that's how I would approach it.

Speaker3: [01:43:03] Awesome. Let's go with Russell. What is your pitch? How do you convince a company? They need data science.

Speaker4: [01:43:10] So I would I would ask them very loosely explain the data ecosystem of your company to your grandmother in three sentences. And if you can't do it, you need to employ someone who can.

Speaker3: [01:43:24] Awesome. Let's go with San Antonio. How would you convince a company they need the data science?

Speaker5: [01:43:30] Greg basically stole my answer, but yeah, I would ask them a bunch of questions about how they're doing depending on the product as well. Right? The questions might be relevant to what their users are doing, their conversion, etc., etc. It's a different kind of product. It's going to vary, but basically how much do you know about your company and how it's doing and how the users are wanting?

Speaker3: [01:43:52] Awesome. Let's go with what Erik, did you did you speak at my.

Harpreet: [01:43:56] Tongue in cheek? Answer was to be a used peer pressure and tell him that the competitors [01:44:00] are all doing it. But if I want to make it more serious, bring donuts to the pitch.

Speaker3: [01:44:05] I like it. That's spicy. Mr. Vin, we will go ahead and end with you. I think. Take it home.

Speaker4: [01:44:12] I'm going to run with Eric's because I think it's awesome. Yeah, you have to tell them. Look. Look at your competitors. They're going to own you. How do you think they're discovering drugs right now? What do you think they're doing? Trial and error like you're doing right now? They're not. They're using machine learning. I mean, seriously, Google it. Read a couple of academic papers. You're going to get crushed in the next two years if you're not using data science for at least what it is that you're building. How is you as a pharmaceutical company not know this? I mean, it's one of those questions that fear of missing out is actually pretty powerful.

Speaker3: [01:44:43] That's right. Ask them, how do you like to be dunked in the face over and over and over again? Dunk. Oh, okay. And that will end the data science multi happy hour. It's been many happy hours. So thank you, everyone for joining for tuning in. I hope everyone enjoyed the insights we talked about a lot. We talked about how do you break down a project? We talked about if you're given an example of case study, how would you approach it? What would you expect a hiring manager or your business partner to look for? We talked about watches and drugs. No, we didn't talk about drugs. We just mentioned it and watches multiple times. Lots of good stuff. So everyone, I will end with her piece. You have one life to live. Don't waste.

Harpreet: [01:45:34] It. Why not do something big?

Speaker3: [01:45:35] Why not do something big? Thank you for joining. I hope everyone has a good weekend and take care.