happyhours-april2.mp3

Speaker1: [00:00:09] Oh, yeah, what's going on, everybody? Welcome, welcome to the @TheArtistsOfDataScience Happy hour. Super excited to have all of you guys here it is Friday, April 2nd. Oh, my God. Alison, April already. Jeez, man, I felt like I felt like it was the Christmas party just a couple of weeks ago. But damn it, it is Friday already Friday, April 2nd. Welcome, everybody. Super excited to have you guys here. Thank you so much for joining me. My great, great week to be at the at the Incise podcast. Got an opportunity to interview a couple of cool people this week. I interviewed Lawrence Morenae, who hosts the Data and this podcast that is the Data Something podcast. I'll get it right. And then also, uh, Christina D. Jacomo, who is a industrial philosopher that was really fun episode was released with, uh, Super Data Science podcast featuring yours truly. So I hope you guys got an opportunity to check that out. I had a lot of fun recording with Jon for that episode, so I'll make sure to include a link to that in the show notes. Hopefully you guys got a chance to check that out. Also announced that I will be emceeing the Data Science Cool Virtual Conference that's happening next week, actually April, April 10th and 11th. That is actually next weekend. So looking forward to see you guys there. Had an episode released today with Schneid Equality and I had a super cool man, one of my favorite people on LinkedIn absolutely love his content, love his post. So it was really a honor to have him on the show. Man Yeah. A lot of awesome stuff happening next week. Believe it or not.

Speaker1: [00:01:46] Next week marks one year since I first released episodes of this podcast, April 8th of last year. I released twelve episodes all in one go. And, uh, yeah, I guess since then I've just been consistently just interviewing a lot of people. It's been one hell of a ride to thank you guys for sticking with me. I don't know if any of you guys had listened to those early episodes. If you have not yet listen to those early episodes, just don't listen to them because they were not that good at my recent work is far, far better. So listen to those instead. But you know what that means? That means next Friday's happy hour session is going to be the one year anniversary party. So we should all get ready for that. That'll be exciting. Also next Friday, released on the one year anniversary episode with Robert Greene, author of Forty eight Loss of Power, Mastery, Loss of Human Nature. Thirty Three Strategies for Ward, The Art of Seduction, the world famous Robert Greene. Also, guys, be sure to check out the episode of How to Get a Analytics Job podcast because one of our friends was on that episode. That is Tom AIs. Tom, how are you doing, man? Shout out to everybody in the room, Christian. Congratulations on getting the new job, man. That is a super exciting. I was so happy to see that for you popping up on on LinkedIn News Feed Torx Eric Vivian, Albert, Susan Walsh is in the house. Congratulations on getting those final words down. Right. Oh man. Well welcome, everybody. Super happy to have you guys here. Tom, how's it going, man.

Speaker2: [00:03:12] Hey, permission to have to say an inappropriate thing.

Speaker1: [00:03:16] I've always had permission to.

Speaker2: [00:03:18] Susan, you look so damn sexy tonight.

Speaker1: [00:03:23] Yes. That that looks like somebody who should

Speaker3: [00:03:26] Tell me you've. I'm working too hard, Mr. Burke. She finished book. It nearly broke me.

Speaker1: [00:03:35] I got a couple other friends in the house. Shout out to George Furkan in the house.

Speaker3: [00:03:41] Hey, everyone. Hello, everybody. Cool.

Speaker1: [00:03:44] How's it going, man? So in Canada, we get today off. Today's a day off in Canada. I don't know if you guys in the States get to be off either. I think of the Commonwealth nations do for sure. Um, but I'd say it's a holiday here in Canada. Good Friday.

Speaker3: [00:03:58] Um, I have some questions. Did you grow up in Canada or.

Speaker1: [00:04:01] No, I'm from Sacramento, California. It's OK because I

Speaker4: [00:04:04] Was going to say your accent is not you

Speaker3: [00:04:06] Don't say about weird enough. You mean like.

Speaker1: [00:04:09] Yeah, yeah. No. OK, yeah. Sacramento, Sacramento, California. Joe is in the house. Joe, good to see you. Antonio. Man super excited to have you guys here. Somebody here is from. Oh, shoot, it's A.A.. Good to see here. Greg is in the building. Man Nice. This is that this is awesome. Yeah. I'm super excited for next week's Data High School Virtual Conference because it is literally a @ArtistsOfData science reunion. Like that's what they should remain the event because pretty much everybody who is there has been on a podcast or is going to be on the podcast. So that's pretty cool. I'm excited for that. Uh, we're super happy to have you guys here. Hey, so let's open it up for questions. If anybody has any questions whatsoever, um, go ahead and end and ask. And then while that person is asking him, you want to hold yourself in the queue, let me know and I'll go ahead and I'll add you to the to the queue.

Speaker3: [00:05:07] So I. A question that I am fairly certain I'm not going to understand the answer to, but if you can explain it like I'm fine. That would be great. So I have been working on Duncker for a couple of hours today and finally making some headway into actually getting it to do something, which is good. But I don't I just don't totally understand the difference between, like containers and virtual environments. And I think from a quick Google, I think containers are bigger than virtual environments, but that's about all I know. So if you could help me understand that, that would be super.

Speaker1: [00:05:44] So imagine you have like an apartment building, right? So an apartment building. Usually each individual apartment in that apartment building might have its own little environment. And people are going to hang different pieces of art up to going to configure their furniture in different ways and whatnot. But they all are going to share some common services, like, for example, individual apartments that can have its own water source or own heat source or what have you. Right. Those common resources are shared among all the apartments in that building. Likewise with containers similar to each individual apartment. So each individual apartment is like its own little self-contained environment in a sense. But it's sharing resources with the, you know, which whatever the PC that's on, comparing that to a virtual environment. I don't know if I have an apt analogy for that. I don't even know if my first analogy made sense. I'm going to turn this over to someone far smarter than me who goes by the name of Jewelry's so chose.

Speaker3: [00:06:42] But yeah. Yeah. Matt, you're here.

Speaker5: [00:06:44] Sorry. Yes, I'm on here. I think I just got my video again this morning. You're like the wizard. Yeah. So a container is is a type of virtual machine, if you think of it. Eight of us virtual machine, you get your just for instance, and it has its own kernel, has a bunch of services, kind of like Harp it was saying. And it's more or less isolated. Right. It shows it. Sure. Some machine resources with the other instances on that machine. Now the problem with that type of virtualization gives you extremely good isolation, but having a separate kernel for each virtual machine is very expensive. So the kernel is like the it's the all privileged part of the operating system that's allowed to do everything. So it has access to all memory. It manages all the internal processes. So with containers, what you do is you say, I'm going to have one kernel, but then all the other important parts of the operating system are basically to break apart. So in particular, a lot of important parts of the operating system happen in the file system. So, for example, you have all kinds of file system references to operating system components to executables to code. And so a container completely isolates the file system. So each of your containers can have its own versions of executables, own versions of Python, own versions of C dependencies.

Speaker5: [00:08:04] And you also virtualize networking on top of that so that you can minimize a lot of networking problems that you get when you have multiple processes running inside of the same main operating system. So one way to talk about this, if you look up what the user land is in Linux or Unix, so that's your file system, plus all the processes that are basically outside the kernel. That's what goes inside the container. And so you can have these kind of virtual machines that are much lighter than a standard virtual machine, get some benefits of virtualization, but have much greater efficiency on top of that. Now, one thing to be aware of from the Ops perspective is that there is something called container escape. So standard virtualization does provide a much higher degree of isolation. There are various tricks for processes to get out of their containers and affect other things in the operating system. And so generally, if you're running containers together on a machine, there needs to be some level of trust so that people are running malicious code there because people can definitely do a lot of damage on your kubernetes cluster if they can run malicious code inside of one container.

Speaker6: [00:09:04] I think you're also asking about virtual environments. All right. Maybe the distinction between. That's a good point. And so, yeah, like virtual environments are typically split up. You can think of it in a virtual environment, something you would spin up that you would be using. Right. So Erich's virtual environment, if you want to start sharing your environment so that people that's for Docker would probably be more appropriate. And so you can think of it that way, too. I mean, I assume you're talking about Python virtual environments, right? Yeah. So that does give you the ability to service in a virtual environment that you just install or whatever it is she's going to do. And that virtual if you want to share that with people, though, it becomes a bit of a problem. Yeah, that's where, you know, comments about tares and specifically Docker would be a much better fit. It's I've working in dramaturge for people trying to share virtual environments ends up just being a nightmare for a multitude of reasons. So, yeah,

Speaker5: [00:09:57] And I'll point out one specific case where it's a problem. So with a virtual M, your python typically still references operating system like a low level C dependencies. And you'll notice sometimes that your. Your Python install of some package just goes completely off the rails because there's something in the OS you have to have, maybe there's a dependency that needs to be 60 for a bit and it's 30 to or vice versa. Containers also package up those dependencies in addition to the core python code that goes inside the bricks for them. And so they contain a lot more of those important operating system components so that you do get much greater portability, portability. To Joe's point, you can ship that container and it will more or less run on any Linux operating system.

Speaker3: [00:10:37] So it seems like this if I understood it. So it seems like a container is kind of giving you kind of the best of both worlds of your own instead of working in their local machine where it's not portable. But you don't necessarily have all of the I guess what you might say is extra stuff if you take a full virtual machine, but instead you just kind of grab the pieces you need in the container and then you're able to run whatever you need to do that. Exactly. Yep. That's that's out. Yeah.

Speaker1: [00:11:10] So kind of just by analogy here, if we needed to circumvent the problem of the works on my machine doesn't work on your machine, that's when we would use a container. But if we wanted to circumvent the problem of well used to work on my own machine, now it doesn't work anymore. What did I do wrong then. Use a virtual environment. Is that kind of a good way to think about it?

Speaker5: [00:11:32] It's actually yeah, I think that's good. Yeah.

Speaker6: [00:11:35] Especially, you know, for example, I'm running like I have a Mac in one right now that operates. This is going to work with a lot of already chips. It's going to work with some software, for example, like Mantell or Skype. And if you do it in front of them, one, by the way, also it may not work for you. You have to get the preview version, but that's kind of at some point.

Speaker1: [00:11:56] So, Tom, I saw you were

Speaker2: [00:11:59] Just going to offer Eric a three year old explanation if he wanted it, but

Speaker3: [00:12:05] Only if it involves throwing peace as well.

Speaker2: [00:12:09] No, but it's like a membrane that only goes one way stuff from the operating system or the outside. The python virtual environment can be used inside that virtual environment. But but what you put in the virtual environment can't go out to the greater system. And with Docker, it goes a little bigger still using the OS and all those things. But it is just a bigger container with the same dynamics a virtual machine is creating. It's it's a brand new set of operating systems and hard drives and stuff, but it's still just contained at the hardware level, especially if it's a hyper level, one hypervisor, meaning it's it's it's connecting to the bare metal, whereas I know that's a fancy sounding term, but there's just two levels of virtualization at the virtual machine level. The second level isn't really touching the hardware directly. So it just depends on how those virtual machine creators are made. But that to me, that's the simple example. So Docker is just in between virtual machines and python virtual environments.

Speaker1: [00:13:19] Thank you very much, Tom. Was that factory

Speaker3: [00:13:22] Handsfree three year old version, five year old version and.

Speaker1: [00:13:27] Yeah, and version right up and appreciate everybody contributing to that. That was really helpful. Thank you. So next question I got in the queue, I got Greg and then I got George Georgian's. He has a really, really great question. So I'm excited to get to George as well. But I'm always excited for Greg's questions because I always learn something when Greg ask questions. So Greg,

Speaker3: [00:13:46] Go for it. In this case, I always love building on top of Eric is my question to you guys is what is the best way to serve a model or container's the best way nowadays, the most popular in the best ways and safer ways. So we mean like by deploying models single question.

Speaker1: [00:14:05] I mean they use a API. Is that work? I mean, I started over to my machine learning engineering and he wraps it up into a API pattern kubernetes clustering, which also has some docker with it as well, where it's in the container. And that's pretty much how it works. But then again, I'm not not the not not too much of an engineer, so I'll turn this one over to a Brandon D. I need I need good you though Brandon has been quite some time

Speaker3: [00:14:32] And not what you said is very similar to what we do as well. And I'm in a similar situation as you, and I'm not really the engineer, but I work with engineering team and they do just that there.

Speaker1: [00:14:42] Joe or Matt or Tom.

Speaker2: [00:14:44] I was just writing. It's it's very cultural dependent. In my experience. Jenkins is kind of popular and it also depends on how cranky or system administrator is, whether they I think the crankiest ones that are least willing to try new things are sometimes the best ones because they're crap doesn't break. So. But if. Let's use Docker, that's great, because it's so it's so easy to deploy that way, but I think it just varies from place to place. There's no wrong answer either.

Speaker6: [00:15:18] What kind of model are you talking about? Like what mix is? There is a framework that makes the model or is it.

Speaker3: [00:15:24] No. Any any any machine learning models in general. Is there a go to framework that is popular, safe or easy to plug into or something like that? What is the ideal that are actually are?

Speaker6: [00:15:41] So to Tom's point, it's a very broad question. Like the answer is kind of a depends. But, you know, if you make a model, you can deploy at sea or in the clouds, like your native speakers are perfectly viable way to do it. I would totally recommend doing that versus writing like a classic API. And for example, if you use an Tensorflow, Tensorflow Exten remodeled spine, which has a very equivalent thing, if you're doing that Kelaher like it is the job of a pickle file, like a prototype behind that. But if you have the advantage of using a host of service, I would use that as actually talking to somebody from getting Turo, I believe as a company. Actually, Dragonair, which I did earlier, is actually that second company Asani about. So they have the ability to just write a model and post it. So that's kind of cool to see a lot of blaggard frameworks come up. I started address this problem sort of the end and I'll set it up.

Speaker1: [00:16:37] So, so, so I guess that that the it depends part is really dependent on how you want to serve. Like how do you envision the person who is going to be using the results of this model, using the actual results, like are they going to website information? Is it integrated into a much larger system? Like I guess that's one of the it depends. Really matters. Right. So maybe that could help if you have any ideas.

Speaker3: [00:17:02] And beyond my question, I was like kind of, you know, or docker containers, a go to weigh, you know, a safer way to go about it. Is it the most popular? What makes them so powerful? That's with regards to deploying machine learning. That's what I'm trying to understand with regards to Docker container,

Speaker1: [00:17:24] Probably up to this one over Tiamat. But I'm just I want to make sure, Matt, correct me if my understanding is incorrect, but I think the the great thing about the containers is just reproducibility. You don't even need to worry about the other. Wherever this thing is being hosted, you don't need to worry about installing all the necessary packages and recreating the computing environment from scratch. Everything is there as you need it. Matt, is that a good interpretation?

Speaker5: [00:17:50] That's a good interpretation. It's combining reproducibility with scalability. And so in some sense, what containers allow you to do. So if you take like the Solutions Architect exam, they spend a lot of time talking about auto scaling with virtual machines, which again is a way to get reproducibility. You can have the image. The problem is that's pretty coarse grain scaling. And so containers you can think of as allowing a smaller unit of compute. Say you're presenting an API for machine learning model evaluation. You can scale up gradually just by adding one container at a time in the cluster itself scales at more closely. And then often the other tricot use is you can have that kubernetes cluster host, multiple types of workloads to sort of even out the loading on it. And so you can host you could even host things like machine learning, training with your machine learning model evaluation in the same cluster and then gradually just add more containers as those workloads vary. So if you suddenly have a huge workload for your training, the cluster can scale up as you get more hits to your site, your evaluation model can scale up as well. So the reproducibility is one of the main drivers for that. And I think the reason, the thing that led Google to research containers in the first

Speaker6: [00:18:58] Place, it's sort of the argument you heard of the argument of like pets versus cattle. So, I mean, think of like a server has like a special server. It's like a think you love it and feed it and sometimes dies. Cattle are basically like, you know, they're not your pets. It's I hate to be crude, but it's, you know, they're expendable resources. So you got to think of containers that way. So, you know, it's just it's an image of something that you want to do and you can spend as many as you want to kill them all off matter.

Speaker3: [00:19:31] So excited. Thank you. Thank you guys all.

Speaker1: [00:19:34] I just like the container technology. I can look that up. But this concept of containerization like this date back a couple of decades. Is this like a relatively recent, pretty

Speaker6: [00:19:44] Old actually

Speaker5: [00:19:46] Concept? Yeah, it goes fairly far back, like people can identify antecedence from Sun Microsystems and others. But I think Google introduced two main commitment to the Linux kernel and the early 2000s that introduced containers. And then as far as popularization, that really happened because of Docker, where people said, hey, this this thing Google is doing is really cool. Let's get it out there to the wider public does precisely.

Speaker6: [00:20:08] The reason people aren't hypervisor is before that you just to deal with it. So, you know, I just said, doctor, doctor takes a lot of stuff. I mean, I remember the days LinkedIn had VMware, you know, and just a better place now.

Speaker1: [00:20:25] Yeah. Well, thank you very much, Greg. That's something here. Yeah, absolutely. Please do.

Speaker3: [00:20:30] The one thing that helps in our team is we have a lot of team members like around right now, around 10, that are developing the same app. And the doctor lets us all have kind of the same environment to work with. So we, like have the doctor file in the get repo and then we can spin it if we want to, whether it's on a VM or sometimes in your local machine, if you got to do some special testing. And then it also helps us to track it to I can track the profile across different versions if we need to find out what changes were made last week and somebody installed a new package or something.

Speaker1: [00:21:00] Thank you very much. Appreciate that, Brandon. Greg, was that helpful?

Speaker3: [00:21:04] Good to go. Thank you. Appreciate it.

Speaker6: [00:21:06] All roads lead to doctor. And in the last couple of questions, what do you got going? We'll find a way to leave it to George's question.

Speaker3: [00:21:14] I blame that on Eric.

Speaker1: [00:21:15] I guess that's a nice thing, George, that have gotten easier. And if anybody else would like to put themselves in the queue for any question, just shoot me a message and I'll ask you to the cue, George, for my friend. Good to finally see you here. It's been quite some time. How you doing? And excited for your question.

Speaker3: [00:21:32] Always happy to be part of these. Luckily, they offer me so many of them to join. So great. You're putting always great content in. So thank you, everybody, for your time. Yes, Joe, I'm moving away from the darker question and answer here. More on the business side. I was wondering, how do you best integrate Data scientist in a company? Is there such a thing as Deve that golden rule, the golden framework that any company should adopt? Or should you have Data scientists part of one central unit and provide it as a service? Should everybody should be treated as a data scientist and be taught as such? Should you have DeSanctis within a particular business unit? And there's different models and I put a couple into the question that's out there. So I was wondering from your experiences, what what's the that framework looking like where you, as a data scientist, you found out that you were able to contribute the most and also the business benefited the most out of it?

Speaker1: [00:22:29] I'm excited for the discussion that's going to come from this. Let's start with Antonio.

Speaker3: [00:22:33] Sure. So right now, I'm not a data scientist, but I'm work more on the strategy plate and we literally my company. So I think whenever if I could just ask for people when they answer this question, if they could kind of talk about how big is their company, because I was literally today about to ask the same thing because this is what I am I've been doing without work. So I work for a Fortune 50 company, a lot of teams. And we didn't have a Data organization like no chief data officer at one point. Right when I started this was so it was very kind of the Spok model or as you said, like decentralized. Each business unit had like a data scientist. And what was a good part of it was a lot of things. I was very familiar with the business. Right. We interacted one on one. So those things were kind of getting done. The negative side of that was for me personally, I didn't have the best access to the talent. Like a data scientist. Right. You can know everything. I didn't have the best engineering. So when we were, sometimes we would have to put something like a model. We would just like, OK, let's send these results to SQL. You can read it from there. Kind of like just put things together. Right, because if you're a data scientist, doesn't mean you're a great engineer. So that was a negative. And then what happened? Another problem that I saw once I moved over more to try to do role is when you have the data scientist pull in silos, you have a lot of people doing different things and you don't have that kind of one source of truth.

Speaker3: [00:24:02] Everybody's doing their own thing. People are using different tools. So that kind of becomes messy as the organization scale. Since probably last year or so, my company has hired a chief data officer. It has put a lot of the data scientist, like hundreds, probably like a thousand data scientists under the CBO. And like we have engineers, architects and all of that. And what's been great with that is when we have high visibility projects, we're able to get all of the smart people in the room like, all right, get me the best Data engineer. Give me the best artist I can. Give me the best data scientist. Let's all sit together and work on this project. So that has been great. So I haven't had to struggle with, you know, kind of scrapping talent and begging people to work on things. So that has been a big plus for the business. And we also it helps with Data governance. You know, we have have like a AIs center where you have the same rules, like explain ability. You want all eyes to be explainable and you have everybody to follow those same guidelines. And that has been great Data. Economy is starting to get standardized, so you kind of have those things across the silos, people are talking the same language. It's been great. The reason why I was going to ask this question today and is because now that we sit away from the business, that is the negative part is we're a little bit distant from it. And what we've been trying to do is, OK, invite us to your meetings.

Speaker3: [00:25:29] Right. Which they try to do. It takes time. Right. People have to gain your trust because now you're not on their team anymore. You're sitting somewhere else. But people have their own meeting, they thought. And by the time they come to you, hey, we need this model, we want this model built. And you're like, well, I don't think that's a good idea. They're like, well, we just spent three weeks talking about it. So you're like, well, if you invited us in the beginning, we would have spent three weeks, you know? So that's that's kind of the problem. So we have a what's called what I call that a hub and spoke kind of a hybrid model where you have the big projects and like the Data scientists and engineers on like Centralize. And then you also have smaller teams across the business. I think that work the best, at least that's it's been in my experience. But again, I think it takes time because once you're centralized, it just you have to gain the business trust and you have to meet people. And it's been on the calls every day, you know, and it's I think it's a lot harder working from home. When you're in the office, you really see people compared to when you're at home, you know, like people don't just give you that trust over an email. There's been a lot of meetings and stuff. So I know that was a lengthy answer, but I'm literally going through this stuff every day. So I wanted to share my experience and I would love to hear what works for other people.

Speaker1: [00:26:45] So quick question. When you said thousands of data scientist, was that like was that like legit? Like that many people?

Speaker3: [00:26:51] It's not. So we. Yeah, and so I work at Verizon. I mean, the Verizon employees like I mean, I know a lot of them are retail, but in the corporate there's a lot of employees and I and the we have more than a thousand people that are with combined data scientists and data engineer. So yeah, it's literally like if you look at just that department, probably like a mid-sized company. So it's a lot. But then that's why you see everybody has different ways of working and you're just trying to how do you make everybody be on the same page? It's it's a challenge, but it's a good challenge.

Speaker1: [00:27:31] Yeah, definitely. Like, I'm going through very similar challenges and my current company being like the first Data scientists in the organization and now having to create kind of like a Data strategy for the organization and trying to think about what is the best way to build out a quote unquote Data group or a Data practice. And, you know, we've done a little bit of research and we're moving towards we're thinking or at least the Center of Excellence model, which is kind of a we're moving towards. But let's hear from a branding catch on this one. Thank you so much for that. Yeah, thanks, Sophie.

Speaker3: [00:28:04] Right now I'm working on an integrated team, which is we got, let's see, about three or four Data scientists. We have two or three Data, two or three Data engineers, five or six software engineers, business person, business analyst. And we don't have a central data science, so organization form or organization in our company. So we're getting our data publicly traded. I don't know how many come. I would say it's a medium to large companies. I've gone back and forth with a few of my colleagues about this as well, and I think there is I'm not sure if there's a really right or wrong answer here. I like integrated teams just because I've recently been working on them and I had struggles before. For me, the main thing is that people tend to do what their boss tells them to do. And if their boss has KPIs, then everybody's concerned about getting their KPIs up. And the reason why I bring that up is I've been in other organizations where it's more of a thought over the wall model and just tied up with the whole thing before because that's like my duty now. I was working even funnier environment where the it was like a firmware set software team. It was an aftermarket device. My machine learning models were put on Matlab and now it's become part of the sea and plugged into there. So there's there's no fancy Venza AIs just like an aftermarket device that you put on a car. So there we had to work with the engineering team in order to get our models into production. But they had different they had different issues. Right. They're working on, I don't know, like issues with CPU usage and memory usage and a whole different you know, they report to a different person, VP of Engineering, and it was hard to force anybody to do anything at the moment.

Speaker3: [00:29:34] I got in to follow questions. They stopped getting, you know what? You're going to have to write a ticket. We're going to have to prioritize this along with everything else. You can't just come to my office and ask asking for like hours at a time. So that was that was difficult for me. I've been in even a consulting organization where we had the shared resources. And one funny thing that kind of happened there is people know who the people are. And then there seems to be a way where in between projects to get people are on these like funny temp projects. So then when that next S.W. gets signed, whom they got you on the real project, they. Why don't you want things like that? So I think in the in the environment of big organizations, you have issues like that. But, you know, I would love to hear people's arguments for having a centralized. Oh, and then one last thing I'll say, it's the way that we kind of get that communication going on among the other data scientists. We just we just hold things like Data science, kind of like office hours just like this. Right. And every Friday at 11:00, they have all the data scientists. Whoever wants to join can come. And then we just exchange ideas there. We don't report to each other. We don't report to the same person, but we exchange ideas.

Speaker1: [00:30:36] They're really like that integrated type of idea because I think it just makes for a really more well-rounded teammates. If he can really understand what people are struggling with and what their concerns are with whatever it is that you build up that I like, that idea makes for more holistic and holistic framework, but just makes for a more well-rounded, I think, Data scientist. So, Torrisi, see, you have your hand up. You have a question or comment on this topic. It was just a comment

Speaker3: [00:31:06] Because this is a very common problem in that industry when you have set up specified departments with specialists, etc.. To me, one of the key things to get involved, to avoid potential processes where people have been working for three weeks and of course, they come and ask you and then it can't be done. There are some problems. The key thing is really to look at your workflows internally in the organization. I'm a true believer in the concept of like advisory board or you set up representatives of different parts of the organization, whether it's like a group. They didn't list one person. That's key contact person one for the analyst. You may want to have one for the IP. You kind of have an advisory board. And when you have larger projects, you would then, of course, include all the key people in the process

Speaker1: [00:32:02] At a certain part of the

Speaker3: [00:32:04] Stage. I'm not a believer in big meetings.

Speaker1: [00:32:07] I hate meetings, to be honest myself.

Speaker3: [00:32:11] Meetings, they have to be efficient. That shouldn't last more than 30 minutes. But for me, the

Speaker1: [00:32:17] Key is to get the right

Speaker3: [00:32:18] Person in at the right time in the workflow and then the work process and how

Speaker1: [00:32:23] That is managed. That means you have to

Speaker3: [00:32:25] Sit down in the organization and have a strategy, an overall strategy, and then you have to look at how the organization is operating.

Speaker1: [00:32:32] Once you figure that out and how

Speaker3: [00:32:34] Processes are, then you should evaluate each individual project based on risk levels, the impact, the low fruit, high fruit, whatever you want to call it, and you have then generated the pipeline. Some projects need to go have much larger involvement from more of the organization. Smaller projects should have simpler processes to follow. But however, when you look at it, they need to include the people that could potentially have an impact on the end result, whether that's a data analyst or secretariat cleaner, whatever it is. That's my philosophy on this.

Speaker1: [00:33:11] I think that it makes a lot of sense. Russell would like to to add in here, hearing after Russell, if anybody else would want to speak on this topic, let me know and just let me know the chat and I'll call on you otherwise. After Russell, move on to the next question from Nesha Russell. Go for

Speaker3: [00:33:28] It. Thank you. Evening, everybody. I'd say very generally and more from a business perspective, from the Data Suns perspective, to say that I think it is going to change depending on the business. If you put a very niche business that does one thing well, regardless of the size, really, I think that lends itself well to having centralized Data science unit that can then be down to all of the various arms. If you've got a variable business that has lots of different sensors, either geographically based or actual actions, for example, the business I work in does lots of different things and lots of different fields. It just wouldn't work for us. So whilst I'd work within business analysis, Data, science, etc. and funnily enough, I am centrally based, but I don't jump around a lot of places and liaise with a lot of people. I don't think one centralized arm to do everything would work for us, but perhaps a hybrid model. So you've got a centralized control unit that then works as part of a network with satellites centralized some divisions within the the other divisions of the business. So I suppose a long way of saying it. It really depends on the business.

Speaker1: [00:34:40] Yes, it sounds like that hub and spoke model that Antonio was talking about a little bit a little bit earlier. Tom, I'm curious that throughout your career, how have you seen the most successful Data practices, Data teams structured?

Speaker2: [00:34:56] I don't think it depends on structure. Harpreet, I think you gave the best one answer, one word answer integration. It's really about how much. The Data scientists care about each other, care to help each other, serve the organization better, health care to help each other learn what they need to learn to get each job done, care about going being humble enough to go ask for help and not having a fear to go ask for help. We don't know at all. Sometimes we do really well and haven't used it for five years and we just forgot it even existed some time. So I think a very nurturing, kind, gentle, patient culture that's integrated and always wanting to help each other put Data science on the best footing in the overall organization. So I'm kind of what I'm saying is I'm structure apathetic but integrated culture and caring culture to the max.

Speaker1: [00:35:50] Thank you very much, Tom. I like that. Like that fits well with the organization or in the integrated machine learning environment. George, was that at all helpful? If anybody else, by the way, if anybody else wanted to add in here, either Joe or Matt or Greg or anybody for that matter? Go ahead. But I actually have a follow

Speaker3: [00:36:09] Up question for for this specifically. So I was doing an informational interview with someone who's out like a much larger company for context. I'm at a 60 person startup. Person I was talking to was Data science leader at twenty thousand thousand person company. And I was discussing my Shaggy's where I would talk to business stakeholders all the time, and that's how I stay in the downstate integrated. But what he told me was that at the scale that he's at, that strategy would not work because a lot of the key decision makers are like these. BP's well, we have like five minutes and six, three weeks to just schedule. So, like that shaggin wouldn't really be purposeful for me or probably probably hinder me. And so late with this kind of thing about like, where is this Data science team? And like integrates with the rest of the companies? Like what? Is there a certain point where you have to be more strategic with talking to people or is it just you have to put yourself out there? This conversation, Termina, that conversation I had last week.

Speaker1: [00:37:10] So there's still that question a little bit. So if a question about strategy and strategic communication and maybe I missed the question there.

Speaker3: [00:37:20] So, yeah, the question was a little raw. So from what I'm hearing from this current conversations that like, where is that? How should we set the science team this hub and spoke versus being a centralized place? And from what I got from AIs, the key components like communication business needs, like what should we work on? And so as a smaller company, I'm able to just do that very easily. Let's go and talk to people. We can go to a larger company. That's not maybe the best strategy. And so, like, is it the team structure design that's really important to just be aggressive, be a go getter and just talk to people. You know, what was the best strategy for Data scientist for navigating these larger companies?

Speaker1: [00:37:58] Oh, yeah, go for it.

Speaker3: [00:38:00] One thing I always see companies forget to do is to to self assess where they are. I remember and I was pointing to I try to remember where I found it before and I thought it was so good. Maybe I'll show you guys. It's like making an assessment of where you are in terms of like some sort of maturity matrix and then work from there to figure out where you want to be in terms of how you want to position your workers, because it will give you a little bit of idea of where you are in terms of the ISO. The Matrix goes from exploring, experimenting, of formalizing, optimizing, transforming. And then like this you have like like your people, your processes, etc, etc. and you kind of strategy Data technology or governance. And then you kind of get yourself where you are. So if you're a company that's is experimenting and then you kind of have a feel for who are the people that you have who can excel or who can perform that experiment, experimenting anyi, then you can kind of position yourself in terms of whether you want a centralized team or you want to dispatch. You have to have enough people per department who can perform to that experimenting of A.I. and does it fit the business needs going forward. So it is always easy to say, oh we want to go after that, but we have to do companies have to do a self-assessment for us to understand where we are in that spectrum of the AI to to know how to move forward from there. I tend to

Speaker6: [00:39:41] Agree with that. And also it depends, you know, if you're doing Data stuff at all, if you are, it's probably better setting out. Is a skunkworks project actually getting those the company involved just yet. So but if you're Amazon and you've been doing Data since day one and so it's like that's just part of the DNA of the company. So I think it like a right answer. And it's also is this dichotomy between the hub and spoke or centralized. All right, I mean, because I think there's pros and cons of both, but what I've also seen successful is actually a combination of the two as you react in one way or the other. It's a weird trade off. Centralized means, obviously, it's very expanded to control hub. And spoke means that you run the risk of running silos. Right. And then practices and everything else sort of gets out of control. So there's a third option. You kind of to both at the same time. But, you know, I put a LinkedIn to Conaway's law. I mean, it's sort of the immutable organizations. You're going to design systems that represent how your organization communicates. So if you're very much a command and control company, like no amount of hub and spoke is going to save you. It will not work in a company, period. And they can say the same, the opposite. If you're a very decentralized company and you try and do a command control, that's going to work either. So this depends on the organization. And, you know, most of you I think you'll find, too, that these cultural and organizational things, they've accidentally happened. Like I don't think a lot of people are intentional about creating culture or ways of communicating. So just organically happens, I guess you get to deal with whatever that is.

Speaker1: [00:41:19] So any of those responses answer your question? Definitely.

Speaker3: [00:41:25] No, I, I really appreciate that. Especially I really like the Conaway's like are trying to chat. I'm sure that my manager and also like that strategy component as well. That was really illuminating.

Speaker1: [00:41:37] And a market that I holler at you about. Get me in touch with Liz FastLane because I'd love to have her on the podcast one of these days.

Speaker3: [00:41:44] She's my favorite slacker right now.

Speaker1: [00:41:47] Awesome. Thank you. Like, I like her book. Sounds awesome. I want to read it and I want to talk to you about it. Cool. Thank you, Nisha. You are up next after Nisha, I've got Saurabh. And then after Saurabh, I've got Austin. And if anybody else would like a question, shoot me a message. Let me know. Oh, yeah. And do you think John Thompson spoke on analytics teams? Yeah. Great book. Definitely. Check that out. Building analytics teams. I think that might help George and and Morgan, who has helped me. I've got an interview coming up with him at some point in the near future, so keep an eye out for that. But I got first verifying. See, George, your question this this all spring from your question was, was that.

Speaker3: [00:42:26] Oh, yeah. Oh, yeah. Thank you. And thank you, Brendan, for mentioning Docker and not not running the pattern so far.

Speaker1: [00:42:34] It Georgias that course has been so I've been going through a lot of George's stuff online just because of what I'm working on that we're currently George is nice enough to give me access to one of his courses and it's been immensely helpful. Superexcited, I think we're speaking on the twelfth and just about ten days. So excited to to chat with you and have that have that episode shared on the podcast. Let's go to Nesha.

Speaker4: [00:42:59] So I had a question about the different computing techniques that are usually available and need to be done before performing a BCA. So I am working on a project for my thesis and I'm I'm trying to find references as to which screening. Is that a specific requirement that one scanning technique needs to be done or different articles in my domain say it's different scaling techniques, but they don't give the reason as to why they are performing. One scanning technique was another. But most of them seem to do a standard scalar technique, which is just mean Zetterlund Unitarians. And there are other techniques, some other articles use. And I'm just wondering if there is a reason why one would do one skating technique. Was this idea that especially before performing at BCA. So if anyone has any insights, I would much appreciate it. Thank you.

Speaker1: [00:43:59] I think different scaling techniques really depend on how are you going to be using the Data downstream. Some algorithms will have assumptions that Data should be scaled in some particular ways. I think that's one piece of it for Pythia, if I'm not mistaken. Now, I'll try this one over to the Brandon or Thomas. But if I'm speaking for Pythia, you should standardize before doing the principal component analysis, standardize in the sense that, you know, unit variance and mean zero. Brandon or Tom Brandon. I don't know if refrozen yet.

Speaker3: [00:44:30] It's been it's been too long since I've done this. I forgot if you absolutely need to or not. But if you did, then the first thing I would do is check out the distributions of all the input variables, because sometimes you would just get like funny distributions and you're like, do I even what does any deviation mean here? If I if I calculate this and I'm going to divide it by it, and then sometimes you'll have to you'll have like by more variables or you'll have like categorical variables. And then you have to think about how do I transform this into something numerical.

Speaker4: [00:44:59] In my case, everything is numeric. I do have to do scaling because one variable is in terms of. Kubernetes, and the other one is in terms of dollars. So there's a huge gradient, so I am pretty sure I need to do some kind of skating. I'm just confused as to why some articles say maybe you should do a standard scale of skating and other articles say we should do a mean max skating. Is that a personal choice or is that because of some criteria that I'm not seeing and it's not mentioning it's so obvious to the people of Britain that because

Speaker1: [00:45:36] So so, again, look at look at the articles and see what is happening downstream of that scaling. Scaling is just one step in the process. So in your particular use case, like what is it that you're actually trying to accomplish with this idea

Speaker4: [00:45:51] Once the scaling is done? My my goal essentially is to perform a logistic regression analysis and so that

Speaker1: [00:46:04] You don't scale anything for logistic regression, you just leave the variables as they are.

Speaker4: [00:46:09] The variables are independent. And B, to my advisor was telling one of the requirements for regression analysis is they are supposed to be independent. And one way to do that is new component analysis. So you get that identified out of the equation.

Speaker1: [00:46:28] Well, there's easier ways to get rid of dependency. Maybe you have Coolen Herrity present in your feature space. So I would advise doing feature selection before doing any PCA. Right. So meaning do you did you compute me variants, inflation factors? Did you see if there was any features that were linear combinations of other features? Did you handle that. Yes. And once all that's taken care of and you've absolutely reduced your features set through feature selection, then look at, you know, mentioned variance, inflation factor, but then may be looking to PC. But then you going to lose all interoperability when you perform PC and then try to logistic regression on that. So it just depends on how you're trying to use it. So how many columns are you dealing with? Features are you dealing with? Two hundred. Two hundred or so of the two hundred features. Did you start by eliminating columns that were low variance? Did you start by eliminating columns that were possibly linear combinations of each other? Did you start by looking at maybe doing some type of inflation factor? Keep saying that because that is key to reducing co linearity in the feature space. Did you do any of those steps first?

Speaker4: [00:47:36] Yes. These two pages are essentially narrowed down by the domain experts from a lot more set of features. And these features are created in a manner of prolapsed meaning. These Data that I'm looking at this kind of transaction data and I'm rolling it up by myself. What criteria to look at specific. So each row represents one specific item with regard to variance inflation fact that the domain experts say I should try to sell up to to to explain as to the value. Essentially, they they are not very happy and I am in an academic setting, not in an industry setting. So that thing that they are saying is that I should start with all the 200 variables and then narrow it down into credibility. When it comes to PCEHR, what they are saying is that you still can, based on the principle components, you can still make an observation that these are the variables associated with the first principle, second principle, confident and so forth. So that's where my dilemma lies. And I was just hoping to get some idea as to what you do usually.

Speaker1: [00:48:56] So, I mean, like that crazy type of dimensionality reduction. To me, it's always like a last resort. I mean, once Tom gets back, if he gets back in time, I'll turn it over to him for that. But I would I'm more inclined to do just more highly interpretable things. So if you have two hundred features, you've done your feature selection techniques, you've examined various inflation factors. My next logical step, being a simpleton would be to start looking at possibly stepwise feature selection, maybe start with one feature incrementally adding one, or start with all of them and start incrementally reducing some or maybe doing some type of feature selection. Where I look at, again, some type of correlation threshold, where I look at features that are correlated with the target and then find groups of features and then try pockets of that feature space and compare those models all across and see what happens. I do all of that stuff before I do something as drastic S.P.C.A. But then if I was to do PC, I would make sure I just normalized like standardize unit variance zero. I mean, that's probably how I would proceed in this.

Speaker4: [00:49:59] So is there a reason you do a mean unit variance before PC, maybe back? Why not? Just scale all the variables between zero and one scalar or a robust scale that.

Speaker1: [00:50:13] Yeah. So for PC, you think about what is it that PC is actually doing? So a PC, you have to compute like eigenvalues and eigenvectors and things like that. Right. So if you're just scaling those values between zero and one like that's not that, that's essentially the same as leaving those features untouched, you know what I mean? Because you're just moving it to a different scale. It doesn't really help with the computational burden. That's my understanding of it. So maybe this is the way I was taught I would do. I would make a unit variance zero mean before doing any PC. But yeah, I mean, other than that, I have a good, good explanation as he goes in metered. So maybe he may have something to say.

Speaker6: [00:50:53] You know, I designed it for myself. I used to teach for half an hour after I threw under the bus in this question.

Speaker5: [00:51:00] I mean, honestly, I taught linear algebra and other things, but I'd have to brush up on PC in factor analysis in some of these other techniques to give you a good answer. So it's been a while since I've looked at any of those in detail. Maybe another week I can have something prepared.

Speaker3: [00:51:14] Tom, when we need him.

Speaker1: [00:51:15] Yeah, Tom is he's disappeared for a second.

Speaker3: [00:51:20] But I think I think it's telling, right, that you have a room full of Data science professionals who can't give you like a straight away top of the head answer, because that's a main difference in academia and working in industry. Like it's been a long time since I've tried something as crazy as which in academia isn't crazy at all. It's well-established and old and simple. And but if I were to go to a business person and say, oh, the reason why your credit was denied is because these three components somehow map into this other dimensional space and like, nobody's going to buy that. Right. They just want to know that, oh, you have more than three credit cards at that or you open a credit card in the last 30 days. That's also bad.

Speaker6: [00:51:56] And what I you said, don't feature engineering and feature selection tools for auto. Well, and like, I think you want to watch out for I mean, the main things you want to watch out for more than anything is just if your features are correlated with your label. Right. They'll see your model faster than anything. So that's the first thing I would look for is just ditching out there, obviously, you know, trying to figure out which which features are redundant. Toss out. This is my last resort. I don't know that you need to go that far. I mean, because if you reduce your subsidy, something like, what, three features or something like that, I mean, I think, you know, there's a lot of fidelity and interoperability as well. So most of time, I don't think you need a structured data collection, logistic regression, because, again, you want interoperability. But with Pisgah, you're going to actually compress the feature space time to the three components that have the most variance. Right. So it's really hard to do it. It's hard. It's hard to get insurability off that because you've done a compression at that point.

Speaker1: [00:52:57] Yeah. And that that's actually the key point for why you should do, you know, variance with mean zero before pikas because of all that different those different variants for each column can have like those scales for the variance can be wildly different. Yeah.

Speaker6: [00:53:13] So that's right, yeah. I mean this is the crux of like just structured data problems in general, which is why you don't see like ODonnell with structured data is not as successful as, like image recognition or NLP precisely because you're doing a tabular data that has no rhyme or reason to it at all. It's just numbers thrown into a spreadsheet or a database table maybe combined. But it's not like there is no sense of coherence to this data. Right. And then we talk about like scaling, for example. Yeah. To choose how you want to do it. There's you know, it just depends on how the data distributed.

Speaker1: [00:53:53] Yes. Like like when I have features and I need to compute some type of distance between features and those features are all on these worky, wonky, weird scales, then I would opt for some type of scale between zero and one, because that makes sense in that case. So I guess that's kind of like my intuitive approach for using one particular type of scaling versus another. If if I'm doing some type of distance computation, then yeah, I should probably make sure everything's on the same scale if I'm caring about doing something that involves variances that I probably do, you know, variance and mean.

Speaker6: [00:54:28] It's also depends if you have a normal distribution to Data as well. Right. And if it doesn't that I mean, zero scaling. Isn't that good for you?

Speaker3: [00:54:38] So that's a good question. It depends on how much you could say, what are you working on? And then, like having context is important sometimes. And there's two hundred variables or two hundred variables like predictive. Are they useful or are you what are you working on

Speaker4: [00:54:56] For 200 variables comes from domain experts. They essentially say that these predict the I'm looking at a fraud problem. Essentially, I'm trying to predict. Provided committing crime in the healthcare space. So these two variables, and we know for sure that they are correlated because these variables are like the minimum that are being paid for, provided the standard deviation of the right. And what's the median was the third quarter. So we know, obviously, that the mean median and quarters for that provider from each line of transaction that happens are going to be related. So but at the same time, if I mean, one provider is very similar to another, but the third quartile is very different than something's different about that provider. So I'm trying to target those providers are different. So I don't need the ME. I do need that. What I less what even though they are related and be doing that pretty quickly blows up to 200 when we do this for a lot of, let's say, just one variable. If we step with I mean it doesn't make it. And that's kind of if you think about it, a block walks. What kind of way to figure out how does the distribution look? So that that's how this problem is being approached currently. And the initial set of variables come from the main exploit. That gives a background of what I'm working on and that does trying to introduce the fetus.

Speaker1: [00:56:28] Yeah. So, I mean, I think feature selection is probably a better bet than dimensionality reduction just because it sounds like you in a context where interpretation matters a lot. And just like what Brandon was saying, try to explain to regulators that whatever this first principle principle component indicates, that we shouldn't give you a loan, then what the fuck is that? What's this have to do with anything?

Speaker4: [00:56:52] So I understand the credibility, but so the way that I am addressing that and the credibility part is that from the principal component analysis, I can say that these are the variables that are related to the person's predicament. And that's what I'm counting on. So I'm not completely losing the credibility. Part of the reason I'm not doing feature selection is because if I do a feature selection, I'm going to lose that feature being divided into mean media and only one of them is going to get selected and that the small variance between the mean and the quartile code for a particular provider is kind of like I did try that as well. It just didn't work as well as my logistic regression, I guess, you know, in a principal component analysis setting. But I'm still working on it since it's my pieces from I have a long way to go.

Speaker1: [00:57:43] Yeah. So we're just this entire time or story till Tom came back. So, Tom, we need your help with some PC stuff, and I would just step away for one quick second. So Tom Nesha have added.

Speaker4: [00:57:56] Hi, Tom, forgive me.

Speaker2: [00:57:58] Yeah. I had to go pick up a child, so I need more context. I have a question about different scaling techniques that are available, for example, before doing a PC. Are the reasons as to why? One So I think, yeah, I'd use and I cannot remember the exact names, but the stuff that uses quartiles, I tend to use that. If you have some horrible outliers, maybe, but you it you want to try different ones just to see how you do also. But was there some question or if you should eliminate some features.

Speaker4: [00:58:35] So the question was more regarding to in my case, I'm I'm looking with about close to two hundred features and I'm trying to reduce them. And I've gone down the path of doing a feature extraction rather than a feature selection. So I'm doing a PC. So in my domain, the literature says, OK, different techniques that have before doing PCA, we obviously want to see if all my variables are almost on the same scale. In my case it's not. So I'm doing a scaling of each review, but before I do a PC. But the question comes as to which scaling technique should I use? The most common one seems to be the standard scale. It would just be means zero unique gradients. But there are other techniques that are being used as well, like the min max techniques, killing each variable between zero and one with a robust killer picnic or I think quartiles technique.

Speaker2: [00:59:35] Yes, InterCloud or Cortile.

Speaker3: [00:59:37] So that's one piece you want to decide.

Speaker2: [00:59:42] Yeah. So I wouldn't stress too much over what scaling techniques used, but what I would do is look for co linearity first and then among those features that are cold, linear, you can to keep the one that has the strongest correlation to your label or your labels. Does that make sense?

Speaker4: [01:00:02] Yes. So my, my problem is I know the risk linearity and. I cannot remove that. So, for example, I wonder could essentially say, OK, I have a hospital provider and they need a number of clients on the meaning number of people that they are providing service to us about. And the third quartile is about, let's say, 300. And there's another provider who has the same meaning, 100, but they third quartile. And that's a question of a provider just because then they have cut quarter and it's much different from most of the providers of the same range.

Speaker2: [01:00:42] So are you are you saying you cannot get rid of the other features that are code linear with the strongest

Speaker4: [01:00:51] Because it adds value to my case? Right.

Speaker1: [01:00:54] Ok, what does that add? Predictive value. It might add descriptive value, which is fine. Then you can use whatever features exactly the situation. But if it's not contributing to a useful prediction, then you could toss it out for predictive purposes.

Speaker2: [01:01:07] Exactly. And then then you can still you can hang on to what Harpreet was saying was correct. What I was about to get to is just because I don't think for the predictive model, you don't need to maintain all the features. Now, you can still correlate the one that you get to the other ones. But it's kind of like if you try to keep a man on the prediction, it's just going to cause you mathematical issues and a lot of your models. Now, I could be wrong. I've never tested this, but I'm pretty sure that if I use one of the three methods, like basically decision trees, random forest boost methods, I think you can get away with co linearity then

Speaker1: [01:01:52] You absolutely can. Yeah.

Speaker2: [01:01:53] And so that would be that way you can keep them. But if you're if you still want to reduce some of those features, just get rid of the colony or ones that you can get rid of, then look for ones that have like nearly next to no correlation to your labels and then throw those out. And then once you've got you've you've well, I'm sorry, we're now we're kind of flip flopping back and forth just for the sake of of reducing features. Go on ahead and keep that one biggest feature among the colony are features for the sake of finding other features you can take out and then run PCR on the rest of it and then your bottom lyres for PC, the weakest eigenvalues. You can throw those out at a certain level and then you can even test with use that as a triage list. OK, I'm going to keep these with high eigenvalues, these with low ones. I'll experiment adding them back in until I start to see a difference in, in metrics, metrics that you care about. And then when you see that dropping them off is not is not changing your metric significantly. That's a good place to stop or stop, including features. Does this helping?

Speaker4: [01:03:11] Yes, but my question was before the PC. So the scaling techniques.

Speaker2: [01:03:17] I think I think I mean this very nicely. It may not. I'm struggling to come up with a nice way to say it. Don't stress too much about which scaling technique, but start with Min Max. If you've got some bad outliers, just start with Min Max.

Speaker6: [01:03:34] That does put a rule of thumb is a Gaussian scaling. Then you've got to distribute Data, right. Otherwise Min max, which is fine. So yeah. Is even that money, I think you can probably just throw it in and see what happens.

Speaker2: [01:03:51] And Joe's right. But if you can run each of those features through a Gaussian tester to make sure it's approximately Gaussian and it's OK to use it that way, that's a good thing to do to you. By the way, on Integrated Emelle, a dot com, there's a blog post by Teena Marie and it talks about these kind of things and testing for Gaussian distribution approximation is in there.

Speaker1: [01:04:17] Awesome. Could kick off a great discussion. Thank you very much. Nesha, I need to Brandon's point. You've got a group of scientists here and all of us are just coming up with different answers because this is you know, it's not easy,

Speaker2: [01:04:31] But it's there's

Speaker6: [01:04:34] Got all the stuff he's probably forgotten more math. And we all know

Speaker2: [01:04:38] This, but this there's an art to it to Anisha. You know what? I bet what's going to happen? You'll try Min Max, you'll try Standard and you'll go, oh, they didn't make that big a difference. That's what I'm guessing you will find. But it you might come back and go you idiots. It was this and I can't believe how bad all advised me. Darn it, I'm never coming back. I don't know. We'll see what happens. We'll hope for the best.

Speaker6: [01:05:02] It's worth the price of admission though.

Speaker4: [01:05:04] So ultimately my advice that has to be. Whatever I do right away, she's not going to let me get to you

Speaker1: [01:05:12] Is your advice or a statistician

Speaker4: [01:05:13] And one of my committee members is a statistician, we'll just make sure that my main advice is not

Speaker1: [01:05:20] What are you getting in

Speaker4: [01:05:22] Health economics?

Speaker1: [01:05:23] Oh, well, then just talk to the statistician and use the statisticians methodology, because this is me as a statistician being biased. Anyways, thank you very much. It's great question. Sort of had a question, but he bounced out. So let's go to Austin.

Speaker3: [01:05:38] Yeah, OK.

Speaker6: [01:05:38] So my

Speaker3: [01:05:39] Question was SEMIARID was sparked from an earlier

Speaker6: [01:05:42] Question when we were talking

Speaker3: [01:05:43] About the different models for how you position your Data science teams. And it was really about when do you when should you start looking into getting a potential chief data officer or someone who can help kind of determine your overall strategy? Because right now in my current company, we're going we're kind of working on this where our data team is working with it a bit more. But we don't have anyone who's holistically looking at the strategy. And so I'm just curious what other people think. How should you approach that?

Speaker1: [01:06:21] That's a good question. I want to take a stab at that. If my boss is listening, that it's time to get a chief data officer because I need a raise. So now's a good time. But if anybody hasn't

Speaker2: [01:06:32] Let you rephrase the question one more time.

Speaker3: [01:06:35] So when looking or when should you bring in someone who looks at the strategy and more holistically like so in my

Speaker6: [01:06:47] Mind, I was like we were talking about chief data officers

Speaker3: [01:06:50] Earlier. I know Harp he's kind of leading the charge and his organization at my organization, we kind of have not an advisory board, but it's just like some people in the Data team and some people in it

Speaker6: [01:07:04] Trying to work for our

Speaker3: [01:07:05] Broader functions. But we don't have anyone looking at it from a more holistic level. It's just what are the initiatives we are doing in the next year or two and how do we get things ready for that versus how do we want things to connect in three years, five years? We would expect that to happen as soon as possible. Right. I don't know if you're talking about a mature company or a startup, but for a startup, I would say start as early as possible because that's will determine whether you will be able to scale or not. The sooner you draw that strategy for Data, the better you are able to scale. Right. So a lot of a lot of companies nowadays, they understand the value of Data. And if you don't build the infrastructure, support your need to harvest and store and manipulate and extract insights and you might be the price leader in that strategy is going to depend on what you want to build a structure in-house or you want to leverage third party service providers. And it depends on the cost of the, you know, the money that you have to spend on that infrastructure and everything else. So you would want that to be part of the company's future. That's the ideal space, in my opinion, and to start as early as possible.

Speaker1: [01:08:24] So Austin says this is semi mature company that's been around for over 100 years in manufacturing, but it's catching up in Data strategy. That sounds eerily similar to the situation that I'm at, at work being at a manufacturing company that's about 100 years old and catching up in Data strategy. But I don't know, like I think I'd love to hear the people's perspective on this, maybe Joe or Brandon, if you guys want to chime in. But I know some companies, like even the old startup that I was, that both Commerce, the Data team was under the finance group. Right. And right now, like the Data group is under the CIO where I'm currently at. So does it make sense for a company whose primary asset is not Data, who's not in the Data business to have a chief data officer? Should that role be for somebody that's like the CFO or CIO? I'm not sure

Speaker2: [01:09:13] I'll say it, George. I'll say it. George, don't worry. Why isn't Data their chief asset?

Speaker1: [01:09:20] Well, I mean I mean, that's a great point. If you're in a manufacturing company like the product that you are selling, that you're making money off of is like a tangible good. I mean, I could I could be getting canceled here very soon, but that's just what I'm thinking about it. Yeah. Yeah. What do you think? I'd love to hear from George on this to actually create

Speaker6: [01:09:39] Some excerpts from John Thompson's book about this. He's got your take on this. And I'm not going to reinvent the wheel as a writer. I was ready to write an article on Data teams that John Kerry launches are balking at various whoppers, like I have nothing more to say on this issue. But you know what? He writes, To start their journey, many companies will hire a chief data officer who has experience and a taste for analytics to address the most common Data hygiene issues while delivering analytics outcomes. Over time, the will change to more of a. Chief analytics officer, and so he defines a key role in data management, data governance, data security, certainly for large organizations, the chief analytics officer is about extracting and delivering business value and economic impact from data using analytics, data science. So I think it's a good way of delineating the two two roles as well. And then maybe that would be a good litmus test to decide which one you want to hire. I don't. John's book focuses on analytics teams. I think it's been in favor of a chief analytics officer as opposed to a chief data officer. But but in either case, the reality is, you know, chief data officer is a role that is becoming more common, especially in larger companies. But I think John would also probably agree that the CEO is actually a chief data officer at the end of the day. The CEO needs to take charge with the data. If you don't have the support of the CEO, I don't think anything's really, really going to happen.

Speaker1: [01:11:02] So George Farah can love to hear from you on this. I mean, like kind of coming from a role where you're at your director of Data governance. Like, to me that that almost sounds like it's kind of the chief officer would do. Am I mistaken? What are your thoughts on this question? I'd love to hear that.

Speaker3: [01:11:18] So I agree with José that really the CEO should sort of own that Data piece, but in terms of they should sponsor it. And then I guess depending how Data savvy they are, they might want to offload some of that responsibility to a CDO instead. But even so, yes, they should always be behind it and be the ones that help secure that ongoing funding and support and help really promote the importance of Data as an asset to the organization.

Speaker1: [01:11:47] Awesome. Thank you very much. That is a good question. Would anybody else like to add in here

Speaker6: [01:11:52] Just to also echo another thing that John was saying in his book, you know, Data office or your chief analytics officer, any you're dealing with the I should report to the CEO of NATO that your CFO, God forbid, the CEO, end of story, if he has go awry, where the Data Data goes to, why it goes to the CFO and then you up and then you start being accountable to the finance department. And that sucks.

Speaker1: [01:12:19] Yeah. Wouldn't everyone be accountable to finance department?

Speaker3: [01:12:23] What we're talking about Data assets. Do you guys think like companies like kind of like older companies, mature companies will have Data on premise behind? Is it is it the general understanding that they're there by in regards to companies or on the cloud who are adopted quickly and things like that when it comes to data science and things like that? I don't really hear about that. Thinking about it now

Speaker1: [01:12:50] You get stuff.

Speaker6: [01:12:51] Is that actually blowing the whistle in the background? So I don't

Speaker3: [01:12:56] Know. I would say like companies who are like older, mature companies who are on premise, who have their Data on premise or they does it mean that they're behind any data science and things like that? Or, you know, because typically when I hear a company who is like very hot and things like that, they're very pro cloud and things like that. So but I don't hear the stories of those or fully on premise and they understand the value of data. But where are they on the spectrum of of of immaturity? What are your thoughts?

Speaker1: [01:13:26] A little anecdote here. Like my company, like so put put a model into production. It's in the customer facing system right now, delivering lots of value of that was done with Data that was on premise databases that were on premise. But I mean, everything that we have is like everything's happening on Azure in terms of the deployment and everything like that. But I guess the Data is still there on premise databases and cloud.

Speaker6: [01:13:50] Just be shifting your operational footprint, right? So you just have to get from your data center to the cloud. But but it really comes out a set of practices. I mean, look at look at a high speed trading firms and stuff like that that I think are really alternatively sophisticated, but probably have on Prem just for a lot of reasons. Right. And so, yeah, it's it's but I always call these dark matter companies. These are companies that are on track. You never hear about it. They're making a shit ton of money running windows, usually just all the stuff you would think is like highly unpopular. They're just killing it. And so, yeah, it's a good question. I mean, we see a lot of clients. I think there's quite a few that are doing wonderful stuff, even though that they're certainly not going to get the attention of. Right, because, I mean, these companies are probably less likely to learn more about what they're doing because like what's interesting about a company that's running to my crusty data center stuff even look like a startup in the cloud. Right.

Speaker3: [01:14:47] So, yeah. And I know there are some products like snowmobile snow cone that it'll be yourselfers that companies or remote like those or companies will want to transfer petabytes of data to the cloud and they send a truck to your facility to transfer that data and. Bring it to a Data center. I think that's pretty cool, and I guess they're trying to make things easy for them. But does that change their flexibility in terms of adopting EHI or Data science and things like that? I would just in case

Speaker6: [01:15:22] We have a lot of companies and clients around, I'm still and I mean some are migrating to the cloud effect. It's all like where I live at Google. So of data center down the road where one of our clients is actually moving their their home from work load into the cloud. And I think the challenges that they're going to have really is that they can't operate in the cloud like they do when I'm trying to secure servers on all the time. You really have to know what you're doing with the cloud in order to make it effective. That requires a complete change in your practices and actually, I thought disastrous. So you really just need to understand how the cloud works, how much stuff and playwright guardrails there. But you know, what it would give a company, I think is more flexibility and agility in terms of experimenting stuff and not having to be so cumbersome, either encumbered by their infrastructure. But is he on trend or for some reasons, like one of our clients say they got bought by a private equity firm. And so they're metric is like EBITA, right. So that's what's going to make everything look better. A capital expenditure where you're depreciating your unprime assets over time. Because if you if you're expensed over time, if you if you were to suddenly go to an operational expense model like the cloud, well,

Speaker3: [01:16:39] Yeah, it's going to be Farstad. Yeah, absolutely. They'd love depreciating assets. Definitely. I see what you mean.

Speaker6: [01:16:45] They're just concerned with what's motivating you to like this company. We told them like, look, you're going to probably make more money going to this because you're going to be a lot more agile. But, you know, talking to the CFO was like, well, except we already know what our model looks like here. The cloud means like this company has like 40 percent market share in this respect category. So I'm not going anywhere regardless. I mean, think is basically just like fall asleep for the next five years. You probably saw

Speaker3: [01:17:13] This sense

Speaker1: [01:17:16] If you had your hand up a couple of minutes. You Tom,

Speaker2: [01:17:19] It's just a historical perspective kind of thinking, OK, what other STEM areas have been through this? And I love to think about the electrical age, which didn't really begin that long ago. And I think they went through some similar growing pains about perspectives and such. Let's remember Tesla studied from the mechanical engineering department. Isn't that hilarious? We're still feeling out specializations in the space. I think organizations are still wrapping their head around what smart ass to do. But, Joe, correct me if I'm wrong, I'm also a big fan of John's book. And my takeaway is that chief analytics Data science Data officer really needs to be reporting directly to the CEO. And it's the spirit of yes, he shouldn't be under any other chief but that one. But he should be or she should be working very closely with the other ones, because to me, Data science is to serve the prophetic where the truth needs the truth. Investigation of the organization should be data driven, data centric. But I think we're going to feel these growing pains. And you're right, Greg, this discussion is painful. It's just painful.

Speaker1: [01:18:38] Often you've caused this all some pain.

Speaker3: [01:18:42] I'm sorry.

Speaker1: [01:18:44] So, Russell, you had some great comments as well. Enough to free that up from the chat if you want to give us your thoughts here.

Speaker3: [01:18:52] Ok, I sure want the most recent I've put in there. I've Talk Back because that's freshest in memory and it's just building upon the the difference between on premise and cloud slides in my time work in building a lot of data centers globally, some some independent colocation providers and some big blue chip clients. And they have exploded for the last, say, maybe twenty years. They're just going everywhere and there are environmental concerns with them. There's a big power drain on them. The cooling of the the server is basically the biggest cost for those at the moment. So there's a lot of work to do with those to make them more environmentally friendly and hopefully protect the cloud environment for us. But I said I wrote a little piece a little while back saying that the ultimate adoption of the end user could be that we all start using dumb terminals. So all our interface with everything is not a laptop or a PC, but it's you know, it's a team on something. But it has no local C.P.U capacity other than insufficient to connect it with with a Harp with the network. With sentence, et cetera, and all of your processing capacity, as well as Data being stored, is done remotely so that the devices you buy much less complicated, much less costly.

Speaker3: [01:20:19] But it does require a lot more investment in the data center infrastructure. And the biggest restriction on that is the space to build all these data centers in a place where they can run efficiently and then also these environmental concerns so that the heat output from the CPU's, when multiplied thousands of times by the number of cigarets these guys Harp is a huge endeavor. So cooling for those is a is a big draw on the on the electricity supply. So, yeah, a lot to think about, but yeah, it's something that I think will happen probably in 15, 15 to 30 years as a guest, depending on how our technology progresses. One thing I'm really keen on seeing how this develops in the data center world is the processes themselves. I know. I think Intel we're talking about processes that created much less heat outputs. You know, fix the source of the problem rather than get better at dealing with it would make it less liquid. Cooling is also becoming really big. And that failed. And the last thing is quantum computing. The actual chips starting use quantum computing. That's fascinating to me. Really can't wait to see that that come into action.

Speaker1: [01:21:27] I'd love to get somebody on the podcast to talk about quantum computing and anybody who's written a, you know, kind of a friendly book on that topic, please let me know. Should we message? I'd love to get somebody on the show to talk about that, because it's super fascinating stuff.

Speaker3: [01:21:41] Just one quick one quick thing back then, and hopefully they're probably already done. It would just need some bit of a quantum joke sorry,

Speaker1: [01:21:50] Above my head here,

Speaker6: [01:21:52] Either here or there. Right. Yeah. It's interesting in Utah. Right, because we're in the desert. And so I think within about 20 minutes I can see the NSA data center, Facebook, Google, and there's a couple others being built right now. So I find it interesting, right, because water is supposed cool. Cooling is what's the interesting thing, you know, useful for data centers. And yet they built it like one of those Arab states in the union. So it's totally crazy. I think it just has to do with, like the tax subsidies that they got or something, because, like their legislature, it really doesn't give a shit about the longevity. So that's kind of interesting.

Speaker1: [01:22:30] It's pretty funny, right, guys? Well, we'll go ahead and wrap it up for today. Thank you so much. Such great questions. Thanks for sticking around till the end. I really appreciate having you guys here. Remember, next week is the one year anniversary party. I can't believe it has been one year since I last year the Data Centers podcast. I remember reaching out to Brandy before I even had a podcast. I was like, please come on. My show is like, you don't even have a show. But he agreed to come on the show anyway. So thank you for coming on. But next, next week is the one year anniversary party got the episode releasing with Robert Green, which is one of the interviews that I'm most proud of, mostly because Robert Green said that was like the most interesting interview he's ever been on and Robert Green has been everywhere. So that's pretty cool. Um, a lot of awesome stuff happening, you know, throughout the remainder of the month. I'm speaking to Dana Mackenzie, coauthor of Book of Why. So I'm super excited about that as well. Don't forget to register for Davison's gold virtual conference, where yours truly will be the master of ceremonies. Greg will be presenting there. So we'll see him. Walsh, um, a lot of the people as well. Check out the interview that I did with John Crohn on super sized podcast and yeah, man, looking forward to seeing you guys next week. And, you know, hopefully everybody has a drink in hand. Tell all your friends and get one hundred people into this room. That'll be awesome. If you're listening at home and you've never came into an office, our session now is the time you to come next week. Come next week for the one year party. Alberts joining in right now. Right. We're about to shut it down, but yeah, next week, one year anniversary party guy is looking forward to seeing guys there. Take care. Have a good rest of the evening. Have a good rest of the weekend. Remember, you've got one life on this planet. Why not try to do some big cheers?

Speaker3: [01:24:14] Everyone goes, Oh, thank you.