Srivatsan Srinivasan: [00:00:00] I don't have a math background, and I'm maybe I may not be the best, but I'm able to survive. Just give you a best light on what you're trying to do and learn as much as possible. So don't worry if you don't have any skills, the skills it can be built along the line. You don't have to be really scared of that. Harpreet Sahota: [00:00:30] What's up, everyone? Welcome to another episode of The Artist of Data Science. Be sure to follow the show on Instagram at the Artists of Data Science and on Twitter at Artists of Data. I'll be sharing awesome tips and wisdom on data science, as well as clips from the show. Join the Free Open mastermind Slack channel by going to bitly.com/artistsofdatascience. I'll keep you updated on biweekly open office hours that I'll be hosting for the community. I'm your host Harpreet Sahota, let's ride this beat out into another awesome episode. Harpreet Sahota: [00:01:17] Our guest today is a big fan of and contributor to all things data, cloud and artificial intelligence. He has nearly two decades of applying his intense passion for building data driven products for top financial customers. During this time Harpreet Sahota: [00:01:30] he's amassed experience building complex analytical pipelines, machine learning models for extremely complex business processes, petabyte scale data lakes, and high frequency, high volume streaming analytics pipelines. He's a strong leader who effectively motivates, mentors, and directs others. And has served as a trusted advisor to senior level executives from business and technology, helping them with complex transformations in the data and analytics space. You might recognize him from LinkedIn, where he posts about apps and email apps discussing elements of the whole data pipeline for building into play models in production. In his postings, he covers less, discuss topics of data applications and tries to highlight what works for him and what didn't. You might also recognize him from his YouTube channel, A.I. Engineering, where he shares his mind and knowledge working with real world data applications. As a testament to how helpful his content is. In the first seven months of his YouTube channel launch, he has garnered over 9000 subscribers and almost 120,000 views. He tries to cover the relatively less talked about topics of data applications and tries to tie in to how real world applications are deployed. So please help me in welcoming our guest today. A well-known and highly respected data scientist and architect who's a business leader by title and Hands-On practitioner by passion - Srivatsan Srinivasan. Harpreet Sahota: [00:02:49] Sri, man...Thank you so much for taking time at your schedule to be here today. I really, really appreciate it. Srivatsan Srinivasan: [00:02:54] Thank you. Thank you for the good introduction and nice introduction, Thank you. Harpreet Sahota: [00:02:58] Awesome man. So let's, let's talk a little bit about how you first heard of data science and what drew you to the field and maybe touch on some of the challenges you faced while breaking into the field. Srivatsan Srinivasan: [00:03:12] Sure. Like for me, breaking into the field was kind of a gradual transition. So I've been in the data space from the beginning. Not in the data science though. I've been working with data from the starting off with career. And so for the first one year I was a Java application developer. I soon realized like a Java application was not my forte. So I started focusing more on the backend side of it. Writing SQL and also working as a part-time DBA. Srivatsan Srinivasan: [00:03:41] So data always been in my DNA. And slowly I transitioned from the regular data stuff to the ETL world of it by the Informatica and Datastage came into play and then going to the big data space. Right, so basically I was one of the initial adopter of big data - the hadoop and NoSQL ecosystem. And all the time Srivatsan Srinivasan: [00:04:02] whenever I was working with customers. One thing that we saw in the big data space is I was working with your typical advanced analytics and data mining projects. And what we realize is that we had to deal with large dataset and complex problem - the typical advanced analytics phase. So that's where my entry into the data science - the real data science, sort of came into play with machine learning and all. I would say, like, I was not great in that at the time. Srivatsan Srinivasan: [00:04:29] I was also learning and handling an ML Course. So I was parallely doing it. And we started on the project trying to get some hypotheses and insights, out of data. I would say we did not succeed because I was also new. My team was new. But one good thing I had is the client was very supportive, of it. We failed at that time, but the failure was a good learning for us. So that project did not go at that time. But eight months down the line, we redid and retook the project and delivered it after a year and half. So that's how, like, my data science journey has been. The last five years I have been with data science all along, doing machine learning and do the engineering work. Harpreet Sahota: [00:05:10] And that's quite, quite a journey that you've had. Harpreet Sahota: [00:05:13] And you've been so generous with your knowledge and sharing your knowledge, you know, creating some really well crafted content for LinkedIn and YouTube. And I'm wondering what's the inspiration behind that? Srivatsan Srinivasan: [00:05:25] So, again, like two years - before two years - I was not a LinkedIn user. I had a LinkedIn account for 16 years, but I never used it. So what... I used to get a lot of e-mails from college grads, like how the real world industry work and how it is different from what they are learning. So all the time I've been replying individually to them wherever possible and when I get time. But then I thought like when I was seeing the LinkedIn content, it was more about the buzzwords around right, like all the VGG, all the convolutional neural network, RNN. And then nobody was covering the industry tie up with academics. And that's why I like rather than writing one on one, I started writing about ML engineering. That's how I started my LinkedIn talking about model deployment, data collection, how to basically operationalize your insights with the business process. That's the key thing. Even if you would develop the best model, unless you could operationalize it you're not going to get any outcome out of it. And that's how it started. And slowly, the same content transitioning to YouTube channel. So I started creating videos. That's what we do was the end to end ML, which talks about ML engineering. Harpreet Sahota: [00:06:35] Yeah, definitely found your content very helpful and informative for me in my journey as well. So I'm glad that you're kind of filling that space for that type of content. So I'm curious, you know, taking into consideration, you know, the journey you've had into data science, where do you see the field headed in the next two to five years? Srivatsan Srinivasan: [00:06:53] So when we talk about where the field is headed, right. There are two aspects of it. The very the very first aspect is the research side of it, right. There's a lot going on in the research world on advanced algorithms and everything. The key thing is like you have a lot of technology companies sitting over there like Amazon, Microsoft, and Google. They have a lot of data at their disposal. And they are trying to create like are pretty accurate systems for complex jobs. The complex job can be speech to text, or it can be OCR. It's typically not accessible to the industry, right. Industry does not have that much data to train a translation model, or a speech to text model. So what I see is the accuracy over time for these models, will get better, but the insights will be democratized. So you'll see this as cloud services running around and accessible to the industry. That is one aspect of it. The second maybe the model explanation aspect of it. As we go into the complex model we lose the explanation capability of it. So there will be a lot of research is going on, that is on the research side of it. But in the industry side of it, there a lot of initiate use that are getting started; but more in POC stages. The option is not completely federated across enterprise. So what I see is more and more enterprise line of business will adopt the more of these techniques and then you can see like that fuels a new way of adoption industry. So that's what I see like in two to five years. It's more like more adoption and more like models getting more accessible to end-users. Like complex models like speech to text and it's still that. But when you really use it in industry, you don't get that accurate models. So what I meant, it would become more accurate. Harpreet Sahota: [00:08:41] Very, very interesting in this vision of the future. What's going to separate the great data scientists from the ones that are just merely good? Srivatsan Srinivasan: [00:08:51] So if you really see, right, like the difference is going to be how to adopt your data science journey. When we say how you adopt your data science journey, we typically - we are more focused on today algorithm and technology, the real focus should be on business outcome. Srivatsan Srinivasan: [00:09:09] It does not matter whether you use tensorflow or pytorch to solve a problem. It's about how you are solving a problem and getting in business outcomes. Right. That should be the clean focus of it. I think more and more data scientist today are technology focused. They need to use technology to just solve a problem. Right. So they should more focus on business outcomes. And that's what, like, will really differentiate the good and best data scientist. Harpreet Sahota: [00:09:40] Are you an aspiring data scientist struggling to break into the field, then check out dsdj.co/artists to reserve your spot for a free informational webinar on how you can break into the field that's going to be filled with amazing tips that are specifically designed to help you land your first job. Check it out, dsdj.co/artists. Harpreet Sahota: [00:09:59] Definitely meant a hundred percent agree with that. Harpreet Sahota: [00:10:08] You know, speaking of taking an understanding of the business outcomes and how the work you're doing is going to affect a business. What does it mean to be a good leader in data science? And how can an individual contributor embody the characteristics of a good leader without necessarily having the title? Srivatsan Srinivasan: [00:10:26] So when we say a good leader, right. One thing is a good leader in data science in specifics should be ready to embrace failure. The space that we are dealing with is highly experimental. But, we don't know until we get the data and do our hypothesis whether this particular problem can be solved by data science. So you should be ready to embrace failure in this highly experimental phase. And the second thing is, once you accept the failure, you must be ready to move on. But what I see today is some of the leaders try to force with a solution to machine learning and try to deploy, which fails when it goes into production. So I would say a good leader is basically understand what works, what does not work. Srivatsan Srinivasan: [00:11:08] The leadership title must not be just like an existing title renamed into a data science leader. You should have real hands on experience of solving problem. Maybe it need not be a real coder, but you should be able to understand what scenario data science work and what scenario it does not work. So that should be like a good leader. Harpreet Sahota: [00:11:30] Are you speaking about kind of productionalizing a model - what are some challenges that a a notebook data scientist can face when it comes time to productionize a model? And do you have any tips for how to overcome those hurdles. Srivatsan Srinivasan: [00:11:45] Productionizing models, I would say is one of the challenging activity. right. Like now there are two aspects of it. One is you need to take your code and deployed it in production. Srivatsan Srinivasan: [00:11:55] Second is you have to integrate with your business process so that your business process can seemly action on it, seamlessly action on it. So one thing that I have seen is when we start writing a notebook based data science, right. We typically would not have modularized it. We typically write our code, and then we go back and add features and correct it. And overtime our notebook itself becomes not so readable and interpretable, right. Srivatsan Srinivasan: [00:12:21] So I would say like start with modularizing your code, see like where are your common functions that you can use. A typical example is like how you have the data collection aspect of it, data preparation aspect of it. See if it can go into a separate modules by itself, and what you're doing in your main notebook is just accessing those modules and using it. Srivatsan Srinivasan: [00:12:43] So when you take this notebook to a production, you basically have a pretty readable and reproducible code with all the dependent utility functions - I would just call this as utility functions for data processing, and your common feature engineering and others. The second thing that happens is when you take your code into production typically there is a different team that is involved in deploying to production. It can be a software engineering team depending on the need. Srivatsan Srinivasan: [00:13:07] And by a modularizing your using your code and making it more readable, the reproducibility of deploying the code also increases. So the one of the things I seen - when I started myself, I used to put everything in a single notebook. And then after the project is done, then I had to kind reframe into a way that it can be deployed. Srivatsan Srinivasan: [00:13:29] So start doing it from the beginning of the project. Harpreet Sahota: [00:13:32] That's really good advice. And one thing that I think a lot of fresher's and a lot of people breaking into the space that they don't get the exposure to is what happens when a model is in production. So what are some things that we should be keeping track of Harpreet Sahota: [00:13:46] Once we have deployed our model into production? Srivatsan Srinivasan: [00:13:50] I would say the fun really starts after the model goes into production. Right. Srivatsan Srinivasan: [00:13:55] Because you have multiple aspect of it. You need to make sure your technical SLAs (service level agreement) are met, right. Like in some business process, you need to action as soon as the data comes in, right. In some business process, it can wait. Like typically if you're in the credit card industry and somebody swipes a credit card and you have a model to find whether it's fraud or not, you just have a few seconds or few milliseconds to action on it. Right. So the very first thing you'll be monitoring is whether your model consistently meets the required business SLA. Right? That is the first part. Second is whether that is a drift in your model, like your model can get drifted as soon as it goes into production. And that drift is the second part. The third part is whether your business KPI is are being met and being monitored. So these are different things that we do when it goes into production. Harpreet Sahota: [00:14:44] So let's talk about this concept of drift, right? So we've got two ideas of drift. Here we've got the "concept drift" idea in the "data drift" idea. If you don't mind, for our listeners, can you do you mind talking about these two? Starting with concept drift and maybe some tips for for quantifying, measuring, and tracking these? Srivatsan Srinivasan: [00:15:01] So if you see, like today what we are dealing with, a pandemic outside - like COVID. And lot of the models have already drifted in many, some of the industries, the forecasting models that were used in some kind of retailers like Macy's or some some specialized retailers - the forecasting models have really gone for a toss because those sales have been coming down. So it's very important for anyone to monitor your both concept drift. The concept drift is basically your underlying business assumption faces changes and the data drift is basically your data assumptions are changing. So you made some assumptions but the data that is coming to you is changing. It can be because of an upstream mistake or it can be because of some changes that is made to the business process. Right so, in any of this the model, it's not going to perform as expected. And that's why it's very important to continuously keep monitoring for both your concept drift and data drift. There are a lot of statistical techniques that we typically use. A pretty common one is population stability index, where we just rank order the model by the scores. And then comparing with the distribution that was used to train the model. So you have the trained model, you have the dataset it was trained on, you take a distribution out of it and then see in real world how the distribution looks like. So there are a lot of techniques like Population Stability Index, k-stat (kappa statistic), histogram comparison, even the z-score and t-scores and everything. Harpreet Sahota: [00:16:39] What's up, artists? Be sure to join the free open mastermind slack community by going to bitly.com/artistsofdatascience. It's a great environment for us to talk all things data science, to learn together, to grow together. And I'll also keep you updated on the open biweekly office I'll be hosting for our community. Check out the show on Instagram @theartistsofdatascience. Follow us on Twitter @ArtistsOfData. Look forward to seeing you all there. Harpreet Sahota: [00:17:08] So at my current company, we use we use Azure, a lot and Azure got this built in a thing called Azure Data Drift, and for that they use this data drift coefficient and Wasserstein distance. And it's been very, very useful for tracking models in production. But that's very valuable advice. And I think what you've just touched on are things that most fresher's are not aware of, but they need to be aware of. So it's giving them some great research topics to go to spend some time on. So thank you for that. I was wondering if you had any advice or insight for people that are breaking into the field and they see these job postings at these job postings. They they look like they want the abilities of an entire team rolled up into one person and then they they just become scared of applying. Do you have any tips or advice for them? Srivatsan Srinivasan: [00:17:54] Oh, yeah. I have seen some job posting that they really require a unicorn, right. Srivatsan Srinivasan: [00:17:57] They want people with NLP, computer vision, and the regular machine learning. I would partly attribute to the maturity of the industry as well. Srivatsan Srinivasan: [00:18:07] The problem that is happening is a lot of industry are just experimenting with AI and ML and they exactly don't know what they want. So they want one person who can do everything. With the maturity, it's going to get even - It's going to get better and better. And the same thing happened when it started in Big Data, right. If you see all the open source technologies will be listed for the job description in the data. But over time, things have got better and the same is going to happen here as well. So what I would say to people who are applying is just don't stop applying because of that. The reason is that even the industry is not going to find someone with all the skills. It's kind of like 0.1, or 0.2 percentage of people will have that skills. So start focusing on the core skills and once you're ready with the core skills, start applying, applying for the positions. At the end of the day, they will choose the best out of the lot. And better you be the best rather than like thinking like you need not apply because you don't know all the skills that they have posted in job description. Harpreet Sahota: [00:19:12] Awesome advice. Thank you so much. I think a lot of also up and coming data scientists, they tend to focus primarily on the hard skills and they think that it's those skills that are going to separate them from the rest of the crowd. What are some soft skills that candidates are missing that are really going to separate them from their competition? Srivatsan Srinivasan: [00:19:30] So for hard skills are pretty key, right. But apart from that, when you are really operationalizing your insights, you're are presenting it to someone who might not have the kind of hard skills you have. So basically, you need to present in a way that the end user or the business understands, right. You need to kind of tell a story out of your model. That is the key. I feel maybe they should more focus on problem solving skills rather than a mapping technology to basically any project, right. It's not like, OK, a fraud can only be done with anomaly detection. Fraud can even be done with a regular supervised learning problem, right. So, rather than attaching a technology they should more focus on how to solve a problem. How to kind of like take that and present it to a business user, right. So the problem solving skills and the presentation skills are key. Harpreet Sahota: [00:20:23] And do you have any tips for a data scientist who might find themselves having to present to a non-technical audience or perhaps a room full of executives? Srivatsan Srinivasan: [00:20:33] Right. So when you do the presentation skills, what I was talking about is pretty key because the output of the model goes to always a non-technical user and most of the time. Right. It's it's the business who are consumers of the model and not the technology team, in many cases. So make sure you tell - you convert your model outcomes to stories. And the stories need not be complex. It can even talk about like, what are the features that are contributing to this prediction. And then show inside of the features by [communicating] the features in a simple way and telling these are the personas you have. Srivatsan Srinivasan: [00:21:07] These are the segments you have that are contributing. More like converting your model output scores to a story, along with the features is pretty key. Harpreet Sahota: [00:21:16] Awesome. Thank you very much. So got the last question before a lightning round. What's the one thing you want people to learn from your story? Srivatsan Srinivasan: [00:21:23] I would say like many are scared with mathematics. They think like math background is very important for data science. And I may be a little biased as well, because one thing is I don't have a math background and I'm maybe I may not be the best, but I'm able to survive. Just give your best, right, on what you're trying to do and learn as much as possible. So don't worry, if you don't have any skillset. The skillset can be built along the line. You don't have to be really scared of them. Harpreet Sahota: [00:21:52] I think that's excellent advice. Right. Like, just because you didn't go to school to study math doesn't mean that you can't learn math on your own outside of school. Excellent, excellent advice. So jump into real quick lightning round here. Harpreet Sahota: [00:22:03] What's an academic topic outside of data science that you think every data scientist should spend some time researching on? Srivatsan Srinivasan: [00:22:11] I would say the business aspect of it is pretty key. Right. So basically focus on an industry and try to understand how the industry business process works. So if they are talking about finance, try to understand how all credit card users are on-boarded, right. How frauds are a detected. So there are a lot of research paper over there. I would say like focus on more like understanding any industry that you like the business behind. Harpreet Sahota: [00:22:40] That's excellent advice. Harpreet Sahota: [00:22:41] And I think one good way to do that is by reading case studies. Right. Harpreet Sahota: [00:22:44] So if you're interested in a particular industry, then then read a case study that... Srivatsan Srinivasan: [00:22:48] There are pretty good research papers like there's a Google Scholars website, you can go and search for research papers. And you get plenty of information, or connect with the industry leaders. Right. Just just send them a LinkedIn note or invite. And ask them, like maybe a ten minutes time or fifteen minutes time to quickly get it. Harpreet Sahota: [00:23:06] So what's your favorite question to ask during an interview? Srivatsan Srinivasan: [00:23:10] I typically focus on a resume. So I would say, like if if they do a project, why did you choose this approach? And if you have to redo it today, would you go with the same approach or you are better than you do? Harpreet Sahota: [00:23:24] It's actually the exact same question I ask as well during an interview. Harpreet Sahota: [00:23:27] So what's the what's the strangest question that you've been answering in interview? Harpreet Sahota: [00:23:31] So it was at the beginning of my career. So that's quite long. I was kind of interviewing for an hardware, computer hardware division. And he asked me like, how why would you sell this? And it's a very common question today. But frankly that was...I can answer any technical questions. Srivatsan Srinivasan: [00:23:48] So that was kind of not my forte, I would say. Harpreet Sahota: [00:23:51] What's a number one book you'd recommend our audience to read and your most impactful takeaway from it? Srivatsan Srinivasan: [00:23:57] So related to data science, I would say like I like the Naked Statistics book. It's an it's an amazing book. The reason is that [inaudible]. People take some artificially X and Y data, and then thy try to teach it. This book teaches us as a story. Right. It tells a story where you can learn a lot of good information and something like you may be using in real life, but you don't realize that you are telling in statistical measure over there. So that's amazing Srivatsan Srinivasan: [00:24:30] book I would say, Harpreet Sahota: [00:24:31] Yeah, definitely good recommendation. For you know - this isn't a, this is audio only podcast. But for those did't see, like as soon as he said that, like, literally that book was on my desk right here. So I pulled it out. Good recommendation. Harpreet Sahota: [00:24:42] Yeah, I enjoy that book a lot. Harpreet Sahota: [00:24:44] If you could somehow get a magic telephone that allowed you to contact 20 year old Srivatsan, what would you say to him? Srivatsan Srinivasan: [00:24:51] I would say don't delay building your brand. It took me a lot of time to really come into the social side of it. Srivatsan Srinivasan: [00:24:57] And kind of telling who Srivatsan is. Maybe I'm still in the starting place, but start building brand sooner. So that people know, people know like what your skill sets are, what your capability is, try interacting and being more active in the network. That will earn you more bonus points when going to an interview. Half of your interviews is done if people know you. Harpreet Sahota: [00:25:20] What's the best advice you've ever received? Srivatsan Srinivasan: [00:25:23] So I'd say that the major thing is time management. Right. So obviously we don't have time. So basically that time is how you kind of position yourself in. Harpreet Sahota: [00:25:38] How could people connect with you? Where could they find you? Srivatsan Srinivasan: [00:25:42] Yeah. So the best thing is LinkedIn. So the LinkedIn messages are pretty much I am accessible through that. Srivatsan Srinivasan: [00:25:50] And they can contact me. If like asking technical question, I would prefer to go into YouTube with their respective videos, rather, if not like they can still contact me on LinkedIn. Harpreet Sahota: [00:26:03] Sri, Thank you so much for your time. I appreciate you being here and taking time out your schedule to chat with me today. Srivatsan Srinivasan: [00:26:08] Thank you very much.