carl-osipov-2020-06-21.mp3 Carl Osipov: [00:00:00] I think the success that I've had in my career has been during the times when I was willing and interested in collaborating with someone who had a very different background to mine. I mean, this both from the technical standpoint and from the standpoint of personalities, diversity standpoint. Whenever you collaborate with someone and you're willing to learn from them, you're going to come away as a person who really grows as an individual, not just in their career, but as a human. Harpreet Sahota: [00:00:42] What's up, everyone? Welcome to another episode of the @TheArtistsOfDataScience be sure to follow the show on Instagram at the @TheArtistsOfDataScience and on Twitter at @ArtistsOfData. I'll be sharing awesome tips and wisdom on Data science, as well as clips from the show join the Free Open Mastermind slack channel by going to bitly.com/artistsofdatascience @TheArtistsOfDataScience. I'll keep you updated on bi weekly open office hours. I'll be hosting for the community. I'm your host Harpreet Sahota. Let's ride this beat out into another awesome episode. And don't forget to subscribe, rate, and review the show. Harpreet Sahota: [00:01:33] Our guest today has nearly two decades of information technology experience spanning roles such as program manager at Google, an IT executive at IBM, and as an adviser to Fortune 500 companies. The early part of his career was focused on growing IBM's cloud business, and he was recognized for helping get reach over one million registered users, leading programs and projects across the United States and Europe, including the areas of machine learning, computational natural language processing, and cloud computing. He's authored over 20 articles in professional trade and academic journals, holds six patents at USPTO and has been awarded three corporate awards from IBM for his innovative work. When he's not busy writing code, drawing up design specs, publishing things, answering emails, or doing other business, you can find him brushing up on the latest computer science skills on Coursera, reading and writing about tech, or listening to audio books and podcasts. If he's especially bored, he's up in the sky, flying around central Florida on a Cessna 172 going surfing or spending time at the beach. Today, he's here to talk about his book, which is targeted at teams and individuals who are interested in building a machine learning system implementations efficiently at scale. So please help me in welcoming our guest today, author of Serverless Machine Learning in Action. Carl, Osipov. Carl Thank you so much for taking time out, is scheduled to be here today. I really appreciate you coming on the show. Carl Osipov: [00:02:58] Thank you for having me. And thank you for that very kind introduction. I really look forward to our conversation. Harpreet Sahota: [00:03:03] Aw yeah man, same here. Let's start off by hearing about how you first got into kind of the Data science world, the machine learning world. What drew you to this field? What were some of the challenges you faced breaking into the field? Carl Osipov: [00:03:14] I got started with what we call Data science back when doing undergraduate research at University of Rochester. At the time, I focused very much on the machine learning topics. And back then it was interesting because people didn't really know what machine learning and data science are going to be like. So it was a time for experimentation. I remember writing my first neural network back in 2000. So back then I was working on a project to process a digital image and that digital image, there were pictures of fruit. And one of the projects I worked on was to try to count the number of apples and oranges on the digital image. So I guess that was my first, let's say, machine learning slash Data science project. It was a total failure. But I think what it taught me that it is extremely important to have enough computational capacity to succeed. So I bounce back and forth in my career between more of a machine learning oriented topics and distributed systems design topics. And I think I really started focusing on Data science about 10 years ago, in 2010. I used to lead a project for IBM that focused on academic research, collaboration, and analytics. And back then I was helping the company build the product to analyze academic research publications, process unstructured data, structured data and help come up with profiles of academic researchers. The goal then was to help foster collaboration across different academic institutions and help the administrators of those academic institutions get a better grasp on the portfolios of research that they managed. Harpreet Sahota: [00:04:46] That's pretty interesting and some interesting early projects that you had and thinking about where you started when you first started implementing machine learning systems to where stay the current world is now. How much more hyped has machine learning become since you first kind of broke into this? Carl Osipov: [00:05:01] That's a fun question. I guess it comes down to how you define hype. And I think one way to look at hype is the relationship between publicity and results. So it's sort of a ratio, right, of publicity to results. And I think if you look at the hype that way, it's very interesting because when I was doing undergraduate research back in 2000, there were quite a bit of hype back then. And that was simply because people were interested in neural networks. Books were getting published on neural networks. There was a lot of interest in processing data, like the example that I gave was image Data with some of the early digital cameras back then, but the results were lacking. So I think if you look at hype that way, it was quite a bit hype. And then in the period of the 2000s, maybe early 2010, we actually started getting real results. And I think if you look at the hype today, it's probably not as hyped up as you would imagine, simply because we're actually delivering concrete results with data science, with machine learning. So the denominator of that ratio is really going up. Harpreet Sahota: [00:05:59] I like that definition of just we're getting results. And so therefore it's not as hyped as people think it is because we're actually delivering on what we're promising. So I like that. Where do you see now the field of machine learning headed in the next two to five years? Carl Osipov: [00:06:15] So on this one, I don't think I have a unique perspective. I'm going to say that the results that are coming out of the research community, the academic community around semi supervised learning or self supervised learning they're very interesting. And going forward, right in the timeframe that you describe three, five years. We'll see machine learning algorithms, data science algorithms become more data efficient, will be able to train machine learning models with fewer annotations, with less labeled data. So overall, I think that's one key trend. But in addition to that, that was more of a technical description, I think, in addition to that. What's interesting is that the tools of data science and tools of machine learning are becoming much more widely available, much more accessible. A specific example I can give you, back 20 years ago when I was working with the neural network, I had to implement the C programming language managing pointers in memory. So today, any undergraduate can pick up a framework like PyTorch and start working on fairly sophisticated machine learning models. And what really gets me excited about this situation over the next three to five years is that the availability of compute is changing. So with some of the frameworks that are available today, the serverless machine learning is possible for someone to pick up the machine learning framework and then launch their machine learning models on the cloud. And that's where they get access to effectively unlimited compute, unlimited storage, tremendous networking. And I think that's the direction that will produce most spectacular results in the next few years. Harpreet Sahota: [00:07:41] So as we kind of begin to move towards this vision of the future, we've got more computation speed where we need less data for our models to be trained on. What do you see being the biggest positive impact coming from this, the next two to five years? Carl Osipov: [00:07:55] So the positive impact is going to be interesting, because if you take a look at our IT industry and I'm going to draw a comparison to the Internet, the positive impact is not immediately clear. So usually what happens is that you have a gold rush of companies that are applying the new technologies. So think back in the early 200s, the dotcom era, you had all these dotcom companies that promise positive results. And on the surface, it sounded great. Right. And pets.com, getting your heavy pet food delivered to your home sounded great. So I think the situation today and over the next few years is going to be similar. We'll see a lot of promise results, but very few of them actually will pan out. And I think one of the exciting parts of the information technology industry is that we don't really know in advance what are going to be the most positive results. We don't really know how is going to transform our lives. But I think the takeaway from this, the world is going to be changing and I think it's going to be changing for the better. And I think that's a positive result. Harpreet Sahota: [00:08:52] And the flip side of this now, what do you think would be the scariest application of machine learning in the next two to five years? Carl Osipov: [00:08:58] That's a great question. If we think about the scariest applications, I think the scariest applications are the ones that we do not anticipate. And here's what I mean by that. If you think about what Internet has done to our day to day lives, to the way that we consume media, probably the scariest application of Internet is fake news. It's scary not just to me, but to many politicians, to many public institutions around the world. And fake news was predicted. If you look at science fiction publications, let me give you an example. Verner Bingy, a well-known science fiction author who actually coined the term Singularity Artificial Intelligence, is a great book called The Rainbows. And that came out somewhere near two thousand and in there he talks about this problem of fake news from the Internet. So I think some of the scariest applications are the ones, the ones that were dismissing as science fiction to date, but are actually going to become reality. So to me, I think some of the scariest applications of machine learning are in the financial industry. So think about the system that we live in, capitalism and the fact that today, if you take a look at the statistics, most of the trades in capital think about trades in equities and stocks and bonds. Actually, the majority of these trades now fully automated. In other words, computers are making decisions of how we allocate capital in our system. And to me, this is potentially very scary because we have these systems. We do not fully understand what exactly that they're doing. We suspect that they can use the stock market to potentially communicate with each other. But we're not really doing a good job in terms of trying to drill deep into that black box or these trading decisions in a descent. Whether indication these trading decisions, have in terms of allocating capital to some of the top companies in the world and delegating capital from companies that may need them. For example, companies that may be giving jobs to folks in North Carolina, manufacturing furniture. So the nutshell, I think these concerns that sound like science fiction today are probably going to be the most scary ones 10 years down the road. Harpreet Sahota: [00:10:55] Very, very interesting. Very unique perspective. I really enjoyed hearing that. So as practitioners of Data science and machine learning, as it becomes more ubiquitous, more easy to use, easy to implement, what are some things that we should keep on top of our mind as areas of concern so that we can kind of mitigate the risk of these scary applications? Carl Osipov: [00:11:16] I think it comes down to education. I think the concepts behind machine learning are accessible at a fundamental level. And I think we should start educating people throughout the world on artificial intelligence, starting with high school age. So I think that level of understanding is important. And the. And once we have the population that understands the concepts behind artificial intelligence is going to be for them to decide how to best regulate it felt it best applied to their world to apply to the economy in this case. Harpreet Sahota: [00:11:45] In this vision of the future, as we start to move towards kind of a culture where concepts of artificial intelligence are a bit more ingrained from a earlier level in education. What do you think will separate the great Data scientists from just the good ones? Carl Osipov: [00:11:59] Oh, that's a great question. I think what will make Data scientists of tomorrow successful is going to be more about the understanding of the human culture. So let me unpack that. I think what makes Data scientists effective is not just understanding of the Data or the tools for analyzing the data. I think it's about understanding the context in which that data is used. And let me give you a specific example. So I talked about the stock market. If you simply build a machine that's designed to analyze stock trades and optimize for increasing profits, the machine is not going to be able to recognize the fact that maximization of profits is going to lead potentially to a loss of jobs or to reallocation of capital in a way that disrupts families cause concern to people around the world. And it takes this human perspective with data scientists, the understanding of the humanities to really make those decisions about how to best build systems that analyze data and then make decisions based on that. So I think the data scientists of tomorrow is someone who knows the tools for Data, but at the same time is also versed concepts like ethics. He's also versed in humanities and history and is able to talk about the impact of the Data science algorithms and Data science systems. Harpreet Sahota: [00:13:22] Hey, are you an aspiring Data scientist struggling to break into the field well then check out DSDJ.co/artists to reserve your spot for a free informational webinar on how you can break into the field? Going to be filled with amazing tips that are specifically designed to help you land your first job. Check it out, DSDJ.CO/artists Harpreet Sahota: [00:13:48] It's interesting. There's one thing to understand how to optimize your algorithm so that it produces a optimal result from the modeling perspective. And it's a whole nother thing to understand how that fits into the context of the world around you, because the optimal decision that a machine learning model produces may not be the optimal decision. We're talking about people's lives. So I think that's a really interesting point you made that I'd like to get into your book. Your book I got a chance to go through it. Super interesting. And I guess we could start by pretty high level question here. What is serverless machine learning and how is it different from regular old fashioned machine learning? Carl Osipov: [00:14:23] Sure. Let's dove deep into the book. So the book is not as philosophical at all as our discussion so far. So it was a really fun conversation about where is machine learning, where Data is going. But the book is really an applied guide and think of it almost as a roadmap for someone who is already familiar with machine learning but wants to become a more valuable contributor to their project, to their team, to their organization. And specifically, the book is about this idea of building machine learning systems in a way that allows a Data scientist or machine learning practitioner to become as productive as they can be, meaning that they focus on writing Data science and machine learning and minimizing the amount of time, the amount of effort that they spend on operational concerns. So why is that important? I mentioned that the cloud is something that's impacting Data science and machine learning a lot in this book is really about helping the projects that need to scale up their machine. Learning models specifically need to scale up to Data sets that do not fit in memory or need to scale up to machine learning models that require many servers to train. Now, this may sound paradoxical, like why is the book about serverless machine learning? It really is about training across many servers. Well, serverless is a moniker that persons in the information technology industry, have adopted to describe this model for programming systems, writing software in a way that allows practitioners to just write software and really forget about the fact that there are servers running in the background. Now, of course, servers are still there. They still have limited amount of compute, a limited amount of memory. But for practitioner serverless, approach means that they don't need to worry about all those operational concerns. How do you provision those servers? What kind of operating systems are running those servers? Are those operating systems patched with the latest updates, etc.. It's really about making the practitioners, data science practitioners, machine practitioners more productive. And if you have done some machine learning in the past, so if you have that experience, if you're thinking about scaling up your machine learning model, you have to ask yourself, OK, how much of my machine learning system, the one that I'll put into production, is actually going to be a machine learning code? And there's some studies like, for example, there's a well-known research paper by colleague from Google that shows that machine learning systems that go into production actually end up being roughly five percent machine learning code. So if you're asking yourself, look, I'm going to put this machine learning system that I've built into production, once it's in production, will I spend most of my time worrying about these operational concerns, worrying about those ninety five percent of my systems that are not machine learning code? And the book is really about helping you address that problem, the book is really about helping the practitioner focus on what creates value on that five percent machine learning code and minimize the amount of effort needed to build out the rest of the 95 percent. Harpreet Sahota: [00:17:10] So what is the difference between machine learning code and machine learning platform? Carl Osipov: [00:17:15] When I think about the machine learning code, I think broadly this is the machine learning code that includes everything related to Data in just Data preprocessing clean up. This includes any feature engineering on that Data and also the traditional concept of machine learning model and model training and model influence. And then finally, there's the question of, OK, how do you combine that code into a working system? So typically you think about pipeline, we talk about the machine learning pipeline that takes Data trains a machine learning model and then uses that model to generate predictions or make classifications, etc.. Now, that's what I mean by machine learning code. When I talk about the machine learning platform, it's that ninety five percent that plays a supporting role. In other words, what is the platform that you're using to store the data? If you saw the data, you actually have to provision like a data warehouse instance to store that data or not. And if you do have to provision that instance, what does it mean to actually maintain it in production? Does it have to have regular updates? Does somebody have to go and verify the health of that data store? So these are what I describe as operational concerns. So in the book I describe. The platform as capability is like data warehousing capabilities, interactive queries, serverless, object storage framework for serving a machine learning model, in other words, the Web serving infrastructure and several other components that make up this supporting infrastructure. And really what differentiates this book from some of the others in the marketplace that help you build machine learning systems, is that a focus on this serverless concept? Focus on minimizing the amount of effort that a machine practitioner or Data site has to spend on maintaining that platform code. So helping avoid what I call a trap of MLOps, machine learning operations, where a data scientist and machine learning practitioner builds a model and then once the model goes into production, spends most of their time actually caring for the servers or allocating storage to make sure that the data actually is provided to the machine learning model. Harpreet Sahota: [00:19:14] And you bring this up in your book as well, that the contemporary practice of machine learning tends to suck a lot of productivity out of the practitioners. So what is it about that contemporary practice? You might have just covered it, but if you can just kind of clearly make that line delineated for us. What is it about the contemporary practice of machine learning that tends to just suck our productivity from the practitioner? Carl Osipov: [00:19:35] Absolutely. Let me give you a simple example. It's very easy to get started with machine learning and data science code. Imagine that you are a data scientist. You get a small data set. Let's say you have a sample of data set that goes maybe just one, two gigabytes of size. You throw it into a Jupiter notebook and then maybe you apply your favorite framework. Maybe you could learn. So you use that site, could learn a framework to process data set. Maybe it's a data set about delivery's trying to predict the estimated time arrival for the delivery. So it's very easy to get started in a sense that you can build a reasonably well performing model. Just off that data set in a Jupiter notebook. Was it a matter of an hour or so using maybe reading a decision for? The next question is what if the model that you built is actually needed by the business? Or maybe it's needed for a project for your organization code? How do you go about using that model in production and this is where the productivity drops, where data scientist, because you have somebody who is an expert in processing data, you have somebody who is an expert in analyzing different machine learning algorithm or different data science algorithms, and suddenly they're asked to build essentially a Web server to serve that model. So you have somebody who is maybe going out reading a few tutorials or looking at courses, online books, trying to actually build out parts of that computer infrastructure when in reality they don't really need to be built out. Much of the infrastructure that's needed to put machine learning model and production is already available as the serverless components from major public cloud providers from Amazon, Google, Microsoft. So really what this is about is making sure that the data scientists stay productive and working on what they do best, working with Data, working with models and helping them avoid spending time on integrating all these components of public cloud capabilities, data warehouses, serving infrastructure together, when in reality they're frameworks that can help them do that. Harpreet Sahota: [00:21:24] I absolutely agree with that. And I've been in that position myself, having to be something that I'm not really an expert in. It does take a lot of time to research, get up to speed and then try to build it out and make it work. So I'm a huge proponent of everything you talking about in your book here. So at what point then does it make sense for us to start using serverless machine learning? At what point can we say "Boss, we need to we need to get on the serverless wave, increase my budget"? Carl Osipov: [00:21:47] It starts when the data set that you're working with no longer fits in memory. So if you have a data set that can be easily processed where you can build a machine learning model or data set in memory, you don't need serverless machine learning. However, if you find yourself in a situation where you think you can actually scale up your machine learning model efforts. So, for example, there are well-known papers such as "The Bitter Lesson" from [Rich] Sutton where he talks about using more compute more data to scale up machine learning. There's another paper I didn't know about the unreasonable effectiveness of data. So if you are a practitioner and you find yourself in a situation where you think you need to be able to scale up how well you do machine learning by using more compute, using more storage, essentially having all these computational capabilities, internet, working together in a public cloud, then you start thinking about serverless machine learning. And really the question is, if your data set fits in memory today, will it fit in memory after you put into production? And for many companies, that answer is that we need to scale up machine learning sooner rather than later. You don't want to find yourself in a situation where you've built something that works on a small data set, but then once you put in production, you've got to go back and tell your boss, hey, it used to work, but it doesn't work anymore because we didn't get enough. This is the kind of problem that should be solved in advance. Harpreet Sahota: [00:23:05] That's speaking of data storage. You talk about in your book, two types of storage. We've got row oriented storage and column oriented storage. Can you talk to us about the difference between the two? Carl Osipov: [00:23:17] Absolutely. And this is one of these distinctions that is not well known in the Data science community. But if you want to, as a data scientist, if you want to be more impactful contributor to a team, more productive member of your team, I think it's important to understand some of these engineering topics, like the one that you just described. So when I talk about the row oriented storage thinking about the way that data is stored in traditional relational database it's something like Oracle, my SQL, PostGresSQL. Traditionally, those databases have actually been designed to process data in rows or in blocks of rows. So, for example, imagine something like an e-commerce database. So let's say you have a e-commerce database of orders and maybe just three columns for a customer I.D., a product I.D. , and a quantity of that product. So was that kind of a database with traditional relational databases? It's very easy to go in and make changes on a row by row. For example, it's possible to have a transaction that updates the quantity of the product purchased by a customer or something from a quantity two to a quantity of five or something. So that's the ideal use case for a relational database and row oriented storage. Now, the problem with row oriented storage is that it's not well designed for Data queries that are analytical for business intelligence style Data. So what do I mean by that? Let's say you're a data scientist and your boss comes to you and says, hey, you have this e-commerce database. Help me figure out all the items that were purchased in bulk. When I say in bulk, I mean, we're purchasing wanted you to help me figure out how many customers purchase products in bulk and for each product, tell me the number of customers that made that purchase. So as soon as you talk about these kinds of business intelligence queries, business report type queries, where you have to process effectively the entire database, all these rows together to come back to a result, this is something I would describe as a many to few kind of a query. The row oriented storage approach doesn't work very well. Instead, you need to start considering what's known as column oriented storage, and that's available from technologies such BigQuery from Google Cloud. This is supported in AWS athena and other modern serverless data warehouses. And the distinction between column oriented storage and row oriented storage is that instead of maintaining Data on row by row basis in row oriented databases, there's actually a separate column stored as a file. So, for example, there's a file that stores all the data related to the quantity is purchased and there's a separate file essentially that stores all the customer IDs and all the product I.D. And when performing these analytical queries, when trying to answer the kinds of questions that I describe, trying to understand the quantities of items purchased in bulk, it's possible to just load a single column with a single file worth of data into memory and process that as a single chunk of data, for example, looking at all those quantities where it is greater than one, for instance. This opens up a lot of optimizations for modern serverless data warehouses, for example. It's possible to load that data for a single column into cache efficiently. Also, it makes it possible to compress a lot of that data, compress those columns that also makes for more efficient process, more efficient storage. And in general, whenever you try to answer these kinds of analytical queries about Data column oriented storage tends to provide better performance. That translates into lower latency times when answering questions. And it also translates into lower storage pools, which is a factor when you're selling the data in the cloud. Harpreet Sahota: [00:26:52] What's up, artists? Be sure to join the free, open Mastermind slack community by going to bitly.com/artistsofdatascience. It's a great environment for us to talk all things Data science, to learn together, to grow together. And I'll also keep you updated on the open biweekly office hours that I'll be hosting for our community. Check out the show on Instagram at @theartistsofdatascience. Follow us on Twitter at @ArtistsOfData. Look forward to seeing you all there. Harpreet Sahota: [00:27:21] Thank you so much for that. So I'm wondering if we can get a hypothetical scenario where serverless machine learning would kind of be an ideal use case. Carl Osipov: [00:27:31] Absolutely. So think about the scenario that I mentioned earlier. Think about an example where someone wants to do food delivery and it's part of food delivery. They need to be able to predict estimated time of arrivals. So let's talk about some of the features of this use case that makes Serverless applicable. If you're trying to scale this kind of prediction across the United States or maybe across North America. This means you're going to be able to make the predictions repeatedly for a large number of users. And also, you want to be able to improve that prediction over time. So what you want to be able to do is use the data that you've collected in the past to improve your predictions to the future. So potentially we're talking about examples where you're processing terabytes of data that's coming in from a variety of devices, from mobile devices, potentially from desktop applications to other devices. All that data needs to be stored so that later you can use that data to help improve predictions. But at the same time, that data needs to be used for predictions streaming, meaning that as the data comes in, as you have the information about, let's say, delivery happening between pickup location and the drop off location, you need to be able to provide a prediction was very, very low. So these factors, massive amounts of data, streaming data, batch data that needs to be stored for historical reasons and the need to improve the system over time. I would say this is a perfect type of use case for serverless machine learning. Harpreet Sahota: [00:28:52] Thank you so much for that. So now to pick your brain on some aspects that I think our audience would love to get some insight on, I think of the field of data science machine learning to be quite a creative field and one that requires a bit of ingenuity, especially when it comes to feature engineering, because I think that's really the most crucial part of building our model. So I was wondering what tips you can share with our audience so that we can be more thoughtful with our feature engineering. Carl Osipov: [00:29:20] I agree with engineering can be one of the most positively impactful exercises when building out a machine learning model, Data science model. So I think the key insight that I can share with you is that feature engineering should be treated more like science than engineering. And here's what I mean by that. I think even before starting with feature engineering, it's important to build our benchmark baseline models and evaluate those models in the absence of feature engineering. And then each feature engineering step, each step of where you're taking raw data and trying to implement some algorithms that translate that raw dataset into additional columns in the structured data or maybe some additional input information in your unstructured data set. So before you even do that, make sure that you have a way of measuring these feature engineering steps independently. So let's take an example from the scenario that we just discussed, the scenario of doing ETA prediction for food delivery. So in that case, let's say you're trying to do predictions based on the pick up in the drop of coordinates. And those coordinates are encoded as latitude and longitude locations. So one example of feature engineering could be to try to replace these Lat/Long coordinates that are extremely fine grained. So actually, if you look at the GPS system, it turns out that it has a resolution about three feet. So these are extremely fine grained coordinates. But for something like food delivery, eta prediction that's to find grained. I mean, if the restaurant that you're ordering from is just down the block, that's not going to change the eta delivery by a significant amount. So one thing you may want to consider as feature engineering exercise is that instead of using those raw latitude and longitude values, maybe you want to build up something like a grid over a map where you use that grid as a substitute for the overall level of importance. Right. So you can create almost like imagine a battleship style grid where you represent the location of different restaurants and locations where this happens. That's a great feature engineering exercise. But the problem is, if you try to perform it, you don't really know, for example, what kind of a grid resolution you may want to put on the map. So this becomes another kind of grammar that you may want to tune production. So this is where hyperparameter tuning and feature engineering really tied together, and this is where it really becomes more of a science. This is where you want to create hypotheses about the specific engineering experiment that I just described. This is where you may want to measure the performance of this feature engineering experiment on the models. And this is where you also may want to do some hyperparameter tuning to try different grid resolutions and see how well they work. So I think the key takeaway of treating feature engineering as this experimental process is probably one of the best things that you can do as a data scientist or machine learning practitioner. Harpreet Sahota: [00:31:55] And my next question I was going to ask you about type of parameter tuning. I think a lot of scientists are first starting out arbitrarily picking values for hyper or parameters just so they can get their grid search to run or whatnot. What are some tips that you can share with our audience so that we can be more thoughtful in our hyperparameter tuning? Carl Osipov: [00:32:12] So grid search for hyperparameter tuning is my favorite pet peeve. I'm still seeing many publications today that use grid search, and I think any time that somebody is putting out a blog post or putting out a code that uses grid search for hyperparameter tuning that needs to come with a huge disclaimer, something along the lines of this is a temporary hack and needs to be fixed later. So today it's well known, right, that instead of using research, other techniques such as random search parameters and using Bayesian techniques that use the history of hyperparameter tuning to improve future results, these give better results overall in practice. So I think the first comment that would make is avoid using grid search in the first place. But if you do decide to get started with hyperparameter tuning, I think what you need to be focused on is building out the pipeline that allows you to do hyperparameter tuning rigorously and allows you to capture the results of hyperparameter tuning. Because as I talk in serverless machine learning, it's very important to establish this end to end pipeline from the Data coming in to machine learning model in production On the other side of the pipeline. And as part of the pipeline, You want to capture the results of your hyperparameter tuning experiments and then use that data. Right. This data scientist we should be using, the data that we produce ourselves should be using that data to drive the hyperparamerter tuning process. And what this means in practice is that once you've launched some of these hyperparameter tuning experiments, you want to make sure that you capture all the information that went into the experiment, everything starting from the details or elements of the data which records in the data we use as part of hyperparameter tuning all the way down to the specific algorithms that you're using to build out the machine learning model and everything, including the location for where machine learning model is going to be deployed. Because if you're deploying your machine learning model as an online interface, you're going to see much different kind of behavior from machine learning model than if you're deploying that machine learning model where it's used for batch data processing. So capturing as much information as you can in your machine learning pipeline in order to run hyper parameter tuning experiments is critical. And this is one of the topics that I discuss in depth several times. Harpreet Sahota: [00:34:17] And you mentioned end to end machine learning. And I think a lot of people who are breaking into the field or maybe more into research type of role where they're kind of just working on stuff on their local machines, don't really deploy anything to production. They don't get too much of an opportunity to actually think about or necessarily look at what needs to be monitored and tracked once the model is deployed. So what are some things that you think Data scientists who are kind of breaking into the field should go in research and study up on specifically with regards to monitor and tracking a model from both the business perspective and from a data science perspective? Carl Osipov: [00:34:56] These are two very different perspectives. So let me break that question up a little bit and talk about things that I think Data scientists need to be incorporating into their workflows today. And then I'll talk about the business side of things as well. So I think Data scientists who are just breaking into the field today date need to spend some time with tools such as MLFlow, for example. So MLFlow is an open source project. It helps Data scientists capture their data science and machine learning experiments. It also provides some visualizations and graphic capabilities. So tools like that should start becoming more popular in the workflow of the modern data scientists. Once the machine learning model is in production, you definitely need to be able to monitor some of these technical concerns about the model. The availability. The latency. All of this information is actually available in most cases from a public cloud provider that can host a machine learning model for you. So if you're a data science practitioner, you need to understand what's available in clouds today to give you this kind of technical operational information about the machine learning model that's surfacing now when it comes down to new data that comes in after you put the machine learning model in production. I think one of the least underused tools to date in the data scientists toolbox is the central limit theorem, the statistics. And when I talk to many junior data scientists that are graduating out of college or out of boot camps, I don't think the idea of central limit theorem really is getting applied in practice. So fundamentally, think about it this way. Even if the distribution if the probability distribution of your data is not normal, is not normally distributed, it is true that the means of samples from your probability distributions, if those means exist, are going to be normally distributed. And this is a very important indicator for new Data that comes in after the machine learning model is put in production. So if you can compare the means of these samples for the data that comes in, those elements within the machine learning model of production and you see huge deviation in those normal distributions compared to the historical data to the data that you actually used before quitting the model in production. This is a huge alert to the data scientists that something is not going to work. Right. So I think this is one of these metrics that definitely need to be used more often. So central limit theorem and applications and central limit theorem are definitely one of these underused tools that I'm seeing today. And then the other part is uncertainty about the business. And how do you track the machine learning model from the business point of view as soon as it's put in production? I think the Business analysts need to think not just about a model, but about the reasons why the model exists in the first place. So we talked about doing a delivery for food. I think the business analyst needs to be there thinking not just about the metrics like ETAs accurate, but the business analyst needs to be thinking about what was the business case for deploying that prediction in the first place? What is the product that's being used to actually do those kinds of predictions? And the business analysts should be the one raising the alarm if the company or maybe the project has decided that, hey, maybe ETAs is not something that we want to predict in our product. Maybe instead of focusing on doing the prediction for food delivery time, we should instead focus on predicting customer satisfaction to deliver. And the ETA is just a small part of it. So I think the business analyst is there to provide context for why the machine learning model exists. Harpreet Sahota: [00:38:07] Thank you for that. Switching gears, in terms of questions here, as somebody who's been a practitioner of machine learning for nearly two decades, over two decades, I'm wondering how you view the field. Do you view it more as a art or more as a hard science? Carl Osipov: [00:38:21] I think whether the field is an art or science is more of a question of how does a particular practitioner approaches. I think some practitioners approach to field clearly as a science. So when I talk about the science, I think about the possible. So if you're a scientist, you're thinking about what are the possibilities of what can be done with machine learning. And then the process for you is communication, telling others about what you found in terms of what's possible and maybe telling others about your state of the results in machine learning and working on experiments, working on different projects, working, trying to understand what else is possible in the field. So I think some individuals, some practitioners definitely approach machine learning as a science, but it's also possible to approach it as an art in many different ways. So when I think about the applications of machine learning to something like adversarial networks, the gaps and some of the possibilities there in terms of creating stylized art, I think this is transcending science and it's clearly venturing into the art territory. And I say that because individuals suddenly look at the subjective aspect. So it's not just about stylizing somebody is it's not about maybe representing somebody portrait in the style of Monet or representing somebody's portrait within style of Van Gough. It's really about how that stylization how that image is going to be perceived by the audience. And I think as soon as you start thinking about this interaction of the impact of the work that you're doing on on the individuals, on the people in your audience, potentially, or on the culture, it becomes more of an art. Harpreet Sahota: [00:39:51] And in what ways do you think the creative process tends to manifest itself when we're working on a machine learning project? Carl Osipov: [00:39:59] We spoke about feature engineering. I think feature engineering is still one of the most creative aspects of machine learning today. If you take a look at some of the results that are coming out and those space, they attempt to replace feature engineering or automate some of the most common feature engineering tasks. But I think as a human being, if you can take a look at the problem and try to come up with creative feature engineering solutions, you're still going to be able to defeat these automated approaches that we're seeing from driverless AIs I've heard both in some cases from AutoML and AutoAI Technologies. So this is definitely one of these fields where somebody can demonstrate their creativity in the sense that as a practitioner, you can find unexpected solutions to well understood problems. I'm thinking about - you talked about my experiences in machine learning. I think the first time that I discovered that machine learning itself can be creative was back again when I was working in undergraduate research. And at that time we were looking at the results of applying genetic algorithms. Now I know some of the audience will talk about genetic algorithms that may be on the sidelines of machine learning or the sidelines of Data science. But this was an application of genetic algorithms to the video game back that way before deep mind. And those games, somebody at the University of Rochester decided to implement genetic algorithms to train a program to play pacman. So everybody could see Pacman, this is where the little yellow circle is running around and eating little berries in a labyrinth, a maze. And one of the things that happens in the maze is that there are these blinking lights. And if you navigate your character and navigate the little yellow circle to each one of the blinking lights, then the rules of the game change one hundred eighty percent. So instead of your pacman getting chased by ghosts, the ghosts become the hunted, let's put it that way. And the fact to actually start chasing those ghosts and basically chasing them down for four points. So the reason I bring this up, I think the first time I realized that algorithms, genetic algorithms in this case can be creative is when this program found a creative solution. It would navigate all the way to one of those blinking lights in a Pacman labyrinth, Pacman maze. But actually would not swallow that little blinking lights did not change the rules of the game. It would simply shuffle back and forth right next to that blinking light until all the ghost chasing pacman, become close, and then it will swallow the blinking lights and then go back and actually achieve high score by chasing down those ghosts for points. So I've never seen that solution myself. I'm sure that as a human being, somebody could have invented that solution on their own. But there I've seen an example of an algorithm identifying a creative solution for itself. So the reason I tell this story is not to say the genetic algorithms are the best thing since sliced bread. To the contrary, I think what's important here is recognizing that there's creativity in the combination of a Data scientists or machine learning practitioners expertise and the capability of the algorithm. So if you combine the two together, like in this example, with Pacman, who had the undergraduate research for you at the Pacman game, but combined together, the result was really creative. So I think it would be great if going forward we learn, we all learn to be more creative by combining our expertise, our knowledge of Data tools, techniques with algorithms that deliver creative results. Harpreet Sahota: [00:43:19] Thank you very much for sharing that. That's really interesting story. Really, really enjoyed that. Thank you. So I wonder if we can get into a little bit of your experience at Google. So while you're at Google, you have the opportunity to work with some of the foremost experts on machine learning. And you help manage the company's efforts to democratize artificial intelligence? I'm curious, what does it mean to you for artificial intelligence to be democratized? Carl Osipov: [00:43:40] I think the best way to explain it would be, by contrast, when the situation was artificial intelligence today to where it was back when I was doing undergraduate research. So back then, the machine learning and some of the techniques of what eventually became known as Data science for very much for the in crowd. So here's what I mean by that. Academics would describe these techniques to each other in research publications and there was a fairly significant barrier to into the field. So if you wanted to understand machine learning, for example, it was nearly impossible to do that just by picking up research papers and trying to study that on your own. You really needed to have guidance from a professor, from a graduate student, from a mentor, really. And I think what has changed today is that artificial intelligence has become more democratized. It has become more widely available to students and others who are interested in artificial intelligence and to think more widely available to the people simply because of the availability of books, online courses and simply because so many more people have started getting into the field and trying to teach artificial intelligence to somebody who is just starting out, trying to help somebody understand it from scratch, instead of expecting that somebody has to go through courses and linear algebra and calculus before they can even pick up the the paper. So to me, really, what it means to democratize artificial intelligence is really to try to achieve this this goal that I'm describing, first of all, making sure that people are educated on artificial intelligence from a fairly early age, from, let's say, high school age or early teens. And then I hope to see the tools of artificial intelligence we used more widely was the hope that people can become more creative in using these tools with better solutions for the world. Harpreet Sahota: [00:45:21] And what would you say was the biggest lesson you learned about democratization of A.I. while you're over there at Google? Carl Osipov: [00:45:29] I think the biggest lesson that I learned is that trying to democratize artificial intelligence is a challenge that requires more than just one company and probably more than just one industry to be successful. So I mentioned education, even high school level education. I think what's needed here is an effort at the scale of governments and changes the scale of public policy. And only when the next generation, the next generation of students who are graduating high school, maybe in the next four years or so, if they are able to really understand the basics of these tools and then start applying those tools as part of their undergraduate study is if they decide to pursue studies in academia. I think this is where we're going to see really fundamental transformation for our society and our economy based on the tools of artificial intelligence. Harpreet Sahota: [00:46:16] So you've got a quite prolific track record of patents, inventions and publications that you've done. I'm wondering what is your most favorite one or the one that you are most proud of? Carl Osipov: [00:46:28] That's a great question. I guess they're like children. They're all my favorite. And if I were to highlight some of them, I think for this audience, probably the one that's most relevant is a pattern that has to do with ontology alignment. So I used to work on something that's called industry models. And when you think about industry models, this is about creating massive data warehouses for large companies. Imagine you're working with two major retailers and these two major retailers decide to merge together, decide to merge their IT operations. And then the question is, how can they take their existing data warehouses? Think about all the structured data that exists as retailers and make sure that they build a single data warehouse so they do not have these separate silos. So one of the patterns that I have is on this problem of ontology alignment, which is about applying machine learning techniques to process the documentation for the data in these data warehouses from, for example, from two different retailers and identified when there's a match. So an example of a match could be if you have one data warehouse that describes something like a customer purchase. So uses the terminology, the customer purchase and then another data warehouse uses the terminology or customer order. So if you're talking about thousands and thousands of these different kinds of entries and a data warehouse, a system describing the patterns disclosed in the pattern is actually automatically identifying these kinds of matches. So instead of human beings having to go step by step and figure out this problem, which is order of n-squared, it is one data warehouse than other data warehouse and machine learning algorithms try to rank the best matches so that human beings can come in and then see whether the match was successful or not. So that's I think that's one of the more interesting points. Harpreet Sahota: [00:48:11] That's actually super interesting because I'm working on something very similar at work, actually. So I'm going to have to go check out your paper. I feel like there's something in there that's going to help me tackle something. I'm working on it. So we have this system where we essentially what you describe, we have engineers drawing up designs and the fields that they have could have different names, but they mean the same thing. And when reloading these Excel documents up into our system to have a easy way to kind of map them to the appropriate kind of internal code that we have, if that makes sense. But the spirit of what you described, it sounds like you can help me attack that problem. So I will definitely have to check that. Carl Osipov: [00:48:48] That's great. It sounds like it can be a fun natural language processing project combined with a map. Harpreet Sahota: [00:48:53] Yeah. So which of your publications, your patents do you think are most applicable to our current times? Carl Osipov: [00:49:00] So there's one disclosure that I have. I think it's still going through the processes at United States Patent Trademark Office, USPTO, and that one has to do with differential privacy. So when I talk about differential privacy, I think about this problem of deanonymizing Data. So if you think about collecting data at scale, so, for example, imagine data that's coming in from personal wearable devices that might be Data that includes somebody's location, that might include somebody heartbeat rate, that may include somebody is health data. So I think it's very important for the current times to ensure that that data stays anonymized. The user wants it to be anonymous. And let me give you an example of why and how it is possible to deanonymized data in some situations. So, for instance, if you have the data that describes somebody's personal location from personal device, like imagine wearing a device on the wrist, let's say somebody has access to that data and at the same time, somebody has access to a different data sets that describes somebody's location, for example, based on the cell tower where they've connected to their cell phone. In many cases, simply by combining these two different data sets together, it's possible to make a very good guess about this matchers between these two different data sets that, in other words, simply by looking about the individual's heart rate and individual's location on the personal device and the individual's location based on the cell phone tower that they connect connected to, it's possible to actually make an educated guess about the identity of that individual, to find out that health data from that individual, from their wearable device. And obviously, this is something that people want to avoid. People want to care about their privacy and want to avoid the inadvertent disclosure of the health data. So I think that pattern has to do with providing some guarantees. The pattern describes how to rescale the data sets from personal wearable devices to essentially minimize the chances of this inadvertent disclosure of personal health data so that the chances of that disclosure are constant and are low. For example, you can say, hey, I'm agreeing to share some data about my personal health as long as the chances of that data becoming deanonymized are no better than one hundred percent. So providing these kind of privacy guarantees I think is important for the industry going forward. Harpreet Sahota: [00:51:24] I definitely, definitely agree with that. Privacy is a huge issue that we got all this data being collected and from all this tech that we have that, like you mentioned, wearable devices and stuff. So thank you for that. Without a doubt, the technical skills are super important in your career as a data scientist, but what do you think are some of the soft skills that maybe Data scientists are lacking that are really going to help them differentiate themselves from their competition? Carl Osipov: [00:51:47] I frequently talk to junior Data scientists. I actually have some mentees who are close to completing their degrees in data science or have already completed those degrees, are looking for jobs. So let me try to describe some of these recurring patterns that I'm seeing across my mentees. And if any of them are listening, I promise I'm going to keep it private so I could apply some of these ideas from differential privacy to describe some of these soft skills. So I think it's very important for data scientists to become more effective at operating independently, but at the same time making intelligent decisions about how to operate independently. And I think this is more important today than it used to be in the pre-COVID world. So today, more and more individuals are working remotely, working independently from home. And it's very important to be effective at managing your personal time and managing how you go about doing work. So I'm still seeing some of these more junior Data scientists trying to look for a manager. And here's what I mean by that. They're trying to look for someone who will affirm their day to day to do their day to day activities. And it almost comes down to, let's say, junior Data. Scientists say, hey, here's my to do list, please, Mr. Manager, check it for me and tell me that this to do list is OK. And I think as Data scientists are progressing along their career path, it's very important to be able to think not just in terms of a simple to do list. I think to do lists are great, but their way of organizing personal activity, I think if you go out and talk to somebody about what you're working on, it's very important to change the language and talk more about the success metrics, talk more about the objectives of what you're trying to achieve, and then talk about why and when you're going to be successful. So I think really to help Data scientists become more senior in their careers, they're going to have to change how they think about their responsibilities and focus more on making sure that delivering results that are successful and can provide measurable success. Harpreet Sahota: [00:53:49] Reminds me of something that Seth Godin talks about in his book, Lynchpins. He talks about the ability to be able to navigate not with a step by step set of directions, but with a compass, because when you have a compass, you can still navigate through the ambiguity and find out how to get to where you need to go. But when you're on a step by step map, once you take a wrong turn, it's difficult for you to get right back onto course. Kind of reminded me of that. So do you have any tips for Data scientist who might find themselves having to present to a non-technical audience or to a roomful of executives? Carl Osipov: [00:54:23] I think going back to this question of both the recurring battles that im seeing with Data scientists and some of the mistakes that they made was speaking to an executive audience. I would emphasize focusing as a Data scientist, I think you should be able to review this idea of counterfactuals and the role that counterfactuals are, if you'd like. These are what these kind of questions play in decision making. Too often Data scientists focus on the solution. So definitely, if you're a data scientist and if you have a degree in Data science experience of data science, you already have enough expertise to get yourself into the room of executives with your knowledge of how to build the solution. But for decision makers, for executives, it's far more important to understand the problem. And these what ifs around the problem, trying to understand, is the problem worth solving in the first place, trying to understand the context around the problem. If the situation changes, will the problem still be relevant? So the example that we gave was the good stuff. That really was a food delivery. It was this kind of use case has become more important since the economic environment changed, with COVID . So trying to understand the factors of what is possible around the problem, what is impacting the problem, what is impacting the measures of success for solving the problem. These are far, far more important for Data scientists to understand if they want to get in front of the executives and be a participant in the decision making process. So I think that's one of the keys to Data scientists to develop in that group. Harpreet Sahota: [00:55:54] And how could a Data scientist develop their business acumen or their product sense so that they can communicate in that same language as as these executives? Carl Osipov: [00:56:04] I don't think there's one easy solution. There are different approaches. I think a lengthy approach. But the one that is probably the most effective is trying to start a company on your own and try to make that company successful. So if you have to actually wear multiple hats and act both as a data scientist and as an owner and a key manager of the company, that's a great way to work. Unfortunately, this option is not available to everybody. Of course, there are other options. There are options of looking at NDAs. But I would recommend reading about some of the basics in the business from the standpoint of newspapers. I think if somebody is a data scientist and they want to understand business better, communicate in the right language, pick up a copy of News-Journal Open MarketWatch.com and try to read it on a daily basis. Try to understand what's going on in the economy. And then once you have that basic understanding of how the landscape of the marketplace is changing. Ask yourself, what does it mean for the company where I am employed? What does it mean for the project that I'm working on? And ultimately, what does it going to mean for the success of my customers? Harpreet Sahota: [00:57:10] So what advice or insight can you share with people who are breaking into data science and they're looking at these job postings in some. Just look like they want an entire team rolled up into one person and then they end up feeling like discouraged and dejected and don't even bother applying what advice or insight can you share with them. Carl Osipov: [00:57:27] I would say if there's a job description, if there was a job role that you're after and somewhere in the online post, the requirements are too harsh. Ignore them. I would say never feel dejected or never feel like you should not apply for your dream role if maybe there are too many requirements. Their managers and HR have tremendous flexibility in terms of who they hire. And ultimately, I think if you continue applying and you continue improving over time, continue learning, understanding what kind of skills are asked for in the interviews and what kind of skills are asked for by hiring managers, you're going to be successful. And I think the most important lesson there is to be persistent and continue focusing on that one successful outcome. You only need to be successful once, so don't worry about any of those individual failures. Harpreet Sahota: [00:58:16] So last kind of formal question before you jump into a quick lightning round, and that is what's the one thing you want to people to learn from your story? Carl Osipov: [00:58:24] I think I would like to encourage aspiring Data scientists, aspiring machine learning practitioners to be open to working in adjacent fields. I think the success that I've had in my career has been during the times when I was willing and interested in collaborating with someone who had a very different background from mine. I mean this both from the technical standpoint and from the standpoint of personality, is diversity standpoint. And whenever you collaborate with someone and you're willing to learn from them, learn deeply from their backgrounds, you're going to come away as a person who really grows as an individual, not just in their career, but as a human being. Harpreet Sahota: [00:59:03] Awesome advice and awesome insight from your journey. Appreciate you sharing that. So jumping into the lightning round here. First off, where can people find your book? Carl Osipov: [00:59:10] The book is available from many publishers. If you search for serverless machine learning on Google Search is going to be there in the first page. I think what we're also going to do after this podcast is that once the podcast is available online, the publisher is going to describe this podcast episode or they're going to post about this podcast episode on Twitter. And if you are interested in learning about the book, that way you can follow the many publishers on Twitter or if you're interested, I'll continue posting about the book on my blog, Clouds with Carl, its called. Harpreet Sahota: [00:59:43] And we'll also be sharing a discount code that our audience can use to purchase that book as well. I highly recommend everybody listening to purchase a really great book. So what's your data science superpower? Carl Osipov: [00:59:54] My superpower is in being able to apply engineering to data science. Harpreet Sahota: [00:59:59] If AI could answer any question for you, what would you ask? Carl Osipov: [01:00:03] I would ask how to build a better AI. Harpreet Sahota: [01:00:05] What do you believe that other people think is crazy? Carl Osipov: [01:00:09] I think that once we build artificial general intelligence, it will have as much interest in conversing with human beings as human beings have in conversing with cockroaches. Harpreet Sahota: [01:00:21] If you could have a billboard anywhere. What would you put on it? Carl Osipov: [01:00:25] I would say live your life with your eyes focused on the horizon and work without stumbling on a single stone. Harpreet Sahota: [01:00:31] I like that. So what is an academic topic outside of Data science that you think every data scientist should spend some time studying and researching on? Carl Osipov: [01:00:41] I think every scientist should spend time studying mathematics and deepen their understanding in mathematics. Harpreet Sahota: [01:00:48] And what would be the number one book? Fiction, nonfiction, or maybe one of each that you would recommend our audience read. And what was your most impactful takeaway from it? Carl Osipov: [01:00:58] I really recommend Judea Pearl The Book of Why, and I recommend it because it does talk about the impact of ability to answer counterfactual questions. What if questions on human intelligence and the impact of counterfactuals is what kinds of questions in understanding causal relationships between events, understanding if something causes something else? Harpreet Sahota: [01:01:21] Definitely. Check that out. The book of why? So if we can get a magic telephone that allowed you to contact 18 year old Carl, what would you tell him? Carl Osipov: [01:01:30] I think if I could get such a magic telephone, I would try to keep it around for as long as I can and then at some point decide to give a call and say everything is going to be OK. Harpreet Sahota: [01:01:39] What's the best advice you have ever received? Carl Osipov: [01:01:42] Be kind to people. Harpreet Sahota: [01:01:43] What motivates you? Carl Osipov: [01:01:45] My family. Harpreet Sahota: [01:01:46] What song do you currently have on repeat? Carl Osipov: [01:01:50] I don't have a song that's on repeat because on my playlist, the song keep coming out and non-repeating order. Harpreet Sahota: [01:01:56] All right. So how can people connect with you and what can they find you online? Carl Osipov: [01:02:01] Easiest way is to Google my name. Google for Carl Osipov on Google Search. Or you can find me on LinkedIn. That's the easiest way to connect with me. My blog is On Clough's with Carl on word cloudswithcarl.com, or if you'd like to follow me on Twitter It just clouds with Carl Harpreet Sahota: [01:02:18] Carl, thank you so much for taking time out of your schedule to be here today and really, really appreciate you sharing your insights, going deep into your book. I know our audience is going to learn a ton from this episode, so I can't thank you enough. Carl Osipov: [01:02:30] Thank you so much for hosting this conversation.