Speaker 2: (00:00) Everybody can do it. It's not, you don't need to have a fancy degree or or uh, or need to be the best everywhere. So it's also your phone. Speaker 3: (00:15) [inaudible] Speaker 3: (00:22) [inaudible] Speaker 1: (00:30) what's up everyone? Thank you so much for tuning into the artists of data science podcast. My goal with this podcast is to share the stories and journeys of the thought leaders in data science, the artists who are creating value far-field through the content they're creating, the work they're doing and the positive impact they're having within their organizations, industries, society in the art of data science as a whole. I can't even begin to express how excited I am that you're joining me today. My name is Harpreet sahota and I'll be your host as we talk to some of the most amazing people in data science. Today's episode is brought to you by data science dream job. If you're wondering what it takes to break into the field of data science, checkout DSD J. Dot. Co forward slash artists with anS or an invitation to a free webinar where we'll give you tips on how to land your first job in data science. Speaker 1: (01:21) Have also got a free open mastermind Slack community called the artists of data science loft and I encourage everyone listening to join. I'll make myself available to you for questions on all things data science and then keep you posted on the biweekly open office hours that I'll be hosting our community. Check that out@artofdatascienceloft.slack.com community is super important and I'm hoping you guys will join the community room. We can keep each other motivated, keep each other in the loop on what's going on with our own journeys so that we can learn, grow and get better together. Let's ride this beat out into another awesome pepper sowed and don't forget to subscribe, follow like love rate and review the show. Speaker 3: (02:06) [inaudible]. Speaker 1: (02:20) Our guest today is someone I truly respect and look to as a role model. In fact, he's the main inspiration behind this very part. Speaker 1: (02:26) Yes, he's a data engineering evangelists based in Germany who regularly shares data engineering tools, techniques, and skills often via live video that cover topics like the breadth of job ops in data science, so real time Apache spark coding sessions. He's built his reputation demonstrating how data science can be applied to the real world by showcasing tools, techniques, and apps where data makes a difference. Since 2012 he's been helping his company develop data science platforms that process and analyze insane amounts of data every day. Working with the nuts and bolts of data science, big data platform design and data engineering techniques. I'm talking about everything from figuring out how to ingest and process that data to find effective ways of storing it so that the data scientists in his organization are enabled to do awesome work and deliver a Cavell tag that's German for tremendous amount of value for their customers. Speaker 1: (03:15) You might know him from LinkedIn where he selflessly shares his passion for data engineering and data science by providing his audience tips, tricks, and tools that help build their career and reputation so that they too can become awesome data professionals. It's content around data engineering has even led to him being voted as one of LinkedIn's top voices for data science and analytics. Two years running. You might also know him from his blog podcast or YouTube channel, the plumbers of data science, which by the way has more than 11,000 subscribers and over 150,000 views, so please help me welcoming our guests today. The original plumber of data science, Andres Kretz. Speaker 2:(03:55) Wow, that's an intr. Speaker 1: (04:01) It's such an honor and a privilege to have you on the show You contributed so much to data science, especially a part that is so important today as science data engineering. Talk to me a little bit about your background and how you got into the world of data science and why you chose to specialize in data engineering. Speaker 2: (04:16) The thing is, I would say I'm a, I'm an engineer. I've always been an engineer and I've, I'm coming from an it background, so I was originally [inaudible] is more of a Macan mechatronics study or apprenticeship, and then I got into into it and then studied applied computer science and then more or less I got into this or I ran into this data science field more or less by accident because the traditional stuff didn't work anymore. Then I had a certain time where I thought, man, maybe the analytic stuff, that would be something for me, but that turned out, that's not me. I'm an engineer, right? I like to build stuff and I'm sticking with the engineer, the engineer thing. That's what I'm good at. Speaker 1: (05:11) Are you an aspiring data scientist struggling to break into the field or then checkout DSD J. Dot. Co forward slash artists to reserve your spot for a free informational webinar on how you can break into the field that's going to be filled with amazing tips that are specifically designed to help you land your first job. Check it out. DSD J. Dot. Co// artists Speaker 1: (05:37) and actually I remember, I remember posters on LinkedIn actually where you had made that proclamation that you know what analytics is not for me it's all about data engineering. And I was like, Hey, that's cool man. Like cause that's such an important part. People get caught up in like the hype data science and AI and machine learning and everybody wants to be a data scientist for really the work you do. Data engineering is the essence. The real engine. Speaker 2: (05:58) Yeah. It's, I mean the data scientist and the engineer, they need to work together and it's, it's not, it's completely useless to, to, to think you hire a data scientist and everything is going to be fine. Just this is going to do just fine and you're going to deliver a great product and everything works for me. It's the plumbing, the engineering behind it. That's super important because you, you don't see it. But um, to actually create a good product, you need a good engineering team. That's, that's where everything fits together. The great data scientists who can do the analytics and the engineering team who delivers the data more or less, who can manage the whole thing. Speaker 1: (06:35) You've got such a tremendous background. You're literally, you're writing the book on this topic. So talk to me a little bit, talk to you a little bit about the book you're working on. Uh, the, the it, the data engineering cookbook Speaker 2: (06:47) originally, the, the idea behind that was there's, there's basically nothing really about data engineering and then what do you need? What most people think data engineering. That's, that's, uh, that's only the part where you format the data and, and with different file types or something. And so I thought, let's, let's start a, let's start something that is, that is public where people can go to and work, can get an overview of basically what do you, what is this field? What do you need to become a data engineer? And like everything I do, I just started, I mean, didn't give it much thought. It just hour. Do I do it? Uh, let's do a get up. Okay, let's, then this thing grows and then it keeps growing. Speaker 1: (07:35) I dig that mentality. It's like all right, cool with if anybody's going to write this book it might as well be me. Right. I like that. I like that mentality man. So, okay, so I'm dabbling with the idea of writing a book myself. Like I don't want to get into that too much right now, but what I do, what I get into is you know your creative process for writing the book and how is that similar to your process when you're starting a new project that works. Speaker 2: (07:57) The thing is, I have been talking to someone about this recently, how I, how I work basically I always start with the basic outline, like like bullet points, what is the, what's the thing that I want to create and just like headlines and then I go down and create the whole the whole thing without getting too much into the details of, of writing whole sentences and so on. Like like when you do coding, you have the, the, the rough idea, what you want to code and then you make a draft and then you fill it and that's more or less the process. How I, if it's a blog post or the, the cookbook or a YouTube video, it's, it's like coding just a different language. Speaker 1: (08:45) Yeah, exactly. I like just like coding different language and being the engineer that you are like, do you have to take like an agile approach and agile methodology to writing or teach us? Like just, just go for it. Speaker 2: (08:56) That's always a bit more or less what I do. First of all, I have my process is I cannot write every day. Sometimes it hits me that I need to, I need to quickly note it in my, uh, in my phone and then I'm, I'm getting to it, but I don't have like a big, uh, a big backlog where I'd take the, the things from, I have thought about it recently because it's, it's growing so much and um, um, what I do, what I do every week is I do a mega list of six or seven things that I want to achieve this week. And this is for me, this, because I'm having a full time job. It's ambitious, but this is more or less the, the, the, the thing that I do, uh, it's a bit like, okay, our, I'm making goals that have a value so I know that I have achieved them or not. So like this, this interview, that's, that's one point on my from my seminar Speaker 1: (09:56) Yeah. that's an excellent point though. OKR is as far as the readers out here out there are listeners and readers. Okay. Our measure what matters by John doer, an excellent book, highly recommended. You know, maybe if you could talk to me about your stance on certifications and if you think people are quick to hop on the certification bandwagon and you know, our certification is ever a substitute for, for work experience. Speaker 2: (10:15) I have a, I have a very strong point, strong opinion on this. My opinion has always been that more or less useless. Um, the, the, the thing is a lot of people are hunting for these certificates. Now. I am a certificate, Google data engineer, now I am a certificate cloud, blah, blah. Um, the thing is I have done a lot of, I have started a lot of trainings and when I, when I work, I want to apply this stuff. I basically never get through a whole training. The only training I've really gone through, the whole thing that was a Cloudera training for Hadoop. And just because I was, I was actually there in the Netherlands, uh, more or less did it, but it was, it was very practical and very informative. Usually when I, when I go in there, I'm looking at this stuff and I'm picking out the things that I, that I really need that I think are interesting and good to apply, and then I do it and apply it in a project. So you have these six, seven 30 certificates, who cares? I mean, it's like in school, you can learn for an exam. And what does that say? What does it say? I don't know if it's still, if it's still that popular a few years ago, the, the, the big thing was, um, I'm taking this, this a Hadoop exam, Cloudera number, this is this, what do I need to learn? So basically people were learning just for the exam, so, so they could get the certification but didn't actually do the whole the whole thing. Speaker 1: (11:43) Yeah. There's definitely a difference between taking a online course for the sake of gaining the knowledge, but if all you're doing is taking the online course just to get a attendance certificate or less. Yeah, and then you know, that's not really an effective use of your time and it's doesn't really signal much. Speaker 2: (11:54) Exactly. For me, it's go in there and pick the stuff that you, that you really need that you can really apply. And the the other thing, if you then know, then if you don't have the time to actually get the certification, that's for me it's, it's not that important. Speaker 1 Yeah. It's an interesting segue because you mentioned about being able to apply it. I think there's so much out there in terms of resources for how to build a data science project or a data science portfolio, but for the people out there who want to be data engineers but don't know how to build out a data engineering project, do you have any tips or ideas on how they can get started? Speaker 2: (12:37) The main problem usually people have is they want to build the whole thing from beginning So they say, okay, I need to, I need to do a project, I need to make a big thing. I don't know which data and which tools and they want to do the full chain. I always say start, start small, start, start picking some data that makes sense, that is interesting to you and then start with one tool and then build something on top of that and then build something on top of that and maybe switch something out for something else that you're interested in. And so it doesn't make sense to like the cookbook to go into the cookbook and look at every tool and try to learn every tool that's completely useless. But a few that are interesting or that you, that you see are in demand and then look into them, apply that, use them and learn how to use them and that's, that's the main thing. Speaker 1: (13:20) Would you suggest that you know whoever is out there trying to start a project per data engineering that they may be called try to hunt down like a some type of open API or open data portal and just build a pipeline to extract it from there and do the manipulations. Do you think that's a good approach for building out a project? Speaker 2: One approach is like I say, you hunt down some sources like API and then the easiest thing is the Twitter API. You just get an account, get some data from Twitter, but what you can actually do is just go out there and look through the data sources, free data sources. There are tons off. I just last week, I've added to the cookbook, I don't know how many a lot and it just, you need to Swift through them and find a data set like in a CSV format. Speaker 2: (14:00) And when you can do is if you have a big file you can always slice it and, and like simulate and an API or simulate a source that is posting somewhere. It's rather quick and you don't need to fight with some API APIs and so on. Both works. I have tried both. Speaker 1: So, so you mentioned this a little bit earlier, this data engineering coaching program that you had talked about how you kind of got the idea to get that started and what it all entails. Speaker 2: And the thing is, like I said before, people have have problems. They, they, they, they have problems starting this, this journey. And what I have seen is that I have a, I have seen a lot out there. I can help people basically point them in the right direction here and there and help and help them get going. And I want to create a platform for people where they can basically share their knowledge. Speaker 2: (14:47) I'm doing, I'm doing office hours right now, two days a week and people can join. We're in the, we're in the coaching right now. It's an hour each and I have time for them and they can ask questions and they have basically access to me and I can help them or they can, they can then post on the, on the official team data science blog. There are, they should I advise them, they should each week posts make a post of what you have learned, document all thing because that is one of the, one of the most important things out there. You need to document what do you do? It doesn't matter. Um, that's the uh, the other thing with the certifications we were talking about, you don't really need the certifications when you have a reference of what you have done and this is what I want to achieve with this, with this coaching. Speaker 2: (15:36) Give people the chance to start a project, to do projects, to get help doing that and learn, documenting the whole thing. Then finding a job I think will, will not be a problem. Because when you put that in your CV, people can look at the GitHub code, people can look at your experience over 12 weeks or whatever. You can do this longer. It doesn't matter. That's, that's something where what people are looking for who want to, who are looking for engineers or data scientists, it works the same thing. It's, it's the applied data science Speaker 1: because it's hard for like an employer to open your head and see, Oh, what did you retain from the certification? But if you believe artifacts of your work, if you have that visible, then it's easy to not only verify that you know what you're talking about, but just leaves evidence of the work that you're done and the value that you can potentially contribute. Speaker 2: (16:25) When I, when I initially started the whole arguing and social media and so on, one of the sparks for that was we had a, we had a new data scientist and through the, and in one of the interviews is that, Hmm. Um, now, um, I'm leaving when I come here and it's, uh, the point there was a bit about money, but whatever I'm going here is, is that when I'm leaving my old company, I'm leaving more or less my complete reputation and everything behind. And I'm coming here and starting fresh. I don't know anybody and nobody knows me and I'm, I'm starting fresh year doing the interview. I didn't think a lot about it, but after that I didn't really hit me. I mean that's a, that's a big point. What, what do you, what's your reputation? What is your knowledge? How can, how do people see you? Speaker 2: (17:21) And you don't need to go full out like me and do a blog and YouTube and everything, but if this leave this breadcrumbs behind what, what do you know what, what can you do? And that's, that's there was one Speaker 1: A sparks question for you about the, the coaching program. How can students sign up? Is there a, is there a cost structure to it? Is the coaching one on one? Is it group coaching? Speaker 2: I just set up a website, team data science.com. I just, I just bought that and um, I, I was, I was very, uh, very happy that, that actually team data science that, that, that was free out with. Yeah. Okay. I need to get this. So the data science is the platform. There is a cost structure. Right now I'm a, I'm charging around 400 euros right now. Speaker 2: (18:09) There are not so many people so I have more time. But the idea is that it's coaching, it's a group through office hours because I don't want to keep this, this focused on me. People should learn from what, what other are talking about what problems the other ones have and hopefully collaborate there. So for instance, to in the, in the coaching or working on something with spark and they can, they can ask each other throughout the thing. Um, so there is a bit of a dynamic and people like minded people can find themselves because you know, engineering is, you know, the engineer usually is it stuff. Speaker 1: Yeah. That's good. I like that. I like that group coaching kind of idea. Teamwork makes dream work, I guess, as they say. So, Hey, before we jump into lightning round here, what's one thing you want people to learn from your story? Speaker 2: One thing is that everybody can do it. It's not, you don't need to have a fancy degree or, or need to be the best everywhere. You need to sit down. You need to actually work on stuff and apply stuff and document what you do and then everything is going to going to fall in place. It's, it's, it's nothing too, too complicated. It's not, it's a rocket science. I mean, if you, if you take your time and invest the time, it's, it's worth it. Speaker 1: Yeah. Nothing works unless you do. Right? Speaker 1: (19:30) Hey artists check out our free open mastermind Slack channel, the artists of data science loft at art of data science, loft.slack.com I'll keep you posted on the biweekly open office hours that I'll be hosting and it's a great environment and community for all of us to talk all things. Data science [inaudible] Speaker 1: (19:56) let's go ahead and jump into two lightning round here. Python or R? Speaker 2: Um, Java. All right, I'm coming from a java background, but if you're giving me these two choices, I would say Python. I see a lot of Python out there. It's, it's crazy. Python is going crazy everywhere. Spearker 1: So would you say for, for data engineers then it will probably be most beneficial to learn Java or, or do you think Python would be sufficient? Speaker 2: There are certain use cases where you should look into Java, but actually if you start with Python you can make the switch. I mean it's more or less, it's, it's only a bit of a difference in tax and yeah. Awesome. Python is, I mean the, the tools are going the Python and so awesome. They have it Java Speaker 1: so how about um, for somebody who's coming out, just trying to break into data engineering and, and doesn't know which platform to pick, um, there's continue to competitors I think out there there's AWS or an Azure, which one would you recommend? Speaker 2: (21:02) Yeah. So when you, when you started out and for the learning thing, yes. You have the option of the clouds. I personally, I'm, I'm an AWS guy. I'm not that huge of a fan of, of Azure. I think Azure is behind. I mean maybe AWS guy, but if you want, if you, if you don't have the time or the money or do don't want to go to AWS and you want to do a personal project, install Docker on your machine and run a few containers, everything is good. Let's, that's not a hindrance if I need to choose an AWS route. Speaker 1: Awesome. So, so we, we touched on this earlier, but just to reiterate certifications or self-study? Speaker 2: Self study. There you go all the way. Speaker 1: What's your favorite big data tool? Speaker 2: I'm a huge spark and Kafka guy and Hadoop, that's where I'm coming from and I do, it's going a bit down in popularity, but spark is fascinating and calm and comfortable what you can do with these two tools. It's awesome. Speaker 1: Awesome. So what's your favorite question to ask in an interview? Speaker 2: It's a, it's a coding question. I'm always, I'm always asking, tell me the difference between an object and a class. You immediately see if somebody is able to code and you will not believe Speaker 1: How many people are feeling this, Speaker 2: Especially somebody who is coming from university is really fun. Speaker 1: (22:37) What's the ideal answer to that question? How would you answer that? Speaker 2 Well, my idea, my ideal answer is that the class is more or less the, the blueprint and the object is the actual thing that you build from the blueprint. That would be for me, there will be absolutely okay. You can go more in details and so on, but that'sthe thing, Speaker 1: But it's a simple question. Like a class. A class is a class until we instantiate it as an object. So that's, that's, Speaker 2: That's already the, the more technical answer. I'm not even looking for a technical answer. I'm looking for a practical. Soeaker 1: What's the weirdest question you've been asked in an interview? Speaker 2: I think I, I don't have a real good answer. Some, some, some years ago somebody asked me if I can describe how a motorcycle works and that was quite funny, Speaker 1: sit on it and press the gas or pulled it. Speaker 2: I'm not sure what they want Speaker 1; Awesome man. Hey, well, how can everybody connect with you? Speaker 2: On LinkedIn? LinkedIn is my, my biggest platform. What I have is I have a telegram chat group. It's, it's telegram, slash team data science, um, or, or just search under the skirts on telegram because I really like this, this making the chat function. It's, it's so immediate and you don't need to to LinkedIn and get approval and connect with people. It's just chat and people can chat and they're on YouTube or something. I'm, I'm basically, Speaker 2:(24:16) I'm everywhere, but mainly is, is LinkedIn and telegram. These are the, or my, my email. These are the direct Speaker 1: Awesome. Well, I'll be sure to add to drop links for your YouTube channel from data science as well as the telegram as well. Thank you so much for your time. I know it's super late for you over there, so I appreciate you taking time out of your schedule to sit and chat with me. SPEAKER 2; Thanks. Speaker 3: (24:42) [inaudible].