Ben: Hello and welcome to PodRocket. I'm Ben. I'm one of the co-founders of PodRocket, and we're interviewing Eric Muntz, who is the CTO of Mailchimp. How are you today, Eric? Eric Muntz: I'm doing great. Thanks for asking. How are you all? Ben: I'm doing well, really excited to have you on the show, Eric, and you've had a really interesting career, a bunch of roles in software development earlier in your career. Then starting as a software engineer at Mailchimp, almost 13 years ago, working your way all the way to CTO. I know Mailchimp has grown into a very large business and also had an acquisition by Intuit a few years ago, so a lot of exciting stuff to unpack. Maybe we could start with quick overview just, what is Mailchimp, in case folks are not familiar with the platform. Eric Muntz: Yeah, sure. Mailchimp is a marketing platform for small and medium sized businesses. We are mostly known for delivering a ton of email, but our platform does far more than that today. We support websites, landing pages, automations are super huge, forms, surveys, galore, sort of a one-stop shop for marketing for small businesses, with about 13 million customers headquartered in Atlanta, Georgia. As you noted, acquired by Intuit, a little more than a year ago, it was November 1st last year. Part of Intuit connected to the QuickBooks group, which is the small business and self-employed group at Intuit, which is pretty great. You get the front end and back end of everything you need to run a phenomenal small and medium sized business. Ben: Awesome. I imagine with millions of customers, it's staggering to think about the size of the customer base, and then multiply that by, what I imagine, is large number of emails and communications sent per customer. I'd be curious to understand over the years, you've probably seen orders of magnitude growth and scale of the platform, what are some of the most interesting or exciting, big technology decisions that you've had to make during your tenure? Eric Muntz: Yeah, that's a pretty long podcast to go through all of those. Just to start by talking about just the scale of when it was that I started, and scale of generally what it is today, when I started we had about 300,000 total users on the platform. Today, we have over 13 million active customers. We're adding about 14,000 a day. I remember we used to have some pretty big celebrations when we had 1,000 new customers a day. When we hit that first million, we actually had a pretty big party. It was pretty cool. Going from that scale to where we are today, so we have a sharding mechanism. The way our infrastructure works, is we shard users by about one million per main shard and your users won't see me air quoting per main shard, and maybe it's at 1.5 or 7 million now, just total customer load. When I started, we only had one of those main shards, and we have 21 today that we manage. Some of the first parts of growth was adding a second, and when you add a second you learn of all the things you didn't configure right to be able to manage that and run that. That was pretty big. That was toward the end of 2010, which was the year I joined. From that, we learned how to make sure that login and hopping from login to the shard you're on, and if you are a Mailchimp customer, you'll see a URL, it'll say usXX, where XX is the number, and that's the main shard that you're in. That was pretty big. When I started we were in manage hosting, and today we are mostly in colocation and are migrating to the cloud, migrating to GCP today. Along the way we have moved all over the place. We were at manage hosting, the company called SoftLayer, for I think about seven or eight of those shards, and then we migrated those into colocation and servers that we completely manage ourselves. That was a lot of work, as you can imagine. That was some pretty big undertakings. We actually started to get pretty good at it. I would never say it was fun. It was always pretty scary because making sure that uptime and everything is running smoothly for customers is really at our core. We just care a ton about our customers, small businesses, and what they're doing. Scaling that, watching that scale, was pretty tough. Along the way we've had to re-architect several times and move from a specific type of hardware to another specific type of hardware. I think probably the biggest thing though, is when I started, I don't even know the email volume, it was probably a couple of million a day, and today we're at about a billion a day. Watching that, and I don't know if you know much about email delivery, but it's essentially a big giant black box, where you know how to manage what you're sending to ISPs, but what they do with it is just whatever they do with it, and they don't really give you a whole lot of information about that. Scaling all of that, watching that team scale, and building our own MTA infrastructure, MTA stands for male transfer agent, and it is the thing that hangs on the edge of the internet that sends and receives email, and really getting that to scale to where you could go into Mailchimp right now and start to put together an email and say, I want to send myself a test and it's in your inbox in a few seconds. Being able to make that scale is just a monumental effort by the team. Ben: I imagine 10, 12 years ago, many of the primitives of a modern cloud deployment didn't exist. Kubernetes I don't think existed 12 years ago. Even containerization was maybe early at that point. Over the years, there's been these big technological swings and as you mentioned now moving forward to the cloud, what has it been like to adopt some of these things? Have you adopted incrementally, or have there been big moves to do a lot at once? How have you managed that transition to modern cloud architecture? Eric Muntz: Yeah, that's a really great question, and I'm really proud of the way we've approached it, and how the team has really rallied around it and approached it. What we have decided to do, specifically from going from colocation into the cloud, is not to just do a lift and shift. We didn't just pick up our infrastructure as it existed three years ago and we started working toward this and just drop it into the cloud. Instead, what we did was leaned into GCP and said, we're going to adopt these services and use those services to thin the monolith and get it thinned enough to a point where then it makes sense to pick it up and lift and shift it from there. It's been a really great journey because we had something, for example, we had a KV store that we have in MySQL, so technology stack, we're pretty much a lamp stack. We use PHP, we use MySQL, and we do a really great job of managing MySQP. We've got over a million logical instances of MySQL that we manage across several thousand physical machines. Managing that, we had a thing called KV store that was basically a key value pair, you'd store blobs in it. We moved that out to a managed service and within GCP and it reduced the database weight by about a third, which is super helpful. It allows us to store more in those shards and not have to build new shards and colocation, so we've just done that over and over and over again until we thinned out the monolith and now we've picked up the monolith and started to put it in GCP as it exists, adopting cloud run and all of those services along the way. I love that what you said about years ago, some of those constructs didn't exist and some of them existed but weren't resilient. For example, today we make very heavy use of BigQuery for data warehouse, but I could have six petabytes of data in it. It's pretty amazing today. I don't know if you've gotten to play with it at Big Scale, but it is amazing technology, where you can put just a ton of records in it, six petabytes worth, and it responds very quickly. Years ago when it was new, we tried it out and we actually crashed it. I don't know that we crashed it necessarily for everyone, but for what we were doing, putting data into it, it just fell over. That just goes to show you what's happened in the last several years. We were just trying to put a couple terabytes of data in it as we were live streaming events and it fell over. Today we've got petabytes in it and it's just totally humming along. The other big thing is, as I mentioned before, we have to manage mail transfer agents and for email delivery, IP addresses are very important. If you're a specific small business customer of Mailchimp's, and you're sending enough volume, meaning thousands a day, you will want to have your own IP address so your reputation is tied to it. We need to actually manage thousands of IP addresses down to specific pieces of hardware, and we're managing all that in colocation. At this point, I'm not convinced that a cloud provider is going to manage that any better than we can in colocation. That infrastructure is staying where it is for the time being. I could see a future where maybe that changes, but for the time being, that's where it is. Ben: Got it, so kind of migrating most of the infrastructure to GCP, but then you'll continue to maintain your own physical infrastructure for the MTAs. Eric Muntz: That's right. We have a huge network, so actually this past week we broke a record with about 57 gigabytes per second of sustained outbound traffic, which if you think about it, that's compressed text, it's just emails, just compressed texts, that's a lot of email. It's a lot of compressed texts, so we will continue to manage our own network and keep that in-house and in colocated data centers, and then just tie those to the cloud providers. Ben: One thing I was interested in, thinking a value of 13 million customers, you're adding 14K per day, some percentage of those I'm imagining are spammers or some sort of fraudulent use. We all know email spam is a thing. How do you think about managing what I imagine is a small, but steady influx of spammers and making sure that they don't degrade the trust of your IP addresses or things like that in terms of compromise the platform for others? Eric Muntz: Yeah, well some of that is our secret sauce, so I'm not going to give you too much of that info, but you're absolutely right. The saying, ruin the fun for everyone, that's actually what bad actors can do within an email ecosystem, because the IP address reputation, if it's on a shared IP address, if 100,000 Mailchimp customers are all sending out emails to the same IP address, if 1,000 of them are doing really horrible things, well it's going to literally ruin the fun for everyone. We need to be super careful with that. What we have done, this is where machine learning has really come into play heavily at Mailchimp, and it's going to be that way for every ESP, so that's not too much of a secret sauce. You know can do a little bit of content filtering, but unfortunately from a content standpoint, bad actors and specifically email prospectors and marketers tend to look exactly the same from a content standpoint. You have to go deeper than that. One of the big things we have playing to our advantage is that we have double digit billions of records of activity with email across the entire ecosystem going back to the 20 years that Mailchimp has existed. Then we can do a lot of machine learning and I'm one of those people who really hesitates to say AI, unless it's truly AI, but AI-like behavior, where we're looking at what's on a list and then also activity around how people are behaving. If they sign up and they quickly start doing things or they take a little while, they pop up out of nowhere to do some things, we have some triggers around how all of that works. We won't go too deep into that, because it's a bit of the secret sauce and why I believe we have the highest deliverability of anyone in the ESP world, but what's amazing to me is that team is about six people. That team is just ridiculously talented and is best in class across the entire email ecosystem. Ben: I'm curious, since your tenure and as CTO in the past few years, or even longer than that, your time at Mailchimp, what's one of the biggest problems you've faced in terms of system outage, or a bug, or an issue, and what was the problem and what did the team do to rally and get it fixed? Eric Muntz: Yeah, well, I always call these ghost stories. They can be fun when you talk about them when you're not in the middle of them. The biggest one we had, we call the SSD Apocalypse. It was in January of 2011, and we were still a pretty small team at the time. There's a guy named Joe who was our chief architect for quite a while, who's now a VP of Engineering on product teams and me and Joe and maybe three or so other engineers on the team at this time. I was actually off, I remember because it was January 2nd, and my wedding anniversary January 3rd, and so I was like, "Hey, I'm going to take time off, go hang out with my wife." Joe was one of those people, I say he is like a duck where if you see a duck above the surface, it looks calm and chill, and that's at all times, always calm under the surface his legs might be going crazy like a duck's, but on the top of the surface he is just always calm. We got one email that was like, hey team, things are looking a little weird on these servers. Then a next message that was like, ah, I'm holding it together, but I don't know, and then the third message was dire straits. Something has gone haywire. What happened is we just started losing hard drives. Hard drives just started shutting off and we were redundant, rated, all of that. This was in a managed hosting provider. He's like, "I am barely able to get backups before all of these hardwares are dying." The last one that we got backups for was seconds before they fully went offline. It took us a while to get everything back online. We did lose a little bit of data for customers we weren't able to get a backup for, and this was one of those 72 hours, all hands on deck, no sleep type thing. Joe's famous line is like, "I don't trust technology anymore." We're just like, why would all hard drives die all at the same time all of a sudden? What we found out later, was that we were on crucial M4 drives, if I remember the brand and the manufacturer right. They had a firmware bug that, and these were consumer grade, not commercial grade, which was a pretty big problem for our hosting provider to have given us those hard drives. They had a firmware bug that after 5,000 hours of use, they just shut down, so because they had taken them out of the wrapper and installed them all at roughly the same time, they all hit 5,000 hours within a very short timeframe, and all went goodnight, and just shut down. The amount of aggressive backing up we did after that was a little overlord, but that was quite a scary moment. Then I want you share one more, one of the ones that I'd like to talk about because it just shows the culture of Mailchimp. We're blameless culture, blameless environment, and we understand that mistakes happen. The biggest rule is never hide a problem. Around that same time I was writing code, I was just a software engineer, and one of the times people overreact is when they get a bug report from the CEO. Our CEO said, "Hey, when I go to this form and I change my settings, it's not the whatever, the cache version or whatever is not picking up those settings quickly. It takes a while for it to pick up those settings." I was like, "Ah, I got to fix Ben's problem." I went in, looked at it, made a change and tested it really quickly, was like, okay, that looks good. Got it peer reviewed, pushed it through. Then the next day, I think it may have been around 24 hours later, someone from support pinged me in was like, "Hey, a few users have just reported that anytime they make a change, everything gets reset." I was like, "Oh no." I know no one had touched that code recently, so I go and look at it and sure enough I was not only busting the cash, but I was busting all saved settings. Quickly reverted that, and then ran a script to see how many users may have been hitting that within the last day. It's just like the matrix flowing by and I'm like, okay, do we have backups, and of course we didn't have backups, so I have to go into the CEO and say, "Hey, so I over aggressively fixed that problem." I mean I was really freaked out. This was the first time I did something significantly bad at Mailchimp. I'm texting my wife, don't spend any money, I'm not going to have a job tomorrow. The CEO ends up having to send an email out to 24,000 customers, and he did so just with such humility and such grace and he called it the font-apocalypse, I guess we like apocalypses. He called it the font-apocalypse and said, we're very sorry for this, openly, here's what happened. If this has drastically impacted you, come into support and use this keyword and we'll give you some money back. Crazily enough, what happened is that we ended up getting a lot of praise for how you openly deal with and acknowledge a problem. I thought I was going to get fired and flash forward a few years, and I'm the CTO running all of technology. It's just sort of a great story of never hide a problem and build a culture where people can fail and as long as they own the failure, fix the failure, and learn how to make things better from that failure, I did write a bunch of unit tests around that specific piece of code, even though unit testing really isn't my jam all the time, but I wrote some after that and I just think that's such a great story about our culture and how you can innovate and move quickly and do right by your customers. Ben: Yeah, and I'm curious now with having had experiences like that in the earlier days at the massive scale you operate at today, how do you think about balancing being able to move quickly and have engineers ship code quickly, but also if you do an operation that affects 13 million customers accidentally, it's a big deal. How do you think about balancing speed with safety of not making big mistakes? Eric Muntz: First off, we have a mission statement for our engineering team and it's, we give marketers production ready software designed to help them grow, and we succeed through togetherness, momentum, and pragmatism. There's four words that you can pull out there and it's production ready, togetherness, momentum, and pragmatism. Those are of the four legs of the stool. If any one of those are out of balance, then things get a little haywire. The one specifically to this particular question is production ready. What we mean by that is not that it's perfect and bug free, we mean that it is both observable and it will be observed when it launches. We have a social contract of deploying, which is that when you deploy, you are around and available, you watch the on-call channel, and you know are ready to go if anything happens with the code that you have just deployed. We also have a bit of an odd stance, I don't really believe in stage environments because they tend to get too far separated from production environments. We feature flag extremely heavily and to the extent possible, we test in production. We deploy dark code, code that's in production but not in any user path. Then we add paths through feature flags to what we call internal customers only. That would be folks coming from our RIPs or whatever so that we can then test in production with production load and a real environment, obviously test and development, make sure it works in dev, before you push it, but then you ramp up those feature flags and watch as you ramp them up. You may ramp to 5% and then watch, make sure everything is looking good there. Then the other thing is just that we have this big environment, we send a lot of email, but we're also adding new features and new channels all the time. If you make a change to the email delivery infrastructure, that has to work really well because it's happening a lot all the time. If you add 50 milliseconds to every email sent, the servers are going to catch fire. Acknowledging that there's hot paths and parts of the product that need to be treated like a 20 year old product and other parts that are just go and experiment and test and learn. If you're making a change to surveys, it's fine. It'll be fine if it's not a hundred percent perfect, what you want to do is figure out whether it's what users need and it's built the way users want it and it's going to empower small businesses. Ben: As I mentioned before, you've gone from software engineer and then worked your way up all the way to CTO, with the role you've had for the past three years. I'm curious what's kept you at Mailchimp for so long and what has that path been like to both grow with the organization and grow your role in terms of role and responsibilities over the past 12 years? Eric Muntz: Yeah, I mean, first off, I've had just a tremendous amount of help along the way. I have an absolutely phenomenal team that's both the engineering team who roll up to me, but also peers and partners along the way. Then the founders at Mailchimp have created an environment where someone like me can thrive. That's pretty great. It's pretty awesome. For me, one of the things I learned in my career was that, so prior to Mailchimp, I worked for the federal government, and I built software for federal probation officers. Probation officers are sort of this weird mixture between social work and law enforcement. The people I worked for were helping people acclimate back into society after being released from federal prison. That's a super important job. Recidivism rates, it's always been a hard word for me to say, I shouldn't have tried it on a podcast, are pretty high. If I can build software that help these officers be safe and help them really make sure that these people thrive when they come back into society, then it feels like I'm doing something really meaningful. As you can imagine, working for the government, it's not always the easiest work. There's a lot of red tape and the software was pretty antiquated, but I loved that job because of the impact I was making on the end users. What I learned is I went from there to another company that had really modern tech stack, but I just didn't care about the business that much. I didn't care about the end users and making them happy. I was like, why am I happier at the government where it's actually harder to get work done? I realized that it's because I have to care about the end user. When I first started talking to Mailchimp, I was like email, why should I care about email? They said, "Well, because it's a really high ROI channel for small business, but it's really about small business and making small business succeed." That part is what's kept me around for almost 13 years. Also that I guess the term we're using in industry now is a scale up. It wasn't a startup when I started. It was already a rocket ship. The ship was built, it was fueled, the ignition switch was hit, it had taken off, but I think it was tethered to the ground by its tech stack. We would release things and they were just super broken. We were going super fast, but we would release things like to do, implement this, and the interface, so my job was to release that tether and then add a navigation system so that we wouldn't just empale into a planet, because of that, because of the scale up aspect of it, my job was different every year, or at least every two years, my job sort of changed what was important. For a while, it was writing code, then it was writing code and building infrastructure, and then it was building teams. That mission statement is from a few years back and we're on version four probably of career levels. It's just changed so much every couple of years. It's kept me super engaged. Ben: Now with I believe around 500 engineers, what does your day-to-day job look like? Eric Muntz: Yeah, day-to-day I'm in a lot of meetings. I have not written production code in seven years. I can still read it though. I am still the seventh total contributor to the Mailchimp code base. There's an engineer named David who overtook me for number six. I keep sort of threatening in my mind that I'm going to go write some code, take him back, but CTO comes in a bunch of different flavors. There can be the sort of head nerd that's writing a lot of code and doing a lot of architecture and infrastructure work. For me, I'm more the business-oriented CTO, who's in a lot of strategy discussions, and a lot of execution discussions with partners on the product side and the leader of marketing about how we can make our tech stack work really well for their team. I'm doing all of that through technology, but it's mostly about the business and strategy and how we grow users, how we grow revenue, and of course we were acquired a year ago, so there's a lot of work on the integration side and trying to learn into it and the engineering community and where we fit and all of that. My day-to-day is mostly on that side. Ben: I imagine being at Mailchimp through such a large part of your career and a large part of the history of Mailchimp, you have this immense knowledge of the history and decisions that were made in the past and why things had happened over the years. I imagine that's a big asset as you're a leader in the company. At the same time, I'm curious, how do you think about making sure you're constantly bringing in some amount of new ideas or figuring out what are other people in the world doing and bringing that into how you make decisions at Mailchimp? Eric Muntz: Yeah, that's a really good question. Well, the acquisition has made that a little easier because being acquired and gaining experience and expertise from an org that's as big as Intuits, and is experience as Intuits, has been super helpful. We can lean on them and get a ton of feedback and help there. I'm a part of a ton of different communities as well as the rest of our principal engineers and everyone else being involved in the Intuit architecture communities is super helpful. Prior to that, it's about mindset, and that no idea is a bad idea. When you hire new people, it's really important that they understand the context of why decisions were made in the past, because without that context, it can be like you have a KV store in MySQL, why on earth would you do that? The answer is, well, back then with a really tiny team, we had operationalized MySQL super well and it fit there, there we go. Today, with different tech stacks and different tools, with your experience, how would you build this? Knowing how you would build it, well then how would you migrate us to that? It's a really tight balancing act because I don't believe in polyglot environments. I'm a strong believer of doing things one way and only one way, but I also can be convinced to do them a new way that's going to be better. For example, we are migrating to React for our front end, from Dojo, because it just makes a ton of sense. It's more modernized. You can hire tons of people who know it. It's faster. It's easier to understand all of those things. We just have to really balance the context of why you made decisions in the past with why decisions are right in the future. That's where the pragmatism part of that mission statement I said really comes in. Ben: What are you most proud of in your time at Mailchimp? Eric Muntz: Yeah, I am proud of a lot, but I think the thing I'm most proud of, is our apprenticeship program and the way it has grown really phenomenal engineers by giving them opportunity that they wouldn't have gotten at most companies or maybe are hard to find because they're from sort of non-traditional backgrounds. We started the program in I think 2012, we had some folks in our customer service department that were really highly technical and doing really highly technical work. When you asked them, "Well, why are you not an engineer?" You tend to get those typical imposter syndrome style answers. I don't have a degree in computer science, or I couldn't possibly do what you do because of X, Y, or Z, so we built a program, and I don't have a degree in computer science, I have a math degree. I think it's sort of natural for me to say, "Well, no, you can do it. You just have to be curious and understand logic and put the work in." We started that program in 2012. The way the program works is folks apply for a position on the engineering team. If they get it, they're on the engineering team for three months, and at the end of that three months, it's a try to buy. Both parties can decide whether they want to stick with it or not. If they don't, they go back to their other position. I don't know the numbers today, but recently we had over 70 folks in our engineering team that had started at the company, not in engineering, staff engineers. We had a senior director in engineering who had an art degree background and started in customer service. It's just really awesome to watch these people do their work and it actually creates an engineer who really deeply values the customer experience and understands what it's like to be a customer. Sometimes that's even harder to teach than how to code, so super proud of how that's come together. Ben: I'm curious, you've been in the email world for a long time, but email's been around since the eighties in some form, and email's been around forever, but at the same time nowadays, there's a lot of other ways to communicate. There's other ways companies reach their customers, whether it's through social or through SMS. I'm curious, what do you see as the future of email? Will email still be a popular and primary channel in 10, 20, 30 years, or are there other channels you think will overtake email? Yeah, what is your opinion on that? Eric Muntz: That's a great question and one I get a lot, and I really need to polish up a better answer for it, because it's so hard to look into the crystal ball and see what's coming. I've been at Mailchimp for 12 years and since day one, there's people have talked to me or asked me about the inevitable downfall and demise of email. It seems like everyone thinks it's inevitable, but here it is as strong as ever. I'm not sure I would ever prophesize that it will officially go away anytime soon. That said, technology naturally will evolve. Whether it's email or something that's email-like, that uses a different protocol or something else like that, I don't really know. I don't what's coming. Privacy is a big concern. Email, the protocol for email around privacy, is pretty wacky. If you know the email address, go to town, and there you go. If you've ever looked at the SMTP protocol, it is rough. If any listeners out there haven't looked at it and need to take a nap, go try to read it. It really is a really tough protocol. It's very chatty. I could see the protocol evolving somehow and becoming a little less chatty, a little more specific. Maybe the address and the way the addresses work change. That said, it's a rich environment. You can deliver really, really rich content, where in text you can't. It's harder to deliver as rich content in text messages. We use Slack and you can deliver pretty rich stuff in Slack, but you have to also be part of the community and join specific channels and all of that, so it's tough. I think that messaging and personalized messaging that's sent from small businesses to consumers at the right time with the right content, whether that's email or something else, it's going to be here, and Mailchimp is going to be a part of it and working to make it work really well for small businesses. I love it. I actually love getting marketing. I'm one of those weird, I love getting marketing content. I'm like a marketer's dream come true. I really do believe that, that type of ecosystem is here to stay, whether it's email or something that got evolved from email or looks like it, we send a billion a day, and that's not going away. Ben: Yeah. Well, we'll have to get you signed up for the LogRocket mailing list. Eric Muntz: Will do. Ben: Lastly, curious, what are you most excited for in terms of a specific technology or an engineering paradigm that's growing in popularity? What's most exciting to you? Eric Muntz: Well, I'm excited about our cloud migration. That team's doing a bunch of great work, and there's just some amazing technologies that we're able to make use of there. I'm being a hundred percent honest, I'm a little skeptical of blockchain and where that's going. I'm sort of looking at that with a side eye. I've yet to see a specific implementation where I'm like, that actually does require Blockchain. I'm watching it. It's interesting, but I still don't really see a need yet for businesses. From a Mailchimp standpoint, we just launched a thing called web hooks in automations that I'm really excited about, to see what the developer community does. In a customer journey, you can make one of your destinations sending a web hook. You basically slap in any endpoint and it'll send a web hook with a little bit of data to it. It's sort of an open-ended way for folks to connect automations to the power of Mailchimp's activity. I'm excited to see where developers go with that. Anytime we open up some new API or something, it's exciting for me to see what all of the third party developers are going to do with it, so that's exciting. Ben: Well, Eric, thanks so much for joining us today. This has been great, and hope to have you back in the future. Eric Muntz: Would love to. Thanks again for having me, and happy end of year.