Patrick Dougherty: My machine, like my laptop and my VMs that I could work in, were the silo of data for the whole team, right? My boss is looking at it and he's like, "So if you got hit by a truck, like what would we do the next day?" and I was like, "Yeah, that might be a problem." Eric Anderson: This is Contributor, a podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson. Today, we're discussing Rasgo. I'm with Patrick Dougherty, one of the creators of Rasgo. Welcome to the show, Patrick. Patrick Dougherty: Thanks, Eric. Thanks for having me. Great to be here. Eric Anderson: I'm excited about this. Patrick doesn't know that I spent some time at Google working on a product called Data Prep, Google Cloud Data Prep. Rasgo is a data preparation platform. Patrick, what does that mean exactly? Just so we all are on the same page as we talk. Patrick Dougherty: Sure. I like to boil it down to what the product actually does under the covers which is it, generates SQL. You can simplify it to just those two words. It does that through a number of different interfaces, some more on the automated side, some more just augmenting user. But at its core, to interact with data in a data warehouse, you need to generate SQL. It's a sometimes terrible language to write from scratch and so we make that a lot easier. Eric Anderson: Got it. You're doing this in a no code way or what are the interfaces that one would express something that then gets translated to SQL? Patrick Dougherty: Yeah. Two primary interfaces, one being no code, low code where you can work within our web app and a UI and point and click, enter in arguments. Like a join, right? I have to pick a table to join, with pick the columns to join on, and the type of join from a dropdown, left, right, or enter. Anything from that simple up to some complex like feature engineering type of calculations where you run window functions over tens or hundreds of columns. So that's one option is, no code, low code and the UI. The other, which is feeding the open source product that we recently released is a Python interface. So that's very similar in look and feel to working in Pandas but the difference is that you are generating SQL under the hood and executing it in the data warehouse instead of against a data frame and memory. Eric Anderson: For the purpose of the open source, that's kind of been seated by this Python interface and then you have a Rasgo product, the low code, no code? Patrick Dougherty: That's right. Eric Anderson: Fantastic. Do I have to have a data warehouse to use Rasgo intelligently or is that an optional? Patrick Dougherty: It's required right now. To use Rasgo at all, any interface you need, Snowflake, BigQuery or Postgres instance to connect us to. We do want to expand support to more of a DuckDB style of database where you could run it in memory. But frankly, that's only valuable in an enterprise standpoint if you could sort of do that for dev but ultimately, deploy on a warehouse. That's where our value is and so, that's where our focus is. Eric Anderson: Awesome. What sparked this? I mean, it's worth calling out that your website's Rasgo ML, you mentioned a little bit of feature engineering. I imagine, a part of the Rasgo story is the rise of machine learning tooling and you fit in that world and that world is in a lot of flux. I think it hasn't quite landed what the tool set looks like yet. Maybe, you could help us understand how you kind of arrived at this solution, how Rasgo came to be. Patrick Dougherty: Yeah, very much in flux and I lived that firsthand. When I started my career as a data scientist, I worked at Dell and I was on a team. I was kind of devoted to the digital marketing team, fairly small group, built her up around their Dell software portfolio at the time. My title was data scientist. My job function was anything that makes digital marketing more intelligent, right? That spanned the gamut of things like, oh, writing a Python script to ingest our Facebook ads data, right? To be able to pull it into a database and start building some vis on it all the way up to clustering of the users that were responding to our ads and becoming users. Doing some lead scoring, further segmentation from very descriptive to very predictive. In my experience, that is the job of a data scientist as much as we may want it to be just one part of that spectrum all the time, right? Patrick Dougherty: The other unique part about my role was I didn't have good data engineering support within my team. I was the first data scientist in the group and they didn't have any data engineers. So I was my own data engineer and definitely not suited to do so day one. But that's part of the job, is you pick that stuff up. So you can picture me landing at Dell and starting to build a bunch of stuff out in Pandas. Pandas was the workhorse language, or I guess, package for me that I could iterate in the quickest and get from raw data to insights. Patrick Dougherty: But it created this silo where my machine, like my laptop and my VMs that I could work in, were the silo of data for the whole team, right? My boss is looking at it and he's like, "So if you got hit by a truck, what would we do the next day?" and I was like, "Yeah. You just got to go into my machine. You got to run this script, this... Make sure Airflow is still on," all of this mess and they were like, "Yeah. That might be a problem." So you get to this problem statement of, "How can a data scientist, without needing full-time resources from data engineering, be somewhat self-sufficient in data prep and able to deliver those insights without a massive human capital investment needed?" Eric Anderson: Totally. Yeah. I think you're not alone. Like your laptop, currently on a flight to Mexico or something, is the single point of failure for the marketing team. Got it. So you solved this yourself with Rasgo or, at the time, you cobbled together scripts and you're doing what everyone does? Patrick Dougherty: Yeah. Cobbled together scripts. I remember looking at Luigi, which was Spotify's Airflow predecessor for scheduling and yeah, getting by, right? Making sure my insights were on time and the data was somewhat accurate for my stakeholders. I left Dell, I went into consulting. I wanted to see if that problem was prolific, right? If it was just me or if I was seeing that elsewhere. I worked at a firm called Slalom and built up eventually a whole practice of data scientists and data engineers. We did things like cloud migrations, so moving customers from On-prem, Hadoop, or other database systems to a Snowflake or a BigQuery. We also did a lot of modeling and just insight generation, right? Which is all founded on this idea that a lot of times, we're the ones building data prep scripts from scratch for these customers because their internal teams didn't have the time or the resources to do so. Patrick Dougherty: So after seeing that problem play out a number of times, I also met my co-founder while I worked at Slalom, he was working at Domino Data Lab at the time. We had a mutual client and just solving this problem over and over with manpower, right? Just with great men and women consultants that were coming on board and knocking out scripts, it's like, "That just doesn't scale." I mean, what do you do if you're a company like Coca-Cola, who I worked with, that just had massive rewrites needed as they wanted to move to the cloud and get off Hadoop. That's the genesis of the idea for Rasgo is, how can we make that user more powerful in this new world of, let's call it, the modern data stack, right? Where we're starting to centralize data, data prep can actually happen in warehouse, it doesn't have to happen in an ETL tool. So I think the industry changes led to the fact that it was primed for a tool to come on and make that easier. Eric Anderson: Yep. How do you co-found a company with somebody at a different company? Patrick Dougherty: Like I said, we were working for a client in Atlanta. Jared was selling his software. I was telling my client to buy his software so I could use it, my team could use it on the project. That was kind of the first time we met and just in a conference room at this client, it was pretty funny. And then, we stayed in touch, we batted around a few ideas, we were both entrepreneurs and kind of in waiting or in hoping, right? Like trying to find the right opportunity and thing that we were passionate about. So it was a lot of, I would say, Saturday Zoom sessions because Jared was in New York, at the time. I'm in North Carolina. And so, that just kind of built up gradually over a lot of creative thought processes and also kind of being inside from COVID, right? Like what better to do? Yeah. When we launched, it was just mutual belief in that idea and then eventually connecting with the right angels that were also willing to give us our little head start. Eric Anderson: It's amazing. Where is Rasgo based then? I'm assuming you're not in Atlanta anymore, that was just a stop over. Patrick Dougherty: I lived there for a couple of years. We are fully remote though. I think, technically, our headquarters is New York city. We have two employees there and the rest of the team is really everywhere. A couple on the West Coast and then scattered throughout the East Coast as well. And then, a couple in Austin actually, which is another favorite place to go. I think we've got some good places covered from a team standpoint. Eric Anderson: The decision to open source, was that apparent immediately? Patrick Dougherty: It was apparent immediately that we should do it. The trick was when? Because obviously, it does take some extra time and effort to split out a piece of your code base like that. So we had been iterating on this transform engine kind of gradually over almost a full year, I guess, at the point we were ready to open source it. We were like, "We really want to split this out but we should probably make sure it's valuable first so that we don't spend a month carving out that piece of the code base and open sourcing it if it's not valuable." That time was right as we turned in to this year, we felt like we were pretty stable on it and it was significant enough to be valuable as a standalone piece. So yeah, we launched Rasgo QL, is the open source package. We launched it, I believe, February 1st of 2022. And yeah, it's been exciting so far. We've got some good early evangelists that are getting on board and starting to carry the torch for us a little bit. Eric Anderson: That's fantastic. This is fresh code, it sounds like. Patrick Dougherty: Straight off the assembly line. But like I said, since we've been iterating on it for at least nine months, when we launched, now almost a year, it feels more baked than maybe some other things where you've started from scratch on the open source side, you know? Eric Anderson: Yep. Switching gears a bit, we haven't said some words, I thought we might have said up until this point. We did say feature engineering but we haven't said feature stores and I think that's deliberate even though I think it could be apropos. How do you see the world of feature stores and where does Rasgo fit in? Patrick Dougherty: Yeah. We spend a ton of time thinking about this question. The feature store world is starting to mature and we have some customers that are using Rasgo as a feature store and that's been going really well. The first thing you have to cover though, when we bring up the term is like, "What does it mean?" right? Eric Anderson: Yeah. Patrick Dougherty: Because our competitors use it very differently than we would and I think no one's totally decided what it actually is yet. By the way, it's a terrible term, right? I mean, if you just take it literally, feature store, it sounds like a database. So what we have said from day one is we believe that the cloud data platforms will mature from an infrastructure standpoint to the degree that they can be a database for features, meaning you will not need a separate database in which to store feature values. Patrick Dougherty: Like these databases are already starting to optimize for column organization and compression. Now, they're even starting to emphasize speed, right? So latency of retrieval is starting to come down and we're going to see things like better caching layers within the cloud data platform. All of that momentum we saw two years ago when we started and we said, "We don't want to build at the infrastructure layer. We want to focus on building at the user interaction layer, like the application layer so that we are just how a user experiences their feature," and that spans the life cycle from, obviously, creation first, right? So taking a base data set and building features out of it that you think can be predictive in a model organization. Cataloging those, sharing those in our UI, we do some profiling. So looking for things like outliers, data quality, etc., of those features, and then finally serving them. Patrick Dougherty: Even though the database, we believe, should be where the values are, there's still complexity in making sure that each model gets the right feature values at the right time and likely in Python, right? Those are the pieces that we focused on and what we believe defines a good feature store. If we can agree that that's the definition, then I think I'm happy to speak of Rasgo in those terms, but we've decided that it's probably better at this point to just blaze our own trail and we'll see what happens with that terminology. Eric Anderson: One, I don't know if problem's the right word, but situation with the feature store branding is the early feature stores were very... I've come to learn kind of opinionated or at least specialized tools like they're for high performance, high volume streaming situations that maybe aren't broadly applicable to... Like not every feature belongs in a feature store per that early definition. But I think most people are thinking, feature stores are where you put features. I imagine that Rasgo maybe is a more common denominator for building features, Rasgo is a good place for you to do it. I would suspect. Patrick Dougherty: Yeah. That's our objective first year of our product as we onboarded our first customers and they had a lot of input into the early product. We wanted to make sure that their pain points were answered. Maybe not the way they asked for, but at least covered by product functionality that we were developing. Through that process, we continually heard... We expected going in and that the friction would be, "I've got all these features and I can't organize them and I can't serve them to models fast enough," and over and over we heard, "Yeah, I really wish I had that problem." I don't have enough features in the first place, right? Like I'm still stuck at base table, I extract it from the data warehouse, I build some stuff in Python, and then where do I put it? I think every customer was doing some form of a cloud storage bucket, like an S3 and sort of persisting their feature values there. Patrick Dougherty: We've all seen the pain of throwing things in S3 and then trying to find later what they are, where to use them, when they were built all this stuff. So we've very much oriented toward this creation problem like, how do you create in a way that's scalable that you don't have to go through code re factors before you can go to production? That's definitely been our north star since we heard that kind of feedback loud and clear from the user. We're so excited about Rasgo QL because it carves off a huge piece of that problem which is some things are really easy in Python and really hard in SQL, right? Just from a language standpoint and it kind of mitigates a lot of that pain. Once you have those features in SQL, you get all these other benefits for free that may not be readily apparent at first, but you'll come to appreciate. Eric Anderson: One of the things this conversation brings to light, I think, you pointed out earlier that everyone's doing data science to arrive at some kind of conclusions. And then if you want to automate arriving at those conclusions, you just have to take your data science another step and turn your parameters that you're using to get those conclusions into features so that the automation engine can digest them. Rasgo becomes the place where I do my data prep, all my data preparation that I think has a chance of ending up automated. If I'm trying to figure out who are my best customers and I turn around and I answer and I present it to folks and they're like, "All right. Well, let's do something with these best customers," suddenly my data science needs to become a feature, it's feature engineering. You're kind of 90% the way there, maybe already and I can do that end to end in Rasgo, it sounds like. Patrick Dougherty: You got it. Yeah. That's a really good example where what we keep seeing... But before Rasgo, you would build that analysis, right? Who are my top customers as defined by whatever the CFO thinks it is that day, right? It's like some business logic, you would build that out in Python, right? A lot of times you would do a select star from customers table, right? And then, oh, write some Pandas code and I'll come up with some algorithm and say, "Look, these are our most profitable customers." Patrick Dougherty: Yeah. You take that back to the CFO, god forbid he likes it, right? Like, "Oh, that's super helpful. I need that tomorrow and I need that next week and two weeks and I need that to be refreshed." We saw these teams springing up at our customers and prospects that were literally just their titles were often ML engineer or data engineer but if you ask them what they did most of the time, it was truly code refactoring where a data scientist would ship them Pandas code and they would say, "How can we make this fit within either PI Spark or just raw SQL?" Patrick Dougherty: And yeah, that's such a barrier, right? Because like you said, the internal customer of that insight thinks, "That's really valuable, I need that," and they're not realizing there's all this sort of rework just because of the way the tools have come together that are required to get that on a refresh basis. Eric Anderson: Totally. Very interesting. Tell us the state of the project today, Patrick, and where things are headed from here. Patrick Dougherty: Yeah. We launched a couple months ago, the initial reaction, mostly on social channels, DBT, Slack, and a couple of other related communities was really positive. We've got a couple hundred stars on the GitHub repo which is helping to drive a little bit of organic awareness. The primary roadmap item right now is to expand database support. So SQL syntax support because of course, all these products are claiming antsy SQL compatibility and that's somehow only 70% of their SQL functions and the other 30% is completely custom. Right now, we've got Snowflake, BigQuery, and Postgres are the first three syntaxes we launched with. We've got strong community demand to support Redshift and then Delta Lake something like DuckDB which I mentioned earlier. And then, you can imagine sort of the long backlog of other syntaxes to support. Patrick Dougherty: We actually just launched MySQL as well, I forgot about that one. So that's one dimension on which to expand the package, is just make it more accessible to people working in any SQL syntax. That'll be kind of step one, we're working through that here in Q2. The other component, so DBT integration is a big request from the community and we shipped with some basic DBT integration. So think of it as if I build up a long chain of SQL, right? That might have many CT's or common table expressions in it, that's at the end, ideally, creates some valuable data that I want to obviously keep refreshed, eventually schedule, test, organize, and catalog. DBT provides a lot of that framing around SQL, right? It's really good at that. We already have the ability to export that SQL chain to a DBT project and create a model in the DBT repo but we can kind of further that integration. Patrick Dougherty: That's one of our goals, is to make it even more seamless and even bidirectional. If you're a Python user that wants to extend an existing model with Python, instead of having to write a raw DBT SQL, that we would be able to ingest a model into a SQL chain which is the Rasgo QL primitive, we'll call it, and then start to add more transformations to that which would be really cool. Patrick Dougherty: DBT integration, database support expansion, and then the third thing that's kind of core in our roadmap right now is bring your own SQL templates for transformations. We shipped with about 50 templates which are roughly mapping to one or in some cases, a couple Pandas' transform functions. But Pandas has 128, I believe, unique functions so we want to further expand the library over time but also allow our users to bring their own templates and easily, if they want keep, those proprietary, right? Keep those in sort of a local storage and ingest them and use them in the package. Eric Anderson: Got it. Now, this DBT stuff sounds fascinating. It sounds like that's kind of where initial users and use cases are coming from. DBT has mostly a SQL interface, right? And you're offering a non-SQL interface that produces SQL. Are you in front of DBT like I interact with Rasgo and it compiles to DBT SQL which then compiles to data warehouse SQL. Is that the- Patrick Dougherty: Yeah. You've got it. Exactly. Because when you think about this user that's used to Pandas and likes the syntax of Pandas, they're often intimidated or just outright not interested in learning to write that raw DBT SQL. Rasgo QL is giving them a bridge, right? It's saying, "Hey, you can still write in this kind of friendly, interactive, data frame oriented scripting language." But like you said, we'll compile the DBT SQL for you so you don't have to learn that stuff. That's a huge opportunity within that community to bring more users to DBT. Eric Anderson: Got it. And then, we started this discussion with use cases around marketing just because that was your world originally and then certainly, a lot of the modern data stack use cases are here, are marketing centric. Is that the Rasgo wheelhouse or where are the initial use cases? Patrick Dougherty: Yeah. We've been really lucky to work with customers across industries so far, just nature of the early clients that we landed and have worked closely with. They span healthcare, manufacturing, energy, finance, like hedge funds, and then, marketing and sort of your more traditional use cases as well. There's a lot of really interesting feature engineering in the sensor space, I'll call it. Patrick Dougherty: An energy customer, for instance, early on, they really pushed us to add transformation templates that could deal with sensor data well. You can imagine some of the common themes here of intermittent data showing up with readings. To aggregate that responsibly into a data set you can train a model on is really tricky and it's just horrible in SQL because you have to do a bunch of window calculations to create base tables and then you have to left join everything with that just because the model needs to know where the gaps are. The model's not going to react well to just random intermittent data. Patrick Dougherty: We have some really interesting transforms in the library that handle that problem. And just overall, I think sensor data, preventative maintenance detection of just weird behavior when you're getting some type of stream, that is really interesting. That's one of the exciting ones but I think you'll find that the templates are pretty generic. They're going to serve you reasonably well no matter the business domain that you're trying to feature engineer in. Eric Anderson: That's part of the promise of feature engineering, right? I take a real world situation and I kind of abstract it into vectors that I can pass a model and the model's ultimately just kind of predicting the forecasting where these vectors go. Patrick Dougherty: 100%. Yeah. The data centric AI movement, we'll call it, which has gathered steam, I'd say, over the past year, I like how it's reframing the conversation on extracting features from raw data as the area to actually meaningfully improve your model, right? Auto ML tools have gotten so smart and so sophisticated that human eye trying to hyper parameter tune is probably not the best use of your time anymore. I mean, don't stop doing it completely but if you can spend more time understanding how the data was gathered, where the signal might be in that data, and then building features to extract that signal, I think that's where data scientists can actually add the most value in that life cycle and it can be really fun too. You can go really deep on some interesting problems and learn about things in the field with sensors that you just never would've considered. Eric Anderson: Yeah. I suppose that means the data scientists are being pulled from their ivory tower of algorithms and into the nitty gritty of how sensors are collecting stuff. Patrick Dougherty: I had a data scientist using Rasgo a couple weeks ago. He told me that he had just gotten back from a wind farm in Texas and I was like, "Why are you at a wind farm in Texas?" and he said, "We're seeing such weird anomalies in the data. We wanted to go actually observe how this data's getting collected so that we can understand some of the things we think are anomalies are actually real and vice versa." And so, yeah, absolutely. I mean, this sort of movement to focus more on the data and how it's collected, I think that's awesome stuff. You're so much more likely to build an explainable accurate model that others will trust if you go to that level of depth to understand those pieces of it. Eric Anderson: Awesome. Patrick, anything we didn't cover today you wanted to include? Patrick Dougherty: I think there's an interesting conversation on the modern data stack as it relates to even some of the products you worked on, right? Working with data prep in GCP. I would love to maybe get your 30 second prediction on how that evolves over the next couple years and if the kind of On-prem data transformation tools do all move into the cloud as a default place to transform your data. Eric Anderson: That's an interesting question. I think one of the things that hung up, maybe the early generation of data prep... I mean, a lot of those tools really started Hadoop centric and they were very job oriented. And so, everything was around you do some work offline and then you're going to pass it into the cloud and it's going to execute in the cloud and do this job. I think, increasingly, our clusters are faster, right? Our workstations are more connected and I'd love to see the world of local go away and we're just operating. The notion of it to kind of a job goes away like you're just doing your work and I don't know if it means that things are executing behind the scenes or if everything executes on demand but that's something I think we're still kind of sorting out. Eric Anderson: And then, the other thing that hung it up is I think everyone wanted a no code, low code interface but they were nonstandard. I mean, that's the promise of code, I guess. It forces a rigidity of expression and we're still using SQL after all these years in part because it's kind of low code-ish and it's standard to a degree, as you pointed out. But I'm optimistic that we can kind of come to some paradigms that most people understand and whether or not they're 100% consistent, at least we can adopt, like we have in spreadsheets, a way of working that most people get and grok and they can pick up these tools rather easily. Patrick Dougherty: Yeah. That's awesome. That's a great way of putting it. With the concept of a DBT SQL, right? And abstraction from even the raw stuff a little bit, you kind of have a layer at which to embed that commonality which could be really, really powerful. I think if we solve that, we could solve a lot of the other stuff too. Eric Anderson: Everyone can pick their interface whether that's low code, Python, or SQL and it all kind of ends up in the same place and it's all in the same tooling. Yeah. That's a nice world. Patrick, thank you so much for your time today. Folks have a lot to look forward. You mentioned this already, but they can go join communities on the website? Patrick Dougherty: Yeah. rasgoml.com, join our Slack channel, our Rasgo user group. You can post questions about Rasgo QL in there, and then you can find Rasgo QL directly on GitHub and just Google it, should come up first and download it, try it out, and tell us what you think. Eric Anderson: Fantastic. Thanks for coming on the show. Patrick Dougherty: Thank you. Eric Anderson: You can find today's show notes and past episodes at contributor.fyi. Until next time, I'm Eric Anderson and this has been Contributor.