Kenn Knowles:
I think it's a pretty unique origin story for an Apache project that wasn't like an already successful open source project that migrated to the Apache Software Foundation, which is how things typically happen. This was a bunch of separate communities forming a new one, at the same time they formed the software project. So I loved it. I was so happy to be on a project when something this wild happened.

Eric Anderson:
This is Contributor. A podcast telling the stories behind the best open source projects, and the communities that make them. I'm Eric Anderson.

Eric Anderson:
Today we talk about Apache Beam, which is historical both for me personally, I worked on Apache Beam while at Google, but also it's important in the history of data processing in Google. When Google went from publishing papers to producing more open source, this was their first Apache contribution. The conversation arcs from a quick description of what the project is into the history, which dates all the way to map, produce and covers kind of 15 years at Google, to some of the interesting things that are happening today on the project. In the middle, you hear a lot about the Apache way. We talk a bit about the transition from the way Google did opensource to how the Apache Foundation does it. And we'll also cover some of the intricacies of the technical aspects of the project, how it handles event time versus processing time, unifies, batch, and streaming, as well as supporting multiple languages and multiple execution runners. Ken and Pablo are good friends. We work together at Google and I think you'll enjoy the conversation. So let's just jump into the recording now.

Eric Anderson:
Today, we're talking about Apache Beam and I have with us Pablo Estrada and Ken Knowles, who are both PMC members on the project. I'll note that Apache Beam is historical for a couple reasons here. I mean, it's one of Google's first, I want to say Google's first Apache contribution and kind of marks a pivot from when Google did papers, to Google doing open source projects, and it inspired contributor. So this was a project I worked on as we'll get into and led me down interest and open source, and led very directly to this podcast. So it's a bit of a meta history on Contributor by doing a history on Apache Beam. Pablo, Ken, thanks for coming.

Pablo Estrada:
Thanks.

Kenn Knowles:
Yeah. Thanks for having us.

Eric Anderson:
So let's do the quick description on what Apache Beam is.

Pablo Estrada:
Kenn, go ahead.

Kenn Knowles:
No Pablo, go ahead. No. Okay. Yeah. Apache Beam is a programming model and library. It's a framework for doing big data processing across a variety of engines, and in your language of choice, sort of mixing and matching now connects to any data that you've got, process it in any way you want puts it where you want it afterwards.

Eric Anderson:
And it's kind of unique in that sense where there's probably a handful of data processing engines. People are familiar with. Some we've talked about in the show, Spark, Flink, you would be better to name some others. And this is not necessarily one of those, but it's kind of a generalized interface where I could describe a pipeline like I would in one of those, but it executes on one of those.

Kenn Knowles:
That's right. There's not a million ways to do embarrassingly parallel computation and they have a lot in common. So while Beam sort of originates from the Cloud Data Flow SDK, that's one of the engines you can run it on that I happen to be closely related to. Recognizing that there's sort of core concepts to doing embarrassingly parallel, big data computation. And that's where Beam comes from. And so you can run a Beam pipeline on Spark, on Flink, on Cloud Data Flow, Samza. People have created some that were just experimental, like IBM streams had one for a time.

Pablo Estrada:
There's Apache Nemo.

Kenn Knowles:
Apache Nemo. Yeah.

Pablo Estrada:
Yeah. There's several different communities that have written integrations between Beam and their engine, which will usually run on a cluster and they can schedule Apache Beam workflows or pipelines, which is the way we call them.

Eric Anderson:
And maybe before we move into the history, which we'll do shortly, you mentioned that I can do this in my language of choice. So walk me through kind of, we've talked about the bottom half of this pipeline, where they get executed, but maybe what are my possibilities for writing a pipeline?

Pablo Estrada:
Well, yeah. So today we have the two older ones. We have a Java SDK, which was the first one to be written. We have a Python SDK, we have a almost fully supported Go SDK. We're releasing the full support, I believe next month. Oh, we also have SQL. So we support SQL inside your other SDK written pipelines. And you can also have a SQL console where you can write some pipelines. And that's it that we have kind of supported. We recently worked on a prototype on a Type Script SDK. We're trying to push forward to something that we can let others use. But yeah, that's what we have.

Kenn Knowles:
Yeah. I think I want to mention that on this specific topic of using your language of choice, this ability to execute any language is a core part of Beam, as you say, and specifically Beam comes with protocols and definitions for how to execute the user code that is inside all these pipelines in whatever language. So that's how the Type Script SDK came together in a week because we had these protocols and all you need to do with make a thing that can run your Type Script code, and then make the piece that submits the pipeline to this engine and you can run it on any portable renter, right? So you can run Type Script on Flink after a bunch of people get together for a week. And I just want to specifically highlight that it's not dependent on the engine. This SDK now works on every engine that supports Beam's portable protocols and you can mix and match languages within a pipeline, and that's another way you get things off the ground quickly, right? We didn't write any code to be able to access like big query in Type Script or Go.

Kenn Knowles:
We have a very mature connector. The most mature connectors that have been battle tested are written with the Java SDK, and you can plug it into a pipeline that you're using in another language.

Eric Anderson:
Got it. So, yeah, I think you're right. You may not appreciate the fact that you can write a Type Script front end a framework in a week and have that execute then on all the runners using all the connectors suggest quite a bit of, I don't know, modularity or something. Fantastic. So without further ado, let's talk about how this came to be, because I think there's a lot to tell there, how far back should we go? I mean, do we go all the way back to MapReduce, and maybe you can kind of also embed the context of Google and the way they did work then.

Kenn Knowles:
Yeah. Well, we can definitely go back to MapReduce. So I start from the transition from MapReduce to this internal technology called Flume.

Eric Anderson:
Yep.

Kenn Knowles:
All right. At Google. So, everybody started MapReduce, at least probably. You have one shuffle and you do some data, then you group stuff by key, and then you do some more stuff after you've grouped it. Right. And people are chaining these together. And Google then produced a system that could do that more efficiently, could build a whole data processing pipeline and do better optimizations on it. Right. So those are the two pieces of technology that came from Google. And what Google did to share those technologies with the world was write papers, kind of sketching, how they worked. Right. And they were reimplemented as Hadoop MapReduce and Apache Crunch. And of course, all of the systems out there now, like Apache Flink and Apache Spark, they all do the optimizations that come from Flume. Right. I don't want to overstate things. It's not like rocket science that you couldn't reinvent. So I don't know that they're derivative or anything, but everybody is sort of fusing together operations and minimizing the amount of data you shuffle around.

Kenn Knowles:
But yeah. So Cloud Data Flow is the externalization of the technology behind Flume. Right? This is the data processing, the power is just everything at Google. And-

Eric Anderson:
And maybe... Let me cut you off Ken and interject. So Flume is a code name inside Google, but it's also, there's an Apache Flume, which-

Kenn Knowles:
Not that you think.

Eric Anderson:
... they are different, right? Yeah. Let's be clear here. So this is Google Flume, which is the descendant of MapReduce that is now becoming an external product on GCP under the name of data flow.

Kenn Knowles:
That's right. That's right. I first encountered Flume as Flume Java because I read the paper-

Eric Anderson:
oh, right.

Kenn Knowles:
... before I joined Google, and Flume Java outlines like this next generation after MapReduce. Yes. And so Apache Beam came out of Cloud Data Flow like Cloud Data Flow was the backend. It shared all the core technologies of Flume because Google, we know this is super powerful and we build the whole business on it, and we had an open source SDK. Right. We distributed the Google Cloud Data Flow, Java SDK, and I'm going to get into some technical details.

Eric Anderson:
Yeah.

Kenn Knowles:
Because it's part of the origin story to me, and just maybe how I think about it. Is that we just needed to be able to, you write a pipeline, you got to test it. You got to run it locally. You got to try it out. And before you launch it on the cloud, right.

Kenn Knowles:
And so, we have this open source client, which is pretty typical and it has hooks in it. So you can run your pipeline a different way, not just on Data Flow. And then along come people from Data Artisans and Cloudera. And they just sort of jumped in and be like, "Oh, hey, I can take this hook and make a run on Flink, and make a run on Spark." So that happened organically people found the Data Flow SDK and were like, "Oh man, what if I run a data flow pipeline on these other engines?" And when we noticed that, right, when we all got together and were like, "Oh, this is very cool." I think it's a pretty unique origin story for an Apache project that we actually had these three independent things going on. Those were in different repos. And we dumped all of all three of them into the same repo in order to start Beam. Right?

Kenn Knowles:
So it wasn't like an already successful open source project that migrated to the Apache Software Foundation, which is how things I think more typically happen. This was a bunch of separate communities, sort of forming a new one, at the same time they formed the software project. So I loved it. I was so happy to be on a project when something, this wild happened. I probably skipped a bunch of interesting parts, but that's how Apache Beam came to be. We, dumped the Spark runner, the Flink runner and Cloud Data Flow SDK all in one repo and started building.

Pablo Estrada:
Jumping back for a second into when you were talking about Flume Java. Something that I believe also came into the Data Flow SDK was the internal team at Google had also built streaming capabilities. So, how do you analyze data that is being continuously produced? So I think Flume Java initially had been built to run this kind of MapReduce dags directed acyclic graphs. And over time there were also capabilities built with Windmill, which is another internal Google system, to do analysis of data that is being produced. So instead of running a MapReduce job or a Flume Java job every night, you would have a job that is running at all times. And that is capturing data and either moving it, doing ETL with it or doing alerting or doing some kind of statistics as the data comes in.

Kenn Knowles:
So I want to say that, that was another example of a Google technology and the name you'll find it under, which we also wrote if there was a paper was MillWheel was the code name at the time, and Windmill is the next generation of MillWheel.

Pablo Estrada:
Ah, that might-

Kenn Knowles:
So, if somebody's out there Googling up to, yes, Data Flow includes the Flume Java Technology and MillWheel Technology. And that's where Beam's unified batch streaming model originates in these two really influential Google technologies.

Eric Anderson:
Yeah. We'll do some glossary and links in the show notes to orient folks. Got it. So MillWheel, or the similar named one came together with Flume Java to create the Data Flow SDK, which is the open source bit and the cloud engine, and then the SDK open source, folks pile in and suddenly you've got a project in your hands and at some point someone was like, maybe we should do an Apache something. How does that conversation go? And are people like, "I thought we wrote papers?" And-

Kenn Knowles:
Were you there for that? Pablo?

Pablo Estrada:
Nope.

Kenn Knowles:
Oh, so I have to answer this. I was there for that. I was pretty low on the totem pole at the time, but it was a pretty top down decision. Someone was like, "We're doing this." Because Apache's been synonymous with big data forever. Right. And we have all these different Apache projects that are already related, right. Apache Flink, and Apache Spark were already not contributing yet, but we just saw that there's all this, what do I want to say, is only going to thrive if we joined together. Right. So to be clear, the ability to run the pipelines on Spark and Flink were not in the Cloud Data Flow repo, like they didn't contribute it to the Cloud Data Flow SDK. It was just these separate projects maintained by other people. And so yeah, we reached out to them. We were like, "Hey, do you want to put them into a blank repository and see if we can get it all to build together and continuously and all that."

Pablo Estrada:
I might add. It's not only that they didn't contribute it to the repository. There was no way or no formal way to do it. Right. There was no strict governance on what is the contributor license for this? What is the ownership of this code if we were to bringing it into a Google owned Cloud Data Flow repository?

Eric Anderson:
Got it. It wasn't as though, as they were withholding it as much as they were just wasn't a path or pattern for how to contribute.

Kenn Knowles:
Exactly. It was what I call a throw over the wall, open source project at the time, right. We built the SDK internally and we dumped it onto GitHub. So we sort of reversed where the source of truth became external instead of internal at the same time that we became an Apache project.

Eric Anderson:
So I was also there for some of this and maybe a few anecdotes we can give up. You have to come up with a name when you make an Apache project.

Kenn Knowles:
You do.

Eric Anderson:
There were some other alternatives right there. I forget what they were at it now. But-

Kenn Knowles:
Oh goodness. I can't remember. That's going to have to be homework for the listener. I do not remember the other names, but Beam is a bad pun. If that's what you're getting at. Because it's both batch and streaming. So it's just the word batch and stream swoosh together.

Eric Anderson:
Yeah. "Stratch" doesn't work quite as well. No.

Kenn Knowles:
Although it would be much easier to Google up.

Eric Anderson:
Right. Build more unique, "Stratch." But yeah. So, so Beam is batch and stream combined, which is a nod to the technology. Got it. And then tell me about the Apache way and transitioning from the Google way to the Apache way. Because this was new to me. I was like, "Yeah, we get to the Apache validation." Which I thought was a bit more of a licensing governance decision and not so much a cultural one, but there are some, certainly some cultural elements I came to learn.

Pablo Estrada:
Well, one of the big things in Apache is if it doesn't happen on the middle list, it didn't happen. Right. And so part of it is because you're very used to going on meetings with people and discussing things. Even if these people are not in the same company, you might still schedule time to talk to them and reach decisions together. But Apache has a very strong philosophy of being very transparent and putting the community first. So these kind of closed door discussions are not favored. I mean, and you can talk to people of course, but what's important is to be transparent about the discussions and the sort of work that people are planning, they say, share your intentions. Right? And so yes, you can have meetings with people to brainstorm about your work, but in the end, it's important to keep the whole community loop in. And the whole community means people that are in different countries in different time zones, different companies. And that may have very different time schedules. Maybe they just look at the project once a month or with much less frequency.

Pablo Estrada:
So it's important to be very transparent about all of this communication. And so in Apache, everything is done by a mailing lists, and our projects pick the way they work. There's, weekies, there's things like that. But the town square for Apache projects is the mailing list.

Kenn Knowles:
Yep. I think you really nailed it there. Right? It's just there's so much variety in the communities and putting my own spin on the Apache way. But the point is that you can make a project that users can count on outliving any particular stakeholder or contributor. Right. And so in order to have that, it has to be easy for someone new to come along and figure out what's going on and become part of it. And it also has to be doable like Pablo mentioned, people are maybe they have more sporadic engagement with the project, either they're a user who's actually getting paid, but once in a while they need to fix something or it's just a volunteer that has a limited time or it's a full-time employee. And I think it's actually remarkable that the foundation has kind of come up with a governance model and a culture that allows all those kinds of people to work together in a fairly egalitarian way.

Eric Anderson:
Yeah. Maybe a couple elements that I recall as well. There's an incubation period, you're listed in incubating project and then you can graduate to top level one. And during the incubation period, there's kind of milestones you're reaching, there's somebody from the Apache Foundation who kind of becomes a shepherd to the group, is that correct? And in part they're there to instill Apache way.

Kenn Knowles:
Yeah, that's right. You have a champion, you have mentors, there's a whole system at this point. I've mentored a couple projects now and it's really interesting because the Apache Foundation definitely has a culture of its own. And when projects come from different corporate environments, they are kind of different.

Eric Anderson:
Was there any moments where you're like, "Man, what did we get ourselves into?" This Apache way thing, maybe the Apache way should be a little more of the Google way.

Kenn Knowles:
No, I love it. I don't think so. But you know what? Internal to Google, it's like already very similar, right? There's this concept of inner source and it's not a word we talk about or at least I've heard inside Google because you don't need it. Because it already is, it's pervasive. Right. You have this monorepo, you need to fix something in somebody else's project, you just go do it. Right. And there are mailing lists throughout Google. I just feel that while there is a membrane sort of between internal stuff and external stuff, as long as you don't let that hold you back, it's not like a huge leap, the Google way and the Apache way. That's my feeling about it.

Pablo Estrada:
I feel like sometimes for people and Eric, you might relate, it might have been a little shocking or confusing to have to work in this community. Right. Maybe not everyone agrees with all of your proposals. Right. And there's not a boss that gets to say, "Oh, we're doing this." Right. And so sometimes it can be a little confusing for people. But I think over time I've also came to really appreciate it. I appreciate the openness. I appreciate the... Like Ken says, the ability for a truly open project to outlive any contributor or any commercial interest that is built around it. I think it's awesome, and I really like the model that Apache has where people will represent themselves. And it's a principle that allows companies to build commercial interests around it. But also it prevents this kind of overly powerful, single business interest to push the project whichever way it wants and push people out and things like that. Over time I've realized that it's really good.

Kenn Knowles:
Part of that I'll say is I think it also makes companies trying to do that futile, which is good. Right? Because if you did have a couple of really powerful companies that wanted to influence it sort of makes it so it's not in their best interest. Right. It aligns the project's interests and what the company's affordances are.

Eric Anderson:
Yeah. It's a bit of an immune system to overly commercial activity. Good. Maybe to table that in return to the history, just a moment because we breeze through it. Any highlights that we should recall, stories. I'm trying to think of times when you hit a milestone or you discovered a new user that kind of surprised you, you didn't expect. That would be fun for us to hear.

Pablo Estrada:
Something that I could talk about is initially right can describe the Data Flow SDK was written in Java. And so the integrations with other runners were done directly in Java, right? And so over time, the project, there were investments to be truly kind of this language agnostic, right? Any language you want and any runner you want. And so it took some time first to figure out exactly what that was, what that looked like. I believe. And Ken was there longer than me, but it took some time for the project going from being a very Java focused project to figuring out that we really needed to fully develop this abstractions, this API that allowed us to be language agnostic, to starting to have actual kind of language agnosticity. And recently being able to build a prototype in a hackathon where we added a new language and agnosticity, agnosticism. Anyway.

Kenn Knowles:
It's hard for me to tell cool stories without feeling like I'm short changing everybody whose stories I don't remember off the top of my head. But I will call out. I thought it was really cool when we were developing that. And we had like portable data flow runner that can run any language. And then we were kind of building out the Flink and the Spark runner. And I don't know if this is just like a shout out, but then the folks working on the Samsa runner just figured it out. Now they're running Python on Samsa all of a sudden. And for me, my favorite thing when I'm working, as I work on bigger teams is when stuff happens that you didn't know was happening. And so I'm just calling that out as a thing, I just didn't know it was happening and then it was done. So happy and impressed by that.

Eric Anderson:
I remember when ski on showed up from the Spotify team, they were, seemed very excited about some like clever things that they were excited to introduce the project. I don't recall what happened to that, but that was always fun.

Pablo Estrada:
Some things have been introduced and some have not been able to be introduced, but yet she's still there. It's a scalar wrapper for the Beam Java SDK and speaking from the Data Flow side, it was developed by Spotify. But I know that they're not the only ones using it. And so that's also another awesome thing that came from my user that became a community contribution that it's benefits the whole community. And yes, there's very cool features developed in [inaudible 00:23:19] they developed a shuffling algorithm that would reduce the amount of data that actually gets serialized on exchange between machines, because they were trying to reduce their costs for Spotify wrapped, which is the yearly kind of retrospective that they do of the music that you listen to. So really awesome stories.

Kenn Knowles:
Oh, I just thought of another thing that just kind of came. I know the people, but I don't know which company it came from. The Beam SQL was developed. It was a community contribution. I mean, in some sense, because I met Google, whenever something happens from someone not at Google, I'm like, "You're so cool." And that's exactly what happened. Right. Google came, it started outside of Google and it was, it was Ingin and Ingin, it's like team of about the same name, same handle. And yeah, that's just another highlight of the awesome things that they were users. And then they became power users and then they wrapped SQL on top of Beam. And now that's heavily used.

Eric Anderson:
We haven't covered, but maybe it's worth just giving a nod to it. And if folks want to dig into it more they can. Part of what's special about Beam is as you kind of address this problem with unifying batch and streaming, that folks have struggled with around representing time, and time stamps and temporality in general. Maybe able just say a few words and then we'll move on to where the project's at today.

Kenn Knowles:
I have opinions about this. I'm going to repeat an opinion that I'm refining how I think about this. So I think that batch processing has always been used to implement streaming use cases. And that is sort of the origin of this, right? If you are running a daily job, like your business is running continuously, but you're just processing it overnight. And so batch and streaming... What you're trying to do has always been streaming, right? And so it may or may not have latency constraints, but it is a streaming problem. And so you can, and like the whole continuum between batch processing and streaming processing is applicable.

Kenn Knowles:
And so unifying these is actually really important because you want to move towards streaming, but you certainly still need batch for when you like make a change, you want to run an experiment or you want to catch up, and you need both batch and streaming, even in a streaming only context, which is what we're all sort of moving towards and streaming is a more natural fit for your business, because that's actually what you're trying to get done near the actual business you're trying to run almost all of the time.

Eric Anderson:
Yeah. If I could jump in, I'm less the expert here, but I've often given a lot of thoughts, why are we doing batch or streaming for any given use case? And you're kind of giving the implicit answer, which is that streaming is the super set of use cases. It's kind of the base case for what we're trying to accomplish. I wonder if it seems to me there's like kind of two reasons why we end up in batch one is that we want to aggregate things. And so we run a job at the period of aggregation and rather than making that a function of the pipeline, we just make the pipeline only run when we want to aggregate. And then the other is that it creates a bit of modularity where like I can run a pipeline and produce some files and you can also run a pipeline on those files and we don't have to talk to each other, there's common interfaces for reading files.

Eric Anderson:
And so if I produce files, you can read files. Suddenly everyone can do their own thing. And it decentralizes the work. Whereas in streaming, there's a bit of centralization. You have to be aware of the stream and you have to speak the same language that the stream producer is speaking. And maybe there's some proprietary interfaces you're not familiar with. And so, those are my two reasons where I go back to why we ended up in batch, even though you're right. The world is a streaming world.

Kenn Knowles:
Message queues are the streaming version of that file system interface, right? You need to be able to throw stuff on the queue when the other person's not listening and they need to be able to pick it up and process it.

Pablo Estrada:
One thing is just that yes, the problem will often be a streaming problem. And then what will happen is the current systems can't address it well, right? So as an example, it's much easier to deal with your data schema changing in batch than in streaming, right? So if this is the case for Beam. If you, if the scheme of your data could change in the source, then you'd have to relaunch your pipeline or you'd have to deal with that. And it's more difficult to reason about that with the system that is running all the time, rather than say, we have a batch process that runs at some point before the schema change and at some point after the schema change, and we don't have to break our heads to figure out how to deal with that. And that's just one example of a particular technical problem that arises, right? When we try to solve a streaming problem with, as an implementation of a system tries to solve streaming problems.

Eric Anderson:
Yeah. Totally cascading batch processes are super brutal. Right?

Kenn Knowles:
The other thing you mentioned is the distinction between event time and processing time, right. Is something that we have hammered home a lot. And I think it's worth going into, because there's a high level conceptual thing here where like what you described, where you run the pipeline at the interval of aggregation, right? Like your daily job is aggregating daily stuff. That's always been wrong. If you are doing a daily aggregation, but you're trying to aggregate stuff from a few days ago. It doesn't make sense. Right? You wouldn't just say, "Count everything and run that everyday." It makes no sense. What you would do is you... If you're in a sequel database, you say, "Count and group by day." Right? So the idea of grouping your data by some piece of time, that is part of the data has always been what you should be doing.

Kenn Knowles:
And it's never made sense, not ever has it made sense to say, "I'm in a process and every minute I'll spit out what I got." Right. That's fine. If what you want is for your output to update every minute, but it's not at all fine if you want to be having aggregations of stuff that happened over a minute, right? So event time, when is the term we use in Beam for your data? The time stamp in a time series or event stream that describes when the thing happens. So you want to know how many people logged on in an hour. You got to group it by that timestamp. And the tricky part is it's very hard to know when you have all the data for a time period, right? That's the challenge. If you want to say, add up all the thingies that happened in an hour and you have unreliable networks or whatever, right?

Kenn Knowles:
There's a classic list of reasons why you don't know that you have all the data for an hour and you never will. And even when you're doing batch processing, you don't know that like when you run your nightly batch, you don't know you have everything for that day. You probably don't. You might not even have everything for the week before. So yeah, it's inherent to the use case. So Beam incorporates like this concept where your data has timestamps in it. And those timestamps drive a water mark, which is just our way of approximating when we really believe an aggregation is complete. And you have to have some way of estimating that an aggregation is complete in order to solve a streaming use case, regardless of whether you're executing and batch your streaming.

Pablo Estrada:
And I just wanted to strengthen Ken's idea with an example that Ken gave me at some point, which is invoicing, right? If you're trying to send invoices to your customer, you're trying to add up whatever happened in the last, let's say month, that is invoiceable. And you might get stuff that is processed late, right? Maybe you send the invoices every second of the next month and you get some report that something happened previous month, but the report arrives on the fifth of the month because people are processing it by hand. And so you don't just invoice that on the next month, you send that adjustment to the previous month to your customer, right? And so this applies in streaming pipelines, right? Where you have to decide what to do with delayed data. Yes, you can just decide to clump it up with the next period of aggregation, but you can also do other things. Right. And so invoicing can be a good example. Just reason about.

Kenn Knowles:
I love that example because I hadn't thought about it before, but banks have got event time and processing time. Right. Forever. Because you look at your statement, there's a transaction time, right? There's when the transaction happened. And then there's when the transaction posted, right? The bank, when they send you an invoice, they're not saying like, "Here are all the transactions you made in this month." They're saying, "Here's all the transactions that posted in this month." So they're using their processing time in terms of how they group it together. If they wanted to do a proper grouping by when you made the transaction, they'd have to wait a little bit longer to send the invoice.

Eric Anderson:
Great. And you mentioned watermarks and I don't know, maybe I didn't fully understand them at the time, but this watermark is a thing you can describe in Beam, in the programming interface that says, this is when it's probably safe to aggregate for now, you have control of it. But Beam also gives you some default things you can just run with. And it's like a heuristic that you guys feel like is a representative way of doing it. Or do you describe your watermark to Beam?

Pablo Estrada:
I'll try to be as practical as possible. This all depends on how you're reading your data, where your data is coming from. But in the end, our watermark is just the system trying to estimate what data we've processed so far. Right? So going back to the invoicing when we send an invoice on the second of the month is because we're estimating that by then, we will have received all of the transactions that occurred in the previous month. And so when we send an invoice on the second of the month, it's because we're estimating that our watermark reach the end of the previous month, because our data from the previous month has arrived. Right? And this is because we don't know the future. This is a guesstimate, usually depending on the source, because the problem is we need to decide when to send the invoices, right? Our customers want their invoices soon. Right? And so we could tell them, "Hey, we want to wait until the 15th of the month to send you the invoices, so that we're sure that we got all of the transactions, but instead we have to do the straight off."

Pablo Estrada:
And so the watermark is the way in which the runner and particularly the source, which is the place that we're reading from, usually the message queue that we're reading from estimate what is the time up to which we've analyzed or received the data.

Eric Anderson:
Great. Let's wrap up with where we are now. What's the Apache Beam project up? For me, it's been four plus years since I kind of followed things closely and tell us what we have to look forward to this year.

Kenn Knowles:
I'll start with a couple small things, right? Panda's compatible data frames is a new thing that's coming out. The Go SDK is obviously huge, right? So these are current events. And the portability story that we mentioned is really coming to the four as mentioned, we've got this prototype Type Script SDK and all of these connectors that are available to all the different SDKs. And one of the things that I'm now going to segue into having these different language communities reaches different sort of use cases and groups and a really important development is in the Python arena. We've got this data frames API, and we also have integrations with machine learning frameworks. We call it AI now. There's a lot of AI integrations. The one that's I think notable is if you're trying to serve predictions, right? There's this Library and Beam now that where you can do this and I want to highlight the work around this and we're trying to make Beam is becoming a premier platform for AI engineering. And you need this.

Kenn Knowles:
You might think that it's like not that complicated, maybe to serve predictions, but the moment you start really trying to productionize it, you end up needing to just sort of deal with a lot of performance concerns and deployment, like how to scale up your serving and how to deal with bad data, how to get metrics on how it's going. So the name we have for it is Run Inference and that's coming out as sort of the next milestone for Beam as an AI platform. Am I forgetting anything in that arena? Pablo?

Pablo Estrada:
No, I was just going to mention one more problem that can often happen that we're trying to deal with when you're working with large ML models and serving them and doing inferences, loading them into memory, right. There are gigabytes of a model. And so we're also working with how do we make sure to load this model once in memory, but be able to process in parallel with this model. But yeah, I think those are the big things that come to mind.

Kenn Knowles:
Yeah. I'm going to highlight how this is related to being this cross engine cross language Apache project thing is that we are building on top of Scikit-learn and PyTorch and TensorFlow and TFX. Right. The integrations in Beam, it's like a natural place for other portability projects. Right? And so it's natural when you're implementing something this in Beam, you also implement it as an abstraction, a top various other platforms.

Pablo Estrada:
I'll mention also on the community side, we have Beam summit coming up, which is kind of the biggest Beam community event. We're doing it in Austin on July 18 to 20 in 2022. It's hybrid, so it's going to be live streamed, but we're also going to have people in person. So if anyone is interested on attending, I'll share a discount code with Eric and yeah, feel free to join. It's going to be fun. First onsite Beam summit since the pandemic started.

Eric Anderson:
Yeah. I've been to a couple conferences in the last three months and boy are people excited to see each other again? Yeah. So I'm hatching the Beam communities in the same place.

Kenn Knowles:
Do you throw in like one of those air horn drops at this part?

Eric Anderson:
That's right.

Pablo Estrada:
There you go.

Eric Anderson:
Ken, Pablo. Great to see you again, we didn't talk much about the team, but I think what's maybe kind of cool about this effort is there's just a lot of people within Google working on it. I imagine some open source projects are kind of started with an individual or a handful of individuals and they represent the kind of locus of control or thought, but really Apache Beam is like it's a village, right. That raised this child.

Kenn Knowles:
It really is. And of course it's nice to see you again as well. And yeah, it's kind of wild. There's no benevolent dictator and there's not even a candidate for who that would be. It's truly a distributed system. Sorry, sorry. I apologize.

Eric Anderson:
You can subscribe to the podcast and check out our community slack @contributor.fyi. If you like the show, please leave a rating and review on Apple Podcasts, Spotify, or wherever you get your podcasts. Until next time. I'm Eric Anderson, and this has been Contributor.