Git with Derrick Stolee === Paul: ~Test, test, test.~ Derrick: ~Test, test, test.~ Paul: ~All right.~[00:00:00] Hi there, and welcome to Pod Rocket, a podcast brought to you by Log Rocket. Log Rocket helps software teams improve user experience with session replay, ever tracking and product analytics. Try for free lag rocket.com today. My name is Paul and joined with us is Derek Stoli. Derek is a principal software engineer over at GitHub and we're gonna be talking a little bit about get 2.41. ~ We're gonna talk, be talking about~ some of the recent blog posts that Derek has come out about Packed object storage. Commit history and the ways that Git actually stores data on your file system. And then getting a little bit into scaler and talking about how GIT fits in with mono repos in larger projects. Welcome to the podcast, Derek. Derrick: Thanks for having me, Paul. Paul: So you joined us back in 2022, and we had you on talking about GIT a little bit. But just for people who didn't tune into that episode, could you tell us in your words who you are and ~kind of~ what you do with Git? Derrick: ~Yeah, I, you know,~ I'm a software engineer and [00:01:00] I've, ~you know,~ worn many hats ~kind of, uh,~ in my life. I used to be a mathematician, ~uh,~ doing research math, but then I switched to software engineering and I've been doing GI related things ever since. I did a little bit of service side things. I've been doing client things, ~uh,~ and a little bit back to service side now, ~uh,~ but ~kind of.~ A big chunk of that has been contributing to the Open Source Git project. ~Uh,~ a lot of the performance, ~uh,~ things that we really need for the client side,~ uh,~ to really scale to monorepo. So a lot of it was very focused on,~ um, you know,~ when I was at Microsoft focusing on the Microsoft Windows at Microsoft Office, Monte repos and making sure we could scale gets operations to handle those~ giant,~ giant repos that are just like so much bigger than anything ~you see, uh, uh,~ in the public open source world. But then just trying to make sure that also those gains we, we make for those large repos are as easy for regular GI users to access and have them enabled by default when possible, or at least easy to get to ~when,~ when, ~uh,~ it can't be on by default. Paul: ~So you mentioned you were in academia before getting into the field. Okay. So math mathematician, then now you're working with. Really complicated problems. I, I people could consider them complicated. We're talking about storing blobs of data and, and all that.~ Do you feel like the things you studied being a mathematician ~sort of~ drew you into this sector of computer science? ~Cuz it's, it can be quite mathy if you, if you try to crack it open.~ Derrick: ~Yeah, I, I,~ I was actually a [00:02:00] specialist where I did computational math. I did computational approaches to like pure graph theory problems. ~So, you know,~ this idea of graph theory is gonna be a common theme as in our discussion about, ~you know, under,~ underneath the hood Git is a graph. And that's part of the reason why I got, ~uh,~ involved in Git at the first place, is that. ~You know,~ I was hired because of my graph theory skills to solve a specific graph problem, ~uh,~ that Azure DevOps was happening with commit history. ~Uh,~ but then it ~kind of ~expanded out from there to figure out all these other things. And ~you know,~ I'm definitely very attracted to problems that can be solved. From the very bare bones, right? Like working on in sea and dealing with data structures that we have to write to disc and read from a memory map. That's something that's really exciting to me and I really love to be a part of. ~Uh, and it,~ it's so rare that we get to do something like that ~as,~ as opposed to having an abstraction layer of a database on top. ~Um, to say like, oh,~ I'm actually in the nitty gritty, making sure like, where's the data going? How can we compress it? Where do we not need to not compress it? Could we restructure it? Could we iterate it over differently? And all of that. ~Uh,~ it's just really something that's works for my skillset, works with the way my brain works, ~uh,~ and it's [00:03:00] just really exciting to work on. Paul: ~I,~ I think it's interesting that you bring up, ~you know,~ there's no database layer on top. And I'll segue into you talking about some of the blog posts you put out. So, Derek put, he has, he has a series of data, ~uh,~ get internals. About how get works and they're fantastic. If you wanna learn a little bit about, maybe it's your ~step time,~ first time stepping into the GET internals fantastic blog posts, but in one of the posts packed object storage you mentioned we're essentially working with a database here, like GIT is a database ~in~ in. And so how does that compare to ~like~ a typical database that somebody might think about, like your average web developer I'm reaching for SQL Light, right? That's a file system ~sort of~ backed database. ~Um,~ Derrick: Yeah. I. Paul: for us? Derrick: So ~the,~ the biggest thing I wanted to get across with this blog post was that the, ~you know,~ a lot of us developers understand certain things about databases. Even if we're not building databases ourselves, ~we're,~ we're running queries and we need to know how that data's structured and the performance of those queries. And ~if we,~ if we. Build our database schema incorrectly, or we change our queries to be poorly [00:04:00] structured, our database will perform poorly. And I wanted to take that kind of ~uh, ~perspective. If you learn a little bit about your database and you ask it the right questions, you can do really cool things and really scalable things. Let's apply that kind of perspective into Git and think about what's going on underneath the hood. As much as you would say, I need to know how my SQL table has indexes. So these queries will go quickly. ~Uh,~ and so that was ~kind of the, the,~ the goal I was trying to reach. And ~you know,~ the first one in particular, it~ kind of~ says like all of your GIT data is essentially two tables. If you think about it as a database, you've got your refs, which go from plain text names that are human readable to the. ~You know, uh,~ 20 hex or the 40 hex character, ~you know,~ Shah one values, and then that points into your object store, which you take those hashes to their contents. ~Um, and so like~ the hashes are predictable to content addressable data store. ~Uh,~ if I have this, ~you know, uh,~ 40 hex string, I can go and say, which object is that? And it's just gonna be a, essentially a document. That get then parses and says, oh, I can see this is a [00:05:00] commit. I've got my commit message here. I've got my other parents, I've got my root tree. And, ~you know,~ trying to give you that kind of, the refs give us, ~uh,~ pointers into that object store, which is then a graph that refers within into itself. ~And so,~ if I wanna go look at the read Me for, ~you know,~ the GI repository, I go to GitHub and I loads the page. It says, oh, okay. ~The,~ the default branch is master. Let's go look that up. In the ref storage, it's pointing to commit abc, ~uh,~ let's go look at that object. Okay.~ It's, ~it's got this message. It's got a root tree, which is describing its root directory. ~Uh,~ let me go look up that by its object ID in the object store. Pull its contents. It's got a list of, ~uh,~ paths that are all the different, ~uh,~ sub trees and paths at that thing, and I could find which one is readme.md. That's pointing to an object. Go point to that. And it's just gonna be the file contents of Reme md. So I can go through this really ~step, step by,~ from ~a,~ a ref to commit to a tree to a blob. And that all needs to happen really fast in order to be able to serve that content. And then think about that, ~you know,~ at larger scale where you're saying, oh, I wanna check [00:06:00] out something and I need to go and walk all the trees and find all the blobs and put them on a disc, right? When you're doing an initial clone, And all these things about ~like~ what exactly is happening under the hood is really helpful to be ~kind of~ helping you understand, ~you know,~ why it would get slowed down in these cases. Or, ~you know,~ even simple things like, hey, when I do a rebase, why do I commit IDs change, right? That I'm changing the contents of the commits as the BA basic thing. And so they're hashes of course are gonna change.~ Uh,~ and so you gotta think about it as. I'm not editing a commit. I'm making a new one that looks like that old one with some differentiation. And so that's the kind of, ~uh,~ introduction to those internals I really wanted to get across. Paul: Gotcha. And that really shed some great light for me at least, like learning about the verbiage you use to describe the different abstractions. Like we got blobs, we got refs modeling, modeling it in the sense of a database. Definitely feels a bit natural. And yeah, in your blog post, if ~it's,~ it's an ergonomic way to like actually digest that idea. ~Um,~ it, for people that don't know what happens when you [00:07:00] check out, does, ~I'm,~ I'm just curious, where does that data live? On disc under the hood. If we want to dive under the hood real quick cuz it's not in my current directory, that's ~so.~ Derrick: ~Right.~ So it's, it, ~you know, your,~ your working directory is all the files that have been checked out~ and,~ and decompressed in all the different ways into actually saying, this is just what I could build. Know, ~you know,~ from this to make my final, ~uh,~ application. But, ~uh,~ inside of your working direct, there's a hidden directory called Do Get, and that's where all of the internal data of get is stored. ~Uh,~ it's in a, ~you know,~ a.directory. So it's ~kind of~ hidden from view by default, but if you go and look at it, you can find it. And inside there they've got, there's a directory called refs, which stores all the reference names and things like that. ~Uh,~ there's also ~a,~ a director called Objects and that's where your object store is. ~Uh,~ at the very base level, ~you know,~ I say, oh, I'm gonna add this. Read me file. This new version of my read me file, I'm gonna use, get, add to stage it. It's gonna say, okay, lemme take that content, let me hash it. And I've got a new object id. I see it that I don't have that already. Let me create what's called a loose object. And so inside docket slash objects, It takes the first two hex characters, makes a directory, and then the other 38 makes a [00:08:00] file inside that directory and then writes the compressed contents using, ~uh, you know,~ just xlib compression. And that's a very simple way of saying, okay, I see I've gotten my file on disc and I have now stored it somewhere. ~Um,~ and there's some other, there's ~also~ also bookkeeping pointing to it. But if every single object I had was stored as like this in this loose way, ~you know, the,~ the, that two character directory, there's only 256 of those. ~Uh,~ so you're gonna start getting a huge wide directories full of a ton of files, and it's gonna slow down. So the next thing that get does is say, oh, I've got enough loose objects. Let me collect them into what's called a pack file, which at the very basic things you could think of, those loose objects just concatenated together. ~Uh,~ so they're compressed and stuck together into one file. And then there's another file that's called the Pack index that essentially says, oh, I wanna look up this object id. ~Well,~ I've got a sorted list of object IDs and pointers into that pack file to say, where do you go get the start of that file? ~Uh,~ and you can do that. And so you've kinda get this feel like, oh, the pack file is like a table. Of data, but I'd have to scan it [00:09:00] linearly if I was looking for it. And also check all these hashes. But I could just look at the pack index and quickly navigate to the object. ~Uh,~ there's some extra benefits that happen when you go into the packed store, which is, Hey, we're gets designed for source code. It's designed for people working on these files and editing a few lines at a time. They're not really saying, oh, this file has absolutely nothing to do with the previous version. In that case, we can actually store, ~uh,~ a delta, which is essentially a diff from the previous file, ~uh,~ that says, okay, I'm gonna take this content from the previous file, this initial segment. I'm gonna change these lines and then I'm gonna take the rest of the content from ~the,~ the file. And, ~you know,~ a combination of those kinds of instructions to. ~You know,~ uncompress the file from the previous version, and that allows you to, ~you know,~ take a hundred different versions of the same file where they share a lot of content and like really reuse that shared content over and over again. And so when you're doing a clone, yes, you're downloading every single version of every file that's ever existed in that repository, [00:10:00] but, ~They're,~ they're compressed in such a way that you're able to not download as if you had just done a bunch of snapshots. ~Um,~ and so you're really saving a lot of time and effort and, ~you know,~ on the fly, Git is decompressing these things really quickly. It's not needing to do a lot. It's cuz the disc cost is so much more expensive than the in-memory and CPU cost of, ~you know, uh,~ doing that. So ~it's,~ it's really nice to have that saves us both space and time. ~Uh,~ for the reader. Now, computing those deltas is ~kind of~ expensive, and that's one of the challenges we have ~as,~ as, ~uh,~ at GitHub as ~you know,~ server maintainers to make sure that those things can be computed at a reasonable amount of time and, ~you know,~ keep it nice. But as far as the client's concerned, it's generally doing a decent job, ~uh,~ of just reading it really quickly, ~uh,~ and then taking it with the server gave it. Paul: ~I mean,~ you ~certainly seem, seem, Derek, like you~ have a very. Good handle on everything. Cuz we learned about a huge amount of git internals right now.~ I mean,~ the pack files is, is interesting to me. I've certainly looked in the dot git directory and wondered what was hanging out in there. ~Um,~ thank you for shedding some light on that. ~At a,~ at a, ~like a~ higher level, do you ever look at the product that you and the team have built [00:11:00] and think, ~um,~ I wonder if people use it for something other than like file storage. ~Cause if it, you know,~ when you describe to me, I'm having these instructions of changing from one step state to the other state, I'm thinking about like database migrations. ~You know,~ what ~have you,~ have you seen this out in the wild? Do you hope to see it? Do you never hope to see it? Derrick: ~I think, you know,~ I think Git is a very specific tool that's really good at ~what it,~ what it's built for, which is again, ~You know, uh,~ managing multiple versions of human generated text content, ~uh,~ that's ~kind of~ like ~the,~ the, all these things ~kind of ~fit into that, right? I mentioned delta compression is really idea that the idea that a human has gone in and edited a few lines or added a few lines. They're not really writing, ~you know, uh, you know,~ a hundred megabytes of data that's meaningful in one go, ~you know,~ you'll eventually get there with some files or ~with,~ with, ~you know,~ with combinations of files. But it's always these small contributions at small points in time. ~Uh,~ and when I find people using git in weird ways, ~it's,~ it's, those are the kinds of problems we see where, oh, I'm gonna start checking in my, my, ~uh,~ Computer generated config file that has a really compressed view, or it's got, ~you know,~ [00:12:00] it's got a J S O N file that's all one line in supposed to, like some format, ~uh,~ and all the contents look completely different each time. ~Uh,~ or, ~you know,~ I'm, I am storing log data. And so every second I'm committing some new lines to ~a,~ a file. And so yeah, those things Delta compress well, but there's so many of them that it becomes, ~you know,~ un unmanageable for get to be the thing that is actually tracking all those versions. And so it, it's ~kind of~ like making sure that if you're gonna do something different, then human generated source code, that you know where those limits are and what kind of effect it will have. ~You know,~ one of the biggest things we have, ~uh,~ is that game developers have a lot of trouble. Using GI because they wanna use. These large binary assets for their art, and they wanna have that co-located with their code and you, these binary assets, they change completely every time. ~Right?~ They're not source code, they're human generated. ~You know,~ artists ~are,~ are doing this work to create it, but ~they,~ they're essentially stored in a compressed manner that can, they get, can actually do any of this delta compression. So it's storing an [00:13:00] a, a full copy of it every time you change it. And that's just not what gets built for. That starts filling your pack files start to get filled up with these really giant chunks that are just these images. And like the other pieces that get wants to be, ~uh,~ needs to be performant is actually being, ~you know,~ fragmented across the PAC file. It's a big reason why a tool like GIT LFS exists to essentially let you have it all inside your repository, but the big binaries are being stored at ~a~ a thing that's okay to say. I care about this version when I care about it, not all the time. And you can go download it separately and store it in a different place. ~Um,~ so ~that's,~ that's one of the ideas, ~like,~ hey, where this breaks down, there's maybe a tool out there that helps you extend out to do something different. Paul: And, ~um,~ we're about to dive a little bit deeper into the topic of. Efficiency and some new features of Git. ~Uh,~ right before we do that, I just wanna remind our listeners that this podcast is brought to you by Log Rocket. And Log Rocket can help you find and surface errors faster in your web application, so you can spend more time building a [00:14:00] great app and less time debugging and digging through the console several different features such as tracking what your user are doing in real time, discovering hidden traps of the powers of. AI to find patterns and meaningful statistics to help your developers. ~Go ahead and~ head over to log rocket.com today to try it for free. So Derek, ~uh,~ we have get 2.41 coming out, so that's get 2.41. For those who are wondering, is it four one notes? It's 2.41, all one thing. ~And in,~ in one of the posts that you gave ~about~ talking about how this is gonna start to. Serve maybe larger projects ~and,~ and different types of projects more. Some of the bigger key features that you noted specifically were partial clone, sparse checkout, and background maintenance. ~Uh,~ of those three, ~uh,~ which one do you think we could dive into a little bit first to talk about why you think it is one of the greatest of the new features? Derrick: ~Well,~ I guess those three you mentioned ~are, are not,~ are not [00:15:00] super new to two point 41. ~Uh,~ but they are something that I've been thinking about in terms of ~like,~ if you really wanna get to that next level of scale. Where, ~you know,~ Git, vanilla Git isn't working for you, then this is the ne next step ~of,~ of how things work. And, ~um, you know,~ so partial clone is the one that I think is the easiest for people to ~kind of ~grasp because ~it,~ it's something you can just do at clone time. It's the only time you need to make a change and everything else works the same. It's just the trade-offs are different. Again, the idea that I mentioned at a clone, you're downloading every single reachable object. The idea is ~like,~ I wanna get the tip refs. And I have to get their entire commit history and all the objects that are reachable from them. So inclu. So that way after my clone is done, I can go check out any version of the history. ~That's,~ that's how GET is built. It's fully distributed. I don't need to rely on that server anymore to do my job. I can do all my other things. ~Well,~ partial clone breaks that, ~you know,~ complete independence by saying, let me just get the objects I care about. So the default, I think, is to use a blob less partial clone, which means I don't have every version of every file. But I still have every commit, and I still have every tree, which is the [00:16:00] version of a directory. And so that's what the initially downloads. And then you do your initial checkout at Tip Oh. And now it actually downloads all of those blobs that are for that, just the tip versions. ~Uh,~ those wouldn't have any Deltas probably because they're all from different files. And you put those on disk. Now your repository works. Say, I wanna check out a different branch. ~Well,~ some of the FI files will be in common. Some of them will be new. Different. And you actually need to go to the server again at that point to go get those missing files. And so you're essentially saying, I want my initial clone to be really fast. I'm willing to pay a little extra later for just the things I need. And this is really critical for really big repos where your history is so long, you got million, a million commits, and you're really not gonna need those older versions. You just wanna be able to do your work, ~uh,~ and you want to move forward from this point. And yeah, checkout is gonna be a little bit slower here and there. Maybe you wanna get blame and ~it's, it's,~ it's quite a bit slower cause that's to get every version of that file in history. But then that stuff is [00:17:00] local and if you do it again, it's gonna be fast. ~Um,~ so it's just understanding where that trade off is and the idea of like, if I go on a plane and I don't have wifi, ~you know,~ I might not be able to do something. ~Uh,~ I might not be able to change my checkout. I could still work forward from where I'm at. I can't necessarily go back, ~you know,~ to a different branch that I haven't fully gotten all the data for. ~So, you know,~ again, breaking that fully distributed idea that you have absolutely everything you need, but that is something that's super important for people that maybe it's too expensive to do that initial clone, right? Maybe there's something deep in the history that made your repository giant and you did these large binary things and you don't want to go rewrite history to get R rid of them, but you fixed your tip. Your tip is now a small, much smaller repository. Partial clone is a great way to fix that by saying, okay, I'm not gonna download that big file. That's way on the history and ~not,~ not at TIP anymore. ~Uh,~ I will only download it if some actually need it, and that way you can solve those kinds of problems. And so this is really good for repos with a lot of history and ~um,~ and [00:18:00] you're very unlikely to do deep file history stuff. On your client machine, you can go use ~the~ the web for that, right? The web's gonna have a really fast file history if you need it to, you know something like github.com. Paul: ~that almost makes me want to think about it as like an incremental clone. Like I get my base and then over time I'll slowly build up my, my big fat chunks directory.~ Derrick: ~Mm-hmm.~ Paul: ~Um, would~ do you find that, ~uh,~ this feature was something that was born out of. ~You know,~ an ingrown need because you guys expected projects to grow big or was it more an observed thing maybe from misuse or incorrect use? Derrick: This was absolutely something that, that grew out of our absolute need. When we were talking about getting the Windows Monorepo onto Git, ~uh, we,~ we knew that, ~um,~ not only could we not have the full history of every file in the Windows repo be downloaded at a reasonable amount of time, but also like we could only, we can't even have every file at Tip. ~If, you know,~ if you're into a full clone, it's gonna be at the very start. It was a hundred gigabytes of packed get data, and then you do the initial checkout and it's, it blows up to 300 gigabytes in your working directory. ~Like~ that's ~kind of~ like the size we're talking about of these big mono repos. But [00:19:00] if we say, Hey, ~let's,~ let's remove the blob history. That a hundred gigabytes goes down to something like 10 gigabytes or,~ you know,~ a gigabyte. Depending on, ~uh, I,~ I, I misremember the numbers right now, but it was, again, it was something that's still not super fast, but reasonable. And that cost of having the commit history and the tree history was super valuable, ~uh,~ to make sure that, ~like,~ you could do things like, ~you know,~ look at the history or do a checkout of a different branch or do emerge, but, ~uh,~ then that idea of ~like,~ okay, now what do I actually need to need at the tip? That's where Windows used a virtualized file system approach. It's very, that was very heavy handed and required, ~um,~ was necessary for their build system where, ~uh, you know,~ things are a bit messy after working in the thing for 40 years. And ~they,~ they don't necessarily have ~a,~ a concrete way of saying, oh, I need these directories, ~uh,~ and that's it. I know I'm gonna be able to build from those. They're the build system dynamically discovers as it's going along. And so the virtualized file system ~kind of~ fills in it necessary. So it's saying, oh, I don't have this [00:20:00] file. Let me go ask, get for it and get will, go download it ~and,~ and put it on disk for me. That was ~kind of ~the approach there. Paul: Gotcha. Okay. Derrick: ~uh,~ that's not super widely available because it's not, it's really not for a typical person. But one thing you can do that is in Corgi, ~we talked,~ we talked, you mentioned in your list, is sparse checkout. ~Um,~ sparse checkout being the way to take your working directory and sh and focus it on only the files. You actually need to do your work. ~Right.~ You're contributing to a monorepo when you do a pull request and you run ci. It's doing the whole suite of things. But you as a developer don't need to have, ~you know,~ x, y, Z from a different component. You can focus on your component, build the things locally, and then let CI handle those deep integrations with all the other components. ~Uh, you know,~ focus on your piece as much as possible, ~uh,~ and that way, ~uh,~ with partial clone and spar checkout together. You don't have the old versions that you don't need and you don't have the files at, even at tip, that you don't need. You only care about these things that are right in your, ~uh, kind of~ focus zone. Paul: [00:21:00] Do you see partial clone ever stepping into ~like~ a role-based, ~uh,~ sense of authority where it's like this team can work on this clone set of blobs and this team can work on those? Derrick: There, there's, there is a version of partial clone, and I said, blob list, right? We're saying ~just,~ just ignore all blobs. Don't send them to me until I ask for them, right? I will ask them dynamically, but there is a version that essentially combines the sparse checkout definition and says, oh, I care about everything in this, in these directories. Give me everything in those directories throughout history. The difficulty is that actually serving such the clones like that is really expensive. It's very CPU expensive, ~uh,~ and we haven't really gone through, ~uh,~ it's a combination of there's not a super high demand for that on top, ~you know,~ as being something better than partial clone. ~Uh,~ so it, blah, this clones, but, and it's super expensive to try to do something fast. ~Uh,~ we have this thing called reachability bitmaps, which I mentioned a little bit in the, ~uh,~ in my blog series about how we use. These bitmaps to ~kind of~ quickly compute which objects you need from the [00:22:00] server side when the client's trying to fetch or clone. And it allows us to not have to do a bunch of this graph walking stuff through the object store. Let me go grab a commit. Lemme grab its tree, let's parse it and walk the bitmaps. Do a really fast way of solving that, but as soon as you say, oh, but restricted to these paths, suddenly those bitmaps don't work. Cause the bitmaps are not focused on path level scopes. ~Uh,~ there's talk about, okay, ~well ~what if we made it be these, we create some bitmaps that are path scoped and then I can take the union of some bitmaps. But it's. ~You know,~ theoretical at most right now and at, ~you know,~ from what I've seen, ~the, the,~ the cost of just filling in ~the,~ the history of files when you need them is not that much more expensive than, ~you know,~ doing something like this. ~So, you know,~ BLO list clones are the things that we've been seeing people use, ~uh,~ significantly. Paul: ~Uh,~ on the topic of these big projects, ~um,~ can you talk to me a little bit about scaler? So ~this is a new tool coming out.~ It's already out and people can use it to ~kind of~ handle their larger projects. ~Right. And,~ and I know it used to be ~sort of~ like [00:23:00] a virtual file system as well. Am I correct in Derrick: ~Well,~ you're right that. It was definitely born out of that. That's the thing, right? ~Uh,~ so we, when I back up, my team was at Microsoft, we were working on the virtual file system, forget which was the solution for the Microsoft Windows Monorepo. ~When we were moving to, uh, we,~ we re-shifted Focus Windows was on Git. They're happy and enough ~they're,~ they're doing things. ~We're,~ we're making improvements, but ~like,~ they're not the focus anymore. Office is coming. We need to worry about getting them on board. ~Uh,~ We had a different set of challenges, and one of which was making a cross platform to Mac Os because they sell, they, they ship macOS applications of their office products. So Macs developers is a big part of their monorepo developers. ~Uh,~ so we try to build a version of VS or get from Macs. That was kinda like, let's just drag and drop the solution for Windows. Bring it over to office, ~you know, we'll be,~ we'll be done. ~Uh,~ There were a bunch of hurdles with that, including a deprecation that caused that to not be a technically feasible way forward on the MACO West side, but also there was, at the root of it, a different, a difference in their mono repos that made things just not work ~quite,~ quite right. Offices is really, ~uh,~ [00:24:00] componentized build system where they know, hey, each directory ~at~ at root is its own build unit, and we control the dependencies between those units ~really,~ really carefully. Now one thing that means that ~like,~ hey, their build system could pick what they need in advance, but also it means that they have over 2000 directories in their root folder. And something like that just looks really bad when you're, if you're, when you're virtualizing it, you go to explorer and you look at it and you say, oh, it looks like I have 2000 directories in here. I get lost trying to find the things I actually care about, which is not the experience they had before. Paul: Is just from my understanding, could this be similar to ~kind of~ like how yarn workspace ~is,~ is set up, if you're familiar with ~the,~ the root level? Independent built, independently built packages. Derrick: ~Hmm. I mean,~ I am not familiar with yarn, but I I bet that you're that's a very, that's probably a very Paul: Sounds ~kind of~ in line. Gotcha. Okay. Derrick: ~Um,~ and so like within each project they have their own build system, management of dependencies in there, but like across these boundaries, they're ~very,~ very restrictive. ~Um,~ But yeah, so like users weren't expecting it to look like it [00:25:00] did when it was in the virtualized file system because they wanted, I just want the thing I'm focusing on, if I'm a word developer, I don't wanna see all these PowerPoint directories. I don't wanna see Excel. So Sparse checkout was actually like the, we started getting involved with sparse checkout as a and alternative to using the virtual file system, and we needed to build a new spars checkup built in, so it was easier to use. We needed to make a new pattern matching algorithm, so it was fast enough. But essentially what we did is we took the, from the bones of vfs, forget, we built, we removed the virtualized file system part, but then kept all the other pieces, like how we were doing, essentially our version of partial clone, which was custom to Azure DevOps, ~uh,~ that had its own implementation. There. We had a background maintenance that was in there, ~um,~ that we needed, we were reusing and we brought that all over and then essentially realized the architecture here is wrong. Since we don't need to do virtualized file system, we don't need a running process alive all the time, ready to get ~a,~ a fired event from the file system saying Where's this content? ~Uh,~ so what we started doing was taking things [00:26:00] from that c managed process layer and sticking them into our fork of Git. And the more and more we did that, we said, Hey, wait a minute. This is not super critical to, ~like,~ this isn't just a thing for us. This is a thing we could make for everybody. ~So,~ And ~so, you know,~ while if you use the Microsoft Git Fork with Azure DevOps, it will give you these custom things that are specific to that environment. We were able to find a way to upstream ~the, the,~ the bare bones of scaler, like the very fundamentals of it, so that way it's now available from Git 2 38 and later to every GIT client. So not only is it get, they get executable installed in your machine, but the scaler executable is as well, Paul: Automatically, Derrick: Automatically, if you've installed Git, a newer version of Git scaler is on your machine. And the nice thing about it is like it, the way I like to say about ~like~ what it has morphed into is that scaler clone is get clone with all the bells and whistles on. So by default it'll give you a blob list, partial clone. By default, it'll start you in a sparse checkout. By default, it'll [00:27:00] start up background maintenance. ~You know,~ these are all features you could enable via Git if you know the right, ~you know,~ custom options to do. But we can never make Git clone do that. Git clone. People are expecting it to work the way it works. And so this is ~kind of~ a way to say, Hey, we've, we built this for ourselves, but if we put it here in Git, then more people can use it. ~It's,~ it's ~kind of~ a way to have this more~ like,~ let's get the best ~and,~ and really most exciting scale features involved. And yeah, if you didn't wanna sparse checkout, then you need to do something different or you need to disable it once you've cloned it. ~Uh,~ or if you didn't wanna partial clone, then maybe you shouldn't have been doing spar, spar, scaler, clone. You could just do a regular get clone. But it's ~kind of~ assuming, Hey, I've got a big repo. Give me everything I need to make sure that it's gonna be this as quick as possible to bootstrap. I'm gonna be as efficient as possible. Once I'm in there. ~It,~ it even does some custom optional GI config on your repo to say, Hey, this cool feature, you probably want the file system monitor enabled, that's gonna make your Git statuses really fast. You're gonna want to make sure you do these extra x, [00:28:00] y, z things that, so you can find that list in the code. There's a lot of cool stuff in there about, ~like,~ these things are, could be on by default, but we, ~for,~ for historical reasons, we've left them off. For the regular get clone, but for scaler clone, we can turn those knobs to 11. Paul: Can you talk to me a little bit about that feature you just mentioned briefly? ~The,~ the file system monitor. What is that? Derrick: Oh, this is a really cool thing. My co, my colleague Jeff Hostettler upstream this, ~um,~ this is again with the idea from one thing we figured out with vfs Forget, is that we had to integrate with ~a,~ a file system driver that essentially was sending us every single ~uh,~ file system event. Saying, somebody's reading here, somebody's writing here. And we were keeping that up in memory and that allowed us to say, Hey, wait a minute. If we have all this data about what's going on in the file system, could we tell Git about it instead of Git needing to go to the file system itself? ~Right.~ Git is gonna say, I've got a copy of what I think the file system's at in my index. But if somebody runs get status, I need to go double check. Has anything changed since the last time somebody ran a status? And so [00:29:00] it's gonna start groveling the file system. It's gonna start reading directories, and if the directories change, it has to go dig into it deeper. And that can be, ~uh,~ really expensive, right? ~So,~ What we, essentially, ~what~ what Jeff did was he built a new demon inside of Git saying, ~uh,~ that it's a, it's again, another longer process running that could spin up in the background that says, I wanna listen to all the file system events from your working directory. And so if you change a file, then ~the, ~the operating system tells Git, Hey, by the way, this file changed. And it keeps a list of that, those things. And so then when the get status runs, It can say, Hey, by the way, somebody just asked for me status. What's changed? The file system monitor can say, oh, you are at this time. You were, your last index was at this timestamp. I've seen these three events since then, and the status can say, great, I'll use those three events. I'll incorporate them into my view of the world. I won't touch the file system, and so I'm gonna operate a lot faster. So ~this,~ this prevents these, this really long tail of get status groveling the file system and taking a long time. Cuz those [00:30:00] operations, they're very incremental. You can't really stream them nicely, you can't paralyze them nicely, but they are going all the way to disk and it's, ~so,~ it's definitely the slowest thing GETT does, ~uh,~ a lot of times. So this really speeds it up. Now, ~the, the,~ the one caveat is that because of. We have a ~Mac,~ Mac Os, and a Windows version of this. The Linux file to monitor does not exist yet. ~Um,~ partly because of how the eye notify has ~a,~ a size limit of how many eye nodes you can do, and the fan notify is like at the root. So ~like,~ no, we haven't figured out what's the right interface for interacting with Linux, but the good news is that Git was built for Linux in mind. So actually the things, the reason Fs monitor exists in these other platforms is actually really important because the file system doesn't work the way Git expects it to. ~So,~ or the way Git opt was optimized for. So it's ~kind of~ making up for that deficiency. ~Uh,~ and Linux is still pretty fast without the FS monitor because of the way Git is designed to work on Linux. Paul: Now, do you find that feature? Something more popular ~with,~ with an enterprise audience or [00:31:00] could really anybody benefit from this? ~Cuz, ~cuz you mentioned it is part of, for example, that those features that come with. Derrick: Yeah, it's one of the features that come to scaler. I think it's one of those things where, ~uh, ~I haven't looked at the performance numbers recently, but like ~the,~ the tr the benefits really don't start showing up until you have about 10,000 files in your working directory. ~Uh,~ and really I. You get to a hundred thousand where you start to feel like you need it. ~Um,~ and as opposed to ~like,~ it's not that big of a deal. ~Right. You know,~ talking about one of the biggest milestones we got to with the office Monte repo with, ~you know,~ sparse checkout with ~uh,~ FS Monitor and with an extra thing called the sparse index. All those things together got us to. Just under a second for most Git statuses, right? But that's still ~like, you know,~ taking 900 to a thousand milliseconds for a get statuses is not a great experience. ~Uh, if,~ if you're used to running it, ~you know,~ if I'm doing it in the get. ~Uh,~ project with, ~you know,~ 50,000 files, it's, ~you know,~ sub a hundred milliseconds. ~Uh, right.~ So you're expecting that kind of a fast. So ~it's,~ it's, but without those features, it would take six to 10 seconds.[00:32:00] ~So, you know, for, ~for a very typical case. And so it's ~like,~ that's the kind of barrier, it's ~like,~ it's going from super painful to, ~it's,~ it's good. ~It's, it's not,~ it's not terrible. ~Right. Uh,~ If you're prob, if you're feeling like you're at a second range, it'll probably get you down to, ~you know,~ sub half second. It's ~kind of~ like my gut feeling about you, depending on where you're at in, in your monorepo. But, so it's definitely something worth trying. ~Uh,~ you can do it just by setting core Fs, monitor true in your git config, but again, scaler will set it up for you. It's one of those things where you don't need to know that feature exists or ~that~ that's the config to turn it on. Keller just says, this is probably good for you if you think you're big enough. ~Uh,~ and so it does that. Paul: ~And we've, we've mentioned a few of these features that just seem so great that Keller includes right out of the box. You don't need to configure them. The bells and whistles, a as you call them, Derek, are,~ is there anything that. If people are trying outscale or ~they're, ~they're using it and they should know about, that's not a automatic bell and whistle, that is still something very potent in terms of improving their workflow. Derrick: One thing I like that I like to talk about is the background maintenance aspect. ~We,~ we brief touched on it briefly, but ~like, uh,~ interesting background maintenance to get was one of these things that I thought was really interesting. It didn't affect people on Linux very much, but on Windows it [00:33:00] does because, ~uh,~ on Windows, ~uh,~ the way of that Git would launch processes in the background wasn't, ~uh,~ doesn't work's. ~Like,~ it's just ~the,~ the API isn't there. Our windows for the way get wants to do it. ~So, uh,~ instead of launching a g get GC Auto. In the background like it does on Linux or maybe Mac Os, ~uh,~ it would run it in the foreground. So like you're fetch finished and now it's ~like,~ oh, hey, I've noticed you've got such and such number of pack files and loose refs. I'm gonna go repack them now. And then ~it's,~ it's gonna take like a vacuum. It's gonna take all these objects you have and it's gonna zip them up into one big pack file. And it's essentially rewriting your entire object directory. Which again, if that's. ~You know, uh,~ a gigabyte in size or something that's gonna take a while to, to rearrange all that data, recompress it, and create a new file, and then delete the old ones. And that's all while you're just waiting for your fetch to finish so you can go do the next thing, right? ~Uh,~ so background maintenance interacts with your scheduler to say, Hey, let's, ~uh,~ run these kinds of, ~uh,~ operations to keep ~your objects direct,~ your objects directory clean and well maintained. Let's do it, ~uh,~ in a way that doesn't ever disrupt [00:34:00] you in the foreground. It's not gonna do something that completely rewrites everything ~and,~ and deletes it. It's gonna do something more incremental and it's gonna keep all these things nice and neat. ~Um,~ and so that, and then now that you're no longer running this pro GC auto process at the end, it's just gonna make it everything a little bit faster. ~Um,~ it's gonna be everything a little bit faster cause you're never running this process. And those ones where you would get tripped up and wait, those never happen. Paul: Those would never happen. ~Right.~ Derrick: Now again, if you run get clone and then you go into your repository and say, get maintenance start, that's the same thing. Like again, the feature is there. It's really easy to just enable if you know about it. But the scaler clone runs that as part of its operation. Paul: ~And that,~ and that's just the first time when you run the clone, correct. Derrick: Yeah. So if you, so scaler was really designed for, Hey, I wanna get started. From scratch. I don't have anything on my machine. I want to go work with this big repo. How do I do it? Scaler, clone can do that. There is a second command scaler register where if you have an existing repository, just say, Hey, I clone this one already, [00:35:00] but I wanna get as many bells and whistles that work. ~Uh,~ scaler Register will do that. It'll turn on the background maintenance, it'll turn on features like FS monitor. It can't change your clone to be a partial clone, and it's not gonna change whether or not yes, spars checkout enabled. That's for you to decide. Based on what you have, it's not gonna change anything like that, but it will say, let's be as efficient as possible, ~uh,~ from here on out, ~uh,~ with these, all these other settings. And, ~uh,~ the nice thing is also, ~um, as,~ as we upgrade, ~uh,~ at least for the Get for Windows, has a step where it'll go and reconfigure all of your scaler repositories during the installer. So that way if the recommended configs have changed, Those will get updated in the repositories that were registered with scaler. ~Um,~ I'm not sure if that, that ~we,~ we don't have control of the, ~uh,~ installers for the other machines. And a lot of times you're just installing binaries from source as opposed to running some sort of wizard. ~Uh,~ but that's the thing. You could always, you could do scaler register. All I think is the command to say, ~um,~ maybe just re scaler, reconfigure and it'll just update all your scale register repositories to have the latest and greatest features as we're adding them, right? Because we're gets not done. [00:36:00] Making cool features. So we've got these kind of built in mechanisms for keeping people up to date. Paul: ~Yeah, it gets not done. You mentioned, um, you know, the example, I want to enable all the bells and whistles that are working, that are active. Uh, on, on that note, since, since we are kind of running up on time here, could,~ could you tell us about something that's not included with scaler right now? That the team is looking forward to putting out Derrick: ~Mm.~ Yes. One thing we were looking forward to. ~So, uh,~ a feature we've built, ~um,~ that's not integrated with Scaler yet, but it is integrated with Get Clone, if you know about, is this thing called a bundle, u r I. ~Um,~ so one of the things we had in the vfs forget world again cuz we had our own custom protocol and everything, is that we wanted to do essentially partial clone, but we wanted to save server resources on computing what that was that we were giving. And so we had these things called cash servers. That had a bunch of GIT data and it was instead of being in the cloud and Azure, DevOps was located like near build machines or in the lab with all the developers and it pre-com computed the commits and trees packs. So that way when you clone it just downloaded those files that already existed and then would get the rest dynamically from the server. And now this bundle UI feature allows you to do that with Git. ~So,~ but it requires this [00:37:00] external server to have essentially have set up and compute these bundles. But you can say get clone, you got your normal ui and then you say dash bundle u I equals whatever. And that can point to a list of bundles or a, a specific bundle. It downloads the object data. It downloads a set of refs and then ~you,~ you essentially apply that to your local machine. Then you go to the remote and say, okay, gimme, what's missing? I have these things. Give me ~what's,~ what's left over. So you still get the absolute up-to-date stuff that the server's saying it's giving you, but you're starting from this kind of batched of pre-computed stuff and that PrepU stuff could be, maybe it's the benefit is that it's already computed, the servers need not needing to do it. ~Right.~ If you have a really heavy, ~if you have,~ if you have, for instance, a GitHub Enterprise instance that's just overloaded, ~it's,~ it's too busy with everybody doing everything. A bundle server could offload some of that, ~uh,~ load or say you've got, ~uh,~ you're hosting on.com, which is, ~uh,~ a hosted in the US and you're in Europe, or you're in Asia and you wanna be able to get most of your big data, ~uh,~ locally to a machine that's [00:38:00] in, on your premises, or at least~ you know,~ in your country before going to.com to get the remainder, you can set up a bundle server. ~So, uh,~ my team is open sourced. Feature. So we built the feature into COR Git and that's available now, but we also open sourced, ~uh,~ it's a github.com/gi-ecosystem/git bundle server with dashes in between everything there. And it's essentially a tool that will say, no matter what your host is, whatever your server is, we're gonna go and clone that and create this list of bundles and serve a bundle list and then you can, ~uh,~ and set up authentication so that way people can. ~Uh,~ clone using this bundles and get that data at a much faster rate. ~Uh,~ depending on, again, how you set up your server and all that localities, depending on that, ~the,~ the main target we're doing right now, especially in the early ~phases,~ phases since we've, ~um,~ open sourced it is say I have a CI farm. I wanna bootstrap a new CI server to with a full clone, right? I want, one of the things you can always say is, if you've got a big repo, you need to use persistent [00:39:00] build machines that are starting from an existing place and fetching, as opposed to cloning every time. But still that first time you set up a ci new CI machine, it's got a clone. And if your repo's really big, ~so, uh,~ but these machines are usually, ~you know, in a,~ in a room somewhere, all ~you know,~ hooked up to a rack. What if that bundle server was in that same rack? So you're getting absolute fastest possible connection speed as opposed to going all the way across the internet to get it. ~Uh, that's,~ that's ~kind of~ what one of our big targets of where this can benefit. ~Uh,~ I don't think the ergonomics are quite there yet for. Doing this with developers. ~Um, uh,~ for instance, one of the things we haven't done, if you haven't hooked up with partial clone yet, we wanna finish that up. ~Uh,~ but it is something that developers could use if they have set this up. ~Uh,~ the other side of it is the authentication story is very rigid right now. The idea that you have to ~kind of~ store tokens around ~and,~ and register s s h keys and things, ~uh,~ it's really hard to do for just a, Hey, arbitrary developer, go register with this bundle server. But ~we're,~ we're working on it. So that's the biggest thing. We're excited. So if that's something you're really interested in, come hop up in the repo, ~you know, uh,~ [00:40:00] ask a question as a discussion or something, ~uh,~ and we'd be happy ~to,~ to talk to you about it. Paul: One more time. What is the repo called? Derrick: Yeah. So it's ~uh,~ get ecosystem slash get bundle server, and there's a dash between every one of the words. Paul: ~Get bundle server and, uh, I almost wanna call it a, uh, like a bundle cache, but it's more than that because there's, there's graph computation going on for the tree in that server as well as storage. Like it's, the story is more than just the network. I, I understand. Like you're, you're targeting the network speed.~ ~It's in the same rack. Um, do you, like, on the top, really quick on the topic of like computation here, do you. See this as a strategy for GitHub to sort of scale at that larger level as more organizations beef up their total code base. It's like, yeah, we, you can use Git, we're gonna do it. But like guys like take this bundle server~ Derrick: ~Yeah, I mean, one of the cool things is that, yeah, we have ways of scaling. Uh, you know, GitHub itself is scales horizontally, at least in the sense of, you know, on repo, you know, the number of repositories we host gets score scaled horizontally really well. But each repository is still stored entirely on one machine with replication.~ ~Like there's multiple machines. So when you're doing a fetcher, a clone, you're hitting one of those machines and doing all your work there. And, you know, that's gonna lead to like the more people who are doing those clones and things. There's, there's only so much one machine can take. And it's even worse for GitHub Enterprise Server because, uh, you've got all the people doing the web interface are on that same server too.~ ~And there's some amount of, you know, clustering you can do and you can spread a little bit horizontally. But if you're thinking about the, the enterprise application, there's a lot of coordination needs to happen. The wider you get, the harder it is to keep them in sync, you know, really, cuz they all have to have the absolute latest of everything.~ ~The bundle server is allowed by design, allowed to lag behind a bit, and so it can catch up at its leisure, but still have the majority of your Git data there, and so it's really easy to horizontally scale. Yes. Single repository to multiple bundle servers or just to offload that, co that thing to something that's doing cheaper Computation is not need to have a big, beefy server just saying, I just need to get this data to where it's going.~ ~Uh, and it really can save on those costs if that's the thing that's causing you problem. If you see a lot of, you know, clones are causing you this and there's also a, a, a feature in there that helps with fetches. So, uh, your fetches can get a little bit faster too, and, and, and offload. To say, for instance, the, the fetch you actually get from the origin server is only the last hours of content kind of, uh, approach.~ ~Those things can really un uh, Reduce the CPO load and the memory load of your origin server to allow it to be fast at the things it needs to do synchronously, you know, like updating refs, creating new objects, keeping itself maintained. Um, so that's, that's, this has been a, we know for a fact that for Windows and Office, this kind of a localized caching has been critical to their success.~ ~Uh, the thing we're trying to figure out is where do we fit into the spectrum of, uh, repositories that exist today on Git that aren't to that scale? When's, when's the start that they can do this? We've seen customers build their own bespoke version of this that works with their tooling, and it's like, how many different times are they gonna build that themselves before we just build something that they can pick up and use?~ ~And this is that, this is, you know, we see people have their own versions of things. This is our version to have more of a generic version that then people can build upon and, uh, contribute to as well as if we see different needs come, come about.~ Paul: ~Well,~ Derek, thank you for your time coming on the podcast again ~now in 2023 and talking to us about Skyr and get 2.41. If people wanted to keep up to date with you and what you're working on, do you have a, a Twitter or a blog post Blue Sky, any of these socials where people can follow?~ Derrick: ~Uh, I'm on Mastodon. You can find me, uh, stoley at, I think it's mastodon.online. And, uh, you can also, uh, always follow the GitHub blog. You know, whenever we have something to celebrate, we go post it on the blog. You know, we, every single get release, uh, we, we have a little update of like, what's the greatest.~ ~The best features that have come out, necessarily the ones we've contributed, but, you know, things from the community, what are the, our favorites? But then also when we have something big, uh, like the, the story of scaler is one of the blog posts we talked about when we celebrated scaler Becoming Upstream.~ ~Uh, those are, that's kinda the number one place to see those really big updates that we celebrate. Uh, and, you know, keep, keep, if you're interested, you can also read the, you know, Release notes. They're always, that's the really deep dive. Uh, those are the, you know, the nitty gritty details and it's all of them as opposed to the highlights that we post on our blog.~ Paul: ~And,~ and like we mentioned at the beginning of the show, Derek is an author on the blog, a common author, and you can find his series of blog posts and the enjoyable one-offs talking about things like get and scaler diving into some of these features. Thanks again, Derek. It was a pleasure. Derrick: ~Oh,~ thanks Paul.