Work on Parsing Millions of URLs per Second with Yagiz Nizipli
===

[00:00:00] Hi there. And welcome to pod rocket, a web development podcast brought to you by log rocket log rocket helps software teams improve user experience with session replay, error tracking and product analytics. Try it for free at log rocket. com today. My name is Paul and joining with us is Yagiz Nizipli. He is a senior software engineer over at Sentry.

And I'm sure everybody listening to this podcast has used Sentry. And we're going to be talking about something ~that, relates to again, something~ everybody touches today. We're going to be talking about parsing URLs, millions of them. ingestion. It's used everywhere. It's actually a little deep topic and we're going to be stepping into like drastic improvements that Yagiz has been able to explore and bring to the table.

So excited to get into it. Welcome to the podcast, Yagiz.

Hello. Hi, thank you for inviting me to this awesome podcast.

I'm glad we got the title of an awesome podcast. Hopefully we live up to your expectations here. Yeah, we're talking about URL parsing, parsing millions of URLs a second. And it's [00:01:00] relating intrinsically to performance. Why are we talking about URL parsing? If we're talking about faster software, the need for faster software, which I would like to get into as well, but why URL parsing?

Why did you decide to tackle software performance from this angle?

For people that don't know my background is that I'm a Node. js technical steering committee members. And my main goal in the, for the past two years is to improve the performance of Node. js as a runtime in general. And one of the most used components in Node. js is. URLs, especially in ECMAScript modules, ES modules.

It's widely used, anything that touches file URLs is based upon URLs. And we as developers have this assumption that URLs are not a bottleneck in production systems. And. I personally really like to break those assumptions and show the world that we don't need to accept slow code and slow systems.[00:02:00] 

So the easiest way to do that is to tackle the hardest challenge that I could think of is URLs, because it looks easy, but it's extremely hard in practice. ~Yeah.~

So you're tackling performance. You want to make the runtimes faster. Yeah. We were in the like battle of the bands. We're in the battle of the runtimes now. ~We got ~we got Dino, and bun. And so are you trying to improve the runtime of the native node? J s

yes, that's correct. It's, so right now the current status of Node. js is that by thousand cuts and URL is one of the most major issues with performance. And in order to solve this challenge I need to develop a new library with the help of Daniel Lemire to release and get used by Node.

js, which developed specifically ~by the limited, ~for the limitations of JavaScript runtimes. And that was my intention.

Why is targeting specifically like the software side of making this faster? [00:03:00] So important now.

So as there's this influence that's caused by Dino and Ban, that Node. js is, slower compared to them in specific benchmarks. And before that I was working and trying to influence people to be more aware about writing performance code. To justify the reasoning behind optimizing URLs.

In Node. js version 18. 17 or 16. If you have two different APIs, getRequests and one of them gets the URL and returns the URL, the other one gets the URL as a string, parses it with new URL, and then returns the href. So only parses the URL. The difference is 10%. This is the bottleneck that we had to pay before Ada, the URL parser library that I developed with Daniel Lemire.

And a lot of people assumed that this was a [00:04:00] small price that we had to pay. And this was not limited to Node. js or C at all. This was limited to whole world. In the last 10 years, there isn't a single academic paper written about URL parsers in the world that focused on performance or stability or anything other than security.

We also written a paper and get it published called parsing URL parsers per second parsing millions of URLs per second. So I recommend reading it as well.

So if you're tackling this and you said 10%, was it like a 10 percent hit when you just parse it and bring it back? Okay. Does that, so does that feel high? I guess I'm like trying to frame it in my mind. Cause I know this was a big improvements that we're going to talk about later down the line, but.

so if you only parse a single URL, it's not a problem. 10%. If you, if at that same URL, if you parse it two times, it's a lot more. And what people don't know is that whenever you call [00:05:00] import or whenever you call any path functions or whatever you call file system and NodeFS modules, you somehow get interacted with URLs.

~But yeah, ~but this is all implemented inside Node. js core, and people don't realize that they are calling New Europe, and they are making a C call from JavaScript. And in general, in practice, the difference is a lot higher. For example, with the development of Ada, Fetch parsing, the Fetch undici, became at least two times faster, just by the changes of New Europe.

So this is the impact of real world applications that most people assume that it's negligible.

impact of real world applications from something like a 2x improvement. One of those improvements might, you could argue is like complexity. Do you think that complexity ~is ~does it get much better with something like a 2x improvement, a 4x improvement on your stack?

If you ask this question to, I think 10 different people, [00:06:00] they will receive 10 different responses, but I care about performance and writing quality software that I think working and tackling performance related issues result ~in ~in. increased maintainability and increased maintenance.

So you write simpler functions, you write more direct functions, because in order to tackle a specific function, you need to write the most efficient and most simple and the least instruction used function that you could ever think of. This means that you need to write it as clean as possible. So in reality, I think writing better software, writing faster software results in better software.

Now let's move a little bit into some of the improvements that people are seeing with this rethink of URL parsing. We talked a little bit about why we're getting into there. Some of the factors, but what maybe inspired you and some of the desired outcomes~ of I've, just looking at, ~Yagi's and me are both looking at this little like sticky note we have here of some of the bullet points we're going to go over.

And [00:07:00] one of them is a 400 percent improvement overall since like earlier Node. js versions. So if we're talking 400 percent Yagi's, is that like node from the prehistoric days or what's our frame of reference there?

It is version 18. 7, 16 to version 20. Replaced the URL parser in Node. js in version 20 and then we backported to version 18 on 18. 70. So 18. 16 doesn't have ADA and we basically compare 18. 16 to 18. 17 and 18. 16 to version 20. And we came to this conclusion, actually, not me, but one of the Node.

js technical student committee members. Did a thorough benchmark about all modules inside Node. js whenever, when we released version 20 and made it LTF. So the 400 percent improvement justification comes from that specific benchmark. But in reality, after version 20, we made lots and lots more improvements.

And I think it's a little bit ~more, ~more than [00:08:00] that, but I don't have any scientific proof right now to say anything further on this.

~So what about we, ~we talked about a two X improvement. We have CloudFlare engineers using ADA and reporting a 22 times improvement in their parsing for CloudFlare workers. CloudFlare workers, if you're listening, you haven't used it. It's their own version of a little lambda.

You can have a type script server or, endpoint that runs. ~Yeah. ~Can you talk to me a little bit about how that is so high Ys, that, that seems crazy high.

~yes. ~So the issue is that Node. js is written in JavaScript and C So the cost, whenever you call a new URL, you hit the bridge between JavaScript and C and that bridge, that communication has a cost, just like in network latencies, because of serialization, because you're passing a string and you need to serialize it.

But in Cloudflare Workers, we don't have that because Cloudflare Workers is written in C And you don't, I assume that you don't go through the same steps. [00:09:00] And one of the Node. js technical steering committee members, James Snell and is also an employee at Cloudflare, just recently replaced Cloudflare Worker's URL implementation with Ada and got a 22 times faster output.

But what's also funny about ~these ~this result is that James is also the, initially the author of Node. js URL parser that Ada replaced. Developed this maybe eight years ago and we replaced it with Ada right now. And right now he's also contributing to Ada and replaced the Cloudflare workers as well.

And for anybody listening every time we say Ada, that is. Because we didn't like formally double click on this fact, but ADA is the thing that Yagiz and his team was building. And the name is special, right? Aveda.

so Ada is my daughter's name, and I wanted to leave something for her to listen, see, [00:10:00] hear to read, maybe 20 years from now. And what other way of doing that, rather than what I do best, is engineering. Ada comes from there, and Ada names The name that we gave to our daughter comes from Ada Lovelace, which is behind the name for Ada programming language.

So Ada Lovelace, for those who doesn't know, is that she's the first female programmer in the world. And Ada also means island in Turkish language. So there are lots of reasoning for that.

So we talked a little bit about the shiny statistics that, we can read on paper, but ADA and what you and the team have accomplished, like the 22 times faster URL parsing, the 400 percent improvement, you got a two X improvement. What I would love to step into a little bit is actually your experience building this Yagi's and some of the ~technical ~technicalities, if we could step into them right before we do that.

I just want to remind our listeners that this podcast is brought to you by LogRocket. So similar to Yagi's and the work that he does at [00:11:00] Sentry on ingestion and reading everything that's happening, whether it be from your loggers and such, LogRocket does this kind of at the application layer. So you're sending events from the Dom.

The actual DOM tree itself and everything that's happening and allows you to do some pretty powerful stuff so you could use AI to find and surface issues faster that you may not have seen stuff like heat maps, your usual tracking, but it goes above and beyond. So you can spend more time using their developer tools and more time like coding.

Then debugging stuff in the console. Cause that's probably where none of us want to spend most of our time, unless you're yet geese and you're tackling the root of all of our problems by parsing our URLs faster. So if you want to go check out log rocket and you really should head over to log rocket.

com today and try it for free. Like I mentioned, I want to talk a little bit about your experience building this thing primarily like you, okay. You said it's been like 10 years. There's maybe been no papers on performance specifically with the URL parsing. ~Maybe there, ~maybe there was like low [00:12:00] hanging fruit when you first stepped in there.

Like you said, it's very difficult. I'm sure there was very high things on mountaintops to go get as well. But like when you step in as something that hasn't been nurtured or looked at in a while, you're like, wow, I can't believe ~it is like this, or ~it is like this way. And I know there's some stuff you did Better hashing and stuff.

But what about the low hanging fruit? Is there anything that shocked you? Anything that surprised you? Anything that you would like to share that was maybe a little humorous,

Yeah. So it's funny because I initially started working on URL parser because James opened an issue and~ and ~Node. js repo story~ story ~saying that we should investigate whether. Implementing the whole URL parser in WebAssembly results in a much more performant way than the current C version. So I implemented the whole thing in Rust and compiled into WebAssembly and added to the Node.

js core and did a small benchmark. The reality was shocking because WebAssembly result, like we all assume that WebAssembly is super fast and so on, [00:13:00] but it was two, three times slower. Even though I was using the latest technologies and so on and so forth. But apparently there was this funny case that if you want to transfer string data between JavaScript and WebAssembly, then you need to go through the serialized process just like in JavaScript to C And you need to use an API called TextEncoder and TextDecoder to decode and encode messages between a shared memory, so that both WebAssembly and the runtime that executes it uses at the same time.

Then ~af ~after this failed experiment, I re-implemented the halting in JavaScript. And two years ago I gave a talk at Noko about this. Wrote to a fast URL parser in Node ~Yes, ~which was the title. But because of V eight limitations, and I think this is not a specific limitation to V eight, but to just in time compilers.

It wasn't as fast as I expected to be. It was 20 times slower than the c plus version, but. Again, it was slow. And [00:14:00] then a couple of months later I met with a professor called Daniel Lemire from Canada, ~and ~who was an expert in SIMD, single instruction, multiple data instructions. And we started tackling this whole problem once again, from start and.

written in C So this was that, this was the third time that I implemented URL parser. So I added some background knowledge. He had this expertise that only a handful of people in the world had. And we made a great team, which resulted in Ada. But in reality, the URL parser is defined by a consortium called WattVG, which is extremely hard to tell you right now without looking into the website.

hypertext application technology working group,

Yes. So this is a group that's founded by browser engines, URL specification. So [00:15:00] before browsers, of course, we also had a URL specification, which was defined in RFC 3956 or 86 maybe 10, 12 years ago. So the WETVG URL specification is a living document and it's extremely it's an implementation based URL specification, which means that it actually tells you how to implement the URL parser.

But what's funny is that the people that wrote this URL specification ~didn't ~didn't think about performance while writing the specification. They cared about stability and the correctness of the algorithm. So we implemented Ada from scratch in C and we followed the specification one by one. And at that time it was maybe 20 times ~faster, ~slower than curl.

So it was the slowest URL parser, even it was slower than the Javascript version. So then in the course of six months, me and Daniel, we spent a lot of time [00:16:00] optimizing, finding shortcuts and re reading and re implementing several parts of it. ~Which ~which became the current baseline of the URL parser.

that's really interesting how you guys tackled that. So you started with the spec to make sure you were compliant and then you went on to the real. engineering of being novel about how you can

Yes. So what's also funny is that there's this test suite called Web Platform Tests that~ give the~ give the proper test suite for browsers so that they have a common implementation across different browsers. But in reality, Web Platform Tests the specifically URL tests in WPT ~has seven, ~had 78 percent test coverage.

Which consisted of maybe 5, 000 tests and we had to support all of those specific edge cases. And that's why we wanted to first have a hundred percent Test, not test coverage, but success ratio in the existing test so that we can move forward with it.

Out of just sheer curiosity, [00:17:00] is there anything in the spec that you felt was like redundant or not needed? If you were to rewrite the spec, would you not include it? Or maybe it's on the other side of the coin. There's more stuff that you think should be included.

Not included, but there are some parts of the URL specification that's extremely unperformant and not well taught. For example if you use any of the URL parser right now, except ADA right now. And if you go write https. google. com, it will take 250 nanoseconds to parse it.

But if you had a, have a slash at the end, it will take half more than that. It will take 350 nanoseconds, 400 nanoseconds, just for parsing a slash statement, slash character. The root cause of it is that URL parser, Specification is a state machine, which means that you have a starting state and then according to the input that you're parsing, the characters, it goes to [00:18:00] different states.

And whenever, after htps. google. com, whenever you see a slash, it goes to the path name start state. And path name start state has its own impact on performance, but if you deliberately check for if there is a slash at the end, you don't need to go to pet name, but you just append that slash to the string, then you have a short path which makes your code, particular codes, a lot faster than all other competitors.

So this was just an example that we did, but

Oh, so you guys actually threw that into your rewrite as an

Yes. And there is maybe more than a hundred different optimizations like this in Ada. But in reality, what people don't know about URL parser is that the URL specification is written specifically for browsers. This means that whenever you go to a toolbar in Google Chrome, you write for www.

google. com and then you occasionally press [00:19:00] space, right? If you go to slash yes. Space nisibly, but this resolves into Yağız 20 nisibly, which is the encoded version of it. So prior to Google and prior to all web browsers, we didn't have this in the URL specification. This was deliberately done to make browser experience better compared to the previous one.

So this particular decision or these kinds of decisions has an a lot of toll on performance because right now there is no way for you to parse the url without making any allocations because you need to check if it includes a space character or not and if it includes then you need to allocate a new string that translates into percentage 20 for a space

So if you have percent 20 changing into it's a space you said, right?

Yeah,

So all of the URL encoding and stuff, that's also [00:20:00] part of, you have to implement this. Stack top to bottom as well.

yes, that's correct.

Okay,

And what's funny is that URL specification depends on also IDNA specification and IDNA specification ~is only, ~is extremely, it's a lot harder than URL specification. It's a lot vague and hard to interpret than a URL specification. And all of the other URL parsers out there in the world depends on ICU library to have proper support for two ASCII and two Unicode functions.

And Ada is the first one that in, that doesn't have any dependency at all. It's all written in C plus. And because of that, we save hundreds of megabytes of bandwidth for every installation because we don't need to incorporate ICU anymore.

so it's written c We're talking about using a node This is where my kind of knowledge of how your body of work can be used in and strewn through [00:21:00] strewn throughout the world here But what other types of environments can we use that your piece of technology in is it? only c and node

no we wrote it in C and we provided a C bridge for it. And we added a, we created libraries for Go, Python, and Rust. Through that C bridge right now. In Python, Ada is faster than Python's URL parser, URL lib, by three to four, four times faster. In Rust, it is again, 3. 5 times faster than the URL parser of Rust.

I don't have a benchmark for Go, so I can't provide that, but it's indeed the fastest parser for each of those programming languages, even though it's written in C

and speaking of other languages if we're on a web development centric podcast. A lot of people listening though, are polyglots that code in a lot of languages, but if [00:22:00] there's something, if you were to like, think as a web developer, put on the web developers hat for a sec. And are there learnings of general data structure, usage, programming paradigms that you could relate back to standards and status quo things that you see done in the web dev world that you're like, Hey guys.

You should maybe you should think this way a little bit or I experience. We didn't talk about how hashing optimization of hashing helped with the URL parsing. But this, as an example, could it help people frame better web development production?

I think the current problem in 2024 regarding web development is that we all assume that there is only single computer that executes our code, which is the browser itself. And because of how technology is evolving super fast we assume that all of the code that's executed in the browser, in browsers, the computers are extremely fast.

Because it's an end user computer. So we don't care about certain issues like memory [00:23:00] management or speed or execution or FPS and so on and so forth. But in reality, this is something that we should. Think about because the way that it's implemented, even though we have this distinct distinguish between front end and back end development, JavaScript gives you this this possibility to share the code within back end and the front end world.

And without one, without caring about performance for one, we can't. care about performance on the other. So there's this one take that we should take account into. And the other thing is that we should know that the user that consumes our web application, they also consume 10 or 20 different tabs or contents all over the place.

And they are shared when you are sharing this pool of memory with all of those applications. And if one of them uses too much, of course, Chrome and other browsers kill that process or isolate them or in order to not affect [00:24:00] the others. But you should know that there's a limited space and with this knowledge, we can develop a lot faster applications, even though technically we don't have to.

And just to circle back right before when we talked about the other languages Yagi's, there was one particular optimization that I wanted to pick your brain about. , perfect hashing. And how this is leveraged in order to yield seed improvements. Could you also explain to me and the people listening, what is perfect hashing?

On the flip side, what is imperfect hashing?

For those who doesn't know about it, there's this data structure called HashMap, which means that for a particular key, there is a value. And because of how, and depending on how your hashing algorithm is implemented, the collusion between different keys for different contents are widely different.

So the collusion doesn't exist. For example, in our prior web development phase in 2010, we were using MD5, and we [00:25:00] realized that for two different inputs, we could reach out to the same key. Meaning that if I pass yes to a hash function, it returns one. If I pass poll to the hash function, it returns one as well.

So this is not something that we want. And because of that, we developed more and more complex hashing algorithms like SHA 1, SHA 2, and later we evolve this into more cryptographic algorithms because we start to hash passwords right now. So you don't want two different passwords to resolve into a single hash.

This is the main cause of the hashing algorithms. In the context of URL parser, we had to check If a particular HTTP protocol is a special protocol or not in the context of URL parser protocol is the first five, four letters of a URL, which is HTTPS, HTTP, FTP, SSH. These are the protocols and special protocols are the ones that we have an [00:26:00] edge case for in URL parser URL specification, which is HTTPS, HTTP, FIO.

And so on and so forth. So while looking at whenever you are parsing the protocol, as I said, this is a state machine, we were parsing every character until we see a column parameter in the URL. And then when you see a column parameter, then you assume that the starting point, which H and the end point before column is S, then we have an HTTPS, which is a sub string of the whole URL and we have to know if it's if it is a special protocol or not, if we have an edge case for it.

The easiest, the worst case scenario to match for this is to have an array of special array of special protocols. And in worst case, you traverse that special case, special list n times, which is the length of the array. So can we do faster than that? Yes. How can we do [00:27:00] it? By storing it in a hash, meaning that if we store it if we have an object, let's say, in JavaScript terms, if we have an object that has a HTTPS, HTTP, FTP, and the key values and the value that we have a true or something, then we could access this in O1, but in reality, you're also traversing, you're storing that dictionary data, and you're traversing it as well.

One small optimization. To move forward with this is that HTTPS and so on and so forth, you can look into the first character of of the protocol and depending on it, you can define the index of the array that you have. Let's say if it's HTTPS and it has a five characters, then I take the first first characters ASCII code and I make a mathematical operation on it and then I came to a conclusion of six.

And the array is length of 7, so I take the 6th element, and I say that yes, this is a special protocol. [00:28:00] So right now, I eliminated traversing through the element and by using a single instruction, which is the which is using bitfiles operations, in order to find the correct place. In that specific spring.

So this is how we came up with it. But yeah, this is one small optimization in the whole URL parser.

That's interesting because it's on something that traditionally you read, you learn, it's 001. But you say, Hey realistically they're still ground to be gained in this way. Which is really interesting. And,

you have to implement this function in a way that it is inlinable by the compiler. So that even though how much you call it, there is no toll in accessing that particular function in the address space. We had to use a lot of tricky and smart decisions in order to make this as fast as possible. There's, I'm pretty sure there are faster solutions out there right now [00:29:00] that applies to a more generic solution into the problem space that I'm defining. But in the context of the special task that we were trying to do, which is this a special protocol or not, this was the perfect solution.

and just to make sure I understood what you said accessing the function in the address space, you wanted to make sure that it was local and it was quick when you had to reference it.

Yes it yeah, it should be it should be a constant expression, constexpr in C terms. It needs to be inlinable, it needs to be small enough so that the compiler inlines it, and so that calling it doesn't have any toll in performance.

I could go on for another like 20 or 30 minutes asking you about these really particulate areas that you and the team looked into. It's really fascinating, especially the fact that you guys look at it, looked at a O of one problem and decide to even further optimize this. It's really neat to hear how you guys for problem solving.

Unfortunately, we are getting closer to [00:30:00] our time limit that we have today. So let's step into talking a little bit about. What is coming in the future? What do you guys have planned? You already mentioned next versions of ADA are already going to be better. There's things coming down the pipe.

So can you talk to us a little bit about what's planned, maybe what's done in testing and stuff that's not planned, but it's just sitting in the backlog that's running around in your head.

With node 21, we implemented two different versions of the URL parser. One of them is AdaURL and one of them is AdaURL aggregator. The URL aggregator was focusing specifically for the reducing the cost of serialization. So instead of having multiple components in the URL, like a username, password, hostname, port, as different strings, we had a one large string in URL aggregator, we, that we allocated only once.

And then we were storing the indexes, the start indexes of those components. So if protocol end index, then you can say that between [00:31:00] the, if you take the substring of href and zero to protocol end, you have the protocol and so on and so forth. This helped with serialization a lot. Right now, we have two different implementations in the codebase because one of them is extremely easy to construct, one of them is extremely easy to set and get, to call the setters and getters.

Because you don't need to take the substring, you just return the string that's already constructed. With version 3, we are planning to reduce The implementation into one once again, and make our job a lot easier, which is going to be a breaking change because Node. js and several other runtimes depend on the first implementation on several cases.

And the only way to do it without making everybody mad is follow semantic versioning and do it in version three. But on top of that, we still believe that there are lots of optimizations that we could do. And And just quite recently with the last several versions in Ada, we [00:32:00] added URL search param support to Ada as well.

So we are going to work on improving that as well. So more and more performance.

It's really interesting. Like I mentioned learning how you guys are using things like. Oh, one optimization, even at an even higher level or how you're Like the taking the slash off the end. Is there some way that people listening, if they want to keep up to date with what the team is thinking of through get hub, through blog posts where you guys do a mind dump

Yeah,

optimizations.

so from time to time I try to write some blog posts about one of the most impactful optimizations that we did. I, the listeners, I recommend reading to my blog about it and we plan to. Create some first good issues for people to contribute to Ada as well. Because this is an open space and I want this project to live for the next 20 years.

Right now there is a total of 28 or 30 [00:33:00] different contributors in Ada. But the goal is to, if you want to aim for 20 years, then we need more contributors.

At 20 years, you're not allowed to work on it anymore. 

Is the only library that I will work for after 20 years.

and Yagiz we'll link the blog below the show. So if people listening, you just want an easy link to click, go straight to Yagiz's blog. We'll make sure to include that in the notes for the show. You guys, it's been a pleasure talking to you and really peering under the engine hood of what makes node great.

What's coming in, like every time there's a major node version, I'm, I'm always wondering what's going on, what are these guys doing that's making it better and better. And here we are. We have a firsthand look at one of the great minds behind like the improvements that nobody thinks about that really matter a lot.

So I'm going to share some small not small, but big news with you and the listeners of this podcast is that there's a high chance that node 22 is [00:34:00] going to get a node task runner just like MPM Run, which is around seven or eight times faster than MPM runs. Start time.

That's huge. That's massive.

hopefully, there's a discussion about not landing it or landing it, but hopefully fingers crossed, we will land it in version 22 and it will be ready for

One thing that gets so exciting about the node progression is the improvement ~of ~tooling ~and ~the improvement of tooling just built in. And I just stumbled because Emily was sending us a little note saying that we're going to have an episode coming up on pod rocket specifically about node 22 and some of the improvements.

So stay tuned. I'm going to be listening in so that I can make sure I'm not up to date and left in the dust. But yeah, once again it's been a pleasure having you. I will make sure to have your blog link down below so people can keep up to date and truly Godspeed continuing to work on this amazing improvement that you're bringing to our developer world.

Thank you. And thank you for giving me the chance to speak today. It's [00:35:00] an honor to join.