Anna (00:00:05):
Welcome to Zero Knowledge. I'm your host Anna Rose. In this podcast, we will be exploring the latest in zero knowledge research and the decentralized web, as well as new paradigms that promise to change the way we interact and transact online

Anna (00:00:27):
This week, Tarun and I chat with Kelly and Simon from Supranational. We talk about all the types of hardware used for cryptographic computation. This is a topic that is increasingly important in the context of zero knowledge proof computation. We talk about the spectrum of hardware from CPUs to ASICs and how they compare, as well as the work that the Supranational team is doing on that front. Now, before we start in, I have two quick notes to share with you. First of, I don't know if you're aware of this, but earlier this year, we kicked off a zero knowledge blog. The zero knowledge blog is publishing one post a week. The topics range from zero knowledge, deep dives, posts about particular blockchain or DeFi primitives, as well as posted elaborate on previous episodes that we've recorded. This past week I published a blog post all about NFTs based on an earlier episode, we had done on them.

Anna (00:01:17):
It's basically a thought exercise to help me understand if I might ever want to buy one. TL;DR, I'm still on the fence. Have a read, check it out, let us know what you think. But also the way that we're finding writers for this blog is that we're sourcing them directly from the zero knowledge podcast community. So if you are doing research on a topic or you've written a really great piece that you think we should have a look at, if you have top-notch writing skills, we do want the quality of the blog to be quite high. Do get in touch about potentially becoming a contributor. You can email us at blog@zeroknowledge.fm, just send over some sort of sample and we'll get back to you. Second really quick note I want to share is a big thank you to this week's sponsor Aave. Aave is an open source decentralized non-custodial liquidity protocol on Ethereum.

Anna (00:02:08):
With Aave users can participate as depositors, meaning they provide liquidity to earn a passive income, or they can act as borrowers to borrow in an over collateralized way or an under collateralized way. Think one block liquidity flash loans, a topic we've covered quite often on the show. A new feature that they've recently released is called credit delegation. This is where users can delegate their credit to another person who can borrow against it. Aave has an ecosystem grants program for anyone building anything that contributes to the Aave ecosystem. Do check out the Aave developer portal to learn more. I've added the link in the show notes, as well as the blog post, where they talk a little bit more about the grants program, visit Aave.com in general, to find out more about the project. So thank you so much Aave for being a sponsor of the zero knowledge podcast. Now here's our conversation all about hardware with the folks from Supranational. So today Tarun and I are chatting with Kelly Olson and Simon Peffers, both co-founders of Supranational, which is a hardware company. Today we will be doing a deep dive into hardware from CPUs to ASICs and look at how these are used for cryptographic processing. So welcome Kelly and Simon.

Kelly (00:03:22):
Hi, thanks for having us.

Simon (00:03:24):
Hi guys. Pleasure to be here.

Anna (00:03:26):
Tarun, you're the guest host today, and I think one of the reasons that we talked about you joining this one is that you actually have a background in hardware.

Tarun (00:03:34):
Yeah, I used to work on this ASIC for doing physics simulations of proteins. So I once upon a time actually knew something about hardware, now, I'm sure, it 's all dated.

Anna (00:03:47):
Actually, I think the first episode you were ever on, I think we did dig into that. Maybe I'll add the link so we can hear that. That was actually when we met. So I think as a starting point, I'd like to find out a little bit more about Supranational. What is it, what is this company and maybe where did it start from?

Kelly (00:04:05):
Yeah, sure. So I guess I can start. So, Simon and I have known each other for a number of years. We both worked at Intel Corporation, me for a decade and I think Simon a bit over a decade. I think, we actually first met during some investigations into the Bitcoin and blockchain space, while we were both at Intel. I was sitting on the business incubation side and Simon was in the pathfinding department, eventually that effort never really went anywhere within Intel, at least that initial effort that we looked at, but it was always an interest that sort of stayed in Simon and I's minds. And so, about two years ago, we decided to leave Intel and start Supranational. And Supranational is a company that focuses on acceleration of cryptography and other algorithms. We do that by writing software and also developing custom silicon.

Simon (00:04:58):
So, I'm Simon Peffers. And as Kelly said, I was at Intel for about 20 years and I worked in the server architecture team. One of our charters, our main charter was to look at workloads that would be coming up in the next, say, five years or more, and anticipate what those would be and build support into the server chips to, for those workloads. And so the team I was on did a lot of work on cryptography, worked on the SHA and AES instructions and some of the large integer arithmetic instructions. So it was kind of from there as a natural jumping off point to work on the blockchain space, because there's a lot of really new and interesting cryptography coming out in this space. It's computationally very intensive, algorithmically it's complex. And so when it comes to accelerating those workloads, which is desperately or really needed in the space to become practical, it was kind of a natural place for us to jump in. We look anywhere from ASICs, as you mentioned, to CPU, GPU, FPGA, all the sort of hardware platforms, and we can look at the workloads, the cryptography that needs to be done and understand how to map it onto those platforms. What is the best platform we work at, Assembly, C, algorithmic, down to transistors. So we try to cover the whole space and really understand the best way to get these algorithms going quickly and practical for people.

Tarun (00:06:24):
Were you working on Xeon Phi style chips, server chips, or what types of when you're talking about having SHA instructions, is this for generic processors or is this for like particular style thing? Because it's actually quite interesting, I think, especially for a cryptography audience, maybe to understand how those implementations differ on different pieces of hardware.

Simon (00:06:50):
Yeah. One of our colleagues is really an expert on the SHA instructions in particular, but yeah, the team I was on developed those instructions and they're used in a variety of processors anywhere from the mobile processors back when Intel had those to the client processors, to the servers, probably less on Knights, but more on the Xeon, the mainstream Xeons. And they do vary some, I mean, a lot of the core of it is going to be similar, but the number of functional units and things like that will vary.

Kelly (00:07:16):
Yeah. And I think maybe to add on top of that, cryptography instructions, you know, there's sort of a long history of those being introduced into processors, and SHA is one of the relatively recent ones. So it has been on AMD processors for, I think maybe a couple of years now. And it's starting to hit sort of your general laptop and desktop processors for the Intel processors as well.

Anna (00:07:42):
When was Supranational actually founded? Like how old is this business? Cause you've talked about like being at Intel, but like what is the window of time that Supranational has been around? Who is it? Who does it consist of? Is it a team, it's the two of you, but are there other people that are also part of the org?

Kelly (00:07:59):
Sure. Yeah. Maybe I can provide a better explanation there. So Supranational was found in 2019, Simon was the first one actually to leave Intel and start the company. He was soon joined thereafter by myself and another one of our co-founders Sean Gully. Sean Gully worked in a similar division as Simon working on cryptography and data center pathfinding. We also have a number of other sort of part-time employees and contractors as well that we bring in on an as needed basis, depending on exactly what the project is that we're working on.

Anna (00:08:34):
I'm just curious to understand how you work with blockchain, because you're looking at the hardware, you don't have kind of a L1 of your own. So how do you interact with that space?

Kelly (00:08:51):
In the blockchain space in particular? I think, to date, we are predominantly a service company, so we've done a number of engagements and also received some grants from various foundations, like the Ethereum Foundation or Protocol Labs to work on hardware acceleration, not just for their own ecosystems, but also develop general purpose cryptography libraries that are useful broadly across the ecosystem. Moving forward, we are looking into developing some product offerings and these would be in the forms of various sort of cloud proofs as a service, as we call it, offerings to develop basically an easy way for developers to generate zero knowledge proofs quickly and efficiently. So just, you can think about that as a sort of an API surface, and longer term we're also investigating building custom silicon to do more general purpose cryptography operations with the goal of increasing the performance of these things by one to two orders of magnitude.

Anna (00:09:45):
Well, cool. Are there any other organizations that you know, that are doing something similar, maybe even like existing companies that we can kind of associate with the kind of work you're doing?

Kelly (00:09:55):
Yeah. In the cryptography space I think less so. I mean, there certainly are many Bitcoin mining companies and in some ways they do a similar sort of exercises we do, but most of those are predominantly focused on just really high throughput parallel hashing. So, slightly different is they're not working on sort of general purpose cryptographic type operations.

Anna (00:10:20):
Simon, you gave a presentation at like SBC, I think, last time there was one. And in that you talked very much about like RSA, that was sort of your starting point. And I actually, I don't know if we've really talked that much about RSA on the show and I was thinking maybe we could talk a little bit about it here. What is RSA, exactly? And what does accelerating RSA have anything to do with this cryptography that we're talking about today?

Simon (00:10:51):
Well, RSA is one of the forms of cryptography out there. When you think about a public key and private key cryptography, the two main forms are RSA and elliptic curve. And I think what you're getting at is most of the cryptography used in blockchain is in fact elliptic curve. A big part of the reason is that the keys are much smaller, so when you have to store these things forever, with many copies of them, it makes sense to have smaller keys and RSA keys tend to be bigger, but in the fundamental sort of components of RSA and ECC are similar, I guess I would say they're all large integer arithmetic. And part of the reason we got involved in RSA is because, and Kelly can probably describe this more, but one of the first VDFs that we looked at was RSA based. And so when we started off the company, the first project was to look at the possibility of building an RSA-based VDF. And so that sort of set us off on this direction of looking at RSA. But since then, I mean, we look at both RSA and ECC. Most of our work has been elliptic curve cryptography.

Anna (00:11:54):
Thanks for that clarification. Because when I was doing some research for this, I was curious about that. Actually let's give a little framing to the project. So it was the EF that had contracted you to do this VDF RSA hardware.

Kelly (00:12:09):
Yeah, that's exactly right. So the Ethereum Foundation has been investigating sort of strong sources of randomness for a number of years and how to use that for leader election and proof of stake protocols. And so today the Ethereum foundation uses something called RANDAO, which provides very, very strong guarantees, but there are ways to improve upon that. And so one of the things in the middle of 2018, a paper came out by Dan Burnett and some of the other PhD students at Stanford that described how to construct one of these RSA VDFs that would provide these stronger randomness properties. And one of the biggest assumptions that you need to make with any of these RSA VDFs is how fast can it go, basically. And so we were able to do some work with the Ethereum Foundation to make estimates about how fast this thing could be sped up to sort of inform the security properties of the VDF.

Tarun (00:13:02):
So when implementing this repeated squaring VDF, what were some of the, — which is sort of the main, the simplest VDF that I think, on a previous episode with Joe Bano, this was sort of covered — what were some of the challenges that you've found in actually implementing it? Were the issues, more things like you needed a lot of memory, was it something where you actually ended up having a ton of gates or some sort of intermediate state. Yeah, it'd be great to talk through like how conventional RSA circuits might have to differ for this application.

Simon (00:13:40):
I mean, first of all, on the CPU it's easy. You can just call libpnp and it uses the instructions in it. We were doing 2048 bit RSA, so that's pretty easy on the CPU. You can just call these functions and they're reasonably fast. With the VDF you want to make it go as low latency as possible. So to repeat a squaring with low latency, we initially built an FPGA implementation. So a CPU implementation would be about 1100 nanoseconds per squaring. For FPGA we got it down to about 65 nanoseconds per squaring. So quite a bit faster. And then part of the project was to look at an ASIC-based design, so building actual hardware dedicated for this operation. I think the biggest challenge is that if you want to do low latency, you have to parallelize as much of the operation as possible, and it's possible to do that pretty well with squaring and reduction. The problem is that you end up with a lot of gates and they end up taking a lot of area. And it's not really even the gates that are the biggest problem, but the wires. So you have a lot of wires, it's kind of an N-squared problem and finding enough routing tracks for the wires becomes a problem. And then the delay, the time delay through the wires becomes very significant. And so that ended up being the biggest challenge for cranking down the speed was getting the wires and the routing to be efficient, and that the tools, the hardware design tools struggled with that scale of design.

Kelly (00:15:09):
Yeah. And maybe just to add on top of that, whenever you're developing custom silicon, you always have what are called PPA trade-offs: Power, Performance and Area. And in traditional, mainstream CPU that you know is going to be in every laptop and desktop, area is really critical because you're going to have to print millions or maybe hundreds of millions of these chips. And so generally CPU type implementations are really going to focus on low power, small area, because it's going to go into somebody's laptop. In the case of the VDF and developing custom silicon for it, the number one priority was to have the highest performance, the lowest latency on that. And so you'll choose very different sort of architectures for doing these multiplications in the silicon, and you'll sacrifice area, and you'll be willing to accept very high power budgets, so that you can get that best performance. And so that's sort of one of the trade-offs that could be made when you're developing custom silicon.

Simon (00:16:06):
Right. I guess, just to follow on Kelly, for the VDF it's a very unusual design target, which is what he's kind of getting at. Normally you want to have a balanced design for VDF. You can really trade off all the other parameters to get the lowest possible latency. So it's pretty interesting in that regard. That's not a target you see very often

Tarun (00:16:24):
Is the device you made in use by anyone or actually like what ended up happening with it?

Kelly (00:16:33):
Yeah, I can talk about that briefly. So there are sort of three main components that I guess you think about with a VDF. The first is this evaluation, this repeated squaring. And what we were able to do is implement that on a CPU, make it 10 times faster on an FPGA and then ultimately design an ASIC, that would be, let's say another 10 times faster. The second main component is generating the proofs and that's a highly parallel process, so that we were able to develop code that would run on a graphics card and would generate those proofs very efficiently. And then the third thing is — well, maybe there's four things — the third thing is the verification and how do you make the verification of these proofs very efficient? And then the fourth thing is what is required cryptographic assumptions to actually make the VDF secure. And with the RSA VDF project, you have to do a trusted setup. And the biggest issue that we encountered with the RSA project was really a secure way to do a large multi-party computation to generate the underlying modulates for the RSA VDF. And so that was really probably the biggest bottleneck that we did hit in that project was to generate that modules. And that was one of the things that sort of stalled the project and prevented us from moving forward with manufacturing a custom silicon.

Simon (00:17:51):
So when we first started looking at this RSA 2K operation, one of the things that we learned about was a competition at MIT where Ron Rivest, one of the inventors of RSA, along with some others had created this cryptographic puzzle in, was it 1999 or so I think, Kelly may know the exact date, but it was supposed or expected to be solved in about 35 years, at the 35 year anniversary of this department. And we figured that by using the FPGA and implementing a fast repeated squaring circuit, we could actually solve it in about two months or three months. And so when we first started the company, that's the first thing we set out to do with support from EF and others. And using this 65 nanosecond for squaring FPGA, we were actually able to solve it in about two months and unlock the secret that had been encoded. And we had a nice ceremony at MIT where Ron Rivest opened a time capsule and everything. We were unfortunately not the first to unlock it. Two weeks before we unlocked it, a fellow called Bernard Fabrot had unlocked it by using a CPU, which he had run for about two and a half years and just barely beat us to the punch. Just pure coincidence, pure coincidence, but it was pretty funny, it was a fun ceremony.

Anna (00:19:10):
Cool. Well, we'll add actually, I think there's a link that we have about that competition. I'll add a link in the show notes to that since people want to find out a bit more about it.

Tarun (00:19:19):
Maybe actually it might make sense to just have the terms FPGA and ASIC defined and what synthesis is and what the tools you use are and how you think. The PPA example was a really great example of trade offs that engineers have to think about in hardware that software people and cryptographers don't have to think about and aren't used to thinking about, and I think maybe a good way to, before we jump into talking about ZKP hardware is talking maybe a little bit about A) the different hardware platforms, B) kind of the tools you use, the languages, the synthesis, and then maybe a little bit about how you guys do fab, presumably are fab-less and you go to some type of aggregator or something like that.

Kelly (00:20:04):
Yeah, I'll give the very, very high level. And then I'll let Simon really dig deep into those. So the first thing is obviously a CPU. That's something I think everybody knows what that is. It sits inside of your laptop, it sits inside of your cell phone and that's going to do sort of general purpose computation. And to program the CPU, you can write in a myriad of languages all the way down from Assembly up to higher level languages like Python. The next one that's probably most familiar with people is a GPU, which is a graphics processing unit. Traditionally, GPU's have been used sort of solely just for graphics processing, but about 10 years or so ago, we started to see a trend called GPGPU, which is general purpose graphics processing units. And that's really about doing general purpose computations on a graphics card.

Anna (00:20:55):
Are most graphics cards today is that second type, that GPGPU?

Kelly (00:21:01):
Yeah, that's right. Most of the graphics cards today can support more general purpose programs. The area where a graphics card really excels is with highly parallel operations. So maybe on a normal CPU, you may have four cores or it's 12 cores, or if you're lucky you got 32 or 64, a graphics card can have hundreds or thousands of cores. And so if you have a lot of little jobs, a graphics card is a great thing to use. Interestingly enough, it's actually suited quite well for cryptography. And that's one of the things that perhaps we could talk about, but it's a great tool for that. Clearly what you've started to see as people are using GPU's for things like machine learning, for biology simulations, you know, a number of different things and you're starting to see a trend in the industry where more and more people are using these graphics cards for general purpose computation.

Kelly (00:21:56):
The next definition, I guess, is an FPGA, that's a field programmable gate array. And you can think about this as, almost like a chip that you can kind of program on the fly. So you can take this programmable chip and make your own special chip with it. And you can get some performance benefits out of that, but it doesn't require spending millions and millions of dollars to build a custom chip, which is the last thing that we'll talk about, which is an ASIC or an application specific integrated circuit. And an ASIC is really a chip designed to do one or a small handful of things. And the audience will probably be most familiar with ASICs that are built for Bitcoin mining. So what those are is basically a chip that can do one function.

Kelly (00:22:43):
It can do SHA-2 hashing, or maybe some other hashing algorithm, but there'll be hundreds or thousands or hundreds of thousands of these tiny hashing engines on the chip meant to just do as many as possible. But you can't process graphics on that chip. You can't do any sort of simulations on that ship. You can't write programs for that chip. But I'm sure Simon can give a lot more information, not only on the overall ASIC process, but also just on some of the relative trade-offs between these platforms.

Simon (00:23:09):
Yeah. Thanks, Kelly. That was a good definition, I would say. So one way to think about it, I guess, is when we try to always look at the system level. So at a high level, when you think about running something, you have to think about the compute capabilities of the platform, the memory requirements and the I/O needed. And so if you walk down CPU, GPU, FPGA, ASIC. So CPU it has a lot of I/Os, they have as much memory as you kind of want to put in there. They're easy to program, they're the easiest platform to use, but they have a limited number of cores. You might get 32 or 64 at most. So you've limited parallelism, but they're pretty fast cores. As you go down to GPU, you tend because of the thousands of cores and it has a lot of memory bandwidth, but you have to access memory in particular ways for it to be efficient, it's harder to extract performance out of GPUs.

Simon (00:24:00):
And so you tend to have to restructure your algorithm to work well on them. And as Kelly said, cryptography can work well, but we usually have to go in and change it around. Sometimes we use different algorithms. Sometimes we combine algorithms to really suit the platform. But if you can really take advantage of the capabilities there, the sort of extremely high memory bandwidth and all the cores, you get great performance for tasks that can be highly parallel. And they are quite common. I mean, most people have GPUs and they're pretty easy to come by.

Anna (00:24:29):
Although they do sell out, sometimes.

Simon (00:24:32):
They do sell out, which is an interesting development these days. Yeah. FPGA is increasingly specialized and you kind of wire everything up, but they have more primitive, functional units like adders and things like that. And so it does use synthesis tools. They tend to be harder, again, like an order of magnitude, harder to use than GPUs, but for certain applications are much faster. And finally an ASIC is really just a blank sheet of paper. You can build anything you want. And so they're probably a thousand times harder to build and much, much more expensive, but you can really make it do the exact application really well. So you'll probably get orders of magnitude, better performance out of any ASIC than a general purpose platform.

Anna (00:25:12):
Tarun, what did you work on when you worked in hardware? Which of those four?

Tarun (00:25:17):
ASICs, I worked on the biology ASICs that they were mentioning before. I think a lot of people, I feel like who worked in ASICs got into crypto by observing how dumb the mining hardware was and trying to be like, come on, there's gotta be something better and more interesting to do. At least that was how all my old colleagues used to be.

Anna (00:25:43):
So we've defined these different types of hardware platforms. Now Tarun, I think to continue on your point or question, what then are the techniques used to optimize these?

Tarun (00:25:55):
Yeah, I guess like maybe walk, I think software developers who are used to writing user space programs in Linux or Mac are probably not really used to this idea of like, what is Verilog and how do you have to think about when you're copying things from memory and not memory, and there's a ton of tools that hide a lot of those details for you. I think that people don't realize, and maybe it would be cool to talk about like what tools you guys find most helpful when writing cryptography for FPGAs and ASICs.

Simon (00:26:31):
Sure. I mean, in the end of the day, when you're designing for FPGAs or ASICs, it's really a fundamentally different way to think about it. So there's a couple of languages Verilog and VHDL are the most popular. Verilog is probably the most popular. And RTL, by the way, stands for registered transfer level. And what you're basically describing is the hardware, the registers, which register the stored bits, so that's sort of memory, along with the logic that goes in between them. You're describing all that in code that is a functional definition of the chip. Once you write that, and everything is parallel, so when you write RTL in software and hardware, real hardware, everything happens at the same time, which is the way the world works. And so there's languages designed to model that behavior. So you write this model and it's pretty different from software because everything is parallel.

Simon (00:27:17):
And then you do something called simulation. There are tools that will compile this model much like GCC does and run it. And it'll tell you, how it performs and how it behaves. And you run it and run tests on it and debug it and fix it and make sure it all works. In a hardware world, the timeframe to go build a chip is let's say roughly two years, 18 months to two years or more, depending on the complexity of the chip, and the cost is millions of dollars. So unlike software, where if you have a mistake and a bug, you can go fix it. If you build a chip with a bug in it, it's going to cost you another couple of years and millions of dollars. So you have to make sure you spend a ton of effort making sure it's right. So you go through a lot of simulation, a lot of testing, and then once you have the design, you have to go through synthesis. So you have to turn that functional definition into actual gates and ultimately transistors. And those get sent, as you said, to the factory, to what's called a foundry who will actually build the silicon for you.

Tarun (00:28:21):
One actual question regarding testing, compared to maybe testing that you have worked on previously, how does writing tests for cryptography primitives differ? Like, is there a different type of testing suite? Do you find that you have to have more like edge cases? Is there any difference, because I could imagine that it's actually much harder to test cryptography primitives because you really care about the edge cases a lot more than you might in some other processing.

Simon (00:28:54):
Yeah. That's actually, it's a great question. It is different. So in a normal application, you tend to have a lot of what's called control logic. There's two parts of a chip to control logic and data path. So the data path are the units that are actually doing the computing and coming up with adding two numbers or something like that. And the control logic is what decides what data to send into the outros and where to put it afterward. It's sort of orchestrating the work. Control logic tends to have a lot of branches and things like there are different conditions it's looking for. So it tends to be difficult to test and validate. In cryptography you tend to have pretty simple control logic on like for a CPU, for example. That's good news, right? Because you can test it relatively easily. And then for the functional units, like say you have a multiplier, typically what you see with cryptography is it either works or it doesn't , and you really get the answer, the right answer, or there's just total nonsense.

Simon (00:29:45):
However, the corner cases, like if you miss a carry somewhere, that can be tricky. The approach generally used in these scenarios is to do this formal verification, it's the best way to do it, which is where you build a model, a logical model of the design, and you have to go formally prove that it implements a multiplier or something like that. And that can be a difficult task, but it's the way you ensure that it's correct and you haven't missed corner cases. The other thing is, in some cases of cryptography like Bitcoin mining, for example, it's actually okay to get the wrong answer sometimes. If you come up with a hash that that is wrong, cause you miscomputered it because there was a corner case. Well, in the end, you're going to check that on CPU anyway and just throw it away. So it's interesting in mining where you can actually accommodate sort of incorrect results, pretty easily. A lot of applications can't do that.

Tarun (00:30:35):
One thing I guess, I remember a lot of times for chips I'd worked on in the past, we would split the design into pieces that were sort of mission critical and that we'd formally verify. And on other pieces we'd do statistical verification, almost like fuzzing, where we'd send a bunch of inputs that we knew input/output pair expectations for ,through the different units. Like, do you find that you partition the designs or do you do formal verification over the whole design? Cause I was just curious ,cause it does add a bunch of time to the engineering process. And it'd be interesting to know how you guys think about that.

Simon (00:31:17):
I mean, you wouldn't formally verify the whole design most likely, like there are some parts that you can validate with the normal validation techniques and get good enough coverage and have confidence that'll work, or you can have workarounds in the case where it doesn't, the arithmetic is really the place where you mostly want to formally verify things. And if you could tolerate mistakes like Bitcoin and other applications are like that too, then you don't have to do any form of verification. You can use statistical answers and avoid most testing. And say, well, if I'm right 99% of the time, that's good enough.

Tarun (00:31:50):
Cool. And I guess one last question, like sort of a background question, I think for a lot of listeners, it might be hard to know the whole hardware stack of how you go to a factory and how you make a mask and stuff like that. But maybe as a startup, you have to rely on a lot of other vendors and different counterparts who help manufacture the hardware. So what are the advantages and disadvantages of being a startup building hardware versus being at big place that maybe has their own foundry, has their own fab, things like that. Sorry, fab is a fabrication.

Kelly (00:32:35):
So I mean, maybe I can give a little bit of information on that. I mean the, the biggest problem with being a startup in this hardware and semiconductor manufacturing space is just that it's so expensive. It's an incredibly expensive process to make a chip. And when you look at the ecosystem at this point, there's really only three or four companies that have the ability to make cutting edge chips. So these are people like Intel, Taiwan Semiconductor or TSMC, and Samsung. And so, you know, in addition to being expensive, we're in a situation now where there's sort of a global semiconductor shortage. And so you really only have two companies that make chips for other people, and that is TSMC and Samsung. And as a startup, you're going and competing against the likes of Apple AMD, Nvidia, Google, Amazon who are all trying to make their chips out one of these two factories. So I would say that's one of the most difficult things about being in this space, but Simon, maybe you can talk about some of the pros and cons, based off of your experience making chips at Intel.

Simon (00:33:40):
Yeah. I mean, one of the, I think one of the pros for us being in the space, the big companies really aren't looking at crypto hardware at this point. So we get to work with some of the smartest and best people on these problems. And to the extent that we can pick problems that are the right size. I mean, we're not trying to compete with those big companies. We're building more dedicated functionality. We try to keep things very simple. And simplicity means that it takes less work to build, it's less expensive. And because the space is pretty new, there's actually an opportunity to deploy fairly simple designs that have a real impact. And so, being a startup here, we can look at those things in there. They are more of the right size for what we need and what the industry needs currently. And so I think that's one of the advantages. There are a lot of companies that will provide services. So if you need, for example, PCIe interfaces or DDR, those are readily available in the TSMC ecosystem, so you don't need to be a big company to get all those. So we try to take building blocks that already exist in ecosystem and just reuse them. And we try to take even open-source, there's RISC-V open-source CPU. We try to reuse what's out there and really both benefit from and contribute to the open-source ecosystem in the hardware space. That's really, a new kind of development in the past five or 10 years, but it's pretty exciting because it's like open source software. It's a chance to move hardware into more of this open source demand and hopefully accelerate innovation in this space.

Anna (00:35:10):
So you just mentioned open source hardware. Do you think you'd actually see a resurgence of open source Bitcoin ASICs? Or do you think you can only do this open source stuff for like future tech, something coming up?

Simon (00:35:26):
I think for Bitcoin ASICs is probably pretty hard because those are on seventh generation or whatever they're on of highly refined hardware. It's a pretty mature environment. I think it would be hard to compete on an open-source basis for Bitcoin ASICs just because they're really very good right now. But you know, when you think about building new chips quickly and cheaply, I think open-source does get pretty interesting and there's a need, I think, for that that's really developing and you look around and some of the big companies are interested in that too, like Google seems to be interested in this open source hardware. They have, Kelly, maybe you can talk about that platform that Google has opened up. I forget what it was called to build chips.

Kelly (00:36:08):
Yeah. I think the main thing here is that open source hardware in many ways is sort of in its infancy right now. I think, the barriers to entry are coming down in this space. And so there is an opportunity now to use open source tools, to develop open source hardware and to release that sort of the source code for the hardware out into the world. It is definitely picking up interest at some of the major manufacturers like Google and others, and then RISC-V is clearly another area where open source hardware is starting to see some momentum.

Tarun (00:36:41):
Do you foresee the open source hardware movement having a Linux-like evolution, where there's this many long years of grinding before suddenly just everyone adopts it because of some kind of use case, like in the case of Linux with AWS and data centers, or do you view it as a forever niche part of the market?

Kelly (00:37:06):
Yeah, I think it's a great question. Some of the efforts around things like RISC-V certainly give a path towards that Linux-type evolution for open source hardware. And some of the exciting things that we're starting to see in the ecosystem today is that things like data centers and enterprise infrastructure is starting to move towards what I would call like a multi architectural paradigm. So it used to just be that x86, and Intel was the only thing in the data center. And we're starting to see everywhere from Apple to Amazon, to Google experimenting with at least ARM architecture in the data center. And so I could certainly foresee as RISC-V matures and gets more performant, that could be an eventual outcome. You know, the biggest difficulty there will be: can you get enough talent to accrue high quality open source hardware implementations out there in the ecosystem. With Linux and software I think that's maybe an easier proposition than it is in hardware.

Tarun (00:38:12):
And actually I think just for our audience, RISC-V is an instruction set. So it's like a set op codes in Ethereum or op codes in a virtual machine. And that stands for restricted instruction set computer. And it's one of the most well-known open source instruction sets.

Anna (00:38:36):
Do we think we're ready for SNARKs?

Tarun (00:38:38):
I think so. Yeah. Sorry. Hopefully, it wasn't too pedantic — to go through a lot of hard enough — but I do think for an audience of cryptographers and software developers, sometimes it's good to get a vernacular, everyone knows the same language.

Anna (00:38:56):
Now, so far, we've talked about the project that you did on VDFs. We've now defined a lot of the kind of pieces of the hardware stack, which I think is super useful. Let's move towards SNARKs. What projects have you been doing with SNARKs, what's your goals with this? How exactly do you interface with the SNARK community or SNARK technology?

Kelly (00:39:18):
Yeah, so I can start and maybe talk about one of the recent projects that we worked on with the folks at Protocol Labs. So for those that aren't familiar, Protocol Labs is the organization behind projects like IPFS, libp2p, and, most recently, Filecoin. And Filecoin is a novel new blockchain that uses storage power as a way to, rather than traditional mining, as a way to give relative users influence in the protocol, to ensure that the accurate accounting of this storage power, they use zero knowledge proofs. And these are things like proof of spacetime and proof of replication of which I'm sure you've had folks talk about these things on the podcast. As part of this, the Filecoin network, as far as I know, is the network that processes more zero-knowledge proofs than any other blockchain protocol out there. I think they process on the order of something about 5 million zero knowledge proofs a day.

Anna (00:40:16):
More than Zcash, I guess.

Kelly (00:40:19):
Yeah. Maybe more than Zcash over its entire lifetime at this point. So it's a massive, obviously massive new sort of paradigm shift in terms of the use zero knowledge proofs, but that also entails high computational requirements. So, one of the unfortunate things about zero knowledge proofs and this advanced cryptography is that it takes a lot of compute power to perform today. And so one of the recent projects that we worked on them with was speeding up both the verification of zero knowledge proofs and also the performance of actually generating the zero knowledge proofs.

Anna (00:40:59):
And you're talking specifically about SNARKs in this case.

Kelly (00:41:02):
Yes, that's correct. So in the case of Filecoin, what they use is the traditional Groth16 styles zero knowledge proof, and they use the BLS12-381 curve as well. So, Simon, I guess maybe do you want to talk both about some of the work that went on both on the CPU side for the proof verification as well as the GPU side for the proofing?

Simon (00:41:25):
Yeah. So as Kelly mentioned, we looked at both of those, you know, Filecoin had a concern that both verification and proving for the number of SNARKs they are processing, were becoming computational bottlenecks. And so those are areas we dove into for verification. They're very different sizes, I would say. So verification of SNARKs is a pretty fast operation measured in milliseconds, whereas proving of SNARKs is measured in seconds, even minutes. So, one of them is much, much harder than the other. So the cooperation of SNARKs tends to be something called multi exponentiation and FFT. Multi exponentiation is really the dominant computation that has to be done. So when we looked at verification, we looked at a few things. One is, there's a certain function that has to be performed to show that the SNARK is correct. This is done on CPU because it's pretty fast. We do that on CPU. One thing you have to consider when accelerating is that even sending work to a GPU, it takes time, milliseconds of time. So a lot of times just for something that's relatively quick, the overhead of going to an off CPU platform isn't worth it. So in that case, we looked at how do we algorithmically improve that multi exponentiation on CPU. We make use of all the cores very efficiently. When we have a bunch of work to send to cores, we try to balance the work so that all the cores finish at the same time. So if you do it naively, you can have one core that takes a lot longer and everybody's waiting for it, that's inefficient. So we go to the trouble of making sure that the work is as balanced as possible, using all the resources. We do pre computations when we can, and that can really speed up multiplication many times. So in this case, once we put all the work in and so refactored the algorithm and made it efficient, we got about a 10 to 12x speed up for verification, which was pretty big and it met their needs as far as making the system continue to function smoothly.

Anna (00:43:24):
And was that on the proving side or was that overall proving and verifying?

Simon (00:43:28):
That's just verifying. So those are pretty small. The next thing was to look at proving, and because it's so much of a bigger operation, that's a whole another optimization effort. There we used GPUs. That's another can of worms. So we use different algorithms. It's still multi exponentiation, but we use different algorithms, different platform. And we would refactor the whole thing. That we sped up by about four to five times and that isn't released yet. So that's still a work in progress as far as the integration, but that'll be coming in in the coming months.

Anna (00:44:02):
You just said two different platforms. Do you mean like two different hardware platforms? Like we defined before, one for verifying, one for proving?

Simon (00:44:11):
Exactly. So for verifying, because it's relatively small, we just keep that on the CPU. We use all the cores of it that are available. For proving, because it's much, much bigger, we can tolerate the overhead of going off chip to the GPU. And it was already using gpuimpact. GPU for proving is quite a bit faster than CPU already, but we optimized it further and got another 4X out of the GPU. For proving, it's a combination of using all the cores and the GPU as efficiently as you can.

Tarun (00:44:39):
Yeah. I guess I want to talk through some of the technical details. So it sounds like, in the case of verification, its memory footprint is smaller and it seems like SAS, serial computation is much more valued than massively parallel type of things, but what aspects of proving end up being the most parallizable? Cause it does sound like a lot of the architectural improvements you were able to make, came from taking advantage of parallelism within proof generation. So is it like you broke up the FFT into different components? Going through that would be pretty great.

Simon (00:45:19):
Sure. I mean, in both cases, the theme is parallelism. So for a verification, we use all the cores and we do the multi exponentiation in parallel, but because it's small so verification uses about 32,000 bases. We can do pre computation, meaning, we can pre-compute end times for each base and we can save a lot of work that way when it comes to actual verification and get an 8X speed up. And because it's small, 32,000 bases, we have enough memory on the system to actually hold this pre computation.

Tarun (00:45:51):
Does the lookup table fit fully in cache? Or do you actually have to go to main memory?

Simon (00:45:56):
Well, it goes to memory, but memory is pretty fast. And you can prefetch those, if you want to, because you know what the next base is going to be. So what you can do if you want to, it turns out it doesn't really help very much because the cores are pretty good at prefetching anyway, but you can look ahead to the next base you need, the next lookup table element you need, and tell the processor to go read it in advance so that it's ready. But in fact, there's enough work going on that that's mostly hidden for you. So proving is a bit of a different story because it's huge. So in the Filecoin case, they're working with a hundred million bases or more, instead of 32,000. So you can't pre-compute anything because you can hardly even store all the bases alone.

Simon (00:46:34):
So in that case, it's really about picking the best multi-scale or algorithm. In this case, it's pretty much accepted practice and state of the art to use the bucketing algorithm, a Pippenger's bucketing algorithm, which was used here too. So, for speeding up this work on a GPU, it is massively parallel. Multiscalar is embarrassingly and easily parallel by its nature. And that works to our advantage. So we looked at speeding up the field arithmetic on a GPU, and we used a video GPUs in this case, that's pretty common. And you can write assembly code, a form of assembly code for GPUs, so we worked on that to make that a little bit more efficient and then a lot of it, or some of it was making the transfer of data to the GPU overlap with the computation as efficiently as we can. You want to pick the right size of work to send to the GPU so that it efficiently uses all the resources, things like that.

Tarun (00:47:30):
How hard was it actually to deal with implementing the field operations on the GPU, where you have quite a bit of difference in terms of pipelining for different operations and thinking about warps and how much memory you have and stuff like that. Like, it does actually seem like a lot of these big integer operations might be annoying on a GPU.

Simon (00:47:52):
In fact, you can actually use the same code you use on CPU pretty well on the GPU for those. And it's really about the size of work you give to each warp. You want to make sure, for example, that the way you access memory, you want to try to pull in memory so that all the memory you pull in can be used by various warps. One easy mistake to make is you can read in memory, use one piece of it for a warp and throw the rest away. And so how you traverse memory becomes important, how you use registers. You want to make sure that, or try to optimize in a way the number of registers each little program is using, that allows you to run more programs at once. You can do that by tweaking how much data you store on the stack, for example, tricks like that.

Anna (00:48:36):
When you talk about using GPU, CPU, you're optimizing it using Assembly, you said, but what are you actually producing? Do you produce libraries or some code that sits between the GPU and something else? I'm trying to figure out what the output is when you do this work. Because I guess you're not changing the GPU, you're not altering the hardware in this case.

Kelly (00:49:00):
Yeah. That that's exactly right. I mean, at the end of the day, the output is just some open source code that gets put up on GitHub. So in the case of Filecoin, we provided some optimized code that they integrated into their software. And now it lives on GitHub and folks could use it.

Anna (00:49:18):
It would be in their client software then I guess, or in their SNARK proving software.

Kelly (00:49:22):
Exactly. Exactly. So, Filecoin uses a fork of Bellman, which is the original software that was developed by the folks at the Electric Coin Company and used in Zcash. They've made some modifications so that it can run on things like GPUs or at least some of the operations can run on GPUs. And so this code has moved under Filecoin's repository in their SNARK software, essentially.

Tarun (00:49:46):
Do you see proof generation moving to custom silicon versus staying on CPU, GPU, like more traditional platforms? And what benefits and detriments do you see, if say there's a zk-proving ASIC that can handle certain types of circuits? Let's look forward 10 years where we're in a world where people actually want to generate the zk proofs of sizeable programs, maybe like programs that people run on Ethereum. It's still a TI-83 level of computation, but it's still much more than a Zcash transfer.

Kelly (00:50:28):
Yeah. Looking forward, I think we see this running on a variety of different architectures. From the standpoint of privacy and decentralization, a lot of people are going to want to run some of these transactions on their laptop or desktop. And most people don't have the need for anything more than that. Ultimately the question about whether this will move to an ASIC is really a question of economics, I think. Does the ASIC perform well enough that it's worth moving to and developing this custom silicon? And so I think it's yet to be seen as GPU's continue to progress, how much better ASICs will be and what that cost improvement is. But I do imagine a future where zero-knowledge proofs are happening locally on your own laptop or desktop, they could be happening in the cloud. So we're starting to see zk rollups be talked about, or zk-zk rollups. So it may be the case that you compute a small proof on your laptop and you send it up to the cloud where they aggregate hundreds or thousands of transactions together and do a very, very large zero-knowledge proof. That could happen on large servers, that could happen on graphics, processors in the cloud or someday, if there's enough demand, happen on an ASIC.

Tarun (00:51:42):
One actual question, in that regard is, you know, in people's phones, and I'm not sure if people know this, but most cell phones nowadays have reasonable secure enclaves and like non-trivial percentage of silicon dedicated to either privacy or security in some manner. Do you foresee a world where there's a ZKP-specific enclave on people's phones or like in devices?

Kelly (00:52:12):
It's a great question. I think that world is very far off. You know, having worked at Intel every tiny little bit of silicon somebody wants to add into a processor, gets heavily scrutinized because that ends up getting printed millions or billions of times. So I think we're very, very far away from that happening. You know, that being said, a lot of the work that we do is to try and make these sort of primitives as fast as possible on every platform. So one of the projects that we work on is a project called blst, which is an open source cryptography library for BLS signatures. And we've specifically written code so that you can run this on a Raspberry Pi, you can run this on your ARM Macintosh, you can run this on your iPhone. So we're optimistic that we'll continue to get increasing performance out of these platforms. And we specifically target those, as we write these applications

Tarun (00:53:09):
When you write libraries like this, do you end up relying on compiler optimizations to be able to guarantee performance across different platforms, like LLVM IR type of tools? How do you actually guarantee this sort of platform independent performance? Because historically that's always been quite difficult so I'm curious how you think about that with cryptography primitives.

Kelly (00:53:34):
Sure. Yeah. It's actually quite the opposite in terms of relying on the compiler optimization, as most of the low level crypto primitives are actually written in Assembly that is bespoke for those platforms, but maybe I can let Simon talk about some of the work we've done on this BLS signature library and the work that one of our colleagues Andy Poliakoff has done, which is quite amazing work.

Simon (00:53:56):
Yeah. I mean, as Kelly said, the way you get the performance and really ensure that you're getting the behavior wants is to write Assembly. I mean, anytime you go through the compiler, when those change, you know, it'll change the code that's generated performance may change. And in fact, sometimes in order to take advantage of the architecture, the only way is to write Assembly. So on x86, there are instructions called ADD CX and ADD 0X that let you have to carry chains and the compiler can't even generate those instructions. So in order to get good performance for large integer arithmetic, you actually have to write Assembly. So, as Kelly said, most of the code, the low-level code is in Assembly. It gives us a lot of control over what's running there. Andy Poliakoff has been working in the space for a long time. He has worked on other libraries in the past, like OpenSSL. So for example, a lot of the work in cryptography on the internet runs on some of his code. I think part of what we do as a company is trying to find exceptional people in these areas to engage and really build the best software or hardware that we can, and really get people who understand the platform and the architecture and how to really get the performance out of them.

Tarun (00:55:06):
If we view Moore's laws as dead, or basically almost dead, depending on your part of the hardware world, do you view the adoption of ASICs and specialized hardware as strictly increasing over time? Or do you see custom architectures coming up for cryptography related primitives, or do you still think it will end up being RISC and ARM and x86?

Kelly (00:55:30):
Yeah, I'll get my take and I'll let Simon answer as well. I do think one of the inspirations for the company is folks like John Hennessy and David Patterson from UC Berkeley who had this vision of a new golden age of computer architecture that's driven by domain specific hardware. And cryptography is just one example of generating domain specific hardware specifically for cryptography. And I think we'll continue to see that for other workloads as they reach critical mass and size. It's certainly something that we've seen with things like neural networks. This used to be something that was done just on your CPU originally, and then it moved to your graphics cards. And then now, ultimately you've got a number of startups, building custom silicon for high performance machine learning and high performance AI. So I don't think it's gonna be one or the other, but I do think that we will see increasing amounts of domain specific hardware as we move into the future.

Simon (00:56:32):
Yeah. Just to add to what Kelly said, while Moore's law is maybe not speeding things up on a single threaded basis every year, the way it used to there is a lot of parallelism available, and that's what FPGAs and GPUs as an ultimate ASICs can provide. One of the questions, I mean, I think machine learning is a good analogy here. One of the challenges with going to custom silicon is that the space has to be mature enough to support it. The last thing you want is to build silicon and then find that two years later the application has changed, they're using different algorithms, different data types, and your silicon is no longer useful. So eventually it'll probably make sense if one of these areas really takes off and becomes solidified to have a custom hardware. You had to be careful that in the cryptography space, it seems like every six months there's new schemes, new applications of things. So getting to the right level of stability. And then on top of that these companies are doing a great job increasing performance of these commodity platforms every year. You can wait a year and buy a platform that has twice as many cores, is twice as fast off the shelf. And so those are the things, as Kelly said, it becomes an economic problem with balancing out those competing forces. And it has to just make sense in the end to spend that kind of money and lock in the functionality and get the performance around that particular design.

Anna (00:57:53):
So I have one last question and that's actually about the Nvidia, the news that came out. I guess it was last week or two weeks ago, where Nvidia's now going to be blocking Ethereum mining. So like in the GPU's that they're shipping, they're going to make them specific for gaming. And then they're going to release a new product that is more for this mining on GPU platforms. I'm curious what you think about that. I wonder if it means anything to the work that you're doing.

Kelly (00:58:21):
Sure. I mean, I'll take a swing at it first. I don't think it has a big impact on the work that we're doing. It seems that this announcement very targeted at Ethereum mining algorithm. And the code that we write is sort of general GPGPU type code that would be indistinguishable from any other program that someone may write. In terms of why they may be doing it, I think there's a number of reasons. I mean, obviously first and foremost, their core market is folks like gamers or even enterprises that are doing things like machine learning or AI on GPUs. And they wanna ensure that those folks can continue to serve those markets. The other thing though, is that a lot of the GPU you don't necessarily need to do Ethereum mining. So one of the things this could do is provide a new product offering for them, where they can take off pieces of the silicon that aren't needed, or that ended up becoming faulty in the manufacturing process. And they can create this new product offering that ideally they're able to better extract value out of that market segment. So if Ethereum mining is really hot, they can price that one higher and if it's not doing so well, maybe they can price it lower. So I think it was ultimately a business decision more than anything. So that's, that's my 2 cents on it.

Anna (00:59:40):
Semantically, if they take off everything, but the thing you need to mine, wouldn't that just be an ASIC? Isn't that kind of what it becomes?

Kelly (00:59:49):
Yeah. I mean, at the end of the day, a GPU is kind of an ASIC? It just has quite a bit of functionality and it's got a programming language that you can do with it. But ASICs are on a spectrum from fully programmable, like a CPU all the way down to single function, like a Bitcoin miner. And I think a GPU falls sort of in between there.

Anna (01:00:11):
Okay. And if you remove functionality, then it goes further along that path towards an ASIC, I guess.

Kelly (01:00:18):
Yeah. I think that's one way to look at it.

Anna (01:00:21):
I don't know if that makes sense, but at least that's how I'm hearing it. Well, listen, thank you so much, Simon and Kelly for coming on the show and helping us explore this pretty wild world of hardware, something we hadn't touched on before on the podcast.

Kelly (01:00:36):
Thanks a lot guys.

Simon (01:00:37):
Yeah. Thanks for having us, it's our pleasure.

Anna (01:00:40):
Cool. So if anyone would be interested in getting in touch or find out more, how can they reach you? How can they participate in this?

Kelly (01:00:47):
Yeah, absolutely. If you're interested in some of the projects, you can always reach out at hello@supranational.net. Also, if you're interested in joining the team, we have a lot of interesting projects related to zero knowledge proofs. So whether you sit on the theory side or are interested in the hardware optimization or even software programming, please feel free to reach out. We have a number of open positions and would love to have some of the listeners join us here.

Anna (01:01:10):
Fantastic. Cool. So thanks again. And to our listeners, thanks for listening.