Davit Buniatyan: Eventually you make one of those decisions that you're going to always regret. Doesn't matter what. You choose to start a company, or you go and finish your PhD. My advisor gave me this advice saying that, "Hey. You can do both. You can do both at the same time. There's no problem with that. But you won't succeed in either of them. So, you have to focus and choose one and try to be best at it." Eric Anderson: This is Contributor, a podcast telling the stories behind the best open source projects and the communities that make them. I'm Eric Anderson. Today, we'll be discussing Activeloop Hub with Davit Buniatyan, who is one of the creators of Activeloop Hub. Davit, thanks for coming on the show. Davit Buniatyan: Thanks Eric. Thanks for having me. Eric Anderson: There's lots to discuss here, and I'm excited to get into all of it. But first, introduce Activeloop Hub to our listeners. Davit, if you would? Davit Buniatyan: So, Activeloop Hub is a dataset format for AI applications. Vision is to become a database for AI. And currently, it's more... Legally speaking, or technically speaking, it's more like a data store for deep learning applications. So, the main goal behind the scenes is that, you have all these data formats for tabular data like Parquet. You have databases. You have data warehouses, data lakes, and now so called bakehouse, but you don't have one for specialized, especially deep learning applications. And over the past 10 years, we have seen this huge growth with applications of neural networks in unstructured data or complex data with images, video, and audio. But still, data scientists... They operate on top of this data as they are files on your desktop. Or, they are objects or blobs on an object storage. So, that's what we are trying to change here, is to provide a unified format access for data scientists to work with large-scale data sets when they are operating on top of deep learning applications. Eric Anderson: Got it. That was a useful clarification at the beginning, that it aspires to be a database for images today. It's more like a data store. And does that imply that with time, you might add more database-like functionality? Davit Buniatyan: Yes. So as we have seen, there's a lot of... Like when you mention a database, it comes with a huge lot of expectations on how the data should be reached/accessed. But a lot of those expectations are not there yet for deep learning practitioners. For example, asset transactions. Or making consistencies, like it'd be part of the asset across multiple writers. A lot of times, the order is not important when you're training those models. Even further... For example, in a training process, you will expect the data to be in a shuffled manner or coming in random order and manner so you can train your models best. So, the expectations or the use cases are slightly different, that gives you a discount on how to call out or how to frame the tool itself that can be helpful for processing the data Eric Anderson: You compared it to Parquet. And when people are using Parquet, sometimes they refer to their storage, the data lake. Could you call this a data lake for computer vision? Davit Buniatyan: Yes. That's also very good synonym for what we are doing. Eric Anderson: Good. And then, I guess one other aspect I'm curious about, and then we'll get into how this all began. But, you also said that most people are... The kind of traditional way is to download files locally to your computer and then operate them as you're going through training exercises or testing inference. This prevents the need for downloading locally then. You're able to take the traditional tooling and kind of connect it to this cloud store, a cloud data lake. Is that the idea? And so, you do your kind of TensorFlow, if you will, or Pytorch and it accesses actively directly. Davit Buniatyan: Yes. Let me try to give you a... Kind of a example of how it works. Let's say you have million images, and you have also million labels. And those labels are... Take the most simplest case where it's like dog or cat trying to... And you are trying to build a model that will classify if this is an image with the dogs or cats. The way usually people operate now is that, they write this a hundred lines of polar played code to parse the folder structure if it's locally on the machine. If it's remotely, then you have to actually do a copy to your local machine and then start doing the process. Imagine if either this data grows huge. Or instead of having a single node, it can also do multi-GPU training. Like disabled training, where you have to run on those multiple machines or you do, instead of a training or inference on a hundred GPS. Every time you start Avere tool machine on the cloud, or open your Jupyter notebook on your local machine, you have to go and copy the data from the centralized source to your machine. Davit Buniatyan: So now what we do differently is, first of all, instead of treating this 1 million as single files, we actually... As images or labels. We treat them as tensors or columns in our... Let's say, call it dataset. You have multiple columns. Each column for us, instead of being a one-dimensional array, it's now n-dimensional array or a tensor. So now, your dataset... You have image tensor and label tensor. And image tensor has the shape of 1,000,000x512x512x3 and label tensor is 1,000,000x1. And deep learning simply becomes learning a function from image tensor to label tensor. And you can either use... Connect this to PyTorch or TensorFlow. And instead of now actually copying the data, the data gets streamed over the network to your models directly. So, it doesn't get cached on the local disk, and you can save a lot of time while you are processing this data set. Davit Buniatyan: And what we do behind the scenes, where the core value comes into this place is that, we take this 1,000,000x512x512x3 huge array... This is not huge. This is like a small, huge, basically a data set. And we chunk the data and lay out on top of an object storage like S3 on AWS, such that when you stream the data from S3 to a virtual machine, for the GPU's or for the models when they are doing the training process, they feel the data is actually local to the machine. Which means that now, your model being limited to a... Let's say one terabyte of your local disk. It can have access to a petabyte-scale data set sitting on top of a S3. And instead of you running the single machine, now you can have a workload of thousand machines. Davit Buniatyan: So it's very similar to how... Let's say MapReduce, or Spark will operate on top of this data, but with nuances or with the considerations especially for deep learning applications. And one of those considerations is that, when you are running a compute in machine learning, deep learning models... They don't care what the data is in/out. It's a tensor in and tensor out. And then, you have the architecture behind the scenes. Obviously this architecture is biased on type of the data it's accepting. Even so, with recent transformer models, it's becoming kind of universal. So our goal or as the goal of the Hub, is to prepare the data, lay out the data on a remote storage and then transfer over in the network to the model as simple from the model perspective as possible. Because we were... Before sorting the data, asked like, "What is the best way for humans to store the data?" Davit Buniatyan: But now we are kind of, "Can we actually do optimize this?" Because we know how the datasets are operated, how they're going to be computed with deep learning applications. Now, we can backtrack and say, "Okay, what will be the best way to store this data? On a local or remote storage? And then, how to move it over the network through the caches and then bring it to the machine?" And once you also know... The big difference... Let's call it also another difference. People can use disabled file systems. Or they can use a remote way of accessing this data. But one thing that these file systems or operating system level doesn't know is how this data is going to be computed. Like, "What is the future looks like when... What is the next 10 elements? How I'm going to compute?" And if you can incorporate this to the way you are both storing and then transferring this data over the network, then you can actually guarantee performance and speed for your machine learning models. Eric Anderson: Yeah, totally. And maybe to extend this conversation just a little bit, one of the principles that I think, as an industry, we've kind of agreed is important is that you want to process the data the same in training as you do in production and inference. The data is prepared. I know that's really important when you're doing lots of data transformations like you would with numerical or tabular data. It's probably less critical with images, although maybe there's some processing. But at least at a minimum what you're doing is, you're bringing training to the cloud where production's going to happen anyway, or inference is going to happen. Are there benefits in the pre-procesing of images that you have a similar processing environment the way I'm describing? Or is that not as important given there's probably less processing on images? Davit Buniatyan: No, it's actually one of the considerations. "When is the right time to do the pre-processing of the data?/How to do that?" So, let's take the training example. When you get our bitrate size images... When you give it input to the model, you actually have to resize and normalize the data... Eric Anderson: Yep. Davit Buniatyan: ... Before fitting it to the model. Otherwise, your model will not learn very well. And then there's a consideration, "Okay, when is the right time to do this pre-processing? Should I do it actually on the remote storage?" And what you can do recently using the new APIs like S3 or using Lambda functions? Or, can you actually do it on the CPU while your GPU is running the computation? Furthermore with unstructured data, decompression actually becomes a big bottleneck. Davit Buniatyan: And you might think that the network is the bottleneck or the pre-processing is the bottleneck. But actually, decompressing the data... I don't know. It could take... The most simple example, you have PNG file format and then you're bringing it to the tensorize form where you have to fully decompress. That becomes the bottleneck. So, what Hub is doing behind the scenes on behalf of the data's scientist is that, it does all this tweaks. "Okay, how should I move the data? When should I decompress? Where should I put it into the cache? How do I should take this from the cache and then apply the pre-processing function that the user defined?" And then, "How I should fit this into the machine learning models?" So, we try to make the data scientist's life as easy as possible, and then give them... You can think about it as a data lake access for their datasets. Eric Anderson: Got it. I think I follow. This is good. And Davit, let's take some time to talk about how you got to this. Presumably, this was like a personal problem of yours. And maybe you can talk to us how you encountered this problem and then discovered that it was kind of a broad problem? Davit Buniatyan: I wouldn't call it was a personal. So before starting the company, I was doing a PhD at Princeton university. Actually just found story when I got into Princeton. I got into... Like the advisor... A professor there interviewed me, accept me to a computer vision lab. I was so excited to come there. And apparently when I got there, he left to Bay Area to start his own self-driving car company. So, I had to find another professor. Because I didn't wanted to join a startup at that point. He did offer me. And I accidentally or incidentally got into this neuroscience lab led by Sebastian Seung, who was working on this field called connectomics. And connectomics is a new branch in neuroscience. I'm not sure if you've heard of. It tries to reconstruct the connectivity of neurons inside the brain. So what we were doing more specifically, we were taking a mouse brain, a one millimeter volume of it. Cutting into very thin slices. Davit Buniatyan: Each slice was at neuro resolution... So, each slice was 100,000x100,000 pixels. And we had 20,000 of those slices per sample. So the dataset size was getting to petabyte-scale data. And what we had to do is, we had to take this data and then use deep learning or artificial neural networks to be able to segment the neurons, find the connections, build the graph so later neuroscientists can do research to understand how the brain works, what are the biologically inspired real algorithms. Because all the neural network stuff that you see today is actually based on 50 years ago, understanding of how the neurons or how the brain works. And there's... A lot has been changed. For example, one interesting thing you can prove out is that, back-propagation doesn't exist inside the brain. That's an algorithm that is used widely in PyTorch or TensorFlow to train those models. Davit Buniatyan: So, I was not doing the neuroscience part. I was mostly involved with the infrastructure and training those models. And the problem that we had is that, processing a petabyte-scale data at this... Five years ago on the cloud, was super painful. Kubernetes was not scaling to that point. We tried Airflow, it was hardly working for distributing the tasks. The data... To be able to optimize the cost from, I think in the order of millions to five times cheaper, we had to rethink how the data should be stored on those machines, how it should be streamed over the network from the machine to run the CPU or the inference. But some cases actually... For some certain models, if you compile it pretty well on the CPUs, it's much cheaper to run on the CPUs rather than on GPUs even though it's slower, but then you can scale. It's actually 30% cheaper. Davit Buniatyan: And then if you do all these optimizations, you understand that all the tooling you have for traditional analytical workloads... You have Snowflake, you have Databricks. You can use them there. You have bunch of other... All the work that has been done over the past 20 years for big data is unusable in this... I'll call it unstructured space or complicated data space. With here, it was like a four-dimensional data. And a huge dataset. And we had to rethink all those tools. And those problems actually, and the insights that we learned appeared to be not only inside our lab, but also in other labs, not necessarily in biomedical applications. And apparently when we started the company three years ago, got into Y Combinator, it appeared it's actually not only in our research community, but it's also getting into the industry, and startups, and also large companies that we have seen Hub being used. Eric Anderson: That's fantastic. So, just to clarify some of those points. So, you were going to a computer vision lab originally, and then that didn't work out and you ended up in a neurology lab, I suppose. But had that not happened, you might not have kind of had these same insights into how neural networks operate and then had to operate them at scale that led you back to doing computer vision. Davit Buniatyan: Definitely. It's super interesting. And I find today, self-driving car companies, some of them... I will not open the brackets. They have the exact issue for storing this data at scale. But they're already on petabyte-scale datasets that we had in the lab five years ago. Exactly the same problems. And they are still struggling coming up with third version or fourth version of their format to be able to store this data, and then also visualize this data. Visualization is another world. So this is very interesting. Davit Buniatyan: And one thing is interesting as well. Actually, our lab was one of the first folks that used... There's a famous neural network architecture called convolutional neural networks. Because of that, started the revolution of the deep learning around 10 years ago. They actually started using those neural nets in 2008 where no one was using yet, for doing all the segmentation work. And all the... The revolution actually started after Alex Krizhevsky did the AlexNet that used convolutional nets for doing the classification of ImageNet dataset. So, they were ahead of it four years ago or five years ago. So, yeah. That's pretty interesting actually. And you can find a lot of... I'll say golden pieces in all those research labs. They are like tracked on solving a totally different problem. But the problem that they're solving on the way can actually unlock other use cases in totally independent industries. Eric Anderson: And you mentioned you didn't want to join a startup when you joined university for the PhD. But I suppose at some point during this project, you decided that this could be a company or it could be an open source project. When did that happen? And what led you to think that? Davit Buniatyan: Yeah, the problem that we had is that actually, this was super expensive to run this computation on the cloud. And we were wondering, "Why do we have to pay so much amount of money to the clouds to be able to execute this?" And especially when you're in research lab. Even though that project was super well funded by the government, we still... We have to be very, very lean on how each cent is used on the computation. So, we had to do all this back work and computations very well. And of course, we are PhD students. So sometimes, we were messing up and we had to rerun the whole computation from the middle. And that was like just increasing the infrastructure cost there. And at that point it's like, "Okay, why don't we try and see if this could be interesting, let's say for Y Combinator?" Davit Buniatyan: And that's how we applied. We initially thought, "Okay. Yeah, we will get in, we will not. Maybe we'll see how it goes. So, we'll finish our PhDs because we want to become professors, academics, and so on." And then, we got into YC's like, "Yeah, let's give a try. It will be a co-internship." And once you are here, you're like, "Okay, wow. So many opportunities where you can give a try and try to solve." And then at some point when... Unfortunately, last year... I took a leave of absence from the university. Davit Buniatyan: So, last year I dropped. Because they gave me three years, we're super kind. Super thankful. They helped me a lot on this journey. And unfortunately, they gave me like, "Hey. You decide now. Either you're coming back or you're continuing." And apparently... And this is on the topic for this conversation. If you are doing an open source and a startup... And I had this in my head. It's like, "I can have much bigger impact on computer science while building the company then actually do the research and try to come up with an academic paper that could be useful in other sense." Davit Buniatyan: So eventually, you make one of those decisions that you're going to always regret. Doesn't matter what. You choose to start a... Continue company or doing the company, or you go and finish your PhD. I wish I was at the point where I graduated already and then started the company. But my advisor and I think another professor at Princeton, who is also now CTO of TimescaleDB... He gave me this advice saying that, "Hey. You can do both. You can do both at the same time. There's no problem with that. But you won't succeed in either of them. So, you have to focus and choose one and try to be best at it." And that's kind of the... I was trying to operate on those principles so far. Eric Anderson: Well, I think a lot of us are pretty excited that you chose to pursue the company and the project. Having a research paper from you would be nice, but there's 4,000 GitHub startups that are excited to have the code you've produced. So, appreciate that. Great. So, going into YC, were there a lot of other PhDs in YC or how common or uncommon is that? Davit Buniatyan: Well, there are a few PhDs. I remember in biotech. Or in deep tech especially. Folks were doing deep tech startups. There were few. But then, I have seen more actually non PhD/non-academically coming from the background to actually doing open source, which is more interesting. So basically opensource... There's no any barrier for you to start a project on the open source. And if it gets successful, then you can actually build a company on top. So to be fair, when we started, we had an open source solution, but that was not Hub. So, hub actually came like a year later after we having integrating with customers and learning about their problems. So on our side, just to give some sight of here... When we started working with companies, afterwards we connected the dots. "Wait. Here, they are actually storing their data inefficiently." Davit Buniatyan: We had one customer who had 18 million text documents, and they were storing these documents TXT by TXT on S3, which is very inefficient if you want to move the data. It's like you have to copy file by file. And tiny files is super bad. And it's like, "Wait. Why are they doing that?" Then we worked with another company who had really huge aerial imaging, petabyte-scale data on S3, and in a totally different space. So, no documents. And they were storing this with a 20 years old way of how the whole community of aerial imaging processing has been storing this data. And it's like we said, "Okay. They can save on 30% on their compression if they don't use the same compression they do." With the NLP, it's like they don't even... We're saving them any compression. They were just also operating on single files. Davit Buniatyan: So, those kind of problems we actually learned while we were working with customers. So, we haven't started the open source before starting the company. But there are a lot of companies, especially now that getting to Y Combinator, that start their open source before they... In the previous company, they see how this gets useful across their previous company. Or they come from the research as well. But then, YC has been mostly... When I joined, they have been mostly focused on SaaS B2B/B2C companies. And then, they were doing this transition to become less consumer to more enterprise. Their selection process, as far as I've noticed. So, that's just my humble opinion here. Eric Anderson: And earlier, you mentioned that this is a need in not just computer vision, but also NLP with audio files. And you mentioned text documents. But you focused on computer vision, right? Could I use this on audio files or would this not work? Davit Buniatyan: So the thing is that, you can use this for audio files. You can use this for TXT files. You can even use this for tabular data instead of using Parquet. Eric Anderson: Yep. Davit Buniatyan: But you have to pick a battle. You have to focus something, try to be the best there. And we focus on images, and computer vision applications. So obviously as the time goes, we had more resources. We can go back and start optimizing for other scapes, other data modalities. But especially with computer vision, you actually have to support all the data modalities. It's just... Since the bottleneck is on the images, you have to actually solve first, the bottleneck. So, that's where we put our first flag essentially saying, "Okay, let us focus on these imaging videos especially and try to make it as good as possible for others to train and deploy their models on top." So, we optimize first for images, but then we also do support other data modalities, and we can expand the data modalities, giving the customer/user or community member feedback. Eric Anderson: Going back to when you said you were talking to these companies and you discovered they had these other issues and needs, and you changed the product. That sounded like you had kind of a new open source... Hub kind of emerged from those learnings. Was that scary at all to feel like, "Oh. We built one thing, but we really should have built Hub, a new thing?" Talk to us about kind of the emotions and the process of kind of discovering the need for Hub. Davit Buniatyan: We went through a lot. So, changing that product was not the most scariest thing, to be honest through that journey. So, there are a lot of problems and stress and a lot of issues that you face as a founder while you're building a company. And especially if you are doing in deep tech, that makes the problem much more trickier. And let me give you... Explain to you why. Well, the way I define a product-market fit is when you build a company that's... I'll get into the nerdy stuff. Your company's basically end to end differentiable. And it's very similar to neural networks or deep learning how you train those models. So, you can actually do certain actions and then increase your growth metrics, what you are targeting for. Unless you do that, you as a founder... Your goal is to do all these Beijing experiments to figure out should you explore further or should you try to exploit further. And explore further is, try another product or try another feature. Davit Buniatyan: And exploit is like, "Let's... We double down on this and try to see if it can go further." So until you have this PMF, you are doing all this exploration work. And our initial product, to be honest, it worked pretty well. So initially what we have deployed was a software and the cloud alternative for you to run computation on crypto mining GPUs on distributed scale, which was 10 times cheaper than GPU on AWS. So, we built that. We got actually a order of thousand users, I think during the first month. Davit Buniatyan: And we got struggling. We were struggling on the GPU access. We had an order of... I think hundreds up to thousand GPUs available. And all of them were just fully utilized. But the problem, what we learned from enterprise is, their data... They won't trust us in any time near soon, even if we do all the best encryption or whatever that is possible to do to move their data to this crypto mining forms. And then it's like, "Okay. If we can't help those companies, then it's super difficult for us to build a meaningful business on top of the usage that we got," even though we got a lot of usage. So, we started then working closely with those companies to understand what are their real needs and what are the problems they are facing at the moment. And apparently, data was one of them. You might have thought at that point that it's already a solved problem by all these giants in the field. Eric Anderson: Awesome. So, take us to where you go from here. You've now got Hub. It's quite popular. I think we saw that it was one of the GitHub trending Python projects. You got a bunch of community members. You found customers. What are the kind of goals for the project that as kind of an open source user, I can look forward to or the vision from here? Davit Buniatyan: For the open source user especially, what we are pushing so far is, if you have a spectrum of on one end, you have a format, we are going to push it until the database solution and add all like database features while the deep learning community's also maturing. Basically their needs becoming more mature, more exact. Davit Buniatyan: Instead of each company having their own different needs, they become normalized. So, that's what we promise on the open source side of things. And on the platform, we recently launched the managed version of hub which also includes not only just storing this data where you can upload to our managed service, or you can upload to your own S3 or GCs managed by us. We also help you to visualize these datasets at scale on your browser and be able to run version control and analytics to understand all the data and then integrate easily with PyTorch or TensorFlow to train those models. So, that's on... Our roadmap is to actually take the open source, validate it. This is useful for companies/enterprise to be used. And then double down on the growth of both the platform and the open source. Eric Anderson: That's awesome. The database vision is exciting. Help me understand how that fits with inference. So I... It makes sense to me in training. I can imagine a big database and I can write a script that connects the database to my model training environment. If I then want to take that model and deploy it on a video stream, for example, where does the database come in, or does it? Would I save frames or images to the database, or just on occasion for future training? Davit Buniatyan: So, yes. If you're running a real time inference, I wouldn't say you need Hub. You can just likely pass the data from a incoming stream to your model and run the inference. But if you are also storing this data as you mentioned, or you're also storing the predictions of your data and then you want to do backtracking at the later once you get a new model to be able to run this data set, then you have to store it somehow. Davit Buniatyan: For example, a company might have 80 different models. They're constantly training, retraining, making sure. Or, let's say you want to be compliant. Your model you trained on a hundred million users. And then, there's like thousand of users no longer want their data to be with your platform. You have to go and retrain your model by removing this data. And then, you have to run the inference back again, whatever the other users had. So, there's a lot you have to do, especially with storing these datasets, to be compliant and also run the inference. Inference is also a key part, like batched inference. For realtime inference, you might not need help at this point of the time. Eric Anderson: Yeah. Makes sense. And then on your website, you list some areas of focus which are kind of the areas where we're seeing a lot of computer vision used. Do you find that the product lends itself to... You talked about autonomous driving earlier. We know that medical images is a popular place. Drone or aerial I think came up in our discussion already. Yeah, are there certain areas where you feel like this is a particular good fit or just that communities that Activeloop has penetrated well and is popular? Davit Buniatyan: So, yes. You have biomedical image processing, you have automotive, you have aerial imaging. There are two, couple verticals that we haven't shared yet. We haven't updated the website. What we have seen from the community is that today, a lot of Python developers, in computer vision especially... They have big problems with processing videos. And there are no good tools for video processing. And then, the decompression I was mentioning before is one of the core bottlenecks there. Having random access to this data instead of actually watching the video before. And that's another problem there. So, what we have learned from the community, and it has been a target for us for earlier this year, is to focus on nailing down the video use case. And from the video use case, you actually... Both can impact the automotive industry. There's also video processing surveillance/security. That's another vertical that we have seen a lot of interest for Hub. And then, aerial imaging obviously, as we already discussed. Eric Anderson: Yeah, it does seem like you're an interesting point in the workflow. Not only are you data based, but you do some of the workflow tools such that it seems like you could expand to anywhere that people are having trouble in the workflow, you can provide tooling. Like you said, video decompression is an area that you could add value, which is a neat place to be. Davit Buniatyan: Yeah. If you watched Silicon Valley, that TV series. Eric Anderson: Yeah. Davit Buniatyan: That's their core startup was right? To build a compression, decompression... Eric Anderson: I [crosstalk 00:29:11] wasn't going to bring it up but yes. Yes, yes. Totally. Good, Davit. We're winding down our time here. Anything you wanted to cover in our show today that we didn't get to? Davit Buniatyan: No. Thank you very much, Eric for asking me and having this nice conversation. And, yeah. Happy to be helpful. And thanks also for our team members, to our community. They're all contributors that helped us to come to the stage where we are at now. And there's a lot yet need to be done. So, we are working on it. Eric Anderson: Thanks, Davit. Super amazing progress. Appreciate your shout-out to the community. I think folks can find you via your website, activeloop.ai. We'll put a link there, but looks like you have a Slack channel. Davit Buniatyan: Feel free to join slack.activeloop.ai. That basically will get you directly to our community. Feel free to go to GitHub startups/start using our tool. We love your feedback and to see how we can be helpful to you. Eric Anderson: You can find today's show notes and past episodes @contributor.fyi. Until next time, I'm Eric Anderson and this has been Contributor.