- Artificial intelligence and machine learning are becoming mainstream enterprise technologies. And this changes the way that businesses operate and the same time impacts how IT does its job and how enterprise infrastructure evolves as well. Joining us to talk about AI and the data center are Tony Paikeday senior director of AI systems at NVIDIA and Doug O'Flaherty from IBM where Doug is the global ecosystems leader for IBM storage. Welcome guys. - Thank you. - Great to be here Steve. - So Tony, I want to start with you. How do you see enterprises or enterprises investing in AI today? - You know, it's interesting we've seen a really compelling pattern emerge here especially if look at the backdrop of the pandemic and a kind of challenging turbulent economic times and what we've noticed kind of parallels previous almost downturns or similarly challenging times where organizations typically did three things really well. They drew customer relationships tighter and closer. They looked to save costs everywhere possible and they looked for opportunities to create an economic moat or some kind of business agility that their competitors didn't have. And the remarkable thing, and I think why we're seeing so many organizations double down on AI now even amidst this current environments because AI is really good at all three things, right? This is a great time for organizations to build for instance chat bots or build superhuman language understanding into their customer service operations or to analyze sentiment and do things to customize and tailor customer experiences, right? So drawing customers closer. It's also a great time to save costs in terms of cost of inventory, for instance or streamlining your supply chain with better forecast and optimization of your models. Also really something AI is really good at. And then obviously AI is helping a lot of organizations mine their data more effectively to figure out new business opportunities that they may never have considered before and unlocking the needle in the haystack. And the notion of data is something that AI is really good at. So it's no surprise for us that even in midst the current environment more organizations than ever are investing in AI and AI infrastructure. - But deploying AI solutions in the data center. It's a little bit different from, you know building infrastructure for more traditional workloads. Where do you see, where do you see IT organizations struggling as they deploy AI? - Yeah. One of the problems is that many might be inclined to look at it as just another workload that runs the data center that needs compute storage and networking. But reality is that AI is fundamentally different, both in how it's crafted and then ultimately how it's deployed and the kind of resources that it consumes. Right? So if you look at, from a how it's built perspective it's a highly iterative process done by data science artisans who are not IT dev ops type people who don't have experience in writing applications or writing them for scale or even following good IT DevOps rigors, right? It's a highly iterative experimental creative process that constantly requires a human in the loop. And essentially IT has a very important role to help bring rigor to that process of going from a viable AI concept to a prototype to a model that can be trained at scale and then deployed in a production application. So there's that aspect of it in terms of having an IT platform that can manage that workflow. And obviously the data platform under that is super important because you need to enable effortless mobility of very large datasets from one end of that life cycle to the other. Additionally, when you think about the infrastructure it's never been more important to have almost zero distance between the compute processing power that needs to act on the data and where that data lies and the network fabric that interconnects these things. And oftentimes for many organizations that do not have experienced, deep experience, in like high-performance computing and the architectures that are specifically designed for parallelizing large complex mathematical problems over large numbers of compute nodes. This can be daunting because it may not look at all familiar to other workloads that they run at the data center. This is where reference architectures that show you how to strike that optimized balance of compute, networking, and storage, and what kinds of each you need is super important to every IT leader. - Sure. I want to drill down on one of the things that you mentioned, which is keeping GPUs fed and keeping the AI monster fed requires very fast access to data, right? We want data piped in, into these systems quickly. We need data coming off of disc quickly. Doug, I'll throw it over to you a little bit. How does deploying AI change the way we think about storage architecture? - Well, it changes it in two very important ways. One of them is that question you're talking about about keeping the, you know, the performance fed. GPU's are incredibly powerful paradigm shifts, as much as I'm not fond of that phrase, because we now can take and really through statistical methods come up with real insights on very disparate data by chunking through huge portions of it. And the ability for a GPU to do that really also means that it has a lot of IO that comes into it. And there's a lot of work that's being done both by NVIDIA and storage companies to be able to make that easier and faster, you know innovations like RDMA to direct the GPU with GPU direct storage. But that is one part of that lifecycle that Tony was referring to, right? This is something that's continually being iterated on. So that concept of, or this environment of where is the data on which you're working? Is the second piece of keeping the data scientists fed. So you can have a very fast network. You can have something that feeds GPUs pretty well but you actually have to embody the entire lifestyle you know, the entire life cycle of that data AI pipeline. And the IT guys have a really great role that Tony alluded to here because it's regulatory, it's process and it's accessibility. You want to make it so that you have good data that is available to the data science teams that doesn't need to be recreated and has a sense of self service to it so that they can also keep their own creative engine running as well as those GPUs So you need performance for the individual's performance for the pipeline and performance for the GPU. So it's an interesting balance and the reference architectures definitely help with that. - So it is all about balance, right? Tony, can you talk a little about what's required to scale AI in the enterprise? It's more than just fast, to have storage, and some GPU's right. - Yeah, definitely. And this is stuff that I think we've figured out over the course of quite a few years doing large hyperscale type environments for organizations around the world. And, you know, our friends at IBM have been a part of that. The reality is that, as we've been saying AI is unique in how it consumes these resources and you want to achieve the fastest time to solution on your computational problem. So for that you need the ultra high performance storage in combination with the training process and the compute power. So these, this optimal balance that we talked about includes a few things. You need to be able to really think about, for instance, on the compute side AI doesn't simply mean getting as many servers packed with GPUs stitched together or scaling the same kind of storage that you've been using to support mainstream workload, right? You need to be thinking about, for instance, IOPS, latency or you need to be thinking about the network fabric. For instance InfiniBand with 200 gigabit per second, interconnectivity, even the typology that interconnects multiple nodes across that fabric with the storage is also very important. All of these things need to be, if you will, prescribed versus kind of stumbled into, and and why is this important? Well, a lot of the mission critical applications that your developers and data sciences are working on depend on the ability to parallelize the model that you're training across many nodes. These days, if you're thinking about like natural language processing, as an example we have architectures where you need potentially dozens or even hundreds of systems acting together in concert to be able to achieve a reasonable time to solution. The good thing is that, you know, our teams, as Doug has pointed out at both companies NVIDIA and IBM have collaborated on this joint reference architecture that brings together that balance of architecturally how many DGX a-100 nodes do you need with the right kind of IBM storage? What kind of network connectivity you need and benchmarking that like providing validated proven performance documented for known workloads. And, you know the easiest way to take the complexity out of all of this is honestly to follow the reference architecture or obviously work with our partners who have deep competency in both our solutions and know how to bring this stuff together in a turnkey way. - So wouldn't an easier way to do all of this, and I know that I can go to Amazon right now and rent AWS or rent a one hundreds on AWS, wouldn't an easier way to do this just do it all in the cloud. - That's a great question. We often find ourselves trying to rationalize one versus the other, but, you know, for AI we have to recognize that both in fact have a useful place in your AI development journey, but I'll also say that cloud is not the hammer for every AI nail. It's great. It's a great way to engage in early productive experimentation enabling your developers to get a fast start with a low barrier to entry. It's great at supporting temporal needs. As AI development is starting to get under underway but eventually through ongoing iteration once AI models start to get more and more complex consuming more and more compute cycles and in parallel, the data fueling that training gets exponentially larger. And this is the point at which costs start to escalate. And we've seen a lot of customers express this to us and they start to feel the effect of what we call data gravity. And becomes very noticeable. And by this I mean more time and money is being spent pushing large data sets from where they're generated to where the compute resides, right? That's a classical problem when you think about a cloud based deployment where your data lives within the four walls of your data center or your data lake infrastructure and your computer sitting somewhere else. So this speed bump drives up your op ex unfortunately and it's typically the inflection point where a lot of organizations that our companies have dealt with start to realize that there is a real tangible benefit in a fixed cost infrastructure that allows you to remove that fear of budget overrun and give your developers back this sense of freedom and the ability to creatively explore their models without fear of fear of costs. Ultimately, when they're able to do that they're gonna build better higher quality models with the highest accuracy possible. So that's why we think about this. Like many things hybrid is often the most sensible way to go. - Sure. So Doug, Tony touched on reference design work that IBM and NVIDIA are doing together. How are IBM and NVIDIA making AI infrastructure easier for enterprises? - Well, one of the easy ways is is that we have published a reference architecture that allows you to start of build with a building block. And it does have the parallel file system, the high performance throughput, the networking configuration. We've going a little steps further to with the kind of performance recommendations that are coming out from the background with NVIDIA. And certain workloads to look at balances of throughput and latency environments. So the parallel file system of IBM spectrum scale has always been a performance leader in these clustered environments. And a lot of IT departments don't understand or don't normally touch this idea of clustered or scale out computing. And even when we're talking about a single DGX a-100 with multiple GPUs in it, that you're really feeding into a network in a sense and you've got to have a network balance and an environment that goes with it. The other work that we've done is to take it really a step further with our IBM reference architectures and some of the stuff we've done with Red Hat OpenShift which is to look at how the performance reference architecture the NVIDIA pod fits into an AI data workflow. And how do you manage the turnover of this and how do you manage the data ops? One of the things that's very important becomes something that Steve Elliot and I talk a lot about Dr. Steve Elliot is an IBM member of the IBM chief data office. And he and I presenting at GTC 2001. Our session is three, three, two 31. If you want to go see it it's really intriguing because we're talking about how just that data gravity has really changed the way we architect even our own processes. And so we document that we help both build that out. There's very few people in the world who are better at scaling out a variety of kinds of data in a variety within the enterprise than IBM. And so things like our ability to integrate the metadata that is the visible data about the data, which is critical to the IT process and the data science process, be able to integrate that into the pipeline and the workflow, is one of the things that IBM brings its unique skills. And then when you turn that and combine that with the incredible performance, the incredible tuning and the incredible optimizations that NVIDIA's brought to the table including the super pod reference architectures and things like their NGC, their containerized library of fast applications and easy adoption. We really have a unique combination to be able to bring the kind of skills and enterprise scaling up and HPC scaling up that we've done traditionally with the IBM spectrum products and this unique environment that marries the large enterprise that we are and the high performance enterprise that NVIDIA absolutely excels at. - Okay. So to wrap this up, give each of you last word give me one takeaway and Tony, we'll start with you, for an IT practitioner who might be listening to us and just daunted and overwhelmed by the prospect of having to support these systems. Right? What would you tell them? - Don't go it alone. This is a multidisciplinary sports. So on two sides of this one is internally there are business stakeholders, people who are close to the data, you're obviously your data science team. We don't want to be operating in isolation of them. They need to be intimately connected to key architectural choices and platform that ultimately affect their work. Obviously there's people inside of a customer's organization that know how to help instill some of that dev ops rigor into the data science realm. So I think there's definitely, you know importance to bridging the gap between data science team and the IT team. That's, that's kind of, if you will foundational to success as you're moving forward on the other side of it as you think about the platform, I'd also say don't go it alone. Think about how you can leverage partners who have deep competency in the full stack. Right? So the fortunate part is that you've got two partners right here who have again deep competency in every aspect of this solution, right? IBM has been doing this for so many years. IBM storage solutions are specifically architected for the kind of problems that we're trying to solve. When you talk about parallelizing these problems and feeding with data across large scale infrastructure, it's now we're now applying that know-how to AI, which is great. And their people know how to do this. And our mutual set of partners know how to do this. They've done it many times before for organizations around the world, from the NVIDIA perspective if you're thinking about, for instance you're building an NLP application or you're building a recommender system or you're building an autonomous system, chances are you're gonna run into a problem related to your model or the framework you're choosing or some kind of tool. And chances are also pretty good that your NVIDIA team has encountered the same thing somewhere else either in-house with the work we do or with a customer just like you who's had to tackle it before. So I'd say enlist our help, enlist IBM's help. Bring us in early on. And we can help you kind of put that game plan together for success. - And Doug, I'll give you the last word as one of NVIDIA's partners in this space, right? - Tony's got a great point there, right? Our business partners and our joint things. People have done this before, but the other call so there's one other call to action that I always when I'm talking to IT departments and even to the data scientists which is don't advocate your responsibility to be the person who anticipates scaling and building the business. IT departments have great history and regulatory and governance and this the ability to foresee how to grow and to balance IT infrastructure. And data scientists, that's not really their background typically. And so one of the things that's really is important is to go engage with the business partners look at the material we put out there, look at those reference architectures and figure out where you as an IT professional add value to your data scientists and get involved in the conversation early. Because one of the things that we see is you can make an easy decision or you can make a decision on which it's very easy to grow. And one of the things that I know about the reference architectures we've put together is they are designed for seamless almost no downtime or no downtime growth data tiering in order to deliver better economics with IBM spectrum scale and object storage and long-term visibility of your data in governance with things like spectrum discover. And that entire portfolio can doesn't have to start day one. You can start with something that is really easy very simple, a couple of DGXs is a simple easy ESS 3000 and know how that's gonna map forward. And so IT guys get involved, think forward think where you're going and involve those business partners and people who can help you get there. - Great, great thought to leave on. So thank you both for taking the time today. - Thank you.