Nik Spirin (NVIDIA) Fireside Chat
For our final Chat8VC of 2023, we hosted Dr. Nik Spirin, Director of Generative AI and LLMOps at NVIDIA. He sits at a very compelling vantage point in this role. Prior to NVIDIA, he was a serial entrepreneur, classical ML PhD, and researcher by training who worked on many interesting applied AI projects at places like Gigster and played key roles in the enterprise-wide AI transformations of Canon, Dentsu, and Vodafone.
As a reminder, Chat8VC is a monthly series hosted at our SF office where we bring together founders and early-phase operators excited about the foundation model ecosystem. We focus on having highly technical, implementation-oriented conversations and give founders and builders the opportunity to showcase what they’re working on through lightning demos and fireside chats. This can be a fun side project, related to a company they’re starting, or a high-leverage use case within a company operating at scale.
If this conversation excites you and you’d like to chat more with our team about it or attend future Chat8VC events, please reach out to Vivek Gopalan at email@example.com and Bela Becerra at firstname.lastname@example.org!
Let’s start by talking about your path to NVIDIA and where it all started.
I started my journey by deeply reading one of the few, at that time, foundational ML books – Pattern Recognition and Machine Learning by Chris Bishop. It is still on my bookshelf. And, my first project in a lab was to produce a performant MATLAB implementation of the EM algorithm and demonstrate it in action by clustering Old Faithful geyser eruptions from the Bishop’s book and a few bigger datasets. There was no scikit-learn, and Python was used mostly for scripting and web back-ends. That was a lot of fun, and I remember the awe of seeing for the first time a computer program that does some intelligent stuff and finds patterns right in front of my eyes. These days, everyone is reliving a similar story thanks to the latest advancements in generative AI and accessibility of AI co-pilots.
Long story short, I then began working on search ranking, and specifically learning-to-rank, which was an incredibly important machine learning problem at that time. Yahoo was still a reputable player in the search space. Microsoft released Bing in 2009 as a rebranding of Live Search and a direct alternative to Google. Yandex and Baidu, the Russian and Chinese search engines respectively, opened their labs in Silicon Valley and successfully competed with Google in their home markets. The entire industry was centered around web search, similar to GenAI/LLMs now. It was simply impossible to do anything else if you were into production-grade ML systems. The key themes included SVMs (e.g. RankSVM), gradient boosting (e.g. RankBoost, AdaBoost, MatrixNet), and ensemble learning.
Roughly at the same time, the term big data was introduced due to the necessity to manage massive web-scale datasets via MapReduce, which forced us to reconsider the way we approach machine learning, too. Instead of doing an experiment on a scale-up machine with a lot of memory – 144 GBs was the biggest RAM when I was at Facebook (now Meta) working on Graph Search – and still be limited with the number of samples you can use (~10M ranking pairs as far as I remember), one can reimplement an ML algorithm in a distributed way and use as much training data as it is available. Matei Zaharia and the team from Berkeley came up with RDDs and around 2014 released Apache Spark and then MLlib. Since then, everybody started doing distributed training for research and production. I had to upskill myself on distributed ML despite being actively working in this area.
Parallel to that, there was another emerging tech trend called deep learning. Prior to 2012, very few people believed that it would work due to limited access to data and compute to train large networks. However, in theory, neural nets are considered to be universal approximators of functions. The world jumped years ahead with the publication of the AlexNet paper, which for the first time demonstrated that computer vision could be done more effectively with deep learning compared to traditional image analysis techniques and GPUs’ help with training. In 2014, NVIDIA introduced cuDNN to accelerate deep learning on GPUs and support this trend. Once again, I had to learn new skills by extending my knowledge of traditional ML algorithms to CNNs, RNNs, LSTMs when deep learning started gaining interest from enterprises. That gave me a good runway for 5-6 years to play the leadership roles in a few startups and meaningfully contribute to enterprise-wide AI transformations of Canon, Dentsu, and Vodafone, among others.
I joined NVIDIA to lead the expansion of the MLOps platform for self-driving to other AIs in the company. My reasoning was three-fold: (1) NVIDIA is the company in the modern AI ecosystem and has the widest optionality to be successful in multiple directions, including AI chips, AI software, and AI-powered digital twins via Omniverse; (2) MLOps platform is the best leverage to make many AI projects and teams at NVIDIA more effective and successful; (3) we have built an end-to-end AI for visual content production and used photorealistic rendering and ray tracing for synthetic data generation in my last startup Metapixel AI, which is 100% aligned with what NVIDIA does as a company but they do it at a much bigger scale.
After the release of ChatGPT in November 2022, I started focusing on NVIDIA NeMo and LLMOps. Generative AI revolution is the third time in my career when I got disrupted and had to adapt. While modeling techniques and tooling behind state-of-the-art foundation models like Llama-2, NVIDIA Nemotron-3, or GPT-4 were introduced over time and are still based on the same fundamental ML principles and MLOps best practices, I felt behind in the language modeling space due to the computer vision focus of my latest startup. That brings me to one of the lessons learned throughout these transitions – the pace of change in tech is so fast that to stay relevant, one must work twice as much to do your immediate work and to stay up-to-date with the latest developments. Focus helps achieve the set goal, but one must also have a wide aperture, curiosity, and time to explore.
At NVIDIA, you have asymmetric access to information on end application requirements and what people are using ML for. NVIDIA has done application development abstractions quite well, starting with CUDA and now moving towards NVIDIA NeMo and higher-order abstractions. Let’s talk through some of the decisions your team has made on higher-order abstractions. What are you excited about with regards to NVIDIA NeMo?
First and foremost, I want to acknowledge that a lot of great things and abstractions in NVIDIA NeMo, like the NVIDIA NeMo framework, have been built before I joined, and it is a huge collaborative effort by the entire company. Speaking of abstractions, there are two ways to visualize this.
One is an end-to-end AI/ML lifecycle view, which starts from the data ingestion to data curation to data labeling, data validation, model training, evaluation, optimization, and deployment. We try to be present at every stage throughout this lifecycle and provide best-in-class GPU-accelerated tools, services, or SDKs to enable our customers and enterprise application developers in their generative AI workflows. Some of the latest releases from the team are NVIDIA NeMo Retriever, which is close to my original background in search, and NVIDIA TensorRT-LLM for optimized inference.
The second visualization is a “pancake” diagram. At the very bottom, it includes the bare-metal compute products (e.g. NVIDIA A100, H100, GH200, L40S) and networking products (e.g. NVIDIA BlueField, NVLink, and InfiniBand). This forms the suite of hardware components needed to build the modern AI stack. The next level of abstraction involves the compute runtime. NVIDIA CUDA was introduced to enable general computing on GPUs. And, as I have already mentioned, the subsequent level of abstraction as it pertains to AI applications is cuDNN. Essentially, it is a GPU-accelerated library of primitives for deep neural networks, various convolutions, attention or matrix multiplication blocks. Next comes the framework layer, which is responsible for the orchestration of multiple nodes managing the workload together. Distributed training and the primitives related to model parallelism, sequence parallelism, pipeline parallelism, and data parallelism are incredibly important. At a platform level, this must also be enabled by the orchestration and job scheduling system.
At NVIDIA, we cover a great amount of surface area in AI. However, the market and opportunity is even bigger. Therefore, we have an ecosystem of 100+ partners throughout the stack, ranging from vendors focusing on a specific layer, like Lambda Labs for compute or Run:ai for orchestration, to ISVs and MLOps platforms like Weights & Biases, Databricks, or Dataiku, who enable the end-to-end AI development lifecycles, to SIs like Deloitte or Data Monsters, who provide professional services leveraging our tooling to solve specific business problems for enterprise customers. We provide chips, accelerated software tools, and services, and these partners solve for the last mile. Our capabilities fit as a beautiful puzzle, and we co-deliver more value together.
Imagine a data scientist or machine learning engineer building an AI application – they use a specific MLOps platform in their workflow, but behind every single button click, there could be an AI-accelerated service. Who could provide this? NVIDIA. We are a core partner and enable the wide reach of companies to work seamlessly and deliver value around AI. As we say, NVIDIA is the only company that works with all other AI companies, and we want the AI ecosystem to be healthy and as excited about accelerated computing as we are.
When you came into NVIDIA and then NeMo, there were decisions that had already been made as to what platform components should have been built on the MLOps side of things. Something you talk about in your blog on LLMOps is that there's a whole suite of net new requirements. Let’s start by highlighting a few of the new requirements that are most critical for generative AI and the things you’re currently prioritizing.
There are many ways to define your roadmap. First, you can ask your customers and prioritize the requested capabilities. However, we know that this is not a complete picture of the world. Customers might not know about new innovations or the art of possible. Especially in the early days, it is crucial to show your customers what can be done. By the way, I am very passionate about AI education and have a bespoke executive masterclass on AI because of this exact reason. Second, it’s important to have a vision (backed by data) and identify the missing blocks from the first principles. This is why we have basic and applied research teams at NVIDIA who help us find the next innovation and forge ahead. And lastly, it’s important to learn from the broader ecosystem. We closely monitor Hugging Face, Github, and top industry and research conferences, where companies release new products, share knowledge and pain points. The unification of these three sources led to a high-level LLMOps blueprint that is covered in the blog and soon in our upcoming LLMOps session at GTC 2024. Now, let me describe some of the new requirements specific to LLMs.
At the infrastructure/platform layer, you need to have distributed training capabilities because state-of-the-art foundation models are trained on massive trillion-token datasets and would take years to pre-train otherwise.
Next, generative AI has enabled exciting new capabilities as it pertains to synthetic data generation. Synthetic data was previously not as accurate nor representative of the real world. Current generative AI models allow us to synthetically generate realistic text, images, videos, or 3D objects, which can be used to solve the cold-start problem for a new AI project or augment training or evaluation data.
Apart from synthetic data generation, data curation is also an important problem, especially for foundation model building. The data you crawl from the internet is quite messy – there are many bad encodings, ill-formatted documents, subpar layouts, etc. How do you effectively distill useful information from that raw crawled data you could use to actually train your foundation models?
Going up the stack, we segue into workflow systems. If you need to have a large multi-stage training job that takes a long time to run and uses many different nodes to train on, you use a workflow system like Flyte, Argo, or Kubeflow. These systems typically provide job results caching, fault-tolerance guarantees, access control, traceability, alerts.
There’s also a bucket of problems that I’d describe as agent management. You could have an adaptive agent that can come up with a solution on its own, and then instead of having a predefined set of APIs that will be called when you submit a request, it will say, “Oh, wait a minute, I think this problem needs to be solved using five steps.” It will come up with the decomposition of a complex problem into those sub-steps. Having a mechanism to manage the whole life cycle of these agent-based applications is an important and new problem that we have to solve. Otherwise, such an agent can create a suboptimal solution plan with bad latency, throughput, or cost profile.
Next, are model-centric capabilities for model compression and model evaluation. The former is to decrease latency without degrading accuracy by a lot, which is critical for large models. The latter is to make sure that we can reason about model capabilities holistically. Traditional cross-validation isn’t enough. One needs to know how to evaluate general intelligence skills, transfer learning model adaptability, task-specific performance, and many others. It is as if you are evaluating a human being during the job interview.
Finally, I’d like to flag prompt engineering – this is what enables millions of non-engineers to build really powerful AI apps using LLMs. So, we need to enable them with the right tooling as well. Experiment management already supports data, model, and experiment versioning with all your hyperparameters. What is the next step? Prompt versioning is a natural extension of experiment management to prompts so that one can iterate on a prompt in a principled way, see all previous versions, failure modes, cost, latency, safety risks.
If you look back four years ago, when there was a wave of venture-backed Silicon Valley companies trying to pitch synthetic data generation, people didn’t believe in the capabilities of synthetic data generation. What has gotten people over the line on trust in synthetic data generation for model improvement? What are some interesting conversations you’ve had to convince people? Or do you think that barrier is gone?
Let me start with a first-person example from my previous startup Metapixel AI, where we did visual content production for e-commerce. Imagine you want to sell a microphone: you have to take a photo of the microphone in a photo studio, but perhaps the background isn’t clean – there might be a reflection, some dust, or improper angle. You send it to folks handling the Photoshop work and they remove the background and surface defects, calibrate it properly, fit it to the frame, and then you see it online on the e-commerce storefront on the white background, slick and appealing for the buyers. We tried to automate that process end-to-end using AI. The input is a high-fidelity raw image from a camera, and the final output is a PNG or JPEG image that you could publish on an e-commerce website or marketplace.
For training such a model, we started from open-source data available with different research papers, like MS COCO or DUTS for salient point detection. Unfortunately, we only got a few thousands of images from the public internet that fit our requirements, because we wanted to work with 4K or 8K high-fidelity images that professional photo studios use, while the research papers worked with <2K images.
The second data source was customer data. We had data sharing agreements with our customers, where we said we’d provide a service and then if we made mistakes, retouchers could edit those pictures. We’d offer a discount, but then they’d give us back the data so we could improve our algorithms. It was a win-win – we created this data flywheel to build defensibility for our company and also to improve the quality for customers on their specific product types. We got to a reasonable point of about 94% IoU for high-fidelity masks for background removal.
Luckily, my co-founder had a lot of expertise in rendering and ray tracing on GPUs. He had built a photo-realistic rendering pipeline using Unreal 4. As input, you would provide a high-fidelity 3D model of an object, but then you could do so-called domain randomization to generate many different views of that product. You could change the focal distance, position of your light sources, texture of the background, colors, etc. and then iterate across the library of 3D models. From that dataset, you could have a lot more insights and quality distilled into your model. This is one specific example of synthetic data in action, and we were able to boost it to about 99.7% IoU for high-fidelity masks, which was then and might still be the state-of-the-art result.
For that use case, did you have to build an object understanding model to determine object types or orientation and how to rotate these objects?
If you have a complex problem, there are two ways to solve it with AI. One is end-to-end, which in theory is possible with enough data, compute, and a complex neural network architecture. I believe a faster and more realistic path to success is by decomposing a complex problem into subproblems.
In our case, it was to retouch an image and make it look cool. We then asked, “What are the important image transformations that professional retouchers do and how do they do it?” It turns out that they actually do the layering of an image in PSD. They first do the masking. Sometimes they split the mask into different disconnected elements and save it in multiple layers. Then, they do surface correction, rotation, or resizing. All those transformations are done in steps, and for every single one, you could create a smaller network that does the job. And yes, depending on the object type, only a subset of these transformations are applicable. For example, for fur or objects with puffy edges, one needs to use an edge processing algorithm — for rigid manufactured objects, like a bottle or mic, you need another algorithm. Rotation is not important at input but critical for output so that it looks as the real product in use.
By the way, this is the recipe that’s transferable to all AI systems. There are many other case studies beyond the example I offered, where I applied the same approach. This is also how modern self-driving systems are built – you have a parking network, traffic light detection, or lane tracking network, and 20 more.
Let’s get back to the question of convincing people of the value of synthetic data. Is the conclusion that capability improvements, when provided synthetic data, just speak for themselves and enterprises are convinced, or do you have to jump in and do a bunch of experimentation for them?
We had to do a lot of experimentation in our startup. And it wasn’t scalable; it was art and hard to build the business around. Now, the tech is better, people hear more about synthetic data, and there are more successful cases across the industry. This definitely helps companies build confidence, but there will still be some experimentation involved, just as with all other phases of an ML project.
Overall, whenever you have a new product, you show it to your customers and offer the opportunity to try before they buy. Either they recognize the value or they don’t. I always suggest developing a portfolio of experiments, as it’s challenging to derive value from a single experiment. This is likely the conversation vendors in the synthetic data space still have as the market and technology evolves.
Let me give you one more interesting example to show the value of synthetic data. There is a paper published by Anima Anandkumar from the Caltech and NVIDIA research team, where they created a social bias detection dataset to test LLMs for safety and fairness.
How does it work? Well, let’s say we want to determine whether we have prejudice or bias with respect to gender. You could create a parallel corpus, where you have he or she pronouns, and then you could have some activity that those subjects could do. For example, “he plays,” “she plays,” and so on and so forth. You could generate that parallel corpus yourself via crowdsourcing or create a rudimentary script that zips through the dictionary and generates such sentences, but it won’t be diverse nor mimic the real-world nuances or distribution.
Can we do better? Yes, you still have that parallel corpus of “he,” “she,” and so on in a dictionary; you still have the control variables, which define some activities or some qualities of those subjects; and then you ask an LLM by prompting it with a few examples to come up with interesting similar sentences of different length or topics. This way, you create a diverse and realistic synthetically generated parallel corpus. This is the data augmentation part.
Next, you ask an LLM a question – compute the probability of a sentence, and compute the probability of the parallel sentence. If the probability for one pronoun is higher, then this sentence gets a plus-one. Then, you aggregate over all pairs in that parallel corpus. If it happens to be that one pronoun is more preferred for a specific condition, then LLM is biased. If it’s close to 50% and not statistically significant, then LLM is unbiased. This is how you could evaluate LLMs for bias and how synthetic data helps you guarantee that it is unbiased.
Let’s discuss model optimization, as I know you’re excited about this topic. When we think about people leveraging foundation models within the portfolio, a lot of use cases are fundamentally latency-limited. There’s a product experience gap when using GPT-4, Claude, or some of the 3D generation models like Luma and CSM. What are interesting techniques you’ve been working on with model optimization and compression?
Thanks, I’ll reference two things. First is TensorRT-LLM, a flagship project that allows us to do accelerated inference. We have a blog on that topic, where we compare performance of a model with and without application of TRT-LLM compression algorithms. Depending on the model, it’s typically a 2x boost – note workload and architecture play a critical role.
The second reference is the MLPerf Inference v3.1 submission we spearheaded in July 2023. There’s a helpful graph that demonstrates how to stack different optimization techniques to drive even better acceleration. As a baseline, you have FP16 or BF16 format. On top of that, you apply quantization. When you do this transformation, you could do it in a fairly lossless manner. You usually try to trade off 1% of accuracy for a huge jump in performance, shrinking the latency. And we can go further than that.
What if you want to hold off on quantization and prune out weak, useless connections and neurons? You would determine the sensitivity of the output with respect to the variations of a specific neuron. If the value of that neuron is close to zero, it adds minimal impact on the output. You’d want to remove it to save on FLOPS, and then by extension accelerate the inference and lower the cost. As far as I remember, if you do pruning and quantization, you get around 2.6x extra on top of just quantization.
Further, you could also leverage distillation from a smarter model to further promote latency and accuracy.
You have a smart large model, in addition to an already compressed model after quantization and pruning. You try to distill it by training or fine-tuning that smaller model after pruning. You first do the pruning, followed by distillation and quantization. This combination leads to around 10x improvement relative to the baseline and doesn’t require adjusting the hardware configuration.
Combining all of the above, in theory, we can gain up to 40x acceleration. Therefore, I am optimistic on hardware-software co-optimization and believe this will unlock many applications across data center, consumer, and edge platforms for NVIDIA.
Where must NVIDIA deliver value? As you describe the various layers you interact with, there’s greater potential to compete with partners. What are the key aspects you can uniquely bring to the table and should be focused on?
The guidance from the leadership is to enable the ecosystem with the best accelerated hardware, tools, and services. The beauty of the platform business model is that we try to work with our customers at every layer of the stack and meet them where they are.
If you have data scientists, ML engineers, and AI infrastructure talent and need the bare-bone components, then probably you won’t require high level APIs that might be too much of a black box. In this case, we will provide individual components through open source, but you will have to build everything else.
If you lack subject matter experts in AI/ML but boast many software engineers and enterprise architects who can build cool apps, you will need good APIs. Depending on the goals, resources, as well as the customer persona, we could tailor specific abstractions and different products around them. We focus on customer and partners’ needs and work backwards to pick the best form factor and level of abstraction that will solve their problem.
If you think about companies you’ve collaborated with, where do most land in their technology journey? Are they in planning mode, building and shipping AI products, or perhaps honing in on optimization? How fast are people moving in organizations?
There’s always a diffusion of innovation for a new technology and the market naturally stratifies in the innovators, early adopters, early majority, mass market, and laggards.
Large AI-native companies and generative AI unicorns are by definition innovators and ahead of the curve in their generative AI journey. These are organizations defining the standards and helping the rest of the world find the next MLOps/LLMOps blueprint or model architecture, like recently Mamba as an alternative to transformers. The majority of the companies are in the “POC mode,” trying to extend successful predictive AI deployments with new generative AI capabilities or start from zero as the cost of experimentation has decreased dramatically.
Within the next two years, enterprises will gain mastery of generative AI technology and reach great outcomes. I wish I had more data points, but I would pattern match to the previous wave of the AI revolution, which was enabled by deep learning. Historically, there has been a 5-6-year lag in order to truly reap the benefits of a technological advancement. The AlexNet paper was published in 2012, while the mass adoption of deep learning by enterprises happened in 2017-2019. Assuming there’s an acceleration of new technology waves, especially given the pace of development in generative AI, let’s say we shrink this lag exponentially from 5-6 years to 2-3 years. This is the time when the rubber hits the road, and we will have a massive technology market fit delivering value at scale. Taking into account that Megatron-Turing was announced in 2021, Stable Diffusion in August 2022, ChatGPT in November 2022, we will see major value realization in 2024-2025.
Over the next five years, we will not only understand generative AI better but apply it to all sorts of modalities beyond text and connect the digital and physical worlds together.