Charles Srisuwananukorn (Together AI) Fireside Chat

Posts
Interview
May 9, 2025
Share

Posts
Interview
May 9, 2025
Share

We were thrilled to feature Charles Srisuwananukorn from Together AI at January’s Chat8VC. Charles is the Founding Vice President of Engineering at Together AI, where he leads the company’s work on AI infrastructure and clusters. Previously, he was Head of Applied Machine Learning at Snorkel AI and held engineering roles at Apple. He studied Computer Science at Stanford and has helped steer Together from an early contributor to open-source AI to a full-stack infra platform.

You might know their team from their algorithmic and data work on projects like FlashAttention and RedPajama. In this convo, Charles shared his experience scaling both hardware and systems engineering at Together, as well as their team’s philosophy around efficient AI: bridging breakthroughs in model architectures with low-level optimizations in networking, kernel design, and cluster reliability. We cover the challenges of running physical infrastructure at scale, lessons learned from handling esoteric GPU failures, and Together’s ambitions to support both high-scale model training and the next wave of small models for fast inference. 

As a reminder, Chat8VC is a monthly series hosted at our SF office where we bring together founders and early-phase operators excited about the foundation model ecosystem. We focus on highly technical, implementation-oriented conversations, giving builders the chance to showcase what they’re working on through lightning demos and fireside chats.

If this conversation resonates and you'd like to get involved—or join us at a future event—reach out to Vivek Gopalan (vivek@8vc.com) and Bela Becerra (bela@8vc.com).

Excited to have Charles here with us. Before we dive into tooling and infrastructure, let’s take a step back. What’s it been like building Together during this intense period of demand in AI?

Charles (Together): Yeah, it's been a wild ride. We run our own infrastructure - actual physical clusters - which is a massive challenge. You’re talking about deploying GPUs and building site processes, in addition to all the software we ship - it’s not purely virtual. It’s very real, and that complexity is a big part of our day-to-day. But it’s also what makes it fun. And we’re growing a lot. 

Vivek: Many AI-native companies are facing similar growth stories and challenges – though not with hardware and at the same level of the stack that you guys are dealing with. 

When you look at the open-source AI ecosystem – what’s missing? What would you tell future founders in this room to build to support some of the challenges that you’re running into?

Charles: The obvious gap is good data. Clean, diverse, high-quality datasets are still hard to come by. That’s why we launched RedPajama early on. We saw the need - there was a lack of really good data to train models and explore LLMs - and tried to help fill it. That’s a big thing people need to contribute back to open source AI. 

The other big one is tooling for reinforcement learning. As reasoning models get better, the ability to steer and refine them through RL becomes more powerful. But the tools still lag behind.

Vivek: Yeah, and we've seen a bunch of people who are working on products to help with the post-training lifecycle!

Part of why we're here today is to talk about Together GPU clusters. Maybe tell us a little bit more about that. What kinds of companies are using them? Why go and expand in this route rather than just being pure inference as a service? 

Charles: We have GPUs and people can rent them and actually use them to train models, run inference, or do whatever they need. These are H100s, H200s, and we’ll have GB200s and B200s soon. We set you up with Slurm or Kubernetes or whatever you want – and make sure everything works well so that on day one, you can be productive.

We use these clusters ourselves for our inference service, for continuous training, and for our internal research. They’re designed by AI engineers and researchers, for AI engineers and researchers.

The reason we’re doing this ties back to our broader mission. We want to offer a holistic platform for LLMs, and compute infrastructure is a critical piece. If you're doing anything serious with LLMs, reliable compute is table stakes. We think this is a core part of the product.

As you scale these clusters to 10K, 50K, 100K GPUs, what are the biggest challenges that you guys have run into? Whether it's kernel optimization or networking, or something else? 

Charles: Yeah, I mean I think the most common issues that we see in these clusters are the same ones that you’d see in the Llama technical report – GPUs falling out of the bus all the time, ECC errors in the GPUs, etc. But some of the more surprising things are around kernels and hardware reliability.

We’ve had issues with overheating transceivers. At one point we literally walked around the data center with an infrared camera to see what was running too hot and out of spec. That kind of low-level ops – it wasn’t what I thought I’d be doing when I started this job.

And then you start learning about things like: how do objects get cleaned off the optical fiber so that your Infiniband doesn’t flap and cause your training to slow down? These are deep physical-world problems. I didn’t expect to have to get this far into that layer, but you need it if you want to run clusters reliably.

How do you think about ops overhead at that scale? How much can you rely on automation? Or do you literally have people walking around with IR cameras?

Charles: Yeah, it’s both. At our scale, we can’t function without a ton of automation. We’ve got agents running on every machine, monitoring GPU utilization, thermals, failures and paging the team when something goes wrong.

But then, yeah, someone might have to walk into the data center. Maybe it’s rebooting a node. Maybe it’s replacing a bad cable. The systems are built to self-report, but the physical work still has to happen. These aren’t abstract cloud resources. They're real servers that need care and feeding.

And when you look at the broader market, there are a lot of people offering inference and renting out compute. What’s your core competency and what’s hard for others to replicate?

Charles: At Together, our focus is efficient AI. And to us, efficiency spans both the model side, algorithm design, and the infra side, kernel tuning, systems optimization, etc.

So when you rent a GPU cluster from us, you’re not just getting bare metal. You get access to our Together Kernel Collection. This is a suite of custom kernels that we’ve optimized to make training faster and more efficient. Drop them into your training loop, and you might see 10%+ improvements out of the box.

And we’ve been through the pain ourselves. We operate these clusters not just for customers, but for our own research and services. That operational experience compounds. When you’re running 10K+ GPU clusters and keeping them reliable for your own workloads, you learn what matters.

Let’s talk about R1 for a second. R1 distillation showed you can compress a big model down to something like a 32B and still get impressive performance. How does that change your infrastructure planning? Do you worry about building for giant models only to have people prefer smaller ones? Do you risk infra obsolescence if models stay in this mid-sized range? 

Charles: Yeah, I think about that a lot. And it’s lucky for us because even though you can distill a model like R1 and get great performance at 70B, there’s still real demand for full-sized versions and more use-cases open up for smaller ones. 

I was literally talking to someone tonight about this as we host DeepSeek, and they asked: “Are you hosting the big one?” People still want the big one! Especially in research, pretraining, frontier work. The appetite is there.

So when we build infra, we build it to support both. That means really fast Infiniband, fully non-blocking topologies, so we can support fully-distributed large models. But the same setup also runs small models just fine. It’s about flexibility.

And what about the extreme end like the 1B, 3B models that can run locally? We’re big believers in local AI here having invested in Ollama early. Do you see Together extending toward the edge with something in the middle like edge points-of-presence colocated close to where end users are but not quite in the cloud? Is that a real problem today or something more 3-4 years out? 

Charles: Yeah, totally. I actually think that problem is real today even without super-small models.

Latency matters. A lot of inference use cases like AI companions and phone calls are super latency-sensitive. People don’t want to wait. So we’re already thinking about edge POPs, routing, co-locating inference close to the user. We’ve even had to tackle this latency problem already and plan this out even with our large models – to figure out how we build an edge network and reduce latency no matter where the request is coming from. 

The smaller models just make that more viable and you should see a lot of people playing around with them. You could imagine a hierarchy with device, then edge, then a really big hub data center, and that opens up some really interesting architectural ideas. Like treating the device as an extension of our cloud. 

Maybe one last thing to close us out. We've got a lot more compute coming online over the next year or two, and you guys are obviously continuing to scale fast. What's one thing that keeps you up at night and what are you most bullish and excited about right now?

Charles: I think the thing that keeps me up at night is that GPUs don't sleep. I'll get paged in the middle of the night and I’m just making sure everything keeps running. Literally Pagerduty keeps me up at night. 

As we scale, the surface area just explodes. We’re at something like 160 people now, and the infra footprint is massive. It’s a lot of responsibility. 

I’m also super excited. The most recent model releases like Qwen and Deepseek, and what we’ve seen with open models and smaller ones are game changers for accessibility. We’re going to see a huge wave of people suddenly able to run GPT-level models on their laptops.

That’s super cool and I can’t wait to see what people will do with it.