Joe Chen and Jonathan Shen (Upwork) Fireside Chat
%20(1).png)
Share
We were thrilled to feature Joe and Jonathan from Upwork at March's Chat8VC in San Francisco. We covered their journey from teams like Google Brain and Cruise, and their own startup, to leading AI efforts at Upwork—building Uma, a suite of specialized LLMs powering workflows for freelancers and clients across the platform.
They shared lessons on building robust, production-grade models in-house, and why generalist LLMs fall short for long-form, human-in-the-loop interactions. We got a deep dive into their custom data pipelines, safety-aware training processes, and the tradeoffs between off-the-shelf APIs and tightly integrated models tuned to platform behavior.
As a reminder, Chat8VC is a monthly series hosted at our SF office where we bring together founders and early-phase operators excited about the foundation model ecosystem. We focus on highly technical, implementation-oriented conversations, giving builders the chance to showcase what they’re working on through lightning demos and fireside chats.
If this conversation resonates and you'd like to get involved—or join us at a future event—reach out to Vivek Gopalan (vivek@8vc.com) and Bela Becerra (bela@8vc.com).
It’s my pleasure to introduce our friends Joe & Jonathan at Upwork! Over the past couple years, they've taught a good chunk of the 8VC team a bunch about LLM internals and helped us become better predictors of forward progress in AI. They started a company, got scouted by Upwork to build their tech there, and are now running AI R&D there. Want to say a few words about yourselves before we dive in?
Joe (Upwork): Sure, absolutely. As Vivek mentioned, we were doing a startup and then joined Upwork to lead a lot of the LLM training and AI functionality here. My background is actually in AI robustness and reliability. I led reliability research at Cruise and also at Waymo in the self-driving car space. So I come at this from the "old man" mentality, because I’ve seen how things in self-driving don’t always work 100% of the time. That’s why it’s so important these systems are robust and reliable. That’s the mentality we’re bringing to Upwork too—building AI systems that actually work in practice.
Jonathan (Upwork): I’m definitely much more on the software side. I come from a background at Google Brain, doing research into deep learning models. Now that I’m at Upwork, I’m making sure we develop our own in-house models and continue doing foundational research—because that’s critical to bringing these models to market.
Cool! Let’s maybe start by talking about Uma. First, for those who don’t know—can you briefly explain what Upwork does and then talk about the first use cases you had in mind for Uma, both on the client and freelancer side?
Joe: Upwork is a large two-sided work marketplace for freelancers and businesses. It’s a public company that’s been around for a while. People come to the site to find freelancers to connect with skilled freelancers for their business needs—website development, mobile software, those kinds of things. A lot of what we’re doing with Uma, Upwork’s Mindful AI, is building models to help that process.
The first use cases were custom models to help freelancers write proposals—since on Upwork, a freelancer has to write a proposal for every job. We also built Uma to help clients post jobs and evaluate candidates. It’s very much human-in-the-loop. That excites us—it’s not about full end-to-end automation just yet. It’s about making the business flywheel spin faster.
Let’s go a bit deeper into the model side—why focus on specialized models? And what did you have to do around synthetic data, post-training, and data curation?
Joe: I should caveat that a lot of what we’re doing is still in user testing – we had to build out our entire GPU cluster first, and we got to a custom fine-tuned model five months after we joined. The first things that we productionized were proposal writing and candidate evaluation. Even these models alone are showing significant impact – significant contributions to our business.
We love data curation. One of the biggest advantages of being at Upwork is that we can go out and hire talented writers and domain experts from our platform to write interesting new data for us. For some of our initial models, we had to define: What is Uma’s personality? What is the style of interaction we want? So we hired people—many of them screenwriters and others already on the Upwork platform—to write conversations. That helped us differentiate the style of interaction early on.
We also have a robust synthetic data program which is another big element. You can only go so far with real data—it’s somewhat scalable, but not massively so. So we take raw data from help articles, internal documentation, and use it to form better datasets. You can’t just toss raw tokens into a model and expect it to work.
Jonathan: On the technical side, we’re really leveraging open-source models like Llama and Deepseek. Ultimately for specialized experiences, we need to control the kinds of data that’s gone into model weights.
What’s been the most interesting use case where specialized models outperformed generalist ones—or where a generalist model would have required a ton of scaffolding?
Joe: Yeah, so when we joined Upwork, our initial hypothesis was that off-the-shelf models are great for short Q&A, but they fall short for long-tail of business use cases. Like, if you want to build an agent that can handle complex, multi-turn conversations—customer service, project scoping, that sort of thing—you don’t want something that’s just prompt-tuned to death with a giant flowchart.
So at our prior startup, we were working on algorithms to support long-form dialogue: how to stabilize those conversations, remove the need for prompt-tuning, and avoid overfitting to brittle conversation trees. That’s what led to us joining Upwork—Upwork saw what we had built and said, “Yes, this is what we need.”
And it’s worked out. In user studies, we saw that our fine-tuned models doubled quality scores—style, accuracy—relative to off-the-shelf models like GPT-4 or Claude. That’s not shocking. A model trained to act a certain way will always outperform one that’s just lightly prompted.
More surprising was how much better the custom models were even on non-conversational tasks—like pure factual Q&A. We have a slide that compares our Q&A model to GPT-4o and Claude. Even when RAG results were bad, our model did better than GPT-4o did with good RAG. That extra layer of robustness matters—it’s like teaching your model to forget how to be wrong.
In our demo of Uma, we covered everything from training data collection (how to simulate user behavior with our freelancer network & collect accurate human data), Q&A evals, UX choices (how to best embed this within the existing Upwork product, and best imbue notions of memory), and how to handle contextual task switching. To learn more about the work Joe, Jonathan, and their team is doing, read this blog HERE.