Mike Del Balso (Tecton) Fireside Chat
At our September Chat8VC gathering, we hosted Mike Del Balso, founder and CEO of Tecton and a visionary behind Uber's Michelangelo, for a conversation with 8VC Partner Bhaskar Ghosh. As a reminder, Chat8VC is a monthly series hosted at our SF office where we bring together founders and early-phase operators excited about the foundation model ecosystem. We focus on having highly technical, implementation-oriented conversations and give founders and builders the opportunity to showcase what they’re working on through lightning demos and fireside chats. This can be a fun side project, related to a company they’re starting, or a high-leverage use case within a company operating at scale.
In this transcript, Mike discusses the problems in the MLOps ecosystem, the premise (& promises) of a feature platform, and how Tecton fits into the world of LLMs.
If any of these themes excite you and you’d like to chat more with our team about them or attend future Chat8VC events, please reach out to Vivek Gopalan at email@example.com and Bela Becerra at firstname.lastname@example.org!
I’d love to hear more about your background and life story.
It's quite encouraging to see how many people are working on AI-related companies as this was certainly not the case a few years ago. I've focused on machine learning & AI for a while – I started working on AI at Google ten years ago as a product manager. I devoted most of my time to the real-time AI system that powers the search business at Google.
When you type something into search, there's a real time auction coupled with AI behind the scenes that helps find the right ad to display. And as you can imagine, this is one of the highest-scale AI systems in the world. At that time the production workload required hundreds of people working on one model – it was more advanced ten years ago than a lot of the AI systems that large companies have in production today!
This context is important because it was very oriented to production. It wasn't just cool research work. We were hyper focused on questions like: How do we make money? How do we extract as much value as possible? This is how I came up in the AI world, which has always given me a slant towards production, product framing, and handling real-time constraints, which is different from many people who come from an academic research background.
When I joined Uber in 2015, I was hired as the machine learning product manager. There was very little, if any, machine learning driving that product at the time, but there was so much data and a significant opportunity to make decisions with said data. This coincided with the big data wave. We had a very unique, competitive advantage with our dataset. How do we make the product better? How do we make money and make Uber win? There were hundreds of use cases to apply data science and apply AI. I was the PM whose task was to figure out the associated opportunities for AI and build the tools to enable our data scientists to be successful.
We built a system called Michelangelo, which we’ll touch upon later. I do want to mention that we went from almost no ML in production to tens of thousands of models in production in about two and a half years. We were operating at the speed of light.
Should we just mention what Michelangelo is for everybody? My interpretation is that it’s a very comprehensive, productionized end-to-end machine learning platform.
At the time, the concept of a machine learning platform wasn't quite a thing yet. People would somewhat know what you’re referencing. There weren’t a ton of companies building ML platforms. There were a handful, but they weren't doing well.
Given this dynamic, we decided to build the infrastructure to enable automated smart decisions. This encompasses everything from… how do I get the data, refine it, turn it into features, train models automatically, productionize said models, serve models in real time and, ultimately, have others handle productionization.
All of the elements around that core workflow – from training models and monitoring models and data for drift, to visualization and sharing – all lacked priors at the time. It took us a good amount of time to figure it out as we had to invent a lot of these things along the way.
Michelangelo became this mega platform at Uber where all of the different AI systems at Uber are powered by it, still to this day. That became a pattern that a lot of folks wanted to repeat in their companies.
We wrote this blog post that effectively conveyed what we were building at Uber and how we're thinking about ML and AI. A lot of people subsequently reached out asking for advice and how to apply key learnings in their respective company.
If you look at Michelangelo as a holistic platform, what were the two or three things you got right? What would you do differently?
Sometimes you package something that already exists in a certain way and it takes off. We gave it a name. This concept of a “feature store” is actually something that we coined, but I'm sure you had a feature store and built this kind of infrastructure before at LinkedIn and Yahoo. It’s an interesting thing to reflect on with respect to building a product – if you can package something nicely and then put it on the market, you can take stuff that already exists and commercialize it.
The challenging task when you're building AI systems in production is the data. People can allegedly spend 80-90% of their time messing around with data pipelines and then retrieving, cleaning and transferring the data rather than building a model.
At first, we just built the model management layer. We subsequently created a system to train and serve a model. We naively thought we had solved the AI problem, but then started working with individual teams and were finding that projects weren't getting done. And so we would go in and help teams execute on the project. We went in with our customers and did the work for them, which allowed us to better empathize with their pain.
When we did that, we found that across customers, we were repeatedly solving the exact same problem. We were building the same data pipelines again and again. It's a question of: how do we plug into Uber's data and build pipelines that automate turning said data into features about a drive or user?
We ended up automating that work and centralizing it into the platform. This is what became the feature platform. It was a really big win for us, in addition to making sure we supported real-time, which Uber was very dependent on.
As an example, when you call an Uber, there's a score that represents how far someone is, how long it's going to take the car to get to you, etc. Models generate these scores in real time. Then there's another score, which is price. Price depends on the estimate for how far the car is from you. This chain of real-time AI predictions powers the economics of the Uber product. It was very important for us to be able to support that in order to create value.
I’ll mention this – there was a self-driving car project at Uber called ATG (Advanced Technologies Group), and we were the AI platform team behind it. We jumped at the opportunity to help this and quickly learned the division had a very different set of needs and didn't quite know what they wanted. We spent a lot of cycles trying to figure this out and ultimately decided to not support them. This was a very important move for us because we were able to narrow in on projects with a similar set of needs and requirements, and not stretch ourselves too thin or do a partially complete job of multiple things.
This was an important lesson and good skill for people to hone in on. Don’t be stretched too thin and be comfortable saying no to something, even if it comes across as a good opportunity.
Looking back, anything you’d do differently?
There are many things we’d do differently, but the most obvious is something we had to fix along the way related to bundling. There are all of these different pieces that we were talking about, including feature pipelines, model training, model serving, model monitoring, data monitoring, etc. And now for each of these components, there's a whole dedicated industry around them. I could rattle off five scaled companies for every one of these components.
At the time, it wasn't even clear that these were the most important components. We built it all as one system, which was quite monolithic. You’d either opt into using the whole thing or you wouldn’t use it at all. The challenging aspect is that each use case differs in one way or another. If you're the fraud prevention team, maybe you want to use your own serving system? If you're on the safety team, maybe you have your own training system?
With everything being a part of one system, what it means is each component has to be a good fit if you're going to serve them as a customer. The chances everything is going to be a synergistic technology fit, when tightly bundled together, is quite low. You're then just not a good fit for nearly anything.
We decided to pursue unbundling and prioritize modularity. You can use our serving, our feature store, our model training, but then you can use your own monitoring. This was a significant unlock, but it took us a few years into the project to really nail that. It’s not an easy project or change to facilitate. It's a lesson to be focused on modularity earlier.
The Michelangelo blog post is an exceptional reference architecture for end-to-end AI lifecycle management. Walk us through the inspiration and how you decided to pen this piece.
I frankly didn’t even want to write a blog post. An engineering director asked me to prioritize writing so I was compelled to do it!
We published the post though and it quickly blew up. I wrote it roughly 7 years ago and we're still talking about it. But the interesting point is that you should lean into creating this kind of content because it's an accretive asset. You put it out there and it's going to be around for a long time – you never know who's going to read it and come back and chat with you, or if it's going to turn into some other opportunity.
Ninety-nine times out of a hundred they are going to be worth zero, but it's still worth getting it out there because there's a high chance for a high upside for something good to come from creating this kind of content.
Don't be lazy, put your thoughts out there.
I’d love to learn more about the founding thoughts and vision around Tecton.
At Google, all we did was build pipelines for AI, and a lot of the work was centered around data engineering. At Uber, as I mentioned, we kept solving the same problem again and again – it was just data engineering. And when I would talk to folks at Yahoo, LinkedIn, etc. they have huge teams all doing the same thing. Talk to folks at Facebook, they have hundreds of people working on this infrastructure of data pipelines for AI!
It was fairly obvious that everyone's going to be doing AI in the future, and people are going to need this. The area that felt overlooked and seemed like we had a unique insight into was feature management, specifically the data layer for AI, as data is the hard part.
Tecton started as a feature store – here's the data pipeline, we'll compute your features, store them in this database and then we'll serve them for model inference and for model training. That by itself is fairly innovative. People were already doing this in totally hacky ways in Python scripts that they already lost before they got around to productionizing the model. The cool thing about that was that's an important area for any AI application to get the data. The model can only make good decisions if it has access to good data. And the data that it gets depends on whether it comes from your business, user tables, merchants, transaction tables etc. But all the data pipelines in AI have very unique requirements that are not served by traditional data or analytics tooling.
You really care about point in time accuracy because you don't want to allow for leakage. You have two consumers for data and AI – the model serving in real time and then model training where you're trying to get batch data sets that are gigantic.
There’s just a whole different data architecture. At Tecton, we started with that and it quickly became our domain. “We're going to be the data specialist for production AI systems.” We call Tecton a feature platform, and now we're getting into a real-time context platform. It's really about providing your AI models access to all of the best information that your business has about your users and your customers in an automated, fresh way, in real time, and at scale. It's a hard problem to solve, but everyone who's running AI and production needs to solve it.
One design decision question. I remember when we built this, at least at LinkedIn, we had the offline store for the batch workloads on Hadoop initially and for online workloads we had something called Voldemort. We then replaced it with Venice. Curious to hear what you did at Uber?
One interesting thing is at our layer, the underlying infrastructure is not the differentiator. That's an important thing to figure out with your product. Does your customer actually care what you use under the hood or do they just care that you solve the problem? And we didn't really know about that at first. We figured that out over time. So what you're asking is, "Hey, what is the online store?" The online store at Uber was Cassandra. We don't use Cassandra at Tecton. We started using Dynamo and then we moved to supporting Redis, and we're probably going to support other things over time.
Our customers do not care. They just want to know that it's going to work, that we can support their scale, their reliability, etc. We have two point solutions and they kind of roughly capture all of the needs from the market, and that's how we approach it.
Let’s talk a bit about open source at Tecton. Could you tell us a bit more about Feast?
Feast is an open-source project developed within an Uber competitor. After reading our blog post, employees at Gojek decided to try and recreate Michelangelo. They used it as an internal system and open sourced part of it, specifically the feature store. We liked the design and decided that we’d like the team to come work for us at Tecton. We brought them over and now maintain the Feast project.
We have a very unique open source dynamic where Feast is a separate codebase. It's not the same product as Tecton, but it's API compatible. So you build something in Feast and you can just use it and subsequently upgrade it in Tecton to get an enterprise-grade solution with monitoring, governance and lineage. But we are the maintainers for Feast now.
My partner at 8VC has invested in a brilliant open source project called Ollama. I led the Series A in a company called Acryl, which is based on the LinkedIn DataHub open-source project. We are also proud backers of Airbyte alongside at least eight or nine other open-source bets. And this is always a double-edged sword. Curious to get your thoughts on balancing the open vs. closed source aspect of it? How do you encourage users of open source Feast to maybe someday come and use Tecton’s paid offering?
If you're doing open source, it’s very important to think through how you don’t give away too much in your open source product and how you craft a clear path to a commercially viable business.
For us, because it's a different codebase, it made things slightly easier. The open source component is not quite production ready. No large companies use it in isolation in production at scale as they need a certain level of reliability, security, governance etc. It’s frankly obvious that they have to upgrade to Tecton when the time is right.
The constant worry though is whether your open source project will be successful. Because we were early, we defined the core concepts in this space. For this product category, there's a couple of core concepts, and this just became the canonical open source solution. There have been other open source solutions – they do some things better, some things worse, but none of them have taken off compared to this. When people write blog posts about the canonical open source stack, they reference Feast as the open source feature store.
You wrote a very interesting blog post about intersection of how feature stores could be important for certain use cases around using LLMs — real-time lookups, retrieving the right context, etc. Curious what's getting you excited there? Why is that an important use case?
Before foundation models, you’d have a use case–specific machine learning model to help you predict fraud, for example. You swipe your credit card, and you need that model to gain access to the relevant data quickly so it can make a decision. This has to happen in 50 milliseconds or less. Tecton’s early pitch is that we get you the aforementioned data so your model is fresh and deliver it as fast as possible.
Now with the advent of generative AI and LLMs, you don't need training anymore for every use case – you pass in a prompt, and you don't pass in features. We learned it's basically the exact same story. Your prompt can either be a subpar or good prompt. Similarly, you can have crappy features go into your model, or you can have good features go into your model. And if you have high-quality features, your model's prediction is going to be good. If you have a solid prompt, you're going to have good output from your LLM.
Let’s take a customer support use case. You could have a prompt that just includes a current customer complaint from a text box. Or… you could have a prompt with much more context – this is what the customer just told us and when, this is how many times they normally log in, the last error they hit, the number of products they bought from us this year, etc. You can go on and provide a lot of context to the LLM, so the LLM can provide a much richer response.
It's the same underlying data problem. There are offline workloads around evaluation and online workloads around real-time inference. We realize it's the same problem, retrieving real-time context, and I think we're just lucky that our tooling happened to be perfectly ready for exactly this kind of new wave of AI.
Would welcome any quick advice for founders who are eager to play in the area of AI infrastructure!
We have three cultural values at Tecton. The first is customer obsession – care about your customer, understand them deeply, and truly be obsessed with them.
The second is to be an owner, not a renter. You have to care and take pride in your work. If you're not treating the workplace like a house you’ve purchased, then you’re not a fit for this environment.
And lastly, the third is fast, but focused. If you're not moving fast as a startup, you're going to die. It’s critical to remain laser focused though – you can go fast and sprint in the wrong direction. It takes a lot of effort to keep your focus tightly honed especially given the broader noise in the market around generative AI.
If you’re getting distracted along the way, you're going to spin your wheels. But if you have a finely honed vision and stick to it, you'll accomplish that small goal, start winning with that initial wedge and can ultimately build on top of it.