Knowledge Management & Retrieval -- Fireside Chat with Mem & Shortwave
We were delighted to engage in a dynamic conversation with Andrew Lee (co-founder/CEO of Shortwave) and Dennis Xu (co-founder/CEO of Mem) during our June Chat8VC gathering. As a reminder, Chat8VC is a monthly series hosted at our SF office where we bring together founders and early-phase operators excited about the foundation model ecosystem. We focus on having highly technical, implementation-oriented conversations and give founders and builders the opportunity to showcase what they’re working on through lightning demos and fireside chats. This can be a fun side project, related to a company they’re starting, or a high-leverage use case within a company operating at scale.
This was an exciting discussion about hard implementation problems, taking AI features into production (and how to think about prioritization), and how to experiment quickly in an ecosystem that changes underfoot each week. Please enjoy the transcript below!
If any of these themes excite you and you’d like to chat more with our team about them, please reach out to Vivek Gopalan at email@example.com!
Dennis, you’ve described Mem in the past as a “Google search bar for all non-public information – for every piece of information that is uniquely relevant to you.” Tell us about your inspiration for starting Mem and why that problem was uniquely exciting.
I'll start with our mission – how do we give everyone their own personal J.A.R.V.I.S. from Iron Man? How do we build this personal super intelligence that helps you think smarter, create efficiently, and reduce cognitive burden? We started the company in 2019, but it was an idea my co-founder and I have been loosely thinking about for the past 10 years.
It all started over a meal at a Korean restaurant…I pulled out my phone and remarked that if I gave someone my phone, they would be able to perfectly reconstruct my entire life. They would by and large know everything about me, from how I’m spending my time to what I’m thinking.
And yet, we can't actually make use of any of that information in our day-to-day lives. The only people who can are Google and Meta, and they’re using this information to personalize ads.
The question still remains – how do we actually build something that knows everything about you that you can actually use? This unyielding thought persisted and served as the foundational inspiration behind Mem.
Andrew, you're a founder many times over from biometrics to consumer internet and now, databases and email. Tell us a bit about your founder journey across multiple startups. Where do you find inspiration to tackle such meaningful problems?
For background, I'm likely best known for being one of the founders of Firebase. I left in 2017, took a few years off and in early 2019, was ready to kick off my next adventure.
There’s something quite amazing about the federated infrastructure that underpins email. It is the only truly decentralized communication network that we have. Crypto diehards often talk a big game about decentralization, but they're all on Discord!
And meanwhile, email exists and it works, at scale! It is the largest network effect – Facebook has roughly 3 billion users, whereas email has nearly 4.5 billion. Given this, I wanted to create a bright future for this open network.
It was clear to me in 2019 that neither Google nor Microsoft was going to pick up the torch and improve the overall experience. They were obsessed with building out their messenger functionality.
I want email to be intuitive and delightful, so this strong vision for what email could be spurred my team and I into action.
We wanted to center this discussion around the core AI features that Mem and Shortwave are rolling out at the forefront of knowledge management. As we were going through the Shortwave demo, one of the key components highlighted included retrieval – both of you face challenges when it comes to designing retrieval systems and have treated this problem with great nuance. What worked well initially, what ultimately broke? Any key insights from this process?
Dennis Xu (Mem) My twitter is currently inundated by AI demos – people are spinning up impressive demos in 3-4 days. However, once you dig in beyond surface level and actually try the product, you often realize they are fairly subpar. We went through an analogous journey at Mem as we were actually initially solely using Curie embeddings. I'm sure everyone here is using Ada for speed or something more recent. Our original use case was for managing search over a significant amount of information across your knowledge base.
We started off as a notes app. The initial thesis: what if we could resurface semantically similar notes to what you were reviewing at any given point in time? When you first throw things into a vector database, you realize after the first few queries that it’s quite magical. You search for “Colorado, and notes about Breckenridge surface.
Imagine all the stuff you had to build in the past to make that work. As you go deeper you understand that semantic similarity is not enough to truly model what the world is. There’s a lot to do with keyword search that's actually quite useful. But then, even if you have a hybrid system with keyword and semantic vectors, you discover many different types of relationships exist.
What about the structured relationships that can't be represented semantically? For example, I spoke at a fireside chat with Andrew Lee at the Chat8VC event. How do you actually represent that semantically? Our journey has been long and retrieval has been a core aspect of it.
Andrew Lee (Shortwave) I’d like to preface my response by acknowledging that we’re figuring things out as we go. The way our system worked for the demo earlier this evening is different than last week or the month prior. We’ve experimented greatly, and I’m always concerned we’re not doing enough.
I want to echo what Dennis said about semantic similarity not being sufficient. You can say, “I put everything into the database, grabbed things that are similar to my question and threw them into a prompt…done!” The problem is that the answer to your question might not be similar to the question. So, if I inquire as to what happened a month ago, if my question says the word “a month ago,” it's not going to have “a month ago” in the answer. It's going to have something that looks like a date – that's a month ago. You need to be specific and aware of these unique cases.
A lot of the work we've been spearheading is trying to understand intent, especially as it relates to metadata. For example, we leveraged metadata extraction in the demo. When you ask your question, the first thing we do is reformulate your query. We have a big prompt that has an understanding of the current context you have in the browser. It knows you're looking at a thread, what the thread says, your name, date and the state of the world to some extent.
We take that and your chat history, and we use an LLM to rewrite this into a single question that contains all of the relevant details necessary to answer the question. On the back end, we do single feature extraction – we look for date ranges, keywords and other labels. We then move forward and run many queries in parallel. We do semantic similarity because sometimes the answer's there, but we look for the different types of attributes. We do things based on time range, labels and keywords. We compile this information and run it through a cross encoder.
The cross encoder's job is to figure out for a given document and given question, how likely is the document to be useful in answering your question. We get a score and then apply a bunch of weights and heuristics on top of that using the same features to determine what's likely to answer the question. We throw this information into a big prompt and pick the top options.
You both brought up the concept of doing pre-filtering using metadata – either structured metadata that explicitly exists in the content that you have, or some form of inferred metadata that you are pulling out of unstructured data. How do you decide what's relevant and is there a path dependency with some of this pre-filtering? How much manual work is required?
Dennis Xu (Mem) We live in a world where we have hyper intelligent models that don’t often have full context or understanding. It's almost like a hyper intelligent human that can only remember the last second of memory. Context becomes everything. So, how do you provide context regarding an email thread you’re looking to actively reply to?
Beyond providing the model with context, we empower the LLM to make decisions. Nowadays, you actually don't need to build vertical specific rules as much as you used to. You have a domain general expert that is privy to details about nearly everything. Why make decisions yourself as a human when you can leverage technology like this?
Ultimately there’s less need for heuristics – a lot of things can just be inferred by the base model. When you think about different strategies for embeddings, what have you tried? Have you played around with chunking, perhaps context to word chunking versus fixed chunking? What are some of the experiments you've run and how do you validate those experiments?
Andrew Lee (Shortwave) We’ve played a little bit with chunking, but that's one of the areas I wish we had conducted more experiments. You can obviously do it on message boundaries and emails, but it doesn’t always fit. We've tried a few different models for the embeddings. Right now, we're actually using an open source one as opposed to the OpenAI one. We tried it but decided against it given speed, cost and privacy. This way, we don't have to make a call out to OpenAI for every single email. We only have to do it for the things that you ask AI for, which has some benefits.
We recently discovered that if our system works the way I just described, maybe we actually don't care that much about the accuracy of the vector database. If our whole system is assuming we're going to pull a thousand documents and run it through a cross encoder, it’s actually much more important that the cross encoder is solid as opposed to the vector database. We've been thinking about cost factors here too. Embedding all of your emails, throwing this info into a vector database and running a model is quite expensive.
Dennis Xu (Mem) Evaluating embeddings has been very hard. Only in the last few months have we started getting serious about true evaluations and methodology. You actually can’t create holistic evaluation criteria until you have multiple people working on the exact same pipeline. If you just have one person working on it, the gains you can make are so obvious, you frankly don't need to validate them. Where evaluations become useful is when the product scales and requires cross-team collaboration.
It becomes challenging if a team member wants to make a prompt change to inspire a certain action, but a colleague might disagree. One of the new capabilities of GPT-4 is the idea of reflection – a model that is capable of reflecting on its own outputs is one thing, but now the model can reflect on any kind of action. This is something that just wasn't possible before. GPT-4 can now evaluate its own outputs and determine if it was actually correct.
How accurate is trusting GPT with auto evaluation? What are the guarantees that you're not biasing towards what GPT thinks is a good answer as opposed to what is objectively a better answer? A human might say that a shorter, more concise answer is better than a verbose one, for example.
Dennis Xu (Mem) This works well if you share examples of a “good” answer and set the standard. GPT can then study the model output and compare the response. To your point though, if you're just evaluating without properly labeling, this becomes challenging – you don't really know what’s in your retrieval system or the data that’s being referenced.
You can ask the LLM how confident it is, but this may not be accurate…
You previously raised the point as to why do your own reasoning when you can offload some of that to the model itself? What's the right trade-off though between what you offload to the LLM when you sacrifice a few seconds of back and forth, or is there another way to approach the problem?
Dennis Xu (Mem) What I was referencing doesn't require back and forth. It's more so how do you retrieve the relevant context for the LLM to then make the decision. There's a whole host of challenges around both the latency back and forth.
Andrew Lee (Shortwave) We follow a similar approach. We give the LLM appropriate context and let it make a decision about what it cares about. At one point, we're looking at this and saying, the first thing we should do is an LLM call where it tells us the type of question, and then we plug in a different prompt when that happens, so we have a switch statement on the type of prompt we do the other thing. You end up having as many features as in your switch statement, and you can't combine the data in interesting ways. You lose functionality and have to be correct at every stage for it to follow the right path. We only have a few calls to the big LLMs, and each one of those has a pretty big prompt with a lot of data in it. This seems to work better than a bunch of smaller calls that are strung together with other tools.
There are instances where you do want to use non LLM infrastructure. LLMs can be unreliable, slow and expensive. If you want things to be fast, reliable and cheap, you do want to leverage normal data infrastructure as well if you can.
One potential solution to some of the issues of, "Hey, you can't combine a lot of these different switch statements together…" is the idea of an agent. How do you feel about agents, where they are today and where they might headed in the near future?
Dennis Xu (Mem) I saw a tweet the other day that resonated – in a few months, people aren't going to be talking about any distinction between chains and agents. If you think about an agent as something that is repeatedly hitting an LLM and after each step, is coming back and hitting the LLM again, I totally agree. That being said, there's a lot of ways to have the planning capabilities of an agent without incurring the cost of repeatedly hitting the LLM.
You have to enable asynchronous UI to exist. There are actually quite a few tasks for a knowledge worker where it's fine if it takes 10, 15 or even 30 minutes to complete – you just have to set the right expectations. Same case here, as it’s all about the framing of the UX and proper presumptions.
Andrew Lee (Shortwave) I’m excited about this opportunity as well. We could not only share a support case received last week, but review your entire email history to summarize all support cases. We don't do this because it’s slow, but I could see someone wanting this capability.
How standardized do you think context retrieval can get? When you establish standards for context retrieval, are those standards generalizable or do you have to completely think about a new strategy of retrieving the relevant context?
Dennis Xu (Mem) We adjust our whole system weekly! Your question also likely encompasses whether we route different queries differently or if every query is treated similarly…I think you can build a series of primitives for how information is retrieved. We also rewrite our stuff by taking a query – there are only a certain number of ways in which you can rewrite a query and retrieve relevant information to actually have the proper context to write that query. We try to have one pipeline for all of the different executions, but that pipeline evolves.
The demos highlighted earlier heavily emphasized text in and text out, where the user puts in a query. I'm curious if you’ve thought about other modalities or other ways of approaching the UI?
Dennis Xu (Mem) I'm sure people have heard rumors about GPT-4 multimodality. I have two conflicting opinions on this topic…
Firstly, many people talk about how chat is not the right UX for everything. This is likely true in the long run, but right now, chat models are improving fastest, seeing as they’re attracting nearly all user behavior and activity. If you try to be unique and build something else as opposed to chat, the models aren't as good for a lot of those use cases.
That being said, what I am really excited about is the opportunity associated with images and video. What if you could pull out your phone, record something around you and index this the same way you would a PDF…
Andrew Lee (Shortwave) I’ve studied audio a bit, which translates very nicely into text. It’s a seamless and straightforward way to interface. The other opportunity worth exploring is taking attachments and doing smart things with them. There are a ton of images and documents going through email. When I ask a question of AI, it would be great if it surfaced images that contain said keywords.
I still believe there’s a ton of unexplored opportunities surrounding text – not just the email contents, but the metadata in those email contexts and the contacts you have.
There was a case where we inquired about the address for an event and it pulled up the address – we couldn't figure out where it got this information from. We reviewed all emails and the information wasn’t present. Turns out it was in an HTML email and wasn’t visible.
With everything changing so quickly, both of you have to re-implement things often on a weekly cadence. How do you keep abreast of what's new and not fall into this wave of fatigue? How do you think about what information to retain and tune out some of the noise here?
Andrew Lee (Shortwave) I just ask AI, truly! I love seeing events come back to the city, as this wasn’t the case for a while. That said, I’m selective about which ones I attend. It is a full-time job to stay on top of all new information and requires a good amount of effort. But, I think it's so critical and worth doing. And frankly, I'm actually writing a lot of code these days, which is fun. I want to ensure I’m being very hands on.
Dennis Xu (Mem)
About a year and a half ago, I started following 10 AI researcher accounts on Twitter. After this, my newsfeed has just been pure AI research, demos and commentary. I definitely went through a steep learning curve and information absorption. Anytime a new paper came out, it felt almost like a chore to review. I think a lot of the progress has actually slowed down.
The most meaningful advancement as of recent is the reflection capabilities of GPT-4. This is the most important technical breakthrough of GPT-4 and I believe a lot of cool things will spur from this development. In order to stay above the noise, I choose a big bet and candidly ignore everything else unless I get a strong signal around other foundational advancements worth paying attention to.