The Other 90%: Introducing the Aryn-8VC Partnership
Unstructured data is the undiscovered country of the enterprise, containing the institutional wisdom AI seeks to capture. Unlocking the value of that data has proved elusive, as both enterprise search and LLMs have fallen short individually. Aryn's mission is to help you answer questions from all of your data. To this end, Aryn is bringing generative AI to OpenSearch and data preparation - and bridging the gap from results to answers.
While building analytics services at AWS, Aryn’s founding team observed several trends pointing towards a new company. Many customers successfully processed structured data but struggled with unstructured data - which represents up to 90% of enterprise-generated data. However, most analytics services under development focused on structured data. A few proprietary unstructured data search platforms existed for end users and IT teams, but developers were largely ignored. It became clear that better unstructured data services were urgently needed. More recently, advances in generative AI have captivated users, who increasingly prefer natural language interactions in their applications. The next wave of search experiences will be conversational. Aryn will make them attainable.
Architecture & Advantages
Aryn is built on two realizations: 1) In generative AI, data quality determines answer quality - hence the necessity of cleaning and enriching unstructured data. 2) Generative AI is not a layer or tool, but an enabler at every layer. By augmenting data preparation and search with AI, as well as for AI, it’s possible to deliver a conversational search experience that’s high-quality and easy to build on. The result is Aryn’s conversational search stack for unstructured data:
- Conversational APIs let developers easily build conversational search apps, leveraging their preferred generative AI model(s) with plug-and-play convenience.
- Hybrid Search integrates the best of semantic and keyword search behind a seamless interface, using OpenSearch, the popular open source search and analytics suite.
- Sycamore, Aryn’s semantic data preparation system, uses generative AI to clean, extract, enrich, and summarize data. In other words, it improves data quality to improve answer quality.
Aryn’s key differentiators are threefold:
- Speed to production and scale. For most companies, retraining LLMs to answer questions on private data is prohibitive. This can be overcome using retrieval-augmented generation (RAG), which runs semantic search on grounded data to help generative AI models create answers. However, this approach typically requires a complex pipeline of components lacking scale, security, and maintainability. Aryn’s conversational search stack provides everything developers need to build conversational apps - without requiring AI or search expertise. With OpenSearch as a central piece, developers can easily scale their applications to production.
- Better quality answers. When using RAG, answer quality reflects the data presented to the LLM, and the ability to experiment with different models and prompts can have a significant effect on quality. High-stakes work requires accurate outputs, not just decent probabilities. To tackle this problem, Aryn created and open sourced Sycamore, a robust, scalable semantic data preparation system for making unstructured data meaningful - and therefore searchable. Aryn allows developers to customize their data prep and search pipelines, so they can ground AI models on the highest quality data required by their use cases.
- Fully open-source. Aryn’s conversational search stack is 100% open source, under the Apache v2.0 license. It allows developers to customize pipelines and choose their AI models, avoiding lock-in or overreliance on any LLM. A cross-section of early Aryn customers reveals the demand for conversational search, spanning verticals such as financial services, manufacturing, life sciences, government, productivity tools, and media.]
Aryn in Action
Already, Aryn consistently generates answers that would be impossible using purely lexical or LLM-based approaches. As an illustration, take the following query: "What algorithm was the most recent winner of the sort benchmarks?" Aryn's answer:
"The most recent winners of the sort benchmarks in 2022 are Exoshuffle-CloudSort for CloudSort and ELSAR for JouleSort [citation]. CloudSort measures the minimum cost for sorting a large amount of data on a public cloud. JouleSort measures the amount of energy required to sort a certain amount of records. [citation]."
Asking this question of most lexical search platforms returns a document or snippet with matching keywords, not a direct answer. These systems force users to wade through results that may not even address the questions asked, let alone correctly.
In the absence of ground truth, most LLM-based platforms today don't fare much better. Here's ChatGPT's answer: "I don't have access to real-time data, and there isn't a specific 'sort benchmarks' competition or award that I'm aware of with regularly updated winners. Sorting algorithms and their performance are typically evaluated in research papers, textbooks, and online coding communities. Benchmarks are conducted by researchers and developers to compare the performance of different sorting algorithms in specific scenarios."
Famously, when engineers join Google’s search team, they’re given a presentation on the “Life of a Query”, describing exactly what happens from the moment a user types a query and hits “enter”. Applying this frame to Aryn’s conversational search stack, we can appreciate the full journey from question to answer:
- User input: Everything starts with a user posing a natural language question.
- Query understanding: The system uses both recent conversation interactions and input query in the context for the LLM to understand and rewrite the query for better retrieval. For example, if a user asks “What’s the temperature of a cool star?”, traditional search systems might interpret ‘cool’ as fascinating or interesting. LLMs, recognizing the scientific context, would understand that the user is talking about lower temperature stars, i.e. red dwarfs.
- Index lookup: As with a traditional search system, the system checks its index of known documents to identify potentially relevant documents for the retrieval set.
- Dual ranking: The system uses a combination of traditional, lexical ranking signals such as BM25, TF-IDF, as well as semantic and contextual relevance from LLMs, to sort the documents in the retrieval set by relevance.
- Synthesis: The system uses a foundation model to synthesize the answers from the top- ranked documents, along with their citations.
- Conversation platform: The system keeps a record of questions and answers in the same session over time, and uses that information to hone in on certain topics over time.
Only an extraordinary team could resolve the challenges of creating true conversational search app infrastructure. CEO Mehul Shah served as GM of OpenSearch and Glue at AWS, leading product, engineering, operations, and GTM, and bringing these technologies to some of the world’s largest enterprises. CTO Ben Sowell was technical lead for both AWS Glue and AWS Lake Formation. Chief Product Officer Jon Fritz led product management for AWS’ Apache Spark and Hadoop services, and in-memory database services like ElastiCache for Redis. Having three former AWS Principal Engineers and two AWS GM/Directors among Aryn’s founding team is extraordinary, but only one token of their search, big data, and cloud systems bona fides. On a personal note, BG and Mehul are fellow practitioners from the old school of Silicon Valley database kernels, and have been hoping to work together for almost two decades. We’re confident Aryn will prove worth the wait.
Today, Aryn exited stealth, announcing a $7.5 million seed round. We participated for many reasons, starting with the team, which combines vast experience with a rare cohesiveness. Architecture-wise, many vector databases have cropped up, but we bet on a proven search platform (including a scalable vector database) already used by tens of thousands of engineers and thousands of enterprises, and pre-approved by their IT departments. The elegant integration of transformer models throughout adds untold possibilities to these mature, powerful frameworks - without locking customers into one model. Crucially, Aryn is the only end-to-end infrastructure for conversational search that is 100% open source, configurable, and model agnostic. For developers and users alike, this commitment to flexibility ensures they can meet mission requirements now and in the future.
In a sense, Aryn enables stereoscopic vision for the enterprise, combining familiar and novel methods into something natural and universal. In practice, Aryn achieves an elegant symbiosis: it allows developers to build apps for mission critical endeavors without needing to become AI experts, while making the contributions of AI experts exponentially more useful to fields where they have no prior knowledge. We are honored to be a part of Aryn’s story, and look forward to seeing that story become an epic.