Scalable AI Architectures

Posts

Apr 30, 2024

Today, two neural architectures scale to the limits of available data and compute:

Transformers
Diffusion

There’s a myth spreading that scalable architectures aren’t rare, and that every architecture scales with enough optimization. However, decades of research have revealed countless architectures – inspired by all manner of physics, neuroscience, fruit fly mating habits, and mathematics – that don’t scale.

Even if data and compute are the most urgent bottlenecks in advancing today’s capabilities, we shouldn’t take scalable architectures for granted. We should value them, advance them, and find better architectures that address their limitations.

*Neural architectures that don’t (black) and do (blue) scale*

Finding new scalable architectures doesn’t necessarily require massive training runs, thanks to scaling laws. Scaling laws are relationships that predict how model performance will change as dataset size, model size, and training costs vary. In other words, if we plot an architecture’s performance at enough small training runs along those dimensions, we can predict its performance at much larger runs.

For many neural architectures, scaling laws level off well below the 1 billion-parameter mark. Kaplan’s 2020 paper, “Scaling Laws for AI Language Models”, shows the scaling laws of traditional LSTMs plateauing at just 10 million parameters. Tay’s 2022 paper produces scaling laws for ten different architectures, and shows five non-transformer architectures’ scaling laws plateauing well before the 1 billion-parameter mark.

As a result, ruling out new architectures by examining their scaling laws can be quite cheap. According to Databricks, a 1 billion-parameter training run costs only $2,000 (source).

*Traditional LSTM’s scaling law plateauing in “Scaling Laws for AI Language Models” (Kaplan, 2020)*

*Non-transformers’ scaling laws plateauing in “Scaling Laws vs Model Architectures” (Yi Tay, 2022)*

In fact, it was OpenAI’s 2019 GPT- 2 transformer training run at 1.5 billion parameters that gave them the confidence to invest in a GPT-3 training run at 175 billion parameters, over two orders of magnitude larger:

Scaling laws: “by the time we made GPT-2… you could look at the scaling laws and see what was going to happen.” - Sam Altman, On with Kara Swisher, 3/23/23
Benchmarks: “state-of-the-art on Winograd Schema, LAMBADA” - OpenAI Announcement
Emergent capabilities: “capable of generating samples… that feel close to human quality” - OpenAI Announcement

*GPT scaling laws in ”Scaling Laws for AI Language Models” (Kaplan, 2020)*

The straightforward next step to improve model capabilities is continuing to scale the architectures that work, by increasing the size of our data centers and datasets. However, these architectures have major limitations, such as logarithmic scaling, and limited context, which means there's still an opportunity to innovate in core architecture research.

Positively bending our best scaling laws would help not only at the top end, by increasing the capabilities of models trained at the largest data centers on the largest datasets, but also at the middle and low ends, by allowing capable models to be used at affordable prices.

Some of the most exciting lines of work to improve existing scalable architectures include:

Planning (e.g., tree search, reinforcement learning)
Longer context (e.g., Stanford’s FlashAttention, Google’s RingAttention)
Ensembling (e.g., Google’s mixture of experts, Sakana’s evolutionary model merging)

There are also early candidates for new scalable architectures:

State-space models (e.g., Stanford’s S4, Hyena, Cartesia’s Mamba, AI2’s Jamba)
RNN variants (e.g., Bo Peng’s RWKV)
Striped variants of the above with existing architectures

Notably, these new architectures are mostly driven by small teams outside the major industry labs. For example, S4 has three authors, Mamba has two authors, and RWKV is primarily developed by Bo Peng.

These emerging architectures are already proving useful in real-world domains. For example, Arc Institute's Evo, a 7B biological model, is based on an architecture that avoids the quadratic complexity of attention, allowing it to process vast volumes of biological sequence data within a 131K context window.

We're still early in the history of AI, and our architectures are far from physics-limit optimal. While scaling via data center buildouts and data acquisition is the clear next step in advancing model capabilities, there are still major breakthroughs to come from the scalable architectures themselves.

–

Notes

I opted for simplicity over accuracy in the discussion of scaling laws. In reality, they’re not as well-understood as many believe. We only know a small sliver of how models scale above a certain level, and it’s hard to answer questions beyond “What is the best model I can train with a fixed budget of $__?”. Factors like data quality are also hard to quantify.
Scaling law papers often prioritize speed over rigor (see failed replications), because the field is moving so fast.
What will be the real bottlenecks in scaling known architectures? Data? GPU supply? Energy? Dollars? Data centers? Researchers?
What should small teams outside the major industry labs work on?
There are many other promising approaches to improving models that don't involve modifying the architecture, like novel training methods (e.g. UL2R), better optimization, synthetic data, data curation, compression, and alignment.

–

Thanks to friends and colleagues who provided feedback on early drafts of this.

Continue Reading

Posts

Interview

Apr 25, 2024

Karim Atiyeh and Nik Koblov (Ramp) Fireside Chat

Today, we’re pleased to share the transcript from that conversation. Karim and Nik share numerous insights on Ramp’s journey, which recently marked five years and over $1 billion in customer savings. They explain how Ramp has been guided by the same mission since their first prototype, how they’ve achieved and maintained product/market fit, how they’ve built and sustained a dynamic and collaborative talent culture, and where AI has (and hasn’t) factored into their product evolution.

Posts

News

Apr 11, 2024

Announcing Our Investment in PeerDB

While models command the spotlight, the AI age ultimately depends on real-time, reliable data access - and this is true of countless traditional business processes. Today, Postgres is the de facto database for enterprises, SMBs, and developers, yet Change Data Capture (CDC) and data movement tools for Postgres represent a massive bottleneck. Picture the Ever Given wedged in the Suez Canal, with data as cargo.

Posts

News

Feb 27, 2024

OpenGov: A Changing Guard; A Continuing Mission

Today, OpenGov announced its acquisition by Cox Enterprises for $1.8 billion USD.

Posts

Interview

Feb 7, 2024

Nik Spirin (NVIDIA) Fireside Chat

For our final Chat8VC of 2023, we hosted Dr. Nik Spirin, Director of Generative AI and LLMOps at NVIDIA. He sits at a very compelling vantage point in this role. Prior to NVIDIA, he was a serial entrepreneur, classical ML PhD, and researcher by training who worked on many interesting applied AI projects at places like Gigster and played key roles in the enterprise-wide AI transformations of Canon, Dentsu, and Vodafone.

Posts

News

Jan 30, 2024

A Survey Of Large Language Models Accelerating Healthcare Businesses Today

Posts

Interview

Nov 28, 2023

Michel Tricot (Airbyte) Fireside Chat

Our story is certainly full of ups and downs. Airbyte today is often associated with our open source data infrastructure project and product. It really started though in July 2019 when my co-founder and I first decided to build something together. We implemented a fairly disciplined system around idea exploration and honed in on the problem-space of data and went through YC in January ‘20, which was the famous COVID batch where we went from being in person to fully remote.

[mORE RESOURCES]

back to RESOURCES

Home

Resources

Portfolio

Fellowship

About

Build

Our Thesis

Jobs

Team

Contact

Scalable AI Architectures

Karim Atiyeh and Nik Koblov (Ramp) Fireside Chat

Scalable AI Architectures

Share

Continue Reading

Karim Atiyeh and Nik Koblov (Ramp) Fireside Chat

Announcing Our Investment in PeerDB

OpenGov: A Changing Guard; A Continuing Mission

Nik Spirin (NVIDIA) Fireside Chat

A Survey Of Large Language Models Accelerating Healthcare Businesses Today

Michel Tricot (Airbyte) Fireside Chat

Links

Company

Programs

Contact