8VC Emerging Builders Spotlight: Piyush Jain (Yugabyte)

Mar 30, 2023

Mar 30, 2023

In supporting the industry defining companies of the 8VC portfolio, we are fortunate to work with the brightest, most dedicated people in the world. We’re excited to feature some of the most promising engineering and product talent we have the pleasure of collaborating with, not only at 8VC, but within our broader network.

Today we are highlighting Piyush Jain from Yugabyte. Piyush is a software engineer at Yugabyte working in the database group. Before this, he worked at Nutanix, after which he completed his Masters at UT Austin with a focus on distributed systems. In his free time: Piyush keeps an eye on developments in layer 1 blockchains, enjoys hiking, practicing yoga and playing tennis (started picking up recently).

In your words, what is Yugabyte and why are you excited about the company?

Yugabyte is the company behind YugabyteDB, our open source distributed transactional database. You might be wondering “Why do we need another database?” Let me go back into some history to explain how YugabyteDB came to be.

Databases are complex pieces of software, which almost all technical systems rely on for their data storage and access needs. 

Databases have gone through a long and interesting journey and have evolved in two ways; 1) functionality-wise to cater to new data management needs and 2) scalability-wise to support a larger population on the internet, while still maintaining good response times.

Until the late 2000s, we simply had single node databases. If you wanted to support more data and traffic, you would have to switch to a beefier single-node machine with more compute and storage. But, these single node databases were transactional, i.e. ACID compliant. 

Having an ACID-compliant transactional database brought two major user benefits. Firstly, you could apply a set of operations atomically on the database without worrying about partial application of the operations this would help in a bank transfer for example. The second benefit was that the database would serve multiple such atomic sets of operations in parallel for various clients, while ensuring that it behaves as if all transactions occurred serially one after the other. This is important to ensure that when you transfer money to someone, two independent people don’t have to be blocked for your transfer to go through. This needs to be done while ensuring that two sets of operations to the same piece of data don’t trample on each other leading to unintuitive outcomes, such as double spending using one account, by performing simultaneous transfers amounting to more than the account balance.

As more people started coming on the internet, our applications grew and you could vertically scale-up your database node to satisfy the traffic. But, there was a hard limit on the beefiest machine available on the market at any given time. These bigger servers also became more specialized and thus more expensive.

This paved the way for a class of distributed NoSQL databases that horizontally scaled by compromising on transactional guarantees. The application wouldn’t get the same intuitive guarantees that traditional databases provided, and would need some re-writing to bake in some/all of the required ACID guarantees into the application.

Reinventing the wheel for transactional requirements in each application is highly time-consuming and error prone as transactions are hard to get right. 

This is where YugabyteDB comes in: it provides the old, gold standard of being ACID compliant while still being horizontally scalable, similar to NoSQL databases. So, you get the best of both worlds. It also goes further, providing not just transactions, but also the classical features of relational databases in a distributed environment, which would have again required extra logic in the app if using a NoSQL database. Moreover, it runs almost anywhere, bare metal, any cloud, Kubernetes, etc.

Databases are all about tradeoffs. With the transition from SQL to NoSQL, you’re trading off consistency for scalability. What are the tradeoffs as you move to a distributed SQL database— i.e. horizontally scaled SQL— paradigm and where does Yugabyte do the most work to minimize the tradeoffs that you’re forced to make in that transition? 

Let’s discuss two different paths to distributed SQL databases: from NoSQL and from single-node relational databases.

When moving from NoSQL to distributed SQL we are not really compromising on anything, instead we are strictly gaining some benefits. A distributed SQL database can do everything that NoSQL does while achieving the same performance and providing the same scalability. This can be easily seen via the YCQL Cassandra API that YugabyteDB provides. Moreover, YCQL goes further by providing distributed transactions, which traditional NoSQL doesn't offer.

However, a trade-off exists when comparing with single node relational databases. Distributed SQL has to incur higher latencies, which stems from the distributed nature of the system required for scale and availability and the basic laws of physics and speed of light. Inter-node interaction is needed for all writes since all data is replicated by a consensus algorithm that provides resilience to node failures and high availability. Moreover, distributed transactions that touch data on multiple nodes require additional inter-node coordination along with the usual consensus replication. All distributed databases have to face this.

At Yugabyte there has been a lot of work put into ensuring that we pay only the penalty justifiable by this theoretical trade-off and nothing more. One example is that we make sure not to start a distributed transaction if we can decipher that it touches a single shard, thus saving on network round trips associated with the lifecycle of a distributed transaction. Another instance is that since a distributed transaction incurs overhead even during creation, the database keeps a pool of new distributed transactions around on each node to avoid that latency. Further, there are multiple micro optimizations baked in to solve the problems that distributed SQL databases face, for example, higher latencies stemming from clock skew between nodes.

How did you first hear about Yugabyte and how did you initially decide to join? 

From my previous role and during my Masters’ at UT Austin, I was deeply into distributed systems, but not so much into databases. At that point, I was reading literature on Google Spanner and other distributed databases, and realized that transactional distributed databases were a space that I wanted to work in. 

This was due to many reasons: they are a cutting-edge piece of technology which touches a lot of pieces in distributed systems theory, and this amplifies the complexity of the already complex field of databases. It is better to deal with that complexity once when making the database, rather than it spilling out into all applications, as NoSQL users have to do. If distributed SQL databases were created earlier, tech companies wouldn’t have taken the NoSQL detour, but at that time we didn’t have the technical breakthroughs needed to make it happen. I could see that many people had started/were starting to move to distributed SQL given that it is a win-win situation - you get both scale and consistency.

Karthik, our CTO, gave a talk at UT, which led me to start following Yugabyte and learning about the database world. I followed the company and its competitors for about five months and finally decided to reach out. I was inclined towards Yugabyte because of their strategy of reusing the PostgreSQL code base, which is inline with this philosophy, which has historically proven successful.

How has your work evolved over time at Yugabyte? What does your day-to-day look like? 

Thanks to Yugabyte, I’ve grown a lot in the past two years here. I’ve worked on various features from start to finish, including roadmap-level planning and design as well as core implementation.

I started with simple features, like adding partial indexes for YCQL, our Cassandra wire-compatible query language. Then I began working on major additions to our distributed transactions layer, specifically a new isolation level and orthogonally, a new concurrency control scheme. This helped me gain an understanding of the innermost workings of the database, which makes it a true distributed SQL database with a truly differentiated architecture compared to old databases or the earlier NewSQL offerings.

PostgreSQL has three isolation levels at the core:

  • Serializable (strongest guarantees) 
  • Repeatable Read 
  • Read Committed (weakest guarantees) 

At the time, Yugabyte only had two. My task was to add the third isolation level, Read Committed, to bring us to parity with PostgreSQL.

Currently, I am working on improving our cost-based optimizer for better query planning. In regards to my day-to-day work, I must mention that a major chunk of our new feature development time is spent in ensuring correctness guarantees of the database – since that doesn’t come easy.

Even though Read Committed has the weakest guarantees, it’s still important for many personas, especially application developers given this is the default in Postgres. Can you tell us a bit more about the impact of adding this for end users?

There’s a huge impact. Although it is the weakest isolation level, many applications use Read Committed and give up stronger isolation guarantees to ensure better performance. This is ideal when the use case is straightforward enough to add some logic in the application and ensure required isolation guarantees on a case-by-case basis.

We have seen some very large workloads run Read Committed isolation on YugabyteDB. One example is our partnership with the financial banking application Temenos. Also, given that Read Committed is the default isolation in PostgreSQL, it is simple to lift and shift many existing PostgreSQL applications to YugabyteDB without needing to re-write the app.

Also, this isolation brings YugabyteDB in parity with PostgreSQL on isolation levels, which is a major pillar in OLTP databases. 

Can you tell our readers a bit more about the three isolation levels to contextualize this discussion? 

Assume you have a simple ledger of accounts, and a thousand transactions to process at some point. The simplest way would be to execute the transactions in their totality in order (serially), which ensures no conflicts i.e., no unintuitive behavior. For example, in banking, this ensures that you don’t allow double spending. But, this sequential execution results in a useless system: transactions between strangers will block your transactions. And every transaction blocks all the others that came after it.

This is where transactional databases come in with their isolation levels and why they differ from the simple ledger. They allow transactions (sets of operations) to execute atomically and simultaneously, while keeping varying levels of checks on the unintuitive behaviors based on the chosen isolation level. 

Serializable is the gold standard, the strongest, but also the easiest to understand - it doesn’t allow any unintuitive behavior. It guarantees that transactions behave as if they occurred one after the other serially, even though they are executed simultaneously. If it is not possible in some cases, they throw an error to the client.

Repeatable Read and Read Committed aren't explainable in simple terms like Serializable, but the thing to note is that they offer more concurrency at the cost of allowing specific unintuitive behaviors.

Here’s a good post, which covers this in more depth and helps you to understand the trade-offs. 

What are some surprising or unexpected things you’ve learned while at Yugabyte so far?

Databases are hard to build, and a distributed setting amplifies the complexity of the former. Even with these challenges, I’m constantly impressed by how quickly and efficiently things are built at Yugabyte. This velocity comes from the brilliant minds that work here, not by compromising on quality and long term vision.

It is surprising and inspiring to see people single-handedly own and drive whole features, which would normally be a whole team effort. And although a large chunk of work might be driven by a single individual, the benefit of Yugabyte’s collaborative approach is that there is still a lot of deliberation within the wider team to ensure the right design choices are made.

Another thing which is not unexpected, but worth mentioning, is the level of transparency and technical insight in the leadership team. Plus, their trust-building attitude with employees, which in-turn gives each individual an opportunity to grow.

What’s something on the Yugabyte product roadmap that you’re excited to work on?

Now that we have parity with PostgreSQL in terms of isolation levels and concurrent control, some of my colleagues are working on providing top-notch observability, to complete the picture. 

This includes various notable items including, a view to provide information about the locks that the database is holding, and which transactions are blocking which ones at any instant. Perhaps further in future, reporting historical metrics on conflicting transactions to find hotspots in the workload. This will serve as a feedback loop to write better applications.

An area I am focused on is to perform better cost-based query planning. The PostgreSQL query planner is sophisticated and requires some table-level statistics like histograms, cardinalities of columns, etc to make informed choices between various query plans. One item I will work on is finding ways to efficiently fetch a random sample of rows from a table in a distributed setting. Another item is to automatically perform such statistical data collection when a table’s data changes by a significant amount.

As Yugabyte grows, how do you and the team think about engineering culture? 

We have a culture of humility, even amongst a team full of brilliant people, and I look for this humility and self-awareness whenever I interview. It’s all about getting things done in the right way, and getting them done together.

Continue Reading