8VC Emerging Builders Spotlight: Harshal Sheth (Acryl Data)
At 8VC, we’re proud to partner with the most exceptional founders to build transformational technologies that create long-term economic and societal value. We seek out knowledge, map out industries, and believe in our ability to help fix the world around us. "Tikkun Olam," the duty to work to ameliorate an imperfect world, is the core founding theme of our firm and inspires our work. We are deeply proud of 8VC. This is not just our work; it is our art.
In supporting the generational companies of the 8VC portfolio, we are fortunate to work with the brightest, most dedicated people in the world. As we look to identify the next generation of best-in-class entrepreneurs, we naturally turned to our network of peers and friends.
Top engineers might be early in their career, but have often been obsessed with technology and innovation from a young age. They're also better attuned to who are the other top technical minds in their respective cohort! Many of the most important companies we’ve invested in or started won by attracting younger superstar engineers (Palantir, Oculus, Addepar, Qualia, Blend, Affinity, etc).
We’re excited to feature some of the most promising engineering and product talent we have the pleasure of collaborating with not only at 8VC, but within our broader network.
To kick things off, we’re humbled to highlight Harshal Sheth, founding engineer at Acryl Data (8VC portfolio co). Harshal graduated from Yale CS in three years, He previously spent some time investing at 8VC and doing software engineering at Citadel, Google, and Instabase. In his spare time, he enjoys doing puzzles, playing spikeball, and building legos.
In your words, what is Acryl and what are you excited about?
Acryl is pushing forward the open source LinkedIn DataHub project, which is a metadata management platform for the modern data stack. We’re leveraging metadata to manage data and enable economically important workflows like governance, discovery, data quality & observability, impact analysis, testability, etc. All of these things have metadata as a core, underlying layer.
What do these workflows mean in practice? Data discovery encompasses giving users the ability to search over a diverse catalog of data assets (datasets - e.g. Snowflake / Redshift, dashboards - e.g. Looker, charts - e.g. Superset, data jobs - e.g. Airflow) and discover the most valuable resources to power data-driven applications and business insights. In surfacing these data assets, DataHub provides contextual information of how that asset was derived (pipelines & lineage), schema and schema history, ownership & ACLs, and other important documentation. Alongside this information, Acryl extends DataHub to support governance use-cases, for example, categorizing data based on risk, monitoring data access based on governance policies, and automating arbitrary metadata-driven governance workflows. Today, DataHub is the #1 open source metadata catalog and Acryl is the most complete solution for data discovery and governance.
I’m bullish – the vision for Acryl to be the central infrastructure layer for the data stack is incredibly ambitious. It’s conceivable and likely that if we execute well, metadata management becomes a critical pillar of the data practice over the course of the next 5-10 years.
This is a category creating opportunity. We have the ability to define and set the standard for what you can do with high quality, continuously changing metadata.
What part of Acryl do you work on today?
I primarily work on and drive our work on ingestion.
I was drawn to this focus area because in order to build a metadata platform, we need to extract metadata from many different data tools. What does this look like in practice? This effectively means we’re building API integrations with a host of data tools. I specifically work on the core framework to enable these integrations as well as a couple of the specific, high-fidelity connectors.
Similar to any startup environment, I wear many different hats beyond my primary role – I help manage our public documentation, think about pricing strategy, and a handful of other responsibilities. While I certainly enjoy the deep focus associated with coding, exposure to everything from community building to operations has been immensely beneficial as I expand my personal toolkit and understand what it truly entails to build a company.
What is the hardest technical challenge associated with the ingestion problem?
Certainly the richness of the metadata proves to be challenging.
It’s fairly straightforward to build low quality connectors – it’s easy to go to Snowflake and if you have thousands of tables, surface the associated names.
This isn’t necessarily useful though, as we need more depth to this information – what does this information look like, how does it relate to each other, how frequently is it updated, etc. You need to construct a complex graph under the hood (DataHub has a Generalized Metadata Architecture which encompasses a graph database - Neo4j - as one of multiple data stores).
There is a whole host of other enrichment and metadata associated with these assets. It’s a significant tech challenge to be able to offer this type of coverage in a way that makes sense to end users.
The other axis here is the diverse set of sources that you might ingest metadata from. DataHub has over 40 different integration sources, spanning data warehouses (Snowflake / BigQuery / Redshift), streaming (Kafka), BI (Looker / Tableau), ML Ops (Sagemaker / Feast), pipelines (Airflow), dimensional models (dbt), online databases (MongoDB, Postgres), and even third party SaaS applications (Okta).
How do you bucketize your time?
I spend a fair amount of time coding, as is natural for a SWE. Since I joined Acryl as an intern about a year and a half ago, I have a good amount of context around processes, inputs that informed key decisions and the driving rationale behind ingestion. Recently, I’ve taken on responsibilities associated with conducting code reviews and managing the design of ingestion.
I’ve naturally gravitated towards focusing on hiring – it’s exciting to recruit my friends and leverage other talent networks I’ve cultivated over the years. The team is scaling quickly so it’s important to prioritize bringing on board best-in-class talent.
I also really enjoy customer support and community building initiatives, especially seeing it’s an open source platform. I’ve gleaned a ton of really compelling insights by spending time with users & contributors and understanding what’s top of mind.
How has open-source impacted Acryl? How does the scope of work change, especially seeing DataHub is open source? What is the most exciting aspect of DataHub?
The energy and excitement around DataHub is palpable – the community is growing by nearly 8% MoM!
In my spare time, I like to hop on calls and talk to folks who are using the product and learn about use-cases and their data challenges. This breadth of access is something you only get in open source.
It’s really cool to also see individuals actually write blog posts on DataHub highlighting how they’re using it and why they’re excited – oftentimes we don’t even know that a company is using DataHub until they blog about it! The interest is purely organic and this certainly makes me even more excited about Acryl and the ability to leverage and manage this community.
What products are you excited about post ingestion?
We’ve reached a stage where just pulling data into DataHub isn’t sufficient – I’m starting to think about outbound integrations. For example, how can we embed contextual information in Looker highlighting information about the upstream data pipeline (e.g. “This Look was delayed because of some Airflow job not running, and this is the associated impact & contextual details”). This is the type of metadata information we can provide as we kick off outbound integrations. We’re effectively meeting people closer to where they live.
What has been the most surprising thing you’ve learned at Acryl to date?
I’ve been incredibly impressed with the velocity of execution – the team has come so far in the past year and a half. The team is quite lean, but the community is robust. This enables us to move rapidly and not drop the ball as there’s always someone at the ready to pick up responsibilities. The entire frontend and the ingestion framework didn’t even exist a year and a half ago!
The community on DataHub is also very unique and highly engaged – they actively ask thoughtful questions, surface interesting use cases, new feature ideas, come to our Town Halls, and support each other on Slack. We’ve got fantastic community culture - people build with each other without us in the loop.
We’d love to get your thoughts on the Modern Data Stack. How do you make sense of the landscape and where Acryl fits in?
Most tooling is continuing the trend towards open source. The storage layer remains proprietary, but a lot of other components are shifting to open source (i.e. BI tools, dbt, orchestration, real-time analytics etc.). The entire stack is becoming, by default, open source.
There’s another trend worth addressing, which is that data is shifting left. It’s effectively the movement of ownership from the old model of a central data team towards a self service model.
As every function within a company becomes driven by data to a greater degree, democratization becomes important. The central data team moves away from handling tickets towards maintaining core infrastructure as well as datasets.
Metadata is an essential tier for enabling self service because it distributes context and information across your entire data ecosystem to every function that’s using that data.
To wrap up, any hot takes?
Every data lineage, data quality, or data observability company will eventually either become a data catalog or get acquired by one.