The usage-based data stack is finally here (and pretty cheap)

June 07, 2022

Lately I’ve been talking to a lot of early-stage startup founders, and have been interested to find that many of them regard having a modern, scalable data analytics setup as a luxury.

Of course, founders have a lot on their plates, and building out data infrastructure is often rightfully not top of mind. Moreover – up until recently, they would have been right.

But technology moves quickly, and yesterdays’ luxuries become today’s commodities. Having corporate email was a luxury before the launch of GSuite in 2004. Accepting online payments was similarly luxurious before Stripe launched in 2011. Now, both are considered foundational digital infrastructure, with accounts often set up shortly after purchasing a domain name for a new business.

Assigning an exact year to the analogous milestone for data analytics would be harder, since a “modern data stack” has multiple components, typically including a cloud data warehouse, a BI tool, a tool for extracting and loading data (or “ELT”, as it’s called nowadays), and possibly another one for orchestrating data transformation jobs. Each layer is maturing at its own speed.

But I think there’s a strong case that we’ll look back at 2022 as the year it happened. The reason is that it’s finally possible to set up a world-class data stack at low cost, taking advantage of usage-based pricing at each layer.

Over the past decade, building out a data stack has been an exercise riddled with cost discontinuities. The mid-2010’s generation of market leading tools – e.g. Looker for BI, Fivetran for ELT, and Redshift for warehousing – all had 5-figure-per-year entry points. Since no seed stage startup has a $50k/year budget for data tools, most companies would make do with a patchwork of bandaids and lower-end tools, and eventually upgrade once they were a bit more “grown up”. These upgrades would come with high cost in terms of engineering resources and employee retraining.

But in 2022 – and more specifically, as of the general release of Airbyte’s cloud product in April of this year – each layer of the data stack now has a best-in-class option available under a usage-based model, with the potential to get started for a few hundred bucks a month, and scale up to enterprise level usage.

Let’s look at the evolution one layer at a time:

  1. Warehouses. In 2016, Snowflake became the first cloud-agnostic warehouse to offer a self-service pricing option. This addressed a cost discontinuity for AWS-based startups, who otherwise faced a choice between over-provisioning a Redshift cluster at high cost, or postponing investment in a warehouse altogether. (This was less of a problem in the GCP ecosystem, where BigQuery was already a solid option.)
  2. Data ingestion. April 2022 saw the general release of Airbyte Cloud, a fully managed version of the Airbyte open source project. Airbyte Cloud offers the ability to ingest data using pre-built connectors, or author your own, all under a self-service, usage-based pricing model. This presents an appealing alternative to the likes of Fivetran, which gates similar functionality behind low/mid five-figure annual contracts.
  3. Transformation and orchestration. Airflow and dbt, arguably the two most popular tools for defining and orchestrating data transformations, both became available in self-service SaaS form with usage-based pricing in 2019. With the release of dbt Cloud and Astronomer (essentially hosted Airflow), companies could start using these tools without diverting engineering resources for self-hosting.
  4. Business intelligence. Although there is no clear market leader among BI products that also offer self-service signup and seat-based pricing, there are plenty of good options. To start, Snowflake itself comes included with Snowsight, a rudimentary but perfectly usable UI for data exploration. The last few years have also seen the release of cost effective fully hosted versions of open source data analytics tools such as Superset and Metabase. Both are solid options, and will probably come in under a tenth the cost of a contract with Looker, Tableau, or Mode.

Of course, this stuff is still not free. Even a minimalistic and conservatively used data stack could still run you a few hundred to maybe a thousand bucks a month. But in 2022, the “usage-based data stack” I describe here offers a way to leverage great tools, while saving hassle and cost in the long run. I recommend it for almost any founders who want to set their company up for success using data.


© Nathan Gould, 2021