The Year in Data and Analytics – 2020
Deep breath. In and out. I’ve been doing a lot of that in 2020, and I know I’m not alone. I hesitated to write a look-back on the year post, but despite living through a decade’s worth of news in a single year, there were a number of noteworthy data an analytics stories that brought some light to my little world. In fact, 2020 felt to me like the year when a number of trends finally clicked, and the “modern data stack” came into better focus.
With that, here’s my list of the most noteworthy events and insights from across the data and analytics world in 2020.
Snowflake IPO
In all my years working in tech, I can’t recall an Enterprise product that engineers felt such love for. And yes, that’s what Snowflake is. It’s both big Enterprise and craved by engineers. It was also the biggest IPO of the year, and despite the rollercoaster of a stock price since, I’m a big believer in Snowflake now and going forward.
But do I consider Snowflake to be “cool”? Not really to be honest. Buying it is a lot like buying any other Enterprise software, sales team included. Not that it’s a bad thing, but it’s a different feel from being on the buying end of other sought after data products. Same goes for the documentation, user interfaces, and marketing. They all “work”, but it’s Enterprise all the way through. And you know what? That’s part of what makes it work. Engineers love it, and execs feel comfortable signing contracts with a company filled with industry veterans and a professional image.
Snowflake didn’t win over engineers by being trendy or nailing the brand, they won them on the technology. There’s no way to overhype the performance and reliability of Snowflake. For those unfamiliar, the thing that made Snowflake stand out from day 1 is their (genius) decision to separate storage from compute. It made technical sense as well as economic. It’s easy to scale up and down as needed. Need to run a massive query? Spin up an XL warehouse, crunch it, and then shut the warehouse down. Your data is still ready to access, but via a smaller (and thus cheaper) warehouse. Power when you need it, lower bills when you don’t.
I hope that the IPO marks Snowflake’s next chapter of success and not a celebration of cashing in (though many people sure will when the lockup period is over!). Nothing lasts forever, but I think data teams everywhere will be collectively paying Snowflake invoices for a long time to come.
The dbt Rocketship
Perhaps even more exciting, and definitely more surprising (to me at least), was seeing Fishtown Analytics raise both a Series A and B within a 7 month period. Not that they don’t deserve it. dbt is a runaway success and quickly becoming the gold standard as the “T” in ELT. In other words, in the ELT paradigm, once data is ingested (Extracted-Loaded) into a data warehouse, it’s ready to be Transformed (or modeled). dbt takes what was traditionally a pile of independent SQL scripts to create data models and provides structure, code reuse, dependency management, testing, and more. In other words, it’s not only tooling for analytics engineers but also the foundation of a solid development process.
In addition to building such a popular open-source platform (with a commercial option in dbt Cloud) that data analysts analytics engineers go to battle to get their organizations to adopt, Fishtown coined the analytics engineer title. Their influence on the analytics community is clear-as-day in the highly active Slack community, and the runaway success of their first conference (cleverly named Coalesce) despite having to go remote in a Covid-19 world.
I’ve used dbt hands-on in past consulting work, and we use it at HubSpot where I currently work. Momentum has been building for the last few years, and it’s exciting to see the Fishtown team get the funding to grow and take dbt to a wider audience.
Airflow 2.0
For the last 5+ years, Airflow has been perhaps the most popular Workflow Management Platform out there. If you’ve ever heard of a DAG, there’s a high probability that you first heard it in the context of Airflow.
In the final weeks of the year, the shiny new Airflow 2.0 release was made official. There’s a lot to like in the new release, but I’m most excited about the fact that the core concepts remain, and breaking changes are minimal to non-existent for most 1.x users. Airflow’s taken some jabs recently, but I think unfairly so. Sure, DAGs can get unruly, but I find that speaks to the way it’s used and not the tool itself. In fact, I’ve found success pairing Airflow with dbt as a way to reduce DAG complexity by letting Airflow do what it does best: schedule and orchestrate tasks at any scale.
Trends Carrying into 2021
There are three things I’m keeping an eye on as we kick off 2021:
The global acceptance of ELT over ETL: I’ve written about this over the last few years, but the benefits of working in an ELT paradigm rather than traditional ELT cannot be overstated. MPP, columnar databases like Redshift, Snowflake, and BigQuery made it possible. dbt is a great example of showing the power the separation of Transform in empowering analytics engineers. Data teams that haven’t moved to ELT yet are going to feel the pressure build.
An explosion of “EL” frameworks: Just like dbt has planted a flag on the Transform step, I expect to see the emergence of a new breed of tools that make data ingestion (the Extract and Load steps in ELT) even easier for data engineers. Though commercial products like Fivetran and Stitch have seen success over the last few years, I’ve noticed new activity in the space by players including Airbyte. There’s a number of smaller open source projects popping up as well, so expect one or two to stand out.
Real Competition for Snowflake: You don’t have an IPO like Snowflake did without generating more competition. Snowflake has pretty much stomped on Redshift over the past few years, and BigQuery seems to be stuck in the background at the moment. I expect either Amazon and Google to double down on their respective warehouse technology in 2021 and try to take some ground back. Amazon gave it a shot with Redshift RA3 nodes in 2019, but I wouldn’t count out a more substantial architecture shift in the near future.