Blog Post

Why data teams should not build their own dbt incident alerting

by
Nick Freund
-
March 20, 2024

A few weeks ago, we launched sophisticated (and free) Slack alerting for dbt, and the reaction from the community far exceeded our expectations. 

We’ve onboarded many new companies as clients, and we’ve had discussions with startups and enterprises alike about their alerting challenges with dbt — and their broader data stack. We heard a lot of comments like, “Before Workstream our dbt Slack alerts were not detailed or actionable, making it hard for the data team to respond to nightly dbt errors from their phones.” 

This requisite detail includes, for example, specific details about the broken models and tests, information on downstream impacted lineage, and automatically surfacing up failures messages from your logs. 

If this resonates with you, I’d recommend that you spend the next 10 minutes setting up our free Slack alerting for dbt. But if you are instead somewhat skeptical – maybe of software vendors in general, or as someone who has or is considering building their own alerting for dbt, the rest of this article is for you. 

Building incident alerting is expensive

dbt users fit into two buckets: those that manage their own dbt core deployment, and those that leverage the managed Cloud offering from dbt Labs. While that choice greatly impacts your environment, both options come with pretty bad options for incident alerting. 

At Workstream.io, we have naturally run into numerous teams that decided to build their own alerting systems. As software builders ourselves, we love speaking with and comparing notes with these folks because we learn so much. 

There are some pros to the build-it approach – the most notable being that before Workstream.io, self-built alerting was really the only option for those who wanted visibility into dbt incidents without suffering through extreme alerting fatigue. And, with custom software builds, teams generally feel that they will get a solution that is custom to their organization’s needs. 

The obvious and unfortunate tradeoff is that you need to dedicate engineering hours and dollars to building your own alerting. The average team dedicates about a month of a team member’s time to stand up their own alerting; and they then expect that person to spend 2-3 weeks each year thereafter for ongoing maintenance and enhancements. 

If we put aside the considerable hassle of doing this and assume a $100K salary and a 30% benefits burden, custom alerting actually costs self-build teams:

  • $10,833 in the first year for the build
  • ~$8,000 per year thereafter for support, maintenance and enhancements

In our commitment to add continuous value to data workflows, we decided to build dbt incident alerting and provide it to our customers for free. And they will always be free. All you have to do to get started is create a free account and we’ll help you get set up the same day.

All other options lack the basics that teams need

We will be the first people to tell you that we believe our customers and partners are the data experts. By contrast, we believe that we’ve accrued expertise in processes and workflows for data teams. We are obsessed with efficiency and have designed workflow solutions that are integrated with data tooling, and built for the unique needs of data teams. 

Nearly without exception, other alerting options lack the basics that dbt users need. These include:

  1. Intelligently bucketing dbt model failures into incidents that persist across runs.

    Homegrown solutions typically act like a firehose — if dbt runs and a test fails, the data team is sent an alert. But being alerted every 3 hours of the same issue quickly becomes chatty, and so teams stop paying attention. There is a good chance your eyes are going to gloss over when you see that 10th Slack notification in a row from your #data-alerts channel.

    This paradigm has a greater cost than annoyance, however – ignoring alerts is dangerous, as thinking “I already know” means you likely will miss the next alert that comes up. Instead, teams need to be notified of incidents when they first occur and then kept updated as they evolve. And in addition, they can use our incident 360 view to understand all the incidents currently in their environment and see a canonical timeline of each incident as it unfolds.
  1. Alerts should be detailed and actionable - so you can easily triage and respond on your phone.
    Homegrown solutions also do not provide sufficient context for alerts, making it nearly impossible for teams to quickly understand and respond to alerts on their phones. 

    Teams need to understand what broke, including the details of the failing model, test and failure message. And teams need to understand the blast radius of an incident, to diagnose the relative priority of the issue. 

    Finally, your alerts should be actionable. You should be able to easily loop in someone else using @mentions, delegate ownership, or an incident mark it as being monitored. And you need to be able to do this the minute you see the alert in Slack, right from your phone. 

Hand rolled alerting does not lay the groundwork for the future

Every data team using dbt needs these alerting basics - unless for some reason you do not care about, or have not yet had the bandwidth to invest in dbt testing. 

Having testing without proper alerting like the proverbial tree in the forest; proper alerting is table stakes. 

But there is a subset of data teams that are further along in their journey: the larger teams, or those with more mature processes and testing practices. These teams not just better alerting, but for that alerting to be tightly integrated with our capabilities for advanced incident management, and sophisticated incident insights:

  1. Advanced incident management capabilities include:some text
    1. Setting or scheduling* an on-call person, and automatically assigning them incidents
    2. Documenting incident retrospectives
    3. Streamlining stakeholder communications via alert routing across Slack channels*, and data status pages for core / certified dashboards. 
  2. Sophisticated incident insights include providing your team core kpis such as some text
    1. # of incidents over time by source / dbt project, dbt model*
    2. mean time resolution*
    3. # of impacted dashboards over time

*Indicates feature on the near term roadmap

Leveraging a purpose system means you not only get free alerting, but a single, integrated platform for more mature incident management should those needs ever arise. 

We are shipping 10X faster than you

As you can tell, we have a long and exciting roadmap for our platform – and we are shipping incredibly fast. Today we are excited to announce our most recent release, which includes all of the below functionality we have developed in the past three weeks: 

Our Slack Alerting for dbt now includes:

  • More detailed naming for dbt incidents, so you can more quickly understand an issue
  • Downstream incident lineage, including all impacted models, metrics and dashboards (exposures)
  • Quick actions in Slack to assign incident owner, and update incident status

Our Advanced Incident Management now offers:

  • The ability to manually create incidents in our system, in addition to our existing support for incidents from dbt core, dbt cloud, or Monte Carlo
  • Centralized search for incidents via our UI

Our Sophisticated Incident Insights now surfaces trends for:

  • Certified dashboards that are impacted by upstream data incidents

All of the above functionality is now available in our platform for everyone to use.

Try our free alerting today

Whether you are a data team just starting to feel the pain of insufficient alerting, or one that might have built their own system, please consider trying out our free alerting today. You don’t have much to lose - beyond the 10 minutes needed to set up an account, and connect with dbt and Slack — and quite a bit of time, and avoided incident headaches to gain.

by
Nick Freund
-
March 20, 2024