It doesn’t matter the size of your team or company, all businesses are creating data at an exponential rate – but getting access and leveraging that data to solve problems can be difficult. One of the first steps toward a data-driven future is centralizing your data under one roof in a data warehouse or database. Once your data is collected, it can be used for analysis and automation.

As your team begins to rely on your data pipeline and warehouse to do their job, it’s vital to make sure the information is continuing to flow in as you’d expect. If team members are counting on accurate data to make decisions or drive automation, a break in the pipeline can create confusion or lead people to make incorrect decisions. 

The reality is that data pipelines fail and failures occur for a number of reasons. Load times may increase causing a backlog and delay in getting data delivered at the expected time. Changes at the source can break the jobs that extract the data. As the person responsible for this data, you’re aware that these issues are par for the course, and you also want to be the first to know when something unexpected happens.

Data pipeline monitoring and alerting

You’ll want to set up a system to monitor the performance of your pipeline, so that alerts can be sent to the appropriate team member when something looks out of the ordinary.

Step 1: Choose your KPIs

Every step of your data pipeline and ETL process can be measured using key performance indicators (KPIs), such as:

For individual jobs/steps of your data pipeline:

  • Average runtime
  • Success/failure of each

Table-level KPIs:

  • Rows inserted/updated/deleted
  • Space utilization
  • i/o rates

Unusual behavior (anomalies) in these KPIs may or may not indicate a problem, but they generally warrant investigation. If you have a Service Level Agreement (SLA) in place with customers or internal teams relying on this data, the metrics promised within that SLA are a good place to start for identifying what to monitor/alert on.

Step 2: Determine where your KPI data is located

Next, you must determine where the data/logs required to calculate each of your KPIs is located. Typically, this data is stored in logs either in text files or database tables directly. In either case, you will likely want to centralize this data into a standard KPI database table that can be easily queried for dashboard results and alerts.

Here’s a sample data model you can use for the KPI tables:

Example ETL data model

You can create some very basic scripts that pull KPI data into these tables, and run those in accordance with the load schedule.

Here’s a sample SQL script that pulls the row count and size of an individual database table (“Accounts”) into the standard KPI table:

insert into Table_Performance (date, tableName, totalRows, spaceUsage) 
select current_date,‘Accounts’, count(*), pg_relation_size(‘Accounts’)
from Accounts;

If you have many tables, you can quickly run scripting across these, like so:

awk ‘{ print “insert into Table_Performance (date, tableName, totalRows, 
spaceUsage) select current_date,\x27” $1 “\x27, count(*), 
pg_relation_size(\x27" $1 “\x27) from ” $1 “;” }’ <table_list>

Where <table_list> is just a file containing the list of tables you want to monitor.

This script generates all the required SQL statements that will have to run for monitoring all of the tables listed in <table_list>. These statements are run on a set schedule (again using cron or your favorite scheduler) and populate the analysis table (“Table_Performance”).

When you create the rows for the KPIs, you’ll want to make sure each of these has a consistent timestamp or way for you to be able to compare across time periods.

Step 3: Set up dashboards

Now you should have your KPIs loading into the analysis tables regularly following the data loads for each. Using this metadata you can keep track of changes to pipeline performance. If you see any unusual changes to these metrics then you can dig in and identify whether there’s an issue worth addressing.

You can set up a chart for each metric, creating a dashboard that looks something like this:

Image credit goes to Adonis Salazar.

The dashboards are a great way to keep an eye on a few things, do drill-downs in the data and look back at patterns over time. You can use the existing dashboard tools you have to do this or something open source like Metabase.

That said, you’re also going to want to set up automated monitoring so that you can be aware of situations as they arise, even if you’re not looking at your dashboard. Also, you’ll find that once you need to keep track of the status of more than a few services, automated monitoring is really the only way to reliably do that. Many dashboard tools have some basic alerting built in, but since they aren’t purpose-built for monitoring and alerting, they won’t give you the kind of coverage you’ll need in order to know what’s happening across your pipeline.

Step 4: Set up monitoring and alerting

You can use Revere to easily set up the monitoring and alerting for your data pipeline. This is useful when you can’t be looking at the dashboards all the time, or when you have so many jobs running and pipelines to manage that it just doesn’t make sense to try to keep track of them manually. Getting started is just a matter of granting Revere read-only access to the analysis tables. You can learn more about setting up an account here.

Once your data is connected, you can build the alerts in a matter of minutes. To set up an alert, you’ll want to identify the data to monitor, define the conditions that you want to be notified about, and set the timing for how frequently you want Revere to run that check for you.

The fastest way to get started in this case is to select all of the columns of your analysis table, and build advanced conditions off of that one alert. I’d suggest structuring your alerts by theme, so that you can make sure the notifications are really relevant to the issue of that theme. For example, you may have one set of alerts to keep track of unexpected events, like anomalies, and links to a wiki with related background or processes. You may want another set of alerts that tracks commitments made in an SLA, and links to the relevant documentation. The routing for these issues may also differ, which is another reason you would want to structure your alerts this way.

Revere has a point-and-click builder to help create the alerts, but for this example we’re going to assume you know some SQL. To create an alert, start by selecting the data you want to monitor, then define the condition you’d like to be alerted about.

There are a lot of different kinds of issues you can keep track of, but let’s look at two of the most common approaches:

  1. Alerts for anomalies in expected behavior of the KPIs.
  2. Alert when SLAs are not being met.
How to monitor statistical anomalies

The quickest way to identify whether something is outside of the normal range is to look for statistical anomalies.

Here’s example SQL for identifying statistical anomalies. We start by calculating the average number of rows we expect to see based on the last thirty days. That info is included in the data we want to monitor.

Finding statistical anomalies

The alert condition looks for any case where the rows loaded is significantly higher than the average number of rows.

High anomaly

In this case, the alert will fire whenever the number of rows in any table increases by more than 20% above the average.

How to monitor your SLA

If you have an SLA in place there are specific issues you’ll already know to look for, so it’s just a matter of defining those in Revere.

Let’s say for example you’ve committed to data loads completed by 7 am when business users start their day.

You’ll want to select the fields related to load names, times, and status, like so:

Monitor load times

Then define the alert condition as when the end time goes later than 7 am:

Load time alert

Then you can set the alert timing to run daily after 7 am, and if it’s triggered your selected recipients will be notified.

 

Here are a few other ideas for monitoring your data pipeline:

 

  • Alert when data store increases significantly or when it hits a set threshold.
  • Alert if the number of rows ingested is significantly higher or lower than usual.
  • Alert if average load time is significantly higher or over a set threshold.