Adding Data Observability and Alerts to your Data Pipeline is easier than you think

Ensuring Data Quality with Data Observability

Posted by PipeRider · 6 mins read

After you’ve transformed the data in your data warehouse and sent it on its way, you might think that your job is done. That is, until you get a call that there’s missing data, an unexpected schema change, or some unexpected data or outlier has been introduced. To understand when these issues might have occurred, you need some form of data observability for your pipeline.

Data observability means that you can monitor the data moving through the pipeline, be alerted to any changes or issues with the data structure, and compare data to help visualize change and aid in tracking down issues.


With the open-source data observability toolkit, PipeRider, you can add data observability to your data source and start understanding more about your data in minutes with:

Non-intrusive implementation — Focus on understanding your data without changing it Data profiling — In-depth analysis of the structure of your data source Data assertions — Ensure your data stays within acceptable ranges through testing Reporting — The data profile and testing results are exported to an HTML report Following are the steps you need to get starting adding data observability and data assertions to your existing data pipeline.

1. Install PipeRider

PipeRider is installed via pip:

pip install -U piperider

By default it comes with SQLite, but the following connectors are also available:

  • Postgres
  • Snowflake
  • BigQuery
  • Redshift
  • dbt (with one of the supported connectors)
  • duckdb
  • CSV
  • Parquet

Install PipeRider with a connector like this:

pip install -U 'piperider[postgres,snowflake]'

 

2. Initialize a PipeRider project

Once installed, initialize a new project with the following command.

piperider init

Just select your data source, enter the relevant details and you’re ready to go. dbt project settings will be auto-detected, so dbt projects really are zero config!

Verify your connection settings with the diagnose command.

piperider diagnose

 

3. Run PipeRider

With a data source connected you’re ready to run PipeRider.

piperider run

This one command will do the following:

  • Profile your data source
  • Generate recommended data assertions (on first run)
  • Test the data profile against the data assertions
  • Display the data assertion test results on the CLI
  • Generate an HTML report with the data profile and test results

 

4. Test your PipeRider data profile with data assertions

PipeRider creates a set of recommended assertions based on the current state of your data. You can add to or edit these using the available suite of built-in assertions, and through custom assertions you can create your own data reliability tests.

 

5. Compare data profile reports

When your data changes and you have multiple PipeRider runs, compare reports easily with the following command.

piperider compare-reports

You can also compare the last two reports automatically (without needing to manually select them) by using the --last flag.

piperider compare-reports —last

 

Sample Reports

Links to example reports will always be available in the PipeRider documentation. Here are samples created with PipeRider 0.7:

 

Who makes PipeRider?

PipeRider is developed by InfuseAI, the company behind the end-to-end machine learning platform PrimeHub.

InfuseAI has an impressive portfolio of open-source projects so you know you’re in good hands!


InfuseAI is solving data quality issues

InfuseAI makes PipeRider, the open-source data reliability CLI tool that adds data profiling and assertions to data warehouses such as BigQuery, Snowflake, Redshift and more. Data profile and data assertion results are provided in an HTML report each time you run PipeRider.