tl;dr
Data profiling helps you understand your data. When used with data assertions, a data profile can be used to determine data reliability. PipeRider is a non-intrusive open-source platform for adding profiling and assertions to data sources (such as dbt, Postgres, Snowflake). Star PipeRider on Github.
Why Data profiling?
Profiling is super important for understanding your data. A data profile is essentially data about your data. On its own a data profile can provide interesting insights about the data, but the real value comes from pairing the profile with data assertions. Together they form the basis of data reliability. Through data assertions you can define the acceptable norms for your data, and then test that your profile meets this specification.
[data]Pipe[line]Rider
With PipeRider you can easily add data profiling and assertions to your dbt project* and start profiling and testing your dbt models.
PipeRider shows profiling results on the command line, and also in an HTML report. Plus, if you run your dbt tests as part of the PipeRider run, they’ll also be included in the report.
PipeRider also works with other data sources, such as Snowflake, Postgres, SQLite, but for this tutorial I’ll focus on dbt.
If you’d prefer to watch how this is done then check out the YouTube video. Otherwise read on.
dbt, ‘yeah you know me’
This guide assumes that you already have a dbt project ready to go, and that your dbt project is using either Snowflake or Postgres as a data source. Let’s add data reliability to your dbt project:
Install PipeRider
Install PipeRider using pip and specifying the data source your dbt project uses. I recommend using a virtual environment such as venv
or Conda
.
Postgres
pip install 'piperider[postgres]'
Snowflake
pip install 'piperider[snowflake]'
(don’t forget the quotes)
Initialize PipeRider
Inside your dbt project, run the following command to initialize PipeRider:

PipeRider will then automatically detect your dbt data source settings from your dbt_project.yml
and profiles.yml
.
Check your connection
To test the connection to your data source, run the diagnose command:
In the Check Connections section, you should see your dbt models listed.

Run PipeRider
Now all you need to do is run PipeRider!
Since this is a dbt project, you likely already have some dbt tests configured. If you want your dbt tests to appear on your PipeRider report you can add the --dbt-test
or --dbt-build
option to the command. Check the PipeRider docs for more info on available options.
Profile and Assertions
PipeRider will profile the tables and columns from your data source and, as this is the first run, prompt to generate some recommended assertion files.

If you choose ‘yes’, PipeRider will make assertion files with some sensible defaults based on the current structure of your data.
If you choose ‘no’, then PipeRider will create skeleton assertion files.
Choose ‘yes’ and you’ll see that PipeRider creates assertion files for your tables, and then tests them against the recommended assertions.

As it’s the first run, and you’re using recommended assertions, all of the assertions pass. You’ll find out how to edit the assertions below.
HTML Report
At the end of the CLI output, you’ll also find a link to the PipeRider report.

The report contains the data profile for each table, plus the PipeRider assertion and dbt test results (if you ran those as well).

Assertions
The recommended assertions provide some sensible defaults that you can use as a starting off point. If you’re confident your data is currently in good health, then you can leave them as-is.
Most likely, you’ll want to review the data profile in your PipeRider report, and then adjust the recommended assertions as you see fit.
The assertion files for your tables can be found in .piperider/assertions

Open one of the assertion files and you’ll see the defaults that PipeRider has set up for you.

PipeRider comes with some built-in assertions you can use to test your data profile. If you want to get your hands dirty then you can also make custom assertions — if it’s in your data profile, you can test it.
After updating your assertion files, simply run PipeRider again to test your data profile against the new assertions. Failed assertion results can be viewed both on the CLI, and in the generated reports.

What’s next?
There’s lots more to check out with PipeRider, such as comparing reports, regenerating assertions, and GitHub actions. Check out the PipeRider documentation for full details of all features.
We’re InfuseAI: lovers of open-source, data-quality aficionados.
Tell us you want better data reliability by:
If you read here, you might want to learn more about data reliability: