How to Use PipeRider's Built-in Assertion and Custom Assertion to Test the Water Quality Kaggle Dataset

Use PipeRider's assertion method to test the water quality kaggle data

Posted by PipeRider · 6 mins read

tl;dr PipeRider can help detect data issues in a water quality dataset through data profiling and custom assertions, which can be useful for determining the drinkability of water and solving problems immediately if alerts are raised.

Determining the drinkability of water is the perfect case for a data reliability tool. Not only can we detect if there is missing data, but also when values fall outside of safe ranges. If the water resources management center has a good data assertion tool to detect the water quality, they can solve the problem immediately. Also, the manager and the employee can see a daily report to know the data distribution and water status. If PipeRider’s data assertions raise an alert, they can make a decision and better manage it.”

In this article, I will use a Kaggle water quality dataset and show how PipeRider can help detect data issues through data profiling and custom assertions.

PipeRider is a data quality toolkit for data professionals. With PipeRider, you can profile your data sources, create highly customizable data quality assertions, and generate insightful reports.

Get the dataset from Kaggle

First, the Kaggle website stores the water quality dataset. We need to download it and put it into the environment.

If you want to use the Kaggle CLI tool, you need to download the Kaggle key JSON file from the Kaggle website. The Proceed as follows:

Go to “Account”, go down the page, and find the “API” section. Click the “Create New API Token” button. We will download the “kaggle.json” file. Put the file into ~/.kaggle/ folder.

Download the API token JSON file on the Kaggle Account page.

Then, we can download the datasets through the Kaggle CLI tool.

pip install kaggle

kaggle datasets download -d adityakadiwal/water-potability


Transfer CSV to SQLite

PipeRider supports four data sources: dbt integration, Postgres Connector, Snowflake Connector, and SQLite database. Here we use SQLite as our database example.

However, we need to transfer the CSV file into the SQLite database for PipeRider’s suitable database. Here, We can use the open source tool csv-to-sqlite to transfer the CSV files to the SQLite database.

You might also want to check the article of Transfer the CSV Files into an SQLite Database

Follow the command line to transfer the CSV files to the SQLite database.

pip install csvs-to-sqlite
csvs-to-sqlite water-potability.csv water-potability.db

Add build-in assertion

Now, we can start to use PipeRider. We provide a Quick Start tutorial on the documentation page to show how to use PipeRider. You can follow the method to initialize, configure, diagnose and run PipeRider. After you run the piperider run command, you will get the “.piperider” folder in your project folder.

The structure of the .piperider folder

Although PipeRider provides an automatic assertion generation method, we want to configure our logic assertion. For example, the range of PH values is 0 to 14. If the value is over the content of values, then the data detection has some problem, and the users need to collect the data again. PipeRider provides two types of built-in assertions, one takes no parameter, and the other takes parameters.

Built-In Assertions Guide

You can go to .piperider/assertions/ to add assertions in <table>.yml. Here is an example of the built-in assertion yaml file.

After modifying the YAML file, rerun the piperider run and see the assertion result.

The result of piperider run (We configure the built-in assertion YAML file.)

Add custom assertion

We can see that some assertions failed because the maximum value range exceeded the WHO recommended value. If the user views the assertion error. The user needs to find the root cause of the datasets and improve the water quality.

In the description in the Kaggle datasets, we can see the content that tells us, “WHO has recommended maximum permissible limit of pH from 6.5 to 8.5.” Therefore, I want to test that the average value is in the range of 6.5 and 8.5. However, the built-in assertion method does not have this method. We need to write our own logic assertion to do the testing.

PipeRider provides a few built-in assertions and supports custom assertions as plugins that can satisfy your data quality check requirements. In this showcase, we add the new assert_column_avg_in_range assertion method and use the assertion to test the data profiling values.

PipeRider, by default, will load python files under .piperider/plugins custom assertion functions automatically. .piperider/plugins is created piperider init with a scaffolding of a custom assertion function, You can rename the file or generate assertion functions in other python files.

After modifying the YAML file, rerun the piperider run and see the assertion result.

The result of piperider run (We add the custom assertion method.)

The result shows that the new assertion method is successfully tested. You can follow the python class structure to try the specific assertion method.

Also, You can check the history assertion result in Pipeider UI.

Assertion testing in PipeRider UI

I am Simon

Hi, I am Simon, Customer Success Engineer in InfuseAI. Please give me applause and also welcome to provide me with some suggestions if you think the article is helpful for you. Welcome to discuss with me in InfuseAI Discord.

InfuseAI is solving data quality issues

InfuseAI makes PipeRider, the open-source data reliability CLI tool that adds data profiling and assertions to data warehouses such as BigQuery, Snowflake, Redshift and more. Data profile and data assertion results are provided in an HTML report each time you run PipeRider.