Export Password

Data Quality Verification

Data verification includes checking your data for:

  • Accuracy⏤ Is your data correct?
  • Completeness⏤Is there missing data?
  • Consistency⏤ Are fields consistently entered? and,
  • Currency⏤Is your data up to date?

Data verification is about managing data quality and enhancing its reliability. Let’s identify data quality concerns that organizations contend with.

Accuracy includes ensuring a match between source data and destination data. How can accuracy become an issue?

  • Data migrating from source systems often contains duplicated records.
  • When users enter data manually, typos can find their way into the data records, yielding out-of-range values, outliers, and spelling mistakes.
  • Sometimes large chunks of data become misaligned, causing data corruption.

For example, a CSV file might contain a legitimate comma, which the new system can misinterpret as a column separator.

Completeness: Data is incomplete when the business finds missing values, such as voids or nulls in fields that should be populated, or haphazard use of placeholders such as “9 9 9” or “minus 1” to indicate a missing value.

  • Entire records can also be missing due to upstream system failures.

Consistency is another important data quality concern. Are there deviations from standard terminology?

  • Are dates entered consistently? For example, year-month-day and month-day-year formats are incompatible.
  • Is data entered consistently? For example, Mr. John Doe and John Doe might refer to the same person in the real world, but the system will see them as distinct.
  • Are the units consistent? For example, you are expecting ”kilograms,” but you mighthave entries based on “pounds,” or you are expecting ’dollar amounts,” but you might have entries based on “thousands of dollars.”

Currency is an ongoing data quality concern for most businesses. Currency is about ensuring your data remains up to date.

  • For example, you might have dimension tables that contain customer addresses, some of which might be outdated.
  • In the US, you could check these against a change-of-address database and update your table as required.
  • Another currency concern would be name changes as customers can change their names for various reasons.

Plan

Determining how to resolve and prevent bad data can be a complex and iterative process.

  • First, you’ll implement rules to detect bad data.
  • Then you’ll apply those rules to capture and quarantine any bad data.
  • You might need to report any bad data and share the findings with the appropriate domain experts.
  • You and your team can investigate the root cause of each problem, searching for clues upstream in the data lineage.
  • Once you diagnose each problem, you can begin correcting the issues.
  • Ultimately, you want to automate the entire data cleaning workflow as much as possible.

For example, you need to validate the quality of data in the staging area before loading the data into a data warehouse for analytics. You determine that data from certain data sources consistently has data quality issues including:

  • Missing data,
  • Duplicate values,
  • Out-of-range values, and
  • Invalid values.

Here’s how an organization might manage and resolve these issues.

  1. First, write SQL queries to detect these issues and test for them.
  2. Next, address some of the quality issues that you’ve repeatedly identified by creating rules for treating them, such as removing rows that have out-of-range values.
  3. Create a script that runs queries to detect data quality issues that happen during the nightly loads to the data warehouse. This script applies corrective measures and transformations for some of these known issues.
  4. Next, create a second script that automates the script you created in step 3.
  5. After the data is extracted from the various data sources, this script automatically runs the prior script’s SQL data validation queries every night in the staging area.
  6. The script you created in step 3 generates a report of any remaining issues that could not be automatically resolved.
  7. The administrator can review this report and address the unresolved issues.

Some of the leading vendors and their tools for data quality solutions include:

  • IBM InfoSphere Server for Data Quality,
  • Informatica Data Quality,
  • SAP Data Quality Management,
  • SAS Data Quality,
  • Talend Open Studio for Data Quality,
  • Precisely Spectrum Quality,
  • Microsoft Data Quality Services,
  • Oracle Enterprise Data Quality,
  • And an open-source tool called OpenRefine.

The “IBM InfoSphere Information Server for Data Quality” is an example of a product that can help you perform data verification in a unified environment.

  • “InfoSphere Information Server for Data Quality” enables you to continuously monitor the quality of your data, and keep your data clean on an ongoing basis, helping you turn your data into trusted information.
  • In addition, the “IBM InfoSphere Information Server for Data Quality” comes with built-in, end-to-end data quality tools to:
    • Help you understand your data and its relationships.
    • Monitor and analyze data quality continuously.
    • Clean, standardize, and match data; and
    • Maintain data lineage, which is the history of the data’s origin and what happened to the data along the way.

Data Q Verification Example


Let’s go through a quick example on cloud based PostgreSQL server on how to:

  • Check Null values
  • Check Duplicate values
  • Check Min Max
  • Check Invalid values
  • Generate a report on data quality

Create PostgreSQL instance

Download Staging Area Script

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/8ZUkKar_boDbhNgMiwGAWg/setupstagingarea.sh
export PGPASSWORD=emiGkf3h4nVAVYB4gnq8DH0M;

Setup Staging Area with Script

This will setup billingDW database, extract the data for all tables and populate all tables after setting up their schema

bash setupstagingarea.sh

# RESPONSE
Finished loading data
Verifying data
"Checking row in DimMonth Table"
 count 
-------
   132
(1 row)

"Checking row in DimCustomer Table"
 count 
-------
  1000
(1 row)

"Checking row in FactBilling Table"
 count  
--------
 132000
(1 row)

Perform Quality Checks

You can perform most of the data quality checks by manually running sql queries on the data warehouse. It is a good idea to automate these checks using custom programs or tools. Automation helps you to easily

  • create new tests,
  • run tests,
  • and schedule tests.

Download Python Scripts

We will be using a python based framework to run the data quality tests.

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/dataqualitychecks.py

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/HB0XK4MDrGwigMmVPmPoeQ/dbconnect.py

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/mytests.py

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/saTxV8y9Kt-e8Zxe29M0TA/generate-data-quality-report.py

ls

# OUTPUT

2024-10-10 17:39:43 (379 MB/s) - 'generate-data-quality-report.py' saved [1171/1171]

DimCustomer.sql              dbconnect.py
DimMonth.sql                 generate-data-quality-report.py
FactBilling.sql              mytests.py
billing-datawarehouse.tgz    setupstagingarea.sh
billing-datawarehouse.tgz.1  star-schema.sql
dataqualitychecks.py         verify.sql

Install Python Driver for PostgreSQL

python3 -m pip install psycopg2

# RESPONSE
...
 Stored in directory: /home/theia/.cache/pip/wheels/7d/75/13/da1c6d88687ae81bf5e3cfa07d702981ba137963163472b050
Successfully built psycopg2
Installing collected packages: psycopg2
Successfully installed psycopg2-2.9.9

Update Password in Connection File

We need to update the password in the connection python file. Look for dbconnect.py and open to edit:

  • save PostgreSQL password in the corresponding line
  • pgpassword = os.environ.get('PGPASWORD')

Test db Connectivity

Check all of these:

  • if the Postgresql python driver is installed properly.
  • if Postgresql server is up and running.
  • if our micro framework can connect to the database.
$ python3 dbconnect.py
Successfully connected to warehouse
Connection closed

Here is the copy of the connect.py file

import os
import psycopg2
pgpassword = os.environ.get('PGPASWORD')
conn = None
try:
    conn = psycopg2.connect(
        user = "postgres",
        password = pgpassword,
        host = "postgres",
        port = "5432",
        database = "billingDW")
except Exception as e:
    print("Error connecting to data warehouse")
    print(e)
else:
    print("Successfully connected to warehouse")
finally:
    if conn:
        conn.close()
        print("Connection closed")

Create Quality Report

Install Pandas

python3 -m pip install pandas tabulate

Create Sample Report

python3 generate-data-quality-report.py

Explore Data Quality

Open the file mytests.py, here are the contents:

It provides a quick and easy way to author and run new data quality tests.

The testing framework provides the following tests:

  • check_for_nulls - this test will check for nulls in a column
  • check_for_min_max - this test will check if the values in a column are with a range of min and max values
  • check_for_valid_values - this test will check for any invalid values in a column
  • check_for_duplicates - this test will check for duplicates in a column

Each test can be authored by mentioning a minimum of 4 parameters.

  • testname - The human readable name of the test for reporting purposes
  • test - The actual test name that the testing micro framework provides
  • table - The table name on which the test is to be performed
  • column - The table name on which the test is to be performed
from dataqualitychecks import check_for_nulls
from dataqualitychecks import check_for_min_max
from dataqualitychecks import check_for_valid_values
from dataqualitychecks import check_for_duplicates


test1={
    "testname":"Check for nulls",
    "test":check_for_nulls,
    "column": "monthid",
    "table": "DimMonth"
}


test2={
    "testname":"Check for min and max",
    "test":check_for_min_max,
    "column": "month",
    "table": "DimMonth",
    "minimum":1,
    "maximum":12
}


test3={
    "testname":"Check for valid values",
    "test":check_for_valid_values,
    "column": "category",
    "table": "DimCustomer",
    "valid_values":{'Individual','Company'}
}


test4={
    "testname":"Check for duplicates",
    "test":check_for_duplicates,
    "column": "monthid",
    "table": "DimMonth"
}

Check for Nulls

Here is a sample script for that

test1={
    "testname":"Check for nulls",
    "test":check_for_nulls,
    "column": "monthid",
    "table": "DimMonth"
}

All tests must be named as test following by a unique number to identify the test.

  • Give an easy to understand description for testname
  • mention check_for_nulls for test
  • mention the column name on which you wish to check for nulls
  • mention the table name where this column exists

Let us now create a new check_for_nulls test and run it. The test below checks if there are any null values in the column year in the table DimMonth.

  • The test fails if nulls exist.
  • Copy and paste the code below at the end of mytests.py file.
test5={
    "testname":"Check for nulls",
    "test":check_for_nulls,
    "column": "year",
    "table": "DimMonth"
}

Execute Script

Now that the script is edited and saved, execute the python script

python3 generate-data-quality-report.py

Check for Min & Max

Here is a sample chunk

test2={
    "testname":"Check for min and max",
    "test":check_for_min_max,
    "column": "monthid",
    "table": "DimMonth",
    "minimum":1,
    "maximum":12
}

In addition to the usual fields, you have two more fields here.

  • minimum is the lowest valid value for this column. (Example 1 in case of month number)
  • maximum is the highest valid value for this column. (Example 12 in case of month number)
  • Let us now create a new check_for_min_max test and run it.
  • The test below checks for minimum of 1 and maximum of 4 in the column quarter in the table DimMonth.
  • The test fails if there any values less than minimum or more than maximum. Add this code to the bottom of the file and execute it again
test6={
    "testname":"Check for min and max",
    "test":check_for_min_max,
    "column": "quarter",
    "table": "DimMonth",
    "minimum":1,
    "maximum":4
}
python3 generate-data-quality-report.py

Check for Invalid Entries

In addition to the usual fields, you have an additional field here.

  • use the field valid_values to mention what are the valid values for this column.
  • Let us now create a new check_for_valid_values test and run it.
  • The test below checks for valid values in the column quartername in the table DimMonth.
  • The valid values are Q1,Q2,Q3,Q4
  • The test fails if there any values less than minimum or more than maximum.
  • Add it to the file and execute it
test7={
    "testname":"Check for valid values",
    "test":check_for_valid_values,
    "column": "quartername",
    "table": "DimMonth",
    "valid_values":{'Q1','Q2','Q3','Q4'}
}
python3 generate-data-quality-report.py

Check for Duplicates

  • Let us now create a new check_for_duplicates test and run it.
  • The test below checks for any duplicate values in the column customerid in the table DimCustomer.
  • The test fails if duplicates exist.
test8={
    "testname":"Check for duplicates",
    "test":check_for_duplicates,
    "column": "customerid",
    "table": "DimCustomer"
}
python3 generate-data-quality-report.py