//cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/8ZUkKar_boDbhNgMiwGAWg/setupstagingarea.sh wget https:
Data Quality Verification
Data verification includes checking your data for:
- Accuracy⏤ Is your data correct?
- Completeness⏤Is there missing data?
- Consistency⏤ Are fields consistently entered? and,
- Currency⏤Is your data up to date?
Data verification is about managing data quality and enhancing its reliability. Let’s identify data quality concerns that organizations contend with.
Accuracy includes ensuring a match between source data and destination data. How can accuracy become an issue?
- Data migrating from source systems often contains duplicated records.
- When users enter data manually, typos can find their way into the data records, yielding out-of-range values, outliers, and spelling mistakes.
- Sometimes large chunks of data become misaligned, causing data corruption.
For example, a CSV file might contain a legitimate comma, which the new system can misinterpret as a column separator.
Completeness: Data is incomplete when the business finds missing values, such as voids or nulls in fields that should be populated, or haphazard use of placeholders such as “9 9 9” or “minus 1” to indicate a missing value.
- Entire records can also be missing due to upstream system failures.
Consistency is another important data quality concern. Are there deviations from standard terminology?
- Are dates entered consistently? For example, year-month-day and month-day-year formats are incompatible.
- Is data entered consistently? For example, Mr. John Doe and John Doe might refer to the same person in the real world, but the system will see them as distinct.
- Are the units consistent? For example, you are expecting ”kilograms,” but you mighthave entries based on “pounds,” or you are expecting ’dollar amounts,” but you might have entries based on “thousands of dollars.”
Currency is an ongoing data quality concern for most businesses. Currency is about ensuring your data remains up to date.
- For example, you might have dimension tables that contain customer addresses, some of which might be outdated.
- In the US, you could check these against a change-of-address database and update your table as required.
- Another currency concern would be name changes as customers can change their names for various reasons.
Plan
Determining how to resolve and prevent bad data can be a complex and iterative process.
- First, you’ll implement rules to detect bad data.
- Then you’ll apply those rules to capture and quarantine any bad data.
- You might need to report any bad data and share the findings with the appropriate domain experts.
- You and your team can investigate the root cause of each problem, searching for clues upstream in the data lineage.
- Once you diagnose each problem, you can begin correcting the issues.
- Ultimately, you want to automate the entire data cleaning workflow as much as possible.
For example, you need to validate the quality of data in the staging area before loading the data into a data warehouse for analytics. You determine that data from certain data sources consistently has data quality issues including:
- Missing data,
- Duplicate values,
- Out-of-range values, and
- Invalid values.
Here’s how an organization might manage and resolve these issues.
- First, write SQL queries to detect these issues and test for them.
- Next, address some of the quality issues that you’ve repeatedly identified by creating rules for treating them, such as removing rows that have out-of-range values.
- Create a script that runs queries to detect data quality issues that happen during the nightly loads to the data warehouse. This script applies corrective measures and transformations for some of these known issues.
- Next, create a second script that automates the script you created in step 3.
- After the data is extracted from the various data sources, this script automatically runs the prior script’s SQL data validation queries every night in the staging area.
- The script you created in step 3 generates a report of any remaining issues that could not be automatically resolved.
- The administrator can review this report and address the unresolved issues.
Some of the leading vendors and their tools for data quality solutions include:
- IBM InfoSphere Server for Data Quality,
- Informatica Data Quality,
- SAP Data Quality Management,
- SAS Data Quality,
- Talend Open Studio for Data Quality,
- Precisely Spectrum Quality,
- Microsoft Data Quality Services,
- Oracle Enterprise Data Quality,
- And an open-source tool called OpenRefine.
The “IBM InfoSphere Information Server for Data Quality” is an example of a product that can help you perform data verification in a unified environment.
- “InfoSphere Information Server for Data Quality” enables you to continuously monitor the quality of your data, and keep your data clean on an ongoing basis, helping you turn your data into trusted information.
- In addition, the “IBM InfoSphere Information Server for Data Quality” comes with built-in, end-to-end data quality tools to:
- Help you understand your data and its relationships.
- Monitor and analyze data quality continuously.
- Clean, standardize, and match data; and
- Maintain data lineage, which is the history of the data’s origin and what happened to the data along the way.
Data Q Verification Example
Let’s go through a quick example on cloud based PostgreSQL server on how to:
- Check Null values
- Check Duplicate values
- Check Min Max
- Check Invalid values
- Generate a report on data quality
Create PostgreSQL instance
Download Staging Area Script
Export Password
=emiGkf3h4nVAVYB4gnq8DH0M; export PGPASSWORD
Setup Staging Area with Script
This will setup billingDW database, extract the data for all tables and populate all tables after setting up their schema
bash setupstagingarea.sh
# RESPONSE
Finished loading data
Verifying data"Checking row in DimMonth Table"
count -------
132
1 row)
(
"Checking row in DimCustomer Table"
count -------
1000
1 row)
(
"Checking row in FactBilling Table"
count --------
132000
1 row) (
Perform Quality Checks
You can perform most of the data quality checks by manually running sql queries on the data warehouse. It is a good idea to automate these checks using custom programs or tools. Automation helps you to easily
- create new tests,
- run tests,
- and schedule tests.
Download Python Scripts
We will be using a python based framework to run the data quality tests.
//cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/dataqualitychecks.py
wget https:
//cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/HB0XK4MDrGwigMmVPmPoeQ/dbconnect.py
wget https:
//cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DB0260EN-SkillsNetwork/labs/Verifying%20Data%20Quality%20for%20a%20Data%20Warehouse/mytests.py
wget https:
//cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/saTxV8y9Kt-e8Zxe29M0TA/generate-data-quality-report.py
wget https:
ls
# OUTPUT
2024-10-10 17:39:43 (379 MB/s) - 'generate-data-quality-report.py' saved [1171/1171]
DimCustomer.sql dbconnect.py-data-quality-report.py
DimMonth.sql generate
FactBilling.sql mytests.py-datawarehouse.tgz setupstagingarea.sh
billing-datawarehouse.tgz.1 star-schema.sql
billing dataqualitychecks.py verify.sql
Install Python Driver for PostgreSQL
-m pip install psycopg2
python3
# RESPONSE
...in directory: /home/theia/.cache/pip/wheels/7d/75/13/da1c6d88687ae81bf5e3cfa07d702981ba137963163472b050
Stored
Successfully built psycopg2
Installing collected packages: psycopg2-2.9.9 Successfully installed psycopg2
Update Password in Connection File
We need to update the password in the connection python file. Look for dbconnect.py and open to edit:
- save PostgreSQL password in the corresponding line
pgpassword = os.environ.get('PGPASWORD')
Test db Connectivity
Check all of these:
- if the Postgresql python driver is installed properly.
- if Postgresql server is up and running.
- if our micro framework can connect to the database.
$ python3 dbconnect.py
Successfully connected to warehouse Connection closed
Here is the copy of the connect.py file
import os
import psycopg2
= os.environ.get('PGPASWORD')
pgpassword = None
conn try:
= psycopg2.connect(
conn = "postgres",
user = pgpassword,
password = "postgres",
host = "5432",
port = "billingDW")
database except Exception as e:
print("Error connecting to data warehouse")
print(e)
else:
print("Successfully connected to warehouse")
finally:
if conn:
conn.close()print("Connection closed")
Create Quality Report
Install Pandas
-m pip install pandas tabulate python3
Create Sample Report
-data-quality-report.py python3 generate
Explore Data Quality
Open the file mytests.py, here are the contents:
It provides a quick and easy way to author and run new data quality tests.
The testing framework provides the following tests:
- check_for_nulls - this test will check for nulls in a column
- check_for_min_max - this test will check if the values in a column are with a range of min and max values
- check_for_valid_values - this test will check for any invalid values in a column
- check_for_duplicates - this test will check for duplicates in a column
Each test can be authored by mentioning a minimum of 4 parameters.
- testname - The human readable name of the test for reporting purposes
- test - The actual test name that the testing micro framework provides
- table - The table name on which the test is to be performed
- column - The table name on which the test is to be performed
from dataqualitychecks import check_for_nulls
from dataqualitychecks import check_for_min_max
from dataqualitychecks import check_for_valid_values
from dataqualitychecks import check_for_duplicates
={
test1"testname":"Check for nulls",
"test":check_for_nulls,
"column": "monthid",
"table": "DimMonth"
}
={
test2"testname":"Check for min and max",
"test":check_for_min_max,
"column": "month",
"table": "DimMonth",
"minimum":1,
"maximum":12
}
={
test3"testname":"Check for valid values",
"test":check_for_valid_values,
"column": "category",
"table": "DimCustomer",
"valid_values":{'Individual','Company'}
}
={
test4"testname":"Check for duplicates",
"test":check_for_duplicates,
"column": "monthid",
"table": "DimMonth"
}
Check for Nulls
Here is a sample script for that
={
test1"testname":"Check for nulls",
"test":check_for_nulls,
"column": "monthid",
"table": "DimMonth"
}
All tests must be named as test
following by a unique number to identify the test.
- Give an easy to understand description for
testname
- mention
check_for_nulls
fortest
- mention the column name on which you wish to check for nulls
- mention the table name where this column exists
Let us now create a new check_for_nulls
test and run it. The test below checks if there are any null values in the column year
in the table DimMonth
.
- The test fails if nulls exist.
- Copy and paste the code below at the end of mytests.py file.
={
test5"testname":"Check for nulls",
"test":check_for_nulls,
"column": "year",
"table": "DimMonth"
}
Execute Script
Now that the script is edited and saved, execute the python script
-data-quality-report.py python3 generate
Check for Min & Max
Here is a sample chunk
={
test2"testname":"Check for min and max",
"test":check_for_min_max,
"column": "monthid",
"table": "DimMonth",
"minimum":1,
"maximum":12
}
In addition to the usual fields, you have two more fields here.
- minimum is the lowest valid value for this column. (Example 1 in case of month number)
- maximum is the highest valid value for this column. (Example 12 in case of month number)
- Let us now create a new
check_for_min_max
test and run it. - The test below checks for minimum of 1 and maximum of 4 in the column
quarter
in the tableDimMonth
. - The test fails if there any values less than minimum or more than maximum. Add this code to the bottom of the file and execute it again
={
test6"testname":"Check for min and max",
"test":check_for_min_max,
"column": "quarter",
"table": "DimMonth",
"minimum":1,
"maximum":4
}
-data-quality-report.py python3 generate
Check for Invalid Entries
In addition to the usual fields, you have an additional field here.
- use the field
valid_values
to mention what are the valid values for this column. - Let us now create a new
check_for_valid_values
test and run it. - The test below checks for valid values in the column
quartername
in the tableDimMonth
. - The valid values are Q1,Q2,Q3,Q4
- The test fails if there any values less than minimum or more than maximum.
- Add it to the file and execute it
={
test7"testname":"Check for valid values",
"test":check_for_valid_values,
"column": "quartername",
"table": "DimMonth",
"valid_values":{'Q1','Q2','Q3','Q4'}
}
-data-quality-report.py python3 generate
Check for Duplicates
- Let us now create a new
check_for_duplicates
test and run it. - The test below checks for any duplicate values in the column
customerid
in the tableDimCustomer
. - The test fails if duplicates exist.
={
test8"testname":"Check for duplicates",
"test":check_for_duplicates,
"column": "customerid",
"table": "DimCustomer"
}
-data-quality-report.py python3 generate