# List of fields in MyDimDate
dateid
date
year
quarter
quartername
month
monthname
day
weekday weekdayname
Create DW for Solid Waste Co
In this project we’ll:
Develop dimension and fact tables to organize and structure data for analysis -> Design a DW
Employ SQL queries to create and load data into dimension and fact tables
Create materialized views to optimize query performance
Scenario
You are a data engineer hired by a solid waste management company. The company collects and recycles solid waste across major cities in the country of Brazil. The company operates hundreds of trucks of different types to collect and transport solid waste. The company would like to create a data warehouse so that it can create reports like:
- Total waste collected per year per city
- Total waste collected per month per city
- Total waste collected per quarter per city
- Total waste collected per year per truck type
- Total waste collected per truck type per city
- Total waste collected per truck type per station per city
Tasks
We’ll perform the following tasks using our understanding of dimensional modeling, SQL querying, and data warehousing concepts:
Task 1: Design the dimension table MyDimDate
Task 2: Design the dimension table MyDimWaste
Task 3: Design the dimension table MyDimZone
Task 4: Design the fact table MyFactTrips
Task 5: Create the dimension table MyDimDate
Task 6: Create the dimension table MyDimWaste
Task 7: Create the dimension table MyDimZone
Task 8: Create the fact table MyFactTrips
Task 9: Load data into the dimension table DimDate
Task 10: Load data into the dimension table DimTruck
Task 11: Load data into the dimension table DimStation
Task 12: Load data into the fact table FactTrips
Task 13: Create a grouping sets query
Task 14: Create a rollup query
Task 15: Create a cube query using the columns year, city, station, and average waste collected
Task 16: Create a materialized view named max_waste_per_station using the columns city, station, trucktype, and max waste collected
Design Star Schema DW
Data
Here is a partial view of the data provided:
MyDimDate - T1
Write down the fields in the MyDimDate table. The company is looking at a granularity of day, which means they would like to have the ability to generate the report on a yearly, monthly, daily, and weekday basis.
MyDimWaste - T2
Write down the fields in the MyDimWaste table
# List of fields in MyDimWaste
wastetypeid wastetype
MyDimZone - T3
Write down the fields in the MyDimZone table
# List of fields in MyDimZone
zoneid
zonename city
MyFactTrips - T4
Task 4: Design the fact table MyFactTrips
Write down the fields in the MyFactTrips table
# List of fields in MyFactTrips
tripid
wastetypeid
zoneid
dateid wastecollected
Create Tables Schema
We will use PostgreSQL and in particular, pgAdmin and create a database named Project, then create the following tables.
MyDimDate - T5
CREATE TABLE MyDimDate (
dateid INT PRIMARY KEY,
date DATE NOT NULL,
year INT NOT NULL,
quarter INT NOT NULL,2) NOT NULL,
quarterName VARCHAR(
month INT NOT NULL,9) NOT NULL,
monthname VARCHAR(
day INT NOT NULL,
weekday INT NOT NULL,9) NOT NULL
weekdayName VARCHAR(; )
MyDimWaste - T6
CREATE TABLE MyDimWaste (
wasteid INT PRIMARY KEY,15) NOT NULL
wastetype VARCHAR(; )
MyDimZone - T7
CREATE TABLE MyDimZone (
zoneid INT PRIMARY KEY,25) NOT NULL,
zonename VARCHAR(15) NOT NULL
city VARCHAR(; )
MyFactTrips - T8
CREATE TABLE MyFactTrips (
tripid INT PRIMARY KEY,
wastetypeid INT NOT NULL,
zoneid INT NOT NULL,
dateid INT NOT NULL,
wastecollected FLOAT NOT NULL,
FOREIGN KEY (wasteid) REFERENCES MyDimWaste(wasteid),
FOREIGN KEY (zoneid) REFERENCES MyDimZone(zoneid),
FOREIGN KEY (dateid) REFERENCES MyDimDate(dateid); )
Change Schema
After the initial schema design, you were told that due to operational issues, data could not be collected in the format initially planned. This implies that the previous tables (MyDimDate, MyDimWaste, MyDimZone, MyFactTrips) in the Project database and their associated attributes are no longer applicable to the current design. The company has now provided data in CSV files with new tables DimTruck and DimStation as per the new design.
Data
- You will need to load the data provided by the company in CSV format.
- First, create a new database named FinalProject.
- Then, create the tables DimDate, DimTruck, DimStation, and FactTrips by defining the structure of the columns as per the CSV files.
- Next, load the data from the CSV files into the appropriate tables.
Note: Ensure that you upload the files to this path: /var/lib/pgadmin/
Recreate & Populate
DimDate - T9
Based on the data in the csv file, here is the new schema and query to create the table, view first 5 rows
CREATE TABLE DimDate (
dateid INT PRIMARY KEY,
date DATE NOT NULL,
Year INT NOT NULL,
Quarter INT NOT NULL,2) NOT NULL,
QuarterName VARCHAR(
Month INT NOT NULL,255) NOT NULL,
Monthname VARCHAR(
Day INT NOT NULL,
Weekday INT NOT NULL,255) NOT NULL
WeekdayName VARCHAR(; )
DimTruck -T10
Based on the data in the csv file, here is the new schema and query to create the table, view first 5 rows
CREATE TABLE DimTruck (
Truckid INT PRIMARY KEY,100) NOT NULL
TruckType VARCHAR(; )
DimStation - T11
Based on the data in the csv file, here is the new schema and query to create the table, view first 5 rows
CREATE TABLE DimStation (
Stationid INT PRIMARY KEY,100) NOT NULL
city VARCHAR(; )
FactTrips - T12
Based on the data in the csv file, here is the new schema and query to create the table, view first 5 rows
CREATE TABLE FactTrips (255) PRIMARY KEY,
Tripid VARCHAR(
Dateid INT NOT NULL,
Stationid INT NOT NULL,
Truckid INT NOT NULL,10, 2) NOT NULL,
Wastecollected DECIMAL(
FOREIGN KEY (Dateid) REFERENCES DimDate(dateid),
FOREIGN KEY (Stationid) REFERENCES DimStation(Stationid),
FOREIGN KEY (Truckid) REFERENCES DimTruck(Truckid); )
Aggregation Queries
Grouping Sets - T13
Create a grouping sets query using the columns stationid, trucktype, total waste collected.
- We are joining FactTrips ‘f’ with DimStation ‘p’ on their stationid to correlate waste collected for each trucktype
- Output is 58 rows
- You see towards the end of the table you see a row for the entire group of Scania
- One row for the entire group of Volvo, both of which are the TruckType
- The rows of () are the grand total rows which doesn’t group by any Stationid but returns the sum for all Stationids
SELECT
f.Stationid,
t.TruckType,
SUM(f.Wastecollected) AS TotalWasteCollected
FROM
FactTrips f
INNER JOIN= t.Truckid
DimTruck t ON f.Truckid
GROUP BY GROUPING SETS (
(f.Stationid, t.TruckType),
f.Stationid,
t.TruckType,
()
)
ORDER BY
f.Stationid,; t.TruckType
- To be more targeted and only group by the combination of the two use
- And we only get 37 rows
- Notice how we group by Stationid first then by TruckType
SELECT
f.Stationid,
t.TruckType,
SUM(f.Wastecollected) AS TotalWasteCollected
FROM
FactTrips f
INNER JOIN= t.Truckid
DimTruck t ON f.Truckid
GROUP BY GROUPING SETS (
(f.Stationid, t.TruckType),
()
)
ORDER BY
f.Stationid,; t.TruckType
- If we don’t use the ORDER BY clause we get:
Rollup Query - T14
Create a rollup query using the columns year, city, stationid, and total waste collected.
- Table is 8751 rows
SELECT
d.dateid,
s.city,
f.Stationid,
SUM(f.Wastecollected) AS TotalWasteCollected
FROM
FactTrips f
JOIN= s.Stationid
DimStation s ON f.Stationid
JOIN= d.dateid
DimDate d ON f.dateid
GROUP BY ; ROLLUP (d.dateid, d.year, s.city, f.stationid)
Cube Query - T15
Create a cube query using the columns year, city, stationid, and average waste collected.
- Output has 129 rows
SELECT
d.Year,
s.City,
f.Stationid,
AVG(f.wastecollected) AS AverageWasteCollected
FROM
FactTrips f
INNER JOIN= s.Stationid
DimStation s ON f.Stationid
INNER JOIN= d.dateid
DimDate d ON f.dateid
GROUP BY ; CUBE (d.Year, s.City, f.Stationid)
Materialized View
Max_waste_stats
Create a materialized view named max_waste_stats using the columns city, stationid, trucktype, and max waste collected.
CREATE MATERIALIZED VIEW max_waste_stats AS
SELECT
s.City,
f.Stationid,
t.TruckType,
MAX(f.wastecollected) AS MaxWaste
FROM
FactTrips f
JOIN= s.Stationid
DimStation s ON f.Stationid
JOIN= t.Truckid
DimTruck t ON f.Truckid
GROUP BY
s.City,
f.Stationid,
t.TruckType; WITH DATA
Refresh View
Refresh the view and print it out
; REFRESH MATERIALIZED VIEW max_waste_stats
* FROM max_waste_stats; SELECT
- Output is 36 rows