Junk Dimensions contains a combination of low-cardinality attributes that may or may not belong to any other dimension. It can act as a definition table for the combination of different values and can be added to our reporting layer for better performance in defining the dimension attributes and it can also play an important role in managing multi-columnar relationship databases.
Junk Dimensions can be used in multiple scenarios, one of them is when we have the relationship of multiple columns between different tables and when we are building a Dimensional Model.
Let’s say we have multiple different schemas, and each schema has a parent table that may or may not be related with other schema’s parent table through different number of columns. Let’s take 4 different tables as an example and assume that they are related with each other through 4 different columns.
Here we have the MasterOrders table related to DoctorsOrder table with 5 different columns which we can say is our Layer 1. To use this in a reporting tool, Power BI, for example, we need to define a simplified column that would define the uniqueness. As in Power BI, the joins can be made but there will be some columns that will have many to many relationships between them and some might have the one-to-many sided relationships and when we make the multiple relationships between these tables the performance of our reports is affected.
That’s where the Junk Dimension concept comes in, with Junk Dimension we can combine the combination of values and create a unique key on that and then can retrieve in the other tables later with SQL during our Data Warehousing.
We also need to see where there is more uniqueness of the combination of multiple columns exists. In our example, let’s say we have combination of 2M records in DoctorOrder and 5M in MasterOrder.
So, we will be creating the junk dimension based on the MasterOrder.
WITH Temp_OrderKey
AS (
SELECT DISTINCT Order_id
,location_id
,order_location_id
,order_type_id
,order_status_id
FROM MasterOrder
WHERE order_id IS NOT NULL
AND location_id IS NOT NULL
AND order_location_id IS NOT NULL
AND order_type_id IS NOT NULL
AND order_status_id IS NOT NULL
)
SELECT Order_id
,location_id
,order_location_id
,order_type_id
,order_status_id
,ROW_NUMBER() OVER (
ORDER BY order_id
,location_id
,order_location_id
,order_type_id
,order_status_id
) AS Order_Key
FROM Temp_OrderKey
After our Junk Dimension is created, we can now make our joins within SQL and only retrieve the OrderKey in our Facts and Dimension tables for simplified joins.
At this point we can call this as our Layer 2, we have OrderJunk in our layer 2 which is related with MasterOrders and DoctorOrders through the 5 columns (order_id, location_id, order_location_id, order_type_id, order_status_id). The SQL to retrieve the Order_Key in MasterOrders and DoctorOrders would be:
—-For DoctorOrders—-
SELECT a.doctor_id
,b.order_key
FROM DoctorOrders a
LEFT JOIN OrderJunk b ON a.order_id = b.order_id
AND a.location_id = b.location_id
AND a.order_location_id = b.order_location_id
AND a.order_type_id = b.order_type_id
AND a.order_status_id = b.order_status_id
—-For MasterOrders—-
SELECT a.order_status_id
,b.order_key
FROM MasterOrders a
LEFT JOIN OrderJunk b ON a.order_id = b.order_id
AND a.location_id = b.location_id
AND a.order_location_id = b.order_location_id
AND a.order_type_id = b.order_type_id
AND a.order_status_id = b.order_status_id
After creating our Junk Dimension and adding it in our MasterOrders and DoctorOrders tables we have a clean data model that looks like below.
Gulfam Pervaiz
Consultant