As a data engineer, I have created and developed many SQL statements by hand or using ETL tools to develop and maintain data pipelines and mappings using SCD methods in a data warehouse context. I was focused on Oracle databases and I was curious about how SCD with Type 2 will be implemented and run on a different Database like MariaDB. There are many cases where you have to leave your comfort zone and find new ways to achieve a solution in a different environment.
Beside that I wanted to create a short documentation for "experienced" SQL beginners who wants to start into this area and get an idea about the "traditional" data historization. 

Before you start, you need to know the basics about SCD type 2, so please refer to my shot introduction here.
Otherwise without further ado, I would like to share my experience of creating a simple SCD type 2 with MariaDB. 

My setup:

  • MariaDB on Ubuntu
  • phpAdmin as development environment
  • Sakila sample database

 

Sakila is a great sample database. It comes with tables and data that represents a DVD rental store. You can download the database from the MySQL site, it is working fine with MariaDB, thanks to the backward compatibility. It contains the DDL SQL structures in one and the data sets in a separat file. After creating successfully the objects, lets check quickly our new tables.

Listening 1:

First of all, it is a good style to study our new Sakila database and analyse how the objects are related to each other. Due to time constraints, I will skip the deep analysis part for now here, but keep in mind it is important to learn the data model for every project.

For this small tutorial, I chose the table inventory to start with the SCD type 2 historisation. My assumption was that I would later develop a BI report to track inventory movements over time and build small KPI's to analyse inventory trends and item cycles.

Lets see what is inside in inventory

Listening 2:

USE saklia;

SELECT count(*) FROM INVENTORY;
-- 4581

SELECT * FROM INVENTORY;

inventory

This table contains an auto increment Id (inventory_id) and is referencing to related objects (film and store) with foreign key constraints. It also has an last_update column, even the date is not perfect maintained. In a DWH project you need to specify the grain of the data that you want to report in the fact table. The grain is the level of detail that you want to measure. For now I'm fine with the inventory data, because I can count the number of film titles for each store on a specific date.

To store the historisation of the inventory table, I have to create a special table to store the values from the source but also additional information for tracking purpose. 

Listening 3:

--
-- Tabele `co_inventory`
--

CREATE TABLE IF NOT EXISTS `co_inventory` (
  `dwh_inventroy_id` int(11) NOT NULL AUTO_INCREMENT,
  `dwh_valid_from` date NOT NULL,
  `dwh_valid_to` date DEFAULT NULL,
  `dwh_inst_id` int(11) DEFAULT -1,
  `dwh_upd_id` int(11) DEFAULT -1,
  `dwh_is_active` enum('N','J') NOT NULL DEFAULT 'N' COMMENT 'Active record',
  `dwh_evt_active` enum('INS','UPD','DEL','') NOT NULL DEFAULT 'INS',
  `inventory_id` int(10) UNSIGNED NOT NULL DEFAULT 0,
  `film_id` int(10) UNSIGNED NOT NULL,
  `store_id` int(10) UNSIGNED NOT NULL,
  `last_update` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE current_timestamp(),
  PRIMARY KEY (`dwh_inventroy_id`) USING BTREE,
  UNIQUE KEY `idx_inventory_dwh:valid_from` (`inventory_id`,`dwh_valid_from`)
) ENGINE=InnoDB AUTO_INCREMENT=14268 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

The table co_inventory contains a surrogate key (dwh_inventory_id) as an auto-incrementing column as well as the additional columns for historisation tracking with the prefix 'dwh_'.

dwh_valid_to and dwh_valid_from will store the date how long the record is / was active. The column dwh_is_active is redundant to this it can have the flags 'N ' for not active or 'J' for active. It will be used later to select directly the active records.

dwh_evt_active will tell us which DDL was performed on this record (ins, upd, del). The flag 'del' tells us that the record does not exist in the source any more. Nothing is deleted in the DWH, it is just flagged the record as deleted.

Now, we have a target table to store and historize the source data. The next step is to calculate the 'delta' between the source and target table. I was a bit naive to think that MariaDB would work similarly to Oracle in many cases, but I have to start from scratch. Beginning with the circumstances that MairaDB dose not support a full out join. But no worries we can do an union of a left out join select  and a right outer join select from source and target table like this:

Listening 4:

with full_join as (select 
    src.inventory_id, src.film_id, src.store_id, src.last_update,
    trg.dwh_valid_from as trg_dwh_valid_from, trg.dwh_valid_to as trg_dwh_valid_to, trg.inventory_id as trg_inventory_id, trg.dwh_is_active as trg_dwh_is_active, trg.dwh_evt_active as trg_dwh_evt_active, trg.film_id as trg_film_id, trg.store_id as trg_store_id, trg.last_update as trg_last_update
    from inventory src left outer join co_inventory trg on (src.inventory_id = trg.inventory_id)
    union
    select 
    src.inventory_id, src.film_id, src.store_id, src.last_update,
    trg.dwh_valid_from as trg_dwh_valid_from, trg.dwh_valid_to as trg_dwh_valid_to, trg.inventory_id as trg_inventory_id, trg.dwh_is_active as trg_dwh_is_active, trg.dwh_evt_active as trg_dwh_evt_active, trg.film_id as trg_film_id, trg.store_id as trg_store_id, trg.last_update as trg_last_update
    from inventory src right outer join co_inventory trg on (src.inventory_id = trg.inventory_id)
),
inv as (
	select count(*) from inventory 
)
select * from full_join ;

If co_inventory is empty, it only contains data from the source like in this case. With some modificatioins on this SQL statment we are able to perform an initial load. At this point I considered writing a merge statement, but MariaDB does not support merge statements. But no worries, we can create a temporary table to track the changes and perform the historisation on the co_inventory table as well. So the next SQL statemens getting already the delta check.

Listening 5:

-- Preparing the temp table $flow_inventory
drop table if exists $flow_inventory;

-- Create the temp table $flow_inventory
create table $flow_inventory 
with full_join as (select 
    src.inventory_id, src.film_id, src.store_id, src.last_update,
    trg.dwh_valid_from as trg_dwh_valid_from, trg.dwh_valid_to as trg_dwh_valid_to, trg.inventory_id as trg_inventory_id, trg.dwh_is_active as trg_dwh_is_active, trg.dwh_evt_active as trg_dwh_evt_active, trg.film_id as trg_film_id, trg.store_id as trg_store_id, trg.last_update as trg_last_update
    from inventory src left outer join co_inventory trg on (src.inventory_id = trg.inventory_id)
    union
    select 
    src.inventory_id, src.film_id, src.store_id, src.last_update,
    trg.dwh_valid_from as trg_dwh_valid_from, trg.dwh_valid_to as trg_dwh_valid_to, trg.inventory_id as trg_inventory_id, trg.dwh_is_active as trg_dwh_is_active, trg.dwh_evt_active as trg_dwh_evt_active, trg.film_id as trg_film_id, trg.store_id as trg_store_id, trg.last_update as trg_last_update
    from inventory src right outer join co_inventory trg on (src.inventory_id = trg.inventory_id)
),
inv as (
	select count(*) from inventory 
),
delta_data as (
	select null      as dwh_inventory_id,
           date(sysdate()) as dwh_valid_from,
           date(str_to_date('31.12.9999','%d.%m.%Y')) as dwh_valid_to,
           trg_dwh_valid_from,
           trg_dwh_valid_to,
	       1910      as dwh_inst_id,
		   -1        as dwh_upd_id,
		   'J'       as dwh_is_active,
           trg_dwh_is_active,
		   'INS'     as dwh_evt_active,
           trg_dwh_evt_active,
		   inventory_id,
		   film_id,
		   store_id,
		   last_update,
           trg_inventory_id,
		   trg_film_id,
		   trg_store_id,
		   trg_last_update,
           case when (film_id <> trg_film_id)
		          or (store_id <> trg_store_id)
		          or (last_update <> trg_last_update)
           then 'UPD'
                when inventory_id is not null and trg_inventory_id is null  
           then 'INS'
                when inventory_id is null and trg_inventory_id is not null 
           then 'DEL'
           else 'EQU'
           end as inc
	  from full_join     
	  where 
	  (trg_dwh_valid_from is null and trg_dwh_valid_to is null) or sysdate() between trg_dwh_valid_from and trg_dwh_valid_to
)
select * from delta_data;

select * from $flow_inventory;

 

Listening 6:

$flow_inventory

As I mentioned already above the temp table contains information that can be used for the target table to historize according records. The column 'inc' is here the central drive for this task.

Listening 7:

--deactive old version of existing data
update co_inventory trg
inner join $flow_inventory src on (src.inventory_id = trg.inventory_id and src.inc = 'UPD' and sysdate() between trg.dwh_valid_from and trg.dwh_valid_to)
set
trg.dwh_valid_to = date(src.dwh_business_date)-1, 
trg.dwh_inst_id = -1, 
trg.dwh_upd_id = 1910, 
trg.dwh_is_active ='N', 
trg.dwh_evt_active = 'UPD';

--activ new version of existing data
insert into co_inventory 
select 
null, `dwh_business_date`, `dwh_valid_to`, `dwh_inst_id`, `dwh_upd_id`, `dwh_is_active`, `dwh_evt_active`, `inventory_id`, `film_id`, `store_id`, `last_update`
 from $flow_inventory where inc = 'UPD';


--new records
insert into co_inventory 
select 
null, `dwh_business_date`, `dwh_valid_to`, `dwh_inst_id`, `dwh_upd_id`, `dwh_is_active`, `dwh_evt_active`, `inventory_id`, `film_id`, `store_id`, `last_update`
 from $flow_inventory where inc = 'INS';
 
 
-- deactive deleted recods 
update co_inventory trg
inner join $flow_inventory src on (src.trg_inventory_id = trg.inventory_id and src.inc = 'DEL' and sysdate() between trg.dwh_valid_from and trg.dwh_valid_to)
set
trg.dwh_valid_to = date(src.dwh_business_date)-1, 
trg.dwh_inst_id = -1, 
trg.dwh_upd_id = 1910, 
trg.dwh_is_active ='N', 
trg.dwh_evt_active = 'DEL';

 

We have an interim result for now. In the next blog we will do a real run on co_inventory and test the scripts.