Implement SCD-2 ETL Data Pipelines in Snowflake using Informatica Cloud (IICS)

Debi Prasad Mishra
5 min readAug 31, 2021

--

Slowly Changing Dimension is the technique for implementing the history and to capture the records in the dimension table. There are two commonly used SCD techniques for most of the use cases, SCD1 and SCD2.

SCD2: Slowly Changing Dimension Type 2 also known SCD Type 2 is one of the most commonly used type of Dimension table in a Data Warehouse. SCD type 2 will store the history of a given record in the dimension table. In SCD type 2 effective dates (such as start date & end date) and the current flag approach is the most prominent way now a days in any ETL applications. The concepts of SCD type 2 is — Identify the new records and insert them into the dimension table with Start Date as SYSDATE, End Date as NULL and Current Flag as “Y” (stands for active indicator). — Identify the changed records meaning those records already exist in the dimension table with old values, re-insert them into the table Start Date as SYSDATE, End Date as NULL and Current Flag as “Y” too. — Update the existing record in the table with End Date as SYSDATE and Current Flag as “N” for those records have been identified as value changed, in the above process.

I have explained the things in the traditional way using ETL logic, I hope it will help you to implement the same in your project.

Metadata Structures: Source and target table structures are shown below.

Steps for Data Pipeline

Login to IICS and select the Data Integration services. Click on New Asset-> Mappings-> Mapping

1: Drag source and configure it with source table.

2: Drag an expression. Connect with the source and include CUSTOMER_ID only. Create an expression port as output flg_DUMMY type as string(1). Configure it as ‘Y’.

3: Drag a lookup. Configure it with the target table and connect with the expression as mentioned above. Add the conditions:

4: Drag an expression. And connect both lookup and source to it as mentioned. It may ask to resolve the name conflicts, resolve them by using prefix or suffix. I did SQ_ as prefix and _LKP as suffix. Include only the fields that may change from the source such as SQ_EMAIL, SQ_ADDRESS. Add an output port o_SRC_Hashkey as type string(200) and configure it MD5(SQ_EMAIL||SQ_ADDRESS) this will be compared with the lookup HashKey value to check if there is any change. Add another output port o_LKP_Hashkey as type string(200) and configure it with the expression MD5(LKP_EMAIL||LKP_ADDRESS).

5: Drag an expression and connect to the previous expression as mentioned. Add an output port o_CDC_Flag as type string(1) and configure it as IIF(ISNULL(CUSTOMER_KEY_LKP), ‘I’, IIF(NOT ISNULL(CUSTOMER_KEY_LKP) AND o_SRC_Hashkey <> o_LKP_Hashkey,‘U’, NULL))

6: Add a router and connect the expression to router as mentioned. Create two groups Insert and Update. Configure Insert as o_CDC_Flag = ‘I’ and Update as o_CDC_Flag = ‘U’.

7: Drag an expression and connect the insert group to it. Only include source fields in it and add 3 output fields. o_START_DATE and o_LST_MODIFIED_TS both as type datetime and configure it SYSADTE. o_FLAG as string, configure it ‘Y’.

8: Drag an expression again and connect the update group to it and only include source fields in it and add 3 output fields. as mentioned at step 7.

9: Drag an expression and connect the update group to it. Only include lkp key i.e. CUSTOMER_KEY. Add 3 output ports. o_END_DATE and o_LST_MODIFIED_TS both as type datetime and configure it SYSADTE. o_FLAG as string, configure it ‘N’.

10: Create a sequence generator and connect with both the target instances as mentioned, it helps to generate the surrogate keys in the dimension table.

11: Drag target (INS_N). Connect the expression and sequence generator both. Connect with the target as mentioned. Map the target fields using field mapping tab.

12: Drag target (INS_E). Connect the expression and sequence generator both. Connect with the target as mentioned. Map the target fields using field mapping tab.

13: Drag target(UPD_E) and connect the expression to it as mentioned. Select the operation as Data Driven from the drop down. Make data driven condition as DD_UPDATE also select the Surrogate key(CUSTOMER_KEY) as the update column.

Execution step: Save and close the mapping. Click on New Asset -> Tasks -> Mapping Task. Input your Task Name and select Runtime Environment and choose the Mapping which you have created now. Click Next (Sequence) and Finish. The Mapping Task will be saved then run the task.

  • After first run, target table i.e. DIM_CUSTOMER will look as below
  • Pick one CUSTOMER_ID and make some changes on the record, then re-run the job to see the history records are being captured in the target table as per SCD2 logic we have defined.
  • As you can see now, CUSTOMER_KEY having 6 has been inserted as a new record and old record CUSTOMER_KEY as 5 has been flagged as ’N’.

Conclusion — Thank you for reading, I hope this blog will help you in implementing SCD2 logic in your project using IICS. You can reach out to me in case of more questions you have, on my twitter handle or my LinkedIn or leave a comment below. Good luck!

--

--

Debi Prasad Mishra
Debi Prasad Mishra

Written by Debi Prasad Mishra

Engineer by Profession, Student by Passion. Working as a Senior Business Intelligence Engineer. Snowflake Data Superhero 2022-23

No responses yet