Implement SCD-2 ETL Data Pipelines in Snowflake using Merge Statement

Debi Prasad Mishra
4 min readSep 2, 2021

--

The Merge statement in SQL is a very popular clause that can handle Inserts, Updates, and Deletes all in a single transaction without having the separate logic for each of these. You can specify the conditions on which you expect the merge statement to insert, update, or delete. Merge statement gives you a better flexibility in customizing your complex SQL scripts in real time use cases. The merge statement basically modifies an existing table based on the result of comparison between the key fields with another table in the context you have chosen. A most common use case is, while implementing the Slowly Changing Dimensions (SCD) in the data warehouse. In such cases, you need to insert new records into the data warehouse, remove or flag the records from the warehouse which are not in the source anymore, and update the values of those in the warehouse which have been updated in the source.

Syntax: First, you specify the target table and the source table in the Merge.

Second, the merge_condition determines how the rows from the source table are matched with the rows from the target table. Typically, you use the key columns either primary key or unique key for matching.

Third, the merge_condition results in three states as: “Matched, Not Matched, and Not Matched By Source”.

  • Matched: these are the rows that match the merge condition. For the matching rows, you need to update the rows columns in the target table with values from the source table.
  • Not Matched: these are the rows from the source table that does not have any matching rows in the target table. In this case, you need to add the rows from the source table to the target table.
  • Not Matched By Source: these are the rows in the target table that does not match any rows in the source table. If you want to synchronize the target table with the data from the source table, then you will need to use this match condition to delete rows from the target table.

Hands on by Example:

Below records have been loaded into source table per demonstration -

Run the below query to see the change in the dimension table -

After successful execution of merge statement, target table will looks like -

Then, let’s make some update on the source record to see the history will be captured in the target table.

First run the merge statement as mentioned above. Once completed, let’s execute the select command on target table, source updated record i.e. customer id = 885 as address value has changed, should be captured in the target table per ETL logic defined in merge statement.

Performance Optimizing: First you need to change the merge statement to meet all of the required conditions, listing a few and it depends on various individual optimizations on the source or the target side. Possible ways of optimizations for the merge query are not limited to the optimizations I mentioned here, again it is based on hit & trail basis based on scenario specific.

1- Data Analysis is needed on shape & size of the datasets need to be merged.
2- Source & target both should contain unique or primary key constraints.
3- While joining, the condition should include those KEY columns.
4- Those KEY columns, best to have Integer/Number/Numeric as data type and should be identical in both source & target.
5- KEY columns never ever contain any NULL at any point of time.
6- Create an unique index on the source table on the KEY column.
7- When the source is smaller than target, a nested loops operator is preferable.
8- When the source & target are of similar size and the index are applied to the source and target tables, a merge join operator is the most efficient.

Conclusion: Thank you for reading, I hope this blog will help you designing SCD2 logic if needed using merge statement in Snowflake. You can reach out to me in case of more questions you have, on my twitter handle or my LinkedIn or leave a comment below. Good luck!

--

--

Debi Prasad Mishra
Debi Prasad Mishra

Written by Debi Prasad Mishra

Engineer by Profession, Student by Passion. Working as a Senior Business Intelligence Engineer. Snowflake Data Superhero 2022-23

Responses (3)