Snowflake Performance Tuning Checks Towards Query Optimization- Part 2
This is a continuation post of Query Optimization- Part 1. I guess you will get some idea on Snowflake internal memory architecture, the caching concepts and etc. Would like to request, please do check out Part 1 before moving with this post. Below mentioned things are the Few Design Recommendations while implementing the data pipelines.
1- Storing Semi-structured Data in a VARIANT Column: For structural data we are using only native data types such as strings or integers and etc., as per the storage requirement. Data in a VARIANT column is very similar. Load the data set into a VARIANT column in a table. Use the FLATTEN function to extract the objects and keys you plan to query into a separate table.
2- Date/Time Data Types for Columns: When defining columns to contain dates or timestamps, choose a date or timestamp data type rather character or varchar data type. Snowflake stores date & timestamp data more efficiently than VARCHAR, resulting in better query performance in general.
3- Set a Clustering Key for larger datasets: Specifying a clustering key is not necessary for most tables. In general, Snowflake produces well-clustered data in tables. Snowflake performs automatic tuning via the optimization engine and micro-partitioning. Set Cluster keys for larger data sets greater than 1 TB and if the query profile in the web UI indicates that a significant percentage of the total duration time is spent scanning. A clustering key can contain one or more columns. As Snowflake suggested maximum of four columns. Analysis on the queries those having problem, to find the correct clustering key column is the most important. The SQL snippet mentioned below can help to identify the potential performance issues on queries that run for more than 2 minutes.
4- Use Transient tables as needed: Snowflake supports the creation of the transient tables. Snowflake does not preserve a history for these tables, which can result in a measurable reduction of your Snowflake storage costs.
5- Maintain a Dashboard for Snowflake Usage: Snowflake provides the usage dashboard such as Snowsight dashboard to visualize your worksheet data results using charts and dashboards using web UI. Try to design some features like to Audit Cost incurred for the “Account” and other usage metrics as good visualizations under “Account” section. This feature is available to Account Admin role only as per Snowflake and can be used for audit trails.
6- Leverage Materialized View: In every RDBMS system, materialized views are used to improve the query performance, where some sets of queries are being used as repeated manner. Materialized views are helpful when
- The query contains results that require significant processing
- Results don’t change often and results are used often
- Query consumes a lot of resources, such as joins, sub query and expression
7- Joining Order Criteria: Snowflake does not expose the functionality of optimizer hints that is found in other databases to control the order in which joins are performed. In general, the SQL query optimizer chooses the correct order for the joining tables. In some rare situations, it is not possible for the optimizer to identify the join ordering that would result in the fastest execution. Hence joins can create very slow query performance under the following conditions, such as:
- Joining columns has lot of NULLs or Duplicates values
- Joining columns are String or Alphanumeric characters or Date data type
- Joining columns are a calculation or expression or a function type
- Joins where records from one table match multiple records in the joined table causes a Cartesian Product when executing
- Like operator is being applied on the columns including joining condition
Proper modeling of data in the warehouse will minimize the occurrence of exploding the joins. There are techniques to solve the performance problem such as convert the query into sub-query avoiding the joins criteria. For any calculation or expression, good habit is to create a temporary table and solve the problem. Columns with Numeric or Number data type should always a good choice of participating in the join conditions. Let’s take any example —
Hence in reality, the best approach it depends on the problem area itself and moreover it is situation specific, which may often in the data ingestion side or in the transformation steps or in end-user queries. And there are few effective solutions are based on a design approach also rather the query tuning. Scaling up to the large warehouse is not only a good solution always, neither a good strategy too. That being said, thinking all majors from Snowflake rather, you should first identify & focus on the actual problem statement.
Conclusion: Thank you for reading, I hope this blog will help you getting the basic understanding of Snowflake performance tuning activities, and it’s measures either using Snowflake components or your design approach. You can reach out to me in case of more questions you have, on my twitter handle or my LinkedIn or leave a comment below. Good luck!
References: Link to Snowflake online documentation
Clustering Keys & Clustered Tables — Snowflake Documentation