Snowflake schema illustrated

Tutorial

Snowflake schema illustrated

In database design, there are several data modeling techniques such as conceptual, logical, and physical. But there are two schema-types that are currently popular amongst software developers and the enterprise data warehouse (EDW) professionals. They are:

  1. Star schema
  2. Snowflake schema

In this tutorial, we will look at snowflake schema in detail, explaining how it has evolved from star and exactly where it differs.

Snowflake schema is an extension of star schema in a way; it separates itself from Star when it comes to handling large dimension tables. A star schema focuses on a centralized design with a fact table in it connecting to different dimension tables end to end.

>> Read our tutorial on Star schema

As shown in Figure 1, a Policy sales table connects to different dimension tables following the design pattern of a centralized fact table. If you want to arrive at agent wise policy- sales in terms of sum assured, it will have a ‘join’ to agent table, where we can see that agent table with channel information. If we want to arrive at zonal sales count of Policies, a ‘join’ to branch table is required. (See Figure 1: Star schema explained)

Where does the problem arrive?

The following two things can be observed:

  1. You need to connect the whole branch table even when you are asking for zone level information only.
  2. For single zone, for example, say ‘North’, there may be 200 or 300 branches. Therefore, to get North’s count of policies, we need to process all 300 rows!

Deficiency occurs when you are connecting to large dimension tables, resulting in consuming increased execution time of queries in a data warehouse. A large group of users executing similar kind of statements may result in low performance, and the purpose of having an online analytical processing (OLAP) in place fails because of this very reason. Also see: Quick facts on snowflake schema.

star schema illustrated

Figure 1: Star schema explained (Click on image to enlarge)

What leads to the problem?

If a dimension is sparse, that is, most of the possible values for the dimension have no data, and/ or a dimension has a long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and ‘snow flaking’ may be appropriate.

A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will hence be easier to implement. 

>> Also see Quick facts on snowflake schema

Resolution

A snowflake schema applies normalization over a star schema, in which very large dimension tables are normalized into multiple tables. Dimensions with hierarchies can be decomposed into a snowflake structure when you want to avoid joins to big dimension tables when you are using an aggregate of the fact table.

If a fact table contains a ‘1 to many’ relationship with each of the dimensions in a data warehouse schema, then it is appropriate to use a star schema.

Although if a fact table has a ‘many-to-many relationship’ with its dimensions, i.e. many rows in a fact equal many rows in a dimension, then you must resolve this using a snow flake schema where the bridge table contains a unique key to each row in the fact table.

For example, zonal sales or count of policy can easily be derived by having a join between zone table and Zonal sales table avoiding three way join.

 think of a retail system where you need to maintain large dimension tables like ‘Customers’ and join of sales table is taking place with customers for customer id. Having millions of records in ‘sales’ and ‘customers’ will adversely affect the performance when generating sales reports.

Generally, in snowflake, the third normal form is applied to get the separated dimension tables.

Creation of aggregate tables with such small dimension table solves the purpose of having a snowflake in place.

The only disadvantage of snowflake is that the design becomes complicated and increases the number of dimension tables leading to overheads for maintenance.

Figure 1 can be modified following a snowflake structure as shown in Figure 2: Snowflake structure below.

snowflake schema structure

Figure 2: Snowflake schema structure (Click on image to enlarge)

Snowflake schema: Quick facts 

  • The normalization of dimension tables tends to increase number of dimension tables or sub-dimension tables that require more foreign key joins when querying the data therefore reduce the query performance.
  • Snowflake schema helps in saving space by normalizing dimension tables.
  • The query of snowflake schema is more complex than query of star schema due to multiple joins from dimension table to sub-dimension tables.
  • It is more difficult for business users who use data warehouse system using snowflake schema because they have to work with more tables in a database than star schema.
  • By creating aggregate table(s) and joining it (them) to the required dimension table(s) improves performance by reducing the execution time.

<< Back to Top

 

About the author: Mohit Tayal is Senior Consultant - Technology at Mindcraft Software. He has more than four years of experience in BI and OLTP database design and deployment. Currently he is working on a project for BNP Paribas India Services at its Global Equity and Commodity Derivatives unit. Prior to this, he worked on projects at Birla Sunlife Insurance and Polaris Software Lab.

This was first published in September 2012