Tuesday, January 28, 2014

Star Schema is completely de-normalized - Really??

The first thing we were taught was star schema is de-normalized. In this article I will examine if star schema is really not at all normalized. I will analyze star schema for each normal form:

First Normal Form: By definition a relation is in 1 NF if and only if each tuple contains exactly one value for each attribute.
For star schema:
In star schema each attribute is atomic. So, we can say star schema is in first normal form.

Second Normal Form: By definition, a relationship is in 2 NF if:
1.) It is in 1 NF
2.) Every non-key attribute is fully functionally dependent on the entire primary key.
Dimension Table: Kimball says "Every dimension table has single primary key column." So, all the attributes in dimension table are fully functionally dependent on primary key.
Fact Table: Fact table's primary key is combination of primary key from each dimension table which is related to that fact table. Fact table consists of measurements which are dependent on dimensions which make up that fact table. We can say, for separate fact table the non-key attribute in fact table is fully functional dependence on key attribute. I have here taken the case of separate fact table. But, it can be applied to other fact tables.

Third Normal Form: By definition a relationship is in 3 NF iff
1.) It is in 2 NF
2.) There are no transitive dependency for non-key attributes.
Star Schema: This is where star schema becomes de-normalized. As, star schema is flat (more and more columns are combined together). So, there are transitive dependencies in star schema. Hence, star schema is not in 3 NF. For ex:

In above figure, category consists of one or more brand which in turn consists of one or more products. In this example products determine brand which in turn determines category and hence there is transitive dependency.

Now, I would like to show that if we assume that star schema is in 3 NF and neglect transitive dependency for time being then it can be proven that star schema is also in BCNF, 4 NF and 5 NF as shown below:

Boyce-Codd Normal Form (BCNF): By definition a relation is in BCNF if and only if every determinant in the relation is a candidate key.
Dimension Table: In dimension table use of surrogate key makes the dimension tables in BCNF. Kimball says " Each one of the surrogate keys should be simple integer, starting with one and going up to the highest number that is needed." 
So, using surrogate key we can uniquely find out any row, but using non-key attribute we cannot determine key as surrogate key's values are independent of non-key attribute values.
Fact Table: Fact table consists of measurement for various dimension. The measurements are dependent on primary key but cannot uniquely determine the primary key.  For ex: sales dollars and sales units cannot determine product_id or market_id in the fact table.
So, we can say star schema is in BCNF.

Fourth Normal Form: By definition 'A relation is in fourth normal form if it is in BCNF and it contains no multi-valued dependencies.
 A multi-valued dependency come when all the following conditions are met:
1.) A relation has at least 3 attributes (let’s call the attributes A, B, C)
2.) For each value of A, there is a well-defined (multi-valued) set of values for B, and a well-defined (multi-valued) set of values for C.
3.) The set of values for B is independent of the set of values for C.
Dimension Table: We will take example of hierarchical star schema, the first two conditions for multi-valued attributes are met. But, the third condition fails as in hierarchical star schema child node's value is dependent on it's parent's value. In fact, each dimension is arranged so that related attributes are in single table. So, we can say there are no multi-valued dependency. Hence, it is in 4 NF. 
Fact Table: Fact table contains measurements and their values are independent of other measurements. For ex:
image
 Sales dollars (unit price) are independent of sales units (quantity) purchased. Now, we can argue that there can be another column total price which is dependent on unit price and quantity. But, total price can be determined from unit price and quantity and does not add any new information. Moreover, it's a design preference. So, I have not taken it into account.
Now, the keys of fact table is a composite key consisting of keys from dimensional table. Primary key of dimensional table are independent of each other as they are just integers starting from 1. Hence, we do not have multi-value dependency and so fact table is in 4 NF.

Fifth Normal Form: This is based on join dependency. 
Dimension Table: Dimension table has surrogate keys and so we do not have join dependencies in them.
Fact Table: Fact table has composite key. But, each part of composite key is independent of each other as proved above and so it is also in 5 NF. 

So, we can say if transitive dependency is removed from a star schema then it becomes normalized or in other words we can generalize the statement by saying that snow flake schema is nothing but star schema with no transitive dependency.

The main purpose of this article is to show that star schema is still normalized to some level. These were completely my views. Kindly leave comment on what you think about it. 

References: 
Normalization slides provided by Prof. Currim
The Data Warehouse Toolkit by Kimball