BusinessIntelligence

Thursday, March 13, 2014

Large Data Analytic - Comparison

In this blog I would analyze paper "A Comparison of Approaches to Large-Scale Data Analytic".

This paper compares Hadoop with parallel Databases based on various factors. I would list down some comparisons and would also provide my opinions on the same.

1.) Schema Support: Parallel DBMS store data in form of tables having fixed schema while Hadoop accepts unstructured data. According to paper this is a drawback for Hadoop as programs have to extract data from files and data in files need to be in same format throughout.
I would deviate from the claim made in paper. As, there is lot of data in flat files, Hadoop provides efficient way to analyze it. Computer World states that presently unstructured data accounts for 70 - 80% while structured data accounts only for 20 - 30%. So, Hadoop provides us way of analyzing that 80% data which cannot be analyzed using parallel SQL.

2.) Flexibility: The paper argues that SQL has procedures, user defined functions and aggregations making it almost as flexible as Map Reduce.
I don't agree with the authors. SQL procedures provide some flexibility but stating that it is almost as flexible as map reduce would be little optimistic. It is simpler to develop program for Map Reduce compered to writing procedure and triggers in SQL. Also,due to widespread use of Map Reduce there are predefined programs for many of the tasks making it simpler to use.

3.) Joins: The paper shows results of joins. Below is a snapshot from the paper:

As seen from the figure, Vertica and DBMS-X takes significantly shorter time compared to Hadoop. They have used 100 machines for the analysis.

According to me, this example would always give better results for parallel DBMS because Hadoop is inherently not designed to work for Join queries. The other problem with this example is the use of only 100 computers. Authors say that most of the applications require 100 nodes for processing and so they have used 100 node in this example. There are many applications that require more than 100 nodes. Moreover, parallel DBMS does not scale well. So, had authors used more than 100 nodes then parallel DBMS would have taken exponentially more time.

But, there were some points in paper where I do agree with authors like: there is lot of overhead in data distribution over network in Hadoop which results in slower performance.

Another example was if there are 1000 map instances and 500 output files, then there would be 500,000 local files. Now, there would be 500 instances reading output files from map. There can be instances where same Reduce function reads from the Map function hence inducing lot of seeks and increasing the time required.

To conclude I would say that the paper provides interesting insights into Hadoop and parallel databases. Though some examples were the ones for which map reduce is not designed. But, this paper does show us that map reduce is not the silver bullet and can only solve subset of database problems.

Reference:
A Comparison of Approaches to Large-Scale Data Analytic by Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J, Dewitt, Samuel Madden, Michael Stonebraker.

Wednesday, February 26, 2014

SQL vs NoSQL

In this post I will differentiate SQL and NoSQL systems on basis of some features:

Scale to 1000s: This tells us about the scalibility of the system. It shows if the system can be scaled to 1000 or more machines. RDBMS cannot are not scalable. On the other hand, almost all NoSQL systems can be scaled to 1000 machines or more.

Joins/Analytic: It means combining one or more records from different data source (I am generalizing the definition of joins as NoSQL database also include flat files, JSON query etc. So, just saying tables won't suffice). if is probably one of the most important things in relational systems. RDBMS can be joined efficiently. But, the same is not the case with NoSQL. They are inefficient if joins are required and can take exponential time to execute query.

Integrity Constraint: By definition "Database Integrity ensures that data entered into the database is accurate, valid and consistent." Referential Integrity means requiring the existence of a related row in another table.
This is maintained in SQL database. But, NoSQL database are not consistent the moment they are updated. But, most of them are consistent after a period of time. For ex: in Facebook when someone updates his status then it is not immediately reflected thought the world, but it takes some time to get reflected.

References: www.wikipedia.org
Scalable SQL and NoSQL Data Sources by Rick Cattell

Thursday, February 13, 2014

ER Model for Star Schema - Part 2

In last blog I introduced a new construct called dimension (Dm). In this post I will go in detail of it.

I would first explain some existing constructs available and why we can't use them.

1.) Aggregation: An aggregate entity instance is a collection of entities from one or more entity class.

For ex: Attendants, residents and nurses can be combined to form aggregate entity called Medical Team.

We can't use this approach to model star or snow-flake schema because fact table is not collection of one or more dimensional class.

2.) Composite Class: A composite entity instance is composed of a set of members from another entity class.

For ex: combination of different types of employees form employee type.

We can't use it because dimension table is not specialization of fact table.

3.) Typing and Instantiation Classes: This is generally used when we have to model instances of an entity type.

So, as we can see there is no existing construct which can help us model ER Diagram efficiently and so we need a new construct.

Now, let's have a look at an example of star schema modeled using Dimensional (Dm) approach.

If we create ER Model for it, district and region will be joined to market using Dm (Dimension) construct. Fact table will be modeled as weak entity. Following is the ER diagram for it:

If required we can show cardinality in the ER-Digaram.

Degenerate Dimensions: We can model degenerate dimension as a normal entity with only primary key. As, fact table is weak entity so the primary key of degenerate dimension becomes fact table's foreign key.

References:

"The Relation of Star Schema into Entity-Relation Diagrams" by Dr. Krippendorf and Dr. Song

ER Diagram Slides by Dr. Currim

Tuesday, February 4, 2014

ER Modelling for Star Schema

ER Modelling for Star Schema

I would first compare two approaches of ER modelling for Star along with deficiencies in each of them. Then I would present with another way of modelling star schema.

First Approach: The first approach is taken from paper "From ER Models to Dimensional Models." by Dr. Moody and Dr. Kortink.
They suggested to create 3 new entities:
1.) Transaction entity: it is used to represent Fact Tables
2.) Component Entity: it is used to represent Dimension Table
3.) Classification Entity: it is used to represent tables related to Dimensional table.
They created ER Diagram using these 3 entities and used notation similar to crow foot.
Shortcoming of this approach: A completely new model for ER diagram.

Second Approach: This approach is taken from paper"The Relation of Star Schema into Entity-Relation Diagrams" by Dr. Krippendorf and Dr. Song.
It uses the existing ER framework to model star schema.
Dimensional tables are converted into entities and Fact tables are either represented as relations or as weak entity.
Shortcoming of this approach: 1.) Dr. Teorey in paper "A Logical Design Methodology for Relational Databases - Using the Extended Entity" proves that , only entities can have attributes. So, we Fact Tables cannot be represented as relations because Fact Tables have attributes.
2.) It does not give us the detail of dimension. For ex: a dimension market with attributes Market Description, district and region would be depicted as single entity. This loses the detail that district and region were essentially different entities.

My Approach: Now, I would present my approach which aims at solving problems we saw in above two approaches:
Broadly speaking ER model of star schema can be represented as n-ary relationship with many dimension connected to single fact using this relationship.

ER representation of Fact Table:
1.) Fact Table should always be converted to a weak entity.
2.) If fact table has own primary key, it should be shown as partial identifier in ER Diagram.
3.) Foreign Keys from dimensional tables are not shown in the Fact entity.
For ex:

In figure shown on left, SKU_ID and Store_id are foreign key from dimensional table which will be shown as primary key of dimension entity.
Date, receipt_nbr and Receipt_line_Item are primary key for fact table.

ER Representation:

As, seen in the ER diagram Receipt Item, Receipt Number and date are shown as partial identifier while keys from dimensional table are not shown.

Before modelling dimensional table, I will discuss super class, subclass and classification tables.

Super Class and Sub Class: In ER diagram, sub class inherits all the attributes from super class. For ex:

In figure on left manager and associates will inherit employee's attributes like: name and address.

Classification Tables: Tables that form dimension tables are called classification tables. For ex: in figure shown below, district and region would be classification tables and market would be dimension table:

ER Modelling for Dimensional Tables:
I would propose new construct called Dm (Dimension) to model dimensional tables. While creating ER diagram for dimension tables, classification tables are linked using the construct Dm to dimension tables. This can be compared to how super class was linked to subclass using P.

Classification tables transfers all the non-key attributes to dimension table in the same way as super class passed it's attributes to subclass.

Below is ER Representation of Market Dimension:

In this example Region name and district name
will get transferred to market table.

The arrow points to the dimension table.

As, seen the only new construct introduced in my approach called Dm. Apart from that my approach to model ER diagram uses traditional ER modelling technique.

In this article I just showed basic ER modelling. I might be covering this in more details in coming blogs with some examples.

References:
"The Relation of Star Schema into Entity-Relation Diagrams" by Dr. Krippendorf and Dr. Song.
"From ER Models to Dimensional Models." by Dr. Moody and Dr. Kortink
"A Logical Design Methodology for Relational Databases - Using the Extended Entity" by Dr. Teorey, Dr. Yang and Dr. Fry
The Data warehouse toolkit by Dr. Kimball and Dr. Ross
ER Diagram Slides by Dr. Currim

Tuesday, January 28, 2014

Star Schema is completely de-normalized - Really??

The first thing we were taught was star schema is de-normalized. In this article I will examine if star schema is really not at all normalized. I will analyze star schema for each normal form:

First Normal Form: By definition a relation is in 1 NF if and only if each tuple contains exactly one value for each attribute.

For star schema:

In star schema each attribute is atomic. So, we can say star schema is in first normal form.

Second Normal Form: By definition, a relationship is in 2 NF if:

1.) It is in 1 NF

2.) Every non-key attribute is fully functionally dependent on the entire primary key.

Dimension Table: Kimball says "Every dimension table has single primary key column." So, all the attributes in dimension table are fully functionally dependent on primary key.

Fact Table: Fact table's primary key is combination of primary key from each dimension table which is related to that fact table. Fact table consists of measurements which are dependent on dimensions which make up that fact table. We can say, for separate fact table the non-key attribute in fact table is fully functional dependence on key attribute. I have here taken the case of separate fact table. But, it can be applied to other fact tables.

Third Normal Form: By definition a relationship is in 3 NF iff

1.) It is in 2 NF

2.) There are no transitive dependency for non-key attributes.

Star Schema: This is where star schema becomes de-normalized. As, star schema is flat (more and more columns are combined together). So, there are transitive dependencies in star schema. Hence, star schema is not in 3 NF. For ex:

In above figure, category consists of one or more brand which in turn consists of one or more products. In this example products determine brand which in turn determines category and hence there is transitive dependency.

Now, I would like to show that if we assume that star schema is in 3 NF and neglect transitive dependency for time being then it can be proven that star schema is also in BCNF, 4 NF and 5 NF as shown below:

Boyce-Codd Normal Form (BCNF): By definition a relation is in BCNF if and only if every determinant in the relation is a candidate key.

Dimension Table: In dimension table use of surrogate key makes the dimension tables in BCNF. Kimball says " Each one of the surrogate keys should be simple integer, starting with one and going up to the highest number that is needed."
So, using surrogate key we can uniquely find out any row, but using non-key attribute we cannot determine key as surrogate key's values are independent of non-key attribute values.

Fact Table: Fact table consists of measurement for various dimension. The measurements are dependent on primary key but cannot uniquely determine the primary key. For ex: sales dollars and sales units cannot determine product_id or market_id in the fact table.

So, we can say star schema is in BCNF.

Fourth Normal Form: By definition 'A relation is in fourth normal form if it is in BCNF and it contains no multi-valued dependencies.

A multi-valued dependency come when all the following conditions are met:

1.) A relation has at least 3 attributes (let’s call the attributes A, B, C)

2.) For each value of A, there is a well-defined (multi-valued) set of values for B, and a well-defined (multi-valued) set of values for C.

3.) The set of values for B is independent of the set of values for C.

Dimension Table: We will take example of hierarchical star schema, the first two conditions for multi-valued attributes are met. But, the third condition fails as in hierarchical star schema child node's value is dependent on it's parent's value. In fact, each dimension is arranged so that related attributes are in single table. So, we can say there are no multi-valued dependency. Hence, it is in 4 NF.

Fact Table: Fact table contains measurements and their values are independent of other measurements. For ex:

Sales dollars (unit price) are independent of sales units (quantity) purchased. Now, we can argue that there can be another column total price which is dependent on unit price and quantity. But, total price can be determined from unit price and quantity and does not add any new information. Moreover, it's a design preference. So, I have not taken it into account.
Now, the keys of fact table is a composite key consisting of keys from dimensional table. Primary key of dimensional table are independent of each other as they are just integers starting from 1. Hence, we do not have multi-value dependency and so fact table is in 4 NF.

Fifth Normal Form: This is based on join dependency.
Dimension Table: Dimension table has surrogate keys and so we do not have join dependencies in them.
Fact Table: Fact table has composite key. But, each part of composite key is independent of each other as proved above and so it is also in 5 NF.

So, we can say if transitive dependency is removed from a star schema then it becomes normalized or in other words we can generalize the statement by saying that snow flake schema is nothing but star schema with no transitive dependency.

The main purpose of this article is to show that star schema is still normalized to some level. These were completely my views. Kindly leave comment on what you think about it.

References:
Normalization slides provided by Prof. Currim

The Data Warehouse Toolkit by Kimball