Thursday, March 13, 2014

Large Data Analytic - Comparison

In this blog I would analyze paper "A Comparison of Approaches to Large-Scale Data Analytic".

This paper compares Hadoop with parallel Databases based on various factors. I would list down some comparisons and would also provide my opinions on the same.

1.) Schema Support: Parallel DBMS store data in form of tables having fixed schema while Hadoop accepts unstructured data. According to paper this is a drawback for Hadoop as programs have to extract data from files and data in files need to be in same format throughout.
I would deviate from the claim made in paper. As, there is lot of data in flat files, Hadoop provides efficient way to analyze it. Computer World states that presently unstructured data accounts for 70 - 80% while structured data accounts only for 20 - 30%. So, Hadoop provides us way of analyzing that 80% data which cannot be analyzed using parallel SQL.

2.) Flexibility: The paper argues that SQL has procedures, user defined functions and aggregations making it almost as flexible as Map Reduce.
I don't agree with the authors. SQL procedures provide some flexibility but stating that it is almost as flexible as map reduce would be little optimistic. It is simpler to develop program for Map Reduce compered to writing procedure and triggers in SQL. Also,due to widespread use of Map Reduce there are predefined programs for many of the tasks making it simpler to use.

3.) Joins: The paper shows results of joins. Below is a snapshot from the paper:
As seen from the figure, Vertica and DBMS-X takes significantly shorter time compared to Hadoop. They have used 100 machines for the analysis.
According to me, this example would always give better results for parallel DBMS because Hadoop is inherently not designed to work for Join queries. The other problem with this example is the use of only 100 computers. Authors say that most of the applications require 100 nodes for processing and so they have used 100 node in this example. There are many applications that require more than 100 nodes. Moreover, parallel DBMS does not scale well. So, had authors used more than 100 nodes then parallel DBMS would have taken exponentially more time.

But, there were some points in paper where I do agree with authors like: there is lot of overhead in data distribution over network in Hadoop which results in slower performance. 
Another example was if there are 1000 map instances and 500 output files, then  there would be 500,000 local files. Now, there would be 500 instances reading output files from map. There can be instances where same Reduce function reads from the Map function hence inducing lot of seeks and increasing the time required.

To conclude I would say that the paper provides interesting insights into Hadoop and parallel databases. Though some examples were the ones for which map reduce is not designed. But, this paper does show us that map reduce is not the silver bullet and can only solve subset of database problems.

Reference:
A Comparison of Approaches to Large-Scale Data Analytic by Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J, Dewitt, Samuel Madden, Michael Stonebraker.