In order to understand what’s coming, we have to understand our past, together with how Big Data is fundamentally different from what we’ve used to. We have already noticed that the rapid growth of data generation being around the world due to social networking sites, mobile applications, sensor devices, smart meters and even the things which we are using in today’s era creates data. In an era of 90’s, data was in the form of excel tables, soft copies and confidential data resided in relational database applications. Since the last decade the following firms are major contributors of data. The list of firms are described in the table:
|Social Networking Service||Founded in February, 2004|
|Social Networking Service||Founded in March, 2006|
|Business-oriented social networking service||Founded in December, 2002 and launched in May, 2003|
|Photo Sharing Website||March, 2010|
|StumbleUpon||Discovery engine that finds and recommends web contents to the users||2001|
|Google+||Social networking layer for Google||June, 2011|
Facebook is the most popular social media application for data generation, it has grown one million users in 2005, to more than one billion users in 2012, and a thousand- folds increases in less than 8 years. The value generated by a social networking sites is proportional to the number of contacts of the user rather than registered users on networking sites.
According to Metcalfie’s Law and its variants, a number of contacts for N-users, is proportional to N*logN. Thus, the growth of data on social networking sites is proportional to the interaction within the users in social networking applications. Since, the popularity of the internet is the main reason for growth of the communication and interconnectivity in the world. Google was founded in 1998, with the goal of organizing all the information in the world. It became the dominant content search platform in World Wide Web. Google faced some challenges such as indexing, crawling, storing and serving billions of web pages, it could not possible for existing data management system.
The amount of publically available data on the Google search index exploded from 26 million web pages in 1998 to 1 trillion of web pages had stores in less than one decade. In addition, this content was semi-structured, contains images, videos, PDF files that come in the category of unstructured. Google had to develop its infrastructure from the scratch. In 2003, it came with the idea of the Google File System. GFS is fault-tolerant, distributed and runs on proprietary hardware. In December, 2004 Google labs published a paper that introduced a MapReduce parallel processing framework, it allows distributed programming. These two applications become the blueprint for Apache Hadoop.
Doug Cutting is the creator of two open-source projects: Apache Nutch and Lucene. Apache Nutch is used to crawling the web pages in World Wide Web. He realized the idea of MapReduce and it implemented MapReduce model with distributed file system. In 2007, Cutting joined the Yahoo!, it created an open-source project called Apache Hadoop. The journey of Apache Hadoop starts from 2007. Apache Hadoop is an open-source cloud computing platform and it enables the software applications that runs on commodity hardware. Hadoop has two layers: Storage layer for storing large and complex datasets in HDFS and Computational layers provided by MapReduce. For more information, you can join Hadoop and Big Data course if you want get mastery in this revolutionary software framework.