Hadoop is the underpinning of most big data architectures. Taking a closer look at architecture side, a glitch was seen, which needed to be analyzed and resolved. Apache Software Foundation thus released Next Gen Hadoop 2.x which is an absolute makeover of architecture. The advancement from Hadoop 1’s more confined processing model of batch oriented MapReduce jobs, to more interactive processing models of Hadoop 2 has further place the Hadoop ecosystem as the dominant Big Data Analysis platform.
Apache Hadoop 2.x Key Components:
- NameNode High Availability:
Issue: In Hadoop 1.x, the NameNode was a single point of failure. Each cluster had single NameNode. If NameNode became unavailable, the cluster as a whole would become unavailable and would remain so until the NameNode is either brought up on some other system or started again.
Solution: The High Availability Feature of NameNode trait accosts this problem by providing the solution of running two NameNodes in same cluster. These run in Active/Passive arrangement.
Architecture: In this High Availability cluster, two different machines are designed as NameNodes. Anytime, precisely one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is in charge of all client operations in the cluster, while the Standby is going about as a slave, keeping up enough state to give a quick failover if required.
At the point when any namespace change is performed by the Active node, it logs a record of the alteration to an edit log document present in the shared directory. The Standby node is continually watching this registry for alters, and as it sees any editing, it applies them to its own namespace. This guarantees that the namespace state is completely synchronized before a failover happens.
Issue: The earlier HDFS architecture design permits just a single namespace for the whole cluster. A solitary Namenode deals with this namespace.
Solution: With HDFS Federation, numerous Namenode servers manage namespaces and this takes into consideration Horizontal scaling, execution upgrades, and various namespaces. The execution of HDFS federation permits existing Namenode setups to keep running without changes.
Architecture: Keeping in mind the end goal to scale name service, federation utilizes different independent Namenodes/namespaces. The Namenodes are combined, that is, the Namenodes are free and don’t oblige coordination with one another. The datanodes are utilized as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the Cluster. Datanodes send heartbeats signals and reports to the NameNode and handles orders from the Namenodes.
Hadoop 2.0 HDFS permits taking snapshots of the file system. We can save the state and restore it later. Snapshots can be valuable for information reinforcement, recuperating from client lapses and recouping from catastrophes.
4.YARN(Yet Another Resource Negotiator)
MapReduce has experienced a complete makeover in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.
The key thought of MRv2 is to part up the two noteworthy functionalities of the JobTracker, asset administration and employment planning/checking, into isolated daemons.
The thought is to have a ResourceManager (RM) and per-application ApplicationMaster (AM). An application can be a solitary job in the established Map-Reduce occupations.