The Ultimate Guide For Apache Hadoop
According to the report done by the EMC Corporation, we are living in the information explosion era. What does it mean? Well, the world’s data used to double every century, but now, it doubles every two years. This explosion is driven by the Internet of Things, by mobile devices, and our ability to generate more digital content than ever before. It is also fueled by enterprises all over the world transitioning to interfacing with their customers via web 2.0 and mobile technologies. Now, here is an intriguing statistics on how quickly this is happening. The digital universe will grow from four zettabytes of data in 2013, to a whopping 44 zettabytes in 2020.
Where is this data coming from? Is it any different from what we used to deal with five or ten years ago? And, most importantly, what sets it apart when it comes to storing and analyzing it? Arguably, this trend began when businesses started to shift how they interact with their customers. If you were a retail business operating a fleet of stores in the United States 20 years ago,
the bulk of the customer interaction data would come from your own cash registers, In physical stores, tracking both purchases and payment methods. Today, you are likely operating a digital storefront on the web, and, in addition to the brick-and-mortar data, you are getting much more. For starters, you are getting very detailed web logs, tracking how customers are interacting with your website. Then, you are also tracking each individual customer profile, knowing when to target them for certain promotions, or even what items they may be interested in purchasing, all of that based on the past behavior or demographics that they belong to. Outside of your own digital footprint, you can also track your customers’ sentiment on social media and through the search engines that they use to come to the website.
Those are known as the 3 Vs of Big Data, and they are data volume, data velocity, and data variety. In 2001, the industry analyst Doug Laney described Big Data using those 3 Vs, and the name kind of stuck. They really do capture the essence of Big Data perfectly well. In a way, it wouldn’t be an oversimplification to say that they have become a definition for Big Data. All three of those terms should be pretty self-explanatory, especially in the context of our previous example. But let’s quickly walk through them, one by one. Variety is all about the fact that unstructured and semi-structured data is becoming as strategic as the traditional structured data that you would store in a relational database. Volume – volume speaks for itself. Data is coming in forms of new sources, as well as increased regulation in multiple areas, meaning storing more data for a longer period of time becomes a necessity. Velocity – velocity is requirement that machine data, as well as data coming from new sources is being ingested at speeds not even imagined a few years ago. So, whenever anybody says that Hadoop is a Big Data technology, what they are really saying is that Hadoop was designed from the ground-up to deal with all three of these Vs. Specifically, Hadoop is well-suited for any scenario where the volume and variety of data types overwhelm the existing systems, and when the data velocity is also something too much for them to handle.
BIRTH OF HADOOP
Back in 2006, most traditional enterprises were still blissfully unaware of big data challenges and could not yet appreciate its opportunities. Big data was the domain of Internet giants, and one such company came to a point where it had to achieve a lot of the business outcomes we reviewed earlier. Given its scale and the size of the data sets it started to accumulate, it also had to solve this challenge in a cost-effective manner. After all, since the value of big data is typically proportional to the volume of data available for analysis, it made no sense to pay for traditional databases and/or run them on custom hardware appliances. Whatever the solution, it had to:
– Be free from draconian licensing costs
– Run on commodity hardware without requirements for custom servers and/or networking
– Scale linearly with the growth of data volume
– Afford efficient data processing and analytics that would scale well with the size of the data.
The name of the company was Yahoo! Inc. and a guy by the name of Doug Cutting, who had just joined it, had a solution in mind. A few years before, he and a friend of his, Mike Cafarella, started hacking on a project named after Doug’s son’s elephant toy: Hadoop. It would be fair to say that, while the ideas behind Hadoop were clearly conceived at Google, Yahoo! gave us the Hadoop we know and love today.
Hadoop = HDFS + YARN
From its inception to this day, Hadoop has focused on providing a scalable, reliable platform for storage and data analysis that runs on commodity hardware and is fault tolerant. Storage is offered by HDFS (Hadoop Distributed File System) and the processing capabilities are offered by YARN (Yet Another Resource Negotiator). Unlike databases, Hadoop does not know what kind of data will be stored as files in HDFS. It does not know whether that data has certain fields in it, or whether its structure is opaque. It is only during the processing step that some kind of a structure will be imposed on those raw files. This is known as the schema-on-read approach to data management: you just dump whatever raw data comes your way into files in HDFS and you do not think about it until the processing step. The repository of unstructured data is called a data lake.
Hadoop is basically about two components. First, there is YARN, that manages all of the CPU in memory, and then, there is HDFS, that manages all of the direct-attached storage. Both come from Apache Hadoop, but they can also be used interdependently. Some traditional enterprise storage vendors, for example, are providing HDFS-compatible layers on top of their old-school storage products, and this works because HDFS and YARN are independent. So, as long as a storage product speaks HDFS-compatible API, YARN will be more than happy to work with it. This approach of loosely coupled YARN and HDFS APIs provides Hadoop customers with a lot of flexibility. In fact, Hadoop becomes the kind of a platform that frees you to capture any data and store it for as long as you need it, analyze data for any application you already use, or any future application that you might create. You can also use the platform to explore your data with any combination of batch interactive search and streaming analytics. You can also do quite sophisticated machine learning on top of it. Finally, you can deploy all these capabilities however you like, and change those deployments whenever it suits your needs. It looks like as a technology, Hadoop is really all about giving you ultimate flexibility in dealing with your Big Data challenges. Interestingly enough though, it may be the case that its ultimate flexibility is actually one of the open source projects. So, let’s look into how Hadoop really became Hadoop. Hadoop actually happens to be one of the projects developed under the Apache umbrella. And, speaking of which, we should always remember that it is Apache Hadoop.
suppose you happen to be a data scientist in an enterprise organization who just came into possession of the most precious data set. The data set consists of files, but it is also large (which means you cannot really store it on your server’s hard drive, as you normally would). In fact, you have to use multiple servers just to store the data. And remember: this is the most precious data set – so, you need to make sure that, when you store it on multiple servers, you can still read it, even if any of the drives in any of the servers fail. This is precisely what HDFS (Hadoop distributed file system) has been designed for, and it does that while maintaining a very familiar, user-friendly interface.
From the end user’s perspective, HDFS looks and feels like a regular filesystem: the one that you are used on your desktop. Just like any filesystem, HDFS stores data in files, and files are grouped together into a tree of subdirectories. HDFS splits all the data stored in the files into a series of chunks called blocks and its blocks are really big. It actually stores them on different servers in the cluster. Multiple copies of the same block would exist to achieve reliability. If a server hosting a block fails, or experiences a faulty hard drive, HDFS will still give you that data back; it will just have to read it from a different server in the cluster. As with any distributed filesystem, HDFS hides all of the bookkeeping complexity from its clients. You just read the files, and the blocks come to you. Now, unlike most of the distributed filesystem, HDFS allows for one critical piece of bookkeeping information to be actually given back to the client. It allows the client to know where the replicas of each block are in a cluster. In other words, if a client asks, HDFS gives that client a full map.
Every HDFS cluster is comprised of one or two NameNodes, and as many DataNodes as your IT budget will allow. With just one NameNode – if it goes down your whole HDFS deployment is unavailable (even though the DataNodes may be running just fine). With two NameNodes running in an Active/Standby configuration, the Standby can take over in cases where the Active one fails or needs to be brought down for maintenance. This is called a High Availability (HA) configuration.
NameNode (one or two per cluster)
– Is the master service of HDFS
– Determines and maintains how the chunks of data are distributed across the DataNodes
– Actual data never resides here, only metadata (e.g., maps of where blocks are distributed).
DataNode (as many as you want per cluster)
– Stores the chunks of data, and is responsible for replicating the chunks across other DataNodes
– Default number of replicas on most clusters is 3 (but it can be changed on a per-file basis)
– Default block size on most clusters is 128MB.
The typical Hadoop cluster may look something like this.
First, there will be a few master nodes dedicated to running daemons that coordinate the overall activities of the cluster.
in HDFS’s case, that overall coordination is done by a NameNode daemon, and you can see it running on master node number 1.
But, you can also see other ones, all of them colored in green, running on master nodes 1 to 4.
While the master nodes are extremely critical to the overall health and performance of the cluster, strictly speaking,They don’t actually do any real work, they don’t store blocks of information, nor do they run data processing tasks.
you can see worker nodes going from 1 to 7, each of them running daemons colored in blue.
One of those daemons is HDFS’s DataNode, and the other one is the YARN NodeManager.
a utility node, Those nodes are typically not considered to be members of the cluster, but they serve as gateways into it.
From HDFS’s perspective, a cluster consists of a NameNode coordinating a whole bunch of DataNodes, but also giving three fundamental services to every single HDFS client.
The first one is metadata management. The other one is namespace management, and the third one is block management.
Metadata management deals with keeping track of permissions and ownership of files and folders,and any kind of extended metadata, such as block size, replication level, user quotas, or anything else that is specific to HDFS.
Namespace [management] provides a hierarchy of namespaces, basically all of the folders are rooted at / and you can traverse to get to the files.
block map knows which files and folders, what blocks belong to what files, and where on the cluster they are stored.
Once the NameNode is up and running, it makes itself available on the network,and a whole bunch of DataNodes initiate connections to it.
DataNodes that are active, shown in blue here, and the ones that are inactive or down or offline: DataNode number 2, shown in gray here.
NameNode keeps track of all of this; its job is to keep an eye on all the DataNodes that are connected to it, and it needs to know when a given DataNode goes down,
so it can reroute requests for blocks to other DataNodes,
So, for example, even though block 1-2-3 is hosted on three DataNodes here, DataNode 1, DataNode 2, and DataNode 4,the request for block 1-2-3 can only be served from DataNode 1 and DataNode 4.
NameNode keeps track of that.for example, DataNode 1 will be sending a heartbeat, basically saying “Hey, I’m still here. This is my latest heartbeat”
DataNode number 2, on the other hand, will not provide a heartbeat, which will make NameNode realize Datanode 2 became down.
So, DataNode 1, the one that is active and actually has a copy of a block 1-2-3 ,to replicate block 1-2-3 to DataNode 2.
So, let’s take an example of a client, shown here in the upper left corner, trying to write a file to HDFS.
What happens is this: first, the client sends a request to the NameNode to add a file to HDFS, then, it receives a reply with basically a NameNode telling it “Here is your lease to a file path”
Then, the client will keep iterating all the blocks that it has, and, for every block, the client will request the NameNode to provide the block ID,
and a list of destination DataNodes.
Once the client has that information, the NameNode gets out of the way completely, and the client proceeds to write directly to the first DataNode in the list.
The replication pipelinebasically takes care of making sure that DataNode number 1 writes the copy of the block to DataNode num2 and etc..
HDFS supports the notion of users and groups of users.
HDFS offers classic POSIX filesystem permissions for controlling who can read and write ( e.g., -rwxr-xr–)
HDFS also offers extended Access Control Lists (ACL).
Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.
. Before Hadoop, this whole process required writing an algorithm, running it on a single node, and making it act as a single client to HDFS data, essentially sifting through all of it record-by-record. As you may imagine, if your data set qualifies to really be called big data, this will be an extremely slow process
Resource Manager (one or two per cluster) that provides
– Global resource scheduler
– Hierarchical queues
Node Manager (running next to the DataNode)
– Encapsulates RAM and CPU resources available on a worker node into units called YARN
– Manages the lifecycle of YARN containers
– Container resource monitoring
Application Master (created on-demand)
– Manages application scheduling and task execution
– Typically, specific to a higher-level framework (e.g. MapReduce Application Master).
As with HDFS, YARN runs its coordination daemon called ResourceManager on one of the master nodes; master node number 2, in our case.
Also, there is a whole bunch of node manager daemons co-located with DataNodes running on worker nodes: worker node 1 to worker node 7.
The ResourceManager component provides a number of services that perform three main duties: first, there is scheduling, then, there is node management, and then, there is security.
The YARN scheduling is a single component that controls resource usage, according to parameters set by Hadoop administrators.
This allows for greater efficiency, by allowing different organizations to use a centrally pulled set of cluster resources, also known as cluster multi-tenancy.
ResourceManager also provides coordination of NodeManagers.This is actually very similar to how HDFS’s NameNode coordinates all of the DataNodes.
Just like NameNode, the ResourceManager is doing so by monitoring NodeManagers for heartbeats, sent by the NodeManagers every second by default, and expected within 10 minutes,
The ResourceManager is also providing a few security capabilities,
The YARN NodeManager is a daemon service that runs on each worker node, and it manages local resources on behalf of the requesting services,
it tracks the health of the node and communicates its status to the ResourceManager.
The chief duty of the NodeManager is to use the available CPU and RAM capacity on the node to run code, typically written in Java, and given to it by YARN scheduling requests.
The NodeManager reacts to any valid request like that, by allocating the required amounts of CPU and RAM capacity, and spinning off a YARN container, that can now use that much CPU and RAM for running user-submitted code.
A container is a unit of work within a YARN application that is allocated specific CPU and memory resources by the NodeManager on behalf of the ResourceManager
The container is the component that performs the work of the specific YARN application.
A container is also launched each time a new ApplicationMaster is required, and if the request is made by the ResourceManager.
When a job is executed, the ApplicationMaster requests additional resources from the ResourceManager via the NodeManager on which it is running.If additional resources can be allotted, the ResourceManager can then request additional containers to run all the tests across the cluster.So, let’s go back to the NodeManager.
STEPS TO SETUP HADOOP
1) Java Installation
# yum update
# cd /opt/
# wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie” “http://download.oracle.com/otn-pub/java/jdk/8u141-b15/336fa29ff2bb4ef291e347e091f7f4a7/jdk-8u141-linux-x64.tar.gz”
# tar xzf jdk-8u141-linux-x64.tar.gz
# cd /opt/jdk1.8.0_141/
# alternatives –install /usr/bin/java java /opt/jdk1.8.0_141/bin/java 2
# alternatives –config java
There are 3 programs which provide ‘java’.
* 1 /opt/jdk1.7.0_71/bin/java
+ 2 /opt/jdk1.8.0_45/bin/java
Enter to keep the current selection[+], or type selection number: 4
# alternatives –install /usr/bin/jar jar /opt/jdk1.8.0_141/bin/jar 2
# alternatives –install /usr/bin/javac javac /opt/jdk1.8.0_141/bin/javac 2
# alternatives –set jar /opt/jdk1.8.0_141/bin/jar
# alternatives –set javac /opt/jdk1.8.0_141/bin/javac
# java -version
Inorder to configure environemental variable
# export JAVA_HOME=/opt/jdk1.8.0_141
# export JRE_HOME=/opt/jdk1.8.0_141/jre
# export PATH=$PATH:/opt/jdk1.8.0_141/bin:/opt/jdk1.8.0_141/jre/bin
and insert all the environment variables in /etc/environment file for auto loading on system boot.
2) Create a user named hadoop inorder to avoid root
# adduser hadoop
# passwd hadoop
# su – hadoop
$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh localhost
$ cd ~
$ wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
$ tar xzf hadoop-2.6.5.tar.gz
$ mv hadoop-2.6.5 hadoop
Edit /root/.bashrc file and enter below given at the end of file.
$ source ~/App.bashrc
$ source ~/.bashrc to apply changes
$ hdfs namenode -format
$ cd $HADOOP_HOME/sbin/$ start-dfs.s
Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser
http://vps.hadoop.com:8088/for getting the information about hadoop cluster and all of its applications
http://vps.hadoop.com::50090/ for getting information about secondary namenode
http://vps.hadoop.com:50075/ for getting information about DataNode
For multinode cluster multi_node
By now, you should really understand that Hadoop = HDFS + YARN, a power combination that enables the rest of the Hadoop ecosystem. And speaking of the ecosystem, most of it consists of parallel data processing frameworks. HDFS for data storage, and YARN for resource management and scheduling. Together, these two enable all the rest of the Hadoop ecosystem and allow you to unlock the business value of Big Data.