Many cloud platforms and third-party service providers offer Hadoop as a service or VM/container image. This lowers the barrier of entry for those wishing to get started with Hadoop. In this hands-on lab, you will have the opportunity to deploy a single-node Hadoop cluster in a pseudo-distributed configuration. Doing so demonstrates the deployment and configuration of each individual component of Hadoop that will get you ready for when you want to start working with a multi-node cluster to separate and cluster Hadoop services. In this learning activity, you will be performing the following:
* Installing Java
* Deploying Hadoop from an archive file
* Configuring Hadoop’s `JAVA_HOME`
* Configuring the default filesystem for Hadoop
* Configuring HDFS replication
* Setting up passwordless SSH
* Formatting the Hadoop Distributed File System (HDFS)
* Starting Hadoop
* Creating files and directories in Hadoop
* Examining a text file with a MapReduce job
Learning Objectives
Successfully complete this lab by achieving the following learning objectives:
- Install Java
Log into Node 1 as
cloud_user
and install thejava-19-amazon-corretto-devel
package:sudo yum -y install java-19-amazon-corretto-devel
- Deploy Hadoop
From the
cloud_user
home directory, download Hadoop-2.9.2 from your desired mirror. You can view a list of mirrors here:curl -O http://mirrors.gigenet.com/apache/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
Unpack the archive in place:
tar -xzf hadoop-3.3.4.tar.gz
Delete the archive file:
rm hadoop-3.3.4.tar.gz
Rename the installation directory:
mv hadoop-3.3.4/ hadoop/
- Configure java_home
From
/home/cloud_user/hadoop
, setJAVA_HOME
inetc/hadoop/hadoop-env.sh
by changing the following line:export JAVA_HOME=${JAVA_HOME}
Change it to this:
export JAVA_HOME=/usr/lib/jvm/java-19-amazon-corretto/
Save and close the file.
- Configure Core Hadoop
Set the default filesystem to
hdfs
onlocalhost
in/home/cloud_user/hadoop/etc/hadoop/core-site.xml
by changing the following lines:<configuration> </configuration>
Change them to this:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Save and close the file.
- Configure HDFS
Set the default block replication to
1
in/home/cloud_user/hadoop/etc/hadoop/hdfs-site.xml
by changing the following lines:<configuration> </configuration>
Change them to this:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Save and close the file.
- Set Up Passwordless SSH Access to localhost
As
cloud_user
, generate a public/private RSA key pair with:ssh-keygen
The default option for each prompt will suffice.
Add your newly generated public key to your authorized keys list with:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- Format the Filesystem
From
/home/cloud_user/hadoop/
, format the DFS with:bin/hdfs namenode -format
- Start Hadoop
Start the
NameNode
andDataNode
daemons from/home/cloud_user/hadoop
with:sbin/start-dfs.sh
- Download and Copy the Latin Text to Hadoop
From
/home/cloud_user/hadoop
, download thelatin.txt
file with:curl -O https://raw.githubusercontent.com/linuxacademy/content-hadoop-quick-start/master/latin.txt
From
/home/cloud_user/hadoop
, create the/user
and/user/root
directories in Hadoop with:bin/hdfs dfs -mkdir -p /user/cloud_user
From
/home/cloud_user/hadoop/
, copy thelatin.txt
file to Hadoop at/user/cloud_user/latin
with:bin/hdfs dfs -put latin.txt latin
- Examine the latin.txt Text with MapReduce
From
/home/cloud_user/hadoop/
, use thehadoop-mapreduce-examples-*.jar
to calculate the average length of the words in the/user/cloud_user/latin
file and save the job output to/user/cloud_user/latin_wordmean_output
in Hadoop with:bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordmean latin latin_wordmean_output
From
/home/cloud_user/hadoop/
, examine your wordmean job output files with:bin/hdfs dfs -cat latin_wordmean_output/*