description
In the previous blog, we discussed about how to compile hadoop 3 from the source code and import source code into Eclipse.
As one step of the build(mvn clean install -Pdist -DskipTests -Dtar), we generated the hadoop distribution of hadoop-dist/target/hadoop-3.3.0-SNAPSHOT.tar.gz.
In this blog, we will show how to use the generated hadoop distribution to run hadoop service in mac, and how to debug hadoop client as well as service by eclipse.
hadoop version is 3.3.0-SNAPSHOT built from trunk branch.
environment setup
copy and extract
- cp hadoop-dist/target/hadoop-3.3.0-SNAPSHOT.tar.gz /Users/wang.yan/public_work/hadoop_distribute
- cd /Users/wang.yan/public_work/hadoop_distribute
- tar zxvf /Users/wang.yan/public_work/hadoop_distribute/hadoop-3.3.0-SNAPSHOT.tar.gz
configure environment variable
add HADOOP_HOME to the ~/.profile
1
2export HADOOP_HOME=/Users/wang.yan/public_work/hadoop_distribute/hadoop-3.3.0-SNAPSHOT/
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/binsource ~/.profile
try hadoop command
- Execute ./bin/hadoop, then following output will be generated
1
2
3
4
5
6
7
8
9
10
11
12
13
14Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class
OPTIONS is none or any of:
--config dir Hadoop config directory
--debug turn on shell script debug mode
--help usage information
buildpaths attempt to add class files from build tree
hostnames list[,of,host,names] hosts to use in slave mode
hosts filename list of hosts to use in slave mode
loglevel level set the log4j level for this command
workers turn on worker mode
create local paths for namenode and datanode to use
- mkdir -p /Users/wang.yan/tmp/namenode/
- mkdir -p /Users/wang.yan/tmp/datanode/
configure hadoop properties
Change properties files under etc/hadoop/, for running hadoop in the Pseudo-Distributed
mode.
core-site.xml
1
2
3
4
5
6<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>hdfs-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///Users/wang.yan/tmp/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///Users/wang.yan/tmp/datenode </value >
</property>
</configuration>mapred-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>yarn-site.xml
1
2
3
4
5
6<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
setup passphraseless ssh
Refer hadoop wiki for this step.
I tried but it does not work properly in my local, so I skipped this step.
Because I skipped this step, I cannot use the commands of sbin/, I will use the commands of bin/ directly instead.
cleanup before start all service
1 | ./bin/hdfs --daemon stop namenode |
Note that whenever we shutdown namenode and start it, it is a new cluster, and we need to cleanup namenode and datanode data folder before start the namenode/datanode service.
start all service
./bin/hdfs –daemon start namenode
./bin/hdfs –daemon start datanode
./bin/yarn –daemon start resourcemanager
./bin/yarn –daemon start nodemanager
check UIs
We are able to check the UIs.
- hdfs UI : http://localhost:9870/
- resource manager UI : http://localhost:8088/
check logs
The logs are under logs/
test to run word count in pseudo-distributed mode
prepare data
1
2
3
4
5bin/hdfs dfs -rm -r /tmp/input
bin/hdfs dfs -rm -r /tmp/output
bin/hdfs dfs -mkdir -p /tmp/input
bin/hdfs dfs -ls /tmp/input
bin/hdfs dfs -put etc/hadoop/hadoop-env.sh hdfs:///tmp/input/run word count hadoop job
1
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0-SNAPSHOT.jar wordcount hdfs:///tmp/input hdfs:///tmp/output
check result
1
bin/hdfs dfs -cat /tmp/output/part-r-00000
- Part of the output is
1
2
3
4
5
6"AS 1
"License"); 1
"log 1
# 323
## 12
### 33
- Part of the output is
run word count in standalone node and debug by eclipse
clean up output folder and create data
1
2
3mkdir -p /tmp/input
rm -rf /tmp/output
cp etc/hadoop/hadoop-env.sh /tmp/input/open eclipse, open WordCount class, set break point on the main function, open debug configurations, add this to the program arguments :
/tmp/input /tmp/output
.- click the debug button, then we can debug the hadoop client side, the job finishes successfully.
- note that by such way, we are not loading any specific hadoop configuration files, so resourcemanager/nodemanager is not used, namenode/datanode is not used. We can debug and see how it works in the standalone mode.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23-> construct Job, set mapper, combiner, reducer, etc.
-> Job.waitForCompletion
-> Job.submit
-> Job.connect
->Cluster.initialize : create ClientProtocolProvider which provides the protocal for job client and job tracker(resource manager) to communicate.
-> JobSubmitter. submitJobInternal
-> checkSpecs : check output
-> JobSubmissionFiles.getStagingDir : get a wording directory for the job and create it
= file:/tmp/hadoop/mapred/staging/wang.yan449283448/.staging
-> submitClient.getNewJobID() : generate jobID
= job_local2038582574_0001
-> copyAndConfigureFiles() : upload libjars and so on to the job working directory
-> writeSplits() : calculate split and write to the job working directory
-> writeConf() : write configuration file to the job working directory
= /tmp/hadoop/mapred/staging/wang.yan2038582574/.staging/job_local2038582574_0001/job.xml
-> submitClient.submitJob() : execute the job
-> LocalJobRunner.Job : we are not using yarn, so it is still standalone mode, using LocalJobRunner to do the work. It loads job files from the job working directory.
-> get local job dir : file:/Users/wang.yan/public_work/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/build/test/mapred/local/localRunner/wang.yan/job_local470153672_0001
-> write configuration file : /Users/wang.yan/public_work/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/build/test/mapred/local/localRunner/wang.yan/job_local470153672_0001/job_local470153672_0001.xml
-> run() : use a thread to execute the map/reduce task attempt
-> MapTask.run() :
-> Mapper.run() : loop input and invoke mapper
-> TokenizerMapper.TokenizerMapper.map : the map class configured in the client side.
run word count in pseudo-distributed mode and debug hadoop client by eclipse
cleanup output folder before running word count
1
bin/hdfs dfs -rm -r /tmp/output
modify etc/hadoop/hadoop-env.sh file and add following line
1
HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000 $HADOOP_OPTS"
run word count job
1
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0-SNAPSHOT.jar wordcount hdfs:///tmp/input hdfs:///tmp/output
- Then this command should get stuck waiting for connection
open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug
1
2host : localhost
port : 5000since we have configured hadoop configurations to run in pseudo-distributed mode, now we can debug in eclipse to know how it works for hadoop-client.
run word count in pseudo-distributed mode and debug hadoop service by eclipse
- recover the value of hadoop-env.sh
- stop resource manager : ./bin/yarn –daemon stop resourcemanager
modify etc/hadoop/hadoop-env.sh file and add following line
1
HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000 $HADOOP_OPTS"
start resource manager : ./bin/yarn –daemon start resourcemanager
- eclipe -> open ResourceManager class, set breakpoint on the main function
- open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug resource manager
1
2host : localhost
port : 5000
run word count in pseudo-distributed mode and debug hadoop child process bu eclipse
modifying etc/hadoop/mapred-site.xml
1
2
3
4<property>
<name>mapred.child.java.opts</name>
<value>-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002</value>
</property>eclipse -> open YarnChild -> set break point on the main function
- open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug the child process
1
2host : localhost
port : 5002