run work count by hadoop 3 and debug hadoop by eclipse in mac

description

In the previous blog, we discussed about how to compile hadoop 3 from the source code and import source code into Eclipse.

As one step of the build(mvn clean install -Pdist -DskipTests -Dtar), we generated the hadoop distribution of hadoop-dist/target/hadoop-3.3.0-SNAPSHOT.tar.gz.

In this blog, we will show how to use the generated hadoop distribution to run hadoop service in mac, and how to debug hadoop client as well as service by eclipse.

hadoop version is 3.3.0-SNAPSHOT built from trunk branch.

environment setup

copy and extract

  • cp hadoop-dist/target/hadoop-3.3.0-SNAPSHOT.tar.gz /Users/wang.yan/public_work/hadoop_distribute
  • cd /Users/wang.yan/public_work/hadoop_distribute
  • tar zxvf /Users/wang.yan/public_work/hadoop_distribute/hadoop-3.3.0-SNAPSHOT.tar.gz

configure environment variable

  • add HADOOP_HOME to the ~/.profile

    1
    2
    export HADOOP_HOME=/Users/wang.yan/public_work/hadoop_distribute/hadoop-3.3.0-SNAPSHOT/
    PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
  • source ~/.profile

try hadoop command

  • Execute ./bin/hadoop, then following output will be generated
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
    or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
    where CLASSNAME is a user-provided Java class

    OPTIONS is none or any of:

    --config dir Hadoop config directory
    --debug turn on shell script debug mode
    --help usage information
    buildpaths attempt to add class files from build tree
    hostnames list[,of,host,names] hosts to use in slave mode
    hosts filename list of hosts to use in slave mode
    loglevel level set the log4j level for this command
    workers turn on worker mode

create local paths for namenode and datanode to use

  • mkdir -p /Users/wang.yan/tmp/namenode/
  • mkdir -p /Users/wang.yan/tmp/datanode/

configure hadoop properties

Change properties files under etc/hadoop/, for running hadoop in the Pseudo-Distributed mode.

  • core-site.xml

    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
    </property>
    </configuration>
  • hdfs-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.name.dir</name>
    <value>file:///Users/wang.yan/tmp/namenode </value>
    </property>
    <property>
    <name>dfs.data.dir</name>
    <value>file:///Users/wang.yan/tmp/datenode </value >
    </property>
    </configuration>
  • mapred-site.xml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    <property>
    <name>yarn.app.mapreduce.am.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.map.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
    <name>mapreduce.reduce.env</name>
    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    </configuration>
  • yarn-site.xml

    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    </configuration>

setup passphraseless ssh

Refer hadoop wiki for this step.

I tried but it does not work properly in my local, so I skipped this step.

Because I skipped this step, I cannot use the commands of sbin/, I will use the commands of bin/ directly instead.

cleanup before start all service

1
2
3
4
5
6
./bin/hdfs --daemon stop namenode
./bin/hdfs --daemon stop datanode
./bin/yarn --daemon stop resourcemanager
./bin/yarn --daemon stop nodemanager
rm -rf /Users/wang.yan/tmp/datenode/*
./bin/hdfs namenode -format

Note that whenever we shutdown namenode and start it, it is a new cluster, and we need to cleanup namenode and datanode data folder before start the namenode/datanode service.

start all service

./bin/hdfs –daemon start namenode
./bin/hdfs –daemon start datanode
./bin/yarn –daemon start resourcemanager
./bin/yarn –daemon start nodemanager

check UIs

We are able to check the UIs.

check logs

The logs are under logs/

test to run word count in pseudo-distributed mode

  • prepare data

    1
    2
    3
    4
    5
    bin/hdfs dfs -rm -r /tmp/input
    bin/hdfs dfs -rm -r /tmp/output
    bin/hdfs dfs -mkdir -p /tmp/input
    bin/hdfs dfs -ls /tmp/input
    bin/hdfs dfs -put etc/hadoop/hadoop-env.sh hdfs:///tmp/input/
  • run word count hadoop job

    1
    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0-SNAPSHOT.jar wordcount hdfs:///tmp/input hdfs:///tmp/output
  • check result

    1
    bin/hdfs dfs -cat /tmp/output/part-r-00000
    • Part of the output is
      1
      2
      3
      4
      5
      6
      "AS	1
      "License"); 1
      "log 1
      # 323
      ## 12
      ### 33

run word count in standalone node and debug by eclipse

  • clean up output folder and create data

    1
    2
    3
    mkdir -p /tmp/input
    rm -rf /tmp/output
    cp etc/hadoop/hadoop-env.sh /tmp/input/
  • open eclipse, open WordCount class, set break point on the main function, open debug configurations, add this to the program arguments : /tmp/input /tmp/output.

  • click the debug button, then we can debug the hadoop client side, the job finishes successfully.
  • note that by such way, we are not loading any specific hadoop configuration files, so resourcemanager/nodemanager is not used, namenode/datanode is not used. We can debug and see how it works in the standalone mode.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    -> construct Job, set mapper, combiner, reducer, etc.
    -> Job.waitForCompletion
    -> Job.submit
    -> Job.connect
    ->Cluster.initialize : create ClientProtocolProvider which provides the protocal for job client and job tracker(resource manager) to communicate.
    -> JobSubmitter. submitJobInternal
    -> checkSpecs : check output
    -> JobSubmissionFiles.getStagingDir : get a wording directory for the job and create it
    = file:/tmp/hadoop/mapred/staging/wang.yan449283448/.staging
    -> submitClient.getNewJobID() : generate jobID
    = job_local2038582574_0001
    -> copyAndConfigureFiles() : upload libjars and so on to the job working directory
    -> writeSplits() : calculate split and write to the job working directory
    -> writeConf() : write configuration file to the job working directory
    = /tmp/hadoop/mapred/staging/wang.yan2038582574/.staging/job_local2038582574_0001/job.xml
    -> submitClient.submitJob() : execute the job
    -> LocalJobRunner.Job : we are not using yarn, so it is still standalone mode, using LocalJobRunner to do the work. It loads job files from the job working directory.
    -> get local job dir : file:/Users/wang.yan/public_work/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/build/test/mapred/local/localRunner/wang.yan/job_local470153672_0001
    -> write configuration file : /Users/wang.yan/public_work/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-examples/build/test/mapred/local/localRunner/wang.yan/job_local470153672_0001/job_local470153672_0001.xml
    -> run() : use a thread to execute the map/reduce task attempt
    -> MapTask.run() :
    -> Mapper.run() : loop input and invoke mapper
    -> TokenizerMapper.TokenizerMapper.map : the map class configured in the client side.

run word count in pseudo-distributed mode and debug hadoop client by eclipse

  • cleanup output folder before running word count

    1
    bin/hdfs dfs -rm -r /tmp/output
  • modify etc/hadoop/hadoop-env.sh file and add following line

    1
    HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000 $HADOOP_OPTS"
  • run word count job

    1
    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0-SNAPSHOT.jar wordcount hdfs:///tmp/input hdfs:///tmp/output
    • Then this command should get stuck waiting for connection
  • open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug

    1
    2
    host : localhost
    port : 5000
  • since we have configured hadoop configurations to run in pseudo-distributed mode, now we can debug in eclipse to know how it works for hadoop-client.

run word count in pseudo-distributed mode and debug hadoop service by eclipse

  • recover the value of hadoop-env.sh
  • stop resource manager : ./bin/yarn –daemon stop resourcemanager
  • modify etc/hadoop/hadoop-env.sh file and add following line

    1
    HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000 $HADOOP_OPTS"
  • start resource manager : ./bin/yarn –daemon start resourcemanager

  • eclipe -> open ResourceManager class, set breakpoint on the main function
  • open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug resource manager
    1
    2
    host : localhost
    port : 5000

run word count in pseudo-distributed mode and debug hadoop child process bu eclipse

  • modifying etc/hadoop/mapred-site.xml

    1
    2
    3
    4
    <property>
    <name>mapred.child.java.opts</name>
    <value>-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002</value>
    </property>
  • eclipse -> open YarnChild -> set break point on the main function

  • open eclipse -> Project -> Debug configuration -> Remote Java Applicationa -> input following and debug the child process
    1
    2
    host : localhost
    port : 5002