hive scratch directory

This article aims at explaining hive scratch directory.

Scratch directory usage

Hive scratch directory is a temporary working space for storing the plans for different map/reduce stages of the query as well as the intermediate outputs of these stages.

Scratch directory clean up

Hive scratch directory is usually cleaned up by the hive client when the query finishes. However, some data may be left behind if hive client terminates abnormally. Hive server2 contains a thread (ClearDanglingScratchDir) to clean up the remaining files, we can also write our own script to do the clean up if not running Hive server2.

Scratch directory types

Hive queries may be procesed in local(the instance which hive client is invoked) or in remote(hadoop cluster). There also have two kinds of scratch dir accordingly, one in local, the other in hdfs.

Scratch directory configuration

hive.exec.local.scratchdir for local and hive.exec.scratchdir for HDFS(hive configuration).

Note that since hive 0.14.0, the HDFS scratch directory created will be ${hive.exec.scratchdir}\${user_name} indicating it supports multi-tenant natively and there is no need to include user_id in the value.

Scratch directory example

We run a simple query and see what are the files generated in the scratch directory.

1
2
3
4
--@INTERNAL hive_version:hive2
select count(*) from tb1
where col1>0
group by col2

  • when query is submitted to the cluster and waiting for containers

    1
    2
    3
    drwxr-xr-x ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/_tmp.-ext-10001
    -rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/map.xml
    -rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/reduce.xml
    • -ext- : a dir indicates the final query output
    • -mr- : a output directory for each MapReduce job
    • map.xml : map plan
    • reduce.xml : reduce plan
  • when query is running in the hadoop cluster

    1
    -rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/-ext-10001/000000_0
    • data is generated in the -ext- dir
  • when query finished, scratch dir with all files are cleaned up

These info logs are generated when running the query in the previous section.

1
2
3
4
5
6
7
8
9
session.SessionState: Created HDFS directory: /${hive.exec.scratchdir}/${job_id}/${user_name}
session.SessionState: Created HDFS directory: /${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4
session.SessionState: Created HDFS directory: /${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/_tmp_space.db
ql.Context: New scratch dir is hdfs://${namenode_ip}:8020$/{hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1
common.FileUtils: Creating directory if it doesn't exist: hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1
ql.Context: New scratch dir is hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2
exec.Utilities: PLAN PATH = hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/map.xml
exec.Utilities: PLAN PATH = hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/reduce.xml
exec.FileSinkOperator: Moving tmp dir: hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/_tmp.-ext-10001 to: hdfs://${namenode_ip}:8020/${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/-ext-10001
  • First several HDFS scratch directories are created during start SessionState.

  • _hive.hdfs.session.path = ${hive.exec.scratchdir}/${job_id}/${user_name}/${hive.session.id}

  • hive.exec.plan = ${hive.exec.scratchdir}/${job_id}/${user_name}/${hive.session.id}/${context execution id}-${task runner id}/-mr-${path id}/${random uuid}
  • map.xml path = ${hive.exec.plan}/map.xml
  • reduce.xml path = ${hive.exec.plan}/reduce.xml