This article aims at explaining hive scratch directory.
Scratch directory usage
Hive scratch directory is a temporary working space for storing the plans for different map/reduce stages of the query as well as the intermediate outputs of these stages.
Scratch directory clean up
Hive scratch directory is usually cleaned up by the hive client when the query finishes. However, some data may be left behind if hive client terminates abnormally. Hive server2 contains a thread (ClearDanglingScratchDir) to clean up the remaining files, we can also write our own script to do the clean up if not running Hive server2.
Scratch directory types
Hive queries may be procesed in local(the instance which hive client is invoked) or in remote(hadoop cluster). There also have two kinds of scratch dir accordingly, one in local, the other in hdfs.
Scratch directory configuration
hive.exec.local.scratchdir for local and hive.exec.scratchdir for HDFS(hive configuration).
Note that since hive 0.14.0, the HDFS scratch directory created will be ${hive.exec.scratchdir}\${user_name} indicating it supports multi-tenant natively and there is no need to include user_id in the value.
Scratch directory example
We run a simple query and see what are the files generated in the scratch directory.1
2
3
4--@INTERNAL hive_version:hive2
select count(*) from tb1
where col1>0
group by col2
when query is submitted to the cluster and waiting for containers
1
2
3drwxr-xr-x ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/_tmp.-ext-10001
-rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/map.xml
-rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-2/-mr-10004/8cf23c5d-9a81-4e99-ae69-d8b99eee1a08/reduce.xml- -ext- : a dir indicates the final query output
- -mr- : a output directory for each MapReduce job
- map.xml : map plan
- reduce.xml : reduce plan
when query is running in the hadoop cluster
1
-rw-r--r-- ${hive.exec.scratchdir}/${job_id}/${user_name}/bf17195d-b591-457a-a5e1-28426156c7f4/hive_2018-11-18_11-32-47_530_1102432233025308705-1/-mr-10000/.hive-staging_hive_2018-11-18_11-32-47_530_1102432233025308705-1/-ext-10001/000000_0
- data is generated in the -ext- dir
- when query finished, scratch dir with all files are cleaned up
Scratch directory related INFO logs
These info logs are generated when running the query in the previous section.
1 | session.SessionState: Created HDFS directory: /${hive.exec.scratchdir}/${job_id}/${user_name} |
First several HDFS scratch directories are created during start SessionState.
_hive.hdfs.session.path = ${hive.exec.scratchdir}/${job_id}/${user_name}/${hive.session.id}
- hive.exec.plan = ${hive.exec.scratchdir}/${job_id}/${user_name}/${hive.session.id}/${context execution id}-${task runner id}/-mr-${path id}/${random uuid}
- map.xml path = ${hive.exec.plan}/map.xml
- reduce.xml path = ${hive.exec.plan}/reduce.xml