Monday 27 June 2016

HDFS: Practical scenarios of HDFS Trash

Imagine you have deleted a file from your Windows machine unknowingly. Now, you don't have to worry much because it is going to be in your Recycle bin. The case is same when it comes to HDFS filesystem, whenever a file is deleted from HDFS, it is automatically moved to "Trash" folder by default. This post discusses about this feature of HDFS.

Parameter:  fs.trash.interval
------------
This parameter "fs.trash.interval" can be a number greater than 0 and is set in core-site.xml. After the trash feature is enabled, when something is removed from HDFS by using the rm command, the removed files or directories will not be wiped out immediately. Instead, they will be moved to a trash directory (for example, /user/${username}/.Trash). 

 Let us consider a sample output below:
-------
hadoop dfs -rm -r /tmp/10gb

27/06/16 21:32:47 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes. Moved: ‘hdfs://hdpnn/tmp/10gb’ to trash at: hdfs://hdpnn/user/manoj/.Trash/Current
 -------

From the above output, we can see two parameters highlighted, they are:

Deletion interval: means the time in minutes a checkpoint will be expired before it is deleted. It is the value of fs.trash.interval. The NameNode runs a thread to periodically remove expired checkpoints from the file system.

Emptier interval specifies the time in minutes the NameNode waits before running a thread to manage checkpoints. The NameNode deletes checkpoints that are older than fs.trash.interval and creates a new checkpoint from /user/${username}/.Trash/Current. This frequency is determined by the value of fs.trash.checkpoint.interval, and it must not be greater than the deletion interval. This ensures that in an emptier window, there are one or more checkpoints in the trash.

Let us take a configuration example,
-----------
fs.trash.interval = 360 (deletion interval = 6 hours)
fs.trash.checkpoint.interval = 60 (emptier interval = 1 hour)
-----------

This forces the NameNode to create a new checkpoint every one hour and delete checkpoints that have existed longer than 6 hours.

Briefly about checkpoint

A checkpoint is a directory under the user trash that is used to store all files or directories that were deleted before the checkpoint is created. If you want to take a look at the trash directory, you can see it at /user/${username}/.Trash/{timestamp_of_checkpoint_creation}.

Empty the Tash using:  hadoop fs -expunge

This command causes the NameNode to permanently delete files from the trash that are older than the threshold. It immediately removes expired checkpoints from the file system.





Real world Scenario

It is always good to enable trash to avoid unexpected removals of HDFS files and directories. Enabling trash provides a chance to recover data caused due to operational and user errors. At the same time, it is also important to set appropriate values for fs.trash.interval and fs.trash.checkpoint.interval. Example, if you need to frequently upload and delete files from the HDFS, you probably want to set fs.trash.interval to a smaller value, otherwise the checkpoints would take too much space.

Now, let us take a scenario where trash is enabled and you remove some files, HDFS capacity does not increase because files are not fully deleted. The HDFS does not reclaim the space unless the files are removed from the trash, which occurs only after checkpoints are expired. Sometimes you might want to temporarily disable trash when deleting files. In that case, you can run the rm command with the -skipTrash option as below:
-----------
hadoop fs -rm -skipTrash /path/to/permanently/delete
-----------

This bypasses the trash and removes the files immediately from the file system.



No comments:

Post a Comment

Note: only a member of this blog may post a comment.