Tuesday, 21 April 2015

HOW TO FIND FS ERRORS IN HDFS AND FIX IT

The easiest way to determine for FS error is to run an fsck on the filesystem. If hadoop environment variables is set then you should be able to use a path of /, if not hdfs://ip.or.hostname:50070/.

-------
hadoop fsck /
--------

OR

--------
hadoop fsck hdfs://ip.or.hostname:50070/
-------

Missing replicas should be self-healing over time. However, if you want to move them to lost+found, you can use:
------
hadoop fsck / -move
------

OR delete them with:
------
hadoop fsck / -delete
------

If the end of the output, looks as below, then there is corrupted blocks in DFS.


The output of the fsck above will be verbose, but it will mention which blocks are corrupt. We can do some grepping of the fsck above using below command:

---------
hadoop fsck / | egrep -v '^\.+$' | grep -v replica | grep -v Replica 
----------

OR

--------
hadoop fsck hdfs://ip.or.host:50070/ | egrep -v '^\.+$' | grep -v replica | grep -v Replica
--------

This will list the affected files, and the output will not be a bunch of dots, and also files that might currently have under-replicated blocks (which isn't necessarily an issue). The output should include something like this with all your affected files.



The next step would be to determine the importance of the file, can it just be removed and copied back into place, or is there sensitive data that needs to be regenerated?

If it's easy enough just to replace the file, that's the route I would take.

Remove the corrupted file from your hadoop cluster

This command will move the corrupted file to the trash.
---------
hadoop fs -rm /path/to/filename.fileextension
hadoop fs -rm hdfs://ip.or.hostname.of.namenode:50070/path/to/filename.fileextension
---------

Or you can skip the trash to permanently delete (which is probably what you want to do)

-------------
hadoop fs -rm -skipTrash /path/to/filename.fileextension
hadoop fs -rm -skipTrash hdfs://ip.or.hostname.of.namenode:50070/path/to/filename.fileextension
-------------

How would I repair a corrupted file if it was not easy to replace?


This might or might not be possible, but the first step would be to gather information on the file's location, and blocks.

---------
hadoop fsck /path/to/filename/fileextension -locations -blocks -files

hadoop fsck hdfs://ip.or.hostname.of.namenode:50070/path/to/filename/fileextension -locations -blocks -files
--------

From this data, you can track down the node where the corruption is. On those nodes, you can look through logs and determine what the issue is. If a disk was replaced, i/o errors on the server, etc. If possible to recover on that machine and get the partition with the blocks online that would report back to hadoop and the file would be healthy again. If that isn't possible, you will unfortunately have to find another way to regenerate.

No comments:

Post a Comment

Note: only a member of this blog may post a comment.