Saturday 5 December 2015

Hadoop: distcp to copy data between Hadoop Cluster

In this post, I will explain how to use distcp to Migrate Data between two Clusters.

distcp tool available in hadoop cluster helps to move the data between two clusters. If you are running different versions of hadoop, run the distcp tool with hftp:// as the source file system and hdfs:// as the destination file system. This uses the HFTP protocol for the source, and the HDFS protocol for the destination. The default port used by HFTP is 50070, while the default port used for HDFS is 8020.

Source URI: hftp://namenode-location:50070/source-directory

where namenode-location refers to the Hadoop's NameNode hostname as defined by its config fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the config dfs.http.address.

Destination URI: hdfs://nameservice-id/desitination-directory or hdfs://namenode-location

This refers to the Hadoop's NameNode as defined by its configuration fs.defaultFS.

 NOTE : If you are using distcp as part of an upgrade, run the following distcp commands on the destination cluster only. For example, if you are copying from hadoop1 cluster (version 1.0) to hadoop2 cluster(version 2.0).

Command:
--------
$ hadoop distcp -skipcrccheck -update hftp://hadoop1-namenode:50070/backup/source-directory hdfs://hadoop2-namenode:8020/backup/destination-directory
--------

Or use a specific path, such as /hbase to move HBase data, for example:

$ hadoop distcp hftp://hadoop1-namenode:50070/hbase hdfs://hadoop2-namenode:8020/hbase

distcp is a general utility for copying files between distributed filesystems in different clusters.

Keep reading :)

No comments:

Post a Comment

Note: only a member of this blog may post a comment.