Monday, 22 October 2018

Sqoop: How to Export data from Hadoop (Hive or Hbase) to RDBMS


We have an sqoop export tool to export the data back to RDBMS from Hadoop. The only thing you should take care here is, you must have the table created in RDBMS before exporting data from Hadoop to a relational database.

Scenario: I need to migrate a hive table "largedeck" to a RDBMS ( mysql) table "demo" using sqoop.

Step 1:

Download a sample largedeck.txt file to local folder of the hive server.
cd /tmp/testing/

Sample content of largedeck of cards:
[root@manoj-testing]# head -10 largedeck.txt

Step 2:
Create an external table "largedeck" in hive and Load the data from local folder "/tmp/testing" to
 that external table.

Command to create a hive external table "largedeck" to location "/user/manoj/largedeck":
create external table largedeck (color string, suit string, pip string)
LOCATION '/user/manoj/largedeck.txt';


LOAD the data to hive table "largedeck"
LOAD DATA  LOCAL INPATH '/tmp/testing/largedeck.txt' INTO TABLE largedeck;

STEP 4: Run the following commands from hive beeline shell to make sure data has been load
>select * from largedeck limit 10;
> select suit, count(1) from largedeck group by suit;

STEP 5:Create a database "demo" in mysql servers and set the user as "demo_user". The password for it is 'changeit'

create database demo;
create user 'demo_user' identified by 'changeit';
grant all on demo.* to demo_user;
flush privileges;

STEP 6: Now create the following tables in table

mysql -u demo_user -p

create table deck_of_cards (color varchar(10), suit varchar(10), pip varchar(2));

STEP 7:  Run the command from sqoop server to start exporting the hive table "largedeck" to mysql table "demo_user".

 sqoop export --connect "jdbc:mysql://<hostname-of-mysql-server>/demo" --username demo_user --password changeit --table deck_of_cards --export-dir /user/manoj/largedeck.txt --input-fields-terminated-by '|' --num-mappers 2 --outdir java_files

STEP 8:  Once sqoop export is complete run the below command in mysql shell to see the output

> select count(1) from deck_of_cards;

Thursday, 18 October 2018

Sqoop: Export/Import data from RDBMS to Hadoop and viceversa


Sqoop is a tool used for data transfer between RDBMS (like MySQL, Oracle SQL etc.) and Hadoop (Hive, HDFS, and HBASE etc.)It is used to import data from RDBMS to Hadoop and export data from Hadoop to RDBMS.
Image result for sqoop architecture

Sqoop Architecture:

Image result for sqoop architecture

When Sqoop starts functioning, only mapper job will run and reducer is not required. Here only Map phase will run and reduce is not required because the complete import and export process doesn’t require any aggregation and so there is no need of reducers in Sqoop.


Sqoop import tool is used to import the tables from a relational database like MySQL, Oracle SQL the Hadoop system like Sqoop import to HDFS or Hbase etc.

Each row of an RDBMS table is treated as the records of the file. It will store text data in text file and binary data in sequence file or Avro file.

For example, I have one table as below-

The table name is “emp” in the database “userdb” in MySQL database server

Now we will import this table emp data into HDFS. Let’s see how easy it is-

$ sqoop import --connect jdbc:mysql://<mysql-server-hostname>/<database-name>  --username <username-of-mysql-db> --table <table-name>  -n <number-of-mappers> 


$ sqoop import –connect jdbc:mysql://localhost/userdb –username root –table emp m 1

Once you will run the above commands, you will find it running and using some mappers. Once done, you can also see the output using below commands-
$HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

You will find the output file in HDFS something like below. It will be a comma (,) separated.

101,     John,   5000,   IT,        Developer

102,     Chris,   6000,   Sales,   Manager

103,     Jamie,  7000,   Support, Director

Using the above command, the file will be saved in a default location in HDFS.

Import Table data in a particular location

But what if we want to save the file in a particular location (let’s say rdbmsdata) in HDFS?

You can specify the location (target directory) in above command. For this, we use the below command-

–target-dir /rdbmsdata

So the complete command will be something like below-

$ sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp --m 1 --target-dir /rdbmsdata

Verify the result using the below command-

$HADOOP_HOME/bin/hadoop fs -cat /rdbmsdata/part-m-*

Incremental Import

This section will be highly helpful to you if you are using on real time data. For example, you are working on some employee data of a company and everyday new hiring is getting done and so new employee will get added as well.

So in such situation, you will only want to import only the newly updated records and in such case incremental import will help you a lot.

To get this done, you need mainly three things-

Incremental-  Mode like appended
check-column – based on which column you want to decide the incremental value
last value – What was the last value you imported so that it will take next value after that

Now considering a new employee has been added in the emp table with the record as below-
104,     Henning,         4000,   IT,        Developer

And we want to take ID column as the check column and so the last value was 103 and the new is 104.

The complete query to import the newly added rows in the table emp.

Let’s see the query-

sqoop import --connect jdbc:mysql://localhost/userdb --username root --table emp 

--m 1 --incremental append --check-column id --last-value 103

Again you can check the output using the below command-

$HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

As you have seen so far, the result will be comma separated by default while importing the data from the relational database to Hadoop.

In all the above examples, I have shown you how to import table in HDFS through Sqoop. 

How to Import Table Data in HBase with Sqoop?

Like we imported table in HDFS through Sqoop, in a similar way we can do for Hbase as well but there is some twist.

Sqoop doesn’t now permit you to import a relational table directly into an HBase table having multiple column families. To work around this limitation, you can create the HBase table first and then execute three Sqoop import operations to finish the task.

Let’s start by creating an Hbase table first-

Create ‘tablename’, ‘columnfamily’

For example: create ‘empdetails’, ‘personal data’, ‘professional data’

Once you have created a table in HBase, you are all set to write the command to import the RDBMS table into Hbase through Sqoop. Below is the code-

sqoop import --connect jdbc:mysql://localhost/serviceorderdb --username root -P --table emp --columns “designation,name” --hbase-table empdetails --column-family personal details --hbase-row-key ID -m 1

Once executed, you can scan the HBase table to check the output.
Scan empdetails;

In the next post I'll write the procedure to export data from Hadoop to RDBMS.

Wednesday, 1 August 2018

OpenSSL commands

OpenSSL is an open-source implementation of SSL/TLS protocols and is considered to be one of the most versatile SSL tools. 
Open SSL is normally used to generate a Certificate Signing Request (CSR) and private key for different platforms. However, it also has several different functions, which can be listed as follows. It is used to:
  • View details about a CSR or a certificate
  • Compare MD5 hash of a certificate and private key to ensure they match
  • Verify proper installation of the certificate on a website
  • Convert the certificate format
Most of the functions mentioned below can also be performed without involving OpenSSL by using these convenient SSL tools. Here, we have put together few of the most common OpenSSL commands.

General OpenSSL Commands

These are the set of commands that allow the users to generate CSRs, Certificates, Private Keys and many other miscellaneous tasks. Here, we have listed few such commands:
(1) Generate a Certificate Signing Request (CSR) and new private key
openssl req -out CSR.csr -new -newkey rsa:2048 -nodes -keyout privateKey.key

(2) Generate a self-signed certificate
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout privateKey.key -out certificate.crt

(3) Create CSR based on an existing private key
openssl req -out CSR.csr -key privateKey.key –new

(4) Create CSR based on an existing certificate
openssl x509 -x509toreq -in certificate.crt -out CSR.csr -signkey privateKey.key

(5) Passphrase removal from a private key
openssl rsa -in privateKey.pem -out newPrivateKey.pem

SSL Check Commands

These commands are very helpful if the user wants to check the information within an SSL certificate, a Private Key, and CSR. Few online tools can also help you check CSRs and check SSL certificates.
(1) Certificate Signing Request (CSR)
openssl req -text -noout -verify -in CSR.csr

(2) Private Key
openssl rsa -in privateKey.key –check

(3) SSL Certificate
openssl x509 -in certificate.crt -text –noout

(4) PKCS#12 File (.pfx or .p12)
openssl pkcs12 -info -in keyStore.p12

Convert Commands

As per the title, these commands help convert the certificates and keys into different formats to impart them the compatibility with specific servers types. For example, a PEM file, compatible with Apache server, can be converted to PFX (PKCS#12), after which it would be possible for it to work with Tomcat or IIS. 
(1) Convert DER Files (.crt, .cer, .der) to PEM
openssl x509 -inform der -in certificate.cer -out certificate.pem

(2) Convert PEM to DER
openssl x509 -outform der -in certificate.pem -out certificate.der

(3) Convert PKCS #12 File (.pfx, .p12) Containing a Private Key and Certificate to PEM
openssl pkcs12 -in keyStore.pfx -out keyStore.pem –nodes
To output only the private key, users can add –nocerts or –nokeys to output only the certificates.

(4) Convert PEM Certificate (File and a Private Key) to PKCS # 12 (.pfx #12)
openssl pkcs12 -export -out certificate.pfx -inkey privateKey.key -in certificate.crt -certfile CACert.crt

Debugging Using OpenSSL Commands

If there are error messages popping up about your private key not matching the certificate or that the newly-installed certificate is not trusted, you can rely on one of the comments mentioned below. 
(1) Check SSL Connection (All certificates, including Intermediates, are to be displayed)
Here, all the certificates should be displayed, including the Intermediates as well.
openssl s_client -connect

(2) Check MD5 Hash of Public Key
This is to ensure that the public key matches with the CSR or the private key.
openssl x509 -noout -modulus -in certificate.crt | openssl md5
openssl rsa -noout -modulus -in privateKey.key | openssl md5
openssl req -noout -modulus -in CSR.csr | openssl md5

SSL Keytool List

Java Keytool is a key and certificate management utility that allows the users to cache the certificate and manage their own private or public key pairs and certificates. Java Keytool stores all the keys and certificates in a ‘Keystore’, which is, by default, implemented as a file. It contains private keys and certificates that are essential for establishing the reliability of the primary certificate and completing a chain of trust.
Every certificate in Java Keystore has a unique pseudonym/alias. For creating a ‘Java Keystore’, you need to first create the .jks file containing only the private key in the beginning. After that, you need to generate a Certificate Signing Request (CSR) and generate a certificate from it. After this, import the certificate to the Keystore including any root certificates.
The ‘Java Keytool’ basically contains several other functions that help the users export a certificate or to view the certificate details or the list of certificates in Keystore.
Here are few important Java Keytool commands:

For Creating and Importing

These Keytool commands allow users to create a new Java Keytool keysKeystore, generate a Certificate Signing Request (CSR) and import certificates. Before you import the primary certificate for your domain, you need to first import any root or intermediate certificates.
(1) Import a root or intermediate CA certificate to an existing Java keystore
keytool -import -trustcacerts -alias root -file Thawte.crt -keystore keystore.jks

(2) Import a signed primary certificate to an existing Java keystore
keytool -import -trustcacerts -alias mydomain -file mydomain.crt -keystore keystore.jks

(3) Generate a keystore and self-signed certificate
keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass password -validity 360 -keysize 2048

(4) Generate Key Pair & Java Keystore
keytool -genkey -alias mydomain -keyalg RSA -keystore keystore.jks -keysize 2048

(5) Generate CSR for existing Java Keystore
keytool -certreq -alias mydomain -keystore keystore.jks -file mydomain.csr

For Checking

Users can check the information within a certificate or Java keystore by using the following commands:
(1) Check an individual certificate
keytool -printcert -v -file mydomain.crt

(2) Check certificates in Java keystore
keytool -list -v -keystore keystore.jks

(3) Check specific keystore entry using an alias
keytool -list -v -keystore keystore.jks -alias mydomain

Other Java Keytool Commands

(1) Delete a certificate from Java Keystore keystore
keytool -delete -alias mydomain -keystore keystore.jks

(2) Change the password in Java keystore / Change a Java keystore password
keytool -storepasswd -new new_storepass -keystore keystore.jks

(3) Export certificate from Java keystore
keytool -export -alias mydomain -file mydomain.crt -keystore keystore.jks

(4) List the trusted CA Certificate
keytool -list -v -keystore $JAVA_HOME/jre/lib/security/cacerts

(5) Import new CA into Trusted Certs
keytool -import -trustcacerts -file /path/to/ca/ca.pem -alias CA_ALIAS -keystore $JAVA_HOME/jre/lib/securi

Tuesday, 31 July 2018

Hadoop Security: Useful commands for Active Directory usergroup management

In this blog post, I will discuss on some of the common but very useful commands to manage the users in AD.

I've seen that in many Hadoop projects there is a separate AD team for managing Active Directory servers. Many a time a Hadoop admin want to see whether the user has been added in AD or whether a user has been added to a group or whether the password of the user expired etc: The following commands helps in these situations. 

Case1: To check in which all group a user belongs to:

Command:  id <username>

For example: 

[root@manoj ~]$ id hdpadmin

uid=731803102(hdpadmin) gid=731800513(domain_users) groups=731800513(domain_users),731801610(hadoopadmin)

The example states that hdpadmin is a part of "hadoopadmin" group and "domain_users" group.

Case2: Which all users belong to a particular group:

Command:  getent group  <groupname>

For example: 

[root@manoj1 ~]$ getent group hadoopadmin


The output shows that in "hadoopadmin" group "hdpadmin" and "ambari" users are present.

Case2: To check whether the password is working for a user:

Command:  ldapsearch -D <username@domainname> -W

For example:

[root@manoj1 ~]$: ldapsearch -D -W

Then give the password of hdpadmin user. If you get the output as password accepted then you are fine.

Tuesday, 24 April 2018

How to find swap memory utilizing processes in a Linux machine

Top command:

Type the top command as root:
# top

To sort process as per swap page usage (SWAP = VIRT – RES) type capital O (option) followed by p (small p) and [Enter] key:

Single line script to find out the processes consuming high swap memory.

The following script will be useful in finding the process which is utilizing high swap.

(echo "COMM PID SWAP"; for file in /proc/*/status ; do awk '/^Pid|VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | grep kB | grep -wv "0 kB" | sort -k 3 -n -r) | column -t


Tips to find process utilization in a Linux machine

The ps command has several flags that enable you to specify which processes to list and what information to display about each process.
To show all processes running on your system, at the prompt, type the following:
ps -ef
The system displays information similar to the following:
    root     1     0   0   Jun 28      -  3:23 /etc/init 
    root  1588  6963   0   Jun 28      -  0:02 /usr/etc/biod 6 
    root  2280     1   0   Jun 28      -  1:39 /etc/syncd 60 
    mary  2413 16998   2 07:57:30      -  0:05 aixterm 
    mary 11632 16998   0 07:57:31  lft/1  0:01 xbiff 
    mary 16260  2413   1 07:57:35  pts/1  0:00 /bin/ksh 
    mary 16469     1   0 07:57:12  lft/1  0:00 ksh /usr/lpp/X11/bin/xinit 
    mary 19402 16260  20 09:37:21  pts/1  0:00 ps -ef 
The columns in the previous output are defined as follows:

Item Description
USER User login name
PID         Process ID
PPID Parent process ID
C         CPU utilization of process
STIME Start time of process
TTY         Controlling workstation for the process
TIME Total execution time for the process
CMD Command
In the previous example, the process ID for the ps -ef command is 19402. Its parent process ID is 16260, the /bin/kshcommand.
If the listing is very long, the top portion scrolls off the screen. To display the listing one page (screen) at a time, pipe the pscommand to the pg command. At the prompt, type the following:
ps -ef | pg
To display status information of all processes running on your system, at the prompt, type the following:
ps gv
This form of the command lists a number of statistics for each active process. Output from this command looks similar to the following:
     0      - A     0:44    7     8     8    xx     0     0  0.0  0.0 swapper
     1      - A     1:29  518   244   140    xx    21    24  0.1  1.0 /etc/init
   771      - A     1:22    0    16    16    xx     0     0  0.0  0.0 kproc
  1028      - A     0:00   10    16     8    xx     0     0  0.0  0.0 kproc
  1503      - A     0:33  127    16     8    xx     0     0  0.0  0.0 kproc
  1679      - A     1:03  282   192    12 32768   130     0  0.7  0.0 pcidossvr
  2089      - A     0:22  918    72    28    xx     1     4  0.0  0.0 /etc/sync
  2784      - A     0:00    9    16     8    xx     0     0  0.0  0.0 kproc
  2816      - A     5:59 6436  2664   616     8   852   156  0.4  4.0 /usr/lpp/
  3115      - A     0:27  955   264   128    xx    39    36  0.0  1.0 /usr/lib/
  3451      - A     0:00    0    16     8    xx     0     0  0.0  0.0 kproc
  3812      - A     0:00   21   128    12 32768    34     0  0.0  0.0 usr/lib/lpd/
  3970      - A     0:00    0    16     8    xx     0     0  0.0  0.0 kproc
  4267      - A     0:01  169   132    72 32768    16    16  0.0  0.0 /etc/sysl
  4514  lft/0 A     0:00   60   200    72    xx    39    60  0.0  0.0 /etc/gett
  4776  pts/3 A     0:02  250   108   280     8   303   268  0.0  2.0 -ksh 
  5050      - A     0:09 1200   424   132 32768   243    56  0.0  1.0 /usr/sbin
  5322      - A     0:27 1299   156   192    xx    24    24  0.0  1.0 /etc/cron
  5590      - A     0:00    2   100    12 32768    11     0  0.0  0.0 /etc/writ
  5749      - A     0:00    0   208    12    xx    13     0  0.0  0.0 /usr/lpp/
  6111      - T     0:00   66   108    12 32768    47     0  0.0  0.0 /usr/lpp/

Thursday, 19 April 2018

Hbase Replication Implementation


This post explains the procedure to implement hbase replication between two clusters.


Source cluster server hadoop215 ( HADOOP-INT) and destination cluster server hadoop220  (HADOOP-ANA). ( SSL disabled cluster to SSL enabled cluster).


Step 1:  Created the peers in below format in the source cluster server hadoop215 ( HADOOP-INT).

add_peer 'ID', 'CLUSTER_KEY'  



Note: The value for “zookeeper.znode.parent” can be obtained from "/etc/hbase/conf/hbase-site.xml" property. For unsecured cluster this value is “/hbase” and for secured cluster this value is “/hbase-unsecure”.


To create a peer with peer id ‘2’ and to list the peers.

From the hbase shell run the following command as below:

hbase(main)> add_peer '2', 'hadoop216,hadoop217,hadoop218:2181:/hbase'

hbase(main)> list_peers

Step 2: Create a test table ‘reptable4’ with REPLICATION_SCOPE => ‘1’ in source cluster 

Step 3: Enable replication for table ‘reptable4’ in source cluster :

Useful commands related to hbase replication:

  • ·         To list the peers:

  • ·         To create a column family with replication scope 1
>create 'reptable4', { NAME => 'cf1', REPLICATION_SCOPE => 1}

  • ·         To describe a table
>describe 'reptable4'

  • ·         To enable replication for a table:
>enable_table_replication 'reptable4’

  • ·         To insert a row and value to a column
>put 'reptable4', 'row1', 'cf1:v2', 'bar'

  • ·         To see the detailed schema for a table:
>scan 'reptable4'

  • ·         To disable replication for a table:

>disable_table_replication 'reptable4’

  • ·         To list replicated tables:

Reference link: