Wednesday, 1 August 2018

OpenSSL commands

OpenSSL is an open-source implementation of SSL/TLS protocols and is considered to be one of the most versatile SSL tools. 
Open SSL is normally used to generate a Certificate Signing Request (CSR) and private key for different platforms. However, it also has several different functions, which can be listed as follows. It is used to:
  • View details about a CSR or a certificate
  • Compare MD5 hash of a certificate and private key to ensure they match
  • Verify proper installation of the certificate on a website
  • Convert the certificate format
Most of the functions mentioned below can also be performed without involving OpenSSL by using these convenient SSL tools. Here, we have put together few of the most common OpenSSL commands.

General OpenSSL Commands

These are the set of commands that allow the users to generate CSRs, Certificates, Private Keys and many other miscellaneous tasks. Here, we have listed few such commands:
(1) Generate a Certificate Signing Request (CSR) and new private key
openssl req -out CSR.csr -new -newkey rsa:2048 -nodes -keyout privateKey.key

(2) Generate a self-signed certificate
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout privateKey.key -out certificate.crt

(3) Create CSR based on an existing private key
openssl req -out CSR.csr -key privateKey.key –new

(4) Create CSR based on an existing certificate
openssl x509 -x509toreq -in certificate.crt -out CSR.csr -signkey privateKey.key

(5) Passphrase removal from a private key
openssl rsa -in privateKey.pem -out newPrivateKey.pem

SSL Check Commands

These commands are very helpful if the user wants to check the information within an SSL certificate, a Private Key, and CSR. Few online tools can also help you check CSRs and check SSL certificates.
(1) Certificate Signing Request (CSR)
openssl req -text -noout -verify -in CSR.csr

(2) Private Key
openssl rsa -in privateKey.key –check

(3) SSL Certificate
openssl x509 -in certificate.crt -text –noout

(4) PKCS#12 File (.pfx or .p12)
openssl pkcs12 -info -in keyStore.p12

Convert Commands

As per the title, these commands help convert the certificates and keys into different formats to impart them the compatibility with specific servers types. For example, a PEM file, compatible with Apache server, can be converted to PFX (PKCS#12), after which it would be possible for it to work with Tomcat or IIS. 
(1) Convert DER Files (.crt, .cer, .der) to PEM
openssl x509 -inform der -in certificate.cer -out certificate.pem

(2) Convert PEM to DER
openssl x509 -outform der -in certificate.pem -out certificate.der

(3) Convert PKCS #12 File (.pfx, .p12) Containing a Private Key and Certificate to PEM
openssl pkcs12 -in keyStore.pfx -out keyStore.pem –nodes
To output only the private key, users can add –nocerts or –nokeys to output only the certificates.

(4) Convert PEM Certificate (File and a Private Key) to PKCS # 12 (.pfx #12)
openssl pkcs12 -export -out certificate.pfx -inkey privateKey.key -in certificate.crt -certfile CACert.crt

Debugging Using OpenSSL Commands

If there are error messages popping up about your private key not matching the certificate or that the newly-installed certificate is not trusted, you can rely on one of the comments mentioned below. 
(1) Check SSL Connection (All certificates, including Intermediates, are to be displayed)
Here, all the certificates should be displayed, including the Intermediates as well.
openssl s_client -connect

(2) Check MD5 Hash of Public Key
This is to ensure that the public key matches with the CSR or the private key.
openssl x509 -noout -modulus -in certificate.crt | openssl md5
openssl rsa -noout -modulus -in privateKey.key | openssl md5
openssl req -noout -modulus -in CSR.csr | openssl md5

SSL Keytool List

Java Keytool is a key and certificate management utility that allows the users to cache the certificate and manage their own private or public key pairs and certificates. Java Keytool stores all the keys and certificates in a ‘Keystore’, which is, by default, implemented as a file. It contains private keys and certificates that are essential for establishing the reliability of the primary certificate and completing a chain of trust.
Every certificate in Java Keystore has a unique pseudonym/alias. For creating a ‘Java Keystore’, you need to first create the .jks file containing only the private key in the beginning. After that, you need to generate a Certificate Signing Request (CSR) and generate a certificate from it. After this, import the certificate to the Keystore including any root certificates.
The ‘Java Keytool’ basically contains several other functions that help the users export a certificate or to view the certificate details or the list of certificates in Keystore.
Here are few important Java Keytool commands:

For Creating and Importing

These Keytool commands allow users to create a new Java Keytool keysKeystore, generate a Certificate Signing Request (CSR) and import certificates. Before you import the primary certificate for your domain, you need to first import any root or intermediate certificates.
(1) Import a root or intermediate CA certificate to an existing Java keystore
keytool -import -trustcacerts -alias root -file Thawte.crt -keystore keystore.jks

(2) Import a signed primary certificate to an existing Java keystore
keytool -import -trustcacerts -alias mydomain -file mydomain.crt -keystore keystore.jks

(3) Generate a keystore and self-signed certificate
keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass password -validity 360 -keysize 2048

(4) Generate Key Pair & Java Keystore
keytool -genkey -alias mydomain -keyalg RSA -keystore keystore.jks -keysize 2048

(5) Generate CSR for existing Java Keystore
keytool -certreq -alias mydomain -keystore keystore.jks -file mydomain.csr

For Checking

Users can check the information within a certificate or Java keystore by using the following commands:
(1) Check an individual certificate
keytool -printcert -v -file mydomain.crt

(2) Check certificates in Java keystore
keytool -list -v -keystore keystore.jks

(3) Check specific keystore entry using an alias
keytool -list -v -keystore keystore.jks -alias mydomain

Other Java Keytool Commands

(1) Delete a certificate from Java Keystore keystore
keytool -delete -alias mydomain -keystore keystore.jks

(2) Change the password in Java keystore / Change a Java keystore password
keytool -storepasswd -new new_storepass -keystore keystore.jks

(3) Export certificate from Java keystore
keytool -export -alias mydomain -file mydomain.crt -keystore keystore.jks

(4) List the trusted CA Certificate
keytool -list -v -keystore $JAVA_HOME/jre/lib/security/cacerts

(5) Import new CA into Trusted Certs
keytool -import -trustcacerts -file /path/to/ca/ca.pem -alias CA_ALIAS -keystore $JAVA_HOME/jre/lib/securi

Tuesday, 31 July 2018

Hadoop Security: Useful commands for Active Directory usergroup management

In this blog post, I will discuss on some of the common but very useful commands to manage the users in AD.

I've seen that in many Hadoop projects there is a separate AD team for managing Active Directory servers. Many a time a Hadoop admin want to see whether the user has been added in AD or whether a user has been added to a group or whether the password of the user expired etc: The following commands helps in these situations. 

Case1: To check in which all group a user belongs to:

Command:  id <username>

For example: 

[root@manoj ~]$ id hdpadmin

uid=731803102(hdpadmin) gid=731800513(domain_users) groups=731800513(domain_users),731801610(hadoopadmin)

The example states that hdpadmin is a part of "hadoopadmin" group and "domain_users" group.

Case2: Which all users belong to a particular group:

Command:  getent group  <groupname>

For example: 

[root@manoj1 ~]$ getent group hadoopadmin


The output shows that in "hadoopadmin" group "hdpadmin" and "ambari" users are present.

Case2: To check whether the password is working for a user:

Command:  ldapsearch -D <username@domainname> -W

For example:

[root@manoj1 ~]$: ldapsearch -D -W

Then give the password of hdpadmin user. If you get the output as password accepted then you are fine.

Tuesday, 24 April 2018

How to find swap memory utilizing processes in a Linux machine

Top command:

Type the top command as root:
# top

To sort process as per swap page usage (SWAP = VIRT – RES) type capital O (option) followed by p (small p) and [Enter] key:

Single line script to find out the processes consuming high swap memory.

The following script will be useful in finding the process which is utilizing high swap.

(echo "COMM PID SWAP"; for file in /proc/*/status ; do awk '/^Pid|VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | grep kB | grep -wv "0 kB" | sort -k 3 -n -r) | column -t


Tips to find process utilization in a Linux machine

The ps command has several flags that enable you to specify which processes to list and what information to display about each process.
To show all processes running on your system, at the prompt, type the following:
ps -ef
The system displays information similar to the following:
    root     1     0   0   Jun 28      -  3:23 /etc/init 
    root  1588  6963   0   Jun 28      -  0:02 /usr/etc/biod 6 
    root  2280     1   0   Jun 28      -  1:39 /etc/syncd 60 
    mary  2413 16998   2 07:57:30      -  0:05 aixterm 
    mary 11632 16998   0 07:57:31  lft/1  0:01 xbiff 
    mary 16260  2413   1 07:57:35  pts/1  0:00 /bin/ksh 
    mary 16469     1   0 07:57:12  lft/1  0:00 ksh /usr/lpp/X11/bin/xinit 
    mary 19402 16260  20 09:37:21  pts/1  0:00 ps -ef 
The columns in the previous output are defined as follows:

Item Description
USER User login name
PID         Process ID
PPID Parent process ID
C         CPU utilization of process
STIME Start time of process
TTY         Controlling workstation for the process
TIME Total execution time for the process
CMD Command
In the previous example, the process ID for the ps -ef command is 19402. Its parent process ID is 16260, the /bin/kshcommand.
If the listing is very long, the top portion scrolls off the screen. To display the listing one page (screen) at a time, pipe the pscommand to the pg command. At the prompt, type the following:
ps -ef | pg
To display status information of all processes running on your system, at the prompt, type the following:
ps gv
This form of the command lists a number of statistics for each active process. Output from this command looks similar to the following:
     0      - A     0:44    7     8     8    xx     0     0  0.0  0.0 swapper
     1      - A     1:29  518   244   140    xx    21    24  0.1  1.0 /etc/init
   771      - A     1:22    0    16    16    xx     0     0  0.0  0.0 kproc
  1028      - A     0:00   10    16     8    xx     0     0  0.0  0.0 kproc
  1503      - A     0:33  127    16     8    xx     0     0  0.0  0.0 kproc
  1679      - A     1:03  282   192    12 32768   130     0  0.7  0.0 pcidossvr
  2089      - A     0:22  918    72    28    xx     1     4  0.0  0.0 /etc/sync
  2784      - A     0:00    9    16     8    xx     0     0  0.0  0.0 kproc
  2816      - A     5:59 6436  2664   616     8   852   156  0.4  4.0 /usr/lpp/
  3115      - A     0:27  955   264   128    xx    39    36  0.0  1.0 /usr/lib/
  3451      - A     0:00    0    16     8    xx     0     0  0.0  0.0 kproc
  3812      - A     0:00   21   128    12 32768    34     0  0.0  0.0 usr/lib/lpd/
  3970      - A     0:00    0    16     8    xx     0     0  0.0  0.0 kproc
  4267      - A     0:01  169   132    72 32768    16    16  0.0  0.0 /etc/sysl
  4514  lft/0 A     0:00   60   200    72    xx    39    60  0.0  0.0 /etc/gett
  4776  pts/3 A     0:02  250   108   280     8   303   268  0.0  2.0 -ksh 
  5050      - A     0:09 1200   424   132 32768   243    56  0.0  1.0 /usr/sbin
  5322      - A     0:27 1299   156   192    xx    24    24  0.0  1.0 /etc/cron
  5590      - A     0:00    2   100    12 32768    11     0  0.0  0.0 /etc/writ
  5749      - A     0:00    0   208    12    xx    13     0  0.0  0.0 /usr/lpp/
  6111      - T     0:00   66   108    12 32768    47     0  0.0  0.0 /usr/lpp/

Thursday, 19 April 2018

Hbase Replication Implementation


This post explains the procedure to implement hbase replication between two clusters.


Source cluster server hadoop215 ( HADOOP-INT) and destination cluster server hadoop220  (HADOOP-ANA). ( SSL disabled cluster to SSL enabled cluster).


Step 1:  Created the peers in below format in the source cluster server hadoop215 ( HADOOP-INT).

add_peer 'ID', 'CLUSTER_KEY'  



Note: The value for “zookeeper.znode.parent” can be obtained from "/etc/hbase/conf/hbase-site.xml" property. For unsecured cluster this value is “/hbase” and for secured cluster this value is “/hbase-unsecure”.


To create a peer with peer id ‘2’ and to list the peers.

From the hbase shell run the following command as below:

hbase(main)> add_peer '2', 'hadoop216,hadoop217,hadoop218:2181:/hbase'

hbase(main)> list_peers

Step 2: Create a test table ‘reptable4’ with REPLICATION_SCOPE => ‘1’ in source cluster 

Step 3: Enable replication for table ‘reptable4’ in source cluster :

Useful commands related to hbase replication:

  • ·         To list the peers:

  • ·         To create a column family with replication scope 1
>create 'reptable4', { NAME => 'cf1', REPLICATION_SCOPE => 1}

  • ·         To describe a table
>describe 'reptable4'

  • ·         To enable replication for a table:
>enable_table_replication 'reptable4’

  • ·         To insert a row and value to a column
>put 'reptable4', 'row1', 'cf1:v2', 'bar'

  • ·         To see the detailed schema for a table:
>scan 'reptable4'

  • ·         To disable replication for a table:

>disable_table_replication 'reptable4’

  • ·         To list replicated tables:

Reference link:

Hbase Regions are not online or assigned


This blog post  explains how to resolve the HBase Regions issue where Regions are not assigned to any one/two of Regions Server, though Region Servers are up & running


In HBase, tables are split into regions and are served by the region servers. So all Region Severs will be assigned number of Regions depending upon size of Data available in Hbase. If any unexpected exit/failure at Service/Server level happens the regions normally assigned to other available Region Servers by Hbase Master and once failed Region Server is back online, regions started getting assigned by same Hbase Master.

If Regions have not being assigned or number of online / available regions are zero for any particular Regions Server of Hbase, user(s) will be facing issue/failure at Hbase read/write operations via jobs / application. 

From Active Hbase Master Web UI address like: http://<Hbase_Master_Host_Name>:60010/master-status, number of regions will be showing NULL/ZERO.

Hbase Region Server logs can be captured with error as



To fix the issue where Regions are not being assigned or online regions is not visible at Hbase  Master Web UI Hbase Master Web UI address like:


1.Login to shell in Hbase Master host using putty/terminal emulator

**sudo rights required, if Kerberos Security have been enabled

2.Authenticate user by generating Kerberos Ticket for Hbase

**Hbase user Kerberos ticket required as below command will be executed in hbase prompt

3.In $hbase shell execute "assign"  command.

Command: $hbase>assign;

**Validate from Hbase Master Web UI, Regions will be started assigning for effected Region Server; depending of Data Size "assigning of regions" can take 15 to 30min
**Some cases executing "assign" command don't resolve the Region assigning issue.

4.If 'Step 4' fails, executing "balancer" command in hbase shell will help

5.Execute "hbase hbck" command out side of hbase shell OR "hbck" command inside hbase shell
** "hbase hbck" is a command-line tool that checks for region consistency and table integrity problems and repairs corruption.

6.Repair all inconsistencies and corruption at once, use the "-repair" option, which includes all the region and table consistency options.

Command: $hbase hbck -repair

7.[OPTIONAL STEPS] Some case "-repair" doesn't fix the inconsistency of Regions; hence user should try fixing Regions step by step mention below:

a.-fixAssignments repairs unassigned, incorrectly assigned or multiply assigned regions

Command: $hbase hbck -fixAssignments

b.-fixMeta removes rows from hbase:meta when their corresponding regions are not present in HDFS and adds new meta rows if regions are present in HDFS but not in hbase:meta.

Command: $hbase hbck -fixMeta

c.-repairHoles creates HFiles for new empty regions on the filesystem and ensures that the new regions are consistent.

Command: $hbase hbck -repairHoles

d.-fixHdfsOrphans repairs a region directory that is missing a region metadata file (the .regioninfo file).

Command: $hbase hbck -fixHdfsOrphans

**Regions are closed during repair.

8.Once fixing Hbase Regions have been completed with Step -6,Step -7,Step -8, below checks need to confirm

        •Confirm online/available regions have been assigned to all Region Server through Active Hbase Master Web UI

  • Confirm "hbase hbck" from command prompt status replicate "0 inconsistencies detected"

Friday, 13 April 2018

Distribution of Executors, Cores and Memory for a Spark Application running in Yarn

We can normally run a spark job via spark submit. Ever wondered how to configure --num-executors, --executor-memory and --executor-cores spark config params for your cluster?

spark-submit --class <CLASS_NAME> --num-executors ? --executor-cores ? --executor-memory ? ..

More specific as below by mentioning the executor memory and core for a wordcount example:
spark-submit --class com.hadoop.sparksimple.wordcount.JobRunner --master yarn --deploy-mode cluster --driver-memory=2g --executor-memory 2g --executor-cores 1 --num-executors 1 SparkSimple-0.0.1SNAPSHOT.jar hdfs:// hdfs://

Following list captures some recommendations to keep in mind while configuring them:

  • Hadoop/Yarn/OS Deamons: When we run spark application using a cluster manager like Yarn, there’ll be several daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. So, while specifying num-executors, we need to make sure that we leave aside enough cores (~1 core per node) for these daemons to run smoothly.
  • Yarn ApplicationMaster (AM): ApplicationMaster is responsible for negotiating resources from the ResourceManager and working with the NodeManagers to execute and monitor the containers and their resource consumption. If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor).
  • HDFS Throughput: HDFS client has trouble with tons of concurrent threads. It was observed that HDFS achieves full write throughput with ~5 tasks per executor . So it’s good to keep the number of cores per executor below that number.
  • MemoryOverhead: Following picture depicts spark-yarn-memory-usage

Two things to make note of from this picture:
Full memory requested to yarn per executor =
          spark-executor-memory + spark.yarn.executor.memoryOverhead.
 spark.yarn.executor.memoryOverhead = 
        Max(384MB, 7% of spark.executor-memory

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

  • Running executors with too much memory often results in excessive garbage collection delays.
  • Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.
Hands on,

Now, let’s consider a 10 node cluster with following config and analyse different possibilities of executors-core-memory distribution:

**Cluster Config:**
10 Nodes
16 cores per Node
64GB RAM per Node

First Approach: Tiny executors [One Executor per core]:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

- `--num-executors` = `In this approach, we'll assign one executor per core`
                                 = `total-cores-in-cluster`
                                 = `num-cores-per-node * total-nodes-in-cluster` 
                                 = 16 x 10 = 160
- `--executor-cores` = 1 (one executor per core)
- `--executor-memory` = `amount of memory per executor`
                                     = `mem-per-node/num-executors-per-node`
                                     = 64GB/16 = 4GB

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Second Approach: Fat executors (One Executor per node):

Fat executors essentially means one executor per node. Following table depicts the values of our spark-config params with this approach:

- `--num-executors`  = `In this approach, we'll assign one executor per node`
                                 = `total-nodes-in-cluster`
                                 = 10
- `--executor-cores` = `one executor per node means all the cores of the node are assigned to one executor`
                                 = `total-cores-in-a-node`
                                 = 16
- `--executor-memory` = `amount of memory per executor`
                                     = `mem-per-node/num-executors-per-node`
                                     = 64GB/1 = 64GB

Analysis: With all 16 cores per executor, apart from ApplicationManager and daemon processes are not counted for, HDFS throughput will hurt and it’ll result in excessive garbage results. Also,NOT GOOD!

Third Approach: Balance between Fat (vs) Tiny

According to the recommendations which we discussed above:

  • Based on the recommendations mentioned above, Let’s assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput)
  • Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
  • So, Total available of cores in cluster = 15 x 10 = 150
  • Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
  • Leaving 1 executor for ApplicationManager => --num-executors = 29
  • Number of executors per node = 30/10 = 3
  • Memory per executor = 64GB/3 = 21GB
  • Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB
So, recommended config is: 29 executors, 18GB memory each and 5 cores each!!

Analysis: It is obvious as to how this third approach has found right balance between Fat vs Tiny approaches. Needless to say, it achieved parallelism of a fat executor and best throughput of a tiny executor!!


We’ve seen:

Couple of recommendations to keep in mind which configuring these params for a spark-application like:

Budget in the resources that Yarn’s Application Manager would need
How we should spare some cores for Hadoop/Yarn/OS daemon processes
Learnt about spark-yarn-memory-usage
Also, checked out and analysed three different approaches to configure these params:
Tiny Executors - One Executor per Core
Fat Executors - One executor per Node
Recommended approach - Right balance between Tiny (Vs) Fat coupled with the recommendations.
--num-executors, --executor-cores and --executor-memory.. these three params play a very important role in spark performance as they control the amount of CPU & memory your spark application gets. This makes it very crucial for users to understand the right way to configure them. Hope this blog helped you in getting that perspective…