Thursday, 19 April 2018

Hbase Replication Implementation


OBJECTIVE :

This post explains the procedure to implement hbase replication between two clusters.

SCENARIO:


Source cluster server hadoop215 ( HADOOP-INT) and destination cluster server hadoop220  (HADOOP-ANA). ( SSL disabled cluster to SSL enabled cluster).

PROCEDURE:

Step 1:  Created the peers in below format in the source cluster server hadoop215 ( HADOOP-INT).

add_peer 'ID', 'CLUSTER_KEY'  

CLUSTER_KEY Syntax: 

<hbase.zookeeper.quorum>:<hbase.zookeeper.property.clientPort>:<zookeeper.znode.parent>

Note: The value for “zookeeper.znode.parent” can be obtained from "/etc/hbase/conf/hbase-site.xml" property. For unsecured cluster this value is “/hbase” and for secured cluster this value is “/hbase-unsecure”.

Example

To create a peer with peer id ‘2’ and to list the peers.

From the hbase shell run the following command as below:

hbase(main)> add_peer '2', 'hadoop216,hadoop217,hadoop218:2181:/hbase'

hbase(main)> list_peers

Step 2: Create a test table ‘reptable4’ with REPLICATION_SCOPE => ‘1’ in source cluster 



Step 3: Enable replication for table ‘reptable4’ in source cluster :



Useful commands related to hbase replication:


  • ·         To list the peers:
 >list_peers


  • ·         To create a column family with replication scope 1
>create 'reptable4', { NAME => 'cf1', REPLICATION_SCOPE => 1}


  • ·         To describe a table
>describe 'reptable4'


  • ·         To enable replication for a table:
>enable_table_replication 'reptable4’


  • ·         To insert a row and value to a column
>put 'reptable4', 'row1', 'cf1:v2', 'bar'


  • ·         To see the detailed schema for a table:
>scan 'reptable4'


  • ·         To disable replication for a table:

>disable_table_replication 'reptable4’


  • ·         To list replicated tables:
>list_replicated_tables


Reference link:

Hbase Regions are not online or assigned

Objective:

This blog post  explains how to resolve the HBase Regions issue where Regions are not assigned to any one/two of Regions Server, though Region Servers are up & running

Description:

In HBase, tables are split into regions and are served by the region servers. So all Region Severs will be assigned number of Regions depending upon size of Data available in Hbase. If any unexpected exit/failure at Service/Server level happens the regions normally assigned to other available Region Servers by Hbase Master and once failed Region Server is back online, regions started getting assigned by same Hbase Master.

If Regions have not being assigned or number of online / available regions are zero for any particular Regions Server of Hbase, user(s) will be facing issue/failure at Hbase read/write operations via jobs / application. 

From Active Hbase Master Web UI address like: http://<Hbase_Master_Host_Name>:60010/master-status, number of regions will be showing NULL/ZERO.


Hbase Region Server logs can be captured with error as

 


Steps:

To fix the issue where Regions are not being assigned or online regions is not visible at Hbase  Master Web UI Hbase Master Web UI address like:

http://<Hbase_Master_Host_Name>:60010/master-status

1.Login to shell in Hbase Master host using putty/terminal emulator

**sudo rights required, if Kerberos Security have been enabled

2.Authenticate user by generating Kerberos Ticket for Hbase

**Hbase user Kerberos ticket required as below command will be executed in hbase prompt

3.In $hbase shell execute "assign"  command.

Command: $hbase>assign;

**Validate from Hbase Master Web UI, Regions will be started assigning for effected Region Server; depending of Data Size "assigning of regions" can take 15 to 30min
**Some cases executing "assign" command don't resolve the Region assigning issue.

4.If 'Step 4' fails, executing "balancer" command in hbase shell will help

5.Execute "hbase hbck" command out side of hbase shell OR "hbck" command inside hbase shell
** "hbase hbck" is a command-line tool that checks for region consistency and table integrity problems and repairs corruption.

6.Repair all inconsistencies and corruption at once, use the "-repair" option, which includes all the region and table consistency options.

Command: $hbase hbck -repair

7.[OPTIONAL STEPS] Some case "-repair" doesn't fix the inconsistency of Regions; hence user should try fixing Regions step by step mention below:

a.-fixAssignments repairs unassigned, incorrectly assigned or multiply assigned regions

Command: $hbase hbck -fixAssignments

b.-fixMeta removes rows from hbase:meta when their corresponding regions are not present in HDFS and adds new meta rows if regions are present in HDFS but not in hbase:meta.

Command: $hbase hbck -fixMeta

c.-repairHoles creates HFiles for new empty regions on the filesystem and ensures that the new regions are consistent.

Command: $hbase hbck -repairHoles

d.-fixHdfsOrphans repairs a region directory that is missing a region metadata file (the .regioninfo file).

Command: $hbase hbck -fixHdfsOrphans

**Regions are closed during repair.

8.Once fixing Hbase Regions have been completed with Step -6,Step -7,Step -8, below checks need to confirm

        •Confirm online/available regions have been assigned to all Region Server through Active Hbase Master Web UI

  • Confirm "hbase hbck" from command prompt status replicate "0 inconsistencies detected"

Friday, 13 April 2018

Distribution of Executors, Cores and Memory for a Spark Application running in Yarn


We can normally run a spark job via spark submit. Ever wondered how to configure --num-executors, --executor-memory and --executor-cores spark config params for your cluster?

Example:
-----
spark-submit --class <CLASS_NAME> --num-executors ? --executor-cores ? --executor-memory ? ..
------

More specific as below by mentioning the executor memory and core for a wordcount example:
----
spark-submit --class com.hadoop.sparksimple.wordcount.JobRunner --master yarn --deploy-mode cluster --driver-memory=2g --executor-memory 2g --executor-cores 1 --num-executors 1 SparkSimple-0.0.1SNAPSHOT.jar hdfs://hadoop.com:9000/user/spark-test/word-count/input hdfs://hadoop.com:9000/user/spark-test/word-count/output
------

Following list captures some recommendations to keep in mind while configuring them:

  • Hadoop/Yarn/OS Deamons: When we run spark application using a cluster manager like Yarn, there’ll be several daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. So, while specifying num-executors, we need to make sure that we leave aside enough cores (~1 core per node) for these daemons to run smoothly.
  • Yarn ApplicationMaster (AM): ApplicationMaster is responsible for negotiating resources from the ResourceManager and working with the NodeManagers to execute and monitor the containers and their resource consumption. If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor).
  • HDFS Throughput: HDFS client has trouble with tons of concurrent threads. It was observed that HDFS achieves full write throughput with ~5 tasks per executor . So it’s good to keep the number of cores per executor below that number.
  • MemoryOverhead: Following picture depicts spark-yarn-memory-usage
image

Two things to make note of from this picture:
-------
Full memory requested to yarn per executor =
          spark-executor-memory + spark.yarn.executor.memoryOverhead.
 spark.yarn.executor.memoryOverhead = 
        Max(384MB, 7% of spark.executor-memory
-------

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

  • Running executors with too much memory often results in excessive garbage collection delays.
  • Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.
Hands on,

Now, let’s consider a 10 node cluster with following config and analyse different possibilities of executors-core-memory distribution:

**Cluster Config:**
-------
10 Nodes
16 cores per Node
64GB RAM per Node
-------


First Approach: Tiny executors [One Executor per core]:

Tiny executors essentially means one executor per core. Following table depicts the values of our spar-config params with this approach:

-------
- `--num-executors` = `In this approach, we'll assign one executor per core`
                                 = `total-cores-in-cluster`
                                 = `num-cores-per-node * total-nodes-in-cluster` 
                                 = 16 x 10 = 160
- `--executor-cores` = 1 (one executor per core)
- `--executor-memory` = `amount of memory per executor`
                                     = `mem-per-node/num-executors-per-node`
                                     = 64GB/16 = 4GB
--------

Analysis: With only one executor per core, as we discussed above, we’ll not be able to take advantage of running multiple tasks in the same JVM. Also, shared/cached variables like broadcast variables and accumulators will be replicated in each core of the nodes which is 16 times. Also, we are not leaving enough memory overhead for Hadoop/Yarn daemon processes and we are not counting in ApplicationManager. NOT GOOD!

Second Approach: Fat executors (One Executor per node):

Fat executors essentially means one executor per node. Following table depicts the values of our spark-config params with this approach:

----------
- `--num-executors`  = `In this approach, we'll assign one executor per node`
                                 = `total-nodes-in-cluster`
                                 = 10
- `--executor-cores` = `one executor per node means all the cores of the node are assigned to one executor`
                                 = `total-cores-in-a-node`
                                 = 16
- `--executor-memory` = `amount of memory per executor`
                                     = `mem-per-node/num-executors-per-node`
                                     = 64GB/1 = 64GB
------------

Analysis: With all 16 cores per executor, apart from ApplicationManager and daemon processes are not counted for, HDFS throughput will hurt and it’ll result in excessive garbage results. Also,NOT GOOD!

Third Approach: Balance between Fat (vs) Tiny

According to the recommendations which we discussed above:

  • Based on the recommendations mentioned above, Let’s assign 5 core per executors => --executor-cores = 5 (for good HDFS throughput)
  • Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
  • So, Total available of cores in cluster = 15 x 10 = 150
  • Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
  • Leaving 1 executor for ApplicationManager => --num-executors = 29
  • Number of executors per node = 30/10 = 3
  • Memory per executor = 64GB/3 = 21GB
  • Counting off heap overhead = 7% of 21GB = 3GB. So, actual --executor-memory = 21 - 3 = 18GB
So, recommended config is: 29 executors, 18GB memory each and 5 cores each!!

Analysis: It is obvious as to how this third approach has found right balance between Fat vs Tiny approaches. Needless to say, it achieved parallelism of a fat executor and best throughput of a tiny executor!!

Conclusion:

We’ve seen:

Couple of recommendations to keep in mind which configuring these params for a spark-application like:

Budget in the resources that Yarn’s Application Manager would need
How we should spare some cores for Hadoop/Yarn/OS daemon processes
Learnt about spark-yarn-memory-usage
Also, checked out and analysed three different approaches to configure these params:
Tiny Executors - One Executor per Core
Fat Executors - One executor per Node
Recommended approach - Right balance between Tiny (Vs) Fat coupled with the recommendations.
--num-executors, --executor-cores and --executor-memory.. these three params play a very important role in spark performance as they control the amount of CPU & memory your spark application gets. This makes it very crucial for users to understand the right way to configure them. Hope this blog helped you in getting that perspective…

Reference: 
-----------
----------



Thursday, 12 April 2018

Kerberos: Creating Keytab file

In this post, I'll discuss about the basics of keytab file and the procedure to create a keytab file:

1) Keytab is a file that contains kerberos principal and encrypted key
2) You can authenticate to kerberos using keytab
3) If you changed your kerberose password then you need to create a new keytab file
4) Keytab file is used in scripts to automate kerberos authentication or by some service accounts

We can create keytab file using 2 ways:
----
Using ktutil command
From kadmin console
----

Note: Before reading further, please go through the below post and get a basic idea on MIT kerberos.

http://www.maninmanoj.com/2018/04/mit-kerberos-installation-and.html 

Scenario 1: Creating the keytab file using kadmin from the KDC server. The steps are highlighted in the below snapshot:

1) Create a principle as manoj@HADOOP.COM from kadmin.local.




2) The name of the keytab created is manoj.keytab for the principle manoj@HADOOP.COM using ktadd command. This can be created as below:

Note:  norandkey is used so that the password for the principle manoj@HADOOP.COM is not changed.


3) logout of the kadmin shell and using the below command we can initialize the principle without password

kinit -kt manoj.keytab manoj@HADOOP.COM

whereas directly using kinit to the principle name will show password prompt:



Scenario 2:Using ktutil from the client machine create the keytab for the principle manoj@HADOOP.COM here we need to mention the encryption type as mentioned in the /etc/krb5.conf.

The encryption types used here as below:

des3-cbc-sha1-kd
arcfour-hmac-md5
des-hmac-sha1
des-cbc-md5
des-cbc-md4

Use "wkt" command to write the keytab to the desired location, here I've written to the location /tmp/manoj.keytab.



Now, you will be able to see the details of manoj.keytab as below and you can kinit using the keytab without password.




Scenario 3:   Creating keytab using ktadd but not using the option "norand" key. This will overwrite the password of the user principle manoj@HADOOP.COM. This should never be used in production environment as this will replace the exiting password to a new random encrypted password.


Reference: 

https://www.youtube.com/watch?v=_TyNUign4Ko&list=PLY-V_O-O7h4fHTSxNCipvqfOvFCa-6f07&index=19

https://www.ibm.com/support/knowledgecenter/en/SSZUMP_7.1.2/management_sym/sym_kerberos_creating_principal_keytab.html





MIT Kerberos installation and configuration on CentOS6 Server

In this post I will discuss about the practical implementation of a MIT Kerberos.

Scenario:

I have two machines one acting as kerberos server and one acting as a client machine as below:

--------
192.168.56.51 ---> kerberos-server.hadoop.com
192.168.56.52 ---> kerberos-client.hadoop.com
--------

I've added the details of both machine in the /etc/hosts file as below for internal domain resolution:




Step 1:  Following steps to be done on kerberos-server.hadoop.com

Install following package in server:

------
yum -y install krb5-server krb5-libs krb5-workstation
------

Step 2:

Open and edit vi /etc/krb5.conf in server "kerberos-server.hadoop.com" and update the REALM name kdc server:

Sample krb5.conf as below:

The parameters that need to be edited are highlighted  below:
------
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = HADOOP.COM
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[realms]
 HADOOP.COM = {
  kdc = kerberos-server.hadoop.com
  admin_server = kerberos-server.hadoop.com
 }

[domain_realm]
 .hadoop.com = HADOOP.COM
 hadoop.com = HADOOP.COM
 ---------

Step 3: Create kerberos database using kdb5_util, the password to be provided is the db password of KDC
-------
kdb5_util create -s
------


Step 4:

Start services

service krb5kdc start
service kadmin start

Turn on the services

chkconfig krb5kdc on
chkconfig kadmin on



Step 5:

Create one admin principal using
------
kadmin.local -q "addprinc admin/admin"
------





Step 6: Open file  /var/kerberos/krb5kdc/kadm5.acl   and edit ACL as below:

Restart the service as below:

------
service kadmin restart
------



Step 7: 

Login to the kadmin shell using the principle admin/admin@HADOOP.COM.

listprincs will list all principles available, we can create a new principle using the command addprinc. The procedure the highlighted in below snapshot:


Step 8: Verification 

Now you will be able to kinit using the principle adbc@HADOOP.COM



Step 9: Configuration in the client side: ( kerberos-client.hadoop.com):

Install following package in client:

-----------
yum -y install  krb5-libs krb5-workstation
-----------

Step 10:  Copy and paste same krb5.conf file what was created in the master.


Sample krb5.conf as below:
------
[logging]
 default = FILE:/var/log/krb5libs.log
 kdc = FILE:/var/log/krb5kdc.log
 admin_server = FILE:/var/log/kadmind.log

[libdefaults]
 default_realm = HADOOP.COM
 dns_lookup_realm = false
 dns_lookup_kdc = false
 ticket_lifetime = 24h
 renew_lifetime = 7d
 forwardable = true

[realms]
 HADOOP.COM = {
  kdc = kerberos-server.hadoop.com
  admin_server = kerberos-server.hadoop.com
 }

[domain_realm]
 .hadoop.com = HADOOP.COM
 hadoop.com = HADOOP.COM
 ---------


Step 11:  Verification from client machine:

Now you will be able to kinit using admin principle and abcd@HADOOP.COM from client machine as below:




Notes: 

1)  If  you face any errors related to java while creating admin principle ( Step 3) then install java packages as below:

yum -y install *jdk*

2)  Inorder to troubleshoot error while setting up KDC database and initilizing use the below tips:

--> export KRB5_TRACE=/dev/stdout

To disable above feature unset the variable using:

unset  KRB5_TRACE

Reference: 




Kerberos Authentication in Hadoop

Authentication is the first level of security for any system. It is all about validating the identity of a user or a process. In a simple sense, it means verifying a username and password. In this article, we will try to understand the need for secure authentication method and its implementation in a Hadoop cluster. 

In a secure system, the users and the processes are required to identify themselves. Then the system needs to validate the identity. The system must ensure that you are the one who you claim to be. 
The authentication doesn't end there. Once your identity is validated, it must flow further down to the system. Your identity must propagate to the system along with your every action and to every resource that you access on the network. This kind of authentication is not only needed for users, but it is also mandatory for every process or service. 

In the absence of an authentication, a process or a user can pose itself to be a trusted identity and gain access to the data. Most of the systems implement this capability. For example, your Linux OS is capable of validating your credentials and propagating it further down. Now, coming back to a Hadoop cluster. Why can't Hadoop rely on Linux authentication? 
Hadoop works on a group of computers. Each computer executes an independent operating system. OS authentication works within the boundary of an OS. But Hadoop works across those boundaries. So, Ideally, Hadoop should have a network-based authentication system. But unfortunately, Hadoop doesn't have a built-in capability to authenticate users and propagate their identity. So, the community had following options.

Develop an authentication capability into Hadoop.
Integrate with some other system that is purposefully designed to provide the authentication capability over a networked environment.
They decided to go with the second option. So, Hadoop uses Kerberos for authentication and identity propagation. You may ask a question here. Why Kerberos? Why not something else like SSL certificates or OAuth? 
Well, OAuth was not there at that time. And they give two reasons over SSL.

1) Performance
2) Simplicity

Kerberos performs better than SSL, and managing users in Kerberos is much more straightforward. To remove a user, we just delete it from Kerberos whereas revoking an SSL certificate is a complicated thing.

What is Kerberos?

Kerberos is a network authentication protocol created by MIT. It eliminates the need for transmission of passwords across the network and removes the potential threat of an attacker sniffing the network. 
To understand the Kerberos protocol and how it works, You must realize few jargons and components of the Kerberos system. Let me introduce you to all of them. 
The first one is KDC. We call it the Key Distribution Center. KDC is the authentication server in Kerberos environment. Most of the cases, it resides on a separate physical server. We can logically divide the KDC into three parts.

A Database
An Authentication Server (AS)
A Ticket Granting Server (TGS)

The database stores user and service identities. These identities are known as principals. KDC database also stores other information like an encryption key, ticket validity duration, expiration date, etc. 
The Kerberos Authentication Service authenticates the user and issues a TGT ticket. If you have a valid TGT, means AS has verified your credential. 
TGS is the application server of KDC which provides service ticket. Before accessing any service on a Hadoop cluster, you need to get a service ticket from TGS

How Kerberos authentication works?


Let's assume you want to list a directory from HDFS on a Kerberos enabled Hadoop cluster.

1. First thing, you must be authenticated by Kerberos. On a Linux machine, you can do it by executing the kinit tool. The kinit program will ask you for the password. Then, it will send an authentication request to Kerberos Authentication Server.

2. On a successful authentication, the AS will respond back with a TGT.

3. The kinit will store the TGT in your credentials cache. So, now you have your TGT that means, you have got your authentication, and you are ready to execute a Hadoop command.

4. Let's say you run following command. 

hadoop fs --ls / 

So, you are using Hadoop command. That's a Hadoop client. Right?

5. Now, the Hadoop client will use your TGT and reach out to TGS. The client approaches TGS to ask for a service ticket for the Name Node service.

6. The TGS will grant you a service ticket, and the client will cache the service ticket.

7. Now, you have a ticket to communicate with the Name Node. So, the Hadoop RPC will use the service ticket to reach out to Name Node.

8. They will again exchange the tickets. Your Ticket proves your identity and Name node's Ticket determines the identification of the Name Node. Both are sure that they are talking to an authenticated entity. We call this a mutual authentication.

9. The next part is authorization. If you have permissions to list the root directory, the NN will return the results to you. That's all about Kerberos Authentication in Hadoop. 


Friday, 6 April 2018

Hadoop Security: Sentry Authorisation

Hive Authorization is managed via beeline shell.

Become hive principal using hive server keytab
• Login on server where hiveserver2 is running
Start beeline
Execute

Example of JDBC connection URL:
------------
Syntax:
---------
!connect
jdbc:hive2://<hostname-hive-server2>:10000/default;principal=<hive-server2-principle-name>;saslQop=auth-conf

Example
!connect
jdbc:hive2://lcd218.brv.llkz:10000/default;principal=hive/lcd218.brv.llkz@MANOJ.ROOTDOM.NET;saslQop=auth-conf



               "show roles;" will show available Sentry roles.
               "show grant role <rolename>" will show grants for the role
                     
The mapping of Sentry privileges is as follows:

SELECT privilege -> Read access 
INSERT privilege -> Write access 
ALL privilege -> Read and Write

Scenario:  

Customer want to create a new role “crashdata_viewer” and assign that new role to database “abd_crashdata” with read access. The following groups need to be assigned to the specified role:

az_mx_access_us
azspaactuaries
globalpc

Steps for implementation:

Run the below commands from beeline shell:

Create role crashdata_viewer

> create role crashdata_viewer;

Assign role to DB abd_crashdata with read access.

> grant select on database abd_crashdata to role crashdata_viewer

Assign groups to the role crashdata_viewer

> grant role crashdata_viewer to group az_mx_access_us;
> grant role crashdata_viewer to group azspaactuaries;
> grant role crashdata_viewer to group globalpc;

Post implementation checks:

Check the permissions granted for the role crashdata_viewer.

> show grant role crashdata_viewer; 









Check the roles assigned to a particular group:

> show role grant group az_mx_access_us;

Reference: 

https://www.cloudera.com/documentation/enterprise/5-5-x/topics/sg_hive_sql.html