Authentication is the first level of security for any system. It is all about validating the identity of a user or a process. In a simple sense, it means verifying a username and password. In this article, we will try to understand the need for secure authentication method and its implementation in a Hadoop cluster.
In a secure system, the users and the processes are required to identify themselves. Then the system needs to validate the identity. The system must ensure that you are the one who you claim to be.
The authentication doesn't end there. Once your identity is validated, it must flow further down to the system. Your identity must propagate to the system along with your every action and to every resource that you access on the network. This kind of authentication is not only needed for users, but it is also mandatory for every process or service.
In the absence of an authentication, a process or a user can pose itself to be a trusted identity and gain access to the data. Most of the systems implement this capability. For example, your Linux OS is capable of validating your credentials and propagating it further down. Now, coming back to a Hadoop cluster. Why can't Hadoop rely on Linux authentication?
Hadoop works on a group of computers. Each computer executes an independent operating system. OS authentication works within the boundary of an OS. But Hadoop works across those boundaries. So, Ideally, Hadoop should have a network-based authentication system. But unfortunately, Hadoop doesn't have a built-in capability to authenticate users and propagate their identity. So, the community had following options.
Develop an authentication capability into Hadoop.
Integrate with some other system that is purposefully designed to provide the authentication capability over a networked environment.
They decided to go with the second option. So, Hadoop uses Kerberos for authentication and identity propagation. You may ask a question here. Why Kerberos? Why not something else like SSL certificates or OAuth?
Well, OAuth was not there at that time. And they give two reasons over SSL.
Kerberos performs better than SSL, and managing users in Kerberos is much more straightforward. To remove a user, we just delete it from Kerberos whereas revoking an SSL certificate is a complicated thing.
What is Kerberos?
Kerberos is a network authentication protocol created by MIT. It eliminates the need for transmission of passwords across the network and removes the potential threat of an attacker sniffing the network.
To understand the Kerberos protocol and how it works, You must realize few jargons and components of the Kerberos system. Let me introduce you to all of them.
The first one is KDC. We call it the Key Distribution Center. KDC is the authentication server in Kerberos environment. Most of the cases, it resides on a separate physical server. We can logically divide the KDC into three parts.
An Authentication Server (AS)
A Ticket Granting Server (TGS)
The database stores user and service identities. These identities are known as principals. KDC database also stores other information like an encryption key, ticket validity duration, expiration date, etc.
The Kerberos Authentication Service authenticates the user and issues a TGT ticket. If you have a valid TGT, means AS has verified your credential.
TGS is the application server of KDC which provides service ticket. Before accessing any service on a Hadoop cluster, you need to get a service ticket from TGS
How Kerberos authentication works?
Let's assume you want to list a directory from HDFS on a Kerberos enabled Hadoop cluster.
1. First thing, you must be authenticated by Kerberos. On a Linux machine, you can do it by executing the kinit tool. The kinit program will ask you for the password. Then, it will send an authentication request to Kerberos Authentication Server.
2. On a successful authentication, the AS will respond back with a TGT.
3. The kinit will store the TGT in your credentials cache. So, now you have your TGT that means, you have got your authentication, and you are ready to execute a Hadoop command.
4. Let's say you run following command.
hadoop fs --ls /
So, you are using Hadoop command. That's a Hadoop client. Right?
5. Now, the Hadoop client will use your TGT and reach out to TGS. The client approaches TGS to ask for a service ticket for the Name Node service.
6. The TGS will grant you a service ticket, and the client will cache the service ticket.
7. Now, you have a ticket to communicate with the Name Node. So, the Hadoop RPC will use the service ticket to reach out to Name Node.
8. They will again exchange the tickets. Your Ticket proves your identity and Name node's Ticket determines the identification of the Name Node. Both are sure that they are talking to an authenticated entity. We call this a mutual authentication.
9. The next part is authorization. If you have permissions to list the root directory, the NN will return the results to you. That's all about Kerberos Authentication in Hadoop.