New in CDH 5.2: Impala Authentication with LDAP and Kerberos

Impala authentication can now be handled by a combination of LDAP and Kerberos. Here’s why, and how.

Impala , the open source analytic database for Apache Hadoop, supports authentication—the act of proving you are who you say you are—using both Kerberos and LDAP. Kerberos has been supported since release 1.0, LDAP support was added more recently, and with CDH 5.2, you can use both at the same time.

Using LDAP and Kerberos together provides significant value; Kerberos remains the core authentication protocol and is always used when Impala daemons connect to each other and to the Hadoop cluster. However, Kerberos can require more maintenance to support. LDAP is ubiquitous across the enterprise and is commonly utilized by client applications connecting to Impala via ODBC and JDBC drivers. A mix of the two therefore frequently makes sense.

This table demonstrates the various combinations and their use cases:

In this post, I’ll explain why and how to set-up Impala authentication using a combination of LDAP and Kerberos.


Kerberos remains the primary authentication mechanism for Apache Hadoop. A little Kerberos terminology will help with the discussion to follow.

  • A principal is some Kerberos entity, like a person or a daemon process. For our purposes, a principal looks like name/hostname@realm for daemon processes, or just name@realm for users.
  • The name field can be associated with a process, like “impala”, or it can be a username, like “myoder”.
  • The hostname field can be the fully qualified name of the machine, or the Hadoop-specific magic _HOST string, which is auto-replaced with the fully qualified hostname.
  • The realm is similar to (but not necessarily the same as) a DNS domain. 

Kerberos principals can prove that they are who they say that they are by either supplying a password (if the principal is a human) or by providing a “keytab” file. Impala daemons need a keytab file, which must be well protected: anyone who can read that keytab file can impersonate the Impala daemons.

Basic support for Kerberos in impala for this process is straightforward: Supply the following arguments, and the daemons will use the given principal and the keys in the keytab file to take on the identity of the principal for all communication.

  • --principal=impala/hostname@realm and
  • --keytab_file=/full/path/to/keytab

There is another wrinkle if the Impala daemon (impalad) is sitting behind a load balancer. When the clients running queries go through the load balancer (a proxy) the client is expecting the impalad to have a principal that’s the same as the name of the load balancer. So the impalad has to use a principal matching the name of the proxy when it services these external queries, but will need to use a principal matching its actual host name when doing back-end communication between daemons. The new flags to the impalad in this case would be:

  • --principal=impala/proxy-hostname@realm
  • --be_principal=impala/actual-hostname@realm
  • --keytab_file=/full/path/to/keytab

The first --principal specifies what principal to use when the impalad services external queries, and the --be_principal specifies the principal for when the impalad is doing back-end communication. Keys for both of these principals must reside in the same keytab file.

Debugging Kerberos

Kerberos is an elegant protocol, but practical implementations are not always very helpful if something goes wrong. The top two things to check in case of failure are:

  • Time . Kerberos is dependent on synchronized clocks, so it is a best practice to install and use NTP (the Network Time Protocol) on all machines dealing with Kerberos.
  • DNS . Make sure that your hostnames are fully qualified and that forward (name->IP) and reverse (IP->name) DNS lookups are correct.

Beyond that, it is possible to set two environment variables that will give you Kerberos debugging information. The output may be a little overwhelming, but frequently it will point the way to a solution.

  • KRB5_TRACE=/full/path/to/trace/output.log : This environment variable will instruct all kerberos clients and utilities to print debug output to the named file.
  • : This environment variable is passed to the impala daemons, which in turn pass it to the internal java component.

In CDH 5.2 and later you can also supply the --krb5_debug_file parameter, which will turn on Kerberos debugging and write the output to the given file. You can supply it in Cloudera Manager via the Impala Configuration “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters. (Environment variables like those above can be supplied in the adjacent “Impala Service Environment Advanced Configuration Snippet” parameters.) It also goes without saying that Google is your friend when debugging problems with Kerberos.

Kerberos Flags

The Cloudera documentation for Kerberos and Impala covers this in greater detail, but these are the basic flags:


Kerberos is great, but it does require that the end user have a valid Kerberos credential, which is not practical in many environments—because every user who interacts with Impala and the Hadoop cluster must have a Kerberos principal configured. For organizations that use Active Directory to manage user accounts, it can be onerous to create corresponding user accounts for each user in an MIT Kerberos realm. Many corporate environments use the LDAP protocol instead, where clients authenticate themselves using their username and password.

When configured to use LDAP, think of the impalad as an LDAP proxy: the client (the Impala shell, ODBC, JDBC, Hue, and so on) sends its username and password to the impalad, and the impalad takes the username and password and sends them to the LDAP server in an attempt to log in. In LDAP terminology, the impalad issues an LDAP "bind" operation. If the LDAP server returns success for the login attempt, the impalad accepts the connection.

LDAP is only used to authenticate external clients, such as the Impala shell, ODBC, JDBC, and Hue. All other back-end authentication is handled by Kerberos.

LDAP Configurations

LDAP is complicated (and powerful) because it is so flexible; there are many ways to configure LDAP entities and authenticate those entities. In general, every person in LDAP has a Distinguished Name , or DN, which can be considered the username or principal according to LDAP. 

Let’s examine how users are set up for two different LDAP servers.  The first user is named "Test1 Person" and resides in Windows 2008 Active Directory.

# Test1 Person, Users,
dn: CN=Test1Person,CN=Users,DC=ad,DC=sec,DC=cloudera,DC=com
cn: Test1Person
sAMAccountName: test1

The second is me: the entry for user myoder, residing in an OpenLDAP server:

# myoder, People,
dn: uid=myoder,ou=People,dc=cloudera,dc=com
cn: MichaelYoder
uid: myoder
homeDirectory: /home/myoder

Many items have been removed from the above for simplicity. Let’s note some of the similarities and differences in these two accounts:

  • DN: The first line after the comment is for the DN. This is the primary identifying string for one LDAP account. The name starts out specific ( CN=Test1 Person and uid=myoder ) and works out to the more general; DC=cloudera,DC=com corresponds to They are quite different: the AD entry has a human name in the first field ( CN=Test1 Person ) and the OpenLDAP entry has an account username ( uid=myoder ).
  • CN: The Common Name. For AD, it’s the same as in the DN; for OpenLDAP it’s the human name, which is not the uid in the DN.
  • sAMAccountName : This AD-only entry is a legacy form of a username. Despite being deprecated it is widely used and documented.
  • userPrincipalName : This AD-only entry , by convention, should map to the user’s email name. It will usually look like this: This is the modern Active Directory username and is widely used.

There is an additional interesting implementation detail about AD. Normally, authentication in LDAP is based on the DN. With AD, several items are tried in turn :

  • First, the DN
  • userPrincipalName
  • sAMAccountName + "@" + the DNS domain name
  • Netbios domain name + "\" + the sAMAccountName
  • And several other somewhat more obscure mechanisms (see the link above)

LDAP and the Impalad

Given all these differences, it is fortunate that the impala daemon provides several mechanisms to address the varieties of LDAP configurations out there. First, let’s start simple:

  • --enable_ldap_auth must be set , and
  • --ldap_uri=ldap:// needs to be specified.

With just those set, the username given to the impalad (by the impala shell, jdbc, odbc, etc) is passed straight through to the LDAP server unaltered. This approach works out great for AD if the user name is fully qualified, like—it’ll match either the userPrincipal or the sAMAccountName plus the DNS domain name.

It’s also possible to set up the impalad up so that the domain ( in this case) is automatically added to the username, by setting as an argument to the impalad. Now when a client username comes in, like "test1", it will append that domain name so that the result passed to AD becomes This behavior can be a convenience to your users.

So far, things are working smoothly for AD. But what about other LDAP directories, like OpenLDAP? It doesn’t have any of the sAMAccountName or userPrincipalName stuff, and instead we have to authenticate directly against the DN. Users aren’t going to know their LDAP DN! 

Fortunately, the impalad has parameters for this scenario, too. The --ldap_baseDN=X parameter is used to convert the username into the LDAP DN, so that the resulting DN looks like uid=username,X . For example, if --ldap_baseDN=ou=People,dc=cloudera,dc=com , and the username passed in is "myoder", the resulting query passed to LDAP will look like uid=myoder,ou=People,dc=cloudera,dc=com —which does indeed match the DN of user myoder above. Presto!

For maximum flexibility, it’s also possible to specify an arbitrary mapping from usernames into a DN via the --ldap_bind_pattern string. The idea is that the string specified must have a placeholder named #UID inside it, and that #UID is replaced with the username. For example, you could mimic the behavior of --ldap_baseDN by specifying --ldap_bind_pattern=uid=#UID,ou=People,dc=cloudera,dc=com . When the username of "myoder" comes in, it replaces the #UID, and we’ll get the same string as above. This option should be used when more control over the DN is needed.


When using LDAP, the username and password are sent over the connection to the LDAP server in the clear. This means that without any other sort of protection, anyone can see the password travelling over the wire. To prevent that, you must protect connection with TLS (Transport Layer Security, formerly known as SSL). There are two different connections to protect: between the client and the impalad, and between the impalad and the LDAP server.

TLS Between the Client and the Impalad

Authentication for TLS connections is done with certificates, so the impalad (as a TLS server) will need its own certificate. The impalad presents this certificate to clients in order to prove that it really is the impalad. In order to supply this certificate, the impalad must be started with --ssl_server_certificate=/full/path/to/impalad-cert.pem and --ssl_private_key=/full/path/to/impalad-key.pem .

Now clients must use TLS to talk to the impalad. In the impala shell, you accomplished that goal with the --ssl and --ca_cert=/full/path/to/ca-certificate.pem arguments. The ca_cert argument specifies the certificate that signed the ssl_server_certificate above. For ODBC connections, consult the documentation for the Cloudera ODBC driver for Impala . It offers a thorough description of the settings required for certificates, authentication, and TLS.

Frankly, using TLS between the impala clients and the impalad is a good idea, regardless of whether or not LDAP is being used. Otherwise, your queries, and the results of those queries, go over the wire in the clear.

TLS Between the Impalad and the LDAP Server

There are two ways to turn on TLS with the LDAP Server:

  • Supply --ldap_tls as an argument to the impalad. The connection will take place over the usual LDAP port, but after the connection is first made it will issue a STARTTLS request which will upgrade the connection to a secure connection using TLS on that same port.
  • Supply a URI starting with ldaps://. This uses a different port than ldap://

Finally, the connection to the LDAP server needs its own authentication; this way, you know that the impalad is talking to the correct ldap server and you’re not giving your passwords to a rogue man-in-the-middle attacker. You’ll need to pass --ldap_ca_certificate to the impalad to specify the location of the certificate that signed the LDAP server’s certificate.

LDAP Flags

The Cloudera documentation for LDAP and Impala contains much of this information, and the documentation for TLS between the Impala client and the Impala daemon is required reading as well. In Cloudera Manager, you set these flags in the Impala Configuration in the “Service-Wide” -> “Security” menu. You must specifiy them in the “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters.

To summarize all these flags:

Bringing it All Together

Correctly implementing authentication in the most secure manner possible results in quite a lot of flags being passed to the Impala daemons. Here is an example invocation of the impalad (minus other flags), assuming that we want to enable both kerberos and LDAP authentication:

impalad --enable_ldap_auth \
    --ldap_uri=ldap:// \
    --ldap_tls \
    --ldap_ca_certificate=/full/path/to/certs/ldap-ca-cert.pem \
    --ssl_server_certificate=/full/path/to/certs/impala-cert.pem \
    --ssl_private_key=/full/path/to/certs/impala-key.pem \
    --principal=impala/_HOST@EXAMPLE.COM \

Connecting from the impala shell might look like this: --ssl \
    --ca_cert=/full/path/to/cert/impala-ca-cert.pem \

When authenticating with Kerberos, or --ssl \
   --ca_cert=/full/path/to/cert/impala-ca-cert.pem \
   -l -u

When authenticating with LDAP.

Michael Yoder is a Software Engineer at Cloudera.