Impala authentication can now be handled by a combination of LDAP and Kerberos. Here’s why, and how.
Impala , the open source analytic database for Apache Hadoop, supports authentication—the act of proving you are who you say you are—using both Kerberos and LDAP. Kerberos has been supported since release 1.0, LDAP support was added more recently, and with CDH 5.2, you can use both at the same time.
Using LDAP and Kerberos together provides significant value; Kerberos remains the core authentication protocol and is always used when Impala daemons connect to each other and to the Hadoop cluster. However, Kerberos can require more maintenance to support. LDAP is ubiquitous across the enterprise and is commonly utilized by client applications connecting to Impala via ODBC and JDBC drivers. A mix of the two therefore frequently makes sense.
This table demonstrates the various combinations and their use cases:
In this post, I’ll explain why and how to set-up Impala authentication using a combination of LDAP and Kerberos.
Kerberos remains the primary authentication mechanism for Apache Hadoop. A little Kerberos terminology will help with the discussion to follow.
Kerberos principals can prove that they are who they say that they are by either supplying a password (if the principal is a human) or by providing a “keytab” file. Impala daemons need a keytab file, which must be well protected: anyone who can read that keytab file can impersonate the Impala daemons.
Basic support for Kerberos in impala for this process is straightforward: Supply the following arguments, and the daemons will use the given principal and the keys in the keytab file to take on the identity of the principal for all communication.
There is another wrinkle if the Impala daemon (impalad) is sitting behind a load balancer. When the clients running queries go through the load balancer (a proxy) the client is expecting the impalad to have a principal that’s the same as the name of the load balancer. So the impalad has to use a principal matching the name of the proxy when it services these external queries, but will need to use a principal matching its actual host name when doing back-end communication between daemons. The new flags to the impalad in this case would be:
--principal specifies what principal to use when the impalad services external queries, and the
--be_principal specifies the principal for when the impalad is doing back-end communication. Keys for both of these principals must reside in the same keytab file.
Kerberos is an elegant protocol, but practical implementations are not always very helpful if something goes wrong. The top two things to check in case of failure are:
Beyond that, it is possible to set two environment variables that will give you Kerberos debugging information. The output may be a little overwhelming, but frequently it will point the way to a solution.
KRB5_TRACE=/full/path/to/trace/output.log: This environment variable will instruct all kerberos clients and utilities to print debug output to the named file.
JAVA_TOOL_OPTIONS=-Dsun.security.krb5.debug=true: This environment variable is passed to the impala daemons, which in turn pass it to the internal java component.
In CDH 5.2 and later you can also supply the
--krb5_debug_file parameter, which will turn on Kerberos debugging and write the output to the given file. You can supply it in Cloudera Manager via the Impala Configuration “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters. (Environment variables like those above can be supplied in the adjacent “Impala Service Environment Advanced Configuration Snippet” parameters.) It also goes without saying that Google is your friend when debugging problems with Kerberos.
The Cloudera documentation for Kerberos and Impala covers this in greater detail, but these are the basic flags:
Kerberos is great, but it does require that the end user have a valid Kerberos credential, which is not practical in many environments—because every user who interacts with Impala and the Hadoop cluster must have a Kerberos principal configured. For organizations that use Active Directory to manage user accounts, it can be onerous to create corresponding user accounts for each user in an MIT Kerberos realm. Many corporate environments use the LDAP protocol instead, where clients authenticate themselves using their username and password.
When configured to use LDAP, think of the impalad as an LDAP proxy: the client (the Impala shell, ODBC, JDBC, Hue, and so on) sends its username and password to the impalad, and the impalad takes the username and password and sends them to the LDAP server in an attempt to log in. In LDAP terminology, the impalad issues an LDAP "bind" operation. If the LDAP server returns success for the login attempt, the impalad accepts the connection.
LDAP is only used to authenticate external clients, such as the Impala shell, ODBC, JDBC, and Hue. All other back-end authentication is handled by Kerberos.
LDAP is complicated (and powerful) because it is so flexible; there are many ways to configure LDAP entities and authenticate those entities. In general, every person in LDAP has a Distinguished Name , or DN, which can be considered the username or principal according to LDAP.
Let’s examine how users are set up for two different LDAP servers. The first user is named "Test1 Person" and resides in Windows 2008 Active Directory.
# Test1 Person, Users, ad.sec.cloudera.com dn: CN=Test1Person,CN=Users,DC=ad,DC=sec,DC=cloudera,DC=com cn: Test1Person sAMAccountName: test1 userPrincipalName: firstname.lastname@example.org
The second is me: the entry for user myoder, residing in an OpenLDAP server:
# myoder, People, cloudera.com dn: uid=myoder,ou=People,dc=cloudera,dc=com cn: MichaelYoder uid: myoder homeDirectory: /home/myoder
Many items have been removed from the above for simplicity. Let’s note some of the similarities and differences in these two accounts:
uid=myoder) and works out to the more general;
DC=cloudera,DC=comcorresponds to cloudera.com. They are quite different: the AD entry has a human name in the first field (
CN=Test1 Person) and the OpenLDAP entry has an account username (
There is an additional interesting implementation detail about AD. Normally, authentication in LDAP is based on the DN. With AD, several items are tried in turn :
sAMAccountName+ "@" + the DNS domain name
Given all these differences, it is fortunate that the impala daemon provides several mechanisms to address the varieties of LDAP configurations out there. First, let’s start simple:
--enable_ldap_auth must be set, and
--ldap_uri=ldap://ldapserver.your.company.comneeds to be specified.
With just those set, the username given to the impalad (by the impala shell, jdbc, odbc, etc) is passed straight through to the LDAP server unaltered. This approach works out great for AD if the user name is fully qualified, like email@example.com—it’ll match either the
userPrincipal or the
sAMAccountName plus the DNS domain name.
It’s also possible to set up the impalad up so that the domain (ad.sec.cloudera.com in this case) is automatically added to the username, by setting
--ldap_domain=ad.sec.cloudera.com as an argument to the impalad. Now when a client username comes in, like "test1", it will append that domain name so that the result passed to AD becomes firstname.lastname@example.org. This behavior can be a convenience to your users.
So far, things are working smoothly for AD. But what about other LDAP directories, like OpenLDAP? It doesn’t have any of the
userPrincipalName stuff, and instead we have to authenticate directly against the DN. Users aren’t going to know their LDAP DN!
Fortunately, the impalad has parameters for this scenario, too. The
--ldap_baseDN=X parameter is used to convert the username into the LDAP DN, so that the resulting DN looks like
uid=username,X . For example, if
--ldap_baseDN=ou=People,dc=cloudera,dc=com , and the username passed in is "myoder", the resulting query passed to LDAP will look like
uid=myoder,ou=People,dc=cloudera,dc=com —which does indeed match the DN of user myoder above. Presto!
For maximum flexibility, it’s also possible to specify an arbitrary mapping from usernames into a DN via the
--ldap_bind_pattern string. The idea is that the string specified must have a placeholder named #UID inside it, and that #UID is replaced with the username. For example, you could mimic the behavior of
--ldap_baseDN by specifying
--ldap_bind_pattern=uid=#UID,ou=People,dc=cloudera,dc=com . When the username of "myoder" comes in, it replaces the #UID, and we’ll get the same string as above. This option should be used when more control over the DN is needed.
When using LDAP, the username and password are sent over the connection to the LDAP server in the clear. This means that without any other sort of protection, anyone can see the password travelling over the wire. To prevent that, you must protect connection with TLS (Transport Layer Security, formerly known as SSL). There are two different connections to protect: between the client and the impalad, and between the impalad and the LDAP server.
Authentication for TLS connections is done with certificates, so the impalad (as a TLS server) will need its own certificate. The impalad presents this certificate to clients in order to prove that it really is the impalad. In order to supply this certificate, the impalad must be started with
Now clients must use TLS to talk to the impalad. In the impala shell, you accomplished that goal with the
--ca_cert=/full/path/to/ca-certificate.pem arguments. The
ca_cert argument specifies the certificate that signed the
ssl_server_certificate above. For ODBC connections, consult the documentation for the Cloudera ODBC driver for Impala . It offers a thorough description of the settings required for certificates, authentication, and TLS.
Frankly, using TLS between the impala clients and the impalad is a good idea, regardless of whether or not LDAP is being used. Otherwise, your queries, and the results of those queries, go over the wire in the clear.
There are two ways to turn on TLS with the LDAP Server:
--ldap_tlsas an argument to the impalad. The connection will take place over the usual LDAP port, but after the connection is first made it will issue a
STARTTLSrequest which will upgrade the connection to a secure connection using TLS on that same port.
Finally, the connection to the LDAP server needs its own authentication; this way, you know that the impalad is talking to the correct ldap server and you’re not giving your passwords to a rogue man-in-the-middle attacker. You’ll need to pass
--ldap_ca_certificate to the impalad to specify the location of the certificate that signed the LDAP server’s certificate.
The Cloudera documentation for LDAP and Impala contains much of this information, and the documentation for TLS between the Impala client and the Impala daemon is required reading as well. In Cloudera Manager, you set these flags in the Impala Configuration in the “Service-Wide” -> “Security” menu. You must specifiy them in the “Service-Wide” -> “Advanced” -> “Impala Command Line Argument Advanced Configuration Snippet” parameters.
To summarize all these flags:
Correctly implementing authentication in the most secure manner possible results in quite a lot of flags being passed to the Impala daemons. Here is an example invocation of the impalad (minus other flags), assuming that we want to enable both kerberos and LDAP authentication:
impalad --enable_ldap_auth \ --ldap_uri=ldap://ldapserver.your.company.com \ --ldap_tls \ --ldap_ca_certificate=/full/path/to/certs/ldap-ca-cert.pem \ --ssl_server_certificate=/full/path/to/certs/impala-cert.pem \ --ssl_private_key=/full/path/to/certs/impala-key.pem \ --principal=impala/_HOST@EXAMPLE.COM \ --keytab_file=/full/path/to/keytab
Connecting from the impala shell might look like this:
impala-shell.sh --ssl \ --ca_cert=/full/path/to/cert/impala-ca-cert.pem \ -k
When authenticating with Kerberos, or
impala-shell.sh --ssl \ --ca_cert=/full/path/to/cert/impala-ca-cert.pem \ -l -u email@example.com
When authenticating with LDAP.
Michael Yoder is a Software Engineer at Cloudera.