How to connect secure Impala cluster from RStudio on macOS with implyr
Impala is very fast SQL-on-Hadoop, and it will enhance your R experience with implyr, a dplyr based interface for Apache Impala (incubating) created by Ian Cook. I will show you how to setup connection to Kerberized Impala cluster with implyr from local macOS. You can find my GitHub repo as follows:
chezou/implyr-example
_implyr-example - Example repository of implyr_github.com
Setting up ODBC environment for macOS
Install unixODBC with homebrew
First, we will install unixODBC to handle Impala with ODBC. In R world, ODBC is preferred to connect Impala because of its performance and compatibility. Let’s install unixODBC with homebrew.
$ brew install unixodbc
Download and install the latest version of the Impala ODBC driver from Cloudera
You can download the latest Impala ODBC Driver.
Configure your .odbc.ini and .odbcinst.ini
After installing Impala ODBC driver for macOS, basic configuration templates can be found in /opt/cloudera/impalaodbc/Setup/
.
cp /opt/cloudera/impalaodbc/Setup/odbc.ini ~/.odbc.ini
cp /opt/cloudera/impalaodbc/Setup/odbcinst.ini ~/.odbcinst.ini
Before using following setting, you must replace HOST
and KrbRealm
with appropriate ones. Let’s modify your .odbc.ini
as follows:
[ODBC]
# Specify any global ODBC configuration here such as ODBC tracing.
[ODBC Data Sources]
Impala=Cloudera ODBC Driver for Impala
[Impala]
# Description: DSN Description.
# This key is not necessary and is only to give a description of the data source.
Description=Cloudera Impala ODBC Driver DSN
# Driver: The location where the ODBC driver is installed to.
Driver=/opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib
# The DriverUnicodeEncoding setting is only used for SimbaDM
# When set to 1, SimbaDM runs in UTF-16 mode.
# When set to 2, SimbaDM runs in UTF-8 mode.
#DriverUnicodeEncoding=2
# Values for HOST, PORT, KrbFQDN, and KrbServiceName should be set here.
# They can also be specified on the connection string.
HOST=[REPLACE_YOUR_IMPALA_HOST]
PORT=21050
Schema=default
# The authentication mechanism.
# 0 — No authentication (NOSASL)
# 1 — Kerberos authentication (SASL)
# 2 — Username authentication (SASL)
# 3 — Username/password authentication (NOSASL or SASL depending on UseSASL configuration)
AuthMech=1
# Set to 1 to use SASL for authentication.
# Set to 0 to not use SASL.
# When using Kerberos authentication (SASL) or Username authentication (SASL) SASL is always used
# and this configuration is ignored. SASL is always not used for No authentication (NOSASL).
UseSASL=1
# Kerberos related settings.
KrbFQDN=_HOST
KrbRealm=[REPLACE_YOUR_REALM]
KrbServiceName=impala
# Username/password authentication with SASL settings.
UID=
PWD=
# Set to 0 to disable SSL.
# Set to 1 to enable SSL.
SSL=1
CAIssuedCertNamesMismatch=1
TrustedCerts=/opt/cloudera/impalaodbc/lib/universal/cacerts.pem
# If you use SSL with AllowSelfSignedServerCert, you can set this configuration.
#AllowSelfSignedServerCert=1
# Specify the proxy user ID to use.
#DelegationUID=
# General settings
TSaslTransportBufSize=1000
RowsFetchedPerBlock=10000
SocketTimeout=0
StringColumnLength=32767
UseNativeQuery=0
After setting up the .odbc.ini
, your application will refer this setting with appropriate DSN name, like Impala
in this case.
Check the configuration
After configuration, you should kinit with your principal.
$ kinit $USER@YOUR_REALM
You should replace `$USER` and `YOUR_REALM` with the appropriate REALM.
Before using RStudio on you mac, you can check configuration with `isql` command.
$ isql -v “Impala”
+ — — — — — — — — — — — — — — — — — — — -+
| Connected! |
| |
| sql-statement |
| help [tablename] |
| quit |
| |
+ — — — — — — — — — — — — — — — — — — — -+
SQL>
Implyr Example
After setting .odbc.ini you can connect secure Impala cluster with {implyr}
. For instance, We will visualize the airports’ data.
First, install R packages.
install.packages(c(“implyr”, “odbc”, “DBI”, “dplyr”, “ggplot2”, “ggExtra”))
Then, connect the Impala cluster.
library(implyr)
library(odbc)
drv <- odbc::odbc()
impala <- src_impala(
drv = drv,
dsn = “Impala”
)
If your .odbc.ini
is configured properly, you can connect to Impala cluster.
Let’s visualize the airports data. In this case, we assume the data is in u_ariga
database, so that we will change database using SQL use u_ariga
.
library(DBI)
# Change database
dbExecute(impala, “use u_ariga”)
dbGetQuery(impala, “show tables”)
airports <- tbl(impala, “airports_pq”)
# Show the head of airports data
View(airports)
airports %>% filter(latitude < 35) %>% count()
#903
Finally, we will show a joint histogram of longitude and latitude.
airports_by_geo <- airports %>% select(longitude, latitude) %>% collect()
library(ggplot2)
p <- ggplot(airports_by_geo, aes(longitude, latitude)) + geom_point() + theme_classic()
ggExtra::ggMarginal(p, type = “histogram”)
Conclusion
{implyr}
is a great package for Impala and dplyr but it is pretty young project. If you find some problems, why don’t you post into the GitHub issue?