Using pyhive with kerberos ticket to connect to kerberized hadoop cluster

Using pyhive with kerberos ticket to connect to kerberized hadoop cluster - python

I would like to connect to Hive on our kerberized Hadoop cluster and then run some hql queries (obviously haha :)) from machine, which already has its own Kerberose Client and it works, keytab has been passed and tested.
Our Hadoop runs HWS 3.1 and CentOS7, my machine als runs CentOS7
I'm using Python 3.7.3 and PyHive (0.6.1).
I have installed bunch of libraries (and I also tried to uninstall them), as I was going through different forums (HWS, Cloudera, here SO...)
I installed through pip sasl libraries
pure-sasl (0.6.1)
pysasl (0.4.1)
sasl (0.2.1)
thrift-sasl (0.3.0)
I installed through yum
cyrus-sasl-2.1.26-23.el7.x86_64
cyrus-sasl-lib-2.1.26-23.el7.x86_64
cyrus-sasl-plain-2.1.26-23.el7.x86_64
saslwrapper-devel-0.16-5.el7.x86_64
saslwrapper-0.16-5.el7.x86_64
cyrus-sasl-lib-2.1.26-23.el7.i686
cyrus-sasl-devel-2.1.26-23.el7.x86_64
Below lies my connection to the hive
return hive.Connection(host=self.host, port=self.port,
database=self.database, auth=self.__auth,
kerberos_service_name=self.__kerberos_service_name)
This is part of my yaml
hive_interni_hdp:
db_type: hive
host: domain.xx.lan
database: database_name
user: user_name
port: 10000
auth: KERBEROS
kerberos_service_name: hive
When I try to run the code, I'm getting following error.
File "/opt/Python3.7.3/lib/python3.7/site-packages/dfpy/location.py", line 1647, in conn
self.__conn = self._create_connection()
File "/opt/Python3.7.3/lib/python3.7/site-packages/dfpy/location.py", line 1633, in _create_connection
kerberos_service_name=self.__kerberos_service_name)
File "/opt/Python3.7.3/lib/python3.7/site-packages/pyhive/hive.py", line 192, in __init__
self._transport.open()
File "/opt/Python3.7.3/lib/python3.7/site-packages/thrift_sasl/__init__.py", line 79, in open
message=("Could not start SASL: %s" % self.sasl.getError()))
thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found'
Did anyone had luck? Where is the obstacle? Pyhive libs, wrong Kerberos connection settings?

I found an solution, I checked out this documentation https://www.cyrusimap.org/sasl/sasl/sysadmin.html
where is GSSAPI mentioned (with Kerberos 5, which I'm using) and I have checked, that I have no support for gssapi on my machine using
sasl2-shared-mechlist
It stated
GSS-SPNEGO,LOGIN,PLAIN,ANONYMOUS
but after installing gssapi library
yum install cyrus-sasl-gssapi
mechlist states
GSS-SPNEGO,GSSAPI,LOGIN,PLAIN,ANONYMOUS
Than I run the code again and Hooray!
P.S. Don't forget to autentificate and verify your keytab is valid
kinit -kt /root/user.keytab user#domain.com
klist

Related

How to properly connect to SQL Server from a python script when python packages are based on github?

Suppose that due to an HTTP 403 Error it's not possible to download the packages from the PyPi repo (nor pip install <package> commands) which causes me to install the pyodbc by cloning the repo from Github (https://github.com/mkleehammer/pyodbc) and by running the next .cmd windows file:
cd "root_folder"
git activate
git clone https://github.com/mkleehammer/pyodbc.git --depth 1
Note that this package is downloaded to the same root folder where my python script is, after this I try to set a connection to Microsoft SQL Server:
import pyodbc as pyodbc
# set connection settings
server="servername"
database="DB1"
user="user1"
password="123"
# establishing connection to db
conn = pyodbc.connect("DRIVER={SQL Server};SERVER="+server+";DATABASE="+database+";UID="+user+";PWD="+password)
cursor=conn.cursor()
print("Succesful connection to sql server")
However, when I run the above code the next traceback error arises:
Traceback (most recent call last):
File "/dcleaner.py", line 47, in
conn = pyodbc.connect("DRIVER={SQL Server};SERVER="+server+";DATABASE="+database+";UID="+user+";PWD="+password)
AttributeError: module 'pyodbc' has no attribute 'connect'
Do you know how can I properly connect from a py script to a sql-server based database?

After you have cloned PYODBC
cd "root_folder"
git activate
git clone https://github.com/mkleehammer/pyodbc.git --depth 1
On your Local Machine, Go into the cloned directory and open terminal and run below command
python setup.py build
if it errors then try to install appropriate C++ compiler (the error might reveal this detail/ on VSCode it gave the URL to open and download, which I have shared below) install it from here link - choose this one
reboot machine and run this again
python setup.py build #if success then continue with below one
python setup.py install
after that you should be able to import and run the below from your local machine
import pyodbc as pyodbc

redis.exceptions.ConnectionError: Error 97 connecting to localhost:6379. Address family not supported by protocol

when ever i try to run my program following error will will raise.
redis.exceptions.ConnectionError: Error 97 connecting to localhost:6379. Address family not supported by protocol.
Previously the program runs normally now this error will be raised.
Traceback (most recent call last):
File "securit.py", line 26, in <module>
bank = red.get('bank')
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 880, in get
return self.execute_command('GET', name)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 578, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 563, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 538, in send_packed_command
self.connect()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 442, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 97 connecting to localhost:6379. Address family not supported by protocol.

Finally i got answer for above qustion.Step by step the following done
Setup
Before you install redis, there are a couple of prerequisites that need to be downloaded to make the installation as easy as possible.
Start off by updating all of the apt-get packages:
**sudo apt-get update**
Once the process finishes, download a compiler with build essential which will help us install Redis from source:
**sudo apt-get install build-essential**
Finally, we need to download tcl:
**sudo apt-get install tcl8.5**
Installing Redis
With all of the prerequisites and dependencies downloaded to the server, we can go ahead and begin to install redis from source:
Download the latest stable release tarball from Redis.io.
**wget http://download.redis.io/releases/redis-stable.tar.gz**
Untar it and switch into that directory:
**tar xzf redis-stable.tar.gz**
**cd redis-stable**
Proceed to with the make command:
**make**
Run the recommended make test:
**make test**
Finish up by running make install, which installs the program system-wide.
**sudo make install**
Once the program has been installed, Redis comes with a built in script that sets up Redis to run as a background daemon.
To access the script move into the utils directory:
**cd utils**
From there, run the Ubuntu/Debian install script:
**sudo ./install_server.sh**
As the script runs, you can choose the default options by pressing enter. Once the script completes, the redis-server will be running in the background.

Had the same problem. Deleting IPv6 addresses from /etc/hosts file helped for me

How to access Hive on remote server using python client

Case: I have Hive on a cloudera platform. There is a database on Hive that I want to access using python client from my computer. I read a similar SO question but its using pyhs2 which I am unable to install on the remote server. And this SO question too uses Thrift but I cant seem to install it either.
Code: After following the documentation, when I execute the following program it gives me an error.
import pyodbc, sys, os
pyodbc.autocommit=True
con = pyodbc.connect("DSN=default",driver='SQLDriverConnect',autocommit=True)
cursor = con.cursor()
cursor.execute("select * from fb_mpsp")
Error: ssh://ashish#ServerIPAddress/home/ashish/anaconda/bin/python2.7 -u /home/ashish/PyCharm_proj/hdfsConnect/home/ashish/PyCharm_proj/hdfsConnect/Hive_connect/hive_connect.py
Traceback (most recent call last):
File "/home/ashish/PyCharm_proj/hdfsConnect/home/ashish/PyCharm_proj/hdfsConnect/Hive_connect/hive_connect.py", line 5, in
con = pyodbc.connect("DSN=default", driver='SQLDriverConnect',autocommit=True)
pyodbc.Error: ('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found, and no default driver specified (0) (SQLDriverConnect)')
Process finished with exit code 1
Please suggest how can I solve this problem? Also I am not sure why do I have to specify the driver as SQLDriverConnect when the code will be executed using hadoop hive?
Thanks

This worked for me
oODBC = pyodbc.connect("DSN=Cloudera Hive DSN 64;", autocommit = True, ansi = True )
And now everything works fine.
Be sure anything is fine with you DSN using:
isql -v "Cloudera Hive DSN 64"
and replace "Cloudera Hive DSN 64" with the name you used in your odbc.ini
Also, currently I'm not able to use the kerberos authentication unless I make a ticket by hand. Impala works smoothly using kerberos keytab files
Any help about how to have hive odbc working with keytab files is appreciated.

If you do decide to revisit pyhs2 note that it doesn't need to be installed on the remote server, it's installed on your local client.
If you continue with pyodbc, you need to install the ODBC driver for Hive, which you can get from Cloudera's site.
You don't need to specify the driver in your connection, it should be part of your DSN. The specifics of creating the DSN depend on your OS, but essentially you will create it using Administrative Tools -> Data Sources (Windows), install ODBC and edit /Library/ODBC/odbc.ini (Mac), or edit /etc/odbc.ini (Linux).
Conceptually, think of the DSN as a specification that represents all the information about the connection - it will contain the host, port, and driver information. That way in your code you don't have to specify these things and you can switch details about the database without changing your code.
# Note only the DSN name specifies the connection
import pyodbc
conn = pyodbc.connect("DSN=Hive1")
cursor = conn.cursor()
cursor.execute("select * from YYY")
Alternatively, I've updated the other question you referenced with information about how to install the thrift libraries. I think that's the way to go, if you have that option.

Try this method also to conenct and get data remotely from hive server:
connect remote server with ssh and give the cli command to access data from remote server:
ssh -o UserKnownHostsFile=/dev/null -o ConnectTimeout=90 -o StrictHostKeyChecking=no shashanks#remote_host 'hive -e "select * from DB.testtable limit 5;" >/home/shashanks/testfile'

python cannot connect hiveserver2

I have tried use the example on https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2
but gets the following errors:
/usr/lib/python2.7/dist-packages/pkg_resources.py:1031: UserWarning: /home/dsnadmin/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
warnings.warn(msg, UserWarning)
Traceback (most recent call last):
File "hs2.py", line 8, in <module>
database='default') as conn:
File "build/bdist.linux-x86_64/egg/pyhs2/__init__.py", line 7, in connect
File "build/bdist.linux-x86_64/egg/pyhs2/connections.py", line 46, in __init__
File "build/bdist.linux-x86_64/egg/pyhs2/cloudera/thrift_sasl.py", line 66, in open
thrift.transport.TTransport.TTransportException: Could not start SASL: Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found
Here is the hive log:
ERROR [HiveServer2-Handler-Pool: Thread-31]: server.TThreadPoolServer (TThreadPoolServer.java:run(296)) - Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TSaslTransportException: No data or no sasl data in the stream
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TSaslTransportException: No data or no sasl data in the stream
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:328)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
Does anyone can help solve the problem? Thank you very much.
OS version: Ubuntu 14.04.1
Hive version: apache-hive-1.2.0
SASL version: sasl-0.1.3
Thrift version: thrift-0.9.1

You are missing some dependencies, make sure you install cyrus-sasl-devel and cyrus-sasl-gssapi:
On an RHEL-based distro:
sudo yum install cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-md5 cyrus-sasl-plain
... or on a Debian-based distro:
sudo apt-get install sasl2-bin libsasl2-2 libsasl2-dev libsasl2-modules
Per #KenKennedy, also add the libsasl2-modules-gssapi-mit package if using GSSAPI for authentication.

1 In hive-site.xml, set configuration as below:
<property>
<name>hive.server2.authentication</name>
<value>NOSASL</value>
</property>
2 pyhs2 program codes changes as below:
with pyhs2.connect(host='localhost',
port=10000,
authMechanism="NOSASL",
user='user',
password='password',
database='default') as conn:
Please note that username and password cannot be empty, add any username and password when connects to pyhs2

Setting up following environment variable worked for me:
SASL_PATH=/usr/lib/x86_64-linux-gnu/sasl2
This is for Ubuntu

The above answers don't work in my case, I've also tried others.
Finally, I've solved my problem (no idea if it works for you).
Just execute
export LD_LIBRARY_PATH=/usr/lib64:/usr/local/lib:$LD_LIBRARY_PATH
before running your script.
My original LD_LIBRARY_PATH is /usr/local/lib:/usr/lib64

Create a DSN for pyodbc similar to PHP PDO. Is it possible?

I use PHP PDO to connect a MySql Db and it works great. I have something like:
$dsn = 'mysql:host=localhost;dbname=database_name';
$user_db = 'admin';
$password = 'password';
$pdo = new PDO($dsn, $user_db, $password);
Now I need to load same database from a python script and I have to use pypodbc module
But I'm getting some issue:
If I do (on Python):
pyodbc.connect('DRIVER={MySQL};SERVER=localhost;DATABASE=database_name;UID=admin;PWD=password;')
I got en error on log:
pyodbc.Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open
lib '/usr/lib64/libmyodbc5.so' : file not found (0)
(SQLDriverConnect)")
If i check the /etc/odbcinst.ini i can see:
# Driver from the mysql-connector-odbc package
# Setup from the unixODBC package
[MySQL]
Description = ODBC for MySQL
Driver = /usr/lib/libmyodbc5.so
Setup = /usr/lib/libodbcmyS.so
Driver64 = /usr/lib64/libmyodbc5.so
Setup64 = /usr/lib64/libodbcmyS.so
FileUsage = 1
I tried to add the the package mysql-connector-odbc by YUM and i got mysql-connector-odbc.x86_64 0:5.1.5r1144-7.el6
And then running my script I got a new error:
/usr/local/bin/python2.7: relocation error: /usr/lib64/libmyodbc5.so: symbol strmov, version libmysqlclient_16 not defined in file libmysqlclient_r.so.16 with link time reference
It seems this version is not compatible with MySql which I have: 5.5.37-cll - MySQL Community Server (GPL)
I did a YUM REMOVE to restore previous conf.
And now ? Any suggestions ? Thanks!
My configuration:
My Server: CENTOS 6.6 x86_64 virtuozzo
MySql: 5.5.37-cll - MySQL Community Server (GPL)

Finally I fixed it!
yum install unixODBC-devel
yum install mysql-connector-odbc
yum install openssl098e
and than:
rpm -ivh libmysqlclient16-5.1.69-1.w5.x86_64.rpm
Now the
pyodbc.connect('DRIVER={MySQL};SERVER=localhost;DATABASE=database_name;UID=admin;PWD=password;')
works!! yeah!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pyhive with kerberos ticket to connect to kerberized hadoop cluster - python

Related

How to properly connect to SQL Server from a python script when python packages are based on github?

redis.exceptions.ConnectionError: Error 97 connecting to localhost:6379. Address family not supported by protocol

How to access Hive on remote server using python client

python cannot connect hiveserver2

Create a DSN for pyodbc similar to PHP PDO. Is it possible?

Categories

Resources