Connecting to multiple hosts in Hive with SqlAlchemy - python

I've already had a working connection through ODBC using Cloudera ODBC Driver for Apache Hive, where I had my DSN set and all I needed was to call pyodbc.connect(f"DSN={mydsn}", autocommit=True).
Since I'm planning to use pandas on the query result, I've read that SQLAlchemy is the preferred choice and I'd like to avoid warnings resulting from other ways of connection. My DSN for Hive was using Zookeeper and "Hosts" field was filled in the form of host1:2181,host2:2181,host3:2181. I'm trying to connect to these 3 hosts and I've tried changing connection url in analogous way to the one provided in here, but I got invalid literal for int() with base 10: '2181,host2:2181,host3:2181 etc.
from sqlalchemy import create_engine
query = """SELECT TOP 10 * from eb.mobile_sa"""
conn_url = f'hive://{UID}#host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2'
engine = create_engine(conn_url)
with engine.connect() as conn:
df = pd.read_sql(query, conn)
I found kazoo module that is said to be Zookeeper implementation in Python, but when I tried the very first lines from Basic Usage and just 1 host:
from kazoo.client import KazooClient
zk = KazooClient(hosts = "host1:2181", read_only=True)
zk.start()
I got a lot of lines of Connection dropped: socket connection error
How can I correctly connect to multiple hosts in hive?

Related

SQL Server connection - Works in pyodbc, but not SQLAlchemy

This is a fairly common question but even using the answers on SO like here but I still can't connect.
When I setup my connection to pyodbc I can connect with the following:
cnxn = pyodbc.connect('DRIVER={SQL Server Native Client 11.0};SERVER=ip,port;DATABASE=db;UID=user;PWD=pass')
cursor = cnxn.cursor()
cursor.execute("some select query")
for row in cursor.fetchall():
print(row)
and it works.
However to do a .read_sql() in pandas I need to connect with sqlalchemy.
I have tried with both hosted connections and pass-through pyodbc connections like the below:
quoted = urllib.parse.quote_plus('DRIVER={SQL Server Native Client 11.0};Server=ip;Database=db;UID=user;PWD=pass;Port=port;')
engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
engine.connect()
I have tried with both SERVER=ip,port format and the separate Port=port parameter like above but still no luck.
The error I'm getting is Login failed for user 'user'. (18456)
Any help is much appreciated.
I assume that you want to create a DataFrame so when you have a cnxn you can pass it to Pandas read_sql_query function.
Example:
cnxn = pyodbc.connect('your connection string')
query = 'some query'
df = pandas.read_sql_query(query, conn)

Python: SQL Alchemy Create Engine Syntax Issues

Let's say I have the following connection information for a MSSQL server:
'Driver={SQL Server};'
'Server=VCAB18RPACRGZ12\GNRSRZ11,1414;'
'Database=sampleDB;'
'uid=sampleID;'
'pwd=samplePW'
I want to write a python dataframe to the MSSQL server as a table. I have the following code:
from sqlalchemy import create_engine
connection = create_engine('mssql+pyodbc://sampleID:samplePW#myhost:VCAB18RPACRGZ12\GNRSRZ11,1414/sampleDB?driver=SQL+Server+Native+Client+10.0')
My above connection code is erroring out. I'm not sure exactly where my connection information is supposed to go in the create_engine statement.
This is my error ...
ValueError: invalid literal for int() with base 10:
'VCAB18RPACRGZ12\GNRSRZ11,1414'
Your Server Address is not correct.
If 1414 is the port#, you should use ":" instead of ",".
The SQLAlchemy uses pyodbc as the default DBAPI. pymssql is also available.
Below is the connection string sample:
# pyodbc -DSN
engine = create_engine('mssql+pyodbc://scott:tiger#mydsn')
# pymssql
engine = create_engine('mssql+pymssql://scott:tiger#hostname:port/dbname')
# pyodbc -DSN Less connection
from sqlalchemy import create_engine
#assumes driver name=[SQL+Server+Native+Client+10.0]
#engine = create_engine('mssql+pyodbc://username:password#hostname:port/databasename?driver=SQL+Server+Native+Client+10.0')
engine = create_engine(r'mssql+pyodbc://sampleID:samplePW#VCAB18RPACRGZ12\GNRSRZ11:1414/sampleDB?driver=SQL+Server+Native+Client+10.0')
print engine

How to connect to a cluster in Amazon Redshift using SQLAlchemy?

In Amazon Redshift's Getting Started Guide, it's mentioned that you can utilize SQL client tools that are compatible with PostgreSQL to connect to your Amazon Redshift Cluster.
In the tutorial, they utilize SQL Workbench/J client, but I'd like to utilize python (in particular SQLAlchemy). I've found a related question, but the issue is that it does not go into the detail or the python script that connects to the Redshift Cluster.
I've been able to connect to the cluster via SQL Workbench/J, since I have the JDBC URL, as well as my username and password, but I'm not sure how to connect with SQLAlchemy.
Based on this documentation, I've tried the following:
from sqlalchemy import create_engine
engine = create_engine('jdbc:redshift://shippy.cx6x1vnxlk55.us-west-2.redshift.amazonaws.com:5439/shippy')
ERROR:
Could not parse rfc1738 URL from string 'jdbc:redshift://shippy.cx6x1vnxlk55.us-west-2.redshift.amazonaws.com:5439/shippy'
I don't think SQL Alchemy "natively" knows about Redshift. You need to change the JDBC "URL" string to use postgres.
jdbc:postgres://shippy.cx6x1vnxlk55.us-west-2.redshift.amazonaws.com:5439/shippy
Alternatively, you may want to try using sqlalchemy-redshift using the instructions they provide.
I was running into the exact same issue, and then I remembered to include my Redshift credentials:
eng = create_engine('postgresql://[LOGIN]:[PASSWORD]#shippy.cx6x1vnxlk55.us-west-2.redshift.amazonaws.com:5439/shippy')
sqlalchemy-redshift is works for me, but after few days of reserch
packages (python3.4):
SQLAlchemy==1.0.14 sqlalchemy-redshift==0.5.0 psycopg2==2.6.2
First of all, I checked, that my query is working workbench (http://www.sql-workbench.net), then I force it work in sqlalchemy (this https://stackoverflow.com/a/33438115/2837890 helps to know that auto_commit or session.commit() must be):
db_credentials = (
'redshift+psycopg2://{p[redshift_user]}:{p[redshift_password]}#{p[redshift_host]}:{p[redshift_port]}/{p[redshift_database]}'
.format(p=config['Amazon_Redshift_parameters']))
engine = create_engine(db_credentials, connect_args={'sslmode': 'prefer'})
connection = engine.connect()
result = connection.execute(text(
"COPY assets FROM 's3://xx/xx/hello.csv' WITH CREDENTIALS "
"'aws_access_key_id=xxx_id;aws_secret_access_key=xxx'"
" FORMAT csv DELIMITER ',' IGNOREHEADER 1 ENCODING UTF8;").execution_options(autocommit=True))
result = connection.execute("select * from assets;")
print(result, type(result))
print(result.rowcount)
connection.close()
And after that, I forced to work sqlalchemy_redshift CopyCommand perhaps bad way, looks little tricky:
import sqlalchemy as sa
tbl2 = sa.Table(TableAssets, sa.MetaData())
copy = dialect_rs.CopyCommand(
assets,
data_location='s3://xx/xx/hello.csv',
access_key_id=access_key_id,
secret_access_key=secret_access_key,
truncate_columns=True,
delimiter=',',
format='CSV',
ignore_header=1,
# empty_as_null=True,
# blanks_as_null=True,
)
print(str(copy.compile(dialect=RedshiftDialect(), compile_kwargs={'literal_binds': True})))
print(dir(copy))
connection = engine.connect()
connection.execute(copy.execution_options(autocommit=True))
connection.close()
We make just that I made with sqlalchemy, excute query, except comine query by CopyCommand. I have not see some profit :(.
The following works for me with Databricks on all kinds of SQLs
import sqlalchemy as SA
import psycopg2
host = 'your_host_url'
username = 'your_user'
password = 'your_passw'
port = 5439
url = "{d}+{driver}://{u}:{p}#{h}:{port}/{db}".\
format(d="redshift",
driver='psycopg2',
u=username,
p=password,
h=host,
port=port,
db=db)
engine = SA.create_engine(url)
cnn = engine.connect()
strSQL = "your_SQL ..."
try:
cnn.execute(strSQL)
except:
raise
import sqlalchemy as db
engine = db.create_engine('postgres://username:password#url:5439/db_name')
This worked for me

How do I connect to SQL Server via sqlalchemy using Windows Authentication?

sqlalchemy, a db connection module for Python, uses SQL Authentication (database-defined user accounts) by default. If you want to use your Windows (domain or local) credentials to authenticate to the SQL Server, the connection string must be changed.
By default, as defined by sqlalchemy, the connection string to connect to the SQL Server is as follows:
sqlalchemy.create_engine('mssql://*username*:*password*#*server_name*/*database_name*')
This, if used using your Windows credentials, would throw an error similar to this:
sqlalchemy.exc.DBAPIError: (Error) ('28000', "[28000] [Microsoft][ODBC SQL Server Driver][SQL Server]Login failed for us
er '***S\\username'. (18456) (SQLDriverConnect); [28000] [Microsoft][ODBC SQL Server Driver][SQL Server]Login failed for us
er '***S\\username'. (18456)") None None
In this error message, the code 18456 identifies the error message thrown by the SQL Server itself. This error signifies that the credentials are incorrect.
In order to use Windows Authentication with sqlalchemy and mssql, the following connection string is required:
ODBC Driver:
engine = sqlalchemy.create_engine('mssql://*server_name*/*database_name*?trusted_connection=yes')
SQL Express Instance:
engine = sqlalchemy.create_engine('mssql://*server_name*\\SQLEXPRESS/*database_name*?trusted_connection=yes')
If you're using a trusted connection/AD and not using username/password, or otherwise see the following:
SAWarning: No driver name specified; this is expected by PyODBC when using >DSN-less connections
"No driver name specified; "
Then this method should work:
from sqlalchemy import create_engine
server = <your_server_name>
database = <your_database_name>
engine = create_engine('mssql+pyodbc://' + server + '/' + database + '?trusted_connection=yes&driver=ODBC+Driver+13+for+SQL+Server')
A more recent response if you want to connect to the MSSQL DB from a different user than the one you're logged with on Windows. It works as well if you are connecting from a Linux machine with FreeTDS installed.
The following worked for me from both Windows 10 and Ubuntu 18.04 using Python 3.6 & 3.7:
import getpass
from sqlalchemy import create_engine
password = getpass.getpass()
eng_str = fr'mssql+pymssql://{domain}\{username}:{password}#{hostip}/{db}'
engine = create_engine(eng_str)
What changed was to add the Windows domain before \username.
You'll need to install the pymssql package.
Create Your SqlAlchemy Connection URL      From Your pyodbc Connection String      OR Your Known Connection Parameters
I found all the other answers to be educational, and I found the SqlAlchemy Docs on connection strings helpful too, but I kept failing to connect to MS SQL Server Express 19 where I was using no username or password and trusted_connection='yes' (just doing development at this point).
Then I found THIS method in the SqlAlchemy Docs on Connection URLs built from a pyodbc connection string (or just a connection string), which is also built from known connection parameters (i.e. this can simply be thought of as a connection string that is not necessarily used in pyodbc). Since I knew my pyodbc connection string was working, this seemed like it would work for me, and it did!
This method takes the guesswork out of creating the correct format for what you feed to the SqlAlchemy create_engine method. If you know your connection parameters, you put those into a simple string per the documentation exemplified by the code below, and the create method in the URL class of the sqlalchemy.engine module does the correct formatting for you.
The example code below runs as is and assumes a database named master and an existing table named table_one with the schema shown below. Also, I am using pandas to import my table data. Otherwise, we'd want to use a context manager to manage connecting to the database and then closing the connection like HERE in the SqlAlchemy docs.
import pandas as pd
import sqlalchemy
from sqlalchemy.engine import URL
# table_one dictionary:
table_one = {'name': 'table_one',
'columns': ['ident int IDENTITY(1,1) PRIMARY KEY',
'value_1 int NOT NULL',
'value_2 int NOT NULL']}
# pyodbc stuff for MS SQL Server Express
driver='{SQL Server}'
server='localhost\SQLEXPRESS'
database='master'
trusted_connection='yes'
# pyodbc connection string
connection_string = f'DRIVER={driver};SERVER={server};'
connection_string += f'DATABASE={database};'
connection_string += f'TRUSTED_CONNECTION={trusted_connection}'
# create sqlalchemy engine connection URL
connection_url = URL.create(
"mssql+pyodbc", query={"odbc_connect": connection_string})
""" more code not shown that uses pyodbc without sqlalchemy """
engine = sqlalchemy.create_engine(connection_url)
d = {'value_1': [1, 2], 'value_2': [3, 4]}
df = pd.DataFrame(data=d)
df.to_sql('table_one', engine, if_exists="append", index=False)
Update
Let's say you've installed SQL Server Express on your linux machine. You can use the following commands to make sure you're using the correct strings for the following:
For the driver: odbcinst -q -d
For the server: sqlcmd -S localhost -U <username> -P <password> -Q 'select ##SERVERNAME'
pyodbc
I think that you need to put:
"+pyodbc" after mssql
try this:
from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://user:password#host:port/databasename?driver=ODBC+Driver+17+for+SQL+Server")
cnxn = engine.connect()
It works for me
Luck!
If you are attempting to connect:
DNS-less
Windows Authentication for a server not locally hosted.
Without using ODBC connections.
Try the following:
import sqlalchemy
engine = sqlalchemy.create_engine('mssql+pyodbc://' + server + '/' + database + '?trusted_connection=yes&driver=SQL+Server')
This avoids using ODBC connections and thus avoids pyobdc interface errors from DPAPI2 vs DBAPI3 conflicts.
I would recommend using the URL creation tool instead of creating the url from scratch.
connection_url = sqlalchemy.engine.URL.create("mssql+pyodbc",database=databasename, host=servername, query = {'driver':'SQL Server'})
engine = sqlalchemy.create_engine(connection_url)
See this link for creating a connection string with SQL Server Authentication (non-domain, uses username and password)

Connect to an URI in postgres

I'm guessing this is a pretty basic question, but I can't figure out why:
import psycopg2
psycopg2.connect("postgresql://postgres:postgres#localhost/postgres")
Is giving the following error:
psycopg2.OperationalError: missing "=" after
"postgresql://postgres:postgres#localhost/postgres" in connection info string
Any idea? According to the docs about connection strings I believe it should work, however it only does like this:
psycopg2.connect("host=localhost user=postgres password=postgres dbname=postgres")
I'm using the latest psycopg2 version on Python2.7.3 on Ubuntu12.04
I would use the urlparse module to parse the url and then use the result in the connection method. This way it's possible to overcome the psycop2 problem.
from urlparse import urlparse # for python 3+ use: from urllib.parse import urlparse
result = urlparse("postgresql://postgres:postgres#localhost/postgres")
username = result.username
password = result.password
database = result.path[1:]
hostname = result.hostname
port = result.port
connection = psycopg2.connect(
database = database,
user = username,
password = password,
host = hostname,
port = port
)
The connection string passed to psycopg2.connect is not parsed by psycopg2: it is passed verbatim to libpq. Support for connection URIs was added in PostgreSQL 9.2.
To update on this, Psycopg3 does actually include a way to parse a database connection URI.
Example:
import psycopg # must be psycopg 3
pg_uri = "postgres://jeff:hunter2#example.com/db"
conn_dict = psycopg.conninfo.conninfo_to_dict(pg_uri)
with psycopg.connect(**conn_dict) as conn:
...
Another option is using SQLAlchemy for this. It's not just ORM, it consists of two distinct components Core and ORM, and it can be used completely without using ORM layer.
SQLAlchemy provides such functionality out of the box by create_engine function. Moreover, via URI you can specify DBAPI driver or many various postgresql settings.
Some examples:
# default
engine = create_engine("postgresql://user:pass#localhost/mydatabase")
# psycopg2
engine = create_engine("postgresql+psycopg2://user:pass#localhost/mydatabase")
# pg8000
engine = create_engine("postgresql+pg8000://user:pass#localhost/mydatabase")
# psycopg3 (available only in SQLAlchemy 2.0, which is currently in beta)
engine = create_engine("postgresql+psycopg://user:pass#localhost/test")
And here is a fully working example:
import sqlalchemy as sa
# set connection URI here ↓
engine = sa.create_engine("postgresql://user:password#db_host/db_name")
ddl_script = sa.DDL("""
CREATE TABLE IF NOT EXISTS demo_table (
id serial PRIMARY KEY,
data TEXT NOT NULL
);
""")
with engine.begin() as conn:
# do DDL and insert data in a transaction
conn.execute(ddl_script)
conn.exec_driver_sql("INSERT INTO demo_table (data) VALUES (%s)",
[("test1",), ("test2",)])
conn.execute(sa.text("INSERT INTO demo_table (data) VALUES (:data)"),
[{"data": "test3"}, {"data": "test4"}])
with engine.connect() as conn:
cur = conn.exec_driver_sql("SELECT * FROM demo_table LIMIT 2")
for name in cur.fetchall():
print(name)
# you also can obtain raw DBAPI connection
rconn = engine.raw_connection()
SQLAlchemy provides many other benefits:
You can easily switch DBAPI implementations just by changing URI (psycopg2, psycopg2cffi, etc), or maybe even databases.
It implements connection pooling out of the box (both psycopg2 and psycopg3 has connection pooling, but API is different)
asyncio support via create_async_engine (psycopg3 also supports asyncio).

Categories

Resources