I have a need to access data that resides in a remote db2 database via a sql statement and convert it to a Pandas DataFrame. All from my Mac. I looked at using Pandas' read_sql with the ibm_db_sa adapter, but it looks like the prerequisite client side software is not supported on the Mac
I came up with an jdbc option, which I'm posting, but I'm curious to know if anyone else has any ideas
Here's an option using jdbc, the pip installable JayDeBeApi and the appropriate db jar file
Note: this could be used for other jdbc/jaydebeapi compliant databases like Oracle, MS Sql Server, etc
import jaydebeapi
import pandas as pd
def read_jdbc(sql, jclassname, driver_args, jars=None, libs=None):
'''
Reads jdbc compliant data sources and returns a Pandas DataFrame
uses jaydebeapi.connect and doc strings :-)
https://pypi.python.org/pypi/JayDeBeApi/
:param sql: select statement
:param jclassname: Full qualified Java class name of the JDBC driver.
e.g. org.postgresql.Driver or com.ibm.db2.jcc.DB2Driver
:param driver_args: Argument or sequence of arguments to be passed to the
Java DriverManager.getConnection method. Usually the
database URL. See
http://docs.oracle.com/javase/6/docs/api/java/sql/DriverManager.html
for more details
:param jars: Jar filename or sequence of filenames for the JDBC driver
:param libs: Dll/so filenames or sequence of dlls/sos used as
shared library by the JDBC driver
:return: Pandas DataFrame
'''
try:
conn = jaydebeapi.connect(jclassname, driver_args, jars, libs)
except jaydebeapi.DatabaseError as de:
raise
try:
curs = conn.cursor()
curs.execute(sql)
columns = [desc[0] for desc in curs.description] #getting column headers
#convert the list of tuples from fetchall() to a df
return pd.DataFrame(curs.fetchall(), columns=columns)
except jaydebeapi.DatabaseError as de:
raise
finally:
curs.close()
conn.close()
Some examples
#DB2
conn = 'jdbc:db2://<host>:5032/<db>:currentSchema=<schema>;'
class_name = 'com.ibm.db2.jcc.DB2Driver'
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'])
#PostgreSQL
conn = 'jdbc:postgresql://<host>:5432/<db>?currentSchema=<schema>'
class_name = 'org.postgresql.Driver'
jar = '/path/to/jar/postgresql-9.4.1212.jar'
sql = 'SELECT name FROM table_name LIMIT 5'
df = read_jdbc(sql, class_name, [conn, 'myname', 'mypwd'], jars=jar)
I got a simpler answer from https://stackoverflow.com/a/33805547/914967 where it uses pip module ibm_db only:
import ibm_db
import ibm_db_dbi
import pandas as pd
conn_handle = ibm_db.connect('DATABASE={};HOSTNAME={};PORT={};PROTOCOL=TCPIP;UID={};PWD={};'.format(db_name, hostname, port_number, user, password), '', '')
conn = ibm_db_dbi.Connection(conn_handle)
df = pd.read_sql(sql, conn)
Bob, you should check out ibmdbpy (https://pypi.python.org/pypi/ibmdbpy). It is a pandas data frame style API to DB2 and dashDB tables. It supports both underlying DB2 client drivers, ODBC and JDBC.
So as prerequisites you need to set up the DB2 client driver package for Mac that you can find here: http://www-01.ibm.com/support/docview.wss?uid=swg21385217
After #IanBjorhovde commented on my question I investigated another solution that allows me to use sqlalchemy and pandas' read_sql()
Here are the steps I took. Note: I got this working on OSX Yosemite (10.10.4) for python 3.4 and 3.5
1) Download IBM DB2 Express-C (no-cost community edition of DB2)
https://www-01.ibm.com/marketing/iwm/iwm/web/pick.do?source=swg-db2expressc&S_TACT=000000VR&lang=en_US&S_OFF_CD=10000761
2) After navigating to the unzipped dir
sudo ./db2_install
I accepted the default location of /opt/IBM/db2/V10.1
3) Install ibm_db and ibm_db_sa
pip install ibm_db
I built ibm_db_sa from source because the pip installed failed
python setup.py install
That should do it. You might get an error like 'Reason: image not found' when you try to connect to your db so read this for the fix. Note: might require a reboot
Example usage:
import ibm_db_sa
import pandas as pd
from sqlalchemy import select, create_engine
eng = create_engine('ibm_db_sa://<user_name>:<pwd>#<host>:5032/<db name>')
sql = 'SELECT name FROM table_name FETCH FIRST 5 ROWS ONLY'
df = pd.read_sql(sql, eng)
Related
I am using python to connect to DB2 Database
I have installed ibm_db and ibm_dbi packages and imported in to the code
import ibm_db
import ibm_db_dbi
1)created a connection string as conn_str
conn_str='database=pydev;hostname=host.test.com;port=portno;protocol=tcpip;uid=db2inst1;pwd=secret'
ibm_db_conn = ibm_db.connect(conn_str,'','')
conn = ibm_db_dbi.Connection(ibm_db_conn)
2)Now i need to read a DB2 table which is in under schemas called as "BRUD" into python pandas
could any one please help me in getting the connection for this
I'm not sure about sql syntax, but resolution looks like:
df = pd.read_sql('SELECT * FROM BRUD.table_name', conn)
I have just started learning SQL and I'm having some difficulties to import my sql file in python.
The .sql file is in my desktop, as well is my .py file.
That's what I tried so far:
import codecs
from codecs import open
import pandas as pd
sqlfile = "countries.sql"
sql = open(sqlfile, mode='r', encoding='utf-8-sig').read()
pd.read_sql_query("SELECT name FROM countries")
But I got the following message error:
TypeError: read_sql_query() missing 1 required positional argument: 'con'
I think I have to create some kind of connection, but I can't find a way to do that. Converting my data to an ordinary pandas DataFrame would help me a lot.
Thank you
This is the code snippet taken from https://www.dataquest.io/blog/python-pandas-databases/ should help.
import pandas as pd
import sqlite3
conn = sqlite3.connect("flights.db")
df = pd.read_sql_query("select * from airlines limit 5;", conn)
Do not read database as an ordinary file. It has specific binary format and special client should be used.
With it you can create connection which will be able to handle SQL queries. And can be passed to read_sql_query.
Refer to documentation often https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html
You need a database connection. I don't know what SQL flavor are you using, but suppose you want to run your query in SQL server
import pyodbc
con = pyodbc.connect(driver='{SQL Server}', server='yourserverurl', database='yourdb', trusted_connection=yes)
then pass the connection instance to pandas
pd.read_sql_query("SELECT name FROM countries", con)
more about pyodbc here
And if you want to query an SQLite database
import sqlite3
con = sqlite3.connect('pathto/example.db')
More about sqlite here
I have a sqlite db in my home dir.
stephen#stephen-AO725:~$ pwd
/home/stephen
stephen#stephen-AO725:~$ sqlite db1
SQLite version 2.8.17
Enter ".help" for instructions
sqlite> select * from test
...> ;
3|4
5|6
sqlite> .quit
when I try to connect from a jupiter notebook with sqlalchemy and pandas, sth does not work.
db=sqla.create_engine('sqlite:////home/stephen/db1')
pd.read_sql('select * from db1.test',db)
~/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
578
579 def do_execute(self, cursor, statement, parameters, context=None):
--> 580 cursor.execute(statement, parameters)
581
582 def do_execute_no_params(self, cursor, statement, context=None):
DatabaseError: (sqlite3.DatabaseError) file is not a database
[SQL: select * from db1.test]
(Background on this error at: http://sqlalche.me/e/4xp6)
I also tried:
db=sqla.create_engine('sqlite:///~/db1')
same result
Personally, just to complete the code of #Stephen with the modules required:
# 1.-Load module
import sqlalchemy
import pandas as pd
#2.-Turn on database engine
dbEngine=sqlalchemy.create_engine('sqlite:////home/stephen/db1.db') # ensure this is the correct path for the sqlite file.
#3.- Read data with pandas
pd.read_sql('select * from test',dbEngine)
#4.- I also want to add a new table from a dataframe in sqlite (a small one)
df_todb.to_sql(name = 'newTable',con= dbEngine, index=False, if_exists='replace')
Another way to read is using sqlite3 library, which may be more straighforward:
#1. - Load libraries
import sqlite3
import pandas as pd
# 2.- Create your connection.
cnx = sqlite3.connect('sqlite:////home/stephen/db1.db')
cursor = cnx.cursor()
# 3.- Query and print all the tables in the database engine
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())
# 4.- READ TABLE OF SQLITE CALLED test
dfN_check = pd.read_sql_query("SELECT * FROM test", cnx) # we need real name of table
# 5.- Now I want to delete all rows of this table
cnx.execute("DELETE FROM test;")
# 6. -COMMIT CHANGES! (mandatory if you want to save these changes in the database)
cnx.commit()
# 7.- Close the connection with the database
cnx.close()
Please let me know if this helps!
import sqlalchemy
engine=sqlalchemy.create_engine(f'sqlite:///db1.db')
Note: that you need three slashes in sqlite:/// in order to use a relative path for the DB. If you want an absolute path, use four slashes: sqlite:////
Source: Link
The issue is no backward compatibility as noted by Everila. anaconda installs its own sqlite, which is sqlite3.x and that sqlite cannot load databases created by sqlite 2.x
after creating a db with sqlite 3 the code works fine
db=sqla.create_engine('sqlite:////home/stephen/db1')
pd.read_sql('select * from test',db)
which confirms the 4 slashes are needed.
None of the sqlalchemy solutions worked for me with python 3.10.6 and sqlalchemy 2.0.0b4, it could be a beta issue or version 2.0.0 changed things. #corina-roca's solution was close, but not right as you need to pass a connection object, not an engine object. That's what the documentation says, but it didn't actually work. After a bit of experimentation, I discovered that engine.raw_connect() works, although you get a warning on the CLI. Here are my working examples
The sqlite one works out of the box - but it's not ideal if you are thinking of changing databases later
import sqlite3
conn = sqlite3.connect("sqlite:////home/stephen/db1")
df = pd.read_sql_query('SELECT * FROM test', conn)
df.head()
# works, no problem
sqlalchemy lets you abstract your db away
from sqlalchemy import create_engine, text
engine = create_engine("sqlite:////home/stephen/db1")
conn = engine.connect() # <- this is also what you are supposed to
# pass to pandas... it doesn't work
result = conn.execute(text("select * from test"))
for row in result:
print(row) # outside pands, this works - proving that
# connection is established
conn = engine.raw_connection() # with this workaround, it works; but you
# get a warning UserWarning: pandas only
# supports SQLAlchemy connectable ...
df = pd.read_sql_query(sql='SELECT * FROM test', con=conn)
df.head()
I am using a Databricks notebook and trying to export my dataframe as CSV to my local machine after querying it. However, it does not save my CSV to my local machine. Why?
Connect to Database
#SQL Connector
import pandas as pd
import psycopg2
import numpy as np
from pyspark.sql import *
#Connection
cnx = psycopg2.connect(dbname= 'test', host='test', port= '1234', user= 'test', password= 'test')
cursor = cnx.cursor()
SQL Query
query = """
SELECT * from products;
"""
# Execute the query
try:
cursor.execute(query)
except OperationalError as msg:
print ("Command skipped: ")
#Fetch all rows from the result
rows = cursor.fetchall()
# Convert into a Pandas Dataframe
df = pd.DataFrame( [[ij for ij in i] for i in rows] )
Exporting Data as CSV to Local Machine
df.to_csv('test.csv')
It does NOT give any error but when I go to my Mac machine's search icon to find "test.csv", it is not existent. I presume that the operation did not work, thus the file was never saved from the Databricks cloud server to my local machine...Does anybody know how to fix it?
Select from SQL Server:
import pypyodbc
cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=Server_Name;"
"Database=TestDB;"
"Trusted_Connection=yes;")
#cursor = cnxn.cursor()
#cursor.execute("select * from Actions")
cursor = cnxn.cursor()
cursor.execute('SELECT * FROM Actions')
for row in cursor:
print('row = %r' % (row,))
From SQL Server to Excel:
import pyodbc
import pandas as pd
# cnxn = pyodbc.connect("Driver={SQL Server};SERVER=xxx;Database=xxx;UID=xxx;PWD=xxx")
cnxn = pyodbc.connect("Driver={SQL Server};SERVER=EXCEL-PC\SQLEXPRESS;Database=NORTHWND;")
data = pd.read_sql('SELECT * FROM Orders',cnxn)
data.to_excel('C:\\your_path_here\\foo.xlsx')
Since you are using Databricks, you are most probably working on a remote machine. Like it was already mentioned, saving the way you do wont work (file will be save to the machine your notebooks master node is on). Try running:
import os
os.listdir(os.getcwd())
This will list all the files that are in directory from where notebook is running (at least it is how jupyter notebooks work). You should see saved file here.
However, I would think that Databricks provides a utility functions to their clients for easy data download from the cloud. Also, try using spark to connect to db - might be a little more convenient.
I think these two links should be useful for you:
Similar question on databricks forums
Databricks documentation
Because you're running this in a Databricks notebook, when you're using Pandas to save your file to test.csv, this is being saved to the Databricks driver node's file directory. A way to test this out is the following code snippet:
# Within Databricks, there are sample files ready to use within
# the /databricks-datasets folder
df = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", inferSchema=True, header=True)
# Converting the Spark DataFrame to a Pandas DataFrame
import pandas as pd
pdDF = df.toPandas()
# Save the Pandas DataFrame to disk
pdDF.to_csv('test.csv')
The location of your test.csv is within the /databricks/driver/ folder of your Databricks' cluster driver node. To validate this:
# Run the following shell command to see the results
%sh cat test.csv
# The output directory is shown here
%sh pwd
# Output
# /databricks/driver
To save the file to your local machine (i.e. your Mac), you can view the Spark DataFrame using the display command within your Databricks notebook. From here, you can click on the "Download to CSV" button which is highlighted in red in the below image.
Using Python: when connecting to SQL Server using pyodbc, everything works fine, but when I switch to sqlalchemy, the connection fails, giving me the error message:
('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
My code:
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=servername;DATABASE=dbname;UID=username;PWD=password')
engine = sqlalchemy.create_engine("mssql+pyodbc://username:password#servername/dbname")
I can't find the error in my code, and don't understand why the first options works, but the second doesn't.
Help is highly appreciated!
Ran into this problem as well, appending a driver query string to the end of my connection path worked:
"mssql+pyodbc://" + uname + ":" + pword + "#" + server + "/" + dbname + "?driver=SQL+Server"
Update (July 2021) – As above, just modernized (Python 3.6+):
f"mssql+pyodbc://{uname}:{pword}#{server}:{port}/{dbname}?driver=ODBC+Driver+17+for+SQL+Server"
Note that driver= must be all lowercase.
It works using pymssql, instead of pyodbc.
Install pymssql using pip, then change your code to:
engine = sqlalchemy.create_engine("mssql+pymssql://username:password#servername/dbname")
Very late, but experienced the same such problem myself recently. Turned out it was a problem with the latest SQLAlchemy version. Had to rollback my version from 1.4.17 to 1.4.12 (unsure of in-between versions, just went with a version I knew worked).
pip install sqlalchemy==1.4.12
I had original poster's problem with a trusted connection to the Microsoft SQL Server database (pandas 1.5.3, SQLAlchemy 2.0.4). Using answers from this question, this did the trick for me:
import sqlalchemy
import pandas as pd
server = "servername"
database = "dbname"
driver = "ODBC+Driver+17+for+SQL+Server"
url = f"mssql+pyodbc://{server}/{database}?trusted_connection=yes&driver={driver}"
engine = sqlalchemy.create_engine(url)
query = """
SELECT [column1]
,[column2] as some_other_name
FROM [server].[dbo].[table]"""
with engine.begin() as conn:
sqla_query = sqlalchemy.text(query)
df = pd.read_sql(sqla_query, conn)
It should be noted that pandas is not yet fully compatible with SQLAlchemy 2.0: https://pandas.pydata.org/docs/whatsnew/v1.5.3.html