How to write a pandas DataFrame directly into a Netezza Database?

How to write a pandas DataFrame directly into a Netezza Database? - python

I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.
I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase.
The Problem: SQLAlchemy does not support Netezza.
What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?
My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.
Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.
I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs".
I got the package to work (see my solution below). But if someone nows a better solution, please let me know!
I figured it out. For my solution see accepted answer.

Solution
I found a solution that I want to share for everyone with the same problem.
I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.
Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:
import netezza_dialect
from sqlalchemy import create_engine
engine = create_engine("netezza://ODBCDataSourceName")
df.to_sql("YourDatabase",
engine,
if_exists='append',
index=False,
dtype=your_dtypes,
chunksize=1600,
method='multi')
A little explaination to the to_sql() parameters:
It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.
When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.
Note
If you get a warning or error like the following:
C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
"No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
Then you propably treid to connect to the database via
engine = create_engine(netezza://usr:pass#address:port/database_name)
You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.
engine = create_engine(netezza://ODBCDataSourceName)
Then it should have no problems to find the driver.

I know you already answered the question yourself (thanks for sharing the solution)
One general comment about large data-writes to Netezza:
I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.
We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.
I’m not a python savvy but I hope you can follow me ...
If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html

Related

pyodbc and MYSQL connection

I am aware a similiar question exists. It has not been marked as answered yet and I tried all suggestions so far. I am also not native speaker, please excuse spelling mistakes.
I have written a small class in python to interact with a SQL Database.
Now I want to be able to connect to either SQL or MYSQL Database with the same functionalities.
It would be perfect for me to just change the connection type in the instance initiation to keep my class maintainable. Else I would need to create a seconde class using for example the mysql.connector, which would result in two classes with nearly the same structure and content.
This is how I tried to use pyodbc so far:
conn = pyodbc.connect('Driver={SQL Server};'
'Server=xyzhbv;'
'Database=Test;'
'ENCRYPT=yes;'
'UID=root;'
'PWD=12345;')
Please note that I changed all credentials.
What do I need to change to use pyodbc for MySQL?
Is that even possible?
Or
Can I use both libaries within one class without confusing? (they share function names)
Many Thanks for any help.
Have a great day.

Great expectations framework - AWS Redshift connection

I'm trying to set up a connection to AWS Redshift from the Great Expectations Framework (GE) according to the tutorial using Python and facing two issues:
When I'm using postgresql+psycopg2 as driver in the connection string in step 5, adding the datasource (context.add_datasource(**datasource_config)) takes extremely long (up to 20 minutes !!!). Validating expectations afterwards works as expected and even runs quite fast. I'm assuming the huge amount of time needed is due to the size of the redshift cluster I'm connecting to (more than 1000 schemas) and the postgresql driver not being optimized for redshift.
In search for alternatives to the postgresql driver I came across the sqlalchemy-redshift driver. Changing it in the connection string (redshift+psycopg2) adds the datasource instantly, however, validating some expectations (e.g. expect_column_values_to_not_be_null) fails! After some digging through the code I realized it might be due to GE creating a temporary table in the SQL query. So when I specify the query:
select * from my_redshift_schema.my_table;
GE actually seems to run something like:
CREATE TEMPORARY TABLE "ge_temp_bf3cbfa2" AS select * from my_redshift_schema.my_table;
For certain expectations sqlalchemy-redshift tries to find information about the columns of the table, however, it searches for the name of the temporary table and not the actual one I specified in the SQL query. It consequently fails as it obviously can't find a table with that name in the redshift cluster. More specifically it results in a KeyError in the dialect.py file within sqlalchemy-redshift:
.venv/lib/python3.8/site-packages/sqlalchemy_redshift/dialect.py\", line 819, in _get_redshift_columns
return all_schema_columns[key]
KeyError: RelationKey(name='ge_temp_bf3cbfa2', schema='public')
Has anyone succeeded running GE on redshift? How could I mitigate the issues I'm facing (make option 1 faster or fix the error in option 2)?

Clob reading in python using cx_oracle - not working

I am trying to fetch a clob data from Oracle server and the connection is made through ssh tunnel.
When I tried to run the following code:
(id,clob) = cursor.fetchone()
print('one fetched')
clob_data = clob.read()
print(clob_data)
the execution freezes
Can someone help me with what's wrong here because I have referred to cx_oracle docs and the example code is just the same.

It is possible that there is a round trip taking place that is not being handled properly by the cx_Oracle driver. Please create an issue here (https://github.com/oracle/python-cx_Oracle/issues) with a few more details such as platform, Python version, Oracle database/client version, etc.
You can probably work around the issue, however, by simply returning the CLOBs as strings as can be seen in this sample: https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py.

How can I load a sql "dump" file into sql alchemy

I have a large sql dump file ... with multiple CREATE TABLE and INSERT INTO statements. Is there any way to load these all into a SQLAlchemy sqlite database at once. I plan to use the introspected ORM from sqlsoup after I've created the tables. However, when I use the engine.execute() method it complains: sqlite3.Warning: You can only execute one statement at a time.
Is there a way to work around this issue. Perhaps splitting the file with a regexp or some kind of parser, but I don't know enough SQL to get all of the cases for the regexp.
Any help would be greatly appreciated.
Will
EDIT:
Since this seems important ... The dump file was created with a MySQL database and so it has quite a few commands/syntax that sqlite3 does not understand correctly.

"or some kind of parser"
I've found MySQL to be a great parser for MySQL dump files :)
You said it yourself: "so it has quite a few commands/syntax that sqlite3 does not understand correctly." Clearly then, SQLite is not the tool for this task.
As for your particular error: without context (i.e. a traceback) there's nothing I can say about it. Martelli or Skeet could probably reach across time and space and read your interpreter's mind, but me, not so much.

The SQL recognized by MySQL and the SQL in SQLite are quite different. I suggest dumping the data of each table individually, then loading the data into equivalent tables in SQLite.
Create the tables in SQLite manually, using a subset of the "CREATE TABLE" commands given in your raw-dump file.

Copy whole SQL Server database into JSON from Python

I facing an atypical conversion problem. About a decade ago I coded up a large site in ASP. Over the years this turned into ASP.NET but kept the same database.
I've just re-done the site in Django and I've copied all the core data but before I cancel my account with the host, I need to make sure I've got a long-term backup of the data so if it turns out I'm missing something, I can copy it from a local copy.
To complicate matters, I no longer have Windows. I moved to Ubuntu on all my machines some time back. I could ask the host to send me a backup but having no access to a machine with MSSQL, I wouldn't be able to use that if I needed to.
So I'm looking for something that does:
db = {}
for table in database:
db[table.name] = [row for row in table]
And then I could serialize db off somewhere for later consumption... But how do I do the table iteration? Is there an easier way to do all of this? Can MSSQL do a cross-platform SQLDump (inc data)?
For previous MSSQL I've used pymssql but I don't know how to iterate the tables and copy rows (ideally with column headers so I can tell what the data is). I'm not looking for much code but I need a poke in the right direction.

Have a look at the sysobjects and syscolumns tables. Also try:
SELECT * FROM sysobjects WHERE name LIKE 'sys%'
to find any other metatables of interest. See here for more info on these tables and the newer SQL2005 counterparts.

I've liked the ADOdb python module when I've needed to connect to sql server from python. Here is a link to a simple tutorial/example: http://phplens.com/lens/adodb/adodb-py-docs.htm#tutorial

I know you said JSON, but it's very simple to generate a SQL script to do an entire dump in XML:
SELECT REPLACE(REPLACE('SELECT * FROM {TABLE_SCHEMA}.{TABLE_NAME} FOR XML RAW', '{TABLE_SCHEMA}',
QUOTENAME(TABLE_SCHEMA)), '{TABLE_NAME}', QUOTENAME(TABLE_NAME))
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA
,TABLE_NAME

As an aside to your coding approach - I'd say :
set up a virtual machine with an eval on windows
put sql server eval on it
restore your data
check it manually or automatically using the excellent db scripting tools from red-gate to script the data and the schema
if fine then you have (a) a good backup and (b) a scripted output.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.