Great expectations framework - AWS Redshift connection

Great expectations framework - AWS Redshift connection - python

I'm trying to set up a connection to AWS Redshift from the Great Expectations Framework (GE) according to the tutorial using Python and facing two issues:
When I'm using postgresql+psycopg2 as driver in the connection string in step 5, adding the datasource (context.add_datasource(**datasource_config)) takes extremely long (up to 20 minutes !!!). Validating expectations afterwards works as expected and even runs quite fast. I'm assuming the huge amount of time needed is due to the size of the redshift cluster I'm connecting to (more than 1000 schemas) and the postgresql driver not being optimized for redshift.
In search for alternatives to the postgresql driver I came across the sqlalchemy-redshift driver. Changing it in the connection string (redshift+psycopg2) adds the datasource instantly, however, validating some expectations (e.g. expect_column_values_to_not_be_null) fails! After some digging through the code I realized it might be due to GE creating a temporary table in the SQL query. So when I specify the query:
select * from my_redshift_schema.my_table;
GE actually seems to run something like:
CREATE TEMPORARY TABLE "ge_temp_bf3cbfa2" AS select * from my_redshift_schema.my_table;
For certain expectations sqlalchemy-redshift tries to find information about the columns of the table, however, it searches for the name of the temporary table and not the actual one I specified in the SQL query. It consequently fails as it obviously can't find a table with that name in the redshift cluster. More specifically it results in a KeyError in the dialect.py file within sqlalchemy-redshift:
.venv/lib/python3.8/site-packages/sqlalchemy_redshift/dialect.py\", line 819, in _get_redshift_columns
return all_schema_columns[key]
KeyError: RelationKey(name='ge_temp_bf3cbfa2', schema='public')
Has anyone succeeded running GE on redshift? How could I mitigate the issues I'm facing (make option 1 faster or fix the error in option 2)?

Related

Trying to understand why Sqlite3 queries through Anaconda/python (JupyterLab) take ~1,000 times longer than if ran through DB Browser

I have a Python script to import data from raw csv/xlsx files. For these I use Pandas to load the files, do some light transformation, and save to an sqlite3 database. This is fast (as fast as any method). After this, I run some queries against these to make some intermediate datasets. These I run through a function (see below).
More information: I am using Anaconda/Python3 (3.9) on Windows 10 Enterprise.
UPDATE:
Just as information for anybody reading this, I ended up going back to
just using standalone python (still using JupyterLab though)... I no
longer have this issue. So not sure if it is a problem with something
Anaconda does or just the versions of various libraries being used for
that particular Anaconda distribution (latest available). My script
runs more or less in the time that I would expect using Python 3.11
and the versions pulled in by pip for Pandas and sqlite (1.5.3 and
3.38.4).
Python function for running sqlite3 queries:
def runSqliteScript(destConnString, queryString):
'''Runs an sqlite script given a connection string and a query string
'''
try:
print('Trying to execute sql script: ')
print(queryString)
cursorTmp = destConnString.cursor()
cursorTmp.executescript(queryString)
except Exception as e:
print('Error caught: {}'.format(e))
Because somebody asked, here is the function that creates the "destConnString", though it's called something else in the actual function call, but is the same type.
def createSqliteDb(db_file):
''' Creates an sqlite database at direct/file name specified
'''
conSqlite = None
try:
conSqlite = sqlite3.connect(db_file)
return conSqlite
except Error as e:
print('Error {} when trying to create {}'.format(e, db_file))
Example of one of the queries (I commented out journal mode/synchronous pragmas after it didn't seem to help at all):
-- PRAGMA journal_mode = WAL;
-- PRAGMA synchronous = NORMAL;
BEGIN;
drop table if exists tbl_1110_cop_omd_fmd;
COMMIT;
BEGIN;
create table tbl_1110_cop_omd_fmd as
select
siteId,
orderNumber,
familyGroup01, familyGroup02,
count(*) as countOfLines
from tbl_0000_ob_trx_for_frazelle
where 1 = 1
-- and dateCreated between datetime('now', '-365 days') and datetime('now', 'localtime') -- temporarily commented due to no date in file
group by siteId,
orderNumber,
familyGroup01, familyGroup02
order by dateCreated asc
;
COMMIT
;
Here is a list of things that I have tried. Unfortunately, no matter what combination of things I have tried, it has ended up having one bottleneck or another. It seems there is some kind of write bottleneck from python to sqlite3, yet the pandas to_sql method doesn't seem to be affected by it. Complete list of all combinations of things that I have tried.
I tried wrapping all my queries in begin/commit statements. I put these in-line with the query, though I'd be interested in knowing if this is the correct way to do this. This seemed to have no effect.
I tried setting the journal mode to WAL and synchronous to normal, again to no effect.
I tried running the queries in an in-memory database.
Firstly, I tried creating everything from scratch in the in-memory database. The tables didn't create any faster. Saving this in-memory database seems to be a bottleneck (backup method).
Next, I tried creating views instead of tables (again, creating everything from scratch in the in-memory database). This created really quickly. Weirdly, querying these views was very fast. Saving this in-memory database seems to be a bottleneck (backup method).
I tried just writing views to the database file (not in-memory). Unfortunately, the views take as long as the make tables when running from Python/sqlite.
I don't really want to do anything strictly in-memory for the database creation, as this python script is used for different sets of data, some which could have too many rows for an in-memory setup. The only thing I have left to try is to take the in-memory from scratch setup, make views instead of tables, read ALL the in-memory db tables with pandas (from_sql), then write ALL the tables to a file db with pandas (to_sql)... Hoping there is something easy to try to resolve this problem.
connOBData = sqlite3.connect('file:cachedb?mode=memory?cache=shared')
These take approximately 1,000 times or more longer than if I run these queries directly in DB Browser (an sqlite frontend). These queries aren't that complex and run fine (in ~2-4 seconds) in DB Browser. All told, if I run all the queries in a row in DB Browser they'd run in 1-2 minutes. If I let them run through the Python script, it literally takes close to 20 hours. I'd expect the queries to finish in approximately the same time that they run in DB Browser.

How to write a pandas DataFrame directly into a Netezza Database?

I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.
I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase.
The Problem: SQLAlchemy does not support Netezza.
What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?
My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.
Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.
I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs".
I got the package to work (see my solution below). But if someone nows a better solution, please let me know!
I figured it out. For my solution see accepted answer.

Solution
I found a solution that I want to share for everyone with the same problem.
I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.
Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:
import netezza_dialect
from sqlalchemy import create_engine
engine = create_engine("netezza://ODBCDataSourceName")
df.to_sql("YourDatabase",
engine,
if_exists='append',
index=False,
dtype=your_dtypes,
chunksize=1600,
method='multi')
A little explaination to the to_sql() parameters:
It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.
When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.
Note
If you get a warning or error like the following:
C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
"No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
Then you propably treid to connect to the database via
engine = create_engine(netezza://usr:pass#address:port/database_name)
You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.
engine = create_engine(netezza://ODBCDataSourceName)
Then it should have no problems to find the driver.

I know you already answered the question yourself (thanks for sharing the solution)
One general comment about large data-writes to Netezza:
I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.
We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.
I’m not a python savvy but I hope you can follow me ...
If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html

Extract a few million records from Teradata to Python (pandas)

I have data from 6 months of emails (email properties like send date, subject line plus recipient details like age, gender etc, altogether around 20 columns) in my teradata table. It comes around 20 million in total, which I want to be brough into Python for further predictive modelling purpose.
I tried to run the selection query using 'pyodbc' connector but it just runs for hours & hours. Then I stopped it & modified the query to fetch just 1 month of data (may be 3-4 million) but still takes a very long time.
Is there any better(faster) option than 'pyodbc' or any different approach altogether ?
Any input is appreciated. thanks

When communicating between Python and Teradata I recommend to use the Teradata -package (pip teradata; https://developer.teradata.com/tools/reference/teradata-python-module). It leverages ODBC (or REST) to connect.
Beside this you could use JDBC via JayDeBeApi. JDBC could be sometimes some faster then ODBC.
Both options support Python Database API Specification, so that your other code around doesn't have to be touched. E.g. pandas.read_sql works fine with connections from above.
Your performance issues look like some other issues:
network connectivity
Python (Pandas) memory handling
ad 1) throughput can only replaced with more throughput
ad 2) you could try to do as many as possible within the database (feature engineering) + your local machine should have RAM ("pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset") - Maybe Apache Arrow can relieve some of your local RAM issues
Check:
https://developer.teradata.com/tools/reference/teradata-python-module
http://wesmckinney.com/blog/apache-arrow-pandas-internals/

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Why does not postgresql start returning rows immediately?

The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);

The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.

In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.