Getting Data from Snowflake

Getting Data from Snowflake - python

So I have been trying to get a process set up to pull data from my Snowflake database in python using the Snowflake Python Connector. I have made a method for requesting data (shown below)
import snowflake.connector
def request_data(s, query):
snowflake_connection = snowflake.connector.connect(user = 'user',
password = 'password',
account = 'account',
warehouse = 'warehouse',
database = 'database',
schema = s)
try:
with snowflake_connection.cursor() as cursor:
cursor.execute(query)
data = cursor.fetch_pandas_all()
finally:
snowflake_connection.close()
return data
I have been able to request data from this method and work with the data in Python such as the example below
query = "select * from books.sales where date between '2020-08-01' and '2020-11-31'"
sales = request_data('BOOKS', query)
However, when I try to request a larger amount of data (such as change the date range to 2020-08-01 through 2021-07-31), I am getting an error
250003: Failed to get the response. Hanging? method: get, url: <snowflake url>
I have tried looking through the documentation and one thing that I have tested is printing the attribute rowcount in the cursor - print(cursor.rowcount) (which I added to the request data method between cursor.execute(query) and data = cursor.fetch_pandas_all(). When I did this I saw that rowcount was matching the number of rows that I got when I tested the query on a worksheet in snowflakecomputing.com.
So I am imagining it has to do with something related to the amount of data. The query where I have the date range be 2020-08-01 to 2021-07-31 was about 39,000 rows. I have looked at documentation for a limit on the amount of data in a request with the Snowflake Python Connector, and the only number that I ever saw was 10,000.
When I tried to reduce my date range so that the number of rows would be less than that, I was still getting the same error so I am not sure what is wrong. I could break the one query into multiple queries, but I am trying to keep my requests to a minimum. If someone knows how to solve this issue, I would greatly appreciate it.

This is a classic symptom of a packet inspector decrypting packets on your network. Snowflake sends small result sets one way and larger result sets another way. The larger way sometimes confuses packet inspectors because it's snowflakecomputing.com traffic that appears to be coming directly from the underlying host (AWS, Azure, or GCP).
Start by running this query:
select t.value:type::varchar as type,
t.value:host::varchar as host,
t.value:port as port
from table(flatten(input => parse_json(system$whitelist()))) as t;
After you run that, you'll see that the row for STAGE is not "snowflakecomputing.com" but something at aws.com, azure.com, or cloud.google.com. That's where the larger result sets are coming from and what something on the network is inspecting.
Show the results to your network security team. Explain to them that there can be no packet inspection (decryption) on any of these URL endpoints. It's possible to use IP ranges instead, but the IP ranges are vast so the URL is the way to go.
If they claim to have made the changes and things still aren't working (that happens sometimes), you can run SnowCD to test the required list: https://docs.snowflake.com/en/user-guide/snowcd.html Usually it turns out they missed one.
If you're using PrivateLink, refer to the documentation on URLs and ports here: https://docs.snowflake.com/en/sql-reference/functions/system_whitelist_privatelink.html

Related

Importing database with psycopg2 results a table with fewer lines than expected [duplicate]

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki

Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()

Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)

As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.

PyMongo cursor operations are very slow

I'm new to both MongoDB and pyMongo,
and am having some performance issues
regarding cursors.
TL,DNR: Anything operation I try to perform
using a cursor takes about a second.
Long version
I have a small database, which I bulkloaded. Each entry has 3 fields:
dom: domain name (unique)
date: date, YYYYMMDD
flag: string
I've loaded about 1.9 million entries, without incident, and quite quickly.
I created a hash index on the dom field.
Now, I want to grab certain records by the domain field, and update them, using a Python program.
That's where the problem lies.
I'm using the latest MongoDB, and the latest pyMongo.
stripped down program...
import pymongo
from pymongo import MongoClient
db = client.myindexname
posts = db.posts
print list(db.profiles.index_information()) # shows hash index is present
for k in newdomainlist.keys(): #iterate list of domains to check
ret = posts.find({"dom": k}) #this runs fine, and quickly
#'ret' is a cursor
print ret #this runs quickly
#Here's the problem
print ret.count() #this takes about a second. why?
If I just 'print ret', the speed is fine. However, if I try to
reference anything in the cursor, the speed drops to the floor - I
can do about 1 operation per second.
In this case, I'm just trying to see if ret.count() returns '0' (we don't
have this domain), or '1' (we have it already).
I've tried adding a batch_size(10000) to the find, without it helping.
I DO have the Python C extensions loaded.
What the heck am I doing wrong?
thanks

It turned out that I'd created my hashed index on the wrong field, 'collection', rather than 'posts'. Chalk it up to mongodb inexperience. We can close this one now, or delete it entirely.

bq.py Not Paging Results

We're working on writing a wrapper for bq.py and are having some problems with result sets larger than 100k rows. It seems like in the past this has worked fine (we had related problems with Google BigQuery Incomplete Query Replies on Odd Attempts). Perhaps I'm not understanding the limits explained on the doc page?
For instance:
#!/bin/bash
for i in `seq 99999 100002`;
do
bq query -q --nouse_cache --max_rows 99999999 "SELECT id, FROM [publicdata:samples.wikipedia] LIMIT $i" > $i.txt
j=$(cat $i.txt | wc -l)
echo "Limit $i Returned $j Rows"
done
Yields (note there are 4 lines of formatting):
Limit 99999 Returned 100003 Rows
Limit 100000 Returned 100004 Rows
Limit 100001 Returned 100004 Rows
Limit 100002 Returned 100004 Rows
In our wrapper, we directly access the API:
while row_count < total_rows:
data = client.apiclient.tabledata().list(maxResults=total_rows - row_count,
pageToken=page_token,
**table_dict).execute()
# If there are more results than will fit on a page,
# you will recieve a token for the next page
page_token = data.get('pageToken', None)
# How many rows are there across all pages?
total_rows = min(total_rows, int(data['totalRows'])) # Changed to use get(data[rows],0)
raw_page = data.get('rows', [])
We would expect to get a token in this case, but none is returned.

sorry it took me a little while to get back to you.
I was able to identify a bug that exists server-side, you would end up seeing this with the Java client as well as the python client. We're planning on pushing a fix out this coming week. Your client should start to behave correctly as soon as that happens.
BTW, I'm not sure if you knew this already or not but there's a whole standalone python client that you can use to access the API from python as well. I thought that might be a bit more convenient for you than the client that's distributed as part of bq.py. You'll find a link to it on this page:
https://developers.google.com/bigquery/client-libraries

I can reproduce the behavior you're seeing with the bq command-line. That seems like a bug, I'll see what I can do to fix it.
One thing I did notice about the data you're querying was that selecting the id field only, and capping the number of rows around 100,000. This produces about ~1M worth of data so the server would likely not paginate the results. Selecting a larger amount of data will force the server to paginate since it will not be able to return all the results in a single response. If you did a select * for 100,000 rows of samples.wikipedia you'd be getting ~50M back which should be enough to start to see some pagination happen.
Are you seeing too few results come back from the python client as well or were you surprised that no page_token was returned for your samples.wikipedia query?

python postgres can I fetchall() 1 million rows?

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki

Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()

Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.

Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?

I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.