bq.py Not Paging Results

bq.py Not Paging Results - python

We're working on writing a wrapper for bq.py and are having some problems with result sets larger than 100k rows. It seems like in the past this has worked fine (we had related problems with Google BigQuery Incomplete Query Replies on Odd Attempts). Perhaps I'm not understanding the limits explained on the doc page?
For instance:
#!/bin/bash
for i in `seq 99999 100002`;
do
bq query -q --nouse_cache --max_rows 99999999 "SELECT id, FROM [publicdata:samples.wikipedia] LIMIT $i" > $i.txt
j=$(cat $i.txt | wc -l)
echo "Limit $i Returned $j Rows"
done
Yields (note there are 4 lines of formatting):
Limit 99999 Returned 100003 Rows
Limit 100000 Returned 100004 Rows
Limit 100001 Returned 100004 Rows
Limit 100002 Returned 100004 Rows
In our wrapper, we directly access the API:
while row_count < total_rows:
data = client.apiclient.tabledata().list(maxResults=total_rows - row_count,
pageToken=page_token,
**table_dict).execute()
# If there are more results than will fit on a page,
# you will recieve a token for the next page
page_token = data.get('pageToken', None)
# How many rows are there across all pages?
total_rows = min(total_rows, int(data['totalRows'])) # Changed to use get(data[rows],0)
raw_page = data.get('rows', [])
We would expect to get a token in this case, but none is returned.

sorry it took me a little while to get back to you.
I was able to identify a bug that exists server-side, you would end up seeing this with the Java client as well as the python client. We're planning on pushing a fix out this coming week. Your client should start to behave correctly as soon as that happens.
BTW, I'm not sure if you knew this already or not but there's a whole standalone python client that you can use to access the API from python as well. I thought that might be a bit more convenient for you than the client that's distributed as part of bq.py. You'll find a link to it on this page:
https://developers.google.com/bigquery/client-libraries

I can reproduce the behavior you're seeing with the bq command-line. That seems like a bug, I'll see what I can do to fix it.
One thing I did notice about the data you're querying was that selecting the id field only, and capping the number of rows around 100,000. This produces about ~1M worth of data so the server would likely not paginate the results. Selecting a larger amount of data will force the server to paginate since it will not be able to return all the results in a single response. If you did a select * for 100,000 rows of samples.wikipedia you'd be getting ~50M back which should be enough to start to see some pagination happen.
Are you seeing too few results come back from the python client as well or were you surprised that no page_token was returned for your samples.wikipedia query?

Related

Cursor not found while reading all documents from a collection

I have a collection student and I want this collection as list in Python, but unfortunately I got the following error CursorNextError: [HTTP 404][ERR 1600] cursor not found. Is there an option to read a 'huge' collection without an error?
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
print(db.collections())
students = db.collection('students')
#students.all()
students = db.collection('handlingUnits').all()
list(students)
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
students = list(db.collection('students'))
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found

as suggested in my comment, if raising the ttl is not an option (what I wouldn't do either) I would get the data in chunks instead of all at once. In most cases you don't need the whole collection anyway, so maybe think of limiting that first. Do you really need all documents and all their fields?
That beeing said I have no experience with arango, but this is what I would do:
entries = db.collection('students').count() # get total amount of documents in collection
limit=100 # blocksize you want to request
yourlist = [] # final output
for x in range(int(entries/limit) + 1):
block = db.collection('students').all(skip=x*limit, limit=100)
yourlist.extend(block) # assuming block is of type list. Not sure what arango returns
something like this. (Based on the documentation here: https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/)
Limit your request to a reasonable amount and then skip this amount with your next request. You have to check if this "range()" thing works like that you might have to think of a better way of defining the number of iterations you need.
This also assumes arango sorts the all() function per default.
So what is the idea?
determin the number of entries in the collection.
based on that determin how many requests you need (f.e. size=1000 -> 10 blocks each containing 100 entries)
make x requests where you skip the blocks you already have. First iteration entries 1-100; second iteration 101-200, third iteration 201-300 etc.

By default, AQL queries generate the complete result, which is then held in memory, and provided batch by batch. So the cursor is simply fetching the next batch of the already calculated result. In most of the cases this is fine, but if your query produces a huge result set, then this can take a long time and will require a lot of memory.
As an alternative you can create a streaming cursor. See https://www.arangodb.com/docs/stable/http/aql-query-cursor-accessing-cursors.html and check the stream option.
Streaming cursors calculate the next batch on demand and are therefore better suited to iterate a large collection.

Getting Data from Snowflake

So I have been trying to get a process set up to pull data from my Snowflake database in python using the Snowflake Python Connector. I have made a method for requesting data (shown below)
import snowflake.connector
def request_data(s, query):
snowflake_connection = snowflake.connector.connect(user = 'user',
password = 'password',
account = 'account',
warehouse = 'warehouse',
database = 'database',
schema = s)
try:
with snowflake_connection.cursor() as cursor:
cursor.execute(query)
data = cursor.fetch_pandas_all()
finally:
snowflake_connection.close()
return data
I have been able to request data from this method and work with the data in Python such as the example below
query = "select * from books.sales where date between '2020-08-01' and '2020-11-31'"
sales = request_data('BOOKS', query)
However, when I try to request a larger amount of data (such as change the date range to 2020-08-01 through 2021-07-31), I am getting an error
250003: Failed to get the response. Hanging? method: get, url: <snowflake url>
I have tried looking through the documentation and one thing that I have tested is printing the attribute rowcount in the cursor - print(cursor.rowcount) (which I added to the request data method between cursor.execute(query) and data = cursor.fetch_pandas_all(). When I did this I saw that rowcount was matching the number of rows that I got when I tested the query on a worksheet in snowflakecomputing.com.
So I am imagining it has to do with something related to the amount of data. The query where I have the date range be 2020-08-01 to 2021-07-31 was about 39,000 rows. I have looked at documentation for a limit on the amount of data in a request with the Snowflake Python Connector, and the only number that I ever saw was 10,000.
When I tried to reduce my date range so that the number of rows would be less than that, I was still getting the same error so I am not sure what is wrong. I could break the one query into multiple queries, but I am trying to keep my requests to a minimum. If someone knows how to solve this issue, I would greatly appreciate it.

This is a classic symptom of a packet inspector decrypting packets on your network. Snowflake sends small result sets one way and larger result sets another way. The larger way sometimes confuses packet inspectors because it's snowflakecomputing.com traffic that appears to be coming directly from the underlying host (AWS, Azure, or GCP).
Start by running this query:
select t.value:type::varchar as type,
t.value:host::varchar as host,
t.value:port as port
from table(flatten(input => parse_json(system$whitelist()))) as t;
After you run that, you'll see that the row for STAGE is not "snowflakecomputing.com" but something at aws.com, azure.com, or cloud.google.com. That's where the larger result sets are coming from and what something on the network is inspecting.
Show the results to your network security team. Explain to them that there can be no packet inspection (decryption) on any of these URL endpoints. It's possible to use IP ranges instead, but the IP ranges are vast so the URL is the way to go.
If they claim to have made the changes and things still aren't working (that happens sometimes), you can run SnowCD to test the required list: https://docs.snowflake.com/en/user-guide/snowcd.html Usually it turns out they missed one.
If you're using PrivateLink, refer to the documentation on URLs and ports here: https://docs.snowflake.com/en/sql-reference/functions/system_whitelist_privatelink.html

SQLAlchemy `.fetchmany()` vs `.limit()`

Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`

limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.

I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk

PyMongo cursor operations are very slow

I'm new to both MongoDB and pyMongo,
and am having some performance issues
regarding cursors.
TL,DNR: Anything operation I try to perform
using a cursor takes about a second.
Long version
I have a small database, which I bulkloaded. Each entry has 3 fields:
dom: domain name (unique)
date: date, YYYYMMDD
flag: string
I've loaded about 1.9 million entries, without incident, and quite quickly.
I created a hash index on the dom field.
Now, I want to grab certain records by the domain field, and update them, using a Python program.
That's where the problem lies.
I'm using the latest MongoDB, and the latest pyMongo.
stripped down program...
import pymongo
from pymongo import MongoClient
db = client.myindexname
posts = db.posts
print list(db.profiles.index_information()) # shows hash index is present
for k in newdomainlist.keys(): #iterate list of domains to check
ret = posts.find({"dom": k}) #this runs fine, and quickly
#'ret' is a cursor
print ret #this runs quickly
#Here's the problem
print ret.count() #this takes about a second. why?
If I just 'print ret', the speed is fine. However, if I try to
reference anything in the cursor, the speed drops to the floor - I
can do about 1 operation per second.
In this case, I'm just trying to see if ret.count() returns '0' (we don't
have this domain), or '1' (we have it already).
I've tried adding a batch_size(10000) to the find, without it helping.
I DO have the Python C extensions loaded.
What the heck am I doing wrong?
thanks

It turned out that I'd created my hashed index on the wrong field, 'collection', rather than 'posts'. Chalk it up to mongodb inexperience. We can close this one now, or delete it entirely.

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.

Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?

I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

bq.py Not Paging Results - python

Related

Cursor not found while reading all documents from a collection

Getting Data from Snowflake

SQLAlchemy `.fetchmany()` vs `.limit()`

PyMongo cursor operations are very slow

Slow MySQL queries in Python but fast elsewhere

Categories

Resources