I have a collection student and I want this collection as list in Python, but unfortunately I got the following error CursorNextError: [HTTP 404][ERR 1600] cursor not found. Is there an option to read a 'huge' collection without an error?
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
print(db.collections())
students = db.collection('students')
#students.all()
students = db.collection('handlingUnits').all()
list(students)
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
students = list(db.collection('students'))
[OUT] CursorNextError: [HTTP 404][ERR 1600] cursor not found
as suggested in my comment, if raising the ttl is not an option (what I wouldn't do either) I would get the data in chunks instead of all at once. In most cases you don't need the whole collection anyway, so maybe think of limiting that first. Do you really need all documents and all their fields?
That beeing said I have no experience with arango, but this is what I would do:
entries = db.collection('students').count() # get total amount of documents in collection
limit=100 # blocksize you want to request
yourlist = [] # final output
for x in range(int(entries/limit) + 1):
block = db.collection('students').all(skip=x*limit, limit=100)
yourlist.extend(block) # assuming block is of type list. Not sure what arango returns
something like this. (Based on the documentation here: https://python-driver-for-arangodb.readthedocs.io/_/downloads/en/dev/pdf/)
Limit your request to a reasonable amount and then skip this amount with your next request. You have to check if this "range()" thing works like that you might have to think of a better way of defining the number of iterations you need.
This also assumes arango sorts the all() function per default.
So what is the idea?
determin the number of entries in the collection.
based on that determin how many requests you need (f.e. size=1000 -> 10 blocks each containing 100 entries)
make x requests where you skip the blocks you already have. First iteration entries 1-100; second iteration 101-200, third iteration 201-300 etc.
By default, AQL queries generate the complete result, which is then held in memory, and provided batch by batch. So the cursor is simply fetching the next batch of the already calculated result. In most of the cases this is fine, but if your query produces a huge result set, then this can take a long time and will require a lot of memory.
As an alternative you can create a streaming cursor. See https://www.arangodb.com/docs/stable/http/aql-query-cursor-accessing-cursors.html and check the stream option.
Streaming cursors calculate the next batch on demand and are therefore better suited to iterate a large collection.
Related
Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`
limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.
I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk
I am having trouble with the parameter of an SNMP query in a python script. An SNMP query takes an OID as a parameter. The OID I use here is written in the code below and, if used alone in a query, should return a list of states for the interfaces of the IP addresses I am querying onto.
What I want is to use that OID with a variable appended to it in order to get a very precise information (if I use the OID alone I will only get a list of thing that would only complexify my problem).
The query goes like this:
oid = "1.3.6.1.4.1.2011.5.25.119.1.1.3.1.2."
variable = "84.79.84.79"
query = session.get(oid + variable)
Here, this query will return a corrupted SNMPObject, as in the process of configuration of the device I am querying on, another number is added, for some reason we do not really care about here, between these two elements of the parameter.
Below is a screenshot showing some examples of an SNMP request that only takes as a parameter the OID above, without the variable appended, on which you may see that my variable varies, and so does the highlighted additional number:
Basically what I am looking for here is the response, but unfortunately I cannot predict for each IP address I am querying what will that "random" number be.
I could use a loop that tries 20 or 50 queries and only saves the response of the only one that would have worked, but it's ugly. What would be better is some built-in function or library that would just say to the query:
"SNMP query on that OID, with any integer appended to it, and with my variable appended to that".
I definitely don't want to generate a random int, as it is already generated in the configuration of the device I am querying, I just want to avoid looping just to get a proper response to a precise query.
I hope that was clear enough.
Something like this should work:
from random import randint
variable = "84.79.84.79"
numbers = "1.3.6.1.4.1.2011.5.25.119.1.1.3.1.2"
query = session.get('.'.join([numbers, str(randint(1,100)), variable])
What is the proper way of moving a number of records from one collection to another. I have come across several other SO posts such as this which deal with achieve the same goal but none have a python implementation.
#Taking a number of records from one database and returning the cursor
cursor_excess_new = db.test_collection_new.find().sort([("_id", 1)]).limit(excess_num)
# db.test.insert_many(doc for doc in cursor_excess_new).inserted_ids
# Iterating cursor and Trying to write to another database
# for doc in cursor_excess_new:
# db.test_collection_old.insert_one(doc)
result = db.test_collection_old.bulk_write([
for doc in cursor_excess_new:
InsertMany(doc for each doc in cur)
pprint(doc)
])
If I use insert_many, I get the following error: pymongo.errors.OperationFailure: Writes to config servers must have batch size of 1, found 10
bulk_write is giving me a syntax error at the start of for loop.
What is the best practice and correct way of transferring records from one collection to another in pymongo so that it is atomic?
Collection.bulk_write accepts as argument an iterable of query operations.
pymongo has pymongo.operations.InsertOne operation not InsertMany.
For your situation, you can build a list of InsertOne operations for each document in the source collection. Then do a bulk_write on the destination using the built-up list of operations.
from pymongo import InsertOne
...
cursor_excess_new = (
db.test_collection_new
.find()
.sort([("_id", 1)])
.limit(excess_num)
)
queries = [InsertOne(doc) for doc in cursor_excess_new]
db.test_collection_old.bulk_write(queries)
You don't need "for loop".
myList=list(collection1.find({}))
collection2.insert_many(myList)
collection1.delete_many({})
If you need to filter it you can use the following code:
myList=list(collection1.find({'status':10}))
collection2.insert_many(myList)
collection1.delete_many({'status':10})
But be careful, because it has no warranty to move successfully so you need to control transactions. If you're going to use the following code you should consider that your MongoDb shouldn't be Standalone and you need to active Replication and have another instance.
with myClient.start_session() as mySession:
with mySession.start_transaction():
...yourcode...
Finally, the above code has the warranty to move (insert and delete) successfully but the transaction isn't in your hand and you can't get the result of this transaction so you can use the following code to control both moving and transaction:
with myClient.start_session() as mySession:
mySession.start_transaction()
try:
...yourcode...
mySession.commit_transaction()
print("Done")
except Exception as e:
mySession.abort_transaction()
print("Failed",e)
I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?
The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki
Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)
fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()
Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.
EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)
As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.
I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.