I am working on a Trac-Plugin...
To retrieve my data I create a cursor object and get the result table like this:
db = self.env.get_db_cnx()
cursor = db.cursor()
cursor.execute("SELECT...")
Now the result is being used in 3 different functions. My Problem is now that the cursor is being cleaned out while looping through the first time (like it is told here http://packages.python.org/psycopg2/cursor.html)
I then tried to copy the cursor object, but this failed too. the copy(cursor) function seems to have problem with a big dataset and the function deepcopy(cursor) fails anyway (according to this bug http://bugs.python.org/issue1515).
How can I solve this issue?
Storing the values from any finite iterable is simple:
results = list(cursor)
Iterate over the iterable and store the results in a list. This list can be iterated over as many times as necessary.
You don't need a copy of the cursor, just a copy of the results of the query.
For this specific case, you should do what 9000 suggests in his comment -- use the cursors built-in functionality to get the results of a list, which should be as fast or faster than manually calling list.
If you want to avoid looping through the data an extra time you could try wrapping it in a generator:
def lazy_execute(sql, cursor=cursor):
results = []
cursor.execute(sql)
def fetch():
if results:
for r in results:
yield r
raise StopIteration()
else:
for r in cursor:
results.append(r)
yield r
raise StopIteration()
return fetch
This essentially creates a list as you need it, but lets you call the same function everywhere, safely. You would then use this like so:
results = lazy_execute(my_sql):
for r in results():
"do something with r"
This is almost certainly an over-engineered premature-optimization, though it does have the advantage of the same name meaning the same thing in every case, as opposed to generating a new list and then the same data having two different names.
I think if I were going to argue for using this I would use the same-names argument, unless the data set was pretty huge, but if it's huge enough to matter then there's a good chance you don't want to store it all in memory anyway.
Also it's completely untested.
Related
I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets
Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.
I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?
What is the most efficient way to loop through the cursor object in Pymongo?
Currently, this is what I'm doing:
list(my_db.my_collection.find())
Which converts the cursor to list object so that I can iterate over each element. This works fine if the find() query returns a small amount of data. However, when I scale the DB to return 10 million documents, the cursor conversion to the list is taking forever. Instead of converting the DB result(cursor) to list, I tried converting the cursor to dataframe as below:
pd.Dataframe(my_db.my_collection.find())
which didn't give me any performance improvement.
What is the most efficient way to loop through a cursor object in python?
I haven't used the pymongo till date.
But one thing I can definitely say, if you're fetching a huge amount of data by doing
list(my_db.my_collection.find())
then you must use the generator.
Because, using list here would increase memory usage significantly and may bring in MemoryError if it gets beyond the permitted value.
def get_data():
yield(my_db.my_collection.find())
Try using such methods which will not use much memory.
The cursor object pymongo gives you is already lazily loading objects, no need to do anything else.
for doc in my_db.my_collection.find():
#process doc
The method find() returns a Cursor which you can iterate
for match in my_db.my_collection.find():
# do something
pass
I use SELECT COUNT(*) FROM db WHERE <expression> to see if a set of records is null. So:
>>> cnt = c.fetchone()
>>> print cnt
(0L,)
My question is: how do you test for this condition?
I have a number of other ways to accomplish this. Is something like the following possible?
if cnt==(0L,):
# do something
fetchone returns a row, which is a sequence of columns.
If you want to get the first value in a sequence, you use [0].
You could instead compare the row to (0,), as you're suggesting. But as far as I know neither the general DB-API nor the specific MySQLdb library guarantee what kind of sequence a row is; it could be a list, or a custom sequence class. So, relying on the fact that it's a tuple is probably not a good idea. And, since it's just as easy to not do so, why not be safe and portable?
So:
count_row = c.fetchone()
count = count_row[0]
if count == 0:
do_something()
Or, putting it together in one line:
if c.fetchone()[0] == 0:
do_something()
Thank you. Your first sequence works, don't know how I did not try that one, but I did not. The second construction gets an error: ...object has no attribute 'getitem'. I would guess my version of MySQLdb (1.2.3_4, Python 2.7) does not support it.
What I did in the interim was to construct the zero tuple by executing a count(*) constructed to return zero records. This seems to work fine
It's often easier to use the .rowcount attribute of the cursor object to check whether there are any rows in your result set. This attribute is specified in the Python Database API:
This read-only attribute specifies the number of rows that the last
.execute*() produced (for DQL statements like SELECT) or
affected (for DML statements like UPDATE or INSERT). [9]
The attribute is -1 in case no .execute*() has been performed on
the cursor or the rowcount of the last operation is cannot be
determined by the interface. [7]
When .rowcount cannot be used
Note that per the above specs, Cursor.rowcount should be set to -1 when the number of rows produced or affected by the last statement "cannot be determined by the interface." This happens when using the SSCursor and SSDictCursor cursor classes.
The reason is that the MySQL C API has two different functions for retrieving result sets: mysql_store_result() and mysql_use_result(). The difference is that mysql_use_result() reads rows from the result set as you ask for them, rather than storing the entire result set as soon as the query is executed. For very large result sets, this "unbuffered" approach can be faster and uses much less memory on the client machine; however, it makes it impossible for the interface to determine how many rows the result set contains at the time the query is executed.
Both SSCursor and SSDictCursor call mysql_use_result(), so their .rowcount attribute should hold the value -1 regardless of the size of the result set. In contrast, DictCursor and the default Cursor class call mysql_store_result(), which reads and counts the entire result set immediately after executing the query.
To make matters worse, the .rowcount attribute only ever holds the value -1 when the cursor is first opened; once you execute a query, it receives the return value of mysql_affected_rows(). The problem is that mysql_affected_rows() returns an unsigned long long integer, which represents the value -1 in a way that can be very counterintuitive and wouldn't be caught by a condition like cursor.rowcount == -1.
Counting for counting's sake
If the only thing you're doing is counting records, then .rowcount isn't that useful because your COUNT(*) query is going to return a row whether the records exist or not. In that case, test for the zero value in the same way that you would test for any value when fetching results from a query. Whether you can do c.fetchone()[0] == 0 depends on the cursor class you're using; it would work for a Cursor or SSCursor but fail for a DictCursor or SSDictCursor, which fetch dictionaries instead of tuples.
The important thing is just to be clear in your code about what's happening, which is why I would recommend against using c.fetchone() == (0,). That tests an entire row when all you need to do is test a single value; get the value out of the row before you test it, and your code will be more clear. Personally, I find c.fetchone()[0] to be needlessly opaque; I would prefer:
row = cursor.fetchone()
if row[0] == 0:
do_something()
This makes it abundantly clear, without being too verbose, that you're testing the first item of the row. When I'm doing anything more complicated than a simple COUNT() or EXISTS(), I prefer to use DictCursor so that my code relies on (at most) explicit aliases and never on implicit column ordering.
Testing for an empty result set
On the other hand, if you actually need to fetch a result set and the counting is purely incidental, as long as you're not using one of the unbuffered cursor classes you can just execute the important query and not worry about the COUNT():
cursor.execute(r"SELECT id, name, email FROM user WHERE date_verified IS NULL;")
if cursor.rowcount == 0:
print 'No results'
I am trying to get the results to a SQLAlchemy query. I know that if I loop over the query I can put the results in a list (like below), but this seems inefficient for a large set of results and looks ugly when the result will be a single number (as below). Is there a more direct and/or efficient way to return query results?
mylist = []
for item in session.query(func.max(mytable.id)):
mylist.append(item)
Looping through the result, as you do, is correct. You can also use all() to get the list of sequences (rows). Maybe more efficient is to not store the data in a list, get smaller result sets, and/or do the operation immediately on each row. You could also use a server side cursor if your DBMS supports it.
When only one row with one field is fetched, you can use first() and get the first element of the returned sequence. Code wise, this is probably most efficient:
maxid_mytable = session.query(func.max(mytable.id)).first()[0]
This will the equivalent
mylist = session.query(func.max(mytable.id)).all()
I'm having an issue when trying to pass a sqlite query to another function.
The issue is that the sqlite query MAY contains a list and therefore I cannot use *args as it unpacks the tuple but then ignores the list, example query I'm attempting to pass to the function:
'SELECT postname FROM history WHERE postname = ? COLLATE NOCASE', [u'Test']
So in this case I could use args as opposed to *args in the destination function, however I may have a sqlite query that doesn't contain a list and therefore I can't always do this e.g.
'SELECT * FROM history'
so I guess my question in a nutshell is how can I successfully pass a sqlite query to another function whether it contains a list or not, using args?
Can you just try,except it?
try:
func(*args)
except TypeError:
func(args)
Of course, this will catch TypeErrors inside your function as well. As such, you may want to create another function which actually deals with the unpacking and makes sure to give you an unpackable object in return. This also doesn't work for strings since they'll unpack too (see comments).
Here's a function which will make sure an object can be unpacked.
def unpackable(obj):
if hasattr(obj,'__iter__'):
return obj
else:
return (obj,)
func(*unpackable(args))
I would argue the best answer here is to try and ensure you are always putting in an iterable, rather than trying to handle the odd case of having a single item.
Where you have ('SELECT postname FROM history WHERE postname = ? COLLATE NOCASE', [u'Test']) in one place, it makes more sense to pass in a tuple of length one - ('SELECT * FROM history', ) as opposed to the string.
You haven't said where the strings are coming from, so it's possible you simply can't change the way the data is, but if you can, the tuple is the much better option to remove the edge case from your code.
If you truly can't do that, then what you want is to unpack any non-string iterable, checking for that can be done as shown in this question.