Return Results of SQLAlchemy Search Query - python

I am trying to get the results to a SQLAlchemy query. I know that if I loop over the query I can put the results in a list (like below), but this seems inefficient for a large set of results and looks ugly when the result will be a single number (as below). Is there a more direct and/or efficient way to return query results?
mylist = []
for item in session.query(func.max(mytable.id)):
mylist.append(item)

Looping through the result, as you do, is correct. You can also use all() to get the list of sequences (rows). Maybe more efficient is to not store the data in a list, get smaller result sets, and/or do the operation immediately on each row. You could also use a server side cursor if your DBMS supports it.
When only one row with one field is fetched, you can use first() and get the first element of the returned sequence. Code wise, this is probably most efficient:
maxid_mytable = session.query(func.max(mytable.id)).first()[0]

This will the equivalent
mylist = session.query(func.max(mytable.id)).all()

Related

make function memory efficent or store data somewhere else to avoid memory error

I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets
Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.
I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?

How to get nth record from result set of aerospike scan() output in Python, without looping through all results?

A beginner question with python probably.
I am able to iterate over the results of aerospike db query like this -
client = aerospike.client(config).connect()
scan = client.scan('namespace', 'setName')
scan.select('PK','expiresIn','clientId','scopes','roles') # scan from aerospike
scan.foreach(process_result)
def process_result((key, metadata, record)):
expiresIn = record.get("expiresIn")
Now, all I want to do is get the nth record from this set, without having to iterate through all.
I tried looking at Get the nth item of a generator in Python but could not make much sense.
Results from scan operation come from all the nodes in the cluster, pipelined, in no particular order. In that sense, there is no difference between the first record or the Nth record in terms of ordering. There is no order.
I wrote some Medium posts on how to sort results from a scan or query:
Sorted Results from a Secondary Index Query in Aerospike — Part I
Sorted Results from a Secondary Index Query — Part II
As usual, the workaround would be to set the scan policy to return just the digests, store them as a list (or several records with smaller lists) and paginate over those wth batch reads. You can set reasonable TTLs so that this result set has a reasonable length of time.
I can provide sample code if needed.

How to convert a tuple (in place) to first item in list?

I have boiler plate code that performs my sql queries and returns results (this is based off of working code written years ago by someone else). Generally, that code will return a list of tuples, which is totally fine for what I need.
However, if there's only one result, the code returns a single tuple instead, and breaks code that expects to loop through a list of tuples.
I need any easy way to convert the tuple into the first item of a list, so I can use it in my code expecting to loop through lists.
What's the most straightforward way to do this in a single line of code?
I figured there must be a straightforward way to do this, and there is. If my result set is called rows:
if not isinstance(rows,list):
rows = [rows]
I don't know if this is the most Pythonic construction, or if there's a way of combining the isintance and rows = [rows] lines into a single statement.

Efficient use of Python list comprehensions

I have a Python list of objects that could be pretty long. At particular times, I'm interested in all of the elements in the list that have a certain attribute, say flag, that evaluates to False. To do so, I've been using a list comprehension, like this:
objList = list()
# ... populate list
[x for x in objList if not x.flag]
Which seems to work well. After forming the sublist, I have a few different operations that I might need to do:
Subscript the sublist to get the element at index ind.
Calculate the length of the sublist (i.e. the number of elements that have flag == False).
Search the sublist for the first instance of a particular object (i.e. using the list's .index() method).
I've implemented these using the naive approach of just forming the sublist and then using its methods to get at the data I want. I'm wondering if there are more efficient ways to go about these. #1 and #3 at least seem like they could be optimized, because in #1 I only need the first ind + 1 matching elements of the sublist, not necessarily the entire result set, and in #3 I only need to search through the sublist until I find a matching element.
Is there a good Pythonic way to do this? I'm guessing I might be able to use the () syntax in some way to get a generator instead of creating the entire list, but I haven't happened upon the right way yet. I obviously could write loops manually, but I'm looking for something as elegant as the comprehension-based method.
If you need to do any of these operations a couple of times, the overhead of other methods will be higher, the list is the best way. It's also probably the clearest, so if memory isn't a problem, then I'd recommend just going with it.
If memory/speed is a problem, then there are alternatives - note that speed-wise, these might actually be slower, depending on the common case for your software.
For your scenarios:
#value = sublist[n]
value = nth(x for x in objList if not x.flag, n)
#value = len(sublist)
value = sum(not x.flag for x in objList)
#value = sublist.index(target)
value = next(dropwhile(lambda x: x != target, (x for x in objList if not x.flag)))
Using itertools.dropwhile() and the nth() recipe from the itertools docs.
I'm going to assume you might do any of these three things, and you might do them more than once.
In that case, what you want is basically to write a lazily evaluated list class. It would keep two pieces of data, a real list cache of evaluated items, and a generator of the rest. You could then do ll[10] and it would evaluate up to the 10th item, ll.index('spam') and it would evaluate until it finds 'spam', and then len(ll) and it would evaluate the rest of the list, all the while caching in the real list what it sees so nothing is done more than once.
Constructing it would look like this:
LazyList(x for x in obj_list if not x.flag)
But nothing would actually be computed until you actually start using it as above.
Since you commented that your objList can change, if you don't also need to index or search objList itself, then you might be better off just storing two different lists, one with .flag = True and one with .flag = False. Then you can use the second list directly instead of constructing it with a list comprehension each time.
If this works in your situation, it is likely the most efficient way to do it.

Copy cursor object in Python

I am working on a Trac-Plugin...
To retrieve my data I create a cursor object and get the result table like this:
db = self.env.get_db_cnx()
cursor = db.cursor()
cursor.execute("SELECT...")
Now the result is being used in 3 different functions. My Problem is now that the cursor is being cleaned out while looping through the first time (like it is told here http://packages.python.org/psycopg2/cursor.html)
I then tried to copy the cursor object, but this failed too. the copy(cursor) function seems to have problem with a big dataset and the function deepcopy(cursor) fails anyway (according to this bug http://bugs.python.org/issue1515).
How can I solve this issue?
Storing the values from any finite iterable is simple:
results = list(cursor)
Iterate over the iterable and store the results in a list. This list can be iterated over as many times as necessary.
You don't need a copy of the cursor, just a copy of the results of the query.
For this specific case, you should do what 9000 suggests in his comment -- use the cursors built-in functionality to get the results of a list, which should be as fast or faster than manually calling list.
If you want to avoid looping through the data an extra time you could try wrapping it in a generator:
def lazy_execute(sql, cursor=cursor):
results = []
cursor.execute(sql)
def fetch():
if results:
for r in results:
yield r
raise StopIteration()
else:
for r in cursor:
results.append(r)
yield r
raise StopIteration()
return fetch
This essentially creates a list as you need it, but lets you call the same function everywhere, safely. You would then use this like so:
results = lazy_execute(my_sql):
for r in results():
"do something with r"
This is almost certainly an over-engineered premature-optimization, though it does have the advantage of the same name meaning the same thing in every case, as opposed to generating a new list and then the same data having two different names.
I think if I were going to argue for using this I would use the same-names argument, unless the data set was pretty huge, but if it's huge enough to matter then there's a good chance you don't want to store it all in memory anyway.
Also it's completely untested.

Categories

Resources