Python MySQLdb test for select count(*) = zero - python

I use SELECT COUNT(*) FROM db WHERE <expression> to see if a set of records is null. So:
>>> cnt = c.fetchone()
>>> print cnt
(0L,)
My question is: how do you test for this condition?
I have a number of other ways to accomplish this. Is something like the following possible?
if cnt==(0L,):
# do something

fetchone returns a row, which is a sequence of columns.
If you want to get the first value in a sequence, you use [0].
You could instead compare the row to (0,), as you're suggesting. But as far as I know neither the general DB-API nor the specific MySQLdb library guarantee what kind of sequence a row is; it could be a list, or a custom sequence class. So, relying on the fact that it's a tuple is probably not a good idea. And, since it's just as easy to not do so, why not be safe and portable?
So:
count_row = c.fetchone()
count = count_row[0]
if count == 0:
do_something()
Or, putting it together in one line:
if c.fetchone()[0] == 0:
do_something()

Thank you. Your first sequence works, don't know how I did not try that one, but I did not. The second construction gets an error: ...object has no attribute 'getitem'. I would guess my version of MySQLdb (1.2.3_4, Python 2.7) does not support it.
What I did in the interim was to construct the zero tuple by executing a count(*) constructed to return zero records. This seems to work fine

It's often easier to use the .rowcount attribute of the cursor object to check whether there are any rows in your result set. This attribute is specified in the Python Database API:
This read-only attribute specifies the number of rows that the last
.execute*() produced (for DQL statements like SELECT) or
affected (for DML statements like UPDATE or INSERT). [9]
The attribute is -1 in case no .execute*() has been performed on
the cursor or the rowcount of the last operation is cannot be
determined by the interface. [7]
When .rowcount cannot be used
Note that per the above specs, Cursor.rowcount should be set to -1 when the number of rows produced or affected by the last statement "cannot be determined by the interface." This happens when using the SSCursor and SSDictCursor cursor classes.
The reason is that the MySQL C API has two different functions for retrieving result sets: mysql_store_result() and mysql_use_result(). The difference is that mysql_use_result() reads rows from the result set as you ask for them, rather than storing the entire result set as soon as the query is executed. For very large result sets, this "unbuffered" approach can be faster and uses much less memory on the client machine; however, it makes it impossible for the interface to determine how many rows the result set contains at the time the query is executed.
Both SSCursor and SSDictCursor call mysql_use_result(), so their .rowcount attribute should hold the value -1 regardless of the size of the result set. In contrast, DictCursor and the default Cursor class call mysql_store_result(), which reads and counts the entire result set immediately after executing the query.
To make matters worse, the .rowcount attribute only ever holds the value -1 when the cursor is first opened; once you execute a query, it receives the return value of mysql_affected_rows(). The problem is that mysql_affected_rows() returns an unsigned long long integer, which represents the value -1 in a way that can be very counterintuitive and wouldn't be caught by a condition like cursor.rowcount == -1.
Counting for counting's sake
If the only thing you're doing is counting records, then .rowcount isn't that useful because your COUNT(*) query is going to return a row whether the records exist or not. In that case, test for the zero value in the same way that you would test for any value when fetching results from a query. Whether you can do c.fetchone()[0] == 0 depends on the cursor class you're using; it would work for a Cursor or SSCursor but fail for a DictCursor or SSDictCursor, which fetch dictionaries instead of tuples.
The important thing is just to be clear in your code about what's happening, which is why I would recommend against using c.fetchone() == (0,). That tests an entire row when all you need to do is test a single value; get the value out of the row before you test it, and your code will be more clear. Personally, I find c.fetchone()[0] to be needlessly opaque; I would prefer:
row = cursor.fetchone()
if row[0] == 0:
do_something()
This makes it abundantly clear, without being too verbose, that you're testing the first item of the row. When I'm doing anything more complicated than a simple COUNT() or EXISTS(), I prefer to use DictCursor so that my code relies on (at most) explicit aliases and never on implicit column ordering.
Testing for an empty result set
On the other hand, if you actually need to fetch a result set and the counting is purely incidental, as long as you're not using one of the unbuffered cursor classes you can just execute the important query and not worry about the COUNT():
cursor.execute(r"SELECT id, name, email FROM user WHERE date_verified IS NULL;")
if cursor.rowcount == 0:
print 'No results'

Related

make function memory efficent or store data somewhere else to avoid memory error

I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets
Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.
I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?

Sqlite and Python, wildcard after equals

I have the following code
def function(foo):
#Function that counts records in db with SB = 0 and SO = foo
#SO takes the values {0,1}
SQL = """ SELECT Count(*) FROM EvaluatedPlays
WHERE
SB = 0
AND SO = ?
"""
data = SQLfunctions.fetchSQL(SQL, foo)
#This queries the db with the above SQL and foo as a binding
return data
My question is, can I use a value for foo (some sort of wildcard) that will count all records such that just SB = 0 i.e. the value of SO doesn't matter? If not I will have to right a whole new query to do this which I'd rather not do due to readability and redundancy issues.
No, there is no value that matches all values.
You could merge the two queries into one more complicated query—e.g., a LIKE instead of = that's either a non-wildcard string or a pure-wildcard string, or something like (? IS NULL OR SO = ?) that takes the parameter twice. But these are likely to be less readable, rather than more, and may cause a performance hit as well.
In situations like this, it may be reasonable to format the SQL statements dynamically—you're not formatting in any user data, or generating an open-ended set of different statements, just two specific ones. So:
SQL = """ SELECT Count(*) FROM EvaluatedPlays
WHERE
SB = 0"""
if foo is not None:
SQL += "AND SO = ?"
cursor.execute(SQL, [foo] if foo is not None else [])
You could make the SQL a bit more complicated, so it uses ? in a way where it has no effect instead of not using it at all, allowing you to pass the parameter unconditionally. But that can be very confusing to a reader.
And you can find ways to wrap this up more generically, but you'll have to be careful that you're not opening the doors to SQL injection, a combinatorial explosion of statements (which may ruin the database's statement cache), etc.
And of course you can always push the logic of making the switch based on the parameter value into a stored procedure (with SQLite this can even just be a function in your script), but that can be a pretty over-complicated solution for such a simple use case.
So, you'll often find that it looks cleaner and more readable to just have two different statements.

Matlab checking for a value in a database

I have the following Boolean statement in Python:
db_connection.query(
'select storage_time from traces where id=' + trace_id
).dictresult()[0]['storage_time'] == None
It basically checks if there is a value in storage_time and I would like to do the same thing in Matlab, but I can't find anything equivalent to None.
Could you please help me out?
Thanks
Equivalent to None values from Python are NULLs. And since you connect to your database
via Matlab Database Toolbox, you need to specify how NULL values retrieved from database
are to be presented in Matlab. This may be done by setting of 'NullNumberRead'
via calling of setdbprefs function from Matlab Database Toolbox. For instance, you can do this so
setdbprefs('NullNumberRead','NaN')
or so
setdbprefs('NullNumberRead','0')
Unfortunately, there is no guarantee that the NULL representation value specified by this way won't be confused
with real non-NULL values obtained as a result of your query (it is your own responsibility in this case to
guarantee the query always do not contain NaNs or zeros, respectively, among non-NULL values in its results).
But if you have to connect to PostgreSQL, as far as I know, there exists at least one Matlab and PostgreSQL connector that
respects NULLs in a fully consistent manner. This is a high-performance PostgreSQL client library PgMex.
In PostgreSQL both a value itself and the value elements can be NULL (for array types). This makes a representation of NULLs in Matlab less trivial than expected.
To illustrate the way PgMex uses to represent NULLs in Matlab, let us consider the following example. Suppose you have to retrieve results for a query returning one field myfield of float8[] type with two tuples. And suppose the value of myfield for the first tuple equals to NULL as itself, while for the second tuple the corresponding value equals {0,NULL,NaN}. Results are obtained as follows
(we assume that the argument of the very first command connect below is corrected properly and that the table mytable containing
myfield of float8[] type among its fields already exists within the respective database):
% Create the database connection
dbConn=com.allied.pgmex.pgmexec('connect',[...
'host=<yourhost> dbname=<yourdb> port=<yourport> '...
'user=<your_postgres_username> password=<your_postgres_password>']);
pgResult=com.allied.pgmex.pgmexec('exec',dbConn,...
'select myfield from mytable'); % perform the query
SMyField=com.allied.pgmex.pgmexec('getf',pgResult,...
'%float8[]',0); % retrieve the results
Here SMyField is a structure with three fields: valueVec, isNullVec and isValueNullVec.
isValueNullVec equals a column logical array [true;false], that is the entire value for the first tuple equals NULL,
while the value for the second tuple does not equal NULL as itself. isNullVec equals the following column cell array:
{[];[false,true,false]}. By this way it is possible to indicate that only the second element of the array being the value
of myfield for the second tuple is NULL. At last, valueVec equals to the column cell array {[];[0 0 NaN]}. Only
the first and the third elements of the second cell are relevant, in contrast to the second element of this array.
This is because isNullVec clearly indicates that this second element is NULL, so this zero value does not matter
(some default value is chosen for each particular data type).
What concerns your example, the respective code in Matlab may be the following (we assume that dbConn obtained as above already exists
as well as that the query below is correct with storage_time field of timestamp type and the variable trace_id being already defined):
pgResult=com.allied.pgmex.pgmexec('exec',dbConn,...
['select storage_time from traces where id=' + trace_id]); % perform the query
SStorageTime=com.allied.pgmex.pgmexec('getf',pgResult,...
'%timestamp',0); % retrieve the results
% check that the value for the first tuple is not NULL
isStorageTime=~SStorageTime.isValueNullVec(1);
Hence it is sufficient to check only isValueNullVec.
EDIT: There are free academic licenses for PgMex.
MATLAB's Database Toolbox has preferences how to handle NULL values. Depending on those settings you can get different values. See SETDBPREFS for details. You can change preferences in GUI as well.
By default you will get NaN if you read the data as numeric, and 'NULL' strings if you read as strings. In the first case for numbers check for NaN with ISNAN function.
null_idx = isnan(fetcheddata);
For strings use STRCMP:
null_idx= strcmp(upper(fetcheddata), 'NULL');
In addition, if you fetch the data as cell array, you may need to deal with them with CELLFUN or convert to matrix with CELL2MAT.
Matlab idiom is usually to use the isempty() function.
isempty(somefunction(someargument))
returns true if somefunction(someargument) returns any empty result, and false otherwise.
I have not worked with the Matlab DB toolbox much, so I'm not sure what the full translation of your Python statement is.
If you use this query you can check for True or False in instead of None:
trace_id_exists = db_connection.query("""\
select exists (
select 1
from traces
where id = %s
) "exists"
""" % trace_id
).dictresult()[0]['exists']
if trace_id_exists:
...
You could also return something else like 1 or 0.

Copy cursor object in Python

I am working on a Trac-Plugin...
To retrieve my data I create a cursor object and get the result table like this:
db = self.env.get_db_cnx()
cursor = db.cursor()
cursor.execute("SELECT...")
Now the result is being used in 3 different functions. My Problem is now that the cursor is being cleaned out while looping through the first time (like it is told here http://packages.python.org/psycopg2/cursor.html)
I then tried to copy the cursor object, but this failed too. the copy(cursor) function seems to have problem with a big dataset and the function deepcopy(cursor) fails anyway (according to this bug http://bugs.python.org/issue1515).
How can I solve this issue?
Storing the values from any finite iterable is simple:
results = list(cursor)
Iterate over the iterable and store the results in a list. This list can be iterated over as many times as necessary.
You don't need a copy of the cursor, just a copy of the results of the query.
For this specific case, you should do what 9000 suggests in his comment -- use the cursors built-in functionality to get the results of a list, which should be as fast or faster than manually calling list.
If you want to avoid looping through the data an extra time you could try wrapping it in a generator:
def lazy_execute(sql, cursor=cursor):
results = []
cursor.execute(sql)
def fetch():
if results:
for r in results:
yield r
raise StopIteration()
else:
for r in cursor:
results.append(r)
yield r
raise StopIteration()
return fetch
This essentially creates a list as you need it, but lets you call the same function everywhere, safely. You would then use this like so:
results = lazy_execute(my_sql):
for r in results():
"do something with r"
This is almost certainly an over-engineered premature-optimization, though it does have the advantage of the same name meaning the same thing in every case, as opposed to generating a new list and then the same data having two different names.
I think if I were going to argue for using this I would use the same-names argument, unless the data set was pretty huge, but if it's huge enough to matter then there's a good chance you don't want to store it all in memory anyway.
Also it's completely untested.

Data Structure for storing a sorting field to efficiently allow modifications

I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.

Categories

Resources