I have the following code
def function(foo):
#Function that counts records in db with SB = 0 and SO = foo
#SO takes the values {0,1}
SQL = """ SELECT Count(*) FROM EvaluatedPlays
WHERE
SB = 0
AND SO = ?
"""
data = SQLfunctions.fetchSQL(SQL, foo)
#This queries the db with the above SQL and foo as a binding
return data
My question is, can I use a value for foo (some sort of wildcard) that will count all records such that just SB = 0 i.e. the value of SO doesn't matter? If not I will have to right a whole new query to do this which I'd rather not do due to readability and redundancy issues.
No, there is no value that matches all values.
You could merge the two queries into one more complicated query—e.g., a LIKE instead of = that's either a non-wildcard string or a pure-wildcard string, or something like (? IS NULL OR SO = ?) that takes the parameter twice. But these are likely to be less readable, rather than more, and may cause a performance hit as well.
In situations like this, it may be reasonable to format the SQL statements dynamically—you're not formatting in any user data, or generating an open-ended set of different statements, just two specific ones. So:
SQL = """ SELECT Count(*) FROM EvaluatedPlays
WHERE
SB = 0"""
if foo is not None:
SQL += "AND SO = ?"
cursor.execute(SQL, [foo] if foo is not None else [])
You could make the SQL a bit more complicated, so it uses ? in a way where it has no effect instead of not using it at all, allowing you to pass the parameter unconditionally. But that can be very confusing to a reader.
And you can find ways to wrap this up more generically, but you'll have to be careful that you're not opening the doors to SQL injection, a combinatorial explosion of statements (which may ruin the database's statement cache), etc.
And of course you can always push the logic of making the switch based on the parameter value into a stored procedure (with SQLite this can even just be a function in your script), but that can be a pretty over-complicated solution for such a simple use case.
So, you'll often find that it looks cleaner and more readable to just have two different statements.
Related
What's the difference between having multiple nested lookups inside queryset.filter and queryset.exclude?
For example car ratings. User can create ratings of multiple types for any car.
class Car(Model):
...
class Rating(Model):
type = ForeignKey('RatingType') # names like engine, design, handling
user = ... # user
Let's try to get a list of cars without rating by user "a" and type "design".
Approach 1
car_ids = Car.objects.filter(
rating__user="A", rating__type__name="design"
).values_list('id',flat=True)
Car.objects.exclude(id__in=car_ids)
Approach 2
Car.objects.exclude(
rating__user="A", rating__type__name="design"
)
The Approach 1 works well to me whereas the Approach 2 looks to be excluding more cars. My suspicion is that nested lookup inside exclude does not behave like AND (for the rating), rather it behaves like OR.
Is that true? If not, why these two approaches results in different querysets?
Regarding filter, "multiple parameters are joined via AND in the underlying SQL statement". Your first approach results not in one but in two SQL queries roughly equivalent to:
SELECT ... WHERE rating.user='A' AND rating.type.name='design';
SELECT ... WHERE car.id NOT IN (id1, id2, id3 ...);
Here's the part of the documentation that answers your question very precisely regarding exclude:
https://docs.djangoproject.com/en/stable/ref/models/querysets/#exclude
The evaluated SQL query would look like:
SELECT ... WHERE NOT (rating.user='A' AND rating.type.name='design')
Nested lookups inside filter and exclude behave similarly and use AND conditions. At the end of the day, most of the time, your 2 approaches are indeed equivalent... Except that the Car table might have been updated between the 1st and the 2d query of your approach 1.
Are you sure it's not the case? To be sure, try maybe to wrap the 2 lines of approach 1 in a transaction.atomic block? In any case, your second approach is probably the best (the less queries, the better).
If you have any doubt, you can also have a look at the evaluated queries (see here or here).
With sqlalchemy core you can bind a parameter and then fill it in at execution time, e.g.
query = select([my_table]).limit(bindparam('lim'))
session.execute(query, lim=10)
I have it so that the queries received may or may not be using certain parameters. So I'd like a way to be able to verify if a parameter exists or even to get a list of the parameters that have to be substituted at execution time. E.g. this might be done to allow limits in the query like above.
The values which may need to be substituted will be known, so it's just a case of checking if they're actually there. At the moment as a small temporary workaround I have found I can set the names to complex things like random strings of digits, and then just check if those exist in the query's string form. Of course other solutions exists like storing/passing which ones have/haven't been used, but I'm looking to see if there's a way just to check which ones exist. and are being expected.
You can get the parameters from the compiled query, like this:
>>> q = select([test1]).limit(bindparam('lim'))
>>> c = q.compile()
>>> c.params
{'lim': None}
I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they're running nearly continuously.
The query associated with each count operation is rather involved, and I don't think there's a way to get them all to run with indices (i.e. as COUNT_SCAN operations).
The only feasible solution I've come up with is to:
Run a full collection scan every hour, pulling every document out of the db
Once each document is in memory, run all of the count operations against it myself
Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.
So my question is whether there's any support from mongo drivers (pymongo in my case, but I'm curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.
Initially this felt like an odd request, but there's actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.
MongoDB In-Memory storage engine
If you want to process data using complex queries only in RAM using MongoDB syntax, you can configure MongoDB to use In-Memory only storage engine that avoids disk I/O at all.
For me, it is the best option to have the ability to have complex queries and best performance.
Python in-memory databases:
You can use one of the following:
PyDbLite - a fast, pure-Python, untyped, in-memory database engine, using Python syntax to manage data, instead of SQL
TinyDB - if you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you. But not a fast solution and have few other disadvantages.
They should allow working with data directly in RAM, but I'm not sure if this is better than the previous option.
Own custom solution (e.g. written in Python)
Some services handle data in RAM only on application level only. If your solution is not complicated and queries are simple - this is ok. But since some time queries become more complicated and code require some abstraction level (for advanced CRUD), like previous databases.
The last solution can have the best performance, but it takes more time to develop and support it.
As you are using python, have you considered Pandas?, you could basically try and transform your JSON data to pandas data frame and query it as you like, you could achieve whole bunch of operations like count, group by, aggregate etc. Please take a look at the doc. Adding a small example below to help you relate. Hope this helps.
For example:
import pandas as pd
from pandas.io.json import json_normalize
data = {
"data_points":[
{"a":1,"b":3,"c":2},
{"a":3,"b":2,"c":1},
{"a":5,"b":4,"d":3}
]
}
# convert json to data frame
df = json_normalize(data["data_points"])
Pandas data frame view above.
now you could just try and perform operation on them like sum, count etc.
Example:
# sum of column `a`
df['a'].sum()
output: 9
# sum of column `c` that has null values.
df['c'].sum()
output: 3.0
# count of column `c` that has null values.
df['c'].count()
output: 2
Here's the code I have currently to solve this problem. I have enough tests running against it to qualify it for my use case, but it's probably not 100% correct. I certainly don't handle all possible query documents.
def check_doc_against_mongo_query(doc, query):
"""Return whether the given doc would be returned by the given query.
Initially this might seem like work the db should be doing, but consider a use case where we
need to run many complex queries regularly to count matches. If each query results in a full-
collection scan, it is often faster to run a single scan fetching the entire collection into
memory, then run all of the matches locally.
We don't support mongo's full query syntax here, so we'll need to add support as the need
arises."""
# Run our check recursively
return _match_query(doc, query)
def _match_query(doc, query):
"""Return whether the given doc matches the given query."""
# We don't expect a null query
assert query is not None
# Check each top-level field for a match, we AND them together, so return on mismatch
for k, v in query.items():
# Check for AND/OR operators
if k == Mongo.AND:
if not all(_match_query(doc, x) for x in v):
return False
elif k == Mongo.OR:
if not any(_match_query(doc, x) for x in v):
return False
elif k == Mongo.COMMENT:
# Ignore comments
pass
else:
# Now grab the doc's value and match it against the given query value
doc_v = nested_dict_get(doc, k)
if not _match_doc_and_query_value(doc_v, v):
return False
# All top-level fields matched so return match
return True
def _match_doc_and_query_value(doc_v, query_v):
"""Return whether the given doc and query values match."""
cmps = [] # we AND these together below, trailing bool for negation
# Check for operators
if isinstance(query_v, Mapping):
# To handle 'in' we use a tuple, otherwise we use an operator and a value
for k, v in query_v.items():
if k == Mongo.IN:
cmps.append((operator.eq, tuple(v), False))
elif k == Mongo.NIN:
cmps.append((operator.eq, tuple(v), True))
else:
op = {Mongo.EQ: operator.eq, Mongo.GT: operator.gt, Mongo.GTE: operator.ge,
Mongo.LT: operator.lt, Mongo.LTE: operator.le, Mongo.NE: operator.ne}[
k]
cmps.append((op, v, False))
else:
# We expect a simple value here, perform an equality check
cmps.append((operator.eq, query_v, False))
# Now perform each comparison
return all(_invert(_match_cmp(op, doc_v, v), invert) for op, v, invert in cmps)
def _invert(result, invert):
"""Invert the given result if necessary."""
return not result if invert else result
def _match_cmp(op, doc_v, v):
"""Return whether the given values match with the given comparison operator.
If v is a tuple then we require op to match with any element.
We take care to handle comparisons with null the same way mongo does, i.e. only null ==/<=/>=
null returns true, all other comps with null return false. See:
https://stackoverflow.com/questions/29835829/mongodb-comparison-operators-with-null
for details.
As an important special case of null comparisons, ne null matches any non-null value.
"""
if doc_v is None and v is None:
return op in (operator.eq, operator.ge, operator.le)
elif op is operator.ne and v is None:
return doc_v is not None
elif v is None:
return False
elif isinstance(v, tuple):
return any(op(doc_v, x) for x in v)
else:
return op(doc_v, v)
Maybe you could try another approach?
I mean, MongoDB performs really bad in counting, overall with big collections.
I had a pretty similar problem in my last company and what we did is to create some "counters" object, and update them in every update you perform over your data.
In this way, you avoid counting at all.
The document would be something like:
{
query1count: 12,
query2count: 512312,
query3count: 6
}
If the query1count is related to the query: "all documents where userId = 13", then in your python layer you can check before creating/updating a document if the userId = 13, and if so then increase the desired counter.
It will do add a lot of extra complexity to your code, but the reads of the counters will be performed in O(1).
Of course, not all the queries may be that easy but you can reduce a lot the execution time with this approach.
I use SELECT COUNT(*) FROM db WHERE <expression> to see if a set of records is null. So:
>>> cnt = c.fetchone()
>>> print cnt
(0L,)
My question is: how do you test for this condition?
I have a number of other ways to accomplish this. Is something like the following possible?
if cnt==(0L,):
# do something
fetchone returns a row, which is a sequence of columns.
If you want to get the first value in a sequence, you use [0].
You could instead compare the row to (0,), as you're suggesting. But as far as I know neither the general DB-API nor the specific MySQLdb library guarantee what kind of sequence a row is; it could be a list, or a custom sequence class. So, relying on the fact that it's a tuple is probably not a good idea. And, since it's just as easy to not do so, why not be safe and portable?
So:
count_row = c.fetchone()
count = count_row[0]
if count == 0:
do_something()
Or, putting it together in one line:
if c.fetchone()[0] == 0:
do_something()
Thank you. Your first sequence works, don't know how I did not try that one, but I did not. The second construction gets an error: ...object has no attribute 'getitem'. I would guess my version of MySQLdb (1.2.3_4, Python 2.7) does not support it.
What I did in the interim was to construct the zero tuple by executing a count(*) constructed to return zero records. This seems to work fine
It's often easier to use the .rowcount attribute of the cursor object to check whether there are any rows in your result set. This attribute is specified in the Python Database API:
This read-only attribute specifies the number of rows that the last
.execute*() produced (for DQL statements like SELECT) or
affected (for DML statements like UPDATE or INSERT). [9]
The attribute is -1 in case no .execute*() has been performed on
the cursor or the rowcount of the last operation is cannot be
determined by the interface. [7]
When .rowcount cannot be used
Note that per the above specs, Cursor.rowcount should be set to -1 when the number of rows produced or affected by the last statement "cannot be determined by the interface." This happens when using the SSCursor and SSDictCursor cursor classes.
The reason is that the MySQL C API has two different functions for retrieving result sets: mysql_store_result() and mysql_use_result(). The difference is that mysql_use_result() reads rows from the result set as you ask for them, rather than storing the entire result set as soon as the query is executed. For very large result sets, this "unbuffered" approach can be faster and uses much less memory on the client machine; however, it makes it impossible for the interface to determine how many rows the result set contains at the time the query is executed.
Both SSCursor and SSDictCursor call mysql_use_result(), so their .rowcount attribute should hold the value -1 regardless of the size of the result set. In contrast, DictCursor and the default Cursor class call mysql_store_result(), which reads and counts the entire result set immediately after executing the query.
To make matters worse, the .rowcount attribute only ever holds the value -1 when the cursor is first opened; once you execute a query, it receives the return value of mysql_affected_rows(). The problem is that mysql_affected_rows() returns an unsigned long long integer, which represents the value -1 in a way that can be very counterintuitive and wouldn't be caught by a condition like cursor.rowcount == -1.
Counting for counting's sake
If the only thing you're doing is counting records, then .rowcount isn't that useful because your COUNT(*) query is going to return a row whether the records exist or not. In that case, test for the zero value in the same way that you would test for any value when fetching results from a query. Whether you can do c.fetchone()[0] == 0 depends on the cursor class you're using; it would work for a Cursor or SSCursor but fail for a DictCursor or SSDictCursor, which fetch dictionaries instead of tuples.
The important thing is just to be clear in your code about what's happening, which is why I would recommend against using c.fetchone() == (0,). That tests an entire row when all you need to do is test a single value; get the value out of the row before you test it, and your code will be more clear. Personally, I find c.fetchone()[0] to be needlessly opaque; I would prefer:
row = cursor.fetchone()
if row[0] == 0:
do_something()
This makes it abundantly clear, without being too verbose, that you're testing the first item of the row. When I'm doing anything more complicated than a simple COUNT() or EXISTS(), I prefer to use DictCursor so that my code relies on (at most) explicit aliases and never on implicit column ordering.
Testing for an empty result set
On the other hand, if you actually need to fetch a result set and the counting is purely incidental, as long as you're not using one of the unbuffered cursor classes you can just execute the important query and not worry about the COUNT():
cursor.execute(r"SELECT id, name, email FROM user WHERE date_verified IS NULL;")
if cursor.rowcount == 0:
print 'No results'
I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.