Currently I'm focussing on the pythonic way of writing my code and I ran into two situations where I wonder what's best.
First the situation for method overloading, which is not available in python. How would I best solve the situation where I have a function that fetches data from a database, however depending on an argument being an integer or a list of integers the query would be different. Example:
def getData(ids):
if type(ids) == int:
# query the database in an efficient manner for a single ID
elif type(ids) is list:
# query the database in a different manner efficiently for multiple ID's
# also return the data differently
Would I do all the work in a single function or do I use different functions which are called from the above function to do the work? Or would I just need to call a different function explicitly depending on whether I have a list of ID's or just a single ID? What do you believe is best?
Use isinstance:
import collections
def getData(ids):
if isinstance(ids, collections.Iterable):
# query the database efficiently for multiple ID's
else:
# query the database in an efficient manner for a single ID
Related
What's the difference between having multiple nested lookups inside queryset.filter and queryset.exclude?
For example car ratings. User can create ratings of multiple types for any car.
class Car(Model):
...
class Rating(Model):
type = ForeignKey('RatingType') # names like engine, design, handling
user = ... # user
Let's try to get a list of cars without rating by user "a" and type "design".
Approach 1
car_ids = Car.objects.filter(
rating__user="A", rating__type__name="design"
).values_list('id',flat=True)
Car.objects.exclude(id__in=car_ids)
Approach 2
Car.objects.exclude(
rating__user="A", rating__type__name="design"
)
The Approach 1 works well to me whereas the Approach 2 looks to be excluding more cars. My suspicion is that nested lookup inside exclude does not behave like AND (for the rating), rather it behaves like OR.
Is that true? If not, why these two approaches results in different querysets?
Regarding filter, "multiple parameters are joined via AND in the underlying SQL statement". Your first approach results not in one but in two SQL queries roughly equivalent to:
SELECT ... WHERE rating.user='A' AND rating.type.name='design';
SELECT ... WHERE car.id NOT IN (id1, id2, id3 ...);
Here's the part of the documentation that answers your question very precisely regarding exclude:
https://docs.djangoproject.com/en/stable/ref/models/querysets/#exclude
The evaluated SQL query would look like:
SELECT ... WHERE NOT (rating.user='A' AND rating.type.name='design')
Nested lookups inside filter and exclude behave similarly and use AND conditions. At the end of the day, most of the time, your 2 approaches are indeed equivalent... Except that the Car table might have been updated between the 1st and the 2d query of your approach 1.
Are you sure it's not the case? To be sure, try maybe to wrap the 2 lines of approach 1 in a transaction.atomic block? In any case, your second approach is probably the best (the less queries, the better).
If you have any doubt, you can also have a look at the evaluated queries (see here or here).
I have a QuerySet, let's call it qs, which is ordered by some attribute which is irrelevant to this problem. Then I have an object, let's call it obj. Now I'd like to know at what index obj has in qs, as efficiently as possible. I know that I could use .index() from Python or possibly loop through qs comparing each object to obj, but what is the best way to go about doing this? I'm looking for high performance and that's my only criteria.
Using Python 2.6.2 with Django 1.0.2 on Windows.
If you're already iterating over the queryset and just want to know the index of the element you're currently on, the compact and probably the most efficient solution is:
for index, item in enumerate(your_queryset):
...
However, don't use this if you have a queryset and an object obtained by some unrelated means, and want to learn the position of this object in the queryset (if it's even there).
If you just want to know where you object sits amongst all others (e.g. when determining rank), you can do it quickly by counting the objects before you:
index = MyModel.objects.filter(sortField__lt = myObject.sortField).count()
Assuming for the purpose of illustration that your models are standard with a primary key id, then evaluating
list(qs.values_list('id', flat=True)).index(obj.id)
will find the index of obj in qs. While the use of list evaluates the queryset, it evaluates not the original queryset but a derived queryset. This evaluation runs a SQL query to get the id fields only, not wasting time fetching other fields.
QuerySets in Django are actually generators, not lists (for further details, see Django documentation on QuerySets).
As such, there is no shortcut to get the index of an element, and I think a plain iteration is the best way to do it.
For starter, I would implement your requirement in the simplest way possible (like iterating); if you really have performance issues, then I would use some different approach, like building a queryset with a smaller amount of fields, or whatever.
In any case, the idea is to leave such tricks as late as possible, when you definitely knows you need them.
Update: You may want to use directly some SQL statement to get the rownumber (something lie . However, Django's ORM does not support this natively and you have to use a raw SQL query (see documentation). I think this could be the best option, but again - only if you really see a real performance issue.
It's possible for a simple pythonic way to query the index of an element in a queryset:
(*qs,).index(instance)
This answer will unpack the queryset into a list, then use the inbuilt Python index function to determine it's position.
You can do this using queryset.extra(…) and some raw SQL like so:
queryset = queryset.order_by("id")
record500 = queryset[500]
numbered_qs = queryset.extra(select={
'queryset_row_number': 'ROW_NUMBER() OVER (ORDER BY "id")'
})
from django.db import connection
cursor = connection.cursor()
cursor.execute(
"WITH OrderedQueryset AS (" + str(numbered_qs.query) + ") "
"SELECT queryset_row_number FROM OrderedQueryset WHERE id = %s",
[record500.id]
)
index = cursor.fetchall()[0][0]
index == 501 # because row_number() is 1 indexed not 0 indexed
I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they're running nearly continuously.
The query associated with each count operation is rather involved, and I don't think there's a way to get them all to run with indices (i.e. as COUNT_SCAN operations).
The only feasible solution I've come up with is to:
Run a full collection scan every hour, pulling every document out of the db
Once each document is in memory, run all of the count operations against it myself
Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.
So my question is whether there's any support from mongo drivers (pymongo in my case, but I'm curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.
Initially this felt like an odd request, but there's actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.
MongoDB In-Memory storage engine
If you want to process data using complex queries only in RAM using MongoDB syntax, you can configure MongoDB to use In-Memory only storage engine that avoids disk I/O at all.
For me, it is the best option to have the ability to have complex queries and best performance.
Python in-memory databases:
You can use one of the following:
PyDbLite - a fast, pure-Python, untyped, in-memory database engine, using Python syntax to manage data, instead of SQL
TinyDB - if you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you. But not a fast solution and have few other disadvantages.
They should allow working with data directly in RAM, but I'm not sure if this is better than the previous option.
Own custom solution (e.g. written in Python)
Some services handle data in RAM only on application level only. If your solution is not complicated and queries are simple - this is ok. But since some time queries become more complicated and code require some abstraction level (for advanced CRUD), like previous databases.
The last solution can have the best performance, but it takes more time to develop and support it.
As you are using python, have you considered Pandas?, you could basically try and transform your JSON data to pandas data frame and query it as you like, you could achieve whole bunch of operations like count, group by, aggregate etc. Please take a look at the doc. Adding a small example below to help you relate. Hope this helps.
For example:
import pandas as pd
from pandas.io.json import json_normalize
data = {
"data_points":[
{"a":1,"b":3,"c":2},
{"a":3,"b":2,"c":1},
{"a":5,"b":4,"d":3}
]
}
# convert json to data frame
df = json_normalize(data["data_points"])
Pandas data frame view above.
now you could just try and perform operation on them like sum, count etc.
Example:
# sum of column `a`
df['a'].sum()
output: 9
# sum of column `c` that has null values.
df['c'].sum()
output: 3.0
# count of column `c` that has null values.
df['c'].count()
output: 2
Here's the code I have currently to solve this problem. I have enough tests running against it to qualify it for my use case, but it's probably not 100% correct. I certainly don't handle all possible query documents.
def check_doc_against_mongo_query(doc, query):
"""Return whether the given doc would be returned by the given query.
Initially this might seem like work the db should be doing, but consider a use case where we
need to run many complex queries regularly to count matches. If each query results in a full-
collection scan, it is often faster to run a single scan fetching the entire collection into
memory, then run all of the matches locally.
We don't support mongo's full query syntax here, so we'll need to add support as the need
arises."""
# Run our check recursively
return _match_query(doc, query)
def _match_query(doc, query):
"""Return whether the given doc matches the given query."""
# We don't expect a null query
assert query is not None
# Check each top-level field for a match, we AND them together, so return on mismatch
for k, v in query.items():
# Check for AND/OR operators
if k == Mongo.AND:
if not all(_match_query(doc, x) for x in v):
return False
elif k == Mongo.OR:
if not any(_match_query(doc, x) for x in v):
return False
elif k == Mongo.COMMENT:
# Ignore comments
pass
else:
# Now grab the doc's value and match it against the given query value
doc_v = nested_dict_get(doc, k)
if not _match_doc_and_query_value(doc_v, v):
return False
# All top-level fields matched so return match
return True
def _match_doc_and_query_value(doc_v, query_v):
"""Return whether the given doc and query values match."""
cmps = [] # we AND these together below, trailing bool for negation
# Check for operators
if isinstance(query_v, Mapping):
# To handle 'in' we use a tuple, otherwise we use an operator and a value
for k, v in query_v.items():
if k == Mongo.IN:
cmps.append((operator.eq, tuple(v), False))
elif k == Mongo.NIN:
cmps.append((operator.eq, tuple(v), True))
else:
op = {Mongo.EQ: operator.eq, Mongo.GT: operator.gt, Mongo.GTE: operator.ge,
Mongo.LT: operator.lt, Mongo.LTE: operator.le, Mongo.NE: operator.ne}[
k]
cmps.append((op, v, False))
else:
# We expect a simple value here, perform an equality check
cmps.append((operator.eq, query_v, False))
# Now perform each comparison
return all(_invert(_match_cmp(op, doc_v, v), invert) for op, v, invert in cmps)
def _invert(result, invert):
"""Invert the given result if necessary."""
return not result if invert else result
def _match_cmp(op, doc_v, v):
"""Return whether the given values match with the given comparison operator.
If v is a tuple then we require op to match with any element.
We take care to handle comparisons with null the same way mongo does, i.e. only null ==/<=/>=
null returns true, all other comps with null return false. See:
https://stackoverflow.com/questions/29835829/mongodb-comparison-operators-with-null
for details.
As an important special case of null comparisons, ne null matches any non-null value.
"""
if doc_v is None and v is None:
return op in (operator.eq, operator.ge, operator.le)
elif op is operator.ne and v is None:
return doc_v is not None
elif v is None:
return False
elif isinstance(v, tuple):
return any(op(doc_v, x) for x in v)
else:
return op(doc_v, v)
Maybe you could try another approach?
I mean, MongoDB performs really bad in counting, overall with big collections.
I had a pretty similar problem in my last company and what we did is to create some "counters" object, and update them in every update you perform over your data.
In this way, you avoid counting at all.
The document would be something like:
{
query1count: 12,
query2count: 512312,
query3count: 6
}
If the query1count is related to the query: "all documents where userId = 13", then in your python layer you can check before creating/updating a document if the userId = 13, and if so then increase the desired counter.
It will do add a lot of extra complexity to your code, but the reads of the counters will be performed in O(1).
Of course, not all the queries may be that easy but you can reduce a lot the execution time with this approach.
I have a list of strings that I need to determine if they are present in my model names. Currently I am using a series of if and elif statements to determine how many strings are in my list of strings then execute a filter request to my database like this (this works but is extremely inefficient as I am constantly repeating myself):
#example if my list contained only 2 strings
if len(listOfStrings) == 2:
queryResults = MyModel.objects.filter(Q(name__contains=listOfStrings[0])|Q(name__contains=listOfStrings[1]))
Then I notify the client based on the result of these queries. If I test the strings against my models each individually I might get the same models (because some model names contain all three or two strings) then notify the client twice. So I would think to solve this problem using a for loop like so
listOfStrings = {"string1","string2","string3"}
queryResults = ""
for string in listOfStrings:
queryResults += MyModel.objects.filter(name__contains=string)
I understand that the filter() method is lazy and doesn't execute right away and that this python logic is most likely wrong. But how do I concatenate filter requests in order to avoid duplicate model results.
You probably want to use the distinct filter.
The results of an ORM query (e.g., MyObject.query()) need to be ordered according to a ranking algorithm that is based on values not within the database (i.e. from a separate search engine). This means 'order_by' will not work, since it only operates on fields within the database.
But, I don't want to convert the query results to a list, then reorder, because I want to maintain the ability to add further constraints to the query. E.g.:
results = MyObject.query()
results = my_reorder(results)
results = results.filter(some_constraint)
Is this possible to accomplish via SQLAlchemy?
I am afraid you will not be able to do it, unless the ordering can be derived from the fields of the object's table(s) and/or related objects' tables which are in the database.
But you could return from your code the tuple (query, order_func/result). In this case the query can be still extended until it is executed, and then resorted. Or you could create a small Proxy-like class, which will contain this tuple, and will delegate the query-extension methods to the query, and query-execution methods (all(), __iter__, ...) to the query and apply ordering when executed.
Also, if you could calculate the value for each MyObject instance beforehand, you could add a literal column to the query with the values and then use order_by to order by it. Alternatively, add temporary table, add rows with the computed ordering value, join on it in the query and add ordering. But I guess these are adding more complexity than the benefit they bring.