What I have is a list of model objects I haven't run bulk_create on:
objs = [Model(id=1, field=foo), Model(id=2, field=bar)]
What I'd like to do is intersect objs with Model.objects.all() and return only those objects which aren't already in the database (based on the field value).
For example, if my database was:
[Model(id=3, field=foo)]
Then the resulting objs should be:
objs = [Model(id=1, field=bar)]
Is something like this possible?
Edit:
So a bit of further explanation:
What I'm doing is I have an import command, that I'm trying to have an --append flag included.
Without the flag, I delete the tables, and start fresh.
With the flag, I want to bulk create a large number of objects (single creation is much slower - I've checked), and I don't want to have objects with the same field values but different ids.
I've already tried filtering out duplicates after insertion, but it's quite slow and I wanted to test this approach to see if it's faster.
The objects are read from CSV files, and it's faster to make a list of Model, and then bulk_create, as opposed to running create on each row.
We can probably best do this by first constructing a set of values the field column in the database has:
field_vals = set(Model.objects.values_list('field', flat=True).distinct())
and then we can perform a filtering like:
filtered_objs = [obj for obj in objs if obj.field not in field_vals]
By constructing a set first, we run a single query, construct a set in O(n) (with n the number of Models in the database), and then filter in O(m) (with m the number of objects in objs). So the algorithm is O(m+n).
Based on the question however, it looks like you could probably save the effort of constructing these objects in the first place. You can use Django's get_or_create function. And use it like:
obj, created = Model.objects.get_or_create(field=foo)
with obj the objected (either fetched from the database, or created in the database), and created a boolean that is True, if we had to create a new object.
Related
I have a QuerySet, let's call it qs, which is ordered by some attribute which is irrelevant to this problem. Then I have an object, let's call it obj. Now I'd like to know at what index obj has in qs, as efficiently as possible. I know that I could use .index() from Python or possibly loop through qs comparing each object to obj, but what is the best way to go about doing this? I'm looking for high performance and that's my only criteria.
Using Python 2.6.2 with Django 1.0.2 on Windows.
If you're already iterating over the queryset and just want to know the index of the element you're currently on, the compact and probably the most efficient solution is:
for index, item in enumerate(your_queryset):
...
However, don't use this if you have a queryset and an object obtained by some unrelated means, and want to learn the position of this object in the queryset (if it's even there).
If you just want to know where you object sits amongst all others (e.g. when determining rank), you can do it quickly by counting the objects before you:
index = MyModel.objects.filter(sortField__lt = myObject.sortField).count()
Assuming for the purpose of illustration that your models are standard with a primary key id, then evaluating
list(qs.values_list('id', flat=True)).index(obj.id)
will find the index of obj in qs. While the use of list evaluates the queryset, it evaluates not the original queryset but a derived queryset. This evaluation runs a SQL query to get the id fields only, not wasting time fetching other fields.
QuerySets in Django are actually generators, not lists (for further details, see Django documentation on QuerySets).
As such, there is no shortcut to get the index of an element, and I think a plain iteration is the best way to do it.
For starter, I would implement your requirement in the simplest way possible (like iterating); if you really have performance issues, then I would use some different approach, like building a queryset with a smaller amount of fields, or whatever.
In any case, the idea is to leave such tricks as late as possible, when you definitely knows you need them.
Update: You may want to use directly some SQL statement to get the rownumber (something lie . However, Django's ORM does not support this natively and you have to use a raw SQL query (see documentation). I think this could be the best option, but again - only if you really see a real performance issue.
It's possible for a simple pythonic way to query the index of an element in a queryset:
(*qs,).index(instance)
This answer will unpack the queryset into a list, then use the inbuilt Python index function to determine it's position.
You can do this using queryset.extra(…) and some raw SQL like so:
queryset = queryset.order_by("id")
record500 = queryset[500]
numbered_qs = queryset.extra(select={
'queryset_row_number': 'ROW_NUMBER() OVER (ORDER BY "id")'
})
from django.db import connection
cursor = connection.cursor()
cursor.execute(
"WITH OrderedQueryset AS (" + str(numbered_qs.query) + ") "
"SELECT queryset_row_number FROM OrderedQueryset WHERE id = %s",
[record500.id]
)
index = cursor.fetchall()[0][0]
index == 501 # because row_number() is 1 indexed not 0 indexed
I need to compare 2 querysets from the same model from 2 different databases.
I expect the difference between them. In this case, I grab only one column (charfield), from two databases and want to compare this "list", i.e. it would be great to work with sets and difference methods of sets.
But I can't simple subtract querysets, also set(queryset) and list(querysets) -- this give me nothing (not an error), i.e.
diff_set = set(articles1) - set(articles2)
I switched db's on the fly, make 2 querysets and try to compare them (filter, or exclude)
articles1 = list(Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ'))
# right connection
connections.databases['tmp2']['HOST'] = db2.host
connections.databases['tmp2']['NAME'] = db2.name
articles2 = list(Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ'))
# okay to chain Smdocuments objects, gives all the entries
all = list(chain(articles1, articles2))
# got nothing, even len(diff_set) is none
diff_set = set(articles1) - set(articles2)
# this one raise error Subqueries aren't allowed across different databases.
articles_exclude = Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ')
len(articles1)
diff_ex = Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ').exclude(id__in=articles_exclude)
len(diff_ex)
diff_ex raise an error
Subqueries aren't allowed across different databases. Force the inner
query to be evaluated using list(inner_query).
So, "Model objects" not so easy to manipulate, and querysets between difference databases as well.
I see, thats not a good db scheme, but it's another application with distributed db, and I need to compare them.
It's would be enough to compare by one column, but probably compare full queryset will work for future.
Or, should I convert queryset to list and compare raw data?
Your question is really unclear about what you actually expect, but here are a couple hints anyway:
First, model instances (assuming they are instances of the same model of course) compare on their primary key value, which is also used as hash for dicts and sets, so if you want to compare the underlying database records you should not work on model instances but on the raw db values as lists of tuples or dicts. You can get those using (resp.) Queryset.values_list() or Queryset.values() - not forgetting to list() them so you really get a list and not a queryset.
Which brings us to the second important point: while presenting themselves as list-likes (in that they support len(), iteration, subscripting and - with some restrictions - slicing), Querysets are NOT lists. You can't compare two querysets (well you can but they compare on identity, which means that two querysets will only be equal if they are actually the very same object), and, more important, using a queryset as an argument to a 'field__in=' lookup will result in a SQL subquery where passing a proper list results in a plain 'field IN (...)' where clause. This explains the error you get with the exclude(...) approach.
To make a long story short, if you want to effectively compare database rows, you want:
# the fields you want to compate records on
fields = 'field1', 'field2', 'fieldN'
rows1 = list(YouModel.objects.using('tmp1').filter(...).values_list(*fields))
rows2 = list(YouModel.objects.using('tmp2').filter(...).values_list(*fields))
# now you have two lists of tuples so you can apply ordinary python comparisons / set operations etc
print rows1 == rows2
print set(rows1) - set(rows2)
# etc
I'm currently using list comprehension inside dictionary comprehension to detect changes between 2 dictionaries with lists as values.
The code looks something like this:
detectedChanges = {table: [field for field in tableDict[table] if field not in fieldBlackList] for table in modifiedTableDict if table not in tableBlackList}
This will create a dictionary where each entry is the table name and associated with it is a list changes.
The problem I'm getting is that although this code works, the resulting structure detectedChanges is filled with entries that only contain a table name and an empty list (meaning that no changes were detected).
I'm currently doing a posterior sweep through the dictionary in order to remove these entries but I would like avoid putting them in the dictionary in the first place.
Basically if I could somehow do a length check or something over [field for field in tableDict[table] I could validade it before creating the key:value entry.
Is there way to do this with the current method I'm using?
Although dict comprehensions are cool, they should not be misused. The following code is not much longer and it can be kept on a narrow screen as well:
detectedChanges = {}
for table, fields in modifiedTableDict.iteritems():
if table not in tableBlackList:
good_fields = [field for field in fields
if field not in fieldBlackList]
if good_fields:
detectedChanges[table] = good_fields
Just an addition to eumiro's answer. Please use their answer first as it is more readable. However, if I'm not mistaken comprehensions are in general faster, so there is one use case, but ONLY IF THIS IS A BOTTLENECK IN YOUR CODE. I cannot emphasize that enough.
detectedChanges = {table: [field for field in tableDict[table]
if field not in fieldBlackList]
for table in modifiedTableDict
if table not in tableBlackList
if set(tableDict[table])-set(fieldBlackList)}
Notice how ugly this is. I enjoy doing things like this to get a better understanding of Python, and due to the fact that I have had things like this be bottlenecks before. However, you should always use profiling before trying to solve issues that may not exist.
The addition to your code [...] if set(tableDict[table])-set(fieldBlackList) [...] creates a set of the entries in the current table, and a set of the blacklisted fields and gets the entries that are in the current table but not the blacklist. Empty sets evaluate to False causing the comprehension to ignore that table, the same as if it were in the tableBlackList variable. To make it more explicit, one could compare the result to an empty set or check whether it has a value in it at all.
Also, prefer the following for speed:
detectedChanges = {table: [field for field in fields
if field not in fieldBlackList]
for table, fields in modifiedTableDict.iteritems()
if table not in tableBlackList
if set(fields)-set(fieldBlackList)}
The results of an ORM query (e.g., MyObject.query()) need to be ordered according to a ranking algorithm that is based on values not within the database (i.e. from a separate search engine). This means 'order_by' will not work, since it only operates on fields within the database.
But, I don't want to convert the query results to a list, then reorder, because I want to maintain the ability to add further constraints to the query. E.g.:
results = MyObject.query()
results = my_reorder(results)
results = results.filter(some_constraint)
Is this possible to accomplish via SQLAlchemy?
I am afraid you will not be able to do it, unless the ordering can be derived from the fields of the object's table(s) and/or related objects' tables which are in the database.
But you could return from your code the tuple (query, order_func/result). In this case the query can be still extended until it is executed, and then resorted. Or you could create a small Proxy-like class, which will contain this tuple, and will delegate the query-extension methods to the query, and query-execution methods (all(), __iter__, ...) to the query and apply ordering when executed.
Also, if you could calculate the value for each MyObject instance beforehand, you could add a literal column to the query with the values and then use order_by to order by it. Alternatively, add temporary table, add rows with the computed ordering value, join on it in the query and add ordering. But I guess these are adding more complexity than the benefit they bring.
I tend to obsess about expressing code the most compactly and succinctly possible without sacrificing runtime efficiency.
Here's my code:
p_audio = plate.parts.filter(content__iendswith=".mp3")
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv")
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")
extra_context.update({
'p_audio': p_audio and p_audio[0],
'p_video': p_video and p_video[0],
'p_swf': p_swf and p_swf[0]
})
Are there any python/django gurus that can drastically shorten this code?
Actually, in your pursuit of compactness and efficiency, you have managed to come up with code that is terribly inefficient. This is because when you refer to p_audio or not p_audio, that causes that queryset to be evaluated - and because you haven't sliced it before then, that means that the entire filter is brought from the database - eg all the plate objects that end with mp3, and so on.
You should ensure you do the slice for each query first, before you refer to the value of that query. Since you're concerned with code compactness, you probably want to slice with [:1] first, to get a queryset of a single object:
p_audio = plate.parts.filter(content__iendswith=".mp3")[:1]
p_video = not p_audio and plate.parts.filter(content__iendswith=".flv") [:1]
p_swf = not p_audio and not p_video and plate.parts.filter(content__iendswith=".swf")[:1]
and the rest can stay the same.
Edit to add Because you're only interested in the first element of each list, as evidenced by the fact that you only pass [0] from each element into the context. But in your code, not p_audio refers to the original, unsliced queryset: and to determine the true/false value of the qs, Django has to evaluate it, which gets all matching elements from the database and converts them into Python objects. Since you don't actually want those objects, you're doing a lot more work than you need.
Note though that it's not re-running it every time: just the first time, since after the first evaluation the queryset is cached internally. But as I say, that's already more work than you want.
Besides featuring less redundancy, this is also way easier to extend with new content types.
kinds = (("p_audio", ".mp3"), ("p_video", ".flv"), ("p_swf", ".swf"))
extra_context.update((key, False) for key, _ in kinds)
for key, ext in kinds:
entries = plate.parts.filter(content__iendswith=ext)
if entries:
extra_context[key] = entries[0]
break
Just adding this as another answer inspired by Pyroscope's above (as my edit there has to be peer reviewed)
The latest incarnation is exploiting that the Django template system just disregards nonexistant context items when they are referenced, so mp3, etc below do not need to be initialized to False (or 0). So, the following meets all the functionality of the code from the OP. The other optimization is that mp3, etc are used as key names (instead of "p_audio" etc.)
for key in ['mp3','flv','swf'] :
entries = plate.parts.filter(content__iendswith=key)[:1]
extra_context[key] = entries and entries[0]
if extra_context[key] :
break