Django compare queryset from different databases - python

I need to compare 2 querysets from the same model from 2 different databases.
I expect the difference between them. In this case, I grab only one column (charfield), from two databases and want to compare this "list", i.e. it would be great to work with sets and difference methods of sets.
But I can't simple subtract querysets, also set(queryset) and list(querysets) -- this give me nothing (not an error), i.e.
diff_set = set(articles1) - set(articles2)
I switched db's on the fly, make 2 querysets and try to compare them (filter, or exclude)
articles1 = list(Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ'))
# right connection
connections.databases['tmp2']['HOST'] = db2.host
connections.databases['tmp2']['NAME'] = db2.name
articles2 = list(Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ'))
# okay to chain Smdocuments objects, gives all the entries
all = list(chain(articles1, articles2))
# got nothing, even len(diff_set) is none
diff_set = set(articles1) - set(articles2)
# this one raise error Subqueries aren't allowed across different databases.
articles_exclude = Smdocuments.objects.using('tmp1').only('id').filter(doctype__exact='CQ')
len(articles1)
diff_ex = Smdocuments.objects.using('tmp2').only('id').filter(doctype__exact='CQ').exclude(id__in=articles_exclude)
len(diff_ex)
diff_ex raise an error
Subqueries aren't allowed across different databases. Force the inner
query to be evaluated using list(inner_query).
So, "Model objects" not so easy to manipulate, and querysets between difference databases as well.
I see, thats not a good db scheme, but it's another application with distributed db, and I need to compare them.
It's would be enough to compare by one column, but probably compare full queryset will work for future.
Or, should I convert queryset to list and compare raw data?

Your question is really unclear about what you actually expect, but here are a couple hints anyway:
First, model instances (assuming they are instances of the same model of course) compare on their primary key value, which is also used as hash for dicts and sets, so if you want to compare the underlying database records you should not work on model instances but on the raw db values as lists of tuples or dicts. You can get those using (resp.) Queryset.values_list() or Queryset.values() - not forgetting to list() them so you really get a list and not a queryset.
Which brings us to the second important point: while presenting themselves as list-likes (in that they support len(), iteration, subscripting and - with some restrictions - slicing), Querysets are NOT lists. You can't compare two querysets (well you can but they compare on identity, which means that two querysets will only be equal if they are actually the very same object), and, more important, using a queryset as an argument to a 'field__in=' lookup will result in a SQL subquery where passing a proper list results in a plain 'field IN (...)' where clause. This explains the error you get with the exclude(...) approach.
To make a long story short, if you want to effectively compare database rows, you want:
# the fields you want to compate records on
fields = 'field1', 'field2', 'fieldN'
rows1 = list(YouModel.objects.using('tmp1').filter(...).values_list(*fields))
rows2 = list(YouModel.objects.using('tmp2').filter(...).values_list(*fields))
# now you have two lists of tuples so you can apply ordinary python comparisons / set operations etc
print rows1 == rows2
print set(rows1) - set(rows2)
# etc

Related

session.execute returns model objects instead of actual data

I have switched from connection.execute to session.execute. I am not able to get usable data from it. The results seem to contain references to models instead of actual row data.
with Session(engine) as s:
q = select(WarrantyRequest)
res = s.execute(q)
keys = res.keys()
data_list = res.all()
print(keys) # should print list of column names
print(data_list) # should print list of lists with row data
dict_list = s.execute(q).mappings().all()
print(dict_list) # should print list of dicts with column names as keys
It prints
RMKeyView(['WarrantyRequest'])
[(<models.mock.WarrantyRequest object at 0x7f4d065d3df0>,), ...]
[{'WarrantyRequest': <models.mock.WarrantyRequest object at 0x7f002b464df0>}, ... ]
When doing the same with connection.execute, I got the expected results.
What am I missing?
There is this paragraph in the docs which kind of describes the behaviour, but I am not able to tell what I am supposed to do to get data out of it.
It’s important to note that while methods of Query such as Query.all() and Query.one() will return instances of ORM mapped objects directly in the case that only a single complete entity were requested, the Result object returned by Session.execute() will always deliver rows (named tuples) by default; this is so that results against single or multiple ORM objects, columns, tables, etc. may all be handled identically.
If only one ORM entity was queried, the rows returned will have exactly one column, consisting of the ORM-mapped object instance for each row. To convert these rows into object instances without the tuples, the Result.scalars() method is used to first apply a “scalars” filter to the result; then the Result can be iterated or deliver rows via standard methods such as Result.all(), Result.first(), etc.
Querying a model
q = select(WarrantyRequest)
returns rows that consist of individual model instances. To get rows of raw data instead, query the model's __table__ attribute:
q = select(WarrantyRequest.__table__)
SQLAlchemy's ORM layer presents database tables and rows in an object-oriented way, on the assumption that the programmer wants to work with objects and their attributes rather than raw data.

Django: What's the difference between Queryset.union() and the OR operator?

When combining QuerySets, what's the difference between the QuerySet.union() method and using the OR operator between QuerySets |?
Consider the following 2 QuerySets:
qs1 = Candidate.objects.filter(id=1)
qs2 = Candidate.objects.filter(id=2)
How is qs1 | qs2 different from qs1.union(qs2)? Is there some subtlety under the hood that I'm missing?
From the QuerySet API reference
union(), intersection(), and difference() return model instances of the type of the first QuerySet even if the arguments are QuerySets of other models.
The .union() Method return QuerySet with schema/column Names of only the first QuerySet Parameter Passed. Where as this is not the case with OR(|) Operator.
From the QuerySet API reference:
The UNION operator selects only distinct values by default. To allow duplicate values, use the all=True argument.
The .union() method allows some granularity in specifying whether to keep or eliminate duplicate records returned. This choice is not available with the OR operator.
Also, QuerySets created by a .union() call cannot have .distinct() called on them.
This is more a SQL question than a Django question. In the example you post, the Django ORM will translate qs1 | qs2 as something along the lines of
SELECT * FROM candidate WHERE id=1 OR id=2
whereas in qs1.union(qs2) it will be something like
SELECT * FROM candidate WHERE id=1 UNION SELECT * FROM candidate WHERE id=2
In this particular example there will be no difference, however I don't believe anyone would write it with UNION.
If you have an expensive query, there will be a difference in the timing when you choose one format over the other. You can use EXPLAIN to experiment. In some tests I make UNION takes way longer to give you the first row, but finishes a bit faster.
If query optimization is not an issue, it is more common to use OR.
The UNION operator is used to combine the result-set of two or more querysets. The querysets can be from the same or from different models. When they querysets are from different models, the fields and their datatypes should match.
As above definition says qs1.union(qs2) it combine both query
but OR operator is used to find boolean value it don't combine query it will just check if they are true or false and for OR operator data/query should be from one model

Index of row looping over django queryset [duplicate]

I have a QuerySet, let's call it qs, which is ordered by some attribute which is irrelevant to this problem. Then I have an object, let's call it obj. Now I'd like to know at what index obj has in qs, as efficiently as possible. I know that I could use .index() from Python or possibly loop through qs comparing each object to obj, but what is the best way to go about doing this? I'm looking for high performance and that's my only criteria.
Using Python 2.6.2 with Django 1.0.2 on Windows.
If you're already iterating over the queryset and just want to know the index of the element you're currently on, the compact and probably the most efficient solution is:
for index, item in enumerate(your_queryset):
...
However, don't use this if you have a queryset and an object obtained by some unrelated means, and want to learn the position of this object in the queryset (if it's even there).
If you just want to know where you object sits amongst all others (e.g. when determining rank), you can do it quickly by counting the objects before you:
index = MyModel.objects.filter(sortField__lt = myObject.sortField).count()
Assuming for the purpose of illustration that your models are standard with a primary key id, then evaluating
list(qs.values_list('id', flat=True)).index(obj.id)
will find the index of obj in qs. While the use of list evaluates the queryset, it evaluates not the original queryset but a derived queryset. This evaluation runs a SQL query to get the id fields only, not wasting time fetching other fields.
QuerySets in Django are actually generators, not lists (for further details, see Django documentation on QuerySets).
As such, there is no shortcut to get the index of an element, and I think a plain iteration is the best way to do it.
For starter, I would implement your requirement in the simplest way possible (like iterating); if you really have performance issues, then I would use some different approach, like building a queryset with a smaller amount of fields, or whatever.
In any case, the idea is to leave such tricks as late as possible, when you definitely knows you need them.
Update: You may want to use directly some SQL statement to get the rownumber (something lie . However, Django's ORM does not support this natively and you have to use a raw SQL query (see documentation). I think this could be the best option, but again - only if you really see a real performance issue.
It's possible for a simple pythonic way to query the index of an element in a queryset:
(*qs,).index(instance)
This answer will unpack the queryset into a list, then use the inbuilt Python index function to determine it's position.
You can do this using queryset.extra(…) and some raw SQL like so:
queryset = queryset.order_by("id")
record500 = queryset[500]
numbered_qs = queryset.extra(select={
'queryset_row_number': 'ROW_NUMBER() OVER (ORDER BY "id")'
})
from django.db import connection
cursor = connection.cursor()
cursor.execute(
"WITH OrderedQueryset AS (" + str(numbered_qs.query) + ") "
"SELECT queryset_row_number FROM OrderedQueryset WHERE id = %s",
[record500.id]
)
index = cursor.fetchall()[0][0]
index == 501 # because row_number() is 1 indexed not 0 indexed

Django intersection of non commited model objects and commited model objects

What I have is a list of model objects I haven't run bulk_create on:
objs = [Model(id=1, field=foo), Model(id=2, field=bar)]
What I'd like to do is intersect objs with Model.objects.all() and return only those objects which aren't already in the database (based on the field value).
For example, if my database was:
[Model(id=3, field=foo)]
Then the resulting objs should be:
objs = [Model(id=1, field=bar)]
Is something like this possible?
Edit:
So a bit of further explanation:
What I'm doing is I have an import command, that I'm trying to have an --append flag included.
Without the flag, I delete the tables, and start fresh.
With the flag, I want to bulk create a large number of objects (single creation is much slower - I've checked), and I don't want to have objects with the same field values but different ids.
I've already tried filtering out duplicates after insertion, but it's quite slow and I wanted to test this approach to see if it's faster.
The objects are read from CSV files, and it's faster to make a list of Model, and then bulk_create, as opposed to running create on each row.
We can probably best do this by first constructing a set of values the field column in the database has:
field_vals = set(Model.objects.values_list('field', flat=True).distinct())
and then we can perform a filtering like:
filtered_objs = [obj for obj in objs if obj.field not in field_vals]
By constructing a set first, we run a single query, construct a set in O(n) (with n the number of Models in the database), and then filter in O(m) (with m the number of objects in objs). So the algorithm is O(m+n).
Based on the question however, it looks like you could probably save the effort of constructing these objects in the first place. You can use Django's get_or_create function. And use it like:
obj, created = Model.objects.get_or_create(field=foo)
with obj the objected (either fetched from the database, or created in the database), and created a boolean that is True, if we had to create a new object.

Reorder of SQLAlchemy Query results based on external ranking

The results of an ORM query (e.g., MyObject.query()) need to be ordered according to a ranking algorithm that is based on values not within the database (i.e. from a separate search engine). This means 'order_by' will not work, since it only operates on fields within the database.
But, I don't want to convert the query results to a list, then reorder, because I want to maintain the ability to add further constraints to the query. E.g.:
results = MyObject.query()
results = my_reorder(results)
results = results.filter(some_constraint)
Is this possible to accomplish via SQLAlchemy?
I am afraid you will not be able to do it, unless the ordering can be derived from the fields of the object's table(s) and/or related objects' tables which are in the database.
But you could return from your code the tuple (query, order_func/result). In this case the query can be still extended until it is executed, and then resorted. Or you could create a small Proxy-like class, which will contain this tuple, and will delegate the query-extension methods to the query, and query-execution methods (all(), __iter__, ...) to the query and apply ordering when executed.
Also, if you could calculate the value for each MyObject instance beforehand, you could add a literal column to the query with the values and then use order_by to order by it. Alternatively, add temporary table, add rows with the computed ordering value, join on it in the query and add ordering. But I guess these are adding more complexity than the benefit they bring.

Categories

Resources