I have two examples of code which accomplish the same thing. One is using python, the other is in SQL.
Exhibit A (Python):
surveys = Survey.objects.all()
consumer = Consumer.objects.get(pk=24)
for ballot in consumer.ballot_set.all()
consumer_ballot_list.append(ballot.question_id)
for survey in surveys:
if survey.id not in consumer_ballot_list:
consumer_survey_list.append(survey.id)
Exhibit B (SQL):
SELECT * FROM clients_survey WHERE id NOT IN (SELECT question_id FROM consumers_ballot WHERE consumer_id=24) ORDER BY id;
I want to know how I can make exhibit A much cleaner and more efficient using Django's ORM and subqueries.
In this example:
I have ballots which contain a question_id that refers to the survey which a consumer has answered.
I want to find all of the surveys that the consumer hasn't answered. So I need to check each question_id(survey.id) in the consumer's set of ballots against the survey model's id's and make sure that only the surveys that the consumer does NOT have a ballot of are returned.
You more or less have the correct idea. To replicate your SQL code using Django's ORM you just have to break the SQL into each discrete part:
1.create table of question_ids the consumer 24 has answered
2.filter the survey for all ids not in the aformentioned table
consumer = Consumer.objects.get(pk=24)
# step 1
answered_survey_ids = consumer.ballot_set.values_list('question_id', flat=True)
# step 2
unanswered_surveys_ids = Survey.objects.exclude(id__in=answered_survey_ids).values_list('id', flat=True)
This is basically what you did in your current python based approach except I just took advantage of a few of Django's nice ORM features.
.values_list() - this allows you to extract a specific field from all the objects in the given queryset.
.exclude() - this is the opposite of .filter() and returns all items in the queryset that don't match the condition.
__in - this is useful if we have a list of values and we want to filter/exclude all items that match those values.
Hope this helps!
Related
I've 2 DynamicDocuments:
class Tasks(db.DynamicDocument):
task_id = db.UUIDField(primary_key=True,default=uuid.uuid4)
name = db.StringField()
flag = db.IntField()
class UserTasks(db.DynamicDocument):
user_id = db.ReferenceField('User')
tasks = db.ListField(db.ReferenceField('Tasks'),default=list)
I want to filter the UserTasks document by checking whether the flag value (from Tasks Document) of the given task_id is 0 or 1, given the task_id and user_id. So I query in the following way:-
obj = UserTasks.objects.get(user_id=user_id,tasks=task_id)
This fetches me an UserTask object.
Now I loop around the task list and first I get the equivalent task and then check its flag value in the following manner.
task_list = obj.tasks
for t in task_list:
if t['task_id'] == task_id:
print t['flag']
Is there any better/direct way of querying UserTasks Document in order to fetch the flag value of Tasks Document.
PS : I could have directly fetched flag value from the Tasks Document, but I also need to check whether the task is associated with the user or not. Hence I directly queried the USerTasks document.
Can we directly filter on a document with the ReferenceField's fields in a single query?
No, its not possible to directly filter a document with the fields of ReferenceField as doing this would require joins and mongodb does not support joins.
As per MongoDB docs on database references:
MongoDB does not support joins. In MongoDB some data is denormalized,
or stored with related data in documents to remove the need for joins.
From another page on the official site:
If we were using a relational database, we could perform a join on
users and stores, and get all our objects in a single query. But
MongoDB does not support joins and so, at times, requires bit of
denormalization.
Relational purists may be feeling uneasy already, as if we were
violating some universal law. But let’s bear in mind that MongoDB
collections are not equivalent to relational tables; each serves a
unique design objective. A normalized table provides an atomic,
isolated chunk of data. A document, however, more closely represents
an object as a whole.
So in 1 query, we can't both filter tasks with a particular flag value and with the given user_id and task_id on the UserTasks model.
How to perform the filtering then?
To perform the filtering as per the required conditions, we will need to perform 2 queries.
In the first query we will try to filter the Tasks model with the given task_id and flag. Then, in the 2nd query, we will filter UserTasks model with the given user_id and the task retrieved from the first query.
Example:
Lets say we have a user_id, task_id and we need to check if the related task has flag value as 0.
1st Query
We will first retrive the my_task with the given task_id and flag as 0.
my_task = Tasks.objects.get(task_id=task_id, flag=0) # 1st query
2nd Query
Then in the 2nd query, you need to filter on UserTask model with the given user_id and my_task object.
my_user_task = UserTasks.objects.get(user_id=user_id, tasks=my_task) # 2nd query
You should perform 2nd query only if you get a my_task object with the given task_id and flag value. Also, you will need to add error handling in case there are no matched objects.
What if we have used EmbeddedDocument for the Tasks model?
Lets say we have defined our Tasks document as an EmbeddedDocument and the tasks field in UserTasks model as an EmbeddedDocumentField, then to do the desired filtering we could have done something like below:
my_user_task = UserTasks.objects.get(user_id=user_id, tasks__task_id=task_id, tasks__flag=0)
Getting the particular my_task from the list of tasks
The above query will return a UserTask document which will contain all the tasks. We will then need to perform some sort of iteration to get the desired task.
For doing that, we can perform list comprehension using enumerate().
Then the desired index will be the 1st element of the 1-element list returned.
my_task_index = [i for i,v in enumerate(my_user_task.tasks) if v.flag==0][0]
#Praful, based on your schema you need two queries because mongodb does not have joins, so if you want to get "all the data" in one query you need a schema which fit that case.
ReferenceField is a special field which does a lazy load of the other collection (it requires a query).
Based on the query you need, I recommend you to change your schema to fit that. The idea behind NOSQL engines is "denormalization" so it is not bad to have a list of EmbeddedDocument. EmbeddedDocument can be a smaller document (denormalized version) with a set of fields instead of all of them.
If you do not want to load the whole document into memory while querying you can exclude that fields using a "projection".
Supossing your UserTasks has a list of EmbeddedDocument with the task you could do:
UserTasks.objects.exclude('tasks').filter(**filters)
I hope it helps you.
Good luck!
I'm running into an issue that I can't find an explanation for.
Given one object (in this case, an "Article"), I want to use another type of object (in this case, a "Category") to determine which other articles are most similar to article X, as measured by the number of categories they have in common. The relationship between Article and Category is Many-to-Many. The use case is to get a quick list of related Objects to present as links.
I know exactly how I would write the SQL by hand:
select
ac.article_id
from
Article_Category ac
where
ac.category_id in
(
select
category_id
from
Article_Category
where
article_id = 1 -- get all categories for article in question
)
and ac.article_id <> 1
group by
ac.article_id
order by
count(ac.category_id) desc, random() limit 5
What I'm struggling with is how to use the Django Model aggregation to match this logic and only run one query. I'd obv. prefer to do it within the framework if possible. Does anybody have pointers on this?
Adding this in now that I've found a way within the model framework to do this.
related_article_list = Article.objects.filter(category=self.category.all())\
.exclude(id=self.id)
related_article_ids = related_article_list.values('id')\
.annotate(count=models.Count('id'))\
.order_by('-count','?')
In the related_article_list part, other Article objects that match on 2 or more Categories will be included separate times. Thus, when using annotation to count them the number will be > 1 and they can be ordered that way.
I think the correct answer if you really want to filter articles on all category should look like this:
related_article_list = Article.objects.filter(category__in=self.category.all())\
.exclude(id=self.id)
I'm trying to do ranking of a QuerySet efficiently (keeping it a QuerySet so I can keep the filter and order_by functions), but cannot seem to find any other way then to iterate through the QuerySet and tack on a rank. I dont want to add rank to my model if I don't have to.
I know how I can get the values I need through SQL query, but can't seem to translate that into Django:
SET #rank = 0, #prev_val = NULL;
SELECT rank, name, school, points FROM
(SELECT #rank := IF(#prev_val = points, #rank, #rank+1) AS rank, #prev_val := points, points, CONCAT(users.first_name, ' ', users.last_name) as name, school.name as school
FROM accounts_userprofile
JOIN schools_school school ON school_id = school.id
JOIN auth_user users ON user_id = users.id
ORDER BY points DESC) as profile
ORDER BY rank DESC
I found that if I did iterate through the QuerySet and tacked on 'rank' manually and then further filtered the results, my 'rank' would disappear - unless is turned it into a list (which made filtering and sorting a bit of pain). Is there any other way you can think of to add rank to my QuerySet? Is there any way I could do the above query and get a QuerySet with filter and order_by functions still intact? I'm currently using the jQuery DataTables with Django to generate a leaderboard with pagination (which is why I need to preserver filtering and order_by).
Thanks in advance! Sorry if I did not post my question correctly - any help would be much appreciated.
I haven't used it myself, but I'm pretty sure you can do that with the extra() method.
Looking at your raw SQL I don't see anything special in your logic. You are simply enumerating all SQL rows on the join ordered by points with a counter from 1, while collapsing the same point values to a same rank.
My suggestion would be to write a custom manager that uses raw() or extra() method. In your manager you would use python's enumerate on all model instances as a rank previously ordered by points. Of course you would have to keep current max value of points and override what enumerate returns to you if they have the same amount of points. Look here for an example of something similar.
Then you could do something like:
YourQuerySet.objects.with_rankings().all()
Using SQLAlchemy, I have a one to many relation with two tables - users and scores. I am trying to query the top 10 users sorted by their aggregate score over the past X amount of days.
users:
id
user_name
score
scores:
user
score_amount
created
My current query is:
top_users = DBSession.query(User).options(eagerload('scores')).filter_by(User.scores.created > somedate).order_by(func.sum(User.scores).desc()).all()
I know this is clearly not correct, it's just my best guess. However, after looking at the documentation and googling I cannot find an answer.
EDIT:
Perhaps it would help if I sketched what the MySQL query would look like:
SELECT user.*, SUM(scores.amount) as score_increase
FROM user LEFT JOIN scores ON scores.user_id = user.user_id
WITH scores.created_at > someday
ORDER BY score_increase DESC
The single-joined-row way, with a group_by added in for all user columns although MySQL will let you group on just the "id" column if you choose:
sess.query(User, func.sum(Score.amount).label('score_increase')).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by("score increase desc")
Or if you just want the users in the result:
sess.query(User).\
join(User.scores).\
filter(Score.created_at > someday).\
group_by(User).\
order_by(func.sum(Score.amount))
The above two have an inefficiency in that you're grouping on all columns of "user" (or you're using MySQL's "group on only a few columns" thing, which is MySQL only). To minimize that, the subquery approach:
subq = sess.query(Score.user_id, func.sum(Score.amount).label('score_increase')).\
filter(Score.created_at > someday).\
group_by(Score.user_id).subquery()
sess.query(User).join((subq, subq.c.user_id==User.user_id)).order_by(subq.c.score_increase)
An example of the identical scenario is in the ORM tutorial at: http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#selecting-entities-from-subqueries
You will need to use a subquery in order to compute the aggregate score for each user. Subqueries are described here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=subquery#using-subqueries
I am assuming the column (not the relation) you're using for the join is called Score.user_id, so change it if this is not the case.
You will need to do something like this:
DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
However this will result in tuples of (user_id, total_score). I'm not sure if the computed score is actually important to you, but if it is, you will probably want to do something like this:
users_scores = []
q = DBSession.query(Score.user_id, func.sum(Score.score_amount).label('total_score')).group_by(Score.user_id).filter(Score.created > somedate).order_by('total_score DESC')[:10]
for user_id, total_score in q:
user = DBSession.query(User)
users_scores.append((user, total_score))
This will result in 11 queries being executed, however. It is possible to do it all in a single query, but due to various limitations in SQLAlchemy, it will likely create a very ugly multi-join query or subquery (dependent on engine) and it won't be very performant.
If you plan on doing something like this often and you have a large amount of scores, consider denormalizing the current score onto the user table. It's more work to upkeep, but will result in a single non-join query like:
DBSession.query(User).order_by(User.computed_score.desc())
Hope that helps.
Can anyone turn me to a tutorial, code or some kind of resource that will help me out with the following problem.
I have a table in a mySQL database. It contains an ID, Timestamp, another ID and a value. I'm passing it the 'main' ID which can uniquely identify a piece of data. However, I want to do a time search on this piece of data(therefore using the timestamp field). Therefore what would be ideal is to say: between the hours of 12 and 1, show me all the values logged for ID = 1987.
How would I go about querying this in Django? I know in mySQL it'd be something like less than/greater than etc... but how would I go about doing this in Django? i've been using Object.Filter for most of database handling so far. Finally, I'd like to stress that I'm new to Django and I'm genuinely stumped!
If the table in question maps to a Django model MyModel, e.g.
class MyModel(models.Model):
...
primaryid = ...
timestamp = ...
secondaryid = ...
valuefield = ...
then you can use
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=<min_timestamp>
).exclude(
timestamp__gt=<max_timestamp>
).values_list('valuefield', flat=True)
This selects entries with the primaryid 1987, with timestamp values between <min_timestamp> and <max_timestamp>, and returns the corresponding values in a list.
Update: Corrected bug in query (filter -> exclude).
I don't think Vinay Sajip's answer is correct. The closest correct variant based on his code is:
MyModel.objects.filter(
primaryid=1987
).exclude(
timestamp__lt=min_timestamp
).exclude(
timestamp__gt=max_timestamp
).values_list('valuefield', flat=True)
That's "exclude the ones less than the minimum timestamp and exclude the ones greater than the maximum timestamp." Alternatively, you can do this:
MyModel.objects.filter(
primaryid=1987
).filter(
timestamp__gte=min_timestamp
).exclude(
timestamp__gte=max_timestamp
).values_list('valuefield', flat=True)
exclude() and filter() are opposites: exclude() omits the identified rows and filter() includes them. You can use a combination of them to include/exclude whichever you prefer. In your case, you want to exclude() those below your minimum time stamp and to exclude() those above your maximum time stamp.
Here is the documentation on chaining QuerySet filters.