Efficiently counting leaves in a tree stored in a db

Efficiently counting leaves in a tree stored in a db - python

I have a Django project that has two models: Group and Person. Groups can contain either Person objects or other Group objects. Groups cannot form a cycle (i.e. Group A containing Group B containing Group A), which results in a tree structure where Person objects are leaves.
My question is - how can I count all the contained Group objects and Person objects within a high level Group (like the root Group) with as few SQL queries as possible?
A naive approach with O(N) (where N is # of subgroups) SQL queries would be:
def Group(models.Model):
name = models.CharField(max_length=150)
parent_group = models.ForeignKey('self', related_name=child_groups, null=True, blank=True)
# returns tuple (# of subgroups, # of person objects)
def count_objects(self):
count = (self.child_groups.count(), self.people.count())
for child_group in self.child_groups.all():
# this adds tuples together ( e.g: (1,2) and (1,2) make (2,4) )
tuple(map(operator.add, count, child_group.count_objects()))
def Person(models.Model):
user = models.ForeignKey(User)
picture = models.ImageSpecField(...)
group = models.ForeignKey('Group', related_name="people")
Is there a way to improve this or should I just store these values within the Group object?

So this is an existing problem that many others have tackled. If you're using Django, check this out:
http://django-mptt.github.com/django-mptt/index.html

Within Postgres you could use recursive queries, although there is no direct support for this in Django.
Alternatively you could consider denormalising the count, possibly there are libraries to do this. A quick google gave me: http://pypi.python.org/pypi/django-composition/
If you have to select the same values quite often and they don't change that much, you could try caching them.

Related

Django - How do I write a queryset that's equivalent to this SQL query? - Manging duplicates with Counting and FIRST_VALUE

I have Model "A" that both relates to another model and acts as a public face to the actual data (Model "B"), users can modify the contents of A but not of B.
For every B there can be many As, and they have a one to many relation.
When I display this model anytime there's two or more A's related to the B I see "duplicate" records with (almost always) the same data, a bad experience.
I want to return a queryset of A items that relate to the B items, and when there's more than one roll them up to the first entered item.
I also want to count the related model B items and return that count to give me an indication of how much duplication is available.
I wrote the following analogous SQL query which counts the related items and uses first_value to find the first A created partitioned by B.
SELECT *
FROM
(
SELECT
COUNT(*) OVER (PARTITION BY b_id) as count_related_items,
FIRST_VALUE(id) OVER (PARTITION BY b_id order by created_time ASC) as first_filter,
*
FROM A
) AS A1
WHERE
A1.first_filter = A1.id;
As requested, here's a simplified view of the models:
class CoreData(models.Model):
title = models.CharField(max_length=500)
class UserData(models.Model):
core = models.ForeignKey("CoreData", on_delete=models.CASCADE)
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
title = models.CharField(max_length=500)
When a user creates data it first checks/creates the CoreData, storing things like the title, and then it creates the UserData, with a reference to the CoreData.
When a second user creates a piece of data and it references the same CoreData is when the "duplication" is introduced and why you can roll up the UserData (in SQL) to find the count and the "first" entry in the one to many relation.

Assuming my understanding is correct -
If you are querying from the UserData model the query would look something like this:
Considering CoreData.id = 18
user_data = UserData.objects.filter(core__id=18).
order_by("created_time").annotate(duplicate_count=Count('core__userData', filter(core__id=18))).first()
user_data would be the First object created which is related to the CoreData object. Also,
user_data.duplicate_count will give you the Count of UserData objects that are related to the CoreData object.
Reference Docs on Annotate here
Update:
If you need the list of UserData of specific CoreData you could use
user_data = UserData.objects.filter(core__id=18).
order_by("created_time").annotate(duplicate_count=Count('core__UserData', filter(core__id=18)))

Django How to filter and sort queryset by related model

I have this model relationship:
class Account:
< ... fields ... >
class Balance(models.Model):
name = models.CharField(...)
count = models.FloatField(...)
account = models.ForeignKey(Account, related_name='balance')
Let's say we have some number of accounts. I need to filter these accounts by balance__name and sort by balance__count. I need sorted accounts, not a list of balances.
How do I do that? I don't even have any suggestions to find out a solution using iteration.

You can implement a queryset like:
Account.objects.filter(
balance__name='my_balance_name'
).order_by('balance__count')
Note that here an account can occur multiple times if there are multiple Balances that have the given name.
If you want to sort in descending order (so from larger counts to smaller counts), then you should add a minus (-):
Account.objects.filter(
balance__name='my_balance_name'
).order_by('-balance__count')

Django - Filter based on number of times a field comes up

I have a model that tracks the number of impressions for ads.
class Impression(models.Model):
ad = models.ForeignKey(Ad, on_delete=models.CASCADE)
user_ip = models.CharField(max_length=50, null=True, blank=True)
clicked = models.BooleanField(default=False)
time_created = models.DateTimeField(auto_now_add=True)
I want to find all the user_ip that has more than 1000 impressions. In other words, if a user_ip comes up in more than 1000 instances of Impression. How can I do that? I wrote a function for this but it is very inefficient and slow because it loops over every impression.
def check_ip():
for i in Impression.objects.all():
if Impression.objects.filter(user_ip=i.user_ip).count() > 1000:
print(i.user_ip)

You should be able to do this in one query with aggregation.. it is possible to filter on aggregate values (like Count()) as follows:
from django.db.models import Count
for ip in Impression.objects.values('user_ip').annotate(ipcount=Count('user_ip')).filter(ipcount__gt=1000):
# do something

Django querysets have an annotate() method which supports what you're trying to do.
from django.db.models import Count
Impression.objects.values('user_ip')\
.annotate(ip_count=Count('user_ip'))\
.filter(ip_count__gt=1000)
This will give you a queryset which returns dictionaries with 'user_ip' and 'ip_count' keys when used as an iterable.
To understand what's happening here you should look at Django's aggregation guide: https://docs.djangoproject.com/en/1.11/topics/db/aggregation/ (in particular this section which explains how annotate interacts with values)
The SQL generated is something like:
SELECT "impression"."user_ip", COUNT("impression"."user_ip") AS "ip_count"
FROM "impression"
GROUP BY "impression"."ip"
HAVING COUNT("impression"."ip") > 1000;

Querying objects using attribute of member of many-to-many

I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]

You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)

Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.

How to chain Django querysets preserving individual order

I'd like to append or chain several Querysets in Django, preserving the order of each one (not the result). I'm using a third-party library to paginate the result, and it only accepts lists or querysets. I've tried these options:
Queryset join: Doesn't preserve ordering in individual querysets, so I can't use this.
result = queryset_1 | queryset_2
Using itertools: Calling list() on the chain object actually evaluates the querysets and this could cause a lot of overhead. Doesn't it?
result = list(itertools.chain(queryset_1, queryset_2))
How do you think I should go?

This solution prevents duplicates:
q1 = Q(...)
q2 = Q(...)
q3 = Q(...)
qs = (
Model.objects
.filter(q1 | q2 | q3)
.annotate(
search_type_ordering=Case(
When(q1, then=Value(2)),
When(q2, then=Value(1)),
When(q3, then=Value(0)),
default=Value(-1),
output_field=IntegerField(),
)
)
.order_by('-search_type_ordering', ...)
)

If the querysets are of different models, you have to evaluate them to lists and then you can just append:
result = list(queryset_1) + list(queryset_2)
If they are the same model, you should combine the queries using the Q object and 'order_by("queryset_1 field", "queryset_2 field")'.
The right answer largely depends on why you want to combine these and how you are going to use the results.

So, inspired by Peter's answer this is what I did in my project (Django 2.2):
from django.db import models
from .models import MyModel
# Add an extra field to each query with a constant value
queryset_0 = MyModel.objects.annotate(
qs_order=models.Value(0, models.IntegerField())
)
# Each constant should basically act as the position where we want the
# queryset to stay
queryset_1 = MyModel.objects.annotate(
qs_order=models.Value(1, models.IntegerField())
)
[...]
queryset_n = MyModel.objects.annotate(
qs_order=models.Value(n, models.IntegerField())
)
# Finally, I ordered the union result by that extra field.
union = queryset_0.union(
queryset_1,
queryset_2,
[...],
queryset_n).order_by('qs_order')
With this, I could order the resulting union as I wanted without changing any private attribute while only evaluating the querysets once.

I'm not 100% sure this solution works in every possible case, but it looks like the result is the union of two QuerySets (on the same model) preserving the order of the first one:
union = qset1.union(qset2)
union.query.extra_order_by = qset1.query.extra_order_by
union.query.order_by = qset1.query.order_by
union.query.default_ordering = qset1.query.default_ordering
union.query.get_meta().ordering = qset1.query.get_meta().ordering
I did not test it extensively, so before you use that code in production, make sure it behaves like expected.

If you need to merge two querysets into a third queryset, here is an example, using _result_cache.
model
class ImportMinAttend(models.Model):
country=models.CharField(max_length=2, blank=False, null=False)
status=models.CharField(max_length=5, blank=True, null=True, default=None)
From this model, I want to display a list of all the rows such that :
(query 1) empty status go first, ordered by countries
(query 2) non empty status go in second, ordered by countries
I want to merge query 1 and query 2.
#get all the objects
queryset=ImportMinAttend.objects.all()
#get the first queryset
queryset_1=queryset.filter(status=None).order_by("country")
#len or anything that hits the database
len(queryset_1)
#get the second queryset
queryset_2=queryset.exclude(status=None).order_by("country")
#append the second queryset to the first one AND PRESERVE ORDER
for query in queryset_2:
queryset_1._result_cache.append(query)
#final result
queryset=queryset_1
It might not be very efficient, but it works :).

For Django 1.11 (released on April 4, 2017) use union() for this, documentation here:
https://docs.djangoproject.com/en/1.11/ref/models/querysets/#django.db.models.query.QuerySet.union
Here is the Version 2.1 link to this:
https://docs.djangoproject.com/en/2.1/ref/models/querysets/#union

the union() function to combine multiple querysets together, rather than the or (|) operator. This avoids a very inefficient OUTER JOIN query that reads the entire table.

If two querysets has common field, you can order combined queryset by that field. Querysets are not evaluated during this operation.
For example:
class EventsHistory(models.Model):
id = models.IntegerField(primary_key=True)
event_time = models.DateTimeField()
event_id = models.IntegerField()
class EventsOperational(models.Model):
id = models.IntegerField(primary_key=True)
event_time = models.DateTimeField()
event_id = models.IntegerField()
qs1 = EventsHistory.objects.all()
qs2 = EventsOperational.objects.all()
qs_combined = qs2.union(qs1).order_by('event_time')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently counting leaves in a tree stored in a db - python

So this is an existing problem that many others have tackled. If you're using Django, check this out: http://django-mptt.github.com/django-mptt/index.html

Related

Django - How do I write a queryset that's equivalent to this SQL query? - Manging duplicates with Counting and FIRST_VALUE

Django How to filter and sort queryset by related model

Django - Filter based on number of times a field comes up

Querying objects using attribute of member of many-to-many

How to chain Django querysets preserving individual order

Categories

Resources