Peewee ORM get similar entries based on foreign key - python

I'm having problem with writing query for getting similar posts in a blog based on tags they have. I have following models:
class Articles(BaseModel):
name = CharField()
...
class Tags(BaseModel):
name = CharField()
class ArticleTags(BaseModel):
article = ForeignKeyField(Articles, related_name = "articles")
tags = ForeignKeyField(Tags, related_name = "tags")
What i'd like to do is to get articles with similar tags sorted by amount of common tags.
Edit
After 2 hours of fiddling with it i got the anwser i was looking for, i'm not sure if it's the most efficient way but it's working:
Here is the function if anyone might need that in the future:
def get_similar_articles(self,common_tags = 1, limit = 3):
"""
Get 3 similar articles based on tag used
Minimum 1 common tags i required
"""
art = (ArticleTags.select(ArticleTags.tag)\
.join(Articles)\
.where(ArticleTags.article == self))
return Articles.select(Articles, ArticleTags)\
.join(ArticleTags)\
.where((ArticleTags.article != self) & (ArticleTags.tag << art))\
.group_by(Articles)\
.having(fn.Count(ArticleTags.id) >= common_tags)\
.order_by(fn.Count(Articles.id).desc())\
.limit(limit)

Just a stylistic nit, table names (and model classes) should preferably be singular.
# Articles tagged with 'tag1'
Articles.select().join(ArticleTags).join(Tags).where(Tags.name == 'tag1')

Related

Peewee - Access an intermediary table easily

Say I have peewee models like so:
class Users(_BaseModel):
id = AutoField(primary_key=True, null=False, unique=True)
first_name = CharField(null=False)
last_name = CharField(null=False)
# Cut short for clarity
class Cohorts(_BaseModel):
id = AutoField(primary_key=True, null=False, unique=True)
name = CharField(null=False, unique=True)
# Cut short for clarity
class CohortsUsers(_BaseModel):
cohort = ForeignKeyField(Cohorts)
user = ForeignKeyField(Users)
is_primary = BooleanField(default=True)
I need to access easily from the user what cohort they are in and for example the cohort's name.
If a user could be in just one cohort, it would be easy but here, having it be many2many complicates things.
Here's what I got so far, which is pretty ugly and inefficient
Users.select(Users, CohortsUsers).join(CohortsUsers).where(Users.id == 1)[0].cohortsusers.cohort.name
Which will do what I require it to but I'd like to find a better way to do it.
Is there a way to have it so I can do Users.get_by_id(1).cohort.name ?
EDIT: I'm thinking about making methods to access them easily on my Users class but I am not really sure it's the best way of doing it nor how to go about it
If it do it like so, it's quite ugly because of the import inside the method to avoid circular imports
#property
def cohort(self):
from dst_datamodel.cohorts import CohortsUsers
return Users.select(Users, CohortsUsers).join(CohortsUsers).where(Users.id == self.id)[0].cohortsusers.cohort
But having this ugly method allows me to do Users.get_by_id(1).cohort easily
This is all covered in the documentation here: http://docs.peewee-orm.com/en/latest/peewee/relationships.html#implementing-many-to-many
You have a many-to-many relationship, where a user can be in zero, one or many cohorts, and a cohort may have zero, one or many users.
If there is some invariant where a user only has one cohort, then just do this:
# Get all cohorts for a given user id and print their name(s).
q = Cohort.select().join(CohortUsers).where(CohortUsers.user == some_user_id)
for cohort in q:
print(cohort.name)
More specific to your example:
#property
def cohort(self):
from dst_datamodel.cohorts import CohortsUsers
cohort = Cohort.select().join(CohortsUsers).where(CohortUsers.user == self.id).get()
return cohort.name

Django Efficiency For Data Manipulation

I am doing some data changes in a django app with a large amount of data and would like to know if there is a way to make this more efficient. It's currently taking a really long time.
I have a model that used to look like this (simplified and changed names):
class Thing(models.Model):
... some fields...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
I need to split the list up based on a new model.
class Tag(models.Model):
name = models.CharField(max_length=200)
class Thing(models.Model):
.... some fields ...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
other_stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
tags = models.Many2ManyField(Tag)
What I need to do is take the list that is currently in stuff, and split it up. For items that have a tag in the Tag model, add it to the Many2Many. For things that don't have a Tag, I add it to other_stuff. Then in the end, the stuff field should contain of the items that were saved in tags.
I start by looping through the Tags to make a dict that maps the string version that would be in the stuff list to the tag object so I don't have to keep querying the Tag model.
Then I loop through the Thing model, get the stuff field, loop through that and add each Tag item to the many2many while keeping lists for each item that is or isn't in Tags. Then put those in the stuff and other stuff fields at the end.
tags = Tag.objects.all()
tag_dict = {tag.name.lower():Tag for tag in tags}
things = Thing.objects.all()
for thing in things:
stuff_list = thing.stuff
stuff_in_tags = []
stuff_not_in_tags = []
for item in stuff_list:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
thing.tags.add(tag_dict[item.lower()])
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
thing.save()
(Ignore any typos. This code works in my actual code)
That seems pretty efficient to me, but its taking hours to run as our database is pretty big (about 500k+ records). Are there any other ways to make this more efficient?
Unless you move some work to the database level with bulk operations, it won't run faster. You are making at least N (500k+) UPDATE queries.
If the parsing cannot be done on the DB level, chunked bulk_update is the next option.
Also, you can use iterator() to avoid loading all the objects to memory and only() to load only relevant columns.
There is a typo in tag_dict - it should be : tag (instance) instead of : Tag (model).
EDIT: I've originally missed the thing.tags.add - this will need additional handling. You have to bulk_create m2m table rows.
chunk_size = 10000
TagsToThing = Thing.tags.through
tag_dict = {tag.name.lower():tag for tag in Tag.objects.all()}
for_update = []
tags_for_create = []
for thing in Thing.objects.only('pk', 'stuff').iterator(chunk_size):
stuff_in_tags = []
stuff_not_in_tags = []
for item in thing.stuff:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
tags_for_create.append(
TagsToThing(thing=thing, tag=tag_dict[item.lower()])
)
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
for_update.append(thing)
if len(for_update) == chunk_size:
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned
for_update = []
tags_for_create = []
# Save remaining objects
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned

Django - cannot retrieve just one record in multi-part filter on model with multiple relations

I can't seem to isolate a single record from this query:
subcust = OwnerCustom.objects.get(carcustom=ncset, owner=sset)
This is the error:
OwnerCustom matching query does not exist
In the actual data, there is only actually one matching record in OwnerCustom for each record in CarCustom. It's supposed to be a kind of many-to-many where there are standard differences listed in CarCustom for each Car, and each owner may maintain their own customizations (overrides) or those default OwnerCustom entries.
Note, there are many different Owner of the same Car. And of course, I'm not actually doing cars, this is a renaming from the original purpose.
Here's the relevant models:
class Car(models.Model):
car_name = models.CharField(max_length=50)
class CarCustom(models.Model):
car = models.ForeignKey(Car, models.PROTECT)
class Owner(models.Model):
car = models.ForeignKey(Car, models.PROTECT)
class OwnerCustom(models.Model):
owner = models.ForeignKey(Owner, models.PROTECT)
carcustom = models.ForeignKey(CarCustom, models.PROTECT)
name = models.CharField(max_length=50)
And the code:
car_queryset = Car.objects.filter(car_name="fancy car")
for nset in car_queryset:
owner_queryset = Owner.objects.filter(car=nset)
for sset in owner_queryset :
carcustom_queryset = CarCustom.objects.filter(car=nset)
for ncset in carcustom_queryset:
subcust = OwnerCustom.objects.get(carcustom=ncset, owner=sset)
I've tried stuff like:
subcust = OwnerCustom.objects.filter(carcustom=ncset, owner=sset).first()
Which gives me a NoneType, and then tried:
subcust = OwnerCustom.objects.filter(carcustom=ncset, owner=sset)[:1].get()
Which gives "matching query does not exist" and this:
subcust = OwnerCustom.objects.filter(carcustom=ncset, owner=sset)[0]
Gives "list index out of range"
UPDATE: I CAN get a working function by using code like this, but I would think since there is only one (guaranteed by application) matching record possible for OwnerCustom.objects.filter(carcustom=ncset, owner=sset) that I could find a better way to fetch it:
car_queryset = Car.objects.filter(car_name="fancy car")
for nset in car_queryset:
owner_queryset = Owner.objects.filter(car=nset)
for sset in owner_queryset :
carcustom_queryset = CarCustom.objects.filter(car=nset)
for ncset in carcustom_queryset:
subcust_queryset = OwnerCustom.objects.filter(carcustom=ncset, owner=sset)
for subcust in subcust_queryset :
logger.info(subcust.name)

django multiple filters on field

I'm doing a simple filter -
filters.py
class TblserversFilter(django_filters.FilterSet):
name = django_filters.CharFilter(name="servername", lookup_type="exact")
class Meta:
model = Tblservers
fields = ['servername']
What I would like to do, if possible, is to have two lookup_types associated with the field. Specifically I want exact AND contains and then somehow replace the operator depending on the filter.
name=serverabc would be an exact search and name~abc will be a fuzzy search.
You could do a method_filter and then prefix your filter queries with different symbols for exact and icontains and other filters that you want at the client side.
Since code is better than a thousand words:
exact_prefix = '#'
icontains_prefix = '~'
class TblserversFilter(django_filters.FilterSet):
name = django_filters.MethodFilter(
action=name_filter)
def name_filter(self, value):
if value:
value_prefix = value[0]
if value_prefix == exact_prefix:
return self.filter(name=value)
elif value_prefix == icontains_prefix:
return self.filter(name__icontains=value)
# this can continue for all the types of filters you want
else:
return self.filter(name=value)
else:
return self.filter(name=value)
class Meta:
model = Tblservers
fields = ['servername']
EDIT:
In django-filter 1.0 MethodFilter was replaced with Filter's method argument. So solution rewritten for v1.0 would be following (not tested):
exact_prefix = '#'
icontains_prefix = '~'
class TblserversFilter(django_filters.FilterSet):
name = django_filters.CharFilter(
method='name_filter')
def name_filter(self, qs, name, value):
if value:
value_prefix = value[0]
if value_prefix == exact_prefix:
return qs.filter(name=value)
elif value_prefix == icontains_prefix:
return qs.filter(name__icontains=value)
# this can continue for all the types of filters you want
else:
return qs.filter(name=value)
else:
return qs.filter(name=value)
class Meta:
model = Tblservers
fields = ['servername']
First my apologies for the shameless library self-plug.
At some point I was trying to do something similar in django-filters however the solution was much more complex then anticipated. I ended up creating my own library for doing filtering in Django which natively supports the exact functionality you are looking for - django-url-filter. Its API is very similar to django-filters:
from django import forms
from url_filter.filter import Filter
from url_filter.filtersets import ModelFilterSet
class TblserversFilter(FilterSet):
name = Filter(form_field=forms.CharField(max_length=15), lookups=['exact', 'contains'])
class Meta(object):
model = Tblservers
fields = ['name', 'servername']
Note that the URL will look a bit different though:
?name=foo # exact
?name__exact=foo
?name__contains=foo
Also you will need to manually call the filter set in order to filter the queryset:
fs = TblserversFilter(data=query, queryset=...)
filtered_qs = fs.filter()
Syntax of the URL parameters is very similar to Django ORM.
You can look at the docs for more examples. Hopefully it might be of use.

Django query: Joining two models with two fields

I have the following models:
class AcademicRecord(models.Model):
record_id = models.PositiveIntegerField(unique=True, primary_key=True)
subjects = models.ManyToManyField(Subject,through='AcademicRecordSubject')
...
class AcademicRecordSubject(models.Model):
academic_record = models.ForeignKey('AcademicRecord')
subject = models.ForeignKey('Subject')
language_group = IntegerCharField(max_length=2)
...
class SubjectTime(models.Model):
time_id = models.CharField(max_length=128, unique=True, primary_key=True)
subject = models.ForeignKey(Subject)
language_group = IntegerCharField(max_length=2)
...
class Subject(models.Model):
subject_id = models.PositiveIntegerField(unique=True,primary_key=True)
...
The academic records have list of subjects each with a language code and the subject times have a subject and language code.
With a given AcademicRecord, how can I get the subject times that matches with the AcademicRecordSubjects that the AcademicRecord has?
This is my approach, but it makes more queries than needed:
# record is the given AcademicRecord
times = []
for record_subject in record.academicrecordsubject_set.all():
matched_times = SubjectTime.objects.filter(subject=record_subject.subject)
current_times = matched_times.filter(language_group=record_subject.language_group)
times.append(current_times)
I want to make the query using django ORM not with raw SQL
SubjectTime language group has to match with Subject's language group aswell
I got it, in part thanks to #Robert Jørgensgaard Eng
My problem was how to do the inner join using more than 1 field, in which the F object came on handly.
The correct query is:
SubjectTime.objects.filter(subject__academicrecordsubject__academic_record=record,
subject__academicrecordsubject__language_group=F('language_group'))
Given an AcademicRecord instance academic_record, it is either
SubjectTime.objects.filter(subject__academicrecordsubject_set__academic_record=academic_record)
or
SubjectTime.objects.filter(subject__academicrecordsubject__academic_record=academic_record)
The results reflect all the rows of the join that these ORM queries become in SQL. To avoid duplicates, just use distinct().
Now this would be much easier, if I had a django shell to test in :)

Categories

Resources