Django Efficiency For Data Manipulation - python

I am doing some data changes in a django app with a large amount of data and would like to know if there is a way to make this more efficient. It's currently taking a really long time.
I have a model that used to look like this (simplified and changed names):
class Thing(models.Model):
... some fields...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
I need to split the list up based on a new model.
class Tag(models.Model):
name = models.CharField(max_length=200)
class Thing(models.Model):
.... some fields ...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
other_stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
tags = models.Many2ManyField(Tag)
What I need to do is take the list that is currently in stuff, and split it up. For items that have a tag in the Tag model, add it to the Many2Many. For things that don't have a Tag, I add it to other_stuff. Then in the end, the stuff field should contain of the items that were saved in tags.
I start by looping through the Tags to make a dict that maps the string version that would be in the stuff list to the tag object so I don't have to keep querying the Tag model.
Then I loop through the Thing model, get the stuff field, loop through that and add each Tag item to the many2many while keeping lists for each item that is or isn't in Tags. Then put those in the stuff and other stuff fields at the end.
tags = Tag.objects.all()
tag_dict = {tag.name.lower():Tag for tag in tags}
things = Thing.objects.all()
for thing in things:
stuff_list = thing.stuff
stuff_in_tags = []
stuff_not_in_tags = []
for item in stuff_list:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
thing.tags.add(tag_dict[item.lower()])
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
thing.save()
(Ignore any typos. This code works in my actual code)
That seems pretty efficient to me, but its taking hours to run as our database is pretty big (about 500k+ records). Are there any other ways to make this more efficient?

Unless you move some work to the database level with bulk operations, it won't run faster. You are making at least N (500k+) UPDATE queries.
If the parsing cannot be done on the DB level, chunked bulk_update is the next option.
Also, you can use iterator() to avoid loading all the objects to memory and only() to load only relevant columns.
There is a typo in tag_dict - it should be : tag (instance) instead of : Tag (model).
EDIT: I've originally missed the thing.tags.add - this will need additional handling. You have to bulk_create m2m table rows.
chunk_size = 10000
TagsToThing = Thing.tags.through
tag_dict = {tag.name.lower():tag for tag in Tag.objects.all()}
for_update = []
tags_for_create = []
for thing in Thing.objects.only('pk', 'stuff').iterator(chunk_size):
stuff_in_tags = []
stuff_not_in_tags = []
for item in thing.stuff:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
tags_for_create.append(
TagsToThing(thing=thing, tag=tag_dict[item.lower()])
)
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
for_update.append(thing)
if len(for_update) == chunk_size:
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned
for_update = []
tags_for_create = []
# Save remaining objects
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned

Related

Django: Make a single DB call instead of iterating QuerySet

I have the following code that iterates the tags queryset, and for each item, creates a Department object and adds it to the departments list:
departments: List[Department] = []
tags = Tag.objects.filter(type="department")
for tag in tags:
dept_id = tag.reference_id
dept_name = tag.name
parent_tag = Tag.objects.get(type="department", reference_id=tag.parent_reference_id)
dept_parent_id = parent_tag.reference_id
departments.append(Department(dept_id, dept_name, dept_parent_id))
However, as you can see, it is making multiple DB calls via Tag.objects.get(), which seems highly inefficient. Is there an efficient way to populate that departments list without making so many DB calls?
TIA.
What you need to use is "in" in your query.
check querysets
Entry.objects.filter(id__in=[1, 3, 4])
Entry.objects.filter(headline__in='abc')
so in your case you can use the the following example :
tags = Tag.objects.filter(id=some_id, type="department").values('id')
tags_list = [tag['id'] for tag in tags]
parent_tag = Tag.objects.get(id__in=tags_list, type="department")
I have used parts of the answer from #Vanda to write the following solution, and this solves my problem.
departments: List[Department] = []
tags = Tag.objects.filter(type="department")
parents_set = {tag.parent_reference_id for tag in tags}
for tag in tags:
dept_id = tag.reference_id
dept_name = tag.name
dept_parent_id = tag.parent_reference_id
if(dept_parent_id not in parents_set):
dept_parent_id = None
departments.append(Department(dept_id, dept_name, dept_parent_id))

Why is it so slow when update ListField in mongoengine?

It's too slow when I update a ListField with mongoengine.Here is an example
class Post(Document):
_id = StringField()
txt = StringField()
comments = ListField(EmbeddedDocumentField(Comment))
class Comment(EmbeddedDocument):
comment = StringField()
...
...
position = 3000
_id = 3
update_comment_str = "example"
#query
post_obj = Post.objects(_id=str(_id)).first()
#update
post_obj.comments[position].comment = update_comment_str
#save
post_obj.save()
The time it cost increases with the increase of the length of post_obj.comments.
How to optimize it?
Post.objects(id=str(_id)).update(**{"comments__{}__comment".format(position): update_comment_str})
In your code.
You fetched the whole document into python instance which will take place in RAM.
Then update 3000 th comments which will do some magic in mongoengine(marking changed fields and so on).
Then saves document.
In my answer,I have sent the update instruction to mongodb instead of fetching whole documents with N comments into Python which will save memory(RAM) and time.
The mongoengine/MongoDB supports index support update like
set__comments__1000__comment="blabla"
In order to give position using variable, I've used python dictionary and kwargs trick.

Peewee ORM get similar entries based on foreign key

I'm having problem with writing query for getting similar posts in a blog based on tags they have. I have following models:
class Articles(BaseModel):
name = CharField()
...
class Tags(BaseModel):
name = CharField()
class ArticleTags(BaseModel):
article = ForeignKeyField(Articles, related_name = "articles")
tags = ForeignKeyField(Tags, related_name = "tags")
What i'd like to do is to get articles with similar tags sorted by amount of common tags.
Edit
After 2 hours of fiddling with it i got the anwser i was looking for, i'm not sure if it's the most efficient way but it's working:
Here is the function if anyone might need that in the future:
def get_similar_articles(self,common_tags = 1, limit = 3):
"""
Get 3 similar articles based on tag used
Minimum 1 common tags i required
"""
art = (ArticleTags.select(ArticleTags.tag)\
.join(Articles)\
.where(ArticleTags.article == self))
return Articles.select(Articles, ArticleTags)\
.join(ArticleTags)\
.where((ArticleTags.article != self) & (ArticleTags.tag << art))\
.group_by(Articles)\
.having(fn.Count(ArticleTags.id) >= common_tags)\
.order_by(fn.Count(Articles.id).desc())\
.limit(limit)
Just a stylistic nit, table names (and model classes) should preferably be singular.
# Articles tagged with 'tag1'
Articles.select().join(ArticleTags).join(Tags).where(Tags.name == 'tag1')

Creating a django query that will retrieve the previous and next object based on alphabetical order

I have a django model that looks something like this:
class Definition
name = models.CharField(max_length=254)
text = models.TextField()
If I do the following query:
animal = Definition.objects.get(name='Owl')
and if I have the following definitions with these names in my database:
Elephant, Owl, Zebra, Human
is there a way to do a django query(ies) that will show me the previous and the next Definitions based on the animal object based on alphabetical order of the name field in the model?
I know that there are ways of getting previous/next based on datetime fields, but I am not so sure for this case.
I don't know of any way of doing this in less than three queries.
target = 'Owl'
animal = Definition.objects.get(name=target)
previous_animal = Definition.objects.order_by('name').filter(name__lt=target)[0]
next_animal = Definition.objects.order_by('name').filter(name__gt=target)[0]
If anyone comes across this like I just did...
heres my solution... it also loops(so if on last item it shows first item as next and if on first item shows last item as previous)
def get_previous_by_title(self):
curr_title = self.get_object().title
queryset = self.my_queryset()
try:
prev = queryset.filter(title__lt=curr_title).order_by("-title")[0:1].get()
except Video.DoesNotExist:
prev = queryset.order_by("-title")[0:1].get()
return prev
def get_next_by_title(self):
curr_title = self.get_object().title
queryset = self.my_queryset()
try:
next = queryset.filter(title__gt=curr_title).order_by("title")[0:1].get()
except Video.DoesNotExist:
next = queryset.order_by("title")[0:1].get()
return next
i have custom querysets based on user level so could just set the queryset as a normal queryset like... Video.objects.all() but anyplace I repeat code more than once I make a function

How can I improve this many-to-many Django ORM query and model set?

I have a Django query and some Python code that I'm trying to optimize because 1) it's ugly and it's not as performant as some SQL I could use to write it, and 2) because the hierarchical regrouping of the data looks messy to me.
So,
1. Is it possible to improve this to be a single query?
2. How can I improve my Python code to be more Pythonic?
Background
This is for a photo gallery system. The particular view is attempting to display the thumbnails for all photos in a gallery. Each photo is statically sized several times to avoid dynamic resizing, and I would like to also retrieve the URLs and "Size Type" (e.g. Thumbnail, Medium, Large) of each sizing so that I can Lightbox the alternate sizes without hitting the database again.
Entities
I have 5 models that are of relevance:
class Gallery(models.Model):
Photos = models.ManyToManyField('Photo', through = 'GalleryPhoto', blank = True, null = True)
class GalleryPhoto(models.Model):
Gallery = models.ForeignKey('Gallery')
Photo = models.ForeignKey('Photo')
Order = models.PositiveIntegerField(default = 1)
class Photo(models.Model):
GUID = models.CharField(max_length = 32)
class PhotoSize(models.Model):
Photo = models.ForeignKey('Photo')
PhotoSizing = models.ForeignKey('PhotoSizing')
PhotoURL = models.CharField(max_length = 1000)
class PhotoSizing(models.Model):
SizeName = models.CharField(max_length = 20)
Width = models.IntegerField(default = 0, null = True, blank = True)
Height = models.IntegerField(default = 0, null = True, blank = True)
Type = models.CharField(max_length = 10, null = True, blank = True)
So, the rough idea is that I would like to get all Photos in a Gallery through GalleryPhoto, and for each Photo, I want to get all the PhotoSizes, and I would like to be able to loop through and access all this data through a dictionary.
A rough sketch of the SQL might look like this:
Select PhotoSize.PhotoURL
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Where Gallery.id = 5
Order By GalleryPhoto.Order Asc
I would like to turn this into a list that has a schema like this:
(
photo: {
'guid': 'abcdefg',
'sizes': {
'Thumbnail': 'http://mysite/image1_thumb.jpg',
'Large': 'http://mysite/image1_full.jpg',
more sizes...
}
},
more photos...
)
I currently have the following Python code (it doesn't exactly mimic the schema above, but it'll do for an example).
gallery_photos = [(photo.Photo_id, photo.Order) for photo in GalleryPhoto.objects.filter(Gallery = gallery)]
photo_list = list(PhotoSize.objects.select_related('Photo', 'PhotoSizing').filter(Photo__id__in=[gallery_photo[0] for gallery_photo in gallery_photos]))
photos = {}
for photo in photo_list:
order = 1
for gallery_photo in gallery_photos:
if gallery_photo[0] == photo.Photo.id:
order = gallery_photo[1] //this gets the order column value
guid = photo.Photo.GUID
if not guid in photos:
photos[guid] = { 'Photo': photo.Photo, 'Thumbnail': None, 'Sizes': [], 'Order': order }
photos[guid]['Sizes'].append(photo)
sorted_photos = sorted(photos.values(), key=operator.itemgetter('Order'))
The Actual Question, Part 1
So, my question is first of all whether I can do my many-to-many query better so that I don't have to do the double query for both gallery_photos and photo_list.
The Actual Question, Part 2
I look at this code and I'm not too thrilled with the way it looks. I sure hope there's a better way to group up a hierarchical queryset result by a column name into a dictionary. Is there?
When you have sql query, that is hard to write using orm - you can use postgresql views. Not sure about mysql. In this case you will have:
Raw SQL like:
CREATE VIEW photo_urls AS
Select
photo.id, --pseudo primary key for django mapper
Gallery.id as gallery_id,
PhotoSize.PhotoURL as photo_url
From PhotoSize
Inner Join Photo On Photo.id = PhotoSize.Photo_id
Inner Join GalleryPhoto On GalleryPhoto.Photo_id = Photo.id
Inner Join Gallery On Gallery.id = GalleryPhoto.Gallery_id
Order By GalleryPhoto.Order Asc
Django model like:
class PhotoUrls(models.Model):
class Meta:
managed = False
db_table = 'photo_urls'
gallery_id = models.IntegerField()
photo_url = models.CharField()
ORM Queryset like:
PhotoUrls.objects.filter(gallery_id=5)
Hope it will help.
Django has some built in functions that will clean up the way your code looks. It will result in subqueries, so I guess it depends on performance. https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.values
gallery_photos = GalleryPhoto.objects.filter(Gallery=gallery).values('Photo_id', 'Order')
photo_queryset = PhotoSize.objects.selected_related('Photo', 'PhotoSizing').filter(
Photo__id__in=gallery_photos.values_list('Photo_id', flat=True))
calling list() will instantly evaluate the queryset, this might affect performance if you have a lot of data.
Additionally, there should be a rather easy way to get rid of if gallery_photo[0] == photo.Photo.id: This seems like it can be easily resolved with another query, getting gallery_photos for all photos.
You can retrieve all data with a single query, and get a list of data dictionaries. Then you can manage this dictionary or create a new one to form your final dictionary... You can use reverse relations in filtering and selecting specific rows from a table... So:
Letx be your selected Galery...
GalleryPhoto.objexts.filter(Galery=x).values('Order', 'Photo__GUID', 'Photo__Photo__PhotoURL', 'Photo__Photo__PhotoSizing__SizeName', 'Photo__Photo__PhotoSizing__Width', 'Photo__Photo__PhotoSizing__Height', 'Photo__Photo__PhotoSizing__Type')
Using Photo__ will create an inner join to Photo table while Photo__Photo__ will create inner join to PhotoSize (via reverse relation) and Photo__Photo__PhotoSizing__ will inner join to PhotoSizing....
You get a list of dictionaries:
[{'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...}, {'Order':....,'GUID': ..., 'PhotoURL':....., 'SizeName':...., 'Width':...., 'Height':..., 'Type':...},....]
You can select rows that you need and get all values as a list of dictionaries... Then you can Write a loop function or iterator to loop through this list and create a new dictionary whit grouping your data...

Categories

Resources