Why is it so slow when update ListField in mongoengine?

Why is it so slow when update ListField in mongoengine? - python

It's too slow when I update a ListField with mongoengine.Here is an example
class Post(Document):
_id = StringField()
txt = StringField()
comments = ListField(EmbeddedDocumentField(Comment))
class Comment(EmbeddedDocument):
comment = StringField()
...
...
position = 3000
_id = 3
update_comment_str = "example"
#query
post_obj = Post.objects(_id=str(_id)).first()
#update
post_obj.comments[position].comment = update_comment_str
#save
post_obj.save()
The time it cost increases with the increase of the length of post_obj.comments.
How to optimize it?

Post.objects(id=str(_id)).update(**{"comments__{}__comment".format(position): update_comment_str})
In your code.
You fetched the whole document into python instance which will take place in RAM.
Then update 3000 th comments which will do some magic in mongoengine(marking changed fields and so on).
Then saves document.
In my answer,I have sent the update instruction to mongodb instead of fetching whole documents with N comments into Python which will save memory(RAM) and time.
The mongoengine/MongoDB supports index support update like
set__comments__1000__comment="blabla"
In order to give position using variable, I've used python dictionary and kwargs trick.

Related

Django Efficiency For Data Manipulation

I am doing some data changes in a django app with a large amount of data and would like to know if there is a way to make this more efficient. It's currently taking a really long time.
I have a model that used to look like this (simplified and changed names):
class Thing(models.Model):
... some fields...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
I need to split the list up based on a new model.
class Tag(models.Model):
name = models.CharField(max_length=200)
class Thing(models.Model):
.... some fields ...
stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
other_stuff = models.JSONField(encoder=DjangoJSONEncoder, default=list, blank=True)
tags = models.Many2ManyField(Tag)
What I need to do is take the list that is currently in stuff, and split it up. For items that have a tag in the Tag model, add it to the Many2Many. For things that don't have a Tag, I add it to other_stuff. Then in the end, the stuff field should contain of the items that were saved in tags.
I start by looping through the Tags to make a dict that maps the string version that would be in the stuff list to the tag object so I don't have to keep querying the Tag model.
Then I loop through the Thing model, get the stuff field, loop through that and add each Tag item to the many2many while keeping lists for each item that is or isn't in Tags. Then put those in the stuff and other stuff fields at the end.
tags = Tag.objects.all()
tag_dict = {tag.name.lower():Tag for tag in tags}
things = Thing.objects.all()
for thing in things:
stuff_list = thing.stuff
stuff_in_tags = []
stuff_not_in_tags = []
for item in stuff_list:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
thing.tags.add(tag_dict[item.lower()])
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
thing.save()
(Ignore any typos. This code works in my actual code)
That seems pretty efficient to me, but its taking hours to run as our database is pretty big (about 500k+ records). Are there any other ways to make this more efficient?

Unless you move some work to the database level with bulk operations, it won't run faster. You are making at least N (500k+) UPDATE queries.
If the parsing cannot be done on the DB level, chunked bulk_update is the next option.
Also, you can use iterator() to avoid loading all the objects to memory and only() to load only relevant columns.
There is a typo in tag_dict - it should be : tag (instance) instead of : Tag (model).
EDIT: I've originally missed the thing.tags.add - this will need additional handling. You have to bulk_create m2m table rows.
chunk_size = 10000
TagsToThing = Thing.tags.through
tag_dict = {tag.name.lower():tag for tag in Tag.objects.all()}
for_update = []
tags_for_create = []
for thing in Thing.objects.only('pk', 'stuff').iterator(chunk_size):
stuff_in_tags = []
stuff_not_in_tags = []
for item in thing.stuff:
if item.lower() in tag_dict.keys():
stuff_in_tags.append(item)
tags_for_create.append(
TagsToThing(thing=thing, tag=tag_dict[item.lower()])
)
else:
stuff_not_in_tags.append(item)
thing.stuff = stuff_in_tags
thing.other_stuff = stuff_not_in_tags
for_update.append(thing)
if len(for_update) == chunk_size:
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned
for_update = []
tags_for_create = []
# Save remaining objects
Thing.objects.bulk_update(for_update, ['stuff', 'other_stuff'], chunk_size)
TagsToThing.objects.bulk_create(tags_for_create, ignore_conflicts=True) # in case the tag is already assigned

How can you query embedded document that is null with mongoengine

I am new to mongoengine and querying. I got a document and an embedded document that looks like the following:
class Plan(EmbeddedDocument):
name = StringField()
size = FloatField()
class Test(Document):
date = DateTimeField()
plan = EmbeddedDocumentField(Plan)
How Can I get all Test-Documents that have no size set. That means that size=null/None?
I tried it with __raw__ query, but this did not work for me..

The way to query attribute of nested/embedded documents is done in the following manner (doc):
class LightSaber(EmbeddedDocument):
color = StringField()
length = FloatField()
class Jedi(Document):
name = StringField()
light_saber = EmbeddedDocumentField(LightSaber)
saber1 = LightSaber(color='red', length=32)
Jedi(name='Obiwan', light_saber=saber1).save()
saber2 = LightSaber(color='yellow', length=None)
Jedi(name='Yoda', light_saber=saber2).save()
Jedi(name='Rey', light_saber=None).save()
for jedi in Jedi.objects(light_saber__length=None):
print(jedi.name)
# prints:
# Yoda
# Rey
That being said, by naming your attribute "size", you are hitting an edge case. In fact "size" is a mongoengine operator and so if you query Test.objects(plan__size=None), you'll get an error because MongoEngine believes that you want to make use of the size operator.
To do the same with __raw__, you need to use the following:
for jedi in Jedi.objects(__raw__={'light_saber.length': None}):
print(jedi.name)
Using __raw__ works fine with "size" as well, in your example that would be: Test.objects(__raw__={'plan.size': None})

Wtforms- Create Dict using formfield that has two values per key

I'd like to be able to collect both the track name and the start time for each track on a cd into a dict or json column in a table.
I have defined a formfield to catch the data relating to the track names and save it in a dict:
class SeperateTracks(NoCsrfForm):
track1 = TextField('track1')
track2 = TextField('track2')
track3 = TextField('track3')
track4 = TextField('track4')
class SendForm(Form):
alltracks = FormField(SeperateTracks)
This creates a dictionary that looks something like this so:
{"track1": "songname1", "track2": "songname2", "track3": "songname3", "track4": "songname4"}
What I'd like to achieve, is to have two TextFields per track- one for the track name and one for the start time of the track.
I realize that in terms of creating more fields to accomodate this, I could simply create more text fields to hold the start time data like so:
class SeperateTracks(NoCsrfForm):
track1 = TextField('track1')
track2 = TextField('track2')
track3 = TextField('track3')
track4 = TextField('track4')
starttime1 = TextField('starttime1')
starttime2 = TextField('starttime2')
starttime3 = TextField('starttime3')
starttime4 = TextField('starttime4')
However, this wouldn't associate the times with the corresponding tracks. What would be the recommended method for doing something like this?

You'd want something like this:
class Track(NoCsrfForm):
songname = StringField('Song Name')
starttime = StringField('start time')
class SeparateTracks(NoCsrfForm):
tracks = FieldList(FormField(Track), min_entries=1, max_entries=4)
I made assumptions on max_entries based on your example, but it's not strictly required, and you can manage anywhere between 1 and N entries this way.
the data in python would look something like:
[{songname: "name 1", starttime: "123"}, {songname: "name 2", starttime: "456"}, ... ]
More info:
See more on Field Enclosures here.
A more extensive example

NDB query using filters on Structured property which is also repeated ?

I am creating a sample application storing user detail along with its class information.
Modal classes being used are :
Model class for saving user's class data
class MyData(ndb.Model):
subject = ndb.StringProperty()
teacher = ndb.StringProperty()
strength = ndb.IntegerProperty()
date = ndb.DateTimeProperty()
Model class for user
class MyUser(ndb.Model):
user_name = ndb.StringProperty()
email_id = ndb.StringProperty()
my_data = ndb.StructuredProperty(MyData, repeated = True)
I am able to successfully store data into the datastore and can also make simple query on the MyUser entity using some filters based on email_id and user_name.
But when I try to query MyUser result using filter on a property from the MyUser modal's Structured property that is my_data, its not giving correct result.
I think I am querying incorrectly.
Here is my query function
function to query based upon the repeated structure property
def queryMyUserWithStructuredPropertyFilter():
shail_users_query = MyUser.query(ndb.AND(MyUser.email_id == "napolean#gmail.com", MyUser.my_data.strength > 30))
shail_users_list = shail_users_query.fetch(10)
maindatalist=[]
for each_user in shail_users_list:
logging.info('NEW QUERY :: The user details are : %s %s'% (each_user.user_name, each_user.email_id))
# Class data
myData = each_user.my_data
for each_my_data in myData:
templist = [each_my_data.strength, str(each_my_data.date)]
maindatalist.append(templist)
logging.info('NEW QUERY :: The class data is : %s %s %s %s'% (each_my_data.subject, each_my_data.teacher, str(each_my_data.strength),str(each_my_data.date)))
return maindatalist
I want to fetch that entity with repeated Structured property (my_data) should be a list which has strength > 30.
Please help me in knowing where I am doing wrong.
Thanks.

Queries over StructuredProperties return objects for which at least one of the structured ones satisfies the conditions. If you want to filter those properties, you'll have to do it afterwards.
Something like this should do the trick:
def queryMyUserWithStructuredPropertyFilter():
shail_users_query = MyUser.query(MyUser.email_id == "napolean#gmail.com", MyUser.my_data.strength > 30)
shail_users_list = shail_users_query.fetch(10)
# Here, shail_users_list has at most 10 users with email being
# 'napolean#gmail.com' and at least one element in my_data
# with strength > 30
maindatalist = [
[[data.strength, str(data.date)] for data in user.my_data if data.strength > 30] for user in shail_users_list
]
# Now in maindatalist you have ONLY those my_data with strength > 30
return maindatalist

Accessing fields in model in post procedure in Google App Engine

I have a post(self) and I want to add some logic here to add lat and lng (these are computed from google maps) to the data store as defined in my db model. Should I add to data, or should I do it some other way such as with the original class. What is the best way to do this?
so...
class Company(db.Model):
company_type = db.StringProperty(required=True, choices=["PLC", "LTD", "LLC", "Sole Trader", "Other"])
company_lat = db.StringProperty(required=True)
company_lng = db.StringProperty(required=True)
class CompanyForm(djangoforms.ModelForm):
company_description = forms.CharField(widget=forms.Textarea(attrs={'rows':'2', 'cols':'20'}))
company_address = forms.CharField(widget=forms.Textarea(attrs={'rows':'2', 'cols':'20'}))
class Meta:
model = Company
exclude = ['company_lat,company_lng']
def post(self):
data = CompanyForm(data=self.request.POST)
map_url = ''
address = self.request.get("company_postcode")
...
lat = response['results'][0]['geometry']['location']['lat']
lng = response['results'][0]['geometry']['location']['lng']
...
# How do I add these fields lat and lng to my data store?
# Should I add them to data? if this is possible?
# Or shall I do it some other way?
Thanks

The djangoforms help page explains how to add data to your datastore entity. Call save method with commit=False. It returns datastore entity and then you can add fields before saving it with put()
def post(self):
...
# This code is after the code above
if data.is_valid():
entity=data.save(commit=False)
entity.company_lat=lat
entity.company_lng=lng
entity.put()

It really depends on the types of queries you intend to do. If you want to perform geospatial queries, GeoModel is built for your use case.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is it so slow when update ListField in mongoengine? - python

Related

Django Efficiency For Data Manipulation

How can you query embedded document that is null with mongoengine

Wtforms- Create Dict using formfield that has two values per key

NDB query using filters on Structured property which is also repeated ?

Accessing fields in model in post procedure in Google App Engine

Categories

Resources