i am currently working on a Django website and at a certain point need to generate a dataset for a graph out of model data. Over the last days i have been trying to optimize the code to generate that data, but its still relatively slow with a dataset which will definetely get a lot larger when the website is up.
The model i am working with looks like this:
class Prices(models.Model):
name = models.CharField(max_length=200, blank=False)
website = models.CharField(max_length=200, blank=False, choices=WEBSITES)
date= models.DateField('date', default=date.today(), blank=False, null=False)
price = models.IntegerField(default=0, blank=False, null=False)
class Meta:
indexes = [
models.Index(fields=['website']),
models.Index(fields=['date'])
]
It stores prices for different items(defined by the name) on different websites.
For the graph representation of the model I need two arrays, one with all the dates as the y-Axis and one with the prices of the item at that date(which might not be available, not every item has a date entry for every day on every website). And all of that for every website.
My code for generating that data looks like this atm:
for website in websites:
iterator = Prices.objects.filter(website=website).iterator()
data = []
entry = next(iterator)
for date in labels:
if entry is not None and date == entry.date:
data.append(entry.price* 0.01)
entry = next(iterator, None)
else:
data.append(None)
... do stuff with data (not relevant for performance)
I loop over each website and retrieve all price data from my model. Then I loop over all dates(which are in the array labels), see if that date entry is available and if so, add that to the array, otherwise none.
Does anybody have tips or ideas on how to optimize the inner for-loop, as thats what makes out 90% of my performance problem.
This might not be as efficient as you want but you can try this.
If this still does not meet your requirements, then all I can think of is doing this asynchronously.
iterator = Prices.objects.filter(website=website)
from collections import defaultdict
result = reduce(lambda acc, prices: acc[prices.date].append((prices.price * 0.01, prices.website)) or acc, filter(lambda x: x.date in labels, iterator), defaultdict(list))
One way to do this asynchronously is
using concurrent.futures module
ThreadPoolExecutor(max_workers = 10) (you can specify maximum workers like this).
Moreover, if you want multiple Processes instead of Threads.
You can simply replace ThreadPoolExecutor with ProcessPoolExecutor
result = defaultdict(list)
def reducer_function(price):
if price.date in labels:
result[price.date].append((prices.price * 0.01, prices.website))
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(reducer_function, iterator)
Related
I am trying to iterate through each instance of a model I have defined.
Say I have the following in models.py, under the people django app:
class foo(models.Model):
name = models.CharField(max_length=20, default="No Name")
age = models.PositiveSmallIntegerField(default=0)
And I have populated the database to have the following data in foo:
name="Charley", age=17
name="Matthew", age=63
name="John", age=34
Now I want to work out the average age of my dataset. In another part of the program (outside the people app, inside the project folder), in a file called bar.py that will be set up to run automatically every day at a specific time, I calculate this average.
from people.models import foo
def findAverageAge():
total = 0
for instance in //What goes here?// :
total += instance.age
length = //What goes here, to find the number of instances?//
average = total / length
return average
print(findAverageAge)
In the code above, the //What goes here?// signifies I don't know what goes there and need help.
You can retrieve all elements with .all() [Django-doc]:
from people.models import foo
def findAverageAge():
total = 0
qs = foo.objects.all()
for instance in :
total += instance.age
length = len(qs)
average = total / length
return average
print(findAverageAge())
But you should not calculate the average at the Django/Python level. This requires retrieving all records, and this can be quite large. A database itself can calculate the average, and this is (normally) done in a more efficient way. You can .aggregate(…) [jango-doc] on the queryset with:
from people.models import foo
def findAverageAge():
return foo.objects.aggregate(
avg_age=Avg('age')
)['avg_age'] or 0
print(findAverageAge())
You should make a management command however: this will load the Django apps, and also makes it more convenient to run command with parameters, etc.
I would like to display the percentage mark instead of the total sum of the mark. Right now i have a table that display the student name and their mark attendance. I would like to convert the mark attendance into a percentage. the current implementation is:
Student Name Attendance
Annie 200
Anny 150
But i would like to show the attendance in percentange. for example:
Student Name Attendance
Annie 100%
Anny 85%
i am not sure how to implement the method. But i have tried this:
# models.py:
class MarkAtt(models.Model):
studName = models.ForeignKey(
Namelist, on_delete=models.SET_NULL, blank=True, null=True, default=None,
)
classGrp = models.ForeignKey(GroupInfo, on_delete=models.SET_NULL, null=True)
currentDate = models.DateField(default=now())
week = models.IntegerField(default=1)
attendance = models.IntegerField(default=100) #1 is present
def get_percentage(self):
ttlCount = MarkAtt.objects.filter(studName).count()
perc = ttlCount / 1100 *100
return perc
# views.py:
def attStudName(request):
students = MarkAtt.objects.values('studName__VMSAcc').annotate(mark=Sum('attendance'))
context = {'students' : students}
return render(request,'show-name.html', context)
So you have your numerator but you need your denomenator. I'm not exactly sure what your denomenator should be with your current setup but creating a new field that uses "Count" rather than "Sum" might do the trick for you. Then you would divid the sum field by the count field. I would probably just do this in the view and not mess with the model.
You can use the formatting mini-language to express a percentage in a string
>>> attendence = 20
>>> total = 100
>>> '{:%}'.format(attendence/total)
'20%'
Bear in mind this will return the answer as string instead of an int or float
To use this in your get_percentage, there are a few issues that will need to be addressed:
studName is not defined or passed into the method but is used in the filter query. This will cause a NameError.
Your filter query looks like it is just counting the students in the filter query instead of summing the attendance. You should use a sum annotation like you have done in the view.
You will need to be able to get the value for 100% attendance to be able to divide by it.
To modify your implimentatiom of this on the model you could do something like this.
def get_percentage(self):
student = MarkAtt.objects.filter(studName=self.studName).annotate(mark=Sum('attendance'))
return '{:%}'.format(student.mark / 1100)
However, I don't think the MarkAtt model is the right place to do this as many MarkAtt objects could relate to one NameList object resulting in possibly running the same query several times for each student. I think it would be better to do this on NameList or the view itself.
class NameList(models.Model):
...
def get_percentage(self):
attendance = self.markatt_set().annotate(total=Sum('attendance'))
return '{:%}'.format(attendance.total / 1100)
There are around 5000 companies and each company has around 4500 prices, that makes a total of around 22,000,000 prices.
Now a while ago, I wrote a code that stored this data in a format like this-
class Endday(models.Model):
company = models.TextField(null=True)
eop = models.CommaSeparatedIntegerField(blank=True, null=True, max_length=50000)
And to store, the code was-
for i in range(1, len(contents)):
csline = contents[i].split(",")
prices = csline[1:len(csline)]
company = csline[0]
entry = Endday(company=company, eop=prices)
entry.save()
Although, the code was slow(obviously) but it did work and stored the data in the database. One day, I decided to delete all the contents of Endday and tried to store again. But it did not work throwing me an error Database locked.
Anyway, I did a little research and got to know MySql can not handle this much of data. So how did it get stored in the first place? I came to a conclusion that all these prices were stored at the very beginning after which lot has stored in the database so this won't be getting stored.
After a little research, I got to know that I should use PostgreSql, so I changed the database, made migrations and moved on to try the code again but no luck. I got an error saying-
psycopg2.DataError: value too long for type character varying(50000)
Alright, so I thought lets try to use bulk_create and modified the code a bit but I was welcomed with the same error.
Next, I thought maybe lets make two models, one to hold the company names and other for the prices and the key to that particular company. So again, I changed the code-
class EnddayCompanies(models.Model):
company = models.TextField(max_length=500)
class Endday(models.Model):
foundation = models.ForeignKey(EnddayCompanies, null=True)
eop = models.FloatField(null=True)
And the views-
to_be_saved = []
for i in range(1, len(contents)):
csline = contents[i].split(",")
prices = csline[1:len(csline)]
company = csline[0]
companies.append(csline[0])
prices =[float(x) for x in prices]
before_save = []
for j in range(len(prices)):
before_save.append(Endday(company=company, eop=prices[j]))
to_be_saved.append(before_save)
Endday.objects.bulk_create(to_be_saved)
But to my surprise, this was so slow that in the middle, it just stopped on a company. I tried to find which particular code was slowing it down and it was-
before_save = []
for j in range(len(prices)):
before_save.append(Endday(company=company, eop=prices[j]))
to_be_saved.append(before_save)
Well, now I am back to square one, and I can not think of anything, so I rang the bell of SO. The questions I have now-
How to go by this?
Why did the save work with MySql?
Is there a better way to do this? (Of course there must be)
If there is, what is it?
I think you can create a separate model for Companyand Price something like this:
class Company(models.Model):
name = models.CharField(max_length=20)
class Price(models.Model):
company = models.ForeignKey(Company, related_name='prices')
price = models.FloatField()
This is how you save the data:
# Assuming that contents is a list of strings with a format like this:
contents = [
'Company 1, 1, 2, 3, 4...',
'Company 2, 1, 2, 3, 4...',
....
]
for content in contents:
tokens = content.split(',')
company = Company.objects.create(name=tokens[0])
Price.objects.bulk_create(
Price(company=company, price=float(x.strip()))
for x in tokens[1:]
)
# Then you can call prices now from company
company.prices.order_by('price')
UPDATE: I just noticed that it is similar to your second implementation, the only difference is the way of saving the data. My implementation has lesser iterations.
I have the following models in django
class Job(models.Model):
cost = models.FloatField()
class Account(models.Model):
job = models.ManyToManyField(Job, through='HasJob')
class HasJob(models.Model):
account = models.ForeignKey(Account, related_name='hasjobs')
job = models.ForeignKey(Job, related_name='hasjobs')
quantity = models.IntegerField()
So an Account can have many jobs in different quantities. I want to be able to sum up the total cost of an account. Is that possible in database level or should I do python for it? Like
account = Account.objects.get(pk=1)
sum = 0
for hasjob in account.hasjobs.all():
sum += hasjob.quantity*hasjob.job.cost
I know its a very "starters" way to do that, and I am guessing it includes many hits on the database. So is there a better way?
IFAIK aggregation can't sum by F() expressions so you have to calculate the sum in python code.
But you can reduce the number of db hits to one - just add the select_related() call to the queryset:
total_sum = sum(hasjob.quantity * hasjob.job.cost
for hasjob in account.hasjobs.all().select_related('job'))
I have a productdatabase that contains products, parts and labels for each part based on langcodes.
The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.
The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.
I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
## Start of problem lines ##
for defaultPart in product.defaultPartsData:
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel
for optionalPart in product.optionalPartsData:
for label in labelsForLangCode:
if label.key() in optionalPart.partLabelList:
typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
## end problem lines ##
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.
All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.
The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.
The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:
#property
def defaultPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)
#property
def optionalPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)
When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.
Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.
Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?
I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.
..fredrik
EDIT:
A AppStats screenshoot here.
From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?
EDIT 2:
I implemented dound's solution and added abit. Now it looks like this:
langCode = 'en'
typeData = Products.ProductPartTypes.all()
productData = Products.Product.all()
labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
productList = []
label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
labelData = memcache.get(label_cache_key)
if labelData is None:
langDict = {}
for langLabel in labelsForLangCode:
langDict[str(langLabel.key())] = langLabel.partLangLabel
memcache.add(label_cache_key, langDict, 500)
labelData = memcache.get(label_cache_key)
GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if partData is None:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
for lb in part.partLabelList:
if str(lb) in labelData:
label = labelData[str(lb)]
break
if part.key() in product.defaultParts:
typeDict[part.type.typeId]['default'] = label
elif part.key() in product.optionalParts:
typeDict[part.type.typeId]['optional'].append(label)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
The result is much better. I now have about 3000ms with out memcache and about 700ms with.
I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?
Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?
EDIT 3:
I noticed that non of the db.Model are indexed, could this make a differance?
EDIT 4:
After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.
..fredrik
A few simple ideas:
1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)
2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.
3) Like Will pointed out, IN queries perform one query PER argument.
Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
class ProductParts(db.Model):
...
products = db.ListProperty(db.Key) # product keys
...
Then you can do ONE query to per product instead of N queries per product. For example, you could do this:
parts = ProductParts.all().filter("products =", product).fetch(1000)
The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).
Less critical improvement idea:
4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:
GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
#property
def defaultPartsData(self):
return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)
You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.
Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).
GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
# here's a new approach that does just ONE datastore query (for each product)
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
if part.key() in self.defaultParts:
part_type = 'default'
else:
part_type = 'optional'
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
# (end new code)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.
So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.
Here's the relevant section from the GQL Reference:
Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.
You might try installing AppStats for your app to see where else it might be slowing down.
I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.
GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.