I am working on a new project and had to build an outline of a few pages really quick.
I imported a catalogue of 280k products that I want to search through. I opted for Whoosh and Haystack to provide search, as I am using them on a previous project.
I added definitions for the indexing and kicked off that process. However, it seems that Django is really, really really slow to iterate over the QuerySet.
Initially, I thought the indexing was taking more than 24 hours - which seemed ridiculous, so I tested a few other things. I can now confirm that it would take many hours to iterate over the QuerySet.
Maybe there's something I'm not used to in Django 2.2? I previously used 1.11 but thought I use a newer version now.
The model I'm trying to iterate over:
class SupplierSkus(models.Model):
sku = models.CharField(max_length=20)
link = models.CharField(max_length=4096)
price = models.FloatField()
last_updated = models.DateTimeField("Date Updated", null=True, auto_now=True)
status = models.ForeignKey(Status, on_delete=models.PROTECT, default=1)
category = models.CharField(max_length=1024)
family = models.CharField(max_length=20)
family_desc = models.TextField(null=True)
family_name = models.CharField(max_length=250)
product_name = models.CharField(max_length=250)
was_price = models.FloatField(null=True)
vat_rate = models.FloatField(null=True)
lead_from = models.IntegerField(null=True)
lead_to = models.IntegerField(null=True)
deliv_cost = models.FloatField(null=True)
prod_desc = models.TextField(null=True)
attributes = models.TextField(null=True)
brand = models.TextField(null=True)
mpn = models.CharField(max_length=50, null=True)
ean = models.CharField(max_length=15, null=True)
supplier = models.ForeignKey(Suppliers, on_delete=models.PROTECT)
and, as I mentioned, there are roughly 280k lines in that table.
When I do something simple as:
from products.models import SupplierSkus
sku_list = SupplierSkus.objects.all()
len(sku_list)
The process will quickly suck up most CPU power and does not finish. Likewise, I cannot iterate over it:
for i in sku_list:
print(i.sku)
Will also just take hours and not print a single line. However, I can iterate over it using:
for i in sku_list.iterator():
print(i.sku)
That doesn't help me very much, as I still need to do the indexing via Haystack and I believe that the issues are related.
This wasn't the case with some earlier projects I've worked with. Even a much more sizeable list (3-5m lines) would be iterated over quite quickly. A query for list length will take a moment, but return the result in seconds rather than hours.
So, I wonder, what's going on?
Is this something someone else has come across?
Okay, I've found the problem was the Python MySQL driver. Without using the .iterator() method a for loop would get stuck on the last element in the QuerySet. I have posted a more detailed answer on an expanded question here.
I was not using the Django recommended mysqlclient. I was using the
one created by Oracle/MySQL. There seems to be a bug that causes an
iterator to get "stuck" on the last element of the QuerySet in a for
loop and be trapped in an endless loop in certain circumstances.
Coming to think of it, it may well be that this is a design feature of the MySQL driver. I remember having a similar issue with a Java version of this driver before. Maybe I should just ditch MySQL and move to PostgreSQL?
I will try to raise a bug with Oracle anyways.
Related
I am making models for my store project and I wanted to know why the code I wrote is wrong? And how can I write correctly?
I want only the time when a product is sold from me to be recorded in the database.
class Product(models.Model):
product_type = models.ForeignKey(ProductType, on_delete=models.PROTECT, related_name='products_types')
upc = models.BigIntegerField(unique=True)
title = models.CharField(max_length=32)
description = models.TextField(blank=True)
category = models.ForeignKey(Category, on_delete=models.PROTECT, related_name='products')
brand = models.ForeignKey(Brand, on_delete=models.PROTECT, related_name='products')
soled = models.BooleanField(default=False)
if soled == True:
soled_time = models.DateTimeField(auto_now_add=True)
created_time = models.DateTimeField(auto_now_add=True)
modified_time = models.DateTimeField(auto_now=True)
def __str__(self):
return self.title
I hope that my problem will solve the question of many friends <3
welcome to SO. Great question, let me explain.
Your model is correct, aside from the soled_time field, and the if statement above it.
That being said, I think you may be missing some logic here, unless you have some kind of single item single sale thing going on (as opposed to a stock of items) then you may need to add another model.
The product tends to be its own entity in this kind of example and we would create supporting models with relationships to the Product model.
For example, you may have a Cart model, which looks something like:
class Cart(models.Model):
products = models.ManyToManyField(Product, related_name='carts_in', on_delete=models.CASCADE)
# add fk to user, so we know who ordered
# add some functions which sum the total cost of products
I hope this kind of makes sense, and then we would move the soled_time field away from products and onto Cart, because then we know when cart was sold, which items were in the cart, and which user made the cart.
Please also consider looking into existing diango packages which manage A LOT of the heavy lifting when it comes to e-commerce, I would Google and sniff around a bit and see if any of the existing packages suit your needs first, most of them are quite extendable aswell so you should be able to make it work for your use case even if it doesn't fit right out of the box.
If you have any questions or I can add clarity to my answer, please feel free to add a comment, or if this satisfies your question, then dont forget to accept this answer and leave an upvote if you like.
As for the question of "How do I update the field when the item is sold?"
The answer is you override the save function of the model, like so:
class Cart(models.Model):
products = models.ManyToManyField(Product, related_name='carts_in', on_delete=models.CASCADE)
soled_time = models.DatetimeField()
# add fk to user, so we know who ordered
# add some functions which sum the total cost of products
def save(self, *args, **kwargs):
super().save(*args, **kwargs)
if self.sold:
self.sold_time = timezone.now()
So I have models like these
class Status(models.Mode):
name = models.CharField(max_length=255, choices=StatusName.choices, unique=True)
class Case(models.Model):
# has some fields
class CaseStatus(models.Model):
case = models.ForeignKey("cases.Case", on_delete=models.CASCADE, related_name="case_statuses")
status = models.ForeignKey("cases.Status", on_delete=models.CASCADE, related_name="case_statuses")
created = models.DateTimeField(auto_now_add=True)
I need to filter the cases on the basis of the status of their case-status but the catch is only the latest case-status should be taken into account.
To get Case objects based on all the case-statuses, this query works:
Case.objects.filter(case_statuses__status=status_name)
But I need to get the Case objects such that only their latest case_status object (descending created) is taken into account. Something like this is what I am looking for:
Case.objects.filter(case_statuses__order_by_created_first__status=status_name)
I have tried Prefetch as well but doesnt seem to work with my use-case
sub_query = CaseStatus.objects.filter(
id=CaseStatus.objects.select_related('case').order_by('-created').first().id)
Case.objects.prefetch_related(Prefetch('case_statuses', queryset=sub_query)).filter(
case_statuses__status=status_name)
This would be easy to solve in raw postgres by using limit 1. But not sure how can I make this work in Django ORM.
You can annotate your cases with their last status, and then filter on that status to be what you want.
from django.db.models import OuterRef
status_qs = CaseStatus.objects.filter(case=OuterRef('pk')).order_by('-created').values('status__name')[:1]
Case.objects.annotate(last_status=status_qs).filter(last_status=status_name)
I am sitting with a query looking like this:
# Get the amount of kilo attached to products
product_data = {}
for productSpy in ProductSpy.objects.all():
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
I do not see how I on my last line would be able to use prefetch_related. In the examples in the docs it's very simplified and somehow makes sense, but I do not understand the whole concept enough to see myself out of this. Could I please get explained what's being done and how? I find this very important to understand, and where met by my first N+1 here.
Thank you up front for your time.
models.py
class ProductSpy(models.Model):
created_by = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
product = models.ForeignKey(Product, on_delete=models.CASCADE)
def __str__(self):
return self.kilo
class Product(models.Model):
product_id = models.IntegerField()
name = models.CharField(max_length=150)
def __str__(self):
return self.name
Django fetches related tables at runtime:
each call to productSpy.product will fetch from the table product using productSpy.id
The latency in I/O operation means that this code is highly inefficient. using prefetch_related will fetch product for all the product spy objects in one shot resulting in better performance.
# Get the amount of kilo attached to products
product_data = {}
product_spies = ProductSpy.objects.all()
product_spies.prefetch_related('product')
product_spies.prefetch_related('kilo')
for productSpy in product_spies:
product_data[productSpy.product.product_id] = productSpy.kilo # RERUN
When one writes productSpy.product if the related object is not already fetched, Django makes automatically will make a query to the database to get the related Product instance. Hence if ProductSpy.objects.all() returned N instances by writing productSpy.product in a loop we will be making N more queries which is what we call N + 1 problem.
Moving further although you can use prefetch_related (will use 2 queries in your case) here it would be better for you to use select_related [Django docs] which will use a LEFT JOIN and get you the related instances in 1 query itself:
product_data = {}
queryset = ProductSpy.objects.select_related('product')
for productSpy in queryset:
product_data[productSpy.product.product_id] = productSpy.kilo # No extra queries as we used `select_related`
Note: There seems to be some problem with your logic here though, as multiple ProductSpy instances can have the same Product,
hence your loop might overwrite some values.
I'm experiencing some severe performances issues with prefetch_related on a Model with 5 m2m fields and I'm pre-fetching also few nested m2m fields.
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).prefetch_related("parent", "takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "status", "approvalWorkflow", "viewers", "requires", "linkedTasks", "activities")
class Task(models.Model):
uuid = models.UUIDField(primary_key=True, default=genOptimUUID, editable=False)
internalStatus = models.IntegerField(default=0)
parent = models.ForeignKey("self", blank=True, null=True, related_name="childs")
name = models.CharField(max_length=45)
taskType = models.ForeignKey("TaskType", null=True)
priority = models.IntegerField()
startDate = models.DateTimeField()
endDate = models.DateTimeField()
status = models.ForeignKey("ProgressionStatus")
assignedUser = models.ForeignKey("Asset", related_name="tasksAssigned")
asset = models.ForeignKey("Asset", related_name="tasksSubject")
viewers = models.ManyToManyField("Asset", blank=True, related_name="followedTasks")
step = models.ForeignKey("Step", blank=True, null=True, related_name="tasks")
approvalWorkflow = models.ForeignKey("ApprovalWorkflow")
linkedTasks = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="linkedTo")
requires = models.ManyToManyField("self", symmetrical=False, blank=True, related_name="depends")
objects = TaskModelManager()
The number of query is fine and the database query time is fine too, for exemple if I query 700 objects of my model i have 35 query and the average query time is 100~200ms but the total request time is approximately 8 seconds.
silk times
I've run some profiling and it pointed out that more than 80% of the time spent was on the prefetch_related_objects call.
profiling
I'm using Django==1.8.5 and djangorestframework==3.4.6
I'm open to any way to optimize this.
Thanks in advance for your help.
Edit with select_related:
I've tried the improvement proposed by Alasdair
class TaskModelManager(models.Manager):
def get_queryset(self):
return super(TaskModelManager, self).get_queryset().exclude(internalStatus=2).select_related("parent", "status", "approvalWorkflow", "step").prefetch_related("takes", "takes__flags", "assignedUser", "assignedUser__flags", "asset", "asset__flags", "viewers", "requires", "linkedTasks", "activities")
The new result is still 8 seconds for the request with 32 queries and 150ms of query time.
Edit :
It seems that a ticket was opened on Django issue tracker 4 years ago and is still open : https: //code.djangoproject.com/ticket/20577
I ran into the same problem.
Following the issue you linked i found that you can improve the prefetch_related performance using Prefetch object and to_attr argument.
From the commit that introduces the Prefetch object:
When a Prefetch instance specifies a to_attr argument, the result is
stored in a list rather than a QuerySet. This has the fortunate
consequence of being significantly faster. The preformance improvement
is due to the fact that we save the costly creation of a QuerySet
instance.
So i significantly improved my code (from about 7 seconds to 0.88 seconds) simply by calling:
Foo.objects.filter(...).prefetch_related(Prefetch('bars', to_attr='bars_list'))
instead of
Foo.objects.filter(...).prefetch_related('bars')
Try using select_related for foreign keys like parent and ApprovalWorkflow instead of prefetch_related.
When you use select_related, Django will fetch the models using a join, unlike prefetch_related which causes an extra query. You might find that this improves performance.
If the DB is 150ms but your request is 8 seconds, it's not your query (in itself, at least). A few possible issues:
1) Your HTML or template is too complex, spending too much time in generating the response. Or consider template caching.
2) All those objects are complex and you load too many fields, so while the query is fast, sending it and processing all those objects in Python is slow. Explore using only(), defer() and values() or value_list() to only load what you need.
Optimization is hard and we'd need more details to give you a better idea. I'd suggest installing Django Debug Toolbar (Django app) or Opbeat (3rd party utility), they can help you detect where your time is spent and then you can optimize accordingly.
I was wondering how I could make this work.
I'm currently doing this Django project that creates and modifies Client models, as such:
class Client(models.Model):
name = models.CharField(max_length=300)
birth_date = models.DateTimeField('birth date')
gender = models.CharField(max_length=1, choices=GENDER, null=True)
phone_number = models.CharField(max_length=13, null=True)
email = models.EmailField(max_length=254, null=True)
medical_condition = models.CharField(max_length=300, null=True)
experience = models.CharField(max_length=3, choices=EXPERIENCE, blank=True)
next_class = models.DateTimeField('next class')
Now, everything is pretty much working fine. I'm currently using generic form views to create and update students, so that's fine. However, I'm planning on doing a View where I can see a financial report of all the students who have taken classes, payments included. How would I go along with this? I could try somehow appending next_class to a dictionary of some sort each time I change its value, but I feel that's not efficient at all.
I've tried to see how I can do this; I was mostly thinking of using Django's Admin history app and integrating it to my actual webpage, but I found out about Simple History which seems pretty good. I feel like it doesn't exactly do the queries I want, though.
Thanks in advance, and I apologize if it's vague in any way. I can update my post with more code if necessary, just didn't want to make this question much longer.
I think your model is wrong. It seems like you need a separate related model for the classes that a student takes. It could be as simple as:
class Class(models.Model):
client = models.ForeignKey('Client')
date = models.DateTimeField()
so that instead of changing a field in the Client model, you add a new row to the Class model. Then it will be fairly simple to query all classes for a client between two dates:
classes_taken = my_client.class_set.filter(date_range=[start_date, end_date])
where my_client is an instance of Client, and start_date and end_date are the relevant dates.