Iterating with Django ORM through large datasets is slow

Iterating with Django ORM through large datasets is slow - python

I'm using Django ORM to get data out of a database with a few million items. However, computation takes a while (40 minutes+), and I'm not sure how to pin point where the issue is located.
Models I've used:
class user_chartConfigurationData(models.Model):
username_chartNum = models.ForeignKey(user_chartConfiguration, related_name='user_chartConfigurationData_username_chartNum')
openedConfig = models.ForeignKey(user_chartConfigurationChartID, related_name='user_chartConfigurationData_user_chartConfigurationChartID')
username_selects = models.CharField(max_length=200)
blockName = models.CharField(max_length=200)
stage = models.CharField(max_length=200)
variable = models.CharField(max_length=200)
condition = models.CharField(max_length=200)
value = models.CharField(max_length=200)
type = models.CharField(max_length=200)
order = models.IntegerField()
def __unicode__(self):
return str(self.username_chartNum)
order = models.IntegerField()
class data_parsed(models.Model):
setid = models.ForeignKey(sett, related_name='data_parsed_setid', primary_key=True)
setid_hash = models.CharField(max_length=100, db_index = True)
block = models.CharField(max_length=2000, db_index = True)
username = models.CharField(max_length=2000, db_index = True)
time = models.IntegerField(db_index = True)
time_string = models.CharField(max_length=200, db_index = True)
def __unicode__(self):
return str(self.setid)
class unique_variables(models.Model):
setid = models.ForeignKey(sett, related_name='unique_variables_setid')
setid_hash = models.CharField(max_length=100, db_index = True)
block = models.CharField(max_length=200, db_index = True)
stage = models.CharField(max_length=200, db_index = True)
variable = models.CharField(max_length=200, db_index = True)
value = models.CharField(max_length=2000, db_index = True)
class Meta:
unique_together = (("setid", "block", "variable", "stage", "value"),)
The code I'm running is looping through data_parsed, with relevant data that matches between user_chartConfigurationData and unique_variables.
#After we get the tab, we will get the configuration data from the config button. We will need the tab ID, which is chartNum, and the actual chart
#That is opened, which is the chartID.
chartIDKey = user_chartConfigurationChartID.objects.get(chartID = chartID)
for i in user_chartConfigurationData.objects.filter(username_chartNum = chartNum, openedConfig = chartIDKey).order_by('order').iterator():
iterator = data_parsed.objects.all().iterator()
#We will loop through parsed objects, and at the same time using the setid (unique for all blocks), which contains multiple
#variables. Using the condition, we can set the variable gte (greater than equal), or lte (less than equal), so that the condition match
#the setid for the data_parsed object, and variable condition
for contents in iterator:
#These are two flags, found is when we already have an entry inside a dictionary that already
#matches the same setid. Meaning they are the same blocks. For example FlowBranch and FlowPure can belong
#to the same block. Hence when we find an entry that matches the same id, we will put it in the same dictionary.
#Added is used when the current item does not map to a previous setid entry in the dictionary. Then we will need
#to add this new entry to the array of dictionary (set_of_pk_values). Otherwise, we will be adding a lot
#of entries that doesn't have any values for variables (because the value was added to another entry inside a dictionary)
found = False
added = False
storeItem = {}
#Initial information for the row
storeItem['block'] = contents.block
storeItem['username'] = contents.username
storeItem['setid'] = contents.setid
storeItem['setid_hash'] = contents.setid_hash
if (i.variable != ""):
for findPrevious in set_of_pk_values:
if(str(contents.setid) == str(findPrevious['setid'])):
try:
items = unique_variables.objects.get(setid = contents.setid, variable = i.variable)
findPrevious[variableName] = items.value
found = True
break
except:
pass
if(found == False):
try:
items = unique_variables.objects.get(setid = contents.setid, variable = i.variable)
storeItem[variableName] = items.value
added = True
except:
pass
if(found == False and added == True):
storeItem['time_string'] = contents.time_string
set_of_pk_values.append(storeItem)
I've tried to use select_related() or prefetch_related(), since it needs to go to unique_variables object and get some data, however, it still takes a long time.
Is there a better way to approach this problem?

Definitely, have a look at django_debug_toolbar. It will tell you how many queries you execute, and how long they last. Can't really live without this package when I have to optimize something =).
PS: Execution will be even slower.
edit: You may also want to enable db_index for the fields you use to filter with or index_together for more than one field. Ofc, measure the times between your changes so you make sure which option is better.

Related

How do I display Django data from a related model of a related model?

I am trying to display data from several models that are related together through a QuerySet. My ultimate goal is to display some information from the Site model, and some information from the Ppack model, based on a date range filter of the sw_delivery_date in the Site model.
Here are my models:
class Site(models.Model):
mnemonic = models.CharField(max_length = 5)
site_name = models.CharField(max_length = 100)
assigned_tech = models.ForeignKey('Person', on_delete=models.CASCADE, null = True, blank = True)
hw_handoff_date = models.DateField(null = True, blank = True)
sw_delivery_date = models.DateField(null = True, blank = True)
go_live_date = models.DateField(null = True, blank = True)
web_url = models.CharField(max_length = 100, null = True, blank = True)
idp_url = models.CharField(max_length = 100, null = True, blank = True)
def __str__(self):
return '(' + self.mnemonic + ') ' + self.site_name
class Ring(models.Model):
ring = models.IntegerField()
def __str__(self):
return "6." + str(self.ring)
class Ppack(models.Model):
ppack = models.IntegerField()
ring = models.ForeignKey('Ring', on_delete=models.CASCADE)
def __str__(self):
return str(self.ring) + " pp" + str(self.ppack)
class Code_Release(models.Model):
Inhouse = 'I'
Test = 'T'
Live = 'L'
Ring_Location_Choices = (
(Inhouse, 'Inhouse'),
(Test, 'Test'),
(Live, 'Live'),
)
site_id = models.ForeignKey('Site', on_delete=models.CASCADE)
type = models.CharField(max_length = 1, choices = Ring_Location_Choices, blank = True, null = True)
release = models.ForeignKey('Ppack', on_delete=models.CASCADE)
def __str__(self):
return "site:" + str(self.site_id) + ", " + self.type + " = " + str(self.release)
If I use the following,
today = datetime.date.today()
future = datetime.timedelta(days=60)
new_deliveries = Site.objects.select_related().filter(sw_delivery_date__range=[today, (today + future)])
I can get all of the objects in the Site model that meet my criteria, however, because there is no relation from Site to Code_Release (there's a one-to-many coming the other way), I can't get at the Code_Release data.
If I run a for loop, I can iterate through every Site returned from the above query, and select the data from the Code_Release model, which allows me to get the related data from the Ppack and Ring models.
site_itl = {}
itl = {}
for delivery in new_deliveries:
releases = Code_Release.objects.select_related().filter(site_id = delivery.id)
for rel in releases:
itl[rel.id] = rel.release
site_itl[delivery.id] = itl
But, that seems overly complex to me, with multiple database hits and possibly a difficult time parsing through that in the template.
Based on that, I was thinking that I needed to select from the Code_Release model. That relates back to both the Site model and the Ppack model (which relates to the Ring model). I've struggled to make the right query / access the data in this way that accomplishes what I want, but I feel this is the right way to go.
How would I best accomplish this?

You can use RelatedManager here. When you declare ForeignKey, Django allows you to access reverse relationship. To be specific, let's say that you have multiple code releases that are pointing to one specific site. You can access them all via site object by using <your_model_name_lowercase>_set attribute. So in your case:
site.code_release_set.all()
will return QuerySet of all code release objects that have ForeignKey to object site

You can access the Releases from a Site object. First, you can put a related_name to have a friendly name of the reverse relation between the models:
site_id = models.ForeignKey('Site', on_delete=models.CASCADE, related_name="releases")
and then, from a Site object you can make normal queries to Release model:
site.releases.all()
site.releases.filter(...)
...

How do I store a string in ArrayField? (Django and PostgreSQL)

I am unable to store a string in ArrayField. There are no exceptions thrown when I try to save something in it, but the array remains empty.
Here is some code from models.py :
# models.py
from django.db import models
import uuid
from django.contrib.auth.models import User
from django.contrib.postgres.fields import JSONField, ArrayField
# Create your models here.
class UserDetail(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
key = models.CharField(max_length=50, default=False, primary_key=True)
api_secret = models.CharField(max_length=50)
user_categories = ArrayField(models.CharField(max_length = 1000), default = list)
def __str__(self):
return self.key
class PreParentProduct(models.Model):
product_user = models.ForeignKey(UserDetail, default=False, on_delete=models.CASCADE)
product_url = models.URLField(max_length = 1000)
pre_product_title = models.CharField(max_length=600)
pre_product_description = models.CharField(max_length=2000)
pre_product_variants_data = JSONField(blank=True, null=True)
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
def __str__(self):
return self.pre_product_title
I try to save it this way:
catlist = ast.literal_eval(res.text)
for jsonitem in catlist:
key = jsonitem.get('name')
id = jsonitem.get("id")
dictionary = {}
dictionary['name'] = key
dictionary['id'] = id
tba = json.dumps(dictionary)
print("It works till here.")
print(type(tba))
usersearch[0].user_categories.append(tba)
print(usersearch[0].user_categories)
usersearch[0].save()
print(usersearch[0].user_categories)
The output I get is:
It works till here.
<class 'str'>
[]
It works till here.
<class 'str'>
[]
[]
Is this the correct way to store a string inside ArrayField?
I cannot store JSONField inside an ArrayField, so I had to convert it to a string.
How do I fix this?

Solution to the append problem.
You haven't demonstrated how your usersearch[0] I suspect it's something like this:
usersearch = UserDetail.objects.all()
If that is so you are making changes to a resultset, those things are immutable. Try this you will see that the id is unchanged too:
usersearch[0].id = 1000
print usersearch.id
But this works
usersearch = list(UserDetail.objects.all())
and so does
u = usersearch[0]
Solution to the real problem
user_categories = ArrayField(models.CharField(max_length = 1000), default = list)
This is wrong. ArrayFields shouldn't be used in this manner. You will soon find that you need to search through them and
Arrays are not sets; searching for specific array elements can be a
sign of database misdesign. Consider using a separate table with a row
for each item that would be an array element. This will be easier to
search, and is likely to scale better for a large number of elements
ref: https://www.postgresql.org/docs/9.5/static/arrays.html
You need to normalize your data. You need to have a category model and your UserDetail should be related to it through a foreign key.

Filter Generic Foreign Key

Is there a more "Python/Django" way to query/filter objects by generic foreign key? I'm trying to get all FullCitation objects for a particular software, where is_primary is True.
I know I can't do this but I want to do something like this:
ct_supported = ContentType.objects.get(app_label="supportedprogram", model="software")
primary_citations = FullCitation.objects.filter(content_type__name=ct_supported, object_id__in='', is_primary=True)
models.py
class FullCitation(models.Model)
# the software to which this citation belongs
# either a supported software program or a non-supported software program
limit = models.Q(app_label = 'myprograms', model = 'supportedprogram') | models.Q(app_label = 'myprograms', model = 'nonsupportedprogram')
content_type = models.ForeignKey(ContentType), limit_choices_to = limit, )
object_id = models.PositiveIntegerField()
content_object = generic.GenericForeignKey('content_type', 'object_id')
is_primary = models.BooleanField(help_text="Is this the Primary Citation for the software program?")
class NonSupportedProgram(models.Model):
title = models.CharField(max_length=256, blank = True)
full_citation = generic.GenericRelation('FullCitation')
class SupportedProgram(models.Model):
title = models.CharField(max_length=256, blank = True)
full_citation = generic.GenericRelation('FullCitation')
# and a bunch of other fields.....
views.py # My current attempt
primary_citations = []
sw_citations = sw.full_citations.all()
for x in sw_citations:
if x.is_primary:
primary_citations.append(x)

Comprehensions should be a last resort for filtering QuerySets. Far better to let them remain as QuerySets as long as you can. I think this is what you're looking for:
ct_supported = ContentType.objects.get_for_model(SupportedProgram))
primary_citations = FullCitation.objects.filter(content_type=ct_supported, is_primary=True)
Updated:
If you want to filter for a specific SupportedProgram instance, do this:
my_supported = SupportedProgram.objects.get(id=instance_id_goes_here)
ct_supported = ContentType.objects.get_for_model(SupportedProgram))
primary_citations = FullCitation.objects.filter(content_object=my_supported, content_type=ct_supported, is_primary=True)

Django - Getting/Saving large objects takes a lot of time

I'm trying to get a few million of items from a model, and parsing them. However, somehow it spends a lot of time trying to get the data saved.
These are the current models that I have:
class mapt(models.Model):
s = models.IntegerField(primary_key=True)
name = models.CharField(max_length=2000)
def __unicode__(self):
return str(self.s)
class datt(models.Model):
s = models.IntegerField(primary_key=True)
setid = models.IntegerField()
var = models.IntegerField()
val = models.IntegerField()
def __unicode(self):
return str(self.s)
class sett(models.Model):
setid = models.IntegerField(primary_key=True)
block = models.IntegerField()
username = models.IntegerField()
ts = models.IntegerField()
def __unicode__(self):
return str(self.setid)
class data_parsed(models.Model):
setid = models.IntegerField(max_length=2000, primary_key=True)
block = models.CharField(max_length=2000)
username = models.CharField(max_length=2000)
data = models.CharField(max_length=200000)
time = models.IntegerField()
def __unicode__(self):
return str(self.setid)
The s parameter for the datt model should actually act as a foreign key to mapt's s parameter. Furthermore, sett's setid field should act as a foreign key to setid's setid.
Lastly, data_parsed's setid is a foreign key to sett's models.
The algorithm is currently written this way:
def database_rebuild(start_data_parsed):
listSetID = []
#Part 1
for items in sett.objects.filter(setid__gte=start_data_parsed):
listSetID.append(items.setid)
uniqueSetID = listSetID
#Part 2
for items in uniqueSetID:
try:
SetID = items
settObject = sett.objects.get(setid=SetID)
UserName = mapt.objects.get(pk=settObject.username).name
TS = pk=settObject.ts
BlockName = mapt.objects.get(pk=settObject.block).name
DataPairs_1 = []
DataPairs_2 = []
DataPairs_1_Data = []
DataPairs_2_Data = []
for data in datt.objects.filter(setid__exact=SetID):
DataPairs_1.append(data.var)
DataPairs_2.append(data.val)
for Data in DataPairs_1:
DataPairs_1_Data.append(mapt.objects.get(pk=Data).name)
for Data in DataPairs_2:
DataPairs_2_Data.append(mapt.objects.get(pk=Data).name)
assert (len(DataPairs_1) == len(DataPairs_2)), "Length not equal"
#Part 3
Serialize = []
for idx, val in enumerate(DataPairs_1_Data):
Serialize.append(str(DataPairs_1_Data[idx]) + ":PARSEABLE:" + str(DataPairs_2_Data[idx]) + ":PARSEABLENEXT:")
Serialize_Text = ""
for Data in Serialize:
Serialize_Text += Data
Data = Serialize_Text
p = data_parsed(SetID, BlockName, UserName, Data, TS)
p.save()
except AssertionError, e:
print "Error:" + str(e.args)
print "Possibly DataPairs does not have equal length"
except Exception as e:
print "Error:" + str(sys.exc_info()[0])
print "Stack:" + str(e.args)
Basically, what it does is that:
Finds all sett objects that is greater than a number
Gets the UserName, TS, and BlockName, then get all the fields in datt field that correspond to a var and val field maps to the mapt 's' field. Var and Val is basically NAME_OF_FIELD:VALUE type of relationship.
Serialize all the var and val parameters so that I could get all the parameters from var and val that is spread across the mapt table in a row in data_parsed.
The current solution does everything I would like to, however, on a Intel Core i5-4300U CPU # 1.90Ghz, it parses around 15000 rows of data daily on a celery periodic worker. I have around 3355566 rows of data at my sett table, and it will take around ~23 days to parse them all.
Is there a way to speed up the process?
============================Updated============================
New Models:
class mapt(models.Model):
s = models.IntegerField(primary_key=True)
name = models.CharField(max_length=2000)
def __unicode__(self):
return str(self.s)
class sett(models.Model):
setid = models.IntegerField(primary_key=True)
block = models.ForeignKey(mapt, related_name='sett_block')
username = models.ForeignKey(mapt, related_name='sett_username')
ts = models.IntegerField()
def __unicode__(self):
return str(self.setid)
# class sett(models.Model):
# setid = models.IntegerField(primary_key=True)
# block = models.IntegerField()
# username = models.IntegerField()
# ts = models.IntegerField()
# def __unicode__(self):
# return str(self.setid)
class datt(models.Model):
s = models.IntegerField(primary_key=True)
setid = models.ForeignKey(sett, related_name='datt_setid')
var = models.ForeignKey(mapt, related_name='datt_var')
val = models.ForeignKey(mapt, related_name='datt_val')
def __unicode(self):
return str(self.s)
# class datt(models.Model):
# s = models.IntegerField(primary_key=True)
# setid = models.IntegerField()
# var = models.IntegerField()
# val = models.IntegerField()
# def __unicode(self):
# return str(self.s)
class data_parsed(models.Model):
setid = models.ForeignKey(sett, related_name='data_parsed_setid', primary_key=True)
block = models.CharField(max_length=2000)
username = models.CharField(max_length=2000)
data = models.CharField(max_length=2000000)
time = models.IntegerField()
def __unicode__(self):
return str(self.setid)
New Parsing:
def database_rebuild(start_data_parsed, end_data_parsed):
for items in sett.objects.filter(setid__gte=start_data_parsed, setid__lte=end_data_parsed):
try:
UserName = mapt.objects.get(pk=items.username_id).name
TS = pk=items.ts
BlockName = mapt.objects.get(pk=items.block_id).name
DataPairs_1 = []
DataPairs_2 = []
DataPairs_1_Data = []
DataPairs_2_Data = []
for data in datt.objects.filter(setid_id__exact=items.setid):
DataPairs_1.append(data.var_id)
DataPairs_2.append(data.val_id)
for Data in DataPairs_1:
DataPairs_1_Data.append(mapt.objects.get(pk=Data).name)
for Data in DataPairs_2:
DataPairs_2_Data.append(mapt.objects.get(pk=Data).name)
assert (len(DataPairs_1) == len(DataPairs_2)), "Length not equal"
Serialize = []
for idx, val in enumerate(DataPairs_1_Data):
Serialize.append(str(DataPairs_1_Data[idx]) + ":PARSEABLE:" + str(DataPairs_2_Data[idx]))
Data = ":PARSEABLENEXT:".join(Serialize)
p = data_parsed(items.setid, BlockName, UserName, Data, TS)
p.save()
except AssertionError, e:
print "Error:" + str(e.args)
print "Possibly DataPairs does not have equal length"
except Exception as e:
print "Error:" + str(sys.exc_info()[0])
print "Stack:" + str(e.args)

Defining lists by appending repeadedly is very slow. Use list comprehensions or even just the list() constructor.
In python you should not join a list of strings using for loops and +=, you should use join().
But that is not the primary bottleneck here. You have a lot of objects.get()s which each takes a database roundtrip. If you didn't have milions of rows in the mapt table, you should probably just make a dictionary mapping mapt primary keys to mapt objects.
Had you defined your foreign keys as foreign keys the django orm could help you do much of this in like five queries in total. That is, instead of SomeModel.objects.get(id=some_instance.some_fk_id) you can do some_instance.some_fk (which will only hit the databse the first time you do it for each instance). You can then even get rid of the foreign key query if some_instance had been initialized as some_instance = SomeOtherModel.objects.select_related('some_fk').get(id=id_of_some_instance).
Perhaps changing the models without changing the database will work.

Selecting a Field With a String Name Only

I have a model that looks like this:
class WeekOne(models.Model):
# Required benchmarks for given exercises
squatBenchmark = 1000
lungeBenchmark = 250
stairDaysCountBenchmark = 3
totalGoals = 4
squats = models.PositiveIntegerField(default=0)
lunges = models.PositiveIntegerField(default=0)
skipStairs = models.BooleanField(default=False)
stairDaysCount = models.PositiveSmallIntegerField(default=0)
# Running count of benchmarks met.
completeCount = models.PositiveSmallIntegerField(default=0)
# Set to true if benchmarks reached.
weekOneComplete = models.BooleanField(default=False)
I want to access the field 'squats', i.e., in a variable assignment amount = user.week_one.squats, but because of the way the views and templates work, I don't have access to a reference to the squats field, I only have a string squats. Is there any way to use this string to access that field?

This is what getattr is for:
amount = getattr(user.week_one, 'squats')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating with Django ORM through large datasets is slow - python

Related

How do I display Django data from a related model of a related model?

How do I store a string in ArrayField? (Django and PostgreSQL)

Filter Generic Foreign Key

Django - Getting/Saving large objects takes a lot of time

Selecting a Field With a String Name Only

Categories

Resources