Computing a large dataset in Django

Computing a large dataset in Django - python

I am generating a matrix of similarities between items in order to provide it to a recommender system in Django. (it's n^2 at the end of the day.)
The issue I am having is that either if I use iterator() or not, my RAM still gets sucked.
I do something like this:
rated_apps_list = Rating.objects.values_list('item_id', flat=True).order_by('-item_id').distinct()
rated_apps_iter = MemorySavingQuerysetIterator(rated_apps_list[start:])
for app_above in rated_apps_iter:
rated_apps_below_iter = MemorySavingQuerysetIterator(rated_apps_list[i+1:])
for app_below in rated_apps_below_iter:
...
where MemorySavingQuerySetIterator is:
class MemorySavingQuerysetIterator(object):
def __init__(self,queryset,max_obj_num=1000):
self._base_queryset = queryset
self._generator = self._setup()
self.max_obj_num = max_obj_num
def _setup(self):
for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
# By making a copy of of the queryset and using that to actually access
# the objects we ensure that there are only `max_obj_num` objects in
# memory at any given time
smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
#logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
for obj in smaller_queryset.iterator():
yield obj
def __iter__(self):
return self
def next(self):
return self._generator.next()
At first I tried just with the .iterator() function but then I believe it was the Database Client who was caching the results.
The leak continues to be there and I have to reload the script after a while.
I know it doesn't look efficient to create as many iterator as elements because then I would end up having all the elements in memory, how would you guys do it?.
Any thoughts? thanks!

Actually, your solution is almost ok. There are few advices:
Don't deepcopy queryset, it's cloned when you slice it anyway.
Slicing is inefficient from database perspective (large offset in a sql), it means database needs to prepare many rows and then pass you only few.
Slicing is also unsafe if anything can be added or deleted from table in between.
You can adapt this thing to your case. Just use item_id instead of pk. It uses condition instead of offset so it's much more efficient.

Related

Django, using "|": Expression tree is too large (maximum depth 1000)

I'm trying to concatenate many querysets together. I tried out the marked answer from this question a while back, but that didn't work in my case. I needed to return a queryset not a list. So I used the |, from the second answer. This worked fine at the time, but now that I'm trying to use it again for something else I get the following error:
Expression tree is too large (maximum depth 1000)
I originally thought that | would concat the querysets, but after reading the docs it appears that it concats the actual query. And that this specific problem occurs if the query becomes too long/complex.
This is what I'm trying to do:
def properties(self, request, pk=None):
project = self.get_object()
if project is None:
return Response({'detail': 'Missing project id'}, status=404)
functions = Function.objects.filter(project=project)
properties = Property.objects.none()
for function in functions:
properties = properties | function.property_set.all()
return Response([PropertySerializer(x).data for x in properties])
Since the functions query returns roughly 1200 results, and each function has about 5 properties, I can understand the query becoming too long/complex.
How can I prevent the query from becoming too complex? Or how can I execute multiple queries and concat them afterwards, while keeping the end result a queryset?

I think you want to obtain all the Property objects that have as Function a certain project.
We can query this with:
properties = Property.objects.filter(function__project=project)
This thus is a queryset that contains all property objects for which the function (I assume this is a ForeignKey) has as project (probably again a ForeignKey is the given project). This will result in a single query as well, but you will avoid constructing gigantic unions.
Alternatively, you can do it in two steps, but this would actually make it slower:
# probably less efficient
function_ids = (Function.objects.filter(project=project)
.values_list('pk', flat=True))
properties = Properties.object(function_id__in=function_ids)

Idiomatic/fast Django ORM check for existence on mysql/postgres

If I want to check for the existence and if possible retrieve an object, which of the following methods is faster? More idiomatic? And why? If not either of the two examples I list, how else would one go about doing this?
if Object.objects.get(**kwargs).exists():
my_object = Object.objects.get(**kwargs)
my_object = Object.objects.filter(**kwargs)
if my_object:
my_object = my_object[0]
If relevant, I care about mysql and postgres for this.

Why not do this in a try/except block to avoid the multiple queries / query then an if?
try:
obj = Object.objects.get(**kwargs)
except Object.DoesNotExist:
pass
Just add your else logic under the except.

django provides a pretty good overview of exists
Using your first example it will do the query two times, according to the documentation:
if some_queryset has not yet been evaluated, but you
know that it will be at some point, then using some_queryset.exists()
will do more overall work (one query for the existence check plus an
extra one to later retrieve the results) than simply using
bool(some_queryset), which retrieves the results and then checks if
any were returned.
So if you're going to be using the object, after checking for existance, the docs suggest just using it and forcing evaluation 1 time using
if my_object:
pass

Re evaluate django query after changes done to database

I got this long queryset statement on a view
contributions = user_profile.contributions_chosen.all()\
.filter(payed=False).filter(belongs_to=concert)\
.filter(contribution_def__left__gt=0)\
.filter(contribution_def__type_of='ticket')
That i use in my template
context['contributions'] = contributions
And later on that view i make changes(add or remove a record) to the table contributions_chosen and if i want my context['contributions'] updated i need to requery the database with the same lenghty query.
contributions = user_profile.contributions_chosen.all()\
.filter(payed=False).filter(belongs_to=concert)\
.filter(contribution_def__left__gt=0)\
.filter(contribution_def__type_of='ticket')
And then again update my context
context['contributions'] = contributions
So i was wondering if theres any way i can avoid repeating my self, to reevaluate the contributions so it actually reflects the real data on the database.
Ideally i would modify the queryset contributions and its values would be updated, and at the same time the database would reflect this changes, but i don't know how to do this.
UPDATE:
This is what i do between the two
context['contributions'] = contributions
I add a new contribution object to the contributions_chosen(this is a m2m relation),
contribution = Contribution.objects.create(kwarg=something,kwarg2=somethingelse)
user_profile.contributions_chosen.add(contribution)
contribution.save()
user_profile.save()
And in some cases i delete a contribution object
contribution = user_profile.contributions_chosen.get(id=1)
user_profile.contributions_chosen.get(id=request.POST['con
contribution.delete()
As you can see i'm modifying the table contributions_chosen so i have to reissue the query and update the context.
What am i doing wrong?
UPDATE
After seeing your comments about evaluating, i realize i do eval the queryset i do
len(contributions) between context['contribution'] and that seems to be problem.
I'll just move it after the database operations and thats it, thanks guy.

update
Seems you have not evaluated the queryset contributions, thus there is no need to worry about updating it because it still has not fetched data from DB.
Can you post code between two context['contributions'] = contributions lines? Normally before you evaluate the queryset contributions (for example by iterating over it or calling its __len__()), it does not contain anything reading from DB, hence you don't have to update its content.
To re-evaluate a queryset, you could
# make a clone
contribution._clone()
# or any op that makes clone, for example
contribution.filter()
# or clear its cache
contribution._result_cache = None
# you could even directly add new item to contribution._result_cache,
# but its could cause unexpected behavior w/o carefulness

I don't know how you can avoid re-evaluating the query, but one way to save some repeated statements in your code would be to create a dict with all those filters and to specify the filter args as a dict:
query_args = dict(
payed=False,
belongs_to=concert,
contribution_def__left__gt=0,
contribution_def__type_of='ticket',
)
and then
contributions = user_profile.contributions_chosen.filter(**query_args)
This just removes some repeated code, but does not solve the repeated query. If you need to change the args, just handle query_args as a normal Python dict, it is one after all :)

A more pythonic way to build a class based on a string (how not to use eval)

OK.
So I've got a database where I want to store references to other Python objects (right now I'm using to store inventory information for person stores of beer recipe ingredients).
Since there are about 15-20 different categories of ingredients (all represented by individual SQLObjects) I don't want to do a bunch of RelatedJoin columns since, well, I'm lazy, and it seems like it's not the "best" or "pythonic" solution as it is.
So right now I'm doing this:
class Inventory(SQLObject):
inventory_item_id = IntCol(default=0)
amount = DecimalCol(size=6, precision=2, default=0)
amount_units = IntCol(default=Measure.GM)
purchased_on = DateCol(default=datetime.now())
purchased_from = UnicodeCol(default=None, length=256)
price = CurrencyCol(default=0)
notes = UnicodeCol(default=None)
inventory_type = UnicodeCol(default=None)
def _get_name(self):
return eval(self.inventory_type).get(self.inventory_item_id).name
def _set_inventory_item_id(self, value):
self.inventory_type = value.__class__.__name__
self._SO_set_inventory_item_id(value.id)
Please note the ICKY eval() in the _get_name() method.
How would I go about calling the SQLObject class referenced by the string I'm getting from __class__.__name__ without using eval()? Or is this an appropriate place to utilize eval()? (I'm sort of of the mindset where it's never appropriate to use eval() -- however since the system never uses any end user input in the eval() it seems "safe".)

To get the value of a global by name; Use:
globals()[self.inventory_type]

l[:] performance problem

Recently I was debugging a code like following
def getList():
#query db and return a list
total_list = Model.objects.all()
result = list()
for item in total_list:
if item.attr1:
result.append(item)
return result
# in main code
org_list = getList()
list = orgList[:]#this line cause cpu problems.
if len(org_list)>0 and org_list[0].is_special:
my_item = org_list[0]
for i in list:
print_item(i)
doSomethingelse(list[0])
In order to simplify the code I change most of it but the main part is here.
In getList method we query db and get 20~30 rows. Then we create a python list from it and return it.
in main method we get org_list variable from getList method
and slice it with orgList[:]
and loop over this list and call spesific elements on it like list[0]
The problem here is that this code runs on a really busy server and unfortunaletty it uses most of the cpu and eventually locks our servers.
Problem here is the line that we slice list varibale with list[:]
if we dont do that and just use org_list variable instead our servers does not have a problem. Does anybosy have any idea why that might happen. is slicing uses alot of cpu or when we use a sliced list. does it uses alot of cpu?

The code that you are showing would run in the 0.1 microseconds that it would take to raise an exception:
org_list = getList()
list = orgList[:]#this line cause cpu problems.
orgList should be org_list. Show the minimal code that actually reproduces the problem.
Also, that kills the list built-in function. Don't do that.
Update Another thought: A common response to "My Django app runs slowly!" is "turn off the debug flag" ... evidently it doesn't free up memory in debug mode.
Update2 about """I found out that when you slice a list. it actually works as a view to that list and just call original list methods to get actual item.""" Where did you get that idea? That can't be the case with a plain old list! Have you redefined list somewhere?
In your getList function:
(1) put in print type(list)
(2) replace result = list() with result = [] and see whether the problem goes away.

list = org_list[:] makes a copy of org_list. Usually you see something like this when you need to modify the list but you want to keep the original around as well.
The code you are showing me doesn't seem like it actually modifies org_list, and if doSomethingelse(list[0]) doesn't modify list, I don't see why it's actually being copied in the first place. Even if it does modify list, as long as org_list isn't needed after those few lines of code, you could probably get away with just using org_list and not doing the slice copy.

I would like to know:
what is the type of total_list ? Please do print type(total_list)
what are the types of the elements of total_list ?
have you an idea of the sizes of these elements ?
what is the number of elements in the list returned by getList() ? 20-30 ?
Remarks:
-giving the keyword "list" as a name for a list isn't a good idea (in list = orgList[:] )
-your getList function can be written :
def getList():
#query db and return a list
total_list = Model.objects.all()
return [ item for item in total_list if item.attr1 ]

unfortunately actual code is property of the company I worked for so I cannot give it exactly like it is but here is some part of it (variable names are changed,but argument of mainTree method was really named list.)
mainObj = get_object_or_404(MainObj, code_name=my_code_name)
if (not mainObj.is_now()) and (not request.user.is_staff):
return HttpResponseRedirect("/")
main_tree = mainObj.main_tree()
objects = main_tree[:] # !!!
if objects.__len__() == 1 and not objects[0].has_special_page:
my_code_name = objects[0].code_name
...
...
...
mainObj.main_tree() is the following
def main_tree(self):
def append_obj_recursive(list, obj):
list.append(obj)
for c in self.obj_set.filter(parent_obj=obj):
append_obj_recursive(list, c)
return list
root_obj = self.get_or_create_root_obj();
top_level_objs = self.obj_set.filter(parent_obj=root_obj).order_by("weight")
objs = []
for tlc in top_level_objs:
append_obj_recursive(objs, tlc)
return objs
This is a really weird problem. slicing should not cause such a bizare problem. If I can find an answer I will post it here too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Computing a large dataset in Django - python

Related

Django, using "|": Expression tree is too large (maximum depth 1000)

Idiomatic/fast Django ORM check for existence on mysql/postgres

Re evaluate django query after changes done to database

A more pythonic way to build a class based on a string (how not to use eval)

l[:] performance problem

Categories

Resources