Is this possible to lazily query the database with mongoengine (python)? - python

I have a memory issue with mongoengine (in python).
Let's say I have a very large amount of custom_documents (several thousands).
I want to process them all, like this:
for item in custom_documents.objects():
process(item)
The problem is custom_documents.objects() load every objects in memory and my app use several GB ...
How can I do to make it more memory wise?
Is there a way to make mongoengine to query the DB lazily (it request objects when we iterates on the queryset)?

According to the docs (and in my experience), collection.objects returns a lazy QuerySet. Your first problem might be that you're calling the objects attribute, rather than just using it as an iterable. I feel like there must be some other reason your app is using so much memory, perhaps process(object) stores a reference to it somehow? Try the following code and check your app's memory usage:
queryset = custom_documents.objects
print queryset.count()
Since QuerySets are lazy, you can do things like custom_documents.limit(100).skip(500) as well in order to return objects 500-600 only.

I think you want to look at querysets - these are the MongoEngine wrapper for cursors:
http://mongoengine.org/docs/v0.4/apireference.html#querying
They let you control the number of objects returned, essentially taking care of the batch size settings etc. that you can set directly in the pymongo driver:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
Cursors are set up to generally behave this way by default, you have to try to get them to return everything in one shot, even in the native mongodb shell.

Related

Do Django filters increases ram consumption per user : Python

I don't know where else I could have asked this question, thus asking it here. I want to know that if I impose multiple Django filters on a page which are using multiple db tables, will that effect ram consumption whenever a user visits this page because before the user only filtered data will get reflected. I'm using django with postgresql on a ubuntu based VM, also if there are any documentation which can be helpful in understanding ram utilization, please suggest.
Django filter and query sets are lazy. What it actually means is you are not actually hitting the database until you evaluate them. Quoting official documentation -
Internally, a QuerySet can be constructed, filtered, sliced, and generally passed around without actually hitting the database. No database activity actually occurs until you do something to evaluate the queryset.
So the only space that gets taken in your RAM is actually the list containing queryset and your program. It is when query is evaluated and data is extracted from the database, that is when(depending on how much data is extracted), memory is filled. Also, it'd be a good idea to look at iterators as well

Iterating over Django QuerySet without pre-populating cache

I want to iterate over all the objects in a QuerySet. However, the QuerySet matches hundreds of thousands if not millions of objects. So when I try to start an iteration, my CPU usage goes to 100%, and all my memory fills up, and then things crash. This happens before the first item is returned:
bts = Backtrace.objects.all()
for bt in bts:
print bt
I can ask for an individual object and it returns immediately:
bts = Backtrace.objects.all()
print(bts[5])
But getting a count of all objects crashes just as above, so I can't iterate using this method since I don't know how many objects there will be.
What's a way to iterate without causing the whole result to get pre-cached?
First of all make sure that you understand when a queryset is evaluated (hits the database).
Of course, one approach would be to filter the queryset.
If this is out of the question there are some workarounds you can use, according to your needs:
Queryset Iterator
Pagination
A django snippet that claims to do the job using a generator
Raw SQL using cursors.
Fetch a list of necessary values or a dictionary of necessary values and work with them.
Here is a nice article that tries to tackle with this issue in a more theoretical level.

How to cache a row instead of an model instance when using SQLAlchemy?

I'm using SQLAlchemy with python-memcached.
Before actually using a real SQLAlchemy query, I first check if the corresponding key is cached, and in case it's not already cached I would set the found instance (which is a SQLAlchemy model object) in cache.
For most of my functions this is fast enough, but in functions where thousands of objects get queried, fair amount of time is spent in cPickle.loads for deserialising the objects.
Because serialisation and deserialisation of a tuple/dict row could be several times faster than that of an object, I'm wondering if I can somehow cache the row instead.
You could use dogpile cache package for this.
There's even an example of usage on the SQLAlchemy documentation website.

Iterate over large collection in django - cache problem

I need to iterate over large collection (3 * 10^6 elements) in Django to do some kind of analysis that can't be done using single SQL statement.
Is it possible to turn off collection caching in django? (Caching all the data is not to be acceptable data has around 0.5GB)
Is it possible to make django fetch collection in chunks? It seems that it tries to pre fetch whole collection in to the memory and then iterate over it. I think that observing the speed of execution:
iter(Coll.objects.all()).next() - this takes forever
iter(Coll.objects.all()[:10000]).next() - this takes less than a second
Use QuerySet.iterator() to walk over the results instead of loading them all first.
It seams that the problem was caused by the database backend (sqlite) that doesn't support reading in chunks.
I've used sqlite as the database will be trashed after I do all the computations but it seems that sqlite isn't good even for that.
Here is what I've found in django source code of sqlite backend:
class DatabaseFeatures(BaseDatabaseFeatures):
# SQLite cannot handle us only partially reading from a cursor's result set
# and then writing the same rows to the database in another cursor. This
# setting ensures we always read result sets fully into memory all in one
# go.
can_use_chunked_reads = False

Is this a good approach to avoid using SQLAlchemy/SQLObject?

Rather than use an ORM, I am considering the following approach in Python and MySQL with no ORM (SQLObject/SQLAlchemy). I would like to get some feedback on whether this seems likely to have any negative long-term consequences since in the short-term view it seems fine from what I can tell.
Rather than translate a row from the database into an object:
each table is represented by a class
a row is retrieved as a dict
an object representing a cursor provides access to a table like so:
cursor.mytable.get_by_ids(low, high)
removing means setting the time_of_removal to the current time
So essentially this does away with the need for an ORM since each table has a class to represent it and within that class, a separate dict represents each row.
Type mapping is trivial because each dict (row) being a first class object in python/blub allows you to know the class of the object and, besides, the low-level database library in Python handles the conversion of types at the field level into their appropriate application-level types.
If you see any potential problems with going down this road, please let me know. Thanks.
That doesn't do away with the need for an ORM. That is an ORM. In which case, why reinvent the wheel?
Is there a compelling reason you're trying to avoid using an established ORM?
You will still be using SQLAlchemy. ResultProxy is actually a dictionary once you go for .fetchmany() or similar.
Use SQLAlchemy as a tool that makes managing connections easier, as well as executing statements. Documentation is pretty much separated in sections, so you will be reading just the part that you need.
web.py has in a decent db abstraction too (not an ORM).
Queries are written in SQL (not specific to any rdbms), but your code remains compatible with any of the supported dbs (sqlite, mysql, postresql, and others).
from http://webpy.org/cookbook/select:
myvar = dict(name="Bob")
results = db.select('mytable', myvar, where="name = $name")

Categories

Resources