Django find_each (like RoR)

Django find_each (like RoR) - python

Is there way to use find_each in django?
According to the rails documentation:
This method is only intended to use for batch processing of large
amounts of records that wouldn’t fit in memory all at once. If you
just need to loop over less than 1000 records, it’s probably better
just to use the regular find methods.
http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each
Thanks.

One possible solution could be to use the built-in Paginator class (could save a lot of hassle).
https://docs.djangoproject.com/en/dev/topics/pagination/
Try something like:
from django.core.paginator import Paginator
from yourapp.models import YourModel
result_query = YourModel.objects.filter(<your find conditions>)
paginator = Paginator(result_query, 1000) # the desired batch size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can add your required code
Or, you could use the limiting options as per needs to iterate over the results.

You can query parts of the whole table with a loop and by slicing the queryset.
If you're working with Debug = True it's important that you flush your queries after each loop since this can cause memory issues (Django stores all the queries that were run until the script finishes or dies).
If you need to restrict the results of the queryset you can replace ".all()" with the appropriate ".filter(conditions)"
from django import db
from myapp import MyModel
# Getting the total of records in the table
total_count = MyModel.objects.all().count()
chunk_size = 1000 # You can change this to any amount you can keep in memory
total_checked = 0
while total_checked < total_count:
# Querying all the objects and slicing only the part you need to work
# with at the moment (only that part will be loaded into memory)
query_set = MyModel.objects.all()[total_checked:total_checked + chunk_size]
for item in query_set:
# Do what you need to do with your results
pass
total_checked += chunk_size
# Clearing django's query cache to avoid a memory leak
db.reset_queries()

Related

PyMongo cursor batch_size

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.
A basic snippet of my code looks like this (the collection has over 10K documents):
import pymongo as pm
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cur = coll.find({}, batch_size=500)
However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.
Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.

Pymongo has some quality-of-life helpers for the Cursor class, so it will automatically do the batching for you, and return result to you in terms of documents.
The batch_size setting is set, but the idea is you only need to set it in the find() method, and not have to do manual low level calls or iterating through the batches.
For example, if I have 100 documents in my collection:
> db.test.count()
100
I then set the profiling level to log all queries:
> db.setProfilingLevel(0,-1)
{
"was": 0,
"slowms": 100,
"sampleRate": 1,
"ok": 1,
...
I then use pymongo to specify batch_size of 10:
import pymongo
import bson
conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))
Running that query, I see in the MongoDB log:
2019-02-22T15:03:54.522+1100 I COMMAND [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, ....
(getMore repeated 9 more times)
So the query was fetched from the server in the specified batches. It's just hidden from you via the Cursor class.
Edit
If you really need to get the documents in batches, there is a function find_raw_batches() under Collection (doc link). This method works similarly to find() and accepts the same parameters. However be advised that it will return raw BSON which will need to be decoded by the application in a separate step. Notably, this method does not support sessions.
Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead. For example:
find({'$gte': <some criteria>, '$lte': <some other criteria>})
Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted. This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.

This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.
import pymongo as pm
CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)
def yield_rows(cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
# do processing here
pass
If I find a cleaner, more efficient way to do this I'll update my answer.

How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script.
I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'):
dbroot.actions_values = BTree()
if not hasattr(dbroot, 'games_played'):
dbroot.games_played = 0
actions_values is conceptually built as follows:
actions_values = { # BTree
str(state): { # BTree
# contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
# and their values(only actions previously taken by the angent are present here), e.g.:
1: 0.4356
5: 0.3456
},
# other states
}
state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:
board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case rows = 6 and cols = 7)
main loop:
should_play = 10000000
transactions_freq = 10000
packing_freq = 50000
player = ReinforcementPlayer(dbroot.actions_values, config)
while dbroot.games_played < should_play:
# max_epsilon at start and then linearly drops to min_epsilon:
epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)
dbroot.games_played += 1
sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
sys.stdout.flush()
board_state = player.play_game(epsilon)
if(dbroot.games_played % transactions_freq == 0):
print('Commiting...')
transaction.commit()
if(dbroot.games_played % packing_freq == 0):
print('Packing DB...')
connection.db().pack()
(packing also takes much time but it's not the main problem; I could pack database after program finishes)
Code operating on dbroot (inside ReinforcementPlayer):
def get_actions_with_values(self, player_id, state):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
return self.actions_values[lookup_state_str]
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
return None
def get_value_of_action(self, player_id, state, action, default=0):
actions = self.get_actions_with_values(player_id, state)
if actions is None:
return default
return actions.get(action, default)
def set_value_of_action(self, player_id, state, action, value):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
self.actions_values[lookup_state_str][action] = value
return
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
return
self.actions_values[lookup_state_str] = BTree()
self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values) is 6018450.
According to iotop IO operations take 90% of the time.

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt

Why my algorithm goes more and more slowly?

Premise: I'm not a programmer!
I've written a Python code to perform iterations on a DB's field with Arcgis 9.2 geoprocessor.
The algorithm has to iterate on more than a two thousand of records and it's speed decreases progressively untill 5/6 minute for 1 iteration!
This is my code:
# Import system modules
import sys, string, os, arcgisscripting
# Create the Geoprocessor object
gp = arcgisscripting.create()
# Set the necessary product code
gp.SetProduct("ArcEditor")
# Load required toolboxes...
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Analysis Tools.tbx")
gp.OverWriteOutput = "True"
# Global variables...
PP_ID_2012_dbf="C:\\...\\4Paths_PP_ID.dbf"
Paths_2012 = "C:\\...\\Paths_2012.shp"
# Search Cursor
src = gp.SearchCursor(PP_ID_2012_dbf)
row = src.Next()
src.Reset()
# While cycle
while row:
SQL_expr = row.GetValue("Query") # Query SQL
print row.GetValue("Query")
Selected_Path = "C:\\...\\Path_"+str(row.GetValue("PointPath_"))+".shp"
# Process: Select...
gp.Select_analysis(Paths_2012, Selected_Path, SQL_expr)
Paths_2012_Select_Simplify_shp="C:\\...\\Path_"+str(row.GetValue("PointPath_"))+"_Simplify.shp"
# Process: Simplify Lines...
gp.SimplifyLine_management(Selected_Path, Paths_2012_Select_Simplify_shp, "POINT_REMOVE", "20 Meters", "FLAG_ERRORS", "KEEP_COLLAPSED_POINTS", "CHECK")
del SQL_expr
del Selected_Path
del Paths_2012_select_Simplify_shp
row = src.Next()
What's wrong? I think it's a problem of cache memory but I'm not able to solve on my own.
Please, help me!

If you're performing a SQL query inside of a loop, it's plausible that you could make your code much faster by figuring out how to get all of the data in one query and then iterating through the result of that query to process it.
As someone else mentioned, you can figure out where you're spending your time by profiling your code, but DB calls are a common culprit, and if you can avoid making your DB calls inside of a loop, you may be able to reduce 2,000 db calls to one or two db calls.

how expensive is to save a django ORM Model without changes?

sometimes we have to do a Model instance.save() regardless if some field changed, just for security and fast development.
how expensive is this with django ORM?
signals are always sent?
any SQL query is executed?
I tested with django debug toolbar to do 10 .save() in different points where anything in the model has changed, and the log does not register sql queries.
other way to test it or some article?
thank you in advance.

Im not entirely sure how you application handles this.
But i ran a small test:
a = Blog.objects.get(pk=1)
for b in range(1, 100):
a.save()
This gave me a result of:
87.04 ms (201 queries)
Be ware as well that a save will do two queries:
SELECT ••• FROM `fun_blog` WHERE `fun_blog`.`id` = 1 LIMIT 1
UPDATE `fun_blog` SET `title` = 'This is my testtitle', `body` = 'This is a testbody' WHERE `fun_blog`.`id` = 1

Django: How do I avoid unnecessary SQL statements?

I'm optimizing a slow page load in our (first) Django project. The overall project does test status management, so there are protocols which have cases which have planned executions. Currently the code is:
protocols = Protocol.active.filter(team=team, release=release)
cases = Case.active.filter(protocol__in=protocols)
caseCount = cases.count()
plannedExecs = Planned_Exec.active.filter(case__in=cases, team=team, release=release)
# Start aggregating test suite information
# pgi Model
testSuite['pgi_model'] = []
for pgi in PLM.objects.filter(release=release).values('pgi_model').distinct():
plmForPgi = PLM.objects.filter(pgi_model=pgi['pgi_model'])
peresults = plannedExecs.filter(plm__in=plmForPgi).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))
# Browser
testSuite['browser'] = []
for browser in BROWSER_OPTIONS:
peresults = plannedExecs.filter(browser=browser[0]).count()
try:
testSuite['browser'].append((browser[1], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['browser'].append((browser[1], "", "", peresults, 0))
# ... more different categories are aggregated below, then the report is generated...
This code makes a lot of SQL statements. The PLM.objects.filter(release=release).values('pgi_model').distinct() returns a list of 50 strings, and the two filter operations both execute an SQL statement for each string, meaning 100 SQL statements for just this for loop. (Also, it seems like that should use values_list with flat=True.)
Since I want to get information about relevant cases and plannedExecutions, I think I really only need to retrieve those two tables, then perform some analysis on that. Using filter and count() seemed like the obvious solution at the time, but I'm wondering if I wouldn't be better off just building a dict of relevant case and plannedExecution information using .values() and then analyzing that instead, so as to avoid unnecessary SQL statements. Any helpful advice? Thanks!
Edit: In trying to profile this to understand where the time goes, I'm using Django Debug toolbar. It explains that there are over 200 queries, and each of which runs extremely quickly, so that overall they account for very little time. However, could it be that the execution of the SQL is relatively quick, but the building of the ORM adds up, given that it happens over 200 times? I refactored a previous page which took 3 minutes to load, and used values() instead of the ORM, thus getting the page load down to 2.7 seconds and 5 SQL statements.

Creating a queryset does not hit the database; only accessing results from it does. Accordingly, merely creating querysets is not your issue.
Note that passing a queryset to to another queryset does not create two queries. Accordingly, building dicts will not reduce the number of database hits.
If you can build up dicts, it may be that you manage to create a simpler query than you would otherwise, which would speed up the actual query execution. That is something of a separate issue, however.

This strikes me as a case for reverse foreign key lookups. We should be able to reduce the top for loop by getting all pgi_models associated with PLMs in the release. I assume you have a model for PGI, for which the PLM model has a foreign key field named pgi_model. If this is the case, you can find the PGIs in a PLM release with the following. You still have a loop, but the iterations of the loop should be reduced, theoretically:
pgis = PGI.objects.filter(plm__in=PLM.objects.filter(release=release))
for pgi in pgis:
peresults = plannedExecs.filter(plm=pgi.plm).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django find_each (like RoR) - python

Related

PyMongo cursor batch_size

How to improve performance of a script operating on large amount of data?

Why my algorithm goes more and more slowly?

how expensive is to save a django ORM Model without changes?

Django: How do I avoid unnecessary SQL statements?

Categories

Resources