With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.
A basic snippet of my code looks like this (the collection has over 10K documents):
import pymongo as pm
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cur = coll.find({}, batch_size=500)
However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.
Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.
Pymongo has some quality-of-life helpers for the Cursor class, so it will automatically do the batching for you, and return result to you in terms of documents.
The batch_size setting is set, but the idea is you only need to set it in the find() method, and not have to do manual low level calls or iterating through the batches.
For example, if I have 100 documents in my collection:
> db.test.count()
100
I then set the profiling level to log all queries:
> db.setProfilingLevel(0,-1)
{
"was": 0,
"slowms": 100,
"sampleRate": 1,
"ok": 1,
...
I then use pymongo to specify batch_size of 10:
import pymongo
import bson
conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))
Running that query, I see in the MongoDB log:
2019-02-22T15:03:54.522+1100 I COMMAND [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, ....
(getMore repeated 9 more times)
So the query was fetched from the server in the specified batches. It's just hidden from you via the Cursor class.
Edit
If you really need to get the documents in batches, there is a function find_raw_batches() under Collection (doc link). This method works similarly to find() and accepts the same parameters. However be advised that it will return raw BSON which will need to be decoded by the application in a separate step. Notably, this method does not support sessions.
Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead. For example:
find({'$gte': <some criteria>, '$lte': <some other criteria>})
Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted. This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.
This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.
import pymongo as pm
CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)
def yield_rows(cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
# do processing here
pass
If I find a cleaner, more efficient way to do this I'll update my answer.
Related
I have a problem. I want to get all documents of a collection with ~ 1 mio documents inside. I asked myself what is the fastest way to get all documents inside a collection. Is it with cursor or with .all? And are there any recommendation for the batch_size?
cursor
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
cursor = db.aql.execute('FOR doc IN <Collection> RETURN doc', stream=True, ttl=3600, batch_size=<batchSize>)
collection = [doc for doc in cursor]
.all - with custom HTTP Client
from arango import ArangoClient
from arango.http import HTTPClient
class MyCustomHTTPClient(HTTPClient):
REQUEST_TIMEOUT = 1000
# Initialize the ArangoDB client.
client = ArangoClient(
http_client=MyCustomHTTPClient())
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
collec = db.collection('<Collection>')
collection = collec.all()
If you want all documents in the memory then the .all will be the fastest because it uses the library's method for getting all the results which is optimized.
If you can process each document as they come in then the cursor is the best way to do it to avoid the memory overhead.
But the best way to decide this is to run tests measure the timing because many factors can effect the speed, such as the connection type and speed to the DB, amount of memory in your computer, etc. The examples you gave look simple enough to do such measurements pretty fast.
In my constructor I initialize an empty dictionary and then in a udf I update it with the new data that arrived from the batch.
My problem is that in every new batch the dictionary is empty again.
How can I bypass the empty step, so new batches have access to all previous values I have already added in my dictionary ?
import CharacteristicVector
import update_charecteristic_vector
class SomeClass(object):
def __init__(self):
self.grid_list = {}
def run_stream(self):
def update_grid_list(grid):
if grid not in self.grid_list:
grid_list[grid] =
if grid not in self.grid_list:
self.grid_list[grid] = CharacteristicVector()
self.grid_list[grid] = update_charecteristic_vector(self.grid_list[grid])
return self.grid_list[grid].Density
.
.
.
udf_update_grid_list = udf(update_grid_list, StringType())
grids_dataframe = hashed.select(
hashed.grid.alias('grid'),
update_list(hashed.grid).alias('Density')
)
query = grids_dataframe.writeStream.format("console").start()
query.awaitTermination()
Unfortunately, this code cannot work for multiple reasons. Even with single batch or in a batch application it will work only if there is only active Python worker process. Also, it is not possible in general, to have global synchronized stat, with support for both reads and writes.
You should be able to use stateful transformations, but for now, there are supported only in Java / Scala and interface is still experimental / evolving.
Depending on your requirements you can try to use in memory data grid, key-value store, or distributed cache.
My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script.
I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'):
dbroot.actions_values = BTree()
if not hasattr(dbroot, 'games_played'):
dbroot.games_played = 0
actions_values is conceptually built as follows:
actions_values = { # BTree
str(state): { # BTree
# contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
# and their values(only actions previously taken by the angent are present here), e.g.:
1: 0.4356
5: 0.3456
},
# other states
}
state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:
board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case rows = 6 and cols = 7)
main loop:
should_play = 10000000
transactions_freq = 10000
packing_freq = 50000
player = ReinforcementPlayer(dbroot.actions_values, config)
while dbroot.games_played < should_play:
# max_epsilon at start and then linearly drops to min_epsilon:
epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)
dbroot.games_played += 1
sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
sys.stdout.flush()
board_state = player.play_game(epsilon)
if(dbroot.games_played % transactions_freq == 0):
print('Commiting...')
transaction.commit()
if(dbroot.games_played % packing_freq == 0):
print('Packing DB...')
connection.db().pack()
(packing also takes much time but it's not the main problem; I could pack database after program finishes)
Code operating on dbroot (inside ReinforcementPlayer):
def get_actions_with_values(self, player_id, state):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
return self.actions_values[lookup_state_str]
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
return None
def get_value_of_action(self, player_id, state, action, default=0):
actions = self.get_actions_with_values(player_id, state)
if actions is None:
return default
return actions.get(action, default)
def set_value_of_action(self, player_id, state, action, value):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
self.actions_values[lookup_state_str][action] = value
return
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
return
self.actions_values[lookup_state_str] = BTree()
self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values) is 6018450.
According to iotop IO operations take 90% of the time.
Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.
Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt
Is there way to use find_each in django?
According to the rails documentation:
This method is only intended to use for batch processing of large
amounts of records that wouldn’t fit in memory all at once. If you
just need to loop over less than 1000 records, it’s probably better
just to use the regular find methods.
http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each
Thanks.
One possible solution could be to use the built-in Paginator class (could save a lot of hassle).
https://docs.djangoproject.com/en/dev/topics/pagination/
Try something like:
from django.core.paginator import Paginator
from yourapp.models import YourModel
result_query = YourModel.objects.filter(<your find conditions>)
paginator = Paginator(result_query, 1000) # the desired batch size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can add your required code
Or, you could use the limiting options as per needs to iterate over the results.
You can query parts of the whole table with a loop and by slicing the queryset.
If you're working with Debug = True it's important that you flush your queries after each loop since this can cause memory issues (Django stores all the queries that were run until the script finishes or dies).
If you need to restrict the results of the queryset you can replace ".all()" with the appropriate ".filter(conditions)"
from django import db
from myapp import MyModel
# Getting the total of records in the table
total_count = MyModel.objects.all().count()
chunk_size = 1000 # You can change this to any amount you can keep in memory
total_checked = 0
while total_checked < total_count:
# Querying all the objects and slicing only the part you need to work
# with at the moment (only that part will be loaded into memory)
query_set = MyModel.objects.all()[total_checked:total_checked + chunk_size]
for item in query_set:
# Do what you need to do with your results
pass
total_checked += chunk_size
# Clearing django's query cache to avoid a memory leak
db.reset_queries()
Premise: I'm not a programmer!
I've written a Python code to perform iterations on a DB's field with Arcgis 9.2 geoprocessor.
The algorithm has to iterate on more than a two thousand of records and it's speed decreases progressively untill 5/6 minute for 1 iteration!
This is my code:
# Import system modules
import sys, string, os, arcgisscripting
# Create the Geoprocessor object
gp = arcgisscripting.create()
# Set the necessary product code
gp.SetProduct("ArcEditor")
# Load required toolboxes...
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Data Management Tools.tbx")
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Analysis Tools.tbx")
gp.OverWriteOutput = "True"
# Global variables...
PP_ID_2012_dbf="C:\\...\\4Paths_PP_ID.dbf"
Paths_2012 = "C:\\...\\Paths_2012.shp"
# Search Cursor
src = gp.SearchCursor(PP_ID_2012_dbf)
row = src.Next()
src.Reset()
# While cycle
while row:
SQL_expr = row.GetValue("Query") # Query SQL
print row.GetValue("Query")
Selected_Path = "C:\\...\\Path_"+str(row.GetValue("PointPath_"))+".shp"
# Process: Select...
gp.Select_analysis(Paths_2012, Selected_Path, SQL_expr)
Paths_2012_Select_Simplify_shp="C:\\...\\Path_"+str(row.GetValue("PointPath_"))+"_Simplify.shp"
# Process: Simplify Lines...
gp.SimplifyLine_management(Selected_Path, Paths_2012_Select_Simplify_shp, "POINT_REMOVE", "20 Meters", "FLAG_ERRORS", "KEEP_COLLAPSED_POINTS", "CHECK")
del SQL_expr
del Selected_Path
del Paths_2012_select_Simplify_shp
row = src.Next()
What's wrong? I think it's a problem of cache memory but I'm not able to solve on my own.
Please, help me!
If you're performing a SQL query inside of a loop, it's plausible that you could make your code much faster by figuring out how to get all of the data in one query and then iterating through the result of that query to process it.
As someone else mentioned, you can figure out where you're spending your time by profiling your code, but DB calls are a common culprit, and if you can avoid making your DB calls inside of a loop, you may be able to reduce 2,000 db calls to one or two db calls.