I need to read whole collection from MongoDB ( collection name is "test" ) in Python code. I tried like
self.__connection__ = Connection('localhost',27017)
dbh = self.__connection__['test_db']
collection = dbh['test']
How to read through collection in chunks by 1000 ( to avoid memory overflow because collection can be very large ) ?
inspired by #Rafael Valero + fixing last chunk bug in his code and making it more general I created generator function to iterate through mongo collection with query and projection:
def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}):
chunks = range(start_from, collection.find(query).count(), int(chunksize))
num_chunks = len(chunks)
for i in range(1,num_chunks+1):
if i < num_chunks:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]]
else:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop]
so for example you first create an iterator like this:
mess_chunk_iter = iterate_by_chunks(db_local.conversation_messages, 200, 0, query={}, projection=projection)
and then iterate it by chunks:
chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
chunk_n=chunk_n+1
chunk_len = 0
for d in docs:
chunk_len=chunk_len+1
total_docs=total_docs+1
print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)
chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated: 2681
I agree with Remon, but you mention batches of 1000, which his answer doesn't really cover. You can set a batch size on the cursor:
cursor.batch_size(1000);
You can also skip records, e.g.:
cursor.skip(4000);
Is this what you're looking for? This is effectively a pagination pattern. However, if you're just trying to avoid memory exhaustion then you don't really need to set batch size or skip.
Use cursors. Cursors have a "batchSize" variable that controls how many documents are actually sent to the client per batch after doing a query. You don't have to touch this setting though since the default is fine and the complexity if invoking "getmore" commands is hidden from you in most drivers. I'm not familiar with pymongo but it works like this :
cursor = db.col.find() // Get everything!
while(cursor.hasNext()) {
/* This will use the documents already fetched and if it runs out of documents in it's local batch it will fetch another X of them from the server (where X is batchSize). */
document = cursor.next();
// Do your magic here
}
Here is a generic solution to iterate over any iterator or generator by batch:
def _as_batch(cursor, batch_size=50):
# iterate over something (pymongo cursor, generator, ...) by batch.
# Note: the last batch may contain less than batch_size elements.
batch = []
try:
while True:
for _ in range(batch_size):
batch.append(next(cursor))
yield batch
batch = []
except StopIteration as e:
if len(batch):
yield batch
This will work as long as the cursor defines a method __next__ (i.e. we can use next(cursor)). Thus, we can use it on raw cursor or also on transformed records.
Examples
Simple usage:
for batch in db['coll_name'].find():
# do stuff
More complex usage (useful for bulk updates for example):
def update_func(doc):
# dummy transform function
doc['y'] = doc['x'] + 1
return doc
query = (update_func(doc) for doc in db['coll_name'].find())
for batch in _as_batch(query):
# do stuff
Reimplementation of the count() function:
sum(map(len, _as_batch( db['coll_name'].find() )))
To the create the initial connection currently in Python 2 using Pymongo:
host = 'localhost'
port = 27017
db_name = 'test_db'
collection_name = 'test'
To connect using MongoClient
# Connect to MongoDB
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
dbh = client[dbname]
collection = dbh[collection_name]
So from here the proper answer.
I want to read by using chunks (in this case of size 1000).
chunksize = 1000
For example we could decide the how many chunks of size (chunksize) we want.
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
Then we can retrieve each chunk.
for i in range(1,len(skips_variable)):
# Expand the cursor and retrieve data
data_from_chunk = dbh[collection_name].find(query)[skips_variable[i-1]:skips_variable[i]]))
Where query in this case is query = {}.
Here I use similar ideas to create dataframes from MongoDB.
Here I use something similar to write to MongoDB in chunks.
I hope it helps.
Related
I have a MongoDB(media_mongo) with collection main_hikari and a lot of data inside. I'm trying to make a function to create a .csv file from this data asap. I'm using this code, but it takes too much time and CPU usage
import pymongo
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://admin:password#localhost:27017')
db = mongo_client.media_mongo
def download_file(down_file_name="hikari"):
docs = pd.DataFrame(columns=[])
if down_file_name == "kokyaku":
col = db.main_kokyaku
if down_file_name == "hikari":
col = db.main_hikari
if down_file_name == "hikanshou":
col = db.main_hikanshou
cursor = col.find()
mongo_docs = list(cursor)
for num, doc in enumerate(mongo_docs):
doc["_id"] = str(doc["_id"])
doc_id = doc["_id"]
series_obj = pandas.Series(doc, name=doc_id)
docs = docs.append(series_obj)
csv_export = docs.to_csv("file.csv", sep=",")
download_file()
My database has data in this format (sorry for that Japanese :D)
_id:"ObjectId("5e0544c4f4eefce9ee9b5a8b")"
事業者受付番号:"data1"
開通区分/処理区分:"data2"
開通ST/処理ST:"data3"
申込日,顧客名:"data4"
郵便番号:"data5"
住所1:"data6"
住所2:"data7"
連絡先番号:"data8"
契約者電話番号:"data9"
And about 150000 entries like this
If you have a lot of data as you indicate, then this line is going to hurt you:
mongo_docs = list(cursor)
It basically means read the entire collection into a client-side array at once. This will create a huge memory high water mark.
Better to use mongoexport as noted above or walk the cursor yourself instead of having list() slurp the whole thing, e.g.:
cursor = col.find()
for doc in cursor:
# read docs one at a time
or to be very pythonic about it:
for doc in col.find(): # or find(expression of your choice)
# read docs one at a time
Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:
503 The datastore operation timed out, or the data was temporarily unavailable.
about half way through. It was working fine. Here is the code:
docs = db.collection(u'theDatabase').stream()
count = 0
for doc in docs:
count += 1
print (count)
Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?
Although Juan's answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.
So I took Juan's code and transformed it to a standard iterative algorithm. Hope this helps someone.
limit = 1000 # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
print(doc.id)
print(count)
# The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
process_data_and_log_errors_if_any(doc)
count = count + 1
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
stream_collection_loop(db_v3.collection('collection'), 0)
Try using a recursive function to batch document retrievals and keep them under the timeout. Here's an example based on the delete_collections snippet:
from google.cloud import firestore
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
def count_collection(coll_ref, count, cursor=None):
if cursor is not None:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
else:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").stream()]
count = count + len(docs)
if len(docs) == 1000:
return count_collection(coll_ref, count, docs[999].get())
else:
print(count)
count_collection(db.collection('users'), 0)
In other answers was shown how to use the pagination to solve the timeout issue.
I suggest to use a generator in combination with pagination, which lets you to process the documents in the same way as you were doing it with query.stream().
Here is an example of function that takes a Query and returns a generator in the same way as the Query stream() method.
from typing import Generator, Optional, Any
from google.cloud.firestore import Query, DocumentSnapshot
def paginate_query_stream(
query: Query,
order_by: str,
cursor: Optional[DocumentSnapshot] = None,
page_size: int = 10000,
) -> Generator[DocumentSnapshot, Any, None]:
paged_query = query.order_by(order_by)
document = cursor
has_any = True
while has_any:
has_any = False
if document:
paged_query = paged_query.start_after(document)
paged_query = paged_query.limit(page_size)
for document in paged_query.stream():
has_any = True
yield document
Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.
A usage example with counting of documents.
from google.cloud.firestore import Query
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = Query(docs)
count = 0
for doc in paginate_query_stream(query, order_by='__name__'):
count += 1
print(count)
All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame.
I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or threads. Each process would load a chunk of a collection, then these chunks would be merged together.
How do I do it correctly with MongoDB?
I have tried similar approach with PostgreSQL. My initial idea was to use SKIP and LIMIT in SQL queries. It has failed, since each cursor, opened for each particular query, started reading data table from the beginning and just skipped specified amount of rows. So I had to create additional column, containing record numbers, and specify ranges of these numbers in queries.
On the contrary, MongoDB assigns unique ObjectID to each document. However, I've found that it is impossible to subtract one ObjectID from another, they can be only compared with ordering operations: less, greater and equal.
Also, pymongo returns the cursor object, that supports indexing operation and has some methods, seeming useful for my task, like count, limit.
MongoDB connector for Spark accomplishes this task somehow. Unfortunately, I'm not familiar with Scala, therefore, it's hard for me to find out how they do it.
So, what is the correct way for parallel loading data from Mongo into python?
up to now, I've come to the following solution:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# import other modules.
collection = get_mongo_collection()
cursor = collection.find({ })
def process_document(in_doc):
out_doc = # process doc keys and values
return pd.DataFrame(out_doc)
df = dd.from_delayed( (delayed(process_document)(d) for d in cursor) )
However, it looks like dask.dataframe.from_delayed internally creates a list from passed generator, effectively loading all collection in a single thread.
Update. I've found in docs, that skip method of pymongo.Cursor starts from beginning of a collection too, as PostgreSQL. The same page suggests using pagination logic in the application. Solutions, that I've found so far, use sorted _id for this. However, they also store last seen _id, that implies that they also work in a single thread.
Update2. I've found the code of the partitioner in the official MongoDb Spark connector: https://github.com/mongodb/mongo-spark/blob/7c76ed1821f70ef2259f8822d812b9c53b6f2b98/src/main/scala/com/mongodb/spark/rdd/partitioner/MongoPaginationPartitioner.scala#L32
Looks like, initially this partitioner reads the key field from all documents in the collection and calculates ranges of values.
Update3: My incomplete solution.
Doesn't work, gets the exception from pymongo, because dask seems to incorrectly treat the Collection object:
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/dask/delayed.pyc in <genexpr>(***failed resolving arguments***)
81 return expr, {}
82 if isinstance(expr, (Iterator, list, tuple, set)):
---> 83 args, dasks = unzip((to_task_dask(e) for e in expr), 2)
84 args = list(args)
85 dsk = sharedict.merge(*dasks)
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/pymongo/collection.pyc in __next__(self)
2342
2343 def __next__(self):
-> 2344 raise TypeError("'Collection' object is not iterable")
2345
2346 next = __next__
TypeError: 'Collection' object is not iterable
What raises the exception:
def process_document(in_doc, other_arg):
# custom processing of incoming records
return out_doc
def compute_id_ranges(collection, query, partition_size=50):
cur = collection.find(query, {'_id': 1}).sort('_id', pymongo.ASCENDING)
id_ranges = [cur[0]['_id']]
count = 1
for r in cur:
count += 1
if count > partition_size:
id_ranges.append(r['_id'])
count = 0
id_ranges.append(r['_id'])
return zip(id_ranges[:len(id_ranges)-1], id_ranges[1: ])
def load_chunk(id_pair, collection, query={}, projection=None):
q = query
q.update( {"_id": {"$gte": id_pair[0], "$lt": id_pair[1]}} )
cur = collection.find(q, projection)
return pd.DataFrame([process_document(d, other_arg) for d in cur])
def parallel_load(*args, **kwargs):
collection = kwargs['collection']
query = kwargs.get('query', {})
projection = kwargs.get('projection', None)
id_ranges = compute_id_ranges(collection, query)
dfs = [ delayed(load_chunk)(ir, collection, query, projection) for ir in id_ranges ]
df = dd.from_delayed(dfs)
return df
collection = connect_to_mongo_and_return_collection_object(credentials)
# df = parallel_load(collection=collection)
id_ranges = compute_id_ranges(collection)
dedf = delayed(load_chunk)(id_ranges[0], collection)
load_chunk perfectly runs when called directly. However, call delayed(load_chunk)( blah-blah-blah ) fails with exception, mentioned above.
I was looking into pymongo parallelization and this is what worked for me. It took my humble gaming laptop nearly 100 minutes to process my mongodb of 40 million documents. The CPU was 100% utilised I had to turn on the AC :)
I used skip and limit functions to split the database, then assigned batches to processes. The code is written for Python 3:
import multiprocessing
from pymongo import MongoClient
def your_function(something):
<...>
return result
def process_cursor(skip_n,limit_n):
print('Starting process',skip_n//limit_n,'...')
collection = MongoClient().<db_name>.<collection_name>
cursor = collection.find({}).skip(skip_n).limit(limit_n)
for doc in cursor:
<do your magic>
# for example:
result = your_function(doc['your_field'] # do some processing on each document
# update that document by adding the result into a new field
collection.update_one({'_id': doc['_id']}, {'$set': {'<new_field_eg>': result} })
print('Completed process',skip_n//limit_n,'...')
if __name__ == '__main__':
n_cores = 7 # number of splits (logical cores of the CPU-1)
collection_size = 40126904 # your collection size
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
processes = [ multiprocessing.Process(target=process_cursor, args=(skip_n,batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
The last split will have a larger limit than the remaining documents, but that won't raise an error
I think dask-mongo will do the work for here. You can install it with pip or conda, and in the repo you can find some examples in a notebook.
dask-mongo will read the data you have in MongoDB as a Dask bag but then you can go from a Dask bag to a Dask Dataframe with df = b.to_dataframe() where b is the bag you read from mongo using with dask_mongo.read_mongo
"Read the mans, thery're rulez" :)
pymongo.Collection has method parallel_scan that returns a list of cursors.
UPDATE. This function can do the job, if the collection does not change too often, and queries are always the same (my case). One could just store query results in different collections and run parallel scans.
I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian
You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.
add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator
I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])
I'm developing part of a system where processes are limited to about 350MB of RAM; we use cx_Oracle to download files from an external system for processing.
The external system stores files as BLOBs, and we can grab them doing something like this:
# ... set up Oracle connection, then
cursor.execute(u"""SELECT filename, data, filesize
FROM FILEDATA
WHERE ID = :id""", id=the_one_you_wanted)
filename, lob, filesize = cursor.fetchone()
with open(filename, "w") as the_file:
the_file.write(lob.read())
lob.read() will obviously fail with MemoryError when we hit a file larger than 300-350MB, so we've tried something like this instead of reading it all at once:
read_size = 0
chunk_size = lob.getchunksize() * 100
while read_size < filesize:
data = lob.read(chunk_size, read_size + 1)
read_size += len(data)
the_file.write(data)
Unfortunately, we still get MemoryError after several iterations. From the time lob.read() is taking, and the out-of-memory condition we eventually get, it looks as if lob.read() is pulling ( chunk_size + read_size ) bytes from the database every time. That is, reads are taking O(n) time and O(n) memory, even though the buffer is quite a bit smaller.
To work around this, we've tried something like:
read_size = 0
while read_size < filesize:
q = u'''SELECT dbms_lob.substr(data, 2000, %s)
FROM FILEDATA WHERE ID = :id''' % (read_bytes + 1)
cursor.execute(q, id=filedataid[0])
row = cursor.fetchone()
read_bytes += len(row[0])
the_file.write(row[0])
This pulls 2000 bytes (argh) at a time, and takes forever (something like two hours for a 1.5GB file). Why 2000 bytes? According to the Oracle docs, dbms_lob.substr() stores its return value in a RAW, which is limited to 2000 bytes.
Is there some way I can store the dbms_lob.substr() results in a larger data object and read maybe a few megabytes at a time? How do I do this with cx_Oracle?
I think that the argument order in lob.read() is reversed in your code. The first argument should be the offset, the second argument should be the amount to read. This would explain the O(n) time and memory usage.