Effective way to create .csv file from MongoDB - python

I have a MongoDB(media_mongo) with collection main_hikari and a lot of data inside. I'm trying to make a function to create a .csv file from this data asap. I'm using this code, but it takes too much time and CPU usage
import pymongo
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://admin:password#localhost:27017')
db = mongo_client.media_mongo
def download_file(down_file_name="hikari"):
docs = pd.DataFrame(columns=[])
if down_file_name == "kokyaku":
col = db.main_kokyaku
if down_file_name == "hikari":
col = db.main_hikari
if down_file_name == "hikanshou":
col = db.main_hikanshou
cursor = col.find()
mongo_docs = list(cursor)
for num, doc in enumerate(mongo_docs):
doc["_id"] = str(doc["_id"])
doc_id = doc["_id"]
series_obj = pandas.Series(doc, name=doc_id)
docs = docs.append(series_obj)
csv_export = docs.to_csv("file.csv", sep=",")
download_file()
My database has data in this format (sorry for that Japanese :D)
_id:"ObjectId("5e0544c4f4eefce9ee9b5a8b")"
事業者受付番号:"data1"
開通区分/処理区分:"data2"
開通ST/処理ST:"data3"
申込日,顧客名:"data4"
郵便番号:"data5"
住所1:"data6"
住所2:"data7"
連絡先番号:"data8"
契約者電話番号:"data9"
And about 150000 entries like this

If you have a lot of data as you indicate, then this line is going to hurt you:
mongo_docs = list(cursor)
It basically means read the entire collection into a client-side array at once. This will create a huge memory high water mark.
Better to use mongoexport as noted above or walk the cursor yourself instead of having list() slurp the whole thing, e.g.:
cursor = col.find()
for doc in cursor:
# read docs one at a time
or to be very pythonic about it:
for doc in col.find(): # or find(expression of your choice)
# read docs one at a time

Related

How do i download a large collection in Firestore with Python without getting at 503 error?

Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:
503 The datastore operation timed out, or the data was temporarily unavailable.
about half way through. It was working fine. Here is the code:
docs = db.collection(u'theDatabase').stream()
count = 0
for doc in docs:
count += 1
print (count)
Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?
Although Juan's answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.
So I took Juan's code and transformed it to a standard iterative algorithm. Hope this helps someone.
limit = 1000 # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
print(doc.id)
print(count)
# The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
process_data_and_log_errors_if_any(doc)
count = count + 1
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
stream_collection_loop(db_v3.collection('collection'), 0)
Try using a recursive function to batch document retrievals and keep them under the timeout. Here's an example based on the delete_collections snippet:
from google.cloud import firestore
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
def count_collection(coll_ref, count, cursor=None):
if cursor is not None:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
else:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").stream()]
count = count + len(docs)
if len(docs) == 1000:
return count_collection(coll_ref, count, docs[999].get())
else:
print(count)
count_collection(db.collection('users'), 0)
In other answers was shown how to use the pagination to solve the timeout issue.
I suggest to use a generator in combination with pagination, which lets you to process the documents in the same way as you were doing it with query.stream().
Here is an example of function that takes a Query and returns a generator in the same way as the Query stream() method.
from typing import Generator, Optional, Any
from google.cloud.firestore import Query, DocumentSnapshot
def paginate_query_stream(
query: Query,
order_by: str,
cursor: Optional[DocumentSnapshot] = None,
page_size: int = 10000,
) -> Generator[DocumentSnapshot, Any, None]:
paged_query = query.order_by(order_by)
document = cursor
has_any = True
while has_any:
has_any = False
if document:
paged_query = paged_query.start_after(document)
paged_query = paged_query.limit(page_size)
for document in paged_query.stream():
has_any = True
yield document
Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.
A usage example with counting of documents.
from google.cloud.firestore import Query
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = Query(docs)
count = 0
for doc in paginate_query_stream(query, order_by='__name__'):
count += 1
print(count)

Parallelizing loading data from MongoDB into python

All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame.
I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or threads. Each process would load a chunk of a collection, then these chunks would be merged together.
How do I do it correctly with MongoDB?
I have tried similar approach with PostgreSQL. My initial idea was to use SKIP and LIMIT in SQL queries. It has failed, since each cursor, opened for each particular query, started reading data table from the beginning and just skipped specified amount of rows. So I had to create additional column, containing record numbers, and specify ranges of these numbers in queries.
On the contrary, MongoDB assigns unique ObjectID to each document. However, I've found that it is impossible to subtract one ObjectID from another, they can be only compared with ordering operations: less, greater and equal.
Also, pymongo returns the cursor object, that supports indexing operation and has some methods, seeming useful for my task, like count, limit.
MongoDB connector for Spark accomplishes this task somehow. Unfortunately, I'm not familiar with Scala, therefore, it's hard for me to find out how they do it.
So, what is the correct way for parallel loading data from Mongo into python?
up to now, I've come to the following solution:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# import other modules.
collection = get_mongo_collection()
cursor = collection.find({ })
def process_document(in_doc):
out_doc = # process doc keys and values
return pd.DataFrame(out_doc)
df = dd.from_delayed( (delayed(process_document)(d) for d in cursor) )
However, it looks like dask.dataframe.from_delayed internally creates a list from passed generator, effectively loading all collection in a single thread.
Update. I've found in docs, that skip method of pymongo.Cursor starts from beginning of a collection too, as PostgreSQL. The same page suggests using pagination logic in the application. Solutions, that I've found so far, use sorted _id for this. However, they also store last seen _id, that implies that they also work in a single thread.
Update2. I've found the code of the partitioner in the official MongoDb Spark connector: https://github.com/mongodb/mongo-spark/blob/7c76ed1821f70ef2259f8822d812b9c53b6f2b98/src/main/scala/com/mongodb/spark/rdd/partitioner/MongoPaginationPartitioner.scala#L32
Looks like, initially this partitioner reads the key field from all documents in the collection and calculates ranges of values.
Update3: My incomplete solution.
Doesn't work, gets the exception from pymongo, because dask seems to incorrectly treat the Collection object:
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/dask/delayed.pyc in <genexpr>(***failed resolving arguments***)
81 return expr, {}
82 if isinstance(expr, (Iterator, list, tuple, set)):
---> 83 args, dasks = unzip((to_task_dask(e) for e in expr), 2)
84 args = list(args)
85 dsk = sharedict.merge(*dasks)
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/pymongo/collection.pyc in __next__(self)
2342
2343 def __next__(self):
-> 2344 raise TypeError("'Collection' object is not iterable")
2345
2346 next = __next__
TypeError: 'Collection' object is not iterable
What raises the exception:
def process_document(in_doc, other_arg):
# custom processing of incoming records
return out_doc
def compute_id_ranges(collection, query, partition_size=50):
cur = collection.find(query, {'_id': 1}).sort('_id', pymongo.ASCENDING)
id_ranges = [cur[0]['_id']]
count = 1
for r in cur:
count += 1
if count > partition_size:
id_ranges.append(r['_id'])
count = 0
id_ranges.append(r['_id'])
return zip(id_ranges[:len(id_ranges)-1], id_ranges[1: ])
def load_chunk(id_pair, collection, query={}, projection=None):
q = query
q.update( {"_id": {"$gte": id_pair[0], "$lt": id_pair[1]}} )
cur = collection.find(q, projection)
return pd.DataFrame([process_document(d, other_arg) for d in cur])
def parallel_load(*args, **kwargs):
collection = kwargs['collection']
query = kwargs.get('query', {})
projection = kwargs.get('projection', None)
id_ranges = compute_id_ranges(collection, query)
dfs = [ delayed(load_chunk)(ir, collection, query, projection) for ir in id_ranges ]
df = dd.from_delayed(dfs)
return df
collection = connect_to_mongo_and_return_collection_object(credentials)
# df = parallel_load(collection=collection)
id_ranges = compute_id_ranges(collection)
dedf = delayed(load_chunk)(id_ranges[0], collection)
load_chunk perfectly runs when called directly. However, call delayed(load_chunk)( blah-blah-blah ) fails with exception, mentioned above.
I was looking into pymongo parallelization and this is what worked for me. It took my humble gaming laptop nearly 100 minutes to process my mongodb of 40 million documents. The CPU was 100% utilised I had to turn on the AC :)
I used skip and limit functions to split the database, then assigned batches to processes. The code is written for Python 3:
import multiprocessing
from pymongo import MongoClient
def your_function(something):
<...>
return result
def process_cursor(skip_n,limit_n):
print('Starting process',skip_n//limit_n,'...')
collection = MongoClient().<db_name>.<collection_name>
cursor = collection.find({}).skip(skip_n).limit(limit_n)
for doc in cursor:
<do your magic>
# for example:
result = your_function(doc['your_field'] # do some processing on each document
# update that document by adding the result into a new field
collection.update_one({'_id': doc['_id']}, {'$set': {'<new_field_eg>': result} })
print('Completed process',skip_n//limit_n,'...')
if __name__ == '__main__':
n_cores = 7 # number of splits (logical cores of the CPU-1)
collection_size = 40126904 # your collection size
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
processes = [ multiprocessing.Process(target=process_cursor, args=(skip_n,batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
The last split will have a larger limit than the remaining documents, but that won't raise an error
I think dask-mongo will do the work for here. You can install it with pip or conda, and in the repo you can find some examples in a notebook.
dask-mongo will read the data you have in MongoDB as a Dask bag but then you can go from a Dask bag to a Dask Dataframe with df = b.to_dataframe() where b is the bag you read from mongo using with dask_mongo.read_mongo
"Read the mans, thery're rulez" :)
pymongo.Collection has method parallel_scan that returns a list of cursors.
UPDATE. This function can do the job, if the collection does not change too often, and queries are always the same (my case). One could just store query results in different collections and run parallel scans.

How to read through collection in chunks by 1000?

I need to read whole collection from MongoDB ( collection name is "test" ) in Python code. I tried like
self.__connection__ = Connection('localhost',27017)
dbh = self.__connection__['test_db']
collection = dbh['test']
How to read through collection in chunks by 1000 ( to avoid memory overflow because collection can be very large ) ?
inspired by #Rafael Valero + fixing last chunk bug in his code and making it more general I created generator function to iterate through mongo collection with query and projection:
def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}):
chunks = range(start_from, collection.find(query).count(), int(chunksize))
num_chunks = len(chunks)
for i in range(1,num_chunks+1):
if i < num_chunks:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]]
else:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop]
so for example you first create an iterator like this:
mess_chunk_iter = iterate_by_chunks(db_local.conversation_messages, 200, 0, query={}, projection=projection)
and then iterate it by chunks:
chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
chunk_n=chunk_n+1
chunk_len = 0
for d in docs:
chunk_len=chunk_len+1
total_docs=total_docs+1
print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)
chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated: 2681
I agree with Remon, but you mention batches of 1000, which his answer doesn't really cover. You can set a batch size on the cursor:
cursor.batch_size(1000);
You can also skip records, e.g.:
cursor.skip(4000);
Is this what you're looking for? This is effectively a pagination pattern. However, if you're just trying to avoid memory exhaustion then you don't really need to set batch size or skip.
Use cursors. Cursors have a "batchSize" variable that controls how many documents are actually sent to the client per batch after doing a query. You don't have to touch this setting though since the default is fine and the complexity if invoking "getmore" commands is hidden from you in most drivers. I'm not familiar with pymongo but it works like this :
cursor = db.col.find() // Get everything!
while(cursor.hasNext()) {
/* This will use the documents already fetched and if it runs out of documents in it's local batch it will fetch another X of them from the server (where X is batchSize). */
document = cursor.next();
// Do your magic here
}
Here is a generic solution to iterate over any iterator or generator by batch:
def _as_batch(cursor, batch_size=50):
# iterate over something (pymongo cursor, generator, ...) by batch.
# Note: the last batch may contain less than batch_size elements.
batch = []
try:
while True:
for _ in range(batch_size):
batch.append(next(cursor))
yield batch
batch = []
except StopIteration as e:
if len(batch):
yield batch
This will work as long as the cursor defines a method __next__ (i.e. we can use next(cursor)). Thus, we can use it on raw cursor or also on transformed records.
Examples
Simple usage:
for batch in db['coll_name'].find():
# do stuff
More complex usage (useful for bulk updates for example):
def update_func(doc):
# dummy transform function
doc['y'] = doc['x'] + 1
return doc
query = (update_func(doc) for doc in db['coll_name'].find())
for batch in _as_batch(query):
# do stuff
Reimplementation of the count() function:
sum(map(len, _as_batch( db['coll_name'].find() )))
To the create the initial connection currently in Python 2 using Pymongo:
host = 'localhost'
port = 27017
db_name = 'test_db'
collection_name = 'test'
To connect using MongoClient
# Connect to MongoDB
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
dbh = client[dbname]
collection = dbh[collection_name]
So from here the proper answer.
I want to read by using chunks (in this case of size 1000).
chunksize = 1000
For example we could decide the how many chunks of size (chunksize) we want.
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
Then we can retrieve each chunk.
for i in range(1,len(skips_variable)):
# Expand the cursor and retrieve data
data_from_chunk = dbh[collection_name].find(query)[skips_variable[i-1]:skips_variable[i]]))
Where query in this case is query = {}.
Here I use similar ideas to create dataframes from MongoDB.
Here I use something similar to write to MongoDB in chunks.
I hope it helps.

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

SQLAlchemy insert millions data inefficiently

It takes a long time to finish 100000 (user,password) tuple insertions.
def insertdata(db,name,val):
i = db.insert()
i.execute(user= name, password=val)
#-----main-------
tuplelist = readfile("C:/py/tst.txt") #parse file is really fast
mydb = initdatabase()
for ele in tuplelist:
insertdata(mydb,ele[0],ele[1])
which function take more time ? Is there a way to test bottleneck in python?
Can I avoid that by caching and commit later?
have the DBAPI handle iterating through the parameters.
def insertdata(db,tuplelist):
i = db.insert()
i.execute([dict(user=elem[0], password=elem[1]) for elem in tuplelist])
#-----main-------
tuplelist = readfile("C:/py/tst.txt") #parse file is really fast
mydb = initdatabase()
insertdata(mydb,tuplelist)

Categories

Resources