Parallelizing loading data from MongoDB into python - python

All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame.
I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or threads. Each process would load a chunk of a collection, then these chunks would be merged together.
How do I do it correctly with MongoDB?
I have tried similar approach with PostgreSQL. My initial idea was to use SKIP and LIMIT in SQL queries. It has failed, since each cursor, opened for each particular query, started reading data table from the beginning and just skipped specified amount of rows. So I had to create additional column, containing record numbers, and specify ranges of these numbers in queries.
On the contrary, MongoDB assigns unique ObjectID to each document. However, I've found that it is impossible to subtract one ObjectID from another, they can be only compared with ordering operations: less, greater and equal.
Also, pymongo returns the cursor object, that supports indexing operation and has some methods, seeming useful for my task, like count, limit.
MongoDB connector for Spark accomplishes this task somehow. Unfortunately, I'm not familiar with Scala, therefore, it's hard for me to find out how they do it.
So, what is the correct way for parallel loading data from Mongo into python?
up to now, I've come to the following solution:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# import other modules.
collection = get_mongo_collection()
cursor = collection.find({ })
def process_document(in_doc):
out_doc = # process doc keys and values
return pd.DataFrame(out_doc)
df = dd.from_delayed( (delayed(process_document)(d) for d in cursor) )
However, it looks like dask.dataframe.from_delayed internally creates a list from passed generator, effectively loading all collection in a single thread.
Update. I've found in docs, that skip method of pymongo.Cursor starts from beginning of a collection too, as PostgreSQL. The same page suggests using pagination logic in the application. Solutions, that I've found so far, use sorted _id for this. However, they also store last seen _id, that implies that they also work in a single thread.
Update2. I've found the code of the partitioner in the official MongoDb Spark connector: https://github.com/mongodb/mongo-spark/blob/7c76ed1821f70ef2259f8822d812b9c53b6f2b98/src/main/scala/com/mongodb/spark/rdd/partitioner/MongoPaginationPartitioner.scala#L32
Looks like, initially this partitioner reads the key field from all documents in the collection and calculates ranges of values.
Update3: My incomplete solution.
Doesn't work, gets the exception from pymongo, because dask seems to incorrectly treat the Collection object:
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/dask/delayed.pyc in <genexpr>(***failed resolving arguments***)
81 return expr, {}
82 if isinstance(expr, (Iterator, list, tuple, set)):
---> 83 args, dasks = unzip((to_task_dask(e) for e in expr), 2)
84 args = list(args)
85 dsk = sharedict.merge(*dasks)
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/pymongo/collection.pyc in __next__(self)
2342
2343 def __next__(self):
-> 2344 raise TypeError("'Collection' object is not iterable")
2345
2346 next = __next__
TypeError: 'Collection' object is not iterable
What raises the exception:
def process_document(in_doc, other_arg):
# custom processing of incoming records
return out_doc
def compute_id_ranges(collection, query, partition_size=50):
cur = collection.find(query, {'_id': 1}).sort('_id', pymongo.ASCENDING)
id_ranges = [cur[0]['_id']]
count = 1
for r in cur:
count += 1
if count > partition_size:
id_ranges.append(r['_id'])
count = 0
id_ranges.append(r['_id'])
return zip(id_ranges[:len(id_ranges)-1], id_ranges[1: ])
def load_chunk(id_pair, collection, query={}, projection=None):
q = query
q.update( {"_id": {"$gte": id_pair[0], "$lt": id_pair[1]}} )
cur = collection.find(q, projection)
return pd.DataFrame([process_document(d, other_arg) for d in cur])
def parallel_load(*args, **kwargs):
collection = kwargs['collection']
query = kwargs.get('query', {})
projection = kwargs.get('projection', None)
id_ranges = compute_id_ranges(collection, query)
dfs = [ delayed(load_chunk)(ir, collection, query, projection) for ir in id_ranges ]
df = dd.from_delayed(dfs)
return df
collection = connect_to_mongo_and_return_collection_object(credentials)
# df = parallel_load(collection=collection)
id_ranges = compute_id_ranges(collection)
dedf = delayed(load_chunk)(id_ranges[0], collection)
load_chunk perfectly runs when called directly. However, call delayed(load_chunk)( blah-blah-blah ) fails with exception, mentioned above.

I was looking into pymongo parallelization and this is what worked for me. It took my humble gaming laptop nearly 100 minutes to process my mongodb of 40 million documents. The CPU was 100% utilised I had to turn on the AC :)
I used skip and limit functions to split the database, then assigned batches to processes. The code is written for Python 3:
import multiprocessing
from pymongo import MongoClient
def your_function(something):
<...>
return result
def process_cursor(skip_n,limit_n):
print('Starting process',skip_n//limit_n,'...')
collection = MongoClient().<db_name>.<collection_name>
cursor = collection.find({}).skip(skip_n).limit(limit_n)
for doc in cursor:
<do your magic>
# for example:
result = your_function(doc['your_field'] # do some processing on each document
# update that document by adding the result into a new field
collection.update_one({'_id': doc['_id']}, {'$set': {'<new_field_eg>': result} })
print('Completed process',skip_n//limit_n,'...')
if __name__ == '__main__':
n_cores = 7 # number of splits (logical cores of the CPU-1)
collection_size = 40126904 # your collection size
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
processes = [ multiprocessing.Process(target=process_cursor, args=(skip_n,batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
The last split will have a larger limit than the remaining documents, but that won't raise an error

I think dask-mongo will do the work for here. You can install it with pip or conda, and in the repo you can find some examples in a notebook.
dask-mongo will read the data you have in MongoDB as a Dask bag but then you can go from a Dask bag to a Dask Dataframe with df = b.to_dataframe() where b is the bag you read from mongo using with dask_mongo.read_mongo

"Read the mans, thery're rulez" :)
pymongo.Collection has method parallel_scan that returns a list of cursors.
UPDATE. This function can do the job, if the collection does not change too often, and queries are always the same (my case). One could just store query results in different collections and run parallel scans.

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?
One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)
Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

How do i download a large collection in Firestore with Python without getting at 503 error?

Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:
503 The datastore operation timed out, or the data was temporarily unavailable.
about half way through. It was working fine. Here is the code:
docs = db.collection(u'theDatabase').stream()
count = 0
for doc in docs:
count += 1
print (count)
Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?
Although Juan's answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.
So I took Juan's code and transformed it to a standard iterative algorithm. Hope this helps someone.
limit = 1000 # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
print(doc.id)
print(count)
# The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
process_data_and_log_errors_if_any(doc)
count = count + 1
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
stream_collection_loop(db_v3.collection('collection'), 0)
Try using a recursive function to batch document retrievals and keep them under the timeout. Here's an example based on the delete_collections snippet:
from google.cloud import firestore
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
def count_collection(coll_ref, count, cursor=None):
if cursor is not None:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
else:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").stream()]
count = count + len(docs)
if len(docs) == 1000:
return count_collection(coll_ref, count, docs[999].get())
else:
print(count)
count_collection(db.collection('users'), 0)
In other answers was shown how to use the pagination to solve the timeout issue.
I suggest to use a generator in combination with pagination, which lets you to process the documents in the same way as you were doing it with query.stream().
Here is an example of function that takes a Query and returns a generator in the same way as the Query stream() method.
from typing import Generator, Optional, Any
from google.cloud.firestore import Query, DocumentSnapshot
def paginate_query_stream(
query: Query,
order_by: str,
cursor: Optional[DocumentSnapshot] = None,
page_size: int = 10000,
) -> Generator[DocumentSnapshot, Any, None]:
paged_query = query.order_by(order_by)
document = cursor
has_any = True
while has_any:
has_any = False
if document:
paged_query = paged_query.start_after(document)
paged_query = paged_query.limit(page_size)
for document in paged_query.stream():
has_any = True
yield document
Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.
A usage example with counting of documents.
from google.cloud.firestore import Query
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = Query(docs)
count = 0
for doc in paginate_query_stream(query, order_by='__name__'):
count += 1
print(count)

Azure Database ingestion speed

I'm trying to improve the ingestion speed in Azure SQL. Even though I'm using SQLAlchemy connection pool, the speed doesn't increase at all after certain number of threads and stuck at about 700 inserts per second.
Azure SQL shows 50% resource utilization. The code runs within Azure, so network shouldn't be an issue.
Is there a way to increase the speed?
import pyodbc
from sqlalchemy.engine import create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
def connect():
return pyodbc.connect('....')
engine = create_engine('mssql+pyodbc://', creator=connect, pool_recycle=20, pool_size=128, pool_timeout=30)
session_factory = sessionmaker(bind=engine)
def process_entry(i):
session = scoped_session(session_factory)
# skipping logic for computing vec, name1, name2
# vec - list of floats, name1, name2 - strings
vec = [55.0, 33.2, 22.3, 44.5]
name1 = 'foo'
name2 = 'bar'
for j, score in enumerate(vec):
parms = {'name1': name1, 'name2': name2, 'score': score }
try:
session.execute('INSERT INTO sometbl (name1, name2, score) VALUES (:name1, :name2, :score)', parms)
session.commit()
except Exception as e:
print(e)
fs = []
pool = ThreadPoolExecutor(max_workers=128)
for i in range(0, N):
future = pool.submit(process_entry, i)
fs.append(future)
concurrent.futures.wait(fs, timeout=None, return_when=ALL_COMPLETED)
commit()ing every row imposes a wait for the row to be saved to the log file and possibly a secondary replica. Instead commit every N rows. Something like:
rn = 0
for j, score in enumerate(vec):
parms = {'name1': name1, 'name2': name2, 'score': score }
try:
session.execute('INSERT INTO sometbl (name1, name2, score) VALUES (:name1, :name2, :score)', parms)
rn = rn + 1
if rn%100 == 0:
session.commit()
except Exception as e:
print(e)
session.commit()
If you want to load data even faster, you can send JSON documents containing batches of data and parse and insert them in bulk using the OPENJSON function in SQL Server. There are also special bulk loading APIs, but AFAIK these aren't easily accessible from python.
You'll also probably hit maximum throughput at a modest number of workers. Unless your table is Memory Optimized it's likely that your inserts will need to coordinate access to shared resources, like latching the leading page in a BTree, or row locks in secondary indexes.
The high level of concurrency you currently have is probably just (partially) compensating for the current per-row commit strategy.
Depending on the database size, you may consider to scale up to premium tiers during this IO intensive workloads to speed up things, and once they finish you may consider scale back to the original tier.
You may also consider using batching to improve performance. Here you will find how to use batching and other strategies to improve Insert performance like SqlBulkCopy, UpdateBatchSize, etc.
For the fastest insert performance, follow these general guidelines but test your scenario:
For < 100 rows, use a single parameterized INSERT command.
For < 1000 rows, use table-valued parameters.
For >= 1000 rows, use SqlBulkCopy.

Fastest way to get a large number of nodes from Neo4j using py2neo

I'm trying to load nodes (about 400) and relationships (about 800) from a Neo4j DB to create a force directed graph using D3. This is my get function (I'm using Tornado):
def get(self):
query_string = "START r=rel(*) RETURN r"
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
start = set([r[0].start_node for r in results])
end = set([r[0].end_node for r in results])
nodes_to_keep = list(start.union(end))
nodes = []
for n in nodes_to_keep:
nodes.append({
"name":n['name'].encode('utf-8'),
"group":n['type'].encode('utf-8'),
"description":n['description'].encode('utf-8'),
"node":int(n['node_id'])})
#links
links = []
for r in results:
links.append({"source":int(r[0].start_node['node_id']), "target":int(r[0].end_node['node_id'])})
self.render(
"index.html",
page_title='My Page',
page_heading='Sweet D3 Force Diagram',
nodes=nodes,
links =links,
)
I'm thinking the expensive process is in for n in nodes_to_keep: and the for r in results: since every time I get each property, that's a trip to the server. Right?
What's the best way to accomplish this task?
The reason why the above process is taking so long is because every time I ask for a node property, I'm taking a trip to the server to fetch something out of the database. I was able to drastically reduce the time this process takes by simply modifying the Cypher query.
For instance, to get all nodes with relationships I used this query:
query_string = """MATCH (n)-[r]-(m)
RETURN n, n.node_id, n.name, n.type, n.description, m.node_id, m.name, m.type, m.description"""
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
The results contain the information I need, so I just loop through the results to get the properties.
The takeaway is that you need to write your queries such that they get you the info you need the first time around.

How to read through collection in chunks by 1000?

I need to read whole collection from MongoDB ( collection name is "test" ) in Python code. I tried like
self.__connection__ = Connection('localhost',27017)
dbh = self.__connection__['test_db']
collection = dbh['test']
How to read through collection in chunks by 1000 ( to avoid memory overflow because collection can be very large ) ?
inspired by #Rafael Valero + fixing last chunk bug in his code and making it more general I created generator function to iterate through mongo collection with query and projection:
def iterate_by_chunks(collection, chunksize=1, start_from=0, query={}, projection={}):
chunks = range(start_from, collection.find(query).count(), int(chunksize))
num_chunks = len(chunks)
for i in range(1,num_chunks+1):
if i < num_chunks:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks[i]]
else:
yield collection.find(query, projection=projection)[chunks[i-1]:chunks.stop]
so for example you first create an iterator like this:
mess_chunk_iter = iterate_by_chunks(db_local.conversation_messages, 200, 0, query={}, projection=projection)
and then iterate it by chunks:
chunk_n=0
total_docs=0
for docs in mess_chunk_iter:
chunk_n=chunk_n+1
chunk_len = 0
for d in docs:
chunk_len=chunk_len+1
total_docs=total_docs+1
print(f'chunk #: {chunk_n}, chunk_len: {chunk_len}')
print("total docs iterated: ", total_docs)
chunk #: 1, chunk_len: 400
chunk #: 2, chunk_len: 400
chunk #: 3, chunk_len: 400
chunk #: 4, chunk_len: 400
chunk #: 5, chunk_len: 400
chunk #: 6, chunk_len: 400
chunk #: 7, chunk_len: 281
total docs iterated: 2681
I agree with Remon, but you mention batches of 1000, which his answer doesn't really cover. You can set a batch size on the cursor:
cursor.batch_size(1000);
You can also skip records, e.g.:
cursor.skip(4000);
Is this what you're looking for? This is effectively a pagination pattern. However, if you're just trying to avoid memory exhaustion then you don't really need to set batch size or skip.
Use cursors. Cursors have a "batchSize" variable that controls how many documents are actually sent to the client per batch after doing a query. You don't have to touch this setting though since the default is fine and the complexity if invoking "getmore" commands is hidden from you in most drivers. I'm not familiar with pymongo but it works like this :
cursor = db.col.find() // Get everything!
while(cursor.hasNext()) {
/* This will use the documents already fetched and if it runs out of documents in it's local batch it will fetch another X of them from the server (where X is batchSize). */
document = cursor.next();
// Do your magic here
}
Here is a generic solution to iterate over any iterator or generator by batch:
def _as_batch(cursor, batch_size=50):
# iterate over something (pymongo cursor, generator, ...) by batch.
# Note: the last batch may contain less than batch_size elements.
batch = []
try:
while True:
for _ in range(batch_size):
batch.append(next(cursor))
yield batch
batch = []
except StopIteration as e:
if len(batch):
yield batch
This will work as long as the cursor defines a method __next__ (i.e. we can use next(cursor)). Thus, we can use it on raw cursor or also on transformed records.
Examples
Simple usage:
for batch in db['coll_name'].find():
# do stuff
More complex usage (useful for bulk updates for example):
def update_func(doc):
# dummy transform function
doc['y'] = doc['x'] + 1
return doc
query = (update_func(doc) for doc in db['coll_name'].find())
for batch in _as_batch(query):
# do stuff
Reimplementation of the count() function:
sum(map(len, _as_batch( db['coll_name'].find() )))
To the create the initial connection currently in Python 2 using Pymongo:
host = 'localhost'
port = 27017
db_name = 'test_db'
collection_name = 'test'
To connect using MongoClient
# Connect to MongoDB
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
dbh = client[dbname]
collection = dbh[collection_name]
So from here the proper answer.
I want to read by using chunks (in this case of size 1000).
chunksize = 1000
For example we could decide the how many chunks of size (chunksize) we want.
# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
skips_variable = [0,len(skips_variable)]
Then we can retrieve each chunk.
for i in range(1,len(skips_variable)):
# Expand the cursor and retrieve data
data_from_chunk = dbh[collection_name].find(query)[skips_variable[i-1]:skips_variable[i]]))
Where query in this case is query = {}.
Here I use similar ideas to create dataframes from MongoDB.
Here I use something similar to write to MongoDB in chunks.
I hope it helps.

Categories

Resources