It takes a long time to finish 100000 (user,password) tuple insertions.
def insertdata(db,name,val):
i = db.insert()
i.execute(user= name, password=val)
#-----main-------
tuplelist = readfile("C:/py/tst.txt") #parse file is really fast
mydb = initdatabase()
for ele in tuplelist:
insertdata(mydb,ele[0],ele[1])
which function take more time ? Is there a way to test bottleneck in python?
Can I avoid that by caching and commit later?
have the DBAPI handle iterating through the parameters.
def insertdata(db,tuplelist):
i = db.insert()
i.execute([dict(user=elem[0], password=elem[1]) for elem in tuplelist])
#-----main-------
tuplelist = readfile("C:/py/tst.txt") #parse file is really fast
mydb = initdatabase()
insertdata(mydb,tuplelist)
Related
I have a MongoDB(media_mongo) with collection main_hikari and a lot of data inside. I'm trying to make a function to create a .csv file from this data asap. I'm using this code, but it takes too much time and CPU usage
import pymongo
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://admin:password#localhost:27017')
db = mongo_client.media_mongo
def download_file(down_file_name="hikari"):
docs = pd.DataFrame(columns=[])
if down_file_name == "kokyaku":
col = db.main_kokyaku
if down_file_name == "hikari":
col = db.main_hikari
if down_file_name == "hikanshou":
col = db.main_hikanshou
cursor = col.find()
mongo_docs = list(cursor)
for num, doc in enumerate(mongo_docs):
doc["_id"] = str(doc["_id"])
doc_id = doc["_id"]
series_obj = pandas.Series(doc, name=doc_id)
docs = docs.append(series_obj)
csv_export = docs.to_csv("file.csv", sep=",")
download_file()
My database has data in this format (sorry for that Japanese :D)
_id:"ObjectId("5e0544c4f4eefce9ee9b5a8b")"
事業者受付番号:"data1"
開通区分/処理区分:"data2"
開通ST/処理ST:"data3"
申込日,顧客名:"data4"
郵便番号:"data5"
住所1:"data6"
住所2:"data7"
連絡先番号:"data8"
契約者電話番号:"data9"
And about 150000 entries like this
If you have a lot of data as you indicate, then this line is going to hurt you:
mongo_docs = list(cursor)
It basically means read the entire collection into a client-side array at once. This will create a huge memory high water mark.
Better to use mongoexport as noted above or walk the cursor yourself instead of having list() slurp the whole thing, e.g.:
cursor = col.find()
for doc in cursor:
# read docs one at a time
or to be very pythonic about it:
for doc in col.find(): # or find(expression of your choice)
# read docs one at a time
Trying to count the number of docs in a firestore collection with python. When i use db.collection('xxxx").stream() i get the following error:
503 The datastore operation timed out, or the data was temporarily unavailable.
about half way through. It was working fine. Here is the code:
docs = db.collection(u'theDatabase').stream()
count = 0
for doc in docs:
count += 1
print (count)
Every time I get a 503 error at about 73,000 records. Does anyone know how to overcome the 20 second timeout?
Although Juan's answer works for basic counting, in case you need more of the data from Firebase and not just the id (a common use case of which is total migration of the data that is not through GCP), the recursive algorithm will eat your memory.
So I took Juan's code and transformed it to a standard iterative algorithm. Hope this helps someone.
limit = 1000 # Reduce this if it uses too much of your RAM
def stream_collection_loop(collection, count, cursor=None):
while True:
docs = [] # Very important. This frees the memory incurred in the recursion algorithm.
if cursor:
docs = [snapshot for snapshot in
collection.limit(limit).order_by('__name__').start_after(cursor).stream()]
else:
docs = [snapshot for snapshot in collection.limit(limit).order_by('__name__').stream()]
for doc in docs:
print(doc.id)
print(count)
# The `doc` here is already a `DocumentSnapshot` so you can already call `to_dict` on it to get the whole document.
process_data_and_log_errors_if_any(doc)
count = count + 1
if len(docs) == limit:
cursor = docs[limit-1]
continue
break
stream_collection_loop(db_v3.collection('collection'), 0)
Try using a recursive function to batch document retrievals and keep them under the timeout. Here's an example based on the delete_collections snippet:
from google.cloud import firestore
# Project ID is determined by the GCLOUD_PROJECT environment variable
db = firestore.Client()
def count_collection(coll_ref, count, cursor=None):
if cursor is not None:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").start_after(cursor).stream()]
else:
docs = [snapshot.reference for snapshot
in coll_ref.limit(1000).order_by("__name__").stream()]
count = count + len(docs)
if len(docs) == 1000:
return count_collection(coll_ref, count, docs[999].get())
else:
print(count)
count_collection(db.collection('users'), 0)
In other answers was shown how to use the pagination to solve the timeout issue.
I suggest to use a generator in combination with pagination, which lets you to process the documents in the same way as you were doing it with query.stream().
Here is an example of function that takes a Query and returns a generator in the same way as the Query stream() method.
from typing import Generator, Optional, Any
from google.cloud.firestore import Query, DocumentSnapshot
def paginate_query_stream(
query: Query,
order_by: str,
cursor: Optional[DocumentSnapshot] = None,
page_size: int = 10000,
) -> Generator[DocumentSnapshot, Any, None]:
paged_query = query.order_by(order_by)
document = cursor
has_any = True
while has_any:
has_any = False
if document:
paged_query = paged_query.start_after(document)
paged_query = paged_query.limit(page_size)
for document in paged_query.stream():
has_any = True
yield document
Take in mind if your target collection constantly grows then you need to filter the upper bound in the query in advance to prevent a potential infinite loop.
A usage example with counting of documents.
from google.cloud.firestore import Query
docs = db.collection(u'theDatabase')
# Query without conditions, get all documents.
query = Query(docs)
count = 0
for doc in paginate_query_stream(query, order_by='__name__'):
count += 1
print(count)
All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame.
I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or threads. Each process would load a chunk of a collection, then these chunks would be merged together.
How do I do it correctly with MongoDB?
I have tried similar approach with PostgreSQL. My initial idea was to use SKIP and LIMIT in SQL queries. It has failed, since each cursor, opened for each particular query, started reading data table from the beginning and just skipped specified amount of rows. So I had to create additional column, containing record numbers, and specify ranges of these numbers in queries.
On the contrary, MongoDB assigns unique ObjectID to each document. However, I've found that it is impossible to subtract one ObjectID from another, they can be only compared with ordering operations: less, greater and equal.
Also, pymongo returns the cursor object, that supports indexing operation and has some methods, seeming useful for my task, like count, limit.
MongoDB connector for Spark accomplishes this task somehow. Unfortunately, I'm not familiar with Scala, therefore, it's hard for me to find out how they do it.
So, what is the correct way for parallel loading data from Mongo into python?
up to now, I've come to the following solution:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
# import other modules.
collection = get_mongo_collection()
cursor = collection.find({ })
def process_document(in_doc):
out_doc = # process doc keys and values
return pd.DataFrame(out_doc)
df = dd.from_delayed( (delayed(process_document)(d) for d in cursor) )
However, it looks like dask.dataframe.from_delayed internally creates a list from passed generator, effectively loading all collection in a single thread.
Update. I've found in docs, that skip method of pymongo.Cursor starts from beginning of a collection too, as PostgreSQL. The same page suggests using pagination logic in the application. Solutions, that I've found so far, use sorted _id for this. However, they also store last seen _id, that implies that they also work in a single thread.
Update2. I've found the code of the partitioner in the official MongoDb Spark connector: https://github.com/mongodb/mongo-spark/blob/7c76ed1821f70ef2259f8822d812b9c53b6f2b98/src/main/scala/com/mongodb/spark/rdd/partitioner/MongoPaginationPartitioner.scala#L32
Looks like, initially this partitioner reads the key field from all documents in the collection and calculates ranges of values.
Update3: My incomplete solution.
Doesn't work, gets the exception from pymongo, because dask seems to incorrectly treat the Collection object:
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/dask/delayed.pyc in <genexpr>(***failed resolving arguments***)
81 return expr, {}
82 if isinstance(expr, (Iterator, list, tuple, set)):
---> 83 args, dasks = unzip((to_task_dask(e) for e in expr), 2)
84 args = list(args)
85 dsk = sharedict.merge(*dasks)
/home/user/.conda/envs/MBA/lib/python2.7/site-packages/pymongo/collection.pyc in __next__(self)
2342
2343 def __next__(self):
-> 2344 raise TypeError("'Collection' object is not iterable")
2345
2346 next = __next__
TypeError: 'Collection' object is not iterable
What raises the exception:
def process_document(in_doc, other_arg):
# custom processing of incoming records
return out_doc
def compute_id_ranges(collection, query, partition_size=50):
cur = collection.find(query, {'_id': 1}).sort('_id', pymongo.ASCENDING)
id_ranges = [cur[0]['_id']]
count = 1
for r in cur:
count += 1
if count > partition_size:
id_ranges.append(r['_id'])
count = 0
id_ranges.append(r['_id'])
return zip(id_ranges[:len(id_ranges)-1], id_ranges[1: ])
def load_chunk(id_pair, collection, query={}, projection=None):
q = query
q.update( {"_id": {"$gte": id_pair[0], "$lt": id_pair[1]}} )
cur = collection.find(q, projection)
return pd.DataFrame([process_document(d, other_arg) for d in cur])
def parallel_load(*args, **kwargs):
collection = kwargs['collection']
query = kwargs.get('query', {})
projection = kwargs.get('projection', None)
id_ranges = compute_id_ranges(collection, query)
dfs = [ delayed(load_chunk)(ir, collection, query, projection) for ir in id_ranges ]
df = dd.from_delayed(dfs)
return df
collection = connect_to_mongo_and_return_collection_object(credentials)
# df = parallel_load(collection=collection)
id_ranges = compute_id_ranges(collection)
dedf = delayed(load_chunk)(id_ranges[0], collection)
load_chunk perfectly runs when called directly. However, call delayed(load_chunk)( blah-blah-blah ) fails with exception, mentioned above.
I was looking into pymongo parallelization and this is what worked for me. It took my humble gaming laptop nearly 100 minutes to process my mongodb of 40 million documents. The CPU was 100% utilised I had to turn on the AC :)
I used skip and limit functions to split the database, then assigned batches to processes. The code is written for Python 3:
import multiprocessing
from pymongo import MongoClient
def your_function(something):
<...>
return result
def process_cursor(skip_n,limit_n):
print('Starting process',skip_n//limit_n,'...')
collection = MongoClient().<db_name>.<collection_name>
cursor = collection.find({}).skip(skip_n).limit(limit_n)
for doc in cursor:
<do your magic>
# for example:
result = your_function(doc['your_field'] # do some processing on each document
# update that document by adding the result into a new field
collection.update_one({'_id': doc['_id']}, {'$set': {'<new_field_eg>': result} })
print('Completed process',skip_n//limit_n,'...')
if __name__ == '__main__':
n_cores = 7 # number of splits (logical cores of the CPU-1)
collection_size = 40126904 # your collection size
batch_size = round(collection_size/n_cores+0.5)
skips = range(0, n_cores*batch_size, batch_size)
processes = [ multiprocessing.Process(target=process_cursor, args=(skip_n,batch_size)) for skip_n in skips]
for process in processes:
process.start()
for process in processes:
process.join()
The last split will have a larger limit than the remaining documents, but that won't raise an error
I think dask-mongo will do the work for here. You can install it with pip or conda, and in the repo you can find some examples in a notebook.
dask-mongo will read the data you have in MongoDB as a Dask bag but then you can go from a Dask bag to a Dask Dataframe with df = b.to_dataframe() where b is the bag you read from mongo using with dask_mongo.read_mongo
"Read the mans, thery're rulez" :)
pymongo.Collection has method parallel_scan that returns a list of cursors.
UPDATE. This function can do the job, if the collection does not change too often, and queries are always the same (my case). One could just store query results in different collections and run parallel scans.
I'm trying to load nodes (about 400) and relationships (about 800) from a Neo4j DB to create a force directed graph using D3. This is my get function (I'm using Tornado):
def get(self):
query_string = "START r=rel(*) RETURN r"
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
start = set([r[0].start_node for r in results])
end = set([r[0].end_node for r in results])
nodes_to_keep = list(start.union(end))
nodes = []
for n in nodes_to_keep:
nodes.append({
"name":n['name'].encode('utf-8'),
"group":n['type'].encode('utf-8'),
"description":n['description'].encode('utf-8'),
"node":int(n['node_id'])})
#links
links = []
for r in results:
links.append({"source":int(r[0].start_node['node_id']), "target":int(r[0].end_node['node_id'])})
self.render(
"index.html",
page_title='My Page',
page_heading='Sweet D3 Force Diagram',
nodes=nodes,
links =links,
)
I'm thinking the expensive process is in for n in nodes_to_keep: and the for r in results: since every time I get each property, that's a trip to the server. Right?
What's the best way to accomplish this task?
The reason why the above process is taking so long is because every time I ask for a node property, I'm taking a trip to the server to fetch something out of the database. I was able to drastically reduce the time this process takes by simply modifying the Cypher query.
For instance, to get all nodes with relationships I used this query:
query_string = """MATCH (n)-[r]-(m)
RETURN n, n.node_id, n.name, n.type, n.description, m.node_id, m.name, m.type, m.description"""
query = neo4j.CypherQuery(graph_db, query_string)
results = query.execute().data
The results contain the information I need, so I just loop through the results to get the properties.
The takeaway is that you need to write your queries such that they get you the info you need the first time around.
I'm trying to extract huge amounts of data from a DB and write it to a csv file. I'm trying to find out what the fastest way would be to do this. I found that running writerows on the result of a fetchall was 40% slower than the code below.
with open(filename, 'a') as f:
writer = csv.writer(f, delimiter='\t')
cursor.execute("SELECT * FROM table")
writer.writerow([i[0] for i in cursor.description])
count = 0
builder = []
row = cursor.fetchone()
DELIMITERS = ['\t'] * (len(row) - 1) + ['\n']
while row:
count += 1
# Add row with delimiters to builder
builder += [str(item) for pair in zip(row, DELIMITERS) for item in pair]
if count == 1000:
count = 0
f.write(''.join(builder))
builder[:] = []
row = cursor.fetchone()
f.write(''.join(builder))
Edit: The database I'm using is unique to the small company that I'm working for, so unfortunately I can't provide much information on that front. I'm using jpype to connect with the database since the only means of connecting is via a jdbc driver. I'm running cPython 2.7.5; would love to use PyPy but it doesn't work with Pandas.
Since I'm extracting such a large number of rows, I'm hesitant to use fetchall for fear that I'll run out of memory. row has comparable performance and is much easier on the eyes, so I think I'll use that. Thanks a bunch!
With the little you've given us to go on, it's hard to be more specific, but…
I've wrapped your code up as a function, and written three alternative versions:
def row():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
for row in cursor:
writer.writerow(row)
def rows():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor)
def rowsall():
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter='\t')
cursor = db.execute("SELECT * FROM mytable")
writer.writerow([i[0] for i in cursor.description])
writer.writerows(cursor.fetchall())
Notice that the last one is the one you say you tried.
Now, I wrote this test driver:
def randomname():
return ''.join(random.choice(string.ascii_lowercase) for _ in range(30))
db = sqlite3.connect(':memory:')
db.execute('CREATE TABLE mytable (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR)')
db.executemany('INSERT INTO mytable (name) VALUES (?)',
[[randomname()] for _ in range(10000)])
filename = 'db.csv'
for f in manual, row, rows, rowsall:
t = timeit.timeit(f, number=1)
print('{:<10} {}'.format(f.__name__, t))
And here are the results:
manual 0.055549702141433954
row 0.03852885402739048
rows 0.03992213006131351
rowsall 0.02850699401460588
So, your code takes nearly twice as long as calling fetchall and writerows in my test!
When I repeat a similar test with other databases, however, rowsall is anywhere from 20% faster to 15% slower than manual (never 40% slower, but as much as 15%)… but row or rows is always significantly faster than manual.
I think the explanation is that your custom code is significantly slower than csv.writerows, but that in some databases, using fetchall instead of fetchone (or just iterating the cursor) slows things down significantly. The reason this isn't true with an in-memory sqlite3 database is that fetchone is doing all of the same work as fetchall and then feeding you the list one at a time; with a remote database, fetchone may do anything from fetch all the lines, to fetching a buffer at a time, to fetching a row at a time, making it potentially much slower or faster than fetchall, depending on your data.
But for a really useful explanation, you'd have to tell us exactly which database and library you're using (and which Python version—CPython 3.3.2's csv module seems to be a lot faster than CPython 2.7.5's, and PyPy 2.1/2.7.2 seems to be faster than CPython 2.7.5 as well, but then either one also might run your code faster too…) and so on.