Mongodb cursor iteration slow - python

I am querying data from mongodb in chunks having 100M + records in total. And then want to process those records within a function.
My query
cursor = collection.find({'facial_task':False}).sort("_id", -1).skip(1000000).limit(10000)
When I iterate over the cursor, then it is taking too much time and gets stuck a lot even I limit the number of results up to 10. I even retrieved the results using batch size but when processing the loop over it; it takes too much time.
My loop is like this
for dd in cursor:
ab = threading.Thread(target=insert_func, args=(dd,))
ab.start()
main_threads.append(ab)
if len(main_threads) >= 5000:
print("****Joining Main Thread***")
for ii in main_threads:
ii.join()
main_threads = []

If the slowdown occurs from the very first iteration, then it must have to do with the cursor itself.
Mongodb cursor is not like a 'chached' list of the entire collection; it queries the db one chunk at a time as you iterate over it.
You can try to cast that cursor into a list and see if the iteration shows a significant speed-up.

Related

Pymongo - Efficient way to loop through cursor having large data

What is the most efficient way to loop through the cursor object in Pymongo?
Currently, this is what I'm doing:
list(my_db.my_collection.find())
Which converts the cursor to list object so that I can iterate over each element. This works fine if the find() query returns a small amount of data. However, when I scale the DB to return 10 million documents, the cursor conversion to the list is taking forever. Instead of converting the DB result(cursor) to list, I tried converting the cursor to dataframe as below:
pd.Dataframe(my_db.my_collection.find())
which didn't give me any performance improvement.
What is the most efficient way to loop through a cursor object in python?
I haven't used the pymongo till date.
But one thing I can definitely say, if you're fetching a huge amount of data by doing
list(my_db.my_collection.find())
then you must use the generator.
Because, using list here would increase memory usage significantly and may bring in MemoryError if it gets beyond the permitted value.
def get_data():
yield(my_db.my_collection.find())
Try using such methods which will not use much memory.
The cursor object pymongo gives you is already lazily loading objects, no need to do anything else.
for doc in my_db.my_collection.find():
#process doc
The method find() returns a Cursor which you can iterate
for match in my_db.my_collection.find():
# do something
pass

Insert slowing down over time as data base grows (no index)

I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.
The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction
Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.
I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between
When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:
c.execute('SELECT name FROM sqlite_master WHERE type="index"')
print(c.fetchone()) #returned none
The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.
abbreviated relevant code:
#process pre defined objects, files, retrieve data in batch -
#all fine, no slowdown on full database
conn = sqlite3.connect(db_path)
c = conn.cursor()
table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)
targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items-
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database
targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific
#subject item combo
list(map(targeted_insert_data,table_breakdown)) # (3+) X slower
conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()
and relevant insert function:
def insert_data(c,tup):
global collector ###list of tuples of data for a combo of a subject and item
global sql_length ###pre defined dictionary translating the item into the
#right length (?,?,?...) string
tbl_name=tup[0]
subject=tup[1]
item=tup[2]
subject_data=collector[subject][item]
if not (subject_data==[]):
statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])
c.executemany(statement,subject_data)#massively slower, about 80% of
#inserts > twice slower
subject_data=[]
EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.
def create_tables(c,tup):
global collector
global title #list of column schemes to match to items
tbl_name=tup[0]
bm_unit=tup[1]
item=tup[2]
subject_data=bm_collector[bm_unit][item]
if not (subject_data==[]):
c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
if c.fetchone()==None:
c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))
there are all told 65 different column schemes in the title dict but this is an example of what they look like:
title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))
Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.
Appending rows to the end of a table is the fastest way to insert data (and you are not playing games with the rowid, so you are indeed appending to then end).
However, you are not using a single table but 16k tables, so the overhead for managing the table structure is multiplied.
Try increasing the cache size. But the most promising change would be to use fewer tables.
It makes sense to me that the time to INSERT increases as a function of the database size. The operating system itself may be slower when opening/closing/writing to larger files. An index could slow things down much more of course, but that doesn't mean that there would be no slowdown at all without an index.

Mysql Date Range Query Loop in Python

I'm loop querying a few date ranges:
con = MySQLdb.connect(host='xxx.com', port=3306,user='user', passwd='pass', db='db')
intervals = 10
precision = 1
dateChunks = list()
for i in range(0,intervals,precision):
results = load_data_window(i,precision)
dateChunks.append(results)
def load_data_window(start,precision):
length = start + precision # <-- time period
limit = 20000
cur = con.cursor()
sql = "SELECT data FROM table WHERE date < DATE_SUB(NOW(), INTERVAL %s HOUR) AND date > DATE_SUB(NOW(), INTERVAL %s HOUR)"
cur.execute(sql,(start,length))
results = cur.fetchall()
It works lightning fast the first few loops, sometimes even all of them, but bogs significantly from time to time. There are other actions happening on the database from different places... Is there something I can do to ensure my query has priority? Something like a transaction?
######### EDIT
I did notice, if I move the con = MSQLdb... inside of the load_data_window function, I get really fast results and then bogging, whereas if I keep the con = MSQLdb... outside it is consistently slower...
Does your date column have an index on it? Such an index may help performance a great deal. That's especially true if your queries progressively slow down: MySQL may be scanning through your table, further and further for each successive hour, looking for the first row to return.
It looks like you're fetching an hour's worth of data with each request, and running fetchall() to slurp all that data into RAM in your python program at once. Is that what you want? If you have an hour with lots of results in it you may hammer the RAM in your python program, forcing garbage collection and even operating-system thrashing.
Can you, instead, cause your python program to iterate over the rows in the result set? This will most likely take a lot less RAM. You can do something like this: http://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-fetchone.html
cur.execute(sql,(start,length))
for row in cur:
print(row)
This sort of strategy would also allow you to process your data efficiently in chunks bigger than an hour if that would help your application. In many, but not all, cases fewer queries means better queries.
Do some of your hours have hugely more rows in them? This could be causing the unpredictable performance.
Finally, you have an off-by-one bug in your query. It should say this...
WHERE ... AND date >= DATE_SUB(NOW(), INTERVAL %s HOUR)
(Notice the >= replacing your >)

Geting results from database one by one

I'm writing small program that is querying for results from database (single table). I'm using python 3.3, sqlalchemy and postgres database.
result = db_session.query(Data).all()
progress = 0
for row in result:
update_progress_bar(progress, len(result))
do_something_with_data(row)
progress += 1
Variable 'result' will contain few thousands rows, and processing of data is taking some time. This is why I introduced simple progress bar to give idea how mutch time it will take.
The problem is, that 30% of the total time is queering the database (first line). So when I start program I get big delay before my progress bar start moving. In addition I don't need to keep all results in memory. I can process them separately.
Is there any way to modify above program to get rows one by one until all rows are received, without loading everything into memory? In addition I want to monitor progress of querying and processing the data.
You need to just loop over the query without calling .all(), and call .yield_per() to set a batch size:
for row in db_session.query(Data).yield_per(10):
do_something_with_data(row)
.all() indeed turns the whole result set into a list first, causing a delay if the resultset is large. Iterating over the query directly after setting .yield_per() instead fetches results as needed, provided the database API supports it.
If you wanted to know up-front how many rows will be returned, call .count() first:
result = db_session.query(Data)
count = result.count()
for row in result.yield_per(10):
update_progress_bar(progress, count)
do_something_with_data(row)
progress += 1
.count() asks the database to gives us an item count for us first.
Your database could still be pre-caching the result rows, leading to a start-up delay, even when using .yield_per(). In that case you'll need to use a windowed query to break up your query into blocks based on the range of values in one of the columns. Wether or not this will work depends on your exact table layout.

Python is slow when iterating over a large list

I am currently selecting a large list of rows from a database using pyodbc. The result is then copied to a large list, and then i am trying to iterate over the list. Before I abandon python, and try to create this in C#, I wanted to know if there was something I was doing wrong.
clientItems.execute("Select ids from largetable where year =?", year);
allIDRows = clientItemsCursor.fetchall() #takes maybe 8 seconds.
for clientItemrow in allIDRows:
aID = str(clientItemRow[0])
# Do something with str -- Removed because I was trying to determine what was slow
count = count+1
Some more information:
The for loop is currently running at about 5 loops per second, and that seems insanely slow to me.
The total rows selected is ~489,000.
The machine its running on has lots of RAM and CPU. It seems to only run one or two cores, and ram is 1.72GB of 4gb.
Can anyone tell me whats wrong? Do scripts just run this slow?
Thanks
This should not be slow with Python native lists - but maybe ODBC's driver is returning a "lazy" object that tries to be smart but just gets slow. Try just doing
allIDRows = list(clientItemsCursor.fetchall())
in your code and post further benchmarks.
(Python lists can get slow if you start inserting things in its middle, but just iterating over a large list should be fast)
It's probably slow because you load all result in memory first and performing the iteration over a list. Try iterating the cursor instead.
And no, scripts shouldn't be that slow.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
aID = str(clientItemrow[0])
count = count + 1
More investigation is needed here... consider the following script:
bigList = range(500000)
doSomething = ""
arrayList = [[x] for x in bigList] # takes a few seconds
for x in arrayList:
doSomething += str(x[0])
count+=1
This is pretty much the same as your script, minus the database stuff, and takes a few seconds to run on my not-terribly-fast machine.
When you connect to your database directly (I mean you get an SQL prompt), how many secods runs this query?
When query ends, you get a message like this:
NNNNN rows in set (0.01 sec)
So, if that time is so big, and your query is slow as "native", may be you have to create an index on that table.
This is slow because you are
Getting all the results
Allocating memory and assigning the values to that memory to create the list allIDRows
Iterating over that list and counting.
If execute gives you back a cursor then use the cursor to it's advantage and start counting as you get stuff back and save time on the mem allocation.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
count +=1
Other hints:
create an index on year
use 'select count(*) from ... to get the count for the year' this will probably be optimised on the db.
Remove the aID line if not needed this is converting the first item of the row to a string even though its not used.

Categories

Resources