Geting results from database one by one

Geting results from database one by one - python

I'm writing small program that is querying for results from database (single table). I'm using python 3.3, sqlalchemy and postgres database.
result = db_session.query(Data).all()
progress = 0
for row in result:
update_progress_bar(progress, len(result))
do_something_with_data(row)
progress += 1
Variable 'result' will contain few thousands rows, and processing of data is taking some time. This is why I introduced simple progress bar to give idea how mutch time it will take.
The problem is, that 30% of the total time is queering the database (first line). So when I start program I get big delay before my progress bar start moving. In addition I don't need to keep all results in memory. I can process them separately.
Is there any way to modify above program to get rows one by one until all rows are received, without loading everything into memory? In addition I want to monitor progress of querying and processing the data.

You need to just loop over the query without calling .all(), and call .yield_per() to set a batch size:
for row in db_session.query(Data).yield_per(10):
do_something_with_data(row)
.all() indeed turns the whole result set into a list first, causing a delay if the resultset is large. Iterating over the query directly after setting .yield_per() instead fetches results as needed, provided the database API supports it.
If you wanted to know up-front how many rows will be returned, call .count() first:
result = db_session.query(Data)
count = result.count()
for row in result.yield_per(10):
update_progress_bar(progress, count)
do_something_with_data(row)
progress += 1
.count() asks the database to gives us an item count for us first.
Your database could still be pre-caching the result rows, leading to a start-up delay, even when using .yield_per(). In that case you'll need to use a windowed query to break up your query into blocks based on the range of values in one of the columns. Wether or not this will work depends on your exact table layout.

Related

mongoengine query for duplicates in embedded documentlist

I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db
I want to check if a record for a certain date already exists for an user before adding a new record to the user
what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code
There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found
import mongoengine
#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class User(mongoengine.Document):
#meta{}
#account details
records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
def conditional_insert(user, new_record):
# the docs tell me to iterate tthrough every record in the user
# there has to be a better way
for r in user.records:
if str(new_record.date) == str(r.date): #i had to do that in my program
#because python kep converting datetime obj to str
return
# if record of duplicate date not found, insert new record
save_record(user, new_record)
def save_record(): pass
if __name__ == "__main__":
lst_to_insert = [] # list of (user, record_to_insert)
for object in lst_to_insert: #O(n)
conditional_insert(object[0],object[1]) #O(n)
#and I have n lst_to_insert so in reality I'm currently at O(n^3)

Hi everyone (and future me who will probably search for the same question 10 years later)
I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class Year(mongoengine.EmbeddedDocument):
monthly_records = mongoengine.EmbeddedDocumentListField(Month)
class Month(mongoengine.EmbeddedDocument):
daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
class User(mongoengine.Document):
#meta{}
#account details
yearly_records = mongoengine.EmbeddedDocumentListField(Year)
because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant
I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:
0) get monthly df
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates
The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)

Mongo cannot conveniently offer you suitable indexes, very sad.
You frequently iterate over user.records.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set, which
offers O(1) constant time lookup, and
offers RAM speed rather than network latency
When you save a user, also make note of if with cache.add((user_id, str(new_record.date))).
EDIT
If you can't afford the memory for all those (user_id, date) tuples,
then sure, a relational database JOIN is fair,
it's just an out-of-core merge of sorted records.
I have had good results with using sqlalchemy to
hit local sqlite (memory or file-backed), or
heavier databases like Postgres or MariaDB.
Bear in mind that relational databases offer lovely ACID guarantees,
but you're paying for those guarantees. In this application
it doesn't sound like you need such properties.
Something as simple as /usr/bin/sort could do an out-of-core
ordering operation that puts all of a user's current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)
Bench it and see.
Without profiling data for a specific workload,
it's pretty hard to tell if any extra complexity
would be worth it.
Identify the true memory or CPU bottleneck,
and focus just on that.
You don't necessarily need ordering, as hashing would suffice,
given enough core.
Send those tuples to a redis cache, and make it his problem
to store them somewhere.

make function memory efficent or store data somewhere else to avoid memory error

I currently have a for loop which is finding and storing combinations in a list. The possible combinations are very large and I need to be able to access the combos.
can I use an empty relational db like SQLite to store my list on a disk instead of using list = []?
Essentially what I am asking is whether there is a db equivalent to list = [] that I can use to store the combinations generated via my script?
Edit:
SQLlite is not a must. Any will work if it can accomplish my task.
Here is the exact function that is causing me so much trouble. Maybe there is a better solution in general.
Idea - Could I insert the list into the database on each loop and then empty the list? Basically, create a list on each loop, send that list to PostgreSQL and then empty the list in the python to keep the RAM usage down?
def permute(set1, set2):
set1_combos = list(combinations(set1, 2))
set2_combos = list(combinations(set2, 8))
full_sets = []
for i in set1_combos:
for j in set2_combos:
full_sets.append(i + j)
return full_sets

Ok, a few ideas
My first thought was, why do you explode the combinations objects in lists? But of course, since we have two nested for loops, the iterator in the inner loop is consumed at the first iteration of the outer loop if it is not converted to a list.
However, you don't need to explode both objects: you can explode just the smaller one. For instance, if both our sets are made of 50 elements, the combinations of 2 elements are 1225 with a memsize (if the items are integers) of about 120 bytes each, i.e. 147KB, while the combinations of 8 elements are 5.36e+08 with a memsize of about 336 bytes, i.e. 180GB. So the first thing is, keep the larger combo set as a combinations object and iterate over it in the outer loop. By the way, this will also be really faster.
Now the database part. I assume a relational DBMS, be it SQLite or anything.
You want to create a table with a single column defined. Each row of your table will contain one final combination. Instead of appending each combination to a list, you will insert it in the table.
Now the question is, how do you need to access the data you created? Do you just need to iterate over the final combos sequentially, or do you need to query them, for instance finding all the combos which contain one specific value?
In the latter case, you'll want to define your column as the Primay Key, so your queries will be efficient; otherwise, you will save space on disk using an auto incrementing integer as the PK (SQLite will create it for you if you don't explicitly define a PK, and so will do a few other DMBS as well).
One final note: the insert phase may be painfully slow if you don't take some specific measures: check this very interesting SO post for details. In short, with a few optimizations they were able to pass from 85 to over 96K insert per second.
EDIT: iterating over the saved data
Once we have the data in the DB, iterating over them could be as simple as:
mycursor.execute('SELECT * FROM <table> WHERE <conditions>')
for combo in mycursor.fetchall():
print(combo) #or do what you need
But if your conditions don't filter away most of the rows you will meet the same memory issue we started with. A first step could be using fetchmany() or even fetchone() instead of fetchall() but still you may have a problem with the size of the query result set.
So you will probably need to read from the DB a chunk of data at a time, exploiting the LIMIT and OFFSET parameters in your SELECT. The final result may be something like:
chunck_size = 1000 #or whatever number fits your case
chunk_count = 0
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> LIMIT {chunk_size} ORDER BY <primarykey>'}
while chunk:
for combo in mycursor.fetchall():
print(combo) #or do what you need
chunk_count += 1
chunk = mycursor.execute(f'SELECT * from <table> WHERE <conditions> ORDER BY <primarykey>' OFFSET {chunk_size * chunk_count} LIMIT {chunk_size}}
Note that you will usually need the ORDER BY clause to ensure rows are returned as you expect them, and not in a random manner.

I don't believe SQLite has a built in array data type. Other DBMSs, such as PostgreSQL, do.
For SQLite, a good recommendation by another user on this site to obtain an array in SQLite can be found here: How to store array in one column in Sqlite3?
Another solution can be found: https://sqlite.org/forum/info/99a33767e8a07e59
In either case, yes it is possible to have a DBMS like SQLite store an array (list) type. However, it may require a little setup depending on the DBMS.
Edit: If you're having memory issues, have you thought about storing your data as a string and accessing the portions of the string you need when you need it?

getting a random set of database entries

I would like to retrieve a random set of X entries from my postgres database using sqlalchemy. My first approach was this
random_set_of_Xrows = models.Table.query.filter(something).order_by(func.random()).limit(len(X)).all()
since my Table is quite big, this command takes about 1 second, and I was wondering how to optimise it. I guess the order_by function requires to look at all rows, so I figured using offset instead might make it faster. However, I can't quite see how to avoid the row count entirely?
Here is an approach using offset
rowCount = db.session.query(func.count(models.Table.id)).filter(something).scalar()
random_set_of_Xrows = models.Table.query.offset(func.floor(func.random()*rowCount)).limit(len(X)).all()
which however is not faster, with most of the time spent getting rowCount.
Any ideas how to make this faster?
cheers
carl
EDIT: As suggested below I added a column to the table with a random value and used that to extract the rows like
random_set_of_Xrows = models.Table.query.filter(something).order_by(models.Table.random_value).limit(len(X)).all()
I did ignore the offset part, since it doesn't matter to me if two calls give me the same results, I just need a random set of rows.

I've optimized this before by adding an indexed column r that inserts a random value automatically when a row is created. Then when you need a random set of rows just SELECT * FROM table ORDER BY r LIMIT 10 OFFSET some_random_value. You can run a script that updates your schema to add this column to your existing rows. You'll add a slight performance hit to writes with this approach, but if this is a functionality you need persistently it should be a fair trade off.

Insert slowing down over time as data base grows (no index)

I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.
The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction
Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.
I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between
When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:
c.execute('SELECT name FROM sqlite_master WHERE type="index"')
print(c.fetchone()) #returned none
The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.
abbreviated relevant code:
#process pre defined objects, files, retrieve data in batch -
#all fine, no slowdown on full database
conn = sqlite3.connect(db_path)
c = conn.cursor()
table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)
targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items-
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database
targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific
#subject item combo
list(map(targeted_insert_data,table_breakdown)) # (3+) X slower
conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()
and relevant insert function:
def insert_data(c,tup):
global collector ###list of tuples of data for a combo of a subject and item
global sql_length ###pre defined dictionary translating the item into the
#right length (?,?,?...) string
tbl_name=tup[0]
subject=tup[1]
item=tup[2]
subject_data=collector[subject][item]
if not (subject_data==[]):
statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])
c.executemany(statement,subject_data)#massively slower, about 80% of
#inserts > twice slower
subject_data=[]
EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.
def create_tables(c,tup):
global collector
global title #list of column schemes to match to items
tbl_name=tup[0]
bm_unit=tup[1]
item=tup[2]
subject_data=bm_collector[bm_unit][item]
if not (subject_data==[]):
c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
if c.fetchone()==None:
c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))
there are all told 65 different column schemes in the title dict but this is an example of what they look like:
title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))
Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.

Appending rows to the end of a table is the fastest way to insert data (and you are not playing games with the rowid, so you are indeed appending to then end).
However, you are not using a single table but 16k tables, so the overhead for managing the table structure is multiplied.
Try increasing the cache size. But the most promising change would be to use fewer tables.

It makes sense to me that the time to INSERT increases as a function of the database size. The operating system itself may be slower when opening/closing/writing to larger files. An index could slow things down much more of course, but that doesn't mean that there would be no slowdown at all without an index.

Python is slow when iterating over a large list

I am currently selecting a large list of rows from a database using pyodbc. The result is then copied to a large list, and then i am trying to iterate over the list. Before I abandon python, and try to create this in C#, I wanted to know if there was something I was doing wrong.
clientItems.execute("Select ids from largetable where year =?", year);
allIDRows = clientItemsCursor.fetchall() #takes maybe 8 seconds.
for clientItemrow in allIDRows:
aID = str(clientItemRow[0])
# Do something with str -- Removed because I was trying to determine what was slow
count = count+1
Some more information:
The for loop is currently running at about 5 loops per second, and that seems insanely slow to me.
The total rows selected is ~489,000.
The machine its running on has lots of RAM and CPU. It seems to only run one or two cores, and ram is 1.72GB of 4gb.
Can anyone tell me whats wrong? Do scripts just run this slow?
Thanks

This should not be slow with Python native lists - but maybe ODBC's driver is returning a "lazy" object that tries to be smart but just gets slow. Try just doing
allIDRows = list(clientItemsCursor.fetchall())
in your code and post further benchmarks.
(Python lists can get slow if you start inserting things in its middle, but just iterating over a large list should be fast)

It's probably slow because you load all result in memory first and performing the iteration over a list. Try iterating the cursor instead.
And no, scripts shouldn't be that slow.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
aID = str(clientItemrow[0])
count = count + 1

More investigation is needed here... consider the following script:
bigList = range(500000)
doSomething = ""
arrayList = [[x] for x in bigList] # takes a few seconds
for x in arrayList:
doSomething += str(x[0])
count+=1
This is pretty much the same as your script, minus the database stuff, and takes a few seconds to run on my not-terribly-fast machine.

When you connect to your database directly (I mean you get an SQL prompt), how many secods runs this query?
When query ends, you get a message like this:
NNNNN rows in set (0.01 sec)
So, if that time is so big, and your query is slow as "native", may be you have to create an index on that table.

This is slow because you are
Getting all the results
Allocating memory and assigning the values to that memory to create the list allIDRows
Iterating over that list and counting.
If execute gives you back a cursor then use the cursor to it's advantage and start counting as you get stuff back and save time on the mem allocation.
clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
count +=1
Other hints:
create an index on year
use 'select count(*) from ... to get the count for the year' this will probably be optimised on the db.
Remove the aID line if not needed this is converting the first item of the row to a string even though its not used.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.