Insert slowing down over time as data base grows (no index) - python

I'm trying to create a (single) database file (that will be regularly updated/occasionally partially recreated/occasionally queried) that is over 200GB, so relatively large in my view. There are about 16k tables and they range in size from a few kb to ~1gb; they have 2-21 columns. The longest table has nearly 15 million rows.
The script I wrote goes through the input files one by one, doing a bunch of processing and regex to get usable data. It regularly sends a batch (0.5-1GB) to be written in sqlite3, with one separate executemany statement to each table that data is inserted to. There are no commit or create table statements etc in-between these execute statements so I believe that all comes under a single transaction
Initially the script worked fast enough for my purposes, but it slows down dramatically over time as it neared completion- which given I will need to slow it down further to keep the memory use manageable in normal use for my laptop is unfortunate.
I did some quick bench-marking comparing inserting identical sample data to an empty database versus inserting to the 200GB database. The later test was ~3 times slower to execute the insert statements (the relative speed commit was even worse, but in absolute terms its insignificant)- aside from that there was no significant difference between
When I researched this topic before it mostly returned results for indexes slowing down inserts on large tables. The answer seemed to be that insert on tables without an index should stay at more or less the same speed regardless of size; since I don't need to run numerous queries against this database I didn't make any indexes. I even double checked and ran a check for indexes which if I have it right should exclude that as a cause:
c.execute('SELECT name FROM sqlite_master WHERE type="index"')
print(c.fetchone()) #returned none
The other issue that cropped up was transactions, but I don't see how that could be a problem only writing to large databases for the same script and the same data to be written.
abbreviated relevant code:
#process pre defined objects, files, retrieve data in batch -
#all fine, no slowdown on full database
conn = sqlite3.connect(db_path)
c = conn.cursor()
table_breakdown=[(tup[0]+'-'+tup[1],tup[0],tup[1]) for tup in all_tup] # creates list of tuples
# (tuple name "subject-item", subject, item)
targeted_create_tables=functools.partial(create_tables,c) #creates new table if needed
#for new subjects/items-
list(map(targeted_create_tables,table_breakdown)) #no slowdown on full database
targeted_insert_data=functools.partial(insert_data,c) #inserts data for specific
#subject item combo
list(map(targeted_insert_data,table_breakdown)) # (3+) X slower
conn.commit() # significant relative slowdown, but insignificant in absolute terms
conn.close()
and relevant insert function:
def insert_data(c,tup):
global collector ###list of tuples of data for a combo of a subject and item
global sql_length ###pre defined dictionary translating the item into the
#right length (?,?,?...) string
tbl_name=tup[0]
subject=tup[1]
item=tup[2]
subject_data=collector[subject][item]
if not (subject_data==[]):
statement='''INSERT INTO "{0}" VALUES {1}'''.format(tbl_name,sql_length[item])
c.executemany(statement,subject_data)#massively slower, about 80% of
#inserts > twice slower
subject_data=[]
EDIT: table create function per CL request. I'm aware that this is inefficient (it takes roughly the same time to check if a table name exists this way as to create the table) but it's not significant to the slow down.
def create_tables(c,tup):
global collector
global title #list of column schemes to match to items
tbl_name=tup[0]
bm_unit=tup[1]
item=tup[2]
subject_data=bm_collector[bm_unit][item]
if not (subject_data==[]):
c.execute('SELECT * FROM sqlite_master WHERE name = "{0}" and type="table"'.format(tbl_name))
if c.fetchone()==None:
c.execute('CREATE TABLE "{0}" {1}'.format(tbl_name,title[item]))
there are all told 65 different column schemes in the title dict but this is an example of what they look like:
title.append(('WINDFOR','(TIMESTAMP TEXT, SP INTEGER, SD TEXT, PUBLISHED TEXT, WIND_CAP NUMERIC, WIND_FOR NUMERIC)'))
Anyone got any ideas about where to look or what could cause this issue? I apologize if I've left out important information or missed something horribly basic, I've come into this topic area completely cold.

Appending rows to the end of a table is the fastest way to insert data (and you are not playing games with the rowid, so you are indeed appending to then end).
However, you are not using a single table but 16k tables, so the overhead for managing the table structure is multiplied.
Try increasing the cache size. But the most promising change would be to use fewer tables.

It makes sense to me that the time to INSERT increases as a function of the database size. The operating system itself may be slower when opening/closing/writing to larger files. An index could slow things down much more of course, but that doesn't mean that there would be no slowdown at all without an index.

Related

mongoengine query for duplicates in embedded documentlist

I'm making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db
I want to check if a record for a certain date already exists for an user before adding a new record to the user
what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code
There's gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found
import mongoengine
#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class User(mongoengine.Document):
#meta{}
#account details
records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
def conditional_insert(user, new_record):
# the docs tell me to iterate tthrough every record in the user
# there has to be a better way
for r in user.records:
if str(new_record.date) == str(r.date): #i had to do that in my program
#because python kep converting datetime obj to str
return
# if record of duplicate date not found, insert new record
save_record(user, new_record)
def save_record(): pass
if __name__ == "__main__":
lst_to_insert = [] # list of (user, record_to_insert)
for object in lst_to_insert: #O(n)
conditional_insert(object[0],object[1]) #O(n)
#and I have n lst_to_insert so in reality I'm currently at O(n^3)
Hi everyone (and future me who will probably search for the same question 10 years later)
I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month
class EmbeddedRecord(mongoengine.EmbeddedDocument):
date = mongoengine.DateField(required = True)
#contents = ...
class Year(mongoengine.EmbeddedDocument):
monthly_records = mongoengine.EmbeddedDocumentListField(Month)
class Month(mongoengine.EmbeddedDocument):
daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)
class User(mongoengine.Document):
#meta{}
#account details
yearly_records = mongoengine.EmbeddedDocumentListField(Year)
because it's mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant
I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:
0) get monthly df
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates
The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)
Mongo cannot conveniently offer you suitable indexes, very sad.
You frequently iterate over user.records.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set, which
offers O(1) constant time lookup, and
offers RAM speed rather than network latency
When you save a user, also make note of if with cache.add((user_id, str(new_record.date))).
EDIT
If you can't afford the memory for all those (user_id, date) tuples,
then sure, a relational database JOIN is fair,
it's just an out-of-core merge of sorted records.
I have had good results with using sqlalchemy to
hit local sqlite (memory or file-backed), or
heavier databases like Postgres or MariaDB.
Bear in mind that relational databases offer lovely ACID guarantees,
but you're paying for those guarantees. In this application
it doesn't sound like you need such properties.
Something as simple as /usr/bin/sort could do an out-of-core
ordering operation that puts all of a user's current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)
Bench it and see.
Without profiling data for a specific workload,
it's pretty hard to tell if any extra complexity
would be worth it.
Identify the true memory or CPU bottleneck,
and focus just on that.
You don't necessarily need ordering, as hashing would suffice,
given enough core.
Send those tuples to a redis cache, and make it his problem
to store them somewhere.

SQLite Optimization (Python) : Finding duplicate entries, and then consolidating referenced values to a representative unique entry?

My problem is as follows:
I have a list of entries in a file that is analogous to names and addresses in a phone book. This list is unordered, somewhere on the order of 1-10 billion entries long, and occupies 500gb on disk. I'd like to count how many times each name is represented in this file (i.e., how many duplicates there are for each name), and then concatenate all known addresses to that name.
My first approach was to insert the entries with a try/except clause, and update if necessary (UPSERT?):
c.execute('CREATE TABLE IF NOT EXISTS MYTable (Name TEXT PRIMARY KEY, Address TEXT) WITHOUT ROWID')
try:
c.execute(f'INSERT INTO MYTable (Name, Address) VALUES ("{thename}","{theaddress}"')
except sqlite3.IntegrityError:
c.execute(f'UPDATE MYTable SET Address = Address || "{theaddress}" WHERE Name = "{thename}"')
And this performs the job at a rate of about 50,000 rows/s. However, I'm trying to make it go faster, as it will take about 50-60 hours to create the database with this approach. I've tried the following optimization attempts:
PRAGMA journal_mode = WAL ; little effect
PRAGMA synchrounous = off ; little effect
beginning transactions/committing transactions manually ; little effect
Removing the index, using executemany, adding the index back on, and then using the fast searches to find duplicates; ~2-4x faster to create the initial DB, but preserves the duplicates until they can be indexed and then collapsed into a single (name:address1,address2,address_n) representation. Also occupies a lot of space on disk.
Is there something about either the approach I should change, or some optimization to solve this issue? Thanks for any help!

Recommendation for writing data from SQLite file via Python sqlite3

I have generated a giant SQLite database and need to get some data out of it. I wrote some script to do so, and profiling let to the unfortunate conclusion that the write process would take approx. 3 days with the current setup. I wrote the script as simplistic as possible to make it as fast as possible.
I am wondering if you have some trick to speed up the whole process. The database has an unique index, but the columns I am querying don't (because of duplicate rows for those).
Would it make sense to use any multi-processing Python library here?
The script would be like this:
import sqlite3
def write_from_query(db_name, table_name, condition, content_column, out_file):
'''
Writes contents from a SQLite database column to an output file
Keyword arguments:
db_name (str): Path of the .sqlite database file.
table_name (str): Name of the target table in the SQLite file.
condition (str): Condition for querying the SQLite database table.
content_colum (str): Name of the column that contains the content for the output file.
out_file (str): Path of the output file that will be written.
'''
# Connecting to the database file
conn = sqlite3.connect('zinc12_drugnow_nrb(copy).sqlite')
c = conn.cursor()
# Querying the database and writing the output file
c.execute('SELECT ({}) FROM {} WHERE {}'.format(content_column, table_name, condition))
with open(out_file, 'w') as outf:
for row in c:
outf.write(row[0])
# Closing the connection to the database
conn.close()
if __name__ == '__main__':
write_from_query(
db_name='my_db.sqlite',
table_name='my_table',
condition='variable1=1 AND variable2<=5 AND variable3="Zinc_Plus"',
content_column='variable4',
out_file='sqlite_out.txt'
)
Link to this script on GitHub
Thanks for your help, I am looking forward to your suggestions!
EDIT:
more information about the database:
I assume that you are running the write_from_query functions for a huge amount of queries.
If so the problem is the missing indices on your filter criteria
This results in the following: for each query you execute, sqlite will loop through the whole 50GB of data and checks whether your conditions hold true. That is VERY inefficient.
The easiest way would be to slap indices on your columns
An alternative would be to formulate less queries that include multiple of your cases and then loop over that data again to split it it in different files. How well this can done however depends on how your data is structured.
I'm not sure about multiprocessing/threading, sqlite is not really made for concurrency, but I guess it could work out since you only read data...
Either you dump the content and filter in your own program - or you add indices to all columns you use in your conditions.
Adding indices to all the columns will take a long long time.
But for many different queries there is no alternative.
No multiprocessing will probably not help. An SSD might, or 64GiB Ram. But they are not needed with indices, queries will be fast on normal disks too.
In conclusion you created a Database without creating indices for the columns you want to query. With 8Mio rows this wont work.
Whilst the process of actually writing this data to a file will take a while I would expect it to be more like minutes than days e.g. at a 50MB/s sequential write speed 15GB works out at around 5 mins.
I suspect that the issue is with the queries / lack of indexes. I would suggest trying to build composite indexes based on the combinations of columns that you need to filter on. As you will see from the documentation here, you can actually add as many columns as you want to an index.
Just to make you aware adding indexes will slow down inserts / updates to your database as every time it now needs to find the appropriate place in the relevant indexes to add data as well as appending data to the end of the tables, but this is probably your only option to speed up the queries.
I will look at the unique indices! But meanwhile another thing I just stumbled upon... Sorry for writing an own answer for my question here, but I thought it is better for the organization...
I was thinking that the .fetchall() command could also speed up the whole process, but I find the sqlite3 documentation on this a little bit brief ... Would something like
with open(out_file, 'w') as outf:
c.excecute ('SELECT * ...')
results = c.fetchmany(10000)
while results:
for row in results:
outf.write(row[0])
results = c.fetchmany(10000)
make sense?

Geting results from database one by one

I'm writing small program that is querying for results from database (single table). I'm using python 3.3, sqlalchemy and postgres database.
result = db_session.query(Data).all()
progress = 0
for row in result:
update_progress_bar(progress, len(result))
do_something_with_data(row)
progress += 1
Variable 'result' will contain few thousands rows, and processing of data is taking some time. This is why I introduced simple progress bar to give idea how mutch time it will take.
The problem is, that 30% of the total time is queering the database (first line). So when I start program I get big delay before my progress bar start moving. In addition I don't need to keep all results in memory. I can process them separately.
Is there any way to modify above program to get rows one by one until all rows are received, without loading everything into memory? In addition I want to monitor progress of querying and processing the data.
You need to just loop over the query without calling .all(), and call .yield_per() to set a batch size:
for row in db_session.query(Data).yield_per(10):
do_something_with_data(row)
.all() indeed turns the whole result set into a list first, causing a delay if the resultset is large. Iterating over the query directly after setting .yield_per() instead fetches results as needed, provided the database API supports it.
If you wanted to know up-front how many rows will be returned, call .count() first:
result = db_session.query(Data)
count = result.count()
for row in result.yield_per(10):
update_progress_bar(progress, count)
do_something_with_data(row)
progress += 1
.count() asks the database to gives us an item count for us first.
Your database could still be pre-caching the result rows, leading to a start-up delay, even when using .yield_per(). In that case you'll need to use a windowed query to break up your query into blocks based on the range of values in one of the columns. Wether or not this will work depends on your exact table layout.

Data Structure for storing a sorting field to efficiently allow modifications

I'm using Django and PostgreSQL, but I'm not absolutely tied to the Django ORM if there's a better way to do this with raw SQL or database specific operations.
I've got a model that needs sequential ordering. Lookup operations will generally retrieve the entire list in order. The most common operation on this data is to move a row to the bottom of a list, with a subset of the intervening items bubbling up to replace the previous item like this:
(operation on A, with subset B, C, E)
A -> B
B -> C
C -> E
D -> D
E -> A
Notice how D does not move.
In general, the subset of items will not be more than about 50 items, but the base list may grow to tens of thousands of entries.
The most obvious way of implementing this is with a simple integer order field. This seems suboptimal. It requires the compromise of making the position ordering column non-unique, where non-uniqueness is only required for the duration of a modification operation. To see this, imagine the minimal operation using A with subset B:
oldpos = B.pos
B.pos = A.pos
A.pos = oldpos
Even though you've stored the position, at the second line you've violated the uniqueness constraint. Additionally, this method makes atomicity problematic - your read operation has to happen before the write, during which time your records could change. Django's default transaction handling documentation doesn't address this, though I know it should be possible in the SQL using the "REPEATABLE READ" level of transaction locking.
I'm looking for alternate data structures that suit this use pattern more closely. I've looked at this question for ideas.
One proposal there is the Dewey decimal style solution, which makes insert operations occur numerically between existing values, so inserting A between B and C results in:
A=1 -> B=2
B=2 -> A=2.5
C=3 -> C=3
This solves the column uniqueness problem, but introduces the issue that the column must be a float of a specified number of decimals. Either I over-estimate, and store way more data than I need to, or the system becomes limited by whatever arbitrary decimal length I impose. Furthermore, I don't expect use to be even over the database - some keys are going to be moved far more often than others, making this solution hit the limit sooner. I could solve this problem by periodically re-numbering the database, but it seems that a good data structure should avoid needing this.
Another structure I've considered is the linked list (and variants). This has the advantage of making modification straightforward, but I'm not certain of it's properties with respect to SQL - ordering such a list in the SQL query seems like it would be painful, and extracting a non-sequential subset of the list has terrible retrieval properties.
Beyond this, there are B-Trees, various Binary Trees, and so on. What do you recommend for this data structure? Is there a standard data structure for this solution in SQL? Is the initial idea of going with sequential integers really going to have scaling issues, or am I seeing problems where there are none?
Prefered solutions:
A linked list would be the usual way to achieve this. A query to return the items in order is trivial in Oracle, but Im not sure how you would do it in PostreSQL.
Another option would be to implement this using the ltree module for postgresql.
Less graceful (and write-heavy) solution:
Start transaction. "select for update" within scope for row level locks. Move the target record to position 0, update the targets future succeeding records to +1 where their position is higher than the targets original position (or vice versa) and then update the target to the new position - a single additional write over that needed without a unique constraint. Commit :D
Simple (yet still write-heavy) solution if you can wait for Postgresql 8.5 (Alpha is available) :)
Wrap it in a transaction, select for update in scope, and use a deferred constraint (postgresql 8.5 has support for deferred unique constraints like Oracle).
A temp table and a transaction should maintain atomicity and the unique constraint on sort order. Restating the problem, you want to go from:
A 10 to B 10
B 25 C 25
C 26 E 26
E 34 A 34
Where there can be any number of items in between each row. So, first you read in the records and create a list [['A',10],['B',25],['C',26],['E',34]]. Through some pythonic magic you shift the identifiers around and insert them into a temp table:
create temporary table reorder (
id varchar(20), -- whatever
sort_order number,
primary key (id));
Now for the update:
update table XYZ
set sort_order = (select sort_order from reorder where xyz.id = reorder.id)
where id in (select id from reorder)
I'm only assuming pgsql can handle that query. If it can, it will be atomic.
Optionally, create table REORDER as a permanent table and the transaction will ensure that attempts to reorder the same record twice will be serialized.
EDIT: There are some transaction issues. You might need to implement both of my ideas. If two processes both want to update item B (for example) there can be issues. So, assume all order values are even:
Begin Transaction
Increment all the orders being used by 1. This puts row level write locks on all the rows you are going to update.
Select the data you just updated, if any sort_order fields are even some other process has added a record that matches your criteria. You can either abort the transaction and restart or you can just drop the record and finish the operation using only the records that were updated in step 2. The "right" thing to do depends on what you need this code to accomplish.
Fill your temporary reorder table as above using the proper even sort_orders.
Update the main table as above.
Drop the temporary table.
Commit the transaction
Step 2 ensures that if two lists overlap, only the first one will have access to the row
in question until the transaction completes:
update XYZ set sort_order = sort_order + 1
where -- whatever your select criteria are
select * from XYZ
where -- same select criteria
order by sort_order
Alternatively, you can add a control field to the table to get the same affect and then you don't need to play with the sort_order field. The benefit of using the sort_order field is indexing by a BIT field or a LOCK_BY_USERID field when the field is usually null tends to have poor performance since the index 99% of the time is meaningless. SQL engines don't like indexes that spend most of their time empty.
It seems to me that your real problem is the need to lock a table for the duration of a transaction. I don't immediately see a good way to solve this problem in a single operation, hence the need for locking.
So the question is whether you can do this in a "Django way" as opposed to using straight SQL. Searching "django lock table" turned up some interesting links, including this snippet, there are many others that implement similar behavior.
A straight SQL linked-list style solution can be found in this stack overflow post, it appeared logical and succinct to me, but again it's two operations.
I'm very curious to hear how this turns out and what your final solution is, be sure to keep us updated!
Why not do a simple character field of some length like a max of 16 (or 255) initially.
Start initially with labeling things aaa through zzz (that should be 17576 entries). (You could also add in 0-9, and the uppercase letters and symbols for an optimization.)
As items are added, they can go to the end up to the maximum you allow for the additional 'end times' (zzza, zzzaa, zzzaaa, zzzaab, zzzaac, zzzaad, etc.)
This should be reasonable simple to program, and it's very similar to the Dewey Decimal system.
Yes, you will need to rebalance it occasionally, but that should be a simple operaion. The simplest approach is two passes, pass 1 would be to set the new ordering tag to '0' (or any character earlier than the first character) followed by the new tag of the appropriate length, and step 2 would be to remove the '0 from the front.
Obviuosly, you could do the same thing with floats, and rebalancing it regularly, this is just a variation on that. The one advantage is that most databases will allow you to set a ridiculously large maximum size for the character field, large enough to make it very, very, very unlikely that you would run out of digits to do the ordering, and also make it unlikely that you would ever have to modify the schema, while not wasting a lot of space.
You can solve the renumbering issue by doing the order column as an integer that is always an even number. When you are moving the data, you change the order field to the new sort value + 1 and then do a quick update to convert all the odd order fields to even:
update table set sort_order = bitand(sort_order, '0xFFFFFFFE')
where sort_order <> bitand(sort_order, '0xFFFFFFFE')
Thus you can keep the uniqueness of sort_order as a constraint
EDIT: Okay, looking at the question again, I've started a new answer.

Categories

Resources