I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.
I wrote a python program that processes a large amount of text and puts the data into its multi-level dictionary. As a result, dictionary gets really big (2GBs or above), eats up memory, and leads to slowness/memory error.
So I'm looking to use sqlite3 instead of putting the data in python dict.
But come to think of it, the entire db from sqlite3 will have to be accessible throughout the running of the program. So in the end, wouldn't it lead to the same result where the memory is eaten up by the large db?
Sorry my understanding of memory is a little clumsy. I want to make things clear before I bother to port my program to using db.
Thank you in advance.
SQLite creates temporary files as needed to store data not yet committed to a database. It will do so even for in-memory databases.
As such, 2GB of data will not eat up all your memory when stored in a SQLite database.
I`m trying to write loader to sqlite that will load as fast as possible simple rows in DB.
Input data looks like rows retrieved from postgres DB. Approximated amount of rows that will go to sqlite: from 20mil to 100mil.
I cannot use other DB except sqlite due to project restrictions.
My question is :
what is a proper logic to write such loader?
At first try I`ve tried to write set of encapsulated generators, that will take one row from Postgres, slightly ammend it and put it into sqlite. I ended up with the fact that for each row, i create separate sqlite connection and cursor. And that looks awfull.
At second try , i moved sqlite connection and cursor out of the generator , to the body of the script and it became clear that i do not commit data to sqlite untill i fetch and process all 20mils records. And this possibly could crash all my hardware.
At third try I strated to consider to keep Sqlite connection away from the loops , but create/close cursor each time i process and push one row to Sqlite. This is better but i think also have some overhead.
I also considered to play with transactions : One connection, one cursor, one transaction and commit called in generator each time row is being pushed to Sqlite. Is this i right way i`m going?
Is there some widely-used pattern to write such a component in python? Because I feel as if I am inventing a bicycle.
SQLite can handle huge transactions with ease, so why not commit at the end? Have you tried this at all?
If you do feel one transaction is a problem, why not commit ever n transactions? Process rows one by one, insert as needed, but every n executed insertions add a connection.commit() to spread the load.
See my previous answer about bulk and SQLite. Possibly my answer here as well.
A question: Do you control the SQLite database? There are compile time options you can tweak related to cache sizes, etc. you can adjust for your purposes as well.
In general, the steps in #1 are going to get you the biggest bang-for-your-buck.
Finally i managed to resolve my problem. Main issue was in exessive amount of insertions in sqlite. After i started to load all data from postgress to memory, aggregate it proper way to reduce amount of rows, i was able to decrease processing time from 60 hrs to 16 hrs.
I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.
Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?
Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.
And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.
If it could help, I could preprocess in sqlite.
Edit:
Just to add some further information (some already listed in my comments):
The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
All processing is happening in Python: all the mdb file is doing is storing the data
All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)
Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.
Any further suggestions still welcome.
Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.
Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.
Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).
Open the database in ms-access and go:
Tools->options
On the advanced tab, un-check the box:
[ ] Open databases using record level locking.
This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.
There also a registry setting that you can use here.
And, Are you using odbc, or an oleDB connection?
You can try:
Set rs = New ADODB.Recordset
With rs
.ActiveConnection = RsCnn
.Properties("Jet OLEDB:Locking Granularity") = 1
Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.
There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.
For furher reading and an example see here:
Does ACEDAO support row level locking?
One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.
About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.
Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.
A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.
Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...
Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)
EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.
EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.
Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.
You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.
Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.
I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.
File --> Options --> Current Database -> Check below options
* Use the Cache format that is compatible with Microsoft Access 2010 and later
* Clear Cache on Close
Then, you file will be saved by compacting to the original size.
I need a real DBA's opinion. Postgres 8.3 takes 200 ms to execute this query on my Macbook Pro while Java and Python perform the same calculation in under 20 ms (350,000 rows):
SELECT count(id), avg(a), avg(b), avg(c), avg(d) FROM tuples;
Is this normal behaviour when using a SQL database?
The schema (the table holds responses to a survey):
CREATE TABLE tuples (id integer primary key, a integer, b integer, c integer, d integer);
\copy tuples from '350,000 responses.csv' delimiter as ','
I wrote some tests in Java and Python for context and they crush SQL (except for pure python):
java 1.5 threads ~ 7 ms
java 1.5 ~ 10 ms
python 2.5 numpy ~ 18 ms
python 2.5 ~ 370 ms
Even sqlite3 is competitive with Postgres despite it assumping all columns are strings (for contrast: even using just switching to numeric columns instead of integers in Postgres results in 10x slowdown)
Tunings i've tried without success include (blindly following some web advice):
increased the shared memory available to Postgres to 256MB
increased the working memory to 2MB
disabled connection and statement logging
used a stored procedure via CREATE FUNCTION ... LANGUAGE SQL
So my question is, is my experience here normal, and this is what I can expect when using a SQL database? I can understand that ACID must come with costs, but this is kind of crazy in my opinion. I'm not asking for realtime game speed, but since Java can process millions of doubles in under 20 ms, I feel a bit jealous.
Is there a better way to do simple OLAP on the cheap (both in terms of money and server complexity)? I've looked into Mondrian and Pig + Hadoop but not super excited about maintaining yet another server application and not sure if they would even help.
No the Python code and Java code do all the work in house so to speak. I just generate 4 arrays with 350,000 random values each, then take the average. I don't include the generation in the timings, only the averaging step. The java threads timing uses 4 threads (one per array average), overkill but it's definitely the fastest.
The sqlite3 timing is driven by the Python program and is running from disk (not :memory:)
I realize Postgres is doing much more behind the scenes, but most of that work doesn't matter to me since this is read only data.
The Postgres query doesn't change timing on subsequent runs.
I've rerun the Python tests to include spooling it off the disk. The timing slows down considerably to nearly 4 secs. But I'm guessing that Python's file handling code is pretty much in C (though maybe not the csv lib?) so this indicates to me that Postgres isn't streaming from the disk either (or that you are correct and I should bow down before whoever wrote their storage layer!)
I would say your test scheme is not really useful. To fulfill the db query, the db server goes through several steps:
parse the SQL
work up a query plan, i. e. decide on which indices to use (if any), optimize etc.
if an index is used, search it for the pointers to the actual data, then go to the appropriate location in the data or
if no index is used, scan the whole table to determine which rows are needed
load the data from disk into a temporary location (hopefully, but not necessarily, memory)
perform the count() and avg() calculations
So, creating an array in Python and getting the average basically skips all these steps save the last one. As disk I/O is among the most expensive operations a program has to perform, this is a major flaw in the test (see also the answers to this question I asked here before). Even if you read the data from disk in your other test, the process is completely different and it's hard to tell how relevant the results are.
To obtain more information about where Postgres spends its time, I would suggest the following tests:
Compare the execution time of your query to a SELECT without the aggregating functions (i. e. cut step 5)
If you find that the aggregation leads to a significant slowdown, try if Python does it faster, obtaining the raw data through the plain SELECT from the comparison.
To speed up your query, reduce disk access first. I doubt very much that it's the aggregation that takes the time.
There's several ways to do that:
Cache data (in memory!) for subsequent access, either via the db engine's own capabilities or with tools like memcached
Reduce the size of your stored data
Optimize the use of indices. Sometimes this can mean to skip index use altogether (after all, it's disk access, too). For MySQL, I seem to remember that it's recommended to skip indices if you assume that the query fetches more than 10% of all the data in the table.
If your query makes good use of indices, I know that for MySQL databases it helps to put indices and data on separate physical disks. However, I don't know whether that's applicable for Postgres.
There also might be more sophisticated problems such as swapping rows to disk if for some reason the result set can't be completely processed in memory. But I would leave that kind of research until I run into serious performance problems that I can't find another way to fix, as it requires knowledge about a lot of little under-the-hood details in your process.
Update:
I just realized that you seem to have no use for indices for the above query and most likely aren't using any, too, so my advice on indices probably wasn't helpful. Sorry. Still, I'd say that the aggregation is not the problem but disk access is. I'll leave the index stuff in, anyway, it might still have some use.
Postgres is doing a lot more than it looks like (maintaining data consistency for a start!)
If the values don't have to be 100% spot on, or if the table is updated rarely, but you are running this calculation often, you might want to look into Materialized Views to speed it up.
(Note, I have not used materialized views in Postgres, they look at little hacky, but might suite your situation).
Materialized Views
Also consider the overhead of actually connecting to the server and the round trip required to send the request to the server and back.
I'd consider 200ms for something like this to be pretty good, A quick test on my oracle server, the same table structure with about 500k rows and no indexes, takes about 1 - 1.5 seconds, which is almost all just oracle sucking the data off disk.
The real question is, is 200ms fast enough?
-------------- More --------------------
I was interested in solving this using materialized views, since I've never really played with them. This is in oracle.
First I created a MV which refreshes every minute.
create materialized view mv_so_x
build immediate
refresh complete
START WITH SYSDATE NEXT SYSDATE + 1/24/60
as select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;
While its refreshing, there is no rows returned
SQL> select * from mv_so_x;
no rows selected
Elapsed: 00:00:00.00
Once it refreshes, its MUCH faster than doing the raw query
SQL> select count(*),avg(a),avg(b),avg(c),avg(d) from so_x;
COUNT(*) AVG(A) AVG(B) AVG(C) AVG(D)
---------- ---------- ---------- ---------- ----------
1899459 7495.38839 22.2905454 5.00276131 2.13432836
Elapsed: 00:00:05.74
SQL> select * from mv_so_x;
COUNT(*) AVG(A) AVG(B) AVG(C) AVG(D)
---------- ---------- ---------- ---------- ----------
1899459 7495.38839 22.2905454 5.00276131 2.13432836
Elapsed: 00:00:00.00
SQL>
If we insert into the base table, the result is not immediately viewable view the MV.
SQL> insert into so_x values (1,2,3,4,5);
1 row created.
Elapsed: 00:00:00.00
SQL> commit;
Commit complete.
Elapsed: 00:00:00.00
SQL> select * from mv_so_x;
COUNT(*) AVG(A) AVG(B) AVG(C) AVG(D)
---------- ---------- ---------- ---------- ----------
1899459 7495.38839 22.2905454 5.00276131 2.13432836
Elapsed: 00:00:00.00
SQL>
But wait a minute or so, and the MV will update behind the scenes, and the result is returned fast as you could want.
SQL> /
COUNT(*) AVG(A) AVG(B) AVG(C) AVG(D)
---------- ---------- ---------- ---------- ----------
1899460 7495.35823 22.2905352 5.00276078 2.17647059
Elapsed: 00:00:00.00
SQL>
This isn't ideal. for a start, its not realtime, inserts/updates will not be immediately visible. Also, you've got a query running to update the MV whether you need it or not (this can be tune to whatever time frame, or on demand). But, this does show how much faster an MV can make it seem to the end user, if you can live with values which aren't quite upto the second accurate.
I retested with MySQL specifying ENGINE = MEMORY and it doesn't change a thing (still 200 ms). Sqlite3 using an in-memory db gives similar timings as well (250 ms).
The math here looks correct (at least the size, as that's how big the sqlite db is :-)
I'm just not buying the disk-causes-slowness argument as there is every indication the tables are in memory (the postgres guys all warn against trying too hard to pin tables to memory as they swear the OS will do it better than the programmer)
To clarify the timings, the Java code is not reading from disk, making it a totally unfair comparison if Postgres is reading from the disk and calculating a complicated query, but that's really besides the point, the DB should be smart enough to bring a small table into memory and precompile a stored procedure IMHO.
UPDATE (in response to the first comment below):
I'm not sure how I'd test the query without using an aggregation function in a way that would be fair, since if i select all of the rows it'll spend tons of time serializing and formatting everything. I'm not saying that the slowness is due to the aggregation function, it could still be just overhead from concurrency, integrity, and friends. I just don't know how to isolate the aggregation as the sole independent variable.
Those are very detailed answers, but they mostly beg the question, how do I get these benefits without leaving Postgres given that the data easily fits into memory, requires concurrent reads but no writes and is queried with the same query over and over again.
Is it possible to precompile the query and optimization plan? I would have thought the stored procedure would do this, but it doesn't really help.
To avoid disk access it's necessary to cache the whole table in memory, can I force Postgres to do that? I think it's already doing this though, since the query executes in just 200 ms after repeated runs.
Can I tell Postgres that the table is read only, so it can optimize any locking code?
I think it's possible to estimate the query construction costs with an empty table (timings range from 20-60 ms)
I still can't see why the Java/Python tests are invalid. Postgres just isn't doing that much more work (though I still haven't addressed the concurrency aspect, just the caching and query construction)
UPDATE:
I don't think it's fair to compare the SELECTS as suggested by pulling 350,000 through the driver and serialization steps into Python to run the aggregation, nor even to omit the aggregation as the overhead in formatting and displaying is hard to separate from the timing. If both engines are operating on in memory data, it should be an apples to apples comparison, I'm not sure how to guarantee that's already happening though.
I can't figure out how to add comments, maybe i don't have enough reputation?
I'm a MS-SQL guy myself, and we'd use DBCC PINTABLE to keep a table cached, and SET STATISTICS IO to see that it's reading from cache, and not disk.
I can't find anything on Postgres to mimic PINTABLE, but pg_buffercache seems to give details on what is in the cache - you may want to check that, and see if your table is actually being cached.
A quick back of the envelope calculation makes me suspect that you're paging from disk. Assuming Postgres uses 4-byte integers, you have (6 * 4) bytes per row, so your table is a minimum of (24 * 350,000) bytes ~ 8.4MB. Assuming 40 MB/s sustained throughput on your HDD, you're looking at right around 200ms to read the data (which, as pointed out, should be where almost all of the time is being spent).
Unless I screwed up my math somewhere, I don't see how it's possible that you are able to read 8MB into your Java app and process it in the times you're showing - unless that file is already cached by either the drive or your OS.
I don't think that your results are all that surprising -- if anything it is that Postgres is so fast.
Does the Postgres query run faster a second time once it has had a chance to cache the data? To be a little fairer your test for Java and Python should cover the cost of acquiring the data in the first place (ideally loading it off disk).
If this performance level is a problem for your application in practice but you need a RDBMS for other reasons then you could look at memcached. You would then have faster cached access to raw data and could do the calculations in code.
One other thing that an RDBMS generally does for you is to provide concurrency by protecting you from simultaneous access by another process. This is done by placing locks, and there's some overhead from that.
If you're dealing with entirely static data that never changes, and especially if you're in a basically "single user" scenario, then using a relational database doesn't necessarily gain you much benefit.
Are you using TCP to access the Postgres? In that case Nagle is messing with your timing.
You need to increase postgres' caches to the point where the whole working set fits into memory before you can expect to see perfomance comparable to doing it in-memory with a program.
Thanks for the Oracle timings, that's the kind of stuff I'm looking for (disappointing though :-)
Materialized views are probably worth considering as I think I can precompute the most interesting forms of this query for most users.
I don't think query round trip time should be very high as i'm running the the queries on the same machine that runs Postgres, so it can't add much latency?
I've also done some checking into the cache sizes, and it seems Postgres relies on the OS to handle caching, they specifically mention BSD as the ideal OS for this, so I thinking Mac OS ought to be pretty smart about bringing the table into memory. Unless someone has more specific params in mind I think more specific caching is out of my control.
In the end I can probably put up with 200 ms response times, but knowing that 7 ms is a possible target makes me feel unsatisfied, as even 20-50 ms times would enable more users to have more up to date queries and get rid of a lots of caching and precomputed hacks.
I just checked the timings using MySQL 5 and they are slightly worse than Postgres. So barring some major caching breakthroughs, I guess this is what I can expect going the relational db route.
I wish I could up vote some of your answers, but I don't have enough points yet.