I wrote a python program that processes a large amount of text and puts the data into its multi-level dictionary. As a result, dictionary gets really big (2GBs or above), eats up memory, and leads to slowness/memory error.
So I'm looking to use sqlite3 instead of putting the data in python dict.
But come to think of it, the entire db from sqlite3 will have to be accessible throughout the running of the program. So in the end, wouldn't it lead to the same result where the memory is eaten up by the large db?
Sorry my understanding of memory is a little clumsy. I want to make things clear before I bother to port my program to using db.
Thank you in advance.
SQLite creates temporary files as needed to store data not yet committed to a database. It will do so even for in-memory databases.
As such, 2GB of data will not eat up all your memory when stored in a SQLite database.
Related
I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.
I want to take advantage of the speed benefits of holding an SQLite database (via SQLAlchemy) in memory while I go through a one-time process of inserting content, and then dump it to file, stored to be used later.
Consider a bog-standard database created in the usual way:
# in-memory database
e = create_engine('sqlite://')
Is there a quicker way of moving its contents to disc, other than just creating a brand new database and inserting each entry manually?
EDIT:
There is some doubt as to whether or not I'd even see any benefits to using an in-memory database. Unfortunately I already see a huge time difference of about 120x.
This confusion is probably due to me missing out some important detail in the question. Also probably due to a lack of understanding on my part re: caches / page sizes / etc. Allow me to elaborate:
I am running simulations of a system I have set up, with each simulation going through the following stages:
Make some queries to the database.
Make calculations / run a simulation based on the results of those queries.
insert new entries into the database based on the most recent simulation.
Make sure the database is up to date with the new entries by running commit().
While I only ever make a dozen or so insertions on each simulation run, I do however run millions of simulations, and the results of each simulation need to be available for future simulations to take place. As I say, this read and write process takes considerably longer when running a file-backed database; it's the difference between 6 hours and a month.
Hopefully this clarifies things. I can cobble together a simple python script to outline my process further a little further if necessary.
SQLAlchemy and SQLite know how to cache and do batch-inserts just fine.
There is no benefit in using an in-memory SQLite database here, because that database uses pages just like the on-disk version would, and the only difference is that eventually those pages get written to disk for disk-based database. The difference in performance is only 1.5 times, see SQLite Performance Benchmark -- why is :memory: so slow...only 1.5X as fast as disk?
There is also no way to move the in-memory database to a disk-based database at a later time, short of running queries on the in-memory database and executing batch inserts into the disk-based database on two separate connections.
I`m trying to write loader to sqlite that will load as fast as possible simple rows in DB.
Input data looks like rows retrieved from postgres DB. Approximated amount of rows that will go to sqlite: from 20mil to 100mil.
I cannot use other DB except sqlite due to project restrictions.
My question is :
what is a proper logic to write such loader?
At first try I`ve tried to write set of encapsulated generators, that will take one row from Postgres, slightly ammend it and put it into sqlite. I ended up with the fact that for each row, i create separate sqlite connection and cursor. And that looks awfull.
At second try , i moved sqlite connection and cursor out of the generator , to the body of the script and it became clear that i do not commit data to sqlite untill i fetch and process all 20mils records. And this possibly could crash all my hardware.
At third try I strated to consider to keep Sqlite connection away from the loops , but create/close cursor each time i process and push one row to Sqlite. This is better but i think also have some overhead.
I also considered to play with transactions : One connection, one cursor, one transaction and commit called in generator each time row is being pushed to Sqlite. Is this i right way i`m going?
Is there some widely-used pattern to write such a component in python? Because I feel as if I am inventing a bicycle.
SQLite can handle huge transactions with ease, so why not commit at the end? Have you tried this at all?
If you do feel one transaction is a problem, why not commit ever n transactions? Process rows one by one, insert as needed, but every n executed insertions add a connection.commit() to spread the load.
See my previous answer about bulk and SQLite. Possibly my answer here as well.
A question: Do you control the SQLite database? There are compile time options you can tweak related to cache sizes, etc. you can adjust for your purposes as well.
In general, the steps in #1 are going to get you the biggest bang-for-your-buck.
Finally i managed to resolve my problem. Main issue was in exessive amount of insertions in sqlite. After i started to load all data from postgress to memory, aggregate it proper way to reduce amount of rows, i was able to decrease processing time from 60 hrs to 16 hrs.
I insert 10,000,000+ records into a sqlite base. Is it safe that I wait with connection.commit() until I finish all the insertions? Or there is some out of memory/overflow risk? Is it possible that the uncommitted entries will use all the available memory and cause page swapping? Committing after each INSERT is a performance killer, so I want to avoid it.
I use sqlite3 module with Python
Well, in the first place, sqlite is not really meant to handle that much data. It is generally for smaller data sets. However, committing that much data might take longer when committing all at once then if you were to commit say after every 2 or 3 inserts.
As far as your memory/overflow risk, sqlite dumps everything into a file chache then on commit put's all the file cache into the sqlitedb file.
However, committing all at once shouldnt give you any issues.
Committing all the insertions at once is a perfectly acceptable optimization. If an error should occur (which is unlikely) then it will occur within the client library, and at that point you can look into solving that issue. But worrying about that prematurely is unnecessary.
SQLite's rollback journal stores the old contents of all pages that were changed in the database file.
If you are only inserting data, you are not actually changing much data that already is in the database (only some management structures), so there will not accumulate much overhead data for the transaction.
(The situation would be different if you had enabled WAL mode.)
I need to track a large volume of inotify messages for a set of files that, during their lifetime, will move between several specific directories, with inodes intact; I need to track the movement of these inodes, as well as create/delete and changes to a file's content. There will be many hundreds of changes per second.
Because of limited resources, I cant store it all in RAM (or disk, or a database).
Luckily, most of these files will be deleted in short order; the file content- and movement-history just need to be stored for later analysis. The files that are not deleted immediately will end up staying in a particular directory for a known period of time.
So it seems to me that I need a data structure that is partially stored in RAM, and partially saved to disk; part of the portion saved to disk will need to be recalled (the files not deleted), but most will not. I will not need to query the data, only access it by an identifier (the file name, which is [A-Z0-9]{8}). It would be helpful to be able to configure when the file data is flushed to disk.
Does such a beast exist?
Edit: I've asked a related question.
Why not a database? Say SQLite.
While SQLite isn't the most efficient storage mechanism in terms of space there are a number of advantages -- the first and foremost is that is an SQL RDBMS. The amount of memory SQLite uses (to temporarily cache data) can be configured through the cache_size pragma.
If SQLite isn't an option, what about one of the "key value stores"? They range from distributed client/server in-memory (e.g. memcached) to local embedded disk-based (e.g BDB) to memory-with-a-persistent-backing-for-overflow and anywhere in-between, etc. They do not have the SQL DDL/DQL (although some might allow relationships), but are efficient at what they do -- store keys and values.
Of course, one could always implement a LRU structure (say a basic sorted list with a limit) with overflow to a simple extensible disk-based hash implementation... but... consider above first :) [There may also be some micro-KV libraries/source out there].
Happy coding.