How to write proper big data loader to sqlite

How to write proper big data loader to sqlite - python

I`m trying to write loader to sqlite that will load as fast as possible simple rows in DB.
Input data looks like rows retrieved from postgres DB. Approximated amount of rows that will go to sqlite: from 20mil to 100mil.
I cannot use other DB except sqlite due to project restrictions.
My question is :
what is a proper logic to write such loader?
At first try I`ve tried to write set of encapsulated generators, that will take one row from Postgres, slightly ammend it and put it into sqlite. I ended up with the fact that for each row, i create separate sqlite connection and cursor. And that looks awfull.
At second try , i moved sqlite connection and cursor out of the generator , to the body of the script and it became clear that i do not commit data to sqlite untill i fetch and process all 20mils records. And this possibly could crash all my hardware.
At third try I strated to consider to keep Sqlite connection away from the loops , but create/close cursor each time i process and push one row to Sqlite. This is better but i think also have some overhead.
I also considered to play with transactions : One connection, one cursor, one transaction and commit called in generator each time row is being pushed to Sqlite. Is this i right way i`m going?
Is there some widely-used pattern to write such a component in python? Because I feel as if I am inventing a bicycle.

SQLite can handle huge transactions with ease, so why not commit at the end? Have you tried this at all?
If you do feel one transaction is a problem, why not commit ever n transactions? Process rows one by one, insert as needed, but every n executed insertions add a connection.commit() to spread the load.

See my previous answer about bulk and SQLite. Possibly my answer here as well.
A question: Do you control the SQLite database? There are compile time options you can tweak related to cache sizes, etc. you can adjust for your purposes as well.
In general, the steps in #1 are going to get you the biggest bang-for-your-buck.

Finally i managed to resolve my problem. Main issue was in exessive amount of insertions in sqlite. After i started to load all data from postgress to memory, aggregate it proper way to reduce amount of rows, i was able to decrease processing time from 60 hrs to 16 hrs.

Related

What is the fastest and most efficient way to check database for new entry?

I currently have an application that, at any given time, will INSERT new data into my database. I also have a different python script that checks my database in an infinite loop for a new entry, and when it finds one, it selects it, and uses it, then waits again.
Im wondering if there is any way of doing this more efficiently and more accurately?
Thanks
I currently have a set up like this:
conn = pyodbc.connect('Driver={ODBC Driver 13 for SQL Server};'
'Server=REDACTED;'
'Database= REDACTED;'
'UID= REDACTED;'
'PWD= REDACTED;')
cursor = conn.cursor()
in a loop such as:
i = 1
while i < 2:
#check database for new entry with select statement and compare old list with current list and see if there is a difference. If there is a difference, use that new key and process data.
It currently works fine, but I feel like it is doing a lot of work for nothing. For example, during the week it will only every really access the database 30-50 times a day, but on the weekend it will do it close to 100-200 times a day... the only thing is, there is no set number of times it will access it, or when.
Any help would be useful.
Thanks

I’ve never had to do this before. But 3 ideas came to mind.
(1) if your script is writing to another database, then your alternative could be setting up a replication of your source database. The DBA of the original database can set that up for you.
(2) if you are willing to overhaul the original database, you could consider a real-time in-memory database (such as redis), depending on your use case that might help.
(3) it appears that sqlalchemy has a built in event listener. I use sqlalchemy in Python but never this particular feature.
https://docs.sqlalchemy.org/en/13/orm/session_events.html

peewee with bulk insert is very slow into sqlite db

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.

Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html

In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

Can anyone tell me what' s the point of connection.commit() in python pyodbc ?

I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.

This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.

Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.

what is the fastest way to update thousands rows in mysql

lets assume you have a table with 1M rows and growing ...
every five minutes of every day you run a python programm which have to update some fields of 50K rows
my question is: what is the fastest way to do the work?
runs those updates in loop and after last one is executed than fire up a cursor commit?
or generate file and than run it throught command line?
create temp table by huge and fast insert and than run a single update to production table?
do prepared statements?
split it up to 1K updates per execute, to generate smaller logs files?
turn off logging while running update?
or do a cases in mysql examples (but this works only up to 255 rows)
i dont know ... have anyone do something like this? what is the best practise? i need to run it as fast as possible ...

Here's some ways you could speed up your UPDATES.
When you UPDATE, the table records are just being rewritten with new data. And all this must be done again on INSERT. That's why you should always use INSERT ... ON DUPLICATE KEY UPDATE instead of REPLACE.
The former one is an UPDATE operation in case of a key violation, while the latter one is DELETE / INSERT
Here's an example INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; More on this here.
UPDATE1: It's a good idea to do your inserts all in a single query. This should speed up your UPDATES. See here on how to do that.
UPDATE2: Now that I have had a chance to read your other sub-questions. Here's what I know-
instead of in a loop, try to execute all UPDATE in a single sql & single commit.
Not sure this is going to make any difference. SQL queries are more important.
Now this is something you could experiment with. Benchmark it. This kind of a thing depends on the size of the TABLE & the INDEXES you have, plus INNODB or MYISAM.
No idea about this.
refer first point.
Yes, this might speed your stuff up slightly. Also see if you have slow_query_log turned on. This logs all slow queries to a separate logfile. Turn this off too.
Again. refer first point.

query execution process: Server first parsing your query then execute you need to analysis
the query
then server take less time to parse then he execute faster instead of slow in other way

MS-Access Database getting very large during inserts

I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.
Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?
Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.
And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.
If it could help, I could preprocess in sqlite.
Edit:
Just to add some further information (some already listed in my comments):
The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
All processing is happening in Python: all the mdb file is doing is storing the data
All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)
Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.
Any further suggestions still welcome.

Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.
Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.
Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).
Open the database in ms-access and go:
Tools->options
On the advanced tab, un-check the box:
[ ] Open databases using record level locking.
This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.
There also a registry setting that you can use here.
And, Are you using odbc, or an oleDB connection?
You can try:
Set rs = New ADODB.Recordset
With rs
.ActiveConnection = RsCnn
.Properties("Jet OLEDB:Locking Granularity") = 1
Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.
There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.
For furher reading and an example see here:
Does ACEDAO support row level locking?

One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.
About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.
Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.

A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.
Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...
Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)
EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.
EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.

Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.
You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.
Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.

I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.

File --> Options --> Current Database -> Check below options
* Use the Cache format that is compatible with Microsoft Access 2010 and later
* Clear Cache on Close
Then, you file will be saved by compacting to the original size.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.