Combine inserts into one transaction Python SQLite3 - python

I am trying to input 1000's of rows on SQLite3 with insert however the time it takes to insert is way too long. I've heard speed is greatly increased if the inserts are combined into one transactions. However, i cannot seem to get SQlite3 to skip checking that the file is written on the hard disk.
this is a sample:
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null, ?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
This is what i have at the begining.
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
However, it does not seem to make a difference. A way to increase the insert speed would be much appreciated.

According to Sqlite documentation, BEGIN transaction should be ended with COMMIT
Transactions can be started manually using the BEGIN command. Such
transactions usually persist until the next COMMIT or ROLLBACK
command. But a transaction will also ROLLBACK if the database is
closed or if an error occurs and the ROLLBACK conflict resolution
algorithm is specified. See the documentation on the ON CONFLICT
clause for additional information about the ROLLBACK conflict
resolution algorithm.
So, your code should be like this:
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null,?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
c.execute('commit')

https://stackoverflow.com/a/3689929/1147726 answers the question. execute('begin') does not have any effect. Apparently, a connection.commit() is sufficient.

(In case someone is still looking for an answer to this)
You should use executemany if you are just doing 1000's of inserts successively.
Look at What is the optimized way to insert large number of records (more than 40,000) in sqlite3
I just struggled with a LOT (order millions) of execute's that were taking about 30 minutes to complete - Switched to executemany and I now have it down to about 10 minutes.

You can use executemany, see this SO question: python sqlite question - Insert method

Related

Speed up insertion of pandas dataframe using fast_executemany Python pyodbc

I am trying to insert data contained in a .csv file from my pc to a remote server. The values are inserted in a table that contains 3 columns, namely Timestamp, Value and TimeseriesID. I have to insert approximately 3000 rows at a time, therefore I am currently using pyodbc and executemany.
My code up to now is the one shown below:
with contextlib.closing(pyodbc.connect(connection_string, autocommit=True)) as conn:
with contextlib.closing(conn.cursor()) as cursor:
cursor.fast_executemany = True # new in pyodbc 4.0.19
# Innsert values in the DataTable table
insert_df = df[["Time (UTC)", column]]
insert_df["id"] = timeseriesID
insert_df = insert_df[["id", "Time (UTC)", column]]
sql = "INSERT INTO %s (%s, %s, %s) VALUES (?, ?, ?)" % (
sqltbl_datatable, 'TimeseriesId', 'DateTime', 'Value')
params = [i.tolist() for i in insert_df.values]
cursor.executemany(sql, params)
As I am using pyodbc 4.0.19, I have the option fast_executemany set to True, which is supposed to speed up things. However, for some reason, I do not see any great improvement when I enable the fast_executemany option. Is there any alternative way that I could use in order to speed up insertion of my file?
Moreover, regarding the performance of the code shown above, I noticed that when disabling the autocommit=True option, and instead I included the cursor.commit() command in the end of my data was imported significantly faster. Is there any specific reason why this happens that I am not aware of?
Any help would be greatly appreciated :)
Regarding the cursor.commit() speed up that you are noticing: when you are using autocommit=True you are requesting the code to execute one database transaction per each of the insert. This means that the code resumes only after the database confirms the data is stored on disk. When you use cursor.commit() after the numerous INSERTs you are effectively executing one database transaction and the data is stored in RAM in the interim (it may be written to disk but not all at the time when you instruct the database to finalize the transaction).
The process of finalizing the transaction typically entails updating tables on disk, updating indexes, flushing logs, syncing copies, etc. which is costly. That is why you observe such a speed up between the 2 scenarios you describe.
When going the faster way please note that until you execute cursor.commit() you cannot be 100% sure that the data is in the database so there may be a need to reissue the query in case of an error (any partial transaction is going to be rolled back).

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Remove all data from table but last N entries

I'm using psycopg2 with Python.
I'd like to periodically flush data from my db. I've set up a task with Timer for this. I had asked this question before, but using the answer listed there freezes up my machine (keyboard stops responding and entire system grinds to halt). Instead, I would like to delete all entries in my table albeit the last N (Not sure that this is the right approach either).
Basically, there is another python process that is running (separate executable), which is populating the db that I wish to interrogate. It seems that if I delete all entries, and that other process is running, that it can lead to the freeze. I don't know of a safe way in which I can remove entries; it's almost as if the other process is relying on an incrementing ID as it writes to the db.
If anyone could help me work this out it'd be greatly appreciated. Thoughts?
A possible solution is to run a DELETE on all ids except those returned by select ... order by pk desc limit N given an autoincremental pk. If no such pk exists, having a created_date and ordering by it should do the same.
Non tested example:
import psycopg2
connection = psycopg2.connect('dbname=test user=postgres')
cursor = conn.cursor()
query = 'delete from my_table where id not in (
select id from my_table order by id desc limit 30)'
cursor.execute(query)
cursor.commit() #Don't know if necessary
cursor.close()
connection.close()
This is probably much faster:
CRETE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl ORDER BY <undisclosed> LIMIT <N>;
TRUNCATE TABLE tbl;
INSERT INTO tbl SELECT * FROM tbl_tmp;
Do it all in one session. Specifics depend on additional circumstances you did not disclose.
Compare to this related, comprehensive answer (your case is simpler):
Remove duplicates from table based on multiple criteria and persist to other table

Merging SQLite databases is driving me mad. Help?

I've got 32 SQLite (3.7.9) databases with 3 tables each that I'm trying to merge together using the idiom that I've found elsewhere (each db has the same schema):
attach db1.sqlite3 as toMerge;
insert into tbl1 select * from toMerge.tbl1;
insert into tbl2 select * from toMerge.tbl2;
insert into tbl3 select * from toMerge.tbl3;
detach toMerge;
and rinse-repeating for the entire set of databases. I do this using python and the sqlite3 module:
for fn in filelist:
completedb = sqlite3.connect("complete.sqlite3")
c = completedb.cursor()
c.execute("pragma synchronous = off;")
c.execute("pragma journal_mode=off;")
print("Attempting to merge " + fn + ".")
query = "attach '" + fn + "' as toMerge;"
c.execute(query)
try:
c.execute("insert into tbl1 select * from toMerge.tbl1;")
c.execute("insert into tbl2 select * from toMerge.tbl2;")
c.execute("insert into tbl3 select * from toMerge.tbl3;")
c.execute("detach toMerge;")
completedb.commit()
except sqlite3.Error as err:
print "Error! ", type(err), " Error msg: ", err
raise
2 of the tables are fairly small, only 50K rows per db, while the third (tbl3) is larger, about 850 - 900K rows. Now, what happens is that the inserts progressively slow down until I get to about the fourth database when they grind to a near halt (on the order of a a megabyte or two in file size added every 1-3 minutes to the combined database). In case it was python, I've even tried dumping out the tables as INSERTs (.insert; .out foo; sqlite3 complete.db < foo is the skeleton, found here) and combining them in a bash script using the sqlite3 CLI to do the work directly, but I get exactly the same problem.
The table setup of tbl3 isn't too demanding - a text field containing a UUID, two integers, and four real values. My worry is that it's the number of rows, because I ran into exactly the same trouble at exactly the same spot (about four databases in) when the individual databases were an order of magnitude larger in terms of file size with the same number of rows (I trimmed the contents of tbl3 significantly by storing summary stats instead of raw data). Or maybe it's the way I'm performing the operation? Can anyone shed some light on this problem that I'm having before I throw something out the window?
Try adding or removing indexes/primary key for the larger table.
You didn't mention the OS you were using or the db file sizes. Windows can have issues with files that are bigger than 2Gb depending on what version.
In any case, since this is a glorified batch script why not get rid of the for loop, get the filename from sys.argv, and then just run it once for each merge db. That way you will never have to deal with memory issues from doing too much in one process.
Mind you, if you end the loop with the following that will likely also fix things.
c.close()
completedb.close()
You say that the same thing occurs when you follow this process using the CLI and quitting after every db. I assume that you mean the Python CLI, and quitting means that you exit and restart Python. If that is true, and it still develops a problem every 4th database, then something is wrong with your SQLITE shared library. It shouldn't be keeping state like that.
If I were in your shoes, I would stop using attach and just open multiple connections in Python, then move the data in batches of about 1000 records per commit. It would be slower than your technique because all the data moves in and out of Python objects, but I think it would also be more reliable. Open the complete db, then loop around opening a second db, copying, then closing the second db. For the copying, I would use OFFSET and LIMIT on the SELECT statements to process batches of 100 records, then commit, then repeat.
In fact, I would also count the completedb records, and the second db records before copying, then after copying count the completedb records to ensure that I had copied the expected amount. Also, you would be keeping track of the value of the next OFFSET and I would write that to a text file right after committing, so that I could interrupt and restart the process at any time and it would carry on where it left off.

Join with Pythons SQLite module is slower than doing it manually

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.
There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

Categories

Resources