Join with Pythons SQLite module is slower than doing it manually - python

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.

There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite

You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)

I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

Related

Python sqlite3 never returns an inner join with 28 milion+ rows

Sqlite database with two tables, each over 28 million rows long. Here's the schema:
CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);
CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);
Here's an example row from MASTER:
ID PATH FILE FULLPATH MODIFIED_TIME
---------- --------------- ---------- ----------------------- -------------
1 e:\ae/BONDS/0/0 100.bin e:\ae/BONDS/0/0/100.bin 1213903192.5
The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.
If I execute the following query in sqlite, I get the results I expect:
select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;
That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.
However, if I execute the same query in Python:
changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")
It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.
I do not have these issues with a small test database - everything runs fine in Python.
I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:
new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")
Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?
Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.
Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.
However, your actual problem is that the query is slow.
Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME.

python 5x slower than perl mySql query

I am translating a code from perl to python.
Even if it works exactly the same, there is a part of the code that is 5x slower in python than in perl and I cannot figure out why.
Both perl and python are in the same machine, as well as the mysql database.
The code queries a db to download all columns of a table and then process each row.
There are more than 5 million rows to process and the big issue is in retrieving the data from the database to the python processing.
Here I attach the two code samples:
Python:
import os
import mysql.connector **<--- import mySqlDb**
import time
outDict = dict()
## DB parameters
db = mysql.connector.connect **<----- mySqlDb.connect( ...**
(host=dbhost,
user=username, # your username
passwd=passw, # your password
db=database) # name of the data base
cur = db.cursor(prepared=True)
sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat;"
cur.execute(sql)
print('\t eDiVa public omics start')
s = time.time()
sz = 1000
rows = cur.fetchall()
for row in rows:
## process out dict
print time.time() - s
cur.close()
db.close()
While here comes the Perl equivalent script:
use strict;
use Digest::MD5 qw(md5);
use DBI;
use threads;
use threads::shared;
my $dbh = DBI->connect('dbi:mysql:'.$database.';host='.$dbhost.'',$username,$pass)
or die "Connection Error!!\n";
my $sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat\;";
## prepare statement and query
my $stmt = $dbh->prepare($sql);
$stmt->execute or die "SQL Error!!\n";
my $c = 0;
#process query result
while (my #res = $stmt->fetchrow_array)
{
$edivaStr{ $res[0].";".$res[1] } = $res[4].",".$res[2];
$c +=1;
}
print($c."\n");
## close DB connection
$dbh->disconnect();
The runtime for these two scripts is:
~40s for the Perl script
~200s for the Python script
I cannot figure out why this happens [I tried using fetchone() or fetchmany() to see if there are memory issues but the runtime at most reduces 10% from the 200s].
My main problem is understanding why there is such a relevant performance difference between the two functionally equivalent code blocks.
Any idea about how can I verify what is happening would be greatly appreciated.
Thanks!
UPDATE ABOUT SOLUTION
Peeyush'comment could be an answer and I'd like him to post it because it allowed me to find a solution.
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance.
the cursor.fetchall means you load all your data in memory at once, instead of doing it slowly when needed.
Replace
row = cur.fetchall()
for row in rows:
by
for row in cur:
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance
I encounter the same problem. With Python cx_Oracle, here's my environment performance stats -- Python takes very long to connect to Oracle DB.
connect to DB, elaps:0.38108
run query, elaps:0.00092
get filename from table, elaps:8e-05
run query to read BLOB, elaps:0.00058
decompress data and write to file, elaps:0.00187
close DB connection, elaps:0.00009
Over all, elaps:0.38476
same function in Perl, elaps:0.00213
If anyone else is struggling with using Python and MySQL, I think that Oracle's mysql.connector for Python tends to be really slow for doing UPDATEs and DELETEs. I've found that mysql.connector is really fast for doing SELECT queries, and using .executemany() for doing INSERTs is blazing fast as well. However, UPDATEs and DELETEs are painfully slow from what I've found. The solution I've decided to go with is to move my data over to PostgreSQL simply because I know that Postgres has a really good Python library (psycopg2). Anyway, hope my feedback helps!
Python for loops are quite slow. You should look into an alternative to treat your query.
From python wiki : https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

Merging SQLite databases is driving me mad. Help?

I've got 32 SQLite (3.7.9) databases with 3 tables each that I'm trying to merge together using the idiom that I've found elsewhere (each db has the same schema):
attach db1.sqlite3 as toMerge;
insert into tbl1 select * from toMerge.tbl1;
insert into tbl2 select * from toMerge.tbl2;
insert into tbl3 select * from toMerge.tbl3;
detach toMerge;
and rinse-repeating for the entire set of databases. I do this using python and the sqlite3 module:
for fn in filelist:
completedb = sqlite3.connect("complete.sqlite3")
c = completedb.cursor()
c.execute("pragma synchronous = off;")
c.execute("pragma journal_mode=off;")
print("Attempting to merge " + fn + ".")
query = "attach '" + fn + "' as toMerge;"
c.execute(query)
try:
c.execute("insert into tbl1 select * from toMerge.tbl1;")
c.execute("insert into tbl2 select * from toMerge.tbl2;")
c.execute("insert into tbl3 select * from toMerge.tbl3;")
c.execute("detach toMerge;")
completedb.commit()
except sqlite3.Error as err:
print "Error! ", type(err), " Error msg: ", err
raise
2 of the tables are fairly small, only 50K rows per db, while the third (tbl3) is larger, about 850 - 900K rows. Now, what happens is that the inserts progressively slow down until I get to about the fourth database when they grind to a near halt (on the order of a a megabyte or two in file size added every 1-3 minutes to the combined database). In case it was python, I've even tried dumping out the tables as INSERTs (.insert; .out foo; sqlite3 complete.db < foo is the skeleton, found here) and combining them in a bash script using the sqlite3 CLI to do the work directly, but I get exactly the same problem.
The table setup of tbl3 isn't too demanding - a text field containing a UUID, two integers, and four real values. My worry is that it's the number of rows, because I ran into exactly the same trouble at exactly the same spot (about four databases in) when the individual databases were an order of magnitude larger in terms of file size with the same number of rows (I trimmed the contents of tbl3 significantly by storing summary stats instead of raw data). Or maybe it's the way I'm performing the operation? Can anyone shed some light on this problem that I'm having before I throw something out the window?
Try adding or removing indexes/primary key for the larger table.
You didn't mention the OS you were using or the db file sizes. Windows can have issues with files that are bigger than 2Gb depending on what version.
In any case, since this is a glorified batch script why not get rid of the for loop, get the filename from sys.argv, and then just run it once for each merge db. That way you will never have to deal with memory issues from doing too much in one process.
Mind you, if you end the loop with the following that will likely also fix things.
c.close()
completedb.close()
You say that the same thing occurs when you follow this process using the CLI and quitting after every db. I assume that you mean the Python CLI, and quitting means that you exit and restart Python. If that is true, and it still develops a problem every 4th database, then something is wrong with your SQLITE shared library. It shouldn't be keeping state like that.
If I were in your shoes, I would stop using attach and just open multiple connections in Python, then move the data in batches of about 1000 records per commit. It would be slower than your technique because all the data moves in and out of Python objects, but I think it would also be more reliable. Open the complete db, then loop around opening a second db, copying, then closing the second db. For the copying, I would use OFFSET and LIMIT on the SELECT statements to process batches of 100 records, then commit, then repeat.
In fact, I would also count the completedb records, and the second db records before copying, then after copying count the completedb records to ensure that I had copied the expected amount. Also, you would be keeping track of the value of the next OFFSET and I would write that to a text file right after committing, so that I could interrupt and restart the process at any time and it would carry on where it left off.

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?
To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Categories

Resources