I am translating a code from perl to python.
Even if it works exactly the same, there is a part of the code that is 5x slower in python than in perl and I cannot figure out why.
Both perl and python are in the same machine, as well as the mysql database.
The code queries a db to download all columns of a table and then process each row.
There are more than 5 million rows to process and the big issue is in retrieving the data from the database to the python processing.
Here I attach the two code samples:
Python:
import os
import mysql.connector **<--- import mySqlDb**
import time
outDict = dict()
## DB parameters
db = mysql.connector.connect **<----- mySqlDb.connect( ...**
(host=dbhost,
user=username, # your username
passwd=passw, # your password
db=database) # name of the data base
cur = db.cursor(prepared=True)
sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat;"
cur.execute(sql)
print('\t eDiVa public omics start')
s = time.time()
sz = 1000
rows = cur.fetchall()
for row in rows:
## process out dict
print time.time() - s
cur.close()
db.close()
While here comes the Perl equivalent script:
use strict;
use Digest::MD5 qw(md5);
use DBI;
use threads;
use threads::shared;
my $dbh = DBI->connect('dbi:mysql:'.$database.';host='.$dbhost.'',$username,$pass)
or die "Connection Error!!\n";
my $sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat\;";
## prepare statement and query
my $stmt = $dbh->prepare($sql);
$stmt->execute or die "SQL Error!!\n";
my $c = 0;
#process query result
while (my #res = $stmt->fetchrow_array)
{
$edivaStr{ $res[0].";".$res[1] } = $res[4].",".$res[2];
$c +=1;
}
print($c."\n");
## close DB connection
$dbh->disconnect();
The runtime for these two scripts is:
~40s for the Perl script
~200s for the Python script
I cannot figure out why this happens [I tried using fetchone() or fetchmany() to see if there are memory issues but the runtime at most reduces 10% from the 200s].
My main problem is understanding why there is such a relevant performance difference between the two functionally equivalent code blocks.
Any idea about how can I verify what is happening would be greatly appreciated.
Thanks!
UPDATE ABOUT SOLUTION
Peeyush'comment could be an answer and I'd like him to post it because it allowed me to find a solution.
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance.
the cursor.fetchall means you load all your data in memory at once, instead of doing it slowly when needed.
Replace
row = cur.fetchall()
for row in rows:
by
for row in cur:
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance
I encounter the same problem. With Python cx_Oracle, here's my environment performance stats -- Python takes very long to connect to Oracle DB.
connect to DB, elaps:0.38108
run query, elaps:0.00092
get filename from table, elaps:8e-05
run query to read BLOB, elaps:0.00058
decompress data and write to file, elaps:0.00187
close DB connection, elaps:0.00009
Over all, elaps:0.38476
same function in Perl, elaps:0.00213
If anyone else is struggling with using Python and MySQL, I think that Oracle's mysql.connector for Python tends to be really slow for doing UPDATEs and DELETEs. I've found that mysql.connector is really fast for doing SELECT queries, and using .executemany() for doing INSERTs is blazing fast as well. However, UPDATEs and DELETEs are painfully slow from what I've found. The solution I've decided to go with is to move my data over to PostgreSQL simply because I know that Postgres has a really good Python library (psycopg2). Anyway, hope my feedback helps!
Python for loops are quite slow. You should look into an alternative to treat your query.
From python wiki : https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops
Related
I have to carry out some statistical treatments on data that is stored in PostgreSQL tables. I have been hesitating between using R and Python.
With R I use the following code:
require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;")
which is very fast : it will take only 5 seconds to fetch a table with ~200 000 lines and 15 columns and almost no NULL's in it.
With Python, I use the following code:
import psycopg2
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;")
my_table = cursor.fetchall()
and surprisingly, it causes my Python session to freeze and my computer to crash.
As I use these librairies as "black boxes", I don't understand why something that is so quick in R can be that slow (and thus almost impossible for a practical use) in Python.
Can someone explain this difference in performance, and can someone tell if there exists a more efficient method to fetch a pgSQL table in Python?
I am no expert in R but very obviously what dbGetQuery() (actually : what dbFetch()) returns is a lazy object that will not load all results in memory - else it would of course take ages and eat all your ram too.
wrt/ Python / psycopg, you definitly DONT want to fetchall() a huge dataset. The proper solution here is to use a server-side cursor and iterate over it.
Edit - answering the questions in your comments:
so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?
Not at all. As written in all letter in the example I likned to, what "does the trick" is using a server side cursor, which is done (in psycopg) by naming the cursor:
HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name')
The DictCursor stuff is actually irrelevant (and should not be mentionned in this example since it obviously confuses newcomers).
I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.
As I mentionned I don't zilch about R and it's implementation - I deduce that whatever dbFetch returns is a lazy object from the behaviour you describe -, but there's nothing magical in having an object that lazily fetches values from an external source. Python's file object is a known example:
with open("/some/huge/file.txt") as f:
for line in f:
print line
In the above snippet, the file object f fetches data from disk only when needed. All that needs to be stored is the file pointer position (and a buffer of the last N bytes that were read from disk, but that's an implementation detail).
If you want to learn more, read about Python's iteratable and iterator.
The typical MySQLdb library query can use a lot of memory and perform poorly in Python, when a large result set is generated. For example:
cursor.execute("SELECT id, name FROM `table`")
for i in xrange(cursor.rowcount):
id, name = cursor.fetchone()
print id, name
There is an optional cursor that will fetch just one row at a time, really speeding up the script and cutting the memory footprint of the script a lot.
import MySQLdb
import MySQLdb.cursors
conn = MySQLdb.connect(user="user", passwd="password", db="dbname",
cursorclass = MySQLdb.cursors.SSCursor)
cur = conn.cursor()
cur.execute("SELECT id, name FROM users")
row = cur.fetchone()
while row is not None:
doSomething()
row = cur.fetchone()
cur.close()
conn.close()
But I can't find anything about using SSCursor with with nested queries. If this is the definition of doSomething():
def doSomething()
cur2 = conn.cursor()
cur2.execute('select id,x,y from table2')
rows = cur2.fetchall()
for row in rows:
doSomethingElse(row)
cur2.close()
then the script throws the following error:
_mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now")
It sounds as if SSCursor is not compatible with nested queries. Is that true? If so that's too bad because the main loop seems to run too slowly with the standard cursor.
This problem in discussed a bit in the MySQLdb User's Guide, under the heading of the threadsafety attribute (emphasis mine):
The MySQL protocol can not handle multiple threads using the same
connection at once. Some earlier versions of MySQLdb utilized locking
to achieve a threadsafety of 2. While this is not terribly hard to
accomplish using the standard Cursor class (which uses
mysql_store_result()), it is complicated by SSCursor (which uses
mysql_use_result(); with the latter you must ensure all the rows have
been read before another query can be executed.
The documentation for the MySQL C API function mysql_use_result() gives more information about your error message:
When using mysql_use_result(), you must execute mysql_fetch_row()
until a NULL value is returned, otherwise, the unfetched rows are
returned as part of the result set for your next query. The C API
gives the error Commands out of sync; you can't run this command now
if you forget to do this!
In other words, you must completely fetch the result set from any unbuffered cursor (i.e., one that uses mysql_use_result() instead of mysql_store_result() - with MySQLdb, that means SSCursor and SSDictCursor) before you can execute another statement over the same connection.
In your situation, the most direct solution would be to open a second connection to use while iterating over the result set of the unbuffered query. (It wouldn't work to simply get a buffered cursor from the same connection; you'd still have to advance past the unbuffered result set before using the buffered cursor.)
If your workflow is something like "loop through a big result set, executing N little queries for each row," consider looking into MySQL's stored procedures as an alternative to nesting cursors from different connections. You can still use MySQLdb to call the procedure and get the results, though you'll definitely want to read the documentation of MySQLdb's callproc() method since it doesn't conform to Python's database API specs when retrieving procedure outputs.
A second alternative is to stick to buffered cursors, but split up your query into batches. That's what I ended up doing for a project last year where I needed to loop through a set of millions of rows, parse some of the data with an in-house module, and perform some INSERT and UPDATE queries after processing each row. The general idea looks something like this:
QUERY = r"SELECT id, name FROM `table` WHERE id BETWEEN %s and %s;"
BATCH_SIZE = 5000
i = 0
while True:
cursor.execute(QUERY, (i + 1, i + BATCH_SIZE))
result = cursor.fetchall()
# If there's no possibility of a gap as large as BATCH_SIZE in your table ids,
# you can test to break out of the loop like this (otherwise, adjust accordingly):
if not result:
break
for row in result:
doSomething()
i += BATCH_SIZE
One other thing I would note about your example code is that you can iterate directly over a cursor in MySQLdb instead of calling fetchone() explicitly over xrange(cursor.rowcount). This is especially important when using an unbuffered cursor, because the rowcount attribute is undefined and will give a very unexpected result (see: Python MysqlDB using cursor.rowcount with SSDictCursor returning wrong count).
I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.
I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.
There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!
I've got a very large SQLite table with over 500,000 rows with about 15 columns (mostly floats). I'm wanting to transfer data from the SQLite DB to a Django app (which could be backed by many RDBMs, but Postgres in my case). Everything works OK, but as the iteration continues, memory usage jumps by 2-3 meg a second for the Python process. I've tried using 'del' to delete the EVEMapDenormalize and row objects at the end of each iteration, but the bloat continues. Here's an excerpt, any ideas?
class Importer_mapDenormalize(SQLImporter):
def run_importer(self, conn):
c = conn.cursor()
for row in c.execute('select * from mapDenormalize'):
mapdenorm, created = EVEMapDenormalize.objects.get_or_create(id=row['itemID'])
mapdenorm.x = row['x']
mapdenorm.y = row['y']
mapdenorm.z = row['z']
if row['typeID']:
mapdenorm.type = EVEInventoryType.objects.get(id=row['typeID'])
if row['groupID']:
mapdenorm.group = EVEInventoryGroup.objects.get(id=row['groupID'])
if row['solarSystemID']:
mapdenorm.solar_system = EVESolarSystem.objects.get(id=row['solarSystemID'])
if row['constellationID']:
mapdenorm.constellation = EVEConstellation.objects.get(id=row['constellationID'])
if row['regionID']:
mapdenorm.region = EVERegion.objects.get(id=row['regionID'])
mapdenorm.save()
c.close()
I'm not at all interested in wrapping this SQLite DB with the Django ORM. I'd just really like to figure out how to get the data transferred without sucking all of my RAM.
Silly me, this was addressed in the Django FAQ.
Needed to clear the DB query cache while in DEBUG mode.
from django import db
db.reset_queries()
I think a select * from mapDenormalize and loading the result into memory will always be a bad idea. My advise is - spread script into chunks. Use LIMIT to get data in portions.
Get first portion, work with it, close to cursor, and then get the next portion.