I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.
Related
Assuming query is some already defined query. As far as I can tell, connection.execute(query).fetchmany(n) and connection.execute(query).limit(n).fetchall() apparently return the same result set. I'm wondering if one of them is more idiomatic or — more importantly — more performant?
Example usage would be:
query = select([census.columns.state, (census.columns.pop2008 - census.columns.pop2000).label("pop_change")]).group_by(census.columns.state).order_by(desc("pop_change"))
results_1 = query.limit(5).fetchall()
results_2 = connection.execute(query).fetchmany(n) #`results_2` = `results_1`
limit will be a part of the sql query sent to the database server.
With fetchmany the query is executed without any limit, but the client (python code) requests only certain number of rows.
Therefore using limit should be faster in most cases.
I have found fetchmany to be very useful when you need to get a very large dataset from the database but you do not want to load all of those results into memory. It allows you to process the results in smaller batches.
result = conn.execution_options(stream_results=True).execute(
SomeLargeTable.__table__.select()
)
while chunk:= result.fetchmany(10000) ## only get 10K rows at a time
for row in chunk:
## process each row before moving onto the next chunk
I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?
The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki
Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)
fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()
Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.
EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)
As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.
I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets.
I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.
def t3():
q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
with connection.cursor() as c:
c.execute(q)
return list(c)
def t4():
return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])
Then using a function to time (self made with time.clock())
r1 = timeme(t3); r2 = timeme(t4)
The results are as follows:
0.00180384529631 and 0.00493390727024 for t3 and t4
And just to make sure the queries are and take the same to execute:
connection.queries[-2::]
Yields:
[
{u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000', u'time': u'0.002'},
{u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]
As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower?
Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure.
I have also tried using the iterator() function at the end of the query but that doesn't help neither.
I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.
By the way, I'm using django 1.7.10 with mysqlclient as the db connector.
EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)
r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558
EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)
-- EDIT3:
Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(.
If anyone is interested, the slow loop is this one:
for row in self.query.get_compiler(self.db).results_iter():
yield tuple(row)
-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think?
Here's the code.
def querysetAsRaw(qs):
q = qs.query.get_compiler(qs.db).as_sql()
with connection.cursor() as c:
c.execute(q[0], q[1])
return c
The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.
I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?
The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki
Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)
fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()
Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.
EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)
As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.
I've got 32 SQLite (3.7.9) databases with 3 tables each that I'm trying to merge together using the idiom that I've found elsewhere (each db has the same schema):
attach db1.sqlite3 as toMerge;
insert into tbl1 select * from toMerge.tbl1;
insert into tbl2 select * from toMerge.tbl2;
insert into tbl3 select * from toMerge.tbl3;
detach toMerge;
and rinse-repeating for the entire set of databases. I do this using python and the sqlite3 module:
for fn in filelist:
completedb = sqlite3.connect("complete.sqlite3")
c = completedb.cursor()
c.execute("pragma synchronous = off;")
c.execute("pragma journal_mode=off;")
print("Attempting to merge " + fn + ".")
query = "attach '" + fn + "' as toMerge;"
c.execute(query)
try:
c.execute("insert into tbl1 select * from toMerge.tbl1;")
c.execute("insert into tbl2 select * from toMerge.tbl2;")
c.execute("insert into tbl3 select * from toMerge.tbl3;")
c.execute("detach toMerge;")
completedb.commit()
except sqlite3.Error as err:
print "Error! ", type(err), " Error msg: ", err
raise
2 of the tables are fairly small, only 50K rows per db, while the third (tbl3) is larger, about 850 - 900K rows. Now, what happens is that the inserts progressively slow down until I get to about the fourth database when they grind to a near halt (on the order of a a megabyte or two in file size added every 1-3 minutes to the combined database). In case it was python, I've even tried dumping out the tables as INSERTs (.insert; .out foo; sqlite3 complete.db < foo is the skeleton, found here) and combining them in a bash script using the sqlite3 CLI to do the work directly, but I get exactly the same problem.
The table setup of tbl3 isn't too demanding - a text field containing a UUID, two integers, and four real values. My worry is that it's the number of rows, because I ran into exactly the same trouble at exactly the same spot (about four databases in) when the individual databases were an order of magnitude larger in terms of file size with the same number of rows (I trimmed the contents of tbl3 significantly by storing summary stats instead of raw data). Or maybe it's the way I'm performing the operation? Can anyone shed some light on this problem that I'm having before I throw something out the window?
Try adding or removing indexes/primary key for the larger table.
You didn't mention the OS you were using or the db file sizes. Windows can have issues with files that are bigger than 2Gb depending on what version.
In any case, since this is a glorified batch script why not get rid of the for loop, get the filename from sys.argv, and then just run it once for each merge db. That way you will never have to deal with memory issues from doing too much in one process.
Mind you, if you end the loop with the following that will likely also fix things.
c.close()
completedb.close()
You say that the same thing occurs when you follow this process using the CLI and quitting after every db. I assume that you mean the Python CLI, and quitting means that you exit and restart Python. If that is true, and it still develops a problem every 4th database, then something is wrong with your SQLITE shared library. It shouldn't be keeping state like that.
If I were in your shoes, I would stop using attach and just open multiple connections in Python, then move the data in batches of about 1000 records per commit. It would be slower than your technique because all the data moves in and out of Python objects, but I think it would also be more reliable. Open the complete db, then loop around opening a second db, copying, then closing the second db. For the copying, I would use OFFSET and LIMIT on the SELECT statements to process batches of 100 records, then commit, then repeat.
In fact, I would also count the completedb records, and the second db records before copying, then after copying count the completedb records to ensure that I had copied the expected amount. Also, you would be keeping track of the value of the next OFFSET and I would write that to a text file right after committing, so that I could interrupt and restart the process at any time and it would carry on where it left off.