I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets.
I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.
def t3():
q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
with connection.cursor() as c:
c.execute(q)
return list(c)
def t4():
return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])
Then using a function to time (self made with time.clock())
r1 = timeme(t3); r2 = timeme(t4)
The results are as follows:
0.00180384529631 and 0.00493390727024 for t3 and t4
And just to make sure the queries are and take the same to execute:
connection.queries[-2::]
Yields:
[
{u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000', u'time': u'0.002'},
{u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]
As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower?
Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure.
I have also tried using the iterator() function at the end of the query but that doesn't help neither.
I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.
By the way, I'm using django 1.7.10 with mysqlclient as the db connector.
EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)
r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558
EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)
-- EDIT3:
Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(.
If anyone is interested, the slow loop is this one:
for row in self.query.get_compiler(self.db).results_iter():
yield tuple(row)
-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think?
Here's the code.
def querysetAsRaw(qs):
q = qs.query.get_compiler(qs.db).as_sql()
with connection.cursor() as c:
c.execute(q[0], q[1])
return c
The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.
Related
I have to carry out some statistical treatments on data that is stored in PostgreSQL tables. I have been hesitating between using R and Python.
With R I use the following code:
require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;")
which is very fast : it will take only 5 seconds to fetch a table with ~200 000 lines and 15 columns and almost no NULL's in it.
With Python, I use the following code:
import psycopg2
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;")
my_table = cursor.fetchall()
and surprisingly, it causes my Python session to freeze and my computer to crash.
As I use these librairies as "black boxes", I don't understand why something that is so quick in R can be that slow (and thus almost impossible for a practical use) in Python.
Can someone explain this difference in performance, and can someone tell if there exists a more efficient method to fetch a pgSQL table in Python?
I am no expert in R but very obviously what dbGetQuery() (actually : what dbFetch()) returns is a lazy object that will not load all results in memory - else it would of course take ages and eat all your ram too.
wrt/ Python / psycopg, you definitly DONT want to fetchall() a huge dataset. The proper solution here is to use a server-side cursor and iterate over it.
Edit - answering the questions in your comments:
so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?
Not at all. As written in all letter in the example I likned to, what "does the trick" is using a server side cursor, which is done (in psycopg) by naming the cursor:
HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name')
The DictCursor stuff is actually irrelevant (and should not be mentionned in this example since it obviously confuses newcomers).
I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.
As I mentionned I don't zilch about R and it's implementation - I deduce that whatever dbFetch returns is a lazy object from the behaviour you describe -, but there's nothing magical in having an object that lazily fetches values from an external source. Python's file object is a known example:
with open("/some/huge/file.txt") as f:
for line in f:
print line
In the above snippet, the file object f fetches data from disk only when needed. All that needs to be stored is the file pointer position (and a buffer of the last N bytes that were read from disk, but that's an implementation detail).
If you want to learn more, read about Python's iteratable and iterator.
I'm using web.py to create a simple report page from Oracle. When I take the best practice approach of using vars= to pass parameters, the delay is 11-12 seconds. When I do the same query using string substitution, the query runs in less than a second. Here's how I'm checking:
sql = """
SELECT a, b, c
FROM my_table
WHERE what = $what
ORDER BY a, b"""
print('before', datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
result = db.query(sql, vars={'what': '1234'})
print('after', datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
The "before" and "after" show clearly than I'm getting a massive delay on the query. I've tried using select() with the same vars= and I get the same delay. So, my initial suggestion is that it's web.db's SQL escape functions that are creating the delay. I don't want to pass unescaped input, and it doesn't seem like there should be so much overhead.
Is there anything else that could be creating this delay? If it is the escape logic, are there any gotchas of which I need to be aware?
Thanks in advance!
EDIT:
On further investigation, I've proven (to myself, at least) that the delay is not specific to web.py, but happens in cx_Oracle. I reached this conclusion modifying my sql for syntax and doing:
cursor = db._db_cursor()
lines = cursor.execute(sql.format(my_table, {'what': '1234'})
...this produces a similar ten-to-twelve second delay compared to hard-coding the variable. Any cx_Oracle advice?
Alrighty. On further, further investigation, I figured out that the problem is a mismatch in encoding between Python and Oracle for the string parameter I pass. Fixed it with a simple what.encode('iso-8859-1'). If that doesn't work for you, check the Oracle encoding using the PL/SQL dump() function. If the Python encoding doesn't work, try decoding first.
I have a general ledger table in my DB with the columns: member_id, is_credit and amount. I want to get the current balance of the member.
Ideally that can be got by two queries where the first query has is_credit == True and the second query is_credit == False something close to:
credit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==True)
debit_amount = session.query(func.sum(Funds.amount).label('Debit_Amount')).filter(Funds.member_id==member_id, Funds.is_credit==False)
balance = credit_amount - debit_amount
and then subtract the result. Is there a way to have the above run in one query to give the balance?
From the comments you state that hybrids are too advanced right now, so I will propose an easier but not as efficient solution (still its okay):
(session.query(Funds.is_credit, func.sum(Funds.amount).label('Debit_Amount')).
filter(Funds.member_d==member_id).group_by(Funds.is_credit))
What will this do? You will recieve a two-row result, one has the credit, the other the debit, depending on the is_credit property of the result. The second part (Debit_Amount) will be the value. You then evaluate them to get the result: Only one query that fetches both values.
If you are unsure what group_by does, I recommend you read up on SQL before doing it in SQLAlchemy. SQLAlchemy offers very easy usage of SQL but it requires that you understand SQL as well. Thus, I recommend: First build a query in SQL and see that it does what you want - then translate it to SQLAlchemy and see that it does the same. Otherwise SQLAlchemy will often generate highly inefficient queries, because you asked for the wrong thing.
I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.
How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?
I'm using Python and SQLObject ORM:
import re
...
def search1():
cr = re.compile(ur'foo')
for item in Item.select():
if cr.search(item.name) or cr.search(item.skim):
print item.name
This function runs in more than 30 seconds. How should I make it run faster?
UPD: The test:
for item in Item.select():
pass
... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.
A Sqlite3 shell query:
select '' from item where name like '%foo%';
runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.
The best way would be to rework your logic to do the selection in the database instead of in your python program.
Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....
If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.
30 seconds to fetch 90,000 rows might not be all that bad.
Have you benchmarked the time required to do the following?
for item in Item.select():
pass
Just to see if the time is DB time, network time or application time?
If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.
If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.
The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.
You could also switch to a db server like postgresql that has support for SIMILAR.
http://www.postgresql.org/docs/8.3/static/functions-matching.html
I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).
I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.
Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.
Initially doing regex via Python was considered for y_serial, but that
was dropped in favor of SQLite's GLOB (which is far faster).
GLOB is similar to LIKE except that it's syntax is more
conventional: * instead of %, ? instead of _ .
See the Endnotes at http://yserial.sourceforge.net/ for more details.
Given your example and expanding on Reed's answer your code could look a bit like the following:
import re
import sqlalchemy.sql.expression as expr
...
def search1():
searchStr = ur'foo'
whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
for item in Items.select().where(whereClause):
print item.name
which translates to
SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'
This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.
You can find some info here on the .contains() method here.
As well as the SQLAlchemy SQL Expression Language Tutorial here.
Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).