Reduce Queries by optimizing _sets in Django - python

The follow results in 4 db hits. Since lines 3 & 4 are just filtering what I grabbed in line 2, what do I need to change so it doesn't hit the db again?
page = get_object_or_404(Page, url__iexact = page_url)
installed_modules = page.module_set.all()
navigation_links = installed_modules.filter(module_type=ModuleTypeCode.MODAL)
module_map = dict([(m.module_static_object.key, m) for m in installed_modules])

Django querysets are lazy, so the following line doesn't hit the database:
installed_modules = page.module_set.all()
The query isn't executed until you iterate over the queryset in this line:
module_map = dict([(m.module_static_object.key, m) for m in installed_modules])
So the code you posted only looks like 3 database queries hits to me, not 4.
Since you are fetching all of the modules from the database already, you could filter the navigation links using a list comprehension instead of another query:
navigation_links = [m for m in installed_modules if m.module_type == ModuleTypeCode.MODAL]
You would have to do some benchmarking to see if this improved performance. It looks like it could be premature optimisation to me.
You might be doing one database query for each module where you fetch module_static_object.key. In this case, you could use select_related.

This is a case of premature optimization. 4 DB queries for a page load is not bad. The idea is to use as few queries as possible, but you're never going to get it down to 1 in every scenario. The code you have there doesn't seem off-the-wall in terms of needlessly creating queries, so it's highly probable that it's already as optimized as you'll be able to make it.

Related

Why accessing Django QuerySet became very slow?

I have a model query in Django:
Query = Details.objects.filter(name__iexact=nameSelected)
I filter it later:
Query2 = Query .filter(title__iexact=title0)
Then I access it using:
...Query2[0][0]...
A few days ago it worked very fast. But now it became at least 20 times slower.
I test it on other PC, it works very fast.
Update: filtering is not the reason of the delay, Query[0][0] is the reason.
Besides that, it became super slow suddenly not over time.
What can make it so slow on my first PC?
Maybe you could try to make a list out of the Queryset when you create it so that you have a real list not only a lazy QS
Query2 = list(Query .filter(title__iexact=title0))
The best way is to avoid loop for filtering the query. What I did is to create a hashmap dictionary
dict0 = {}
Then I added list of items and data that corresponds to that item in the query:
dict0 = dict(zip(title0List, DataList))
Finally I use dict0 instead of query, It boosts the speed at least 10 times for me)

Django QuerySet vs Raw Query performance

I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets.
I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.
def t3():
q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
with connection.cursor() as c:
c.execute(q)
return list(c)
def t4():
return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])
Then using a function to time (self made with time.clock())
r1 = timeme(t3); r2 = timeme(t4)
The results are as follows:
0.00180384529631 and 0.00493390727024 for t3 and t4
And just to make sure the queries are and take the same to execute:
connection.queries[-2::]
Yields:
[
{u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000', u'time': u'0.002'},
{u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]
As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower?
Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure.
I have also tried using the iterator() function at the end of the query but that doesn't help neither.
I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.
By the way, I'm using django 1.7.10 with mysqlclient as the db connector.
EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)
r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558
EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)
-- EDIT3:
Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(.
If anyone is interested, the slow loop is this one:
for row in self.query.get_compiler(self.db).results_iter():
yield tuple(row)
-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think?
Here's the code.
def querysetAsRaw(qs):
q = qs.query.get_compiler(qs.db).as_sql()
with connection.cursor() as c:
c.execute(q[0], q[1])
return c
The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.

How can I reuse objects in django without adding more queries?

In the following code every amount = u.filter(email__icontains=email) django performs another query for my filter, how can I avoid these queries?
u = User.objects.all()
shares = Share.objects.all()
for o in shares:
email = o.email
type = "CASH"
amount = u.filter(email__icontains=email).count()
This whole piece of code is very inefficient and some more context could help.
What do you need u = User.objects.all() for?
calling QuerySet.filter() triggers a query. By calling filter() you just specify some criteria for recordset you want to obtain. How else are you supposed to get the records matching your conditions if not via running a DB query? If you want Django not to run a DB query then you probably dont know what are you doing.
filtering with filter(email__icontains=email) is very inefficient - database cant use any index and your query will be very slow. Cant you just replace that by filter(email=email)?
calling a bunch of queries in a loop is suboptimal.
So again - some context of what are you trying to do would be helpful as someone could find a better solution for your problem.

Slow MySQL queries in Python but fast elsewhere

I'm having a heckuva time dealing with slow MySQL queries in Python. In one area of my application, "load data infile" goes quick. In an another area, the select queries are VERY slow.
Executing the same query in PhpMyAdmin AND Navicat (as a second test) yields a response ~5x faster than in Python.
A few notes...
I switched to MySQLdb as the connector and am also using SSCursor. No performance increase.
The database is optimized, indexed etc. I'm porting this application to Python from PHP/Codeigniter where it ran fine (I foolishly thought getting out of PHP would help speed it up)
PHP/Codeigniter executes the select queries swiftly. For example, one key aspect of the application takes ~2 seconds in PHP/Codeigniter, but is taking 10 seconds in Python BEFORE any of the analysis of the data is done.
My link to the database is fairly standard...
dbconn=MySQLdb.connect(host="127.0.0.1",user="*",passwd="*",db="*", cursorclass = MySQLdb.cursors.SSCursor)
Any insights/help/advice would be greatly appreciated!
UPDATE
In terms of fetching/handling the results, I've tried it a few ways. The initial query is fairly standard...
# Run Query
cursor.execute(query)
I removed all of the code within this loop just to make sure it wasn't the case bottlekneck, and it's not. I put dummy code in its place. The entire process did not speed up at all.
db_results = "test"
# Loop Results
for row in cursor:
a = 0 (this was the dummy code I put in to test)
return db_results
The query result itself is only 501 rows (large amount of columns)... took 0.029 seconds outside of Python. Taking significantly longer than that within Python.
The project is related to horse racing. The query is done within this function. The query itself is long, however, it runs well outside of Python. I commented out the code within the loop on purpose for testing... also the print(query) in hopes of figuring this out.
# Get PPs
def get_pps(race_ids):
# Comma Race List
race_list = ','.join(map(str, race_ids))
# PPs Query
query = ("SELECT raceindex.race_id, entries.entry_id, entries.prognum, runlines.line_id, runlines.track_code, runlines.race_date, runlines.race_number, runlines.horse_name, runlines.line_date, runlines.line_track, runlines.line_race, runlines.surface, runlines.distance, runlines.starters, runlines.race_grade, runlines.post_position, runlines.c1pos, runlines.c1posn, runlines.c1len, runlines.c2pos, runlines.c2posn, runlines.c2len, runlines.c3pos, runlines.c3posn, runlines.c3len, runlines.c4pos, runlines.c4posn, runlines.c4len, runlines.c5pos, runlines.c5posn, runlines.c5len, runlines.finpos, runlines.finposn, runlines.finlen, runlines.dq, runlines.dh, runlines.dqplace, runlines.beyer, runlines.weight, runlines.comment, runlines.long_comment, runlines.odds, runlines.odds_position, runlines.entries, runlines.track_variant, runlines.speed_rating, runlines.sealed_track, runlines.frac1, runlines.frac2, runlines.frac3, runlines.frac4, runlines.frac5, runlines.frac6, runlines.final_time, charts.raceshape "
"FROM hrdb_raceindex raceindex "
"INNER JOIN hrdb_runlines runlines ON runlines.race_date = raceindex.race_date AND runlines.track_code = raceindex.track_code AND runlines.race_number = raceindex.race_number "
"INNER JOIN hrdb_entries entries ON entries.race_date=runlines.race_date AND entries.track_code=runlines.track_code AND entries.race_number=runlines.race_number AND entries.horse_name=runlines.horse_name "
"LEFT JOIN hrdb_charts charts ON runlines.line_date = charts.race_date AND runlines.line_track = charts.track_code AND runlines.line_race = charts.race_number "
"WHERE raceindex.race_id IN (" + race_list + ") "
"ORDER BY runlines.line_date DESC;")
print(query)
# Run Query
cursor.execute(query)
# Query Fields
fields = [i[0] for i in cursor.description]
# PPs List
pps = []
# Loop Results
for row in cursor:
a = 0
#this_pp = {}
#for i, value in enumerate(row):
# this_pp[fields[i]] = value
#pps.append(this_pp)
return pps
One final note... I haven't considered the ideal way to handle the result. I believe one cursor allows the result to come back as a set of dictionaries. I haven't even made it to that point yet as the query and return itself is so slow.
Tho you have only 501 rows it looks like you have over 50 columns. How much total data is being passed from MySQL to Python?
501 rows x 55 columns = 27,555 cells returned.
If each cell averaged "only" 1K that would be close to 27MB of data returned.
To get a sense of how much data mysql is pushing you can add this to your query:
SHOW SESSION STATUS LIKE "bytes_sent"
Is your server well-resourced? Is memory allocation well configured?
My guess is that when you are using PHPMyAdmin you are getting paginated results. This masks the issue of MySQL returning more data than your server can handle (I don't use Navicat, not sure about how that returns results).
Perhaps the Python process is memory-constrained and when faced with this large result set it has to out page out to disk to handle the result set.
If you reduce the number of columns called and/or constrain to, say LIMIT 10 on your query do you get improved speed?
Can you see if the server running Python is paging to disk when this query is called? Can you see what memory is allocated to Python, how much is used during the process and how that allocation and usage compares to those same values in the PHP version?
Can you allocate more memory to your constrained resource?
Can you reduce the number of columns or rows that are called through pagination or asynchronous loading?
I know this is late, however, I have run into similar issues with mysql and python. My solution is to use queries using another language...I use R to make my queries which is blindly fast, do what I can in R and then send the data to python if need be for more general programming, although R has many general purpose libraries as well. Just wanted to post something that may help someone who has a similar problem, and I know this side steps the heart of the problem.

How can I query rows with unique values on a joined column?

I'm trying to have my popular_query subquery remove dupe Place.id, but it doesn't remove it. This is the code below. I tried using distinct but it does not respect the order_by rule.
SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
popular_query = (db.session.query(Post, func.count(SimilarPost.id)).
join(Place, Place.id == Post.place_id).
join(PostOption, PostOption.post_id == Post.id).
outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val).
join(SimilarPost,SimilarPost.id == SimilarPostOption.post_id).
filter(Place.id == Post.place_id).
filter(self.radius_cond()).
group_by(Post.id).
group_by(Place.id).
order_by(desc(func.count(SimilarPost.id))).
order_by(desc(Post.timestamp))
).subquery().select()
all_posts = db.session.query(Post).select_from(filter.pick()).all()
I did a test printout with
print [x.place.name for x in all_posts]
[u'placeB', u'placeB', u'placeB', u'placeC', u'placeC', u'placeA']
How can I fix this?
Thanks!
This should get you what you want:
SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
post_popularity = (db.session.query(func.count(SimilarPost.id))
.select_from(PostOption)
.filter(PostOption.post_id == Post.id)
.correlate(Post)
.outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val)
.join(SimilarPost, sql.and_(
SimilarPost.id == SimilarPostOption.post_id,
SimilarPost.place_id == Post.place_id)
)
.as_scalar())
popular_post_id = (db.session.query(Post.id)
.filter(Post.place_id == Place.id)
.correlate(Place)
.order_by(post_popularity.desc())
.limit(1)
.as_scalar())
deduped_posts = (db.session.query(Post, post_popularity)
.join(Place)
.filter(Post.id == popular_post_id)
.order_by(post_popularity.desc(), Post.timestamp.desc())
.all())
I can't speak to the runtime performance with large data sets, and there may be a better solution, but that's what I managed to synthesize from quite a few sources (MySQL JOIN with LIMIT 1 on joined table, SQLAlchemy - subquery in a WHERE clause, SQLAlchemy Query documentation). The biggest complicating factor is that you apparently need to use as_scalar to nest the subqueries in the right places, and therefore can't return both the Post id and the count from the same subquery.
FWIW, this is kind of a behemoth and I concur with user1675804 that SQLAlchemy code this deep is hard to grok and not very maintainable. You should take a hard look at any more low-tech solutions available like adding columns to the db or doing more of the work in python code.
I don't want to sound like the bad guy here but... in my opinion your approach to the issue seems far less than optimal... if you're using postgresql you could simplify the whole thing using WITH ... but a better approach factoring in my assumption that these posts will be read much more often than updated would be to add some columns to your tables that are updated by triggers on insert/update to other tables, at least if performance is likely to ever become an issue this is the solution I'd go with
Not very familiar with sqlalchemy, so can't write it in clear code for you, but the only other solution I can come up with uses at least a subquery to select the things from order by for each of the columns in group by, and that will add significantly to your already slow query

Categories

Resources