Django: How do I avoid unnecessary SQL statements? - python

I'm optimizing a slow page load in our (first) Django project. The overall project does test status management, so there are protocols which have cases which have planned executions. Currently the code is:
protocols = Protocol.active.filter(team=team, release=release)
cases = Case.active.filter(protocol__in=protocols)
caseCount = cases.count()
plannedExecs = Planned_Exec.active.filter(case__in=cases, team=team, release=release)
# Start aggregating test suite information
# pgi Model
testSuite['pgi_model'] = []
for pgi in PLM.objects.filter(release=release).values('pgi_model').distinct():
plmForPgi = PLM.objects.filter(pgi_model=pgi['pgi_model'])
peresults = plannedExecs.filter(plm__in=plmForPgi).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))
# Browser
testSuite['browser'] = []
for browser in BROWSER_OPTIONS:
peresults = plannedExecs.filter(browser=browser[0]).count()
try:
testSuite['browser'].append((browser[1], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['browser'].append((browser[1], "", "", peresults, 0))
# ... more different categories are aggregated below, then the report is generated...
This code makes a lot of SQL statements. The PLM.objects.filter(release=release).values('pgi_model').distinct() returns a list of 50 strings, and the two filter operations both execute an SQL statement for each string, meaning 100 SQL statements for just this for loop. (Also, it seems like that should use values_list with flat=True.)
Since I want to get information about relevant cases and plannedExecutions, I think I really only need to retrieve those two tables, then perform some analysis on that. Using filter and count() seemed like the obvious solution at the time, but I'm wondering if I wouldn't be better off just building a dict of relevant case and plannedExecution information using .values() and then analyzing that instead, so as to avoid unnecessary SQL statements. Any helpful advice? Thanks!
Edit: In trying to profile this to understand where the time goes, I'm using Django Debug toolbar. It explains that there are over 200 queries, and each of which runs extremely quickly, so that overall they account for very little time. However, could it be that the execution of the SQL is relatively quick, but the building of the ORM adds up, given that it happens over 200 times? I refactored a previous page which took 3 minutes to load, and used values() instead of the ORM, thus getting the page load down to 2.7 seconds and 5 SQL statements.

Creating a queryset does not hit the database; only accessing results from it does. Accordingly, merely creating querysets is not your issue.
Note that passing a queryset to to another queryset does not create two queries. Accordingly, building dicts will not reduce the number of database hits.
If you can build up dicts, it may be that you manage to create a simpler query than you would otherwise, which would speed up the actual query execution. That is something of a separate issue, however.

This strikes me as a case for reverse foreign key lookups. We should be able to reduce the top for loop by getting all pgi_models associated with PLMs in the release. I assume you have a model for PGI, for which the PLM model has a foreign key field named pgi_model. If this is the case, you can find the PGIs in a PLM release with the following. You still have a loop, but the iterations of the loop should be reduced, theoretically:
pgis = PGI.objects.filter(plm__in=PLM.objects.filter(release=release))
for pgi in pgis:
peresults = plannedExecs.filter(plm=pgi.plm).count()
if peresults > 0:
try:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, int(peresults/float(testlistCount)*100)))
except ZeroDivisionError:
testSuite['pgi_model'].append((pgi['pgi_model'], "", "", peresults, 0))

Related

Struggling with how to iterate data

I am learning Python3 and I have a fairly simple task to complete but I am struggling how to glue it all together. I need to query an API and return the full list of applications which I can do and I store this and need to use it again to gather more data for each application from a different API call.
applistfull = requests.get(url,authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
else:
print(applistfull.status_code)
I next have I think 'summaryguid' and I need to again query a different API and return a value that could exist many times for each application; in this case the compiler used to build the code.
I can statically call a GUID in the URL and return the correct information but I haven't yet figured out how to get it to do the below for all of the above and build a master list:
summary = requests.get(f"url{summaryguid}moreurl",authmethod)
if summary.ok:
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(appsummary["compiler"])
I would prefer to not yet have someone just type out the right answer but just drop a few hints and let me continue to work through it logically so I learn how to deal with what I assume is a common issue in the future. My thought right now is I need to move my second if up as part of my initial block and continue the logic in that space but I am stuck with that.
You are on the right track! Here is the hint: the second API request can be nested inside the loop that iterates through the list of applications in the first API call. By doing so, you can get the information you require by making the second API call for each application.
import requests
applistfull = requests.get("url", authmethod)
if applistfull.ok:
data = applistfull.json()
for app in data["_embedded"]["applications"]:
print(app["profile"]["name"],app["guid"])
summaryguid = app["guid"]
summary = requests.get(f"url/{summaryguid}/moreurl", authmethod)
fulldata = summary.json()
for appsummary in fulldata["static-analysis"]["modules"]["module"]:
print(app["profile"]["name"],appsummary["compiler"])
else:
print(applistfull.status_code)

What is the return of an UPDATE query?

I'm using sqlalchemy in combination with sqlite and the databases library and I'm trying to wrap my head around what that combination returns when doing update queries. I'm running a testcase and I have sqlalchemy set up to roll back upon execution of each testcase via force_rollback=True.
db = databases.Database(DB_URL, force_rollback=True)
query = update(my_table).where(my_table.columns.id == some_id_to_update).values(**values)
res = await db.execute(query)
When working with psql, I'd expect res to be the number of rows that were affected by the UPDATE query, but from reading the documentation, sqlite seems to behave differently in that it doesn't return anything. I tested this manually by connecting to the database via sqlite3 and as expected, there is no return when doing UPDATE queries. sqlalchemy however does return something, which I assume is the number of total rows in the table, but I'm not sure. Can anybody shed some light into what is actually returned?
What's more, when I tried to get the number of rows affected by the UPDATE query via SELECT changes(), I'm also getting the number of total rows in the table and not the rows affected by the most recent query. Do I have a misunderstanding of what changes() does?
"The changes() function returns the number of database rows that were changed or inserted or deleted by the most recently completed INSERT, DELETE, or UPDATE statement, exclusive of statements in lower-level triggers."
When you use the Python sqlite3 module, you use .executeXXX interfaces to evaluate/prepare your query. If the query is supposed to modify the database, it does it at this stage. You have to use the same interface to prepare a SELECT statement. In either case, the .executeXXX interfaces never return anything. To get the result of a SELECT query, you have to use a .fetchXXX interface after running .executeXXX.
To get the number of changed rows after INSERT, DELETE, or UPDATE statement via sqlite3, you can also take the difference in con.total_changes before/after running .executeXXX.

How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script.
I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'):
dbroot.actions_values = BTree()
if not hasattr(dbroot, 'games_played'):
dbroot.games_played = 0
actions_values is conceptually built as follows:
actions_values = { # BTree
str(state): { # BTree
# contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
# and their values(only actions previously taken by the angent are present here), e.g.:
1: 0.4356
5: 0.3456
},
# other states
}
state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:
board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case rows = 6 and cols = 7)
main loop:
should_play = 10000000
transactions_freq = 10000
packing_freq = 50000
player = ReinforcementPlayer(dbroot.actions_values, config)
while dbroot.games_played < should_play:
# max_epsilon at start and then linearly drops to min_epsilon:
epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)
dbroot.games_played += 1
sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
sys.stdout.flush()
board_state = player.play_game(epsilon)
if(dbroot.games_played % transactions_freq == 0):
print('Commiting...')
transaction.commit()
if(dbroot.games_played % packing_freq == 0):
print('Packing DB...')
connection.db().pack()
(packing also takes much time but it's not the main problem; I could pack database after program finishes)
Code operating on dbroot (inside ReinforcementPlayer):
def get_actions_with_values(self, player_id, state):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
return self.actions_values[lookup_state_str]
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
return None
def get_value_of_action(self, player_id, state, action, default=0):
actions = self.get_actions_with_values(player_id, state)
if actions is None:
return default
return actions.get(action, default)
def set_value_of_action(self, player_id, state, action, value):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
self.actions_values[lookup_state_str][action] = value
return
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
return
self.actions_values[lookup_state_str] = BTree()
self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values) is 6018450.
According to iotop IO operations take 90% of the time.
Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.
Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt

Django find_each (like RoR)

Is there way to use find_each in django?
According to the rails documentation:
This method is only intended to use for batch processing of large
amounts of records that wouldn’t fit in memory all at once. If you
just need to loop over less than 1000 records, it’s probably better
just to use the regular find methods.
http://apidock.com/rails/ActiveRecord/Batches/ClassMethods/find_each
Thanks.
One possible solution could be to use the built-in Paginator class (could save a lot of hassle).
https://docs.djangoproject.com/en/dev/topics/pagination/
Try something like:
from django.core.paginator import Paginator
from yourapp.models import YourModel
result_query = YourModel.objects.filter(<your find conditions>)
paginator = Paginator(result_query, 1000) # the desired batch size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can add your required code
Or, you could use the limiting options as per needs to iterate over the results.
You can query parts of the whole table with a loop and by slicing the queryset.
If you're working with Debug = True it's important that you flush your queries after each loop since this can cause memory issues (Django stores all the queries that were run until the script finishes or dies).
If you need to restrict the results of the queryset you can replace ".all()" with the appropriate ".filter(conditions)"
from django import db
from myapp import MyModel
# Getting the total of records in the table
total_count = MyModel.objects.all().count()
chunk_size = 1000 # You can change this to any amount you can keep in memory
total_checked = 0
while total_checked < total_count:
# Querying all the objects and slicing only the part you need to work
# with at the moment (only that part will be loaded into memory)
query_set = MyModel.objects.all()[total_checked:total_checked + chunk_size]
for item in query_set:
# Do what you need to do with your results
pass
total_checked += chunk_size
# Clearing django's query cache to avoid a memory leak
db.reset_queries()

Datastore performance, my code or the datastore latency

I had for the last month a bit of a problem with a quite basic datastore query. It involves 2 db.Models with one referring to the other with a db.ReferenceProperty.
The problem is that according to the admin logs the request takes about 2-4 seconds to complete. I strip it down to a bare form and a list to display the results.
The put works fine, but the get accumulates (in my opinion) way to much cpu time.
#The get look like this:
outputData['items'] = {}
labelsData = Label.all()
for label in labelsData:
labelItem = label.item.name
if labelItem not in outputData['items']:
outputData['items'][labelItem] = { 'item' : labelItem, 'labels' : [] }
outputData['items'][labelItem]['labels'].append(label.text)
path = os.path.join(os.path.dirname(__file__), 'index.html')
self.response.out.write(template.render(path, outputData))
#And the models:
class Item(db.Model):
name = db.StringProperty()
class Label(db.Model):
text = db.StringProperty()
lang = db.StringProperty()
item = db.ReferenceProperty(Item)
I've tried to make it a number of different way ie. instead of ReferenceProperty storing all Label keys in the Item Model as a db.ListProperty.
My test data is just 10 rows in Item and 40 in Label.
So my questions: Is it a fools errand to try to optimize this since the high cpu usage is due to the problems with the datastore or have I just screwed up somewhere in the code?
..fredrik
EDIT:
I got a great response from djidjadji at the google appengine mailing list.
The new code looks like this:
outputData['items'] = {}
labelsData = Label.all().fetch(1000)
labelItems = db.get([Label.item.get_value_for_datastore(label) for label in labelsData ])
for label,labelItem in zip(labelsData, labelItems):
name = labelItem.name
try:
outputData['items'][name]['labels'].append(label.text)
except KeyError:
outputData['items'][name] = { 'item' : name, 'labels' : [label.text] }
There's certainly things you can do to optimize your code. For example, you're iterating over a query, which is less efficient than fetching the query and iterating over the results.
I'd recommend using Appstats to profile your app, and check out the Patterns of Doom series of posts.
Don't just try things. That's guessing. You'll only be right some of the time. Don't ask other people to guess either, for the same reason.
Be right every time.
Just pause the code several times and look at the call stack. That will tell you exactly what's going on.

Categories

Resources