I am very new to working with SQL queries. Any suggestions to improve this bit of code:
(by the way, I really don't care about sql security here; this is a bit of code that will be in a pyexe file connecting to a local sqlite file - so it doesnt make sense to worry about security of the query here).
def InitBars(QA = "GDP1POP1_20091224_gdp", QB = "1 pork", reset = False):
global heights, values
D, heights, values, max, = [], {}, {}, 0.0001
if reset: GHolder.remove()
Q = "SELECT wbcode, Year, "+QA+" FROM DB WHERE commodity='"+QB+"' and "+QA+" IS NOT 'NULL'"
for i in cursor.execute(Q):
D.append((str(i[0]) + str(i[1]), float(i[2])))
if float(i[2]) > max: max = float(i[2])
for (i, n) in D: heights[i] = 5.0 / max * n; values[i] = n
Gui["YRBox_Slider"].set(0.0)
Gui["YRBox_Speed"].set(0.0)
after following the advices, this is what I got:
def InitBars(QA = "GDP1POP1_20091224_gdp", QB = "1 pork", reset = False):
global heights, values; D, heights, values, max, = [], {}, {}, 0.0001
if reset: GHolder.remove()
Q = "SELECT wbcode||Year, %s FROM DB WHERE commodity='%s' and %s IS NOT 'NULL'" % (QA, QB, QA)
for a, b in cursor.execute(Q):
if float(b) > max: max = float(b)
values[a] = float(b)
for i in values: heights[i] = 5.0 / max * values[i]
Gui["YRBox_Slider"].set(0.0); Gui["YRBox_Speed"].set(0.0)
If this is a one-off script where you totally trust all of the input data and you just need to get a job done, then fine.
If this is part of a system, and this is indicative of the kind of code in it, there are several problems:
Don't construct SQL queries by appending strings. You said that you don't care about security, but this is such a big problem and so easily solved, then really -- you should do it right all of the time
This function seems to use and manipulate global state. Again, if this is a small one-time use script, then go for it -- in systems that span just a few files, this becomes impossible to maintain.
Naming conventions --- not following any consistency in capitalization
Names of things are not helpful at all. QA, D, QB, -- QA and QB don't even seem to be the same kind of thing -- one is a field, and the other is a value.
All kinds of questionable things are uncommented -- why is max .0001? What the heck is GHolder? What could that loop be doing at the end? Really, the code should be clearer, but if not, throw the maintainer a bone.
Use more descriptive variable names than QA and QB.
Comment the code.
Don't put multiple statements in the same line
Try not to use globals. Use member variables instead.
if QA and QB may come from user input, don't use them to build SQL queries
You should check for SQL injection. Make sure that there's no SQL statement in QA. Also you should probably add slashes if it applies.
Use
Q = "SELECT wbcode, Year, %s FROM DB WHERE commodity='%s' and %s IS NOT 'NULL'" % (QA, QB, QA)
instead:
Q = "SELECT wbcode, Year, "+QA+" FROM DB WHERE commodity='"+QB+"' and "+QA+" IS NOT 'NULL'"
Care about security (sql injection).
Look at any ORM (SqlAlchemy, for example). It makes things easy :)
Related
I have a SQL query I execute, and it comes into my Python program ~500ms (about 100k rows).
I want to quickly insert this into redis, but it currently takes ~6sec, even with piping.
pipe = r.pipeline()
for row in q:
pipe.zincrby(SKEY, row["name"], 1)
pipe.execute()
Is there a way to speed this up?
The problem is you insert a large number of items in a sorted set. Redis doc says that the time complexity of zincrby is O(log(N)) where N is the number of elements in the sorted set. So the more items you insert, the longer it takes. You probably should rethink the way you use Redis in this case. Maybe the sorted set is not the best answer to your use case.
In general there's no way to speed this up from redis's perspective, but there are two things you can do:
1 If keys repeat themselves, try reducing the number of rows by summing up the names before calling redis. i.e.:
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
and then if you have recurring ids, you'll make less queries in redis.
2 Another thing I would try it to call execute() every say 1000 or 5000 commands or so, that way redis would not be blocking for other callers while this is executed, and python itself would allocate less memory, which might speed things up.
e.g. (combined with the above):
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
pipe = r.pipeline()
for i, (k, v) in enumerate(d.iteritems()):
pipe.zincrby(SKEY, k, v)
if i > 0 and i % 5000 == 0:
pipe.execute()
pipe.execute()
It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one
I'm trying to do some text processing on around 200,000 entries in a SQlite database which I'm accessing using SQLAlchemy. I'd like to parallelize it (I'm looking at Parallel Python), but I'm not sure how exactly to do it.
I want to commit the session each time an entry is processed, so that if I need to stop the script I won't lose the work it's already done. However, when I try to pass the session.commit() command to the callback function, it does not seem to work.
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
def callbackFunc(match, session, inputTuple):
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))
The session is imported from assignDB. I'm not getting any error, it's just not updating the database.
Thanks for your help.
UPDATE
Here is the code for fuzzy_substring
def fuzzy_substring(needle, haystack):
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)
# base cases
if m == 1:
return not needle in haystack
if not n:
return m
row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )
row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)
which I got from here: Fuzzy Substring. In my case, "needle" is one of ~8000 possible choices, while haystack is the raw string I'm trying to match. I loop over all possible "needles" and choose the one with the best score.
Without looking at your specific code, it can be fairly said that:
Using serverless SQLite and
Seeking increased write performance through paralleism
are mutually incompatible desires. Quoth the SQLite FAQ:
… However, client/server database engines (such as PostgreSQL, MySQL,
or Oracle) usually support a higher level of concurrency and allow
multiple processes to be writing to the same database at the same
time. This is possible in a client/server database because there is
always a single well-controlled server process available to coordinate
access. If your application has a need for a lot of concurrency, then
you should consider using a client/server database. But experience
suggests that most applications need much less concurrency than their
designers imagine. …
And that's even without whatever gating and ordering SQLAlchemy uses. It is also not clear at all when — if at all — the Parallel Python jobs are completing.
My suggestion: get it working correctly first and then look for optimizations. Especially when the pp secret sauce might not be buying you much at all even if it was working perfectly.
added in response to comment:
If fuzzy_substring matching is the bottleneck it appears completely decoupled from the database access and you should keep that in mind. Without seeing what fuzzy_substring is doing, a good starting assumption is that you can make algorithmic improvements which may make the single-threaded programming computationally feasible. Approximate string matching is a very well studied problem and choosing the right algorithm is often far better than "throw more processors at it".
Far better in this sense is that you have cleaner code, don't waste the overhead of segmenting and reassembling the problem, have a more extensible and debuggable program at the end.
#msw has provided an excellent overview of the problem, giving a general way to think about parallelization.
Notwithstanding these comments, here is what I got to work in the end:
from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring
def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)
# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)
if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)
print "Starting pp with", job_server.get_ncpus(), "workers"
ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng
rawIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None)
numIngredients = session.query(Ingredient).filter(Ingredient.refIng_id == None).count()
stepSize = 30
for i in range(0, numIngredients, stepSize):
print i
print numIngredients
if i + stepSize > numIngredients:
stop = numIngredients
else:
stop = i + stepSize
jobs = []
for match in rawIngredients[i:stop]:
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,))))
job_server.wait()
for match, job in jobs:
inputTuple = job()
print match.ingredient
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()
Essentially, I've chopped the problem into chunks. After matching 30 substrings in parallel, the results are returned and committed to the database. I chose 30 somewhat arbitrarily, so there might be gains to be had in optimizing that number. It seems to have sped up a fair bit, as I'm using all 3(!) of the cores in my processor now.
I have a productdatabase that contains products, parts and labels for each part based on langcodes.
The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.
The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.
I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
## Start of problem lines ##
for defaultPart in product.defaultPartsData:
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel
for optionalPart in product.optionalPartsData:
for label in labelsForLangCode:
if label.key() in optionalPart.partLabelList:
typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
## end problem lines ##
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.
All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.
The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.
The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:
#property
def defaultPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)
#property
def optionalPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)
When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.
Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.
Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?
I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.
..fredrik
EDIT:
A AppStats screenshoot here.
From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?
EDIT 2:
I implemented dound's solution and added abit. Now it looks like this:
langCode = 'en'
typeData = Products.ProductPartTypes.all()
productData = Products.Product.all()
labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
productList = []
label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
labelData = memcache.get(label_cache_key)
if labelData is None:
langDict = {}
for langLabel in labelsForLangCode:
langDict[str(langLabel.key())] = langLabel.partLangLabel
memcache.add(label_cache_key, langDict, 500)
labelData = memcache.get(label_cache_key)
GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if partData is None:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
for lb in part.partLabelList:
if str(lb) in labelData:
label = labelData[str(lb)]
break
if part.key() in product.defaultParts:
typeDict[part.type.typeId]['default'] = label
elif part.key() in product.optionalParts:
typeDict[part.type.typeId]['optional'].append(label)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
The result is much better. I now have about 3000ms with out memcache and about 700ms with.
I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?
Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?
EDIT 3:
I noticed that non of the db.Model are indexed, could this make a differance?
EDIT 4:
After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.
..fredrik
A few simple ideas:
1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)
2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.
3) Like Will pointed out, IN queries perform one query PER argument.
Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
class ProductParts(db.Model):
...
products = db.ListProperty(db.Key) # product keys
...
Then you can do ONE query to per product instead of N queries per product. For example, you could do this:
parts = ProductParts.all().filter("products =", product).fetch(1000)
The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).
Less critical improvement idea:
4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:
GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
#property
def defaultPartsData(self):
return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)
You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.
Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).
GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
# here's a new approach that does just ONE datastore query (for each product)
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
if part.key() in self.defaultParts:
part_type = 'default'
else:
part_type = 'optional'
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
# (end new code)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.
So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.
Here's the relevant section from the GQL Reference:
Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.
You might try installing AppStats for your app to see where else it might be slowing down.
I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.
GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.
This is a query that totals up every players game results from a game and displays the players who match the conditions.
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses
from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(* ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
dotagames as dg,
games as ga,
dotaplayers as dp,
scores as sc
where dg.winner <> 0
and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h
order by totalscore desc
Now I'm not too sure what's the best way to go but what would in your opinion be to optimize this query?
I run a Q6600 # 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).
Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow....
I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I'm open to changing the query to make it run faster.
Thanks for reading my question; look forward to reading your answers!
Edit: EXPLAIN QUERY RESULTS
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 783 Using filesort
2 DERIVED sc ALL name,name_2 NULL NULL NULL 2099 Using temporary; Using filesort
2 DERIVED gp ref gameid,colour,name name 17 development.sc.name 2
2 DERIVED ga eq_ref PRIMARY,id,id_2 PRIMARY 4 development.gp.gameid 1 Using index
2 DERIVED dg ref gameid,winner gameid 4 development.ga.id 1 Using where
2 DERIVED dp ref gameid_2,colour gameid_2 4 development.ga.id 10 Using where
First of all, the SQL is badly formatted. The most obvious error is the line splitting before each AS clause. Second obvious problem is using implicit joins instead of explicitly using INNER JOIN ... ON ....
Now to answer the actual question.
Without knowing the data or the environment, the first thing I'd look at would be some of the MySQL server settings, such as sort_buffer and key_buffer. If you haven't changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.
Having reviewed that, I'd be running pieces of the query to see speed and what EXPLAIN says. The effect of indexing can be profound, but MySQL has a "fingers-and-toes" problem where it just can't use more than one per table. And JOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.
Now you will have an idea where a "tipping point" might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. :-)
However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.
You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster - an order of magnitude, sometimes two - than one big query that is trying to do it all.
No index will help you since you are scanning entire tables.
As your database grows the query will always get slower.
Consider accumulating the stats : after every game, insert the row for that game, and also increment counters in the player's row, Then you don't need to count() and sum() because the information is available.
select * is bad most times - select only the columns you need
break the select into multiple simple selects, use temporary tables when needed
the sum(case part could be done with a subselect
mysql has a very bad performance with or-expressions. use two selects which you union together
Small Improvement
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(1 ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
( select * from dotagames dg1 where dg.winner <> 0 ) as dg,
games as ga,
dotaplayers as dp,
scores as sc
where and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h order by totalscore desc
Changes:
1. count (*) chnaged to count(1)
2. In the FROM, The number of rows are reduced.