How to get something random in datastore (AppEngine)? - python

Currently i'm using something like this:
images = Image.all()
count = images.count()
random_numb = random.randrange(1, count)
image = Image.get_by_id(random_numb)
But it turns out that the ids in the datastore on AppEngine don't start from 1.
I have two images in datastore and their ids are 6001 and 7001.
Is there a better way to retrieve random images?

The datastore is distributed, so IDs are non-sequential: two datastore nodes need to be able to generate an ID at the same time without causing a conflict.
To get a random entity, you can attach a random float between 0 and 1 to each entity on create. Then to query, do something like this:
rand_num = random.random()
entity = MyModel.all().order('rand_num').filter('rand_num >=', rand_num).get()
if entity is None:
entity = MyModel.all().order('rand_num').get()
Edit: Updated fall-through case per Nick's suggestion.

Another solution (if you don't want to add an additional property). Keep a set of keys in memory.
import random
# Get all the keys, not the Entities
q = ItemUser.all(keys_only=True).filter('is_active =', True)
item_keys = q.fetch(2000)
# Get a random set of those keys, in this case 20
random_keys = random.sample(item_keys, 20)
# Get those 20 Entities
items = db.get(random_keys)
The above code illustrates the basic method for getting only keys and then creating a random set with which to do a batch get. You could keep that set of keys in memory, add to it as you create new ItemUser Entities, and then have a method that returns a n random Entities. You'll have to implement some overhead to manage the memcached keys. I like this solution better if you're performing the query for random elements often (I assume using a batch get for n Entities is more efficient than a query for n Entities).

I think Drew Sears's answer above (attach a random float to each entity on create) has a potential problem: every item doesn't have an equal chance of getting picked. For example, if there are only 2 entities, and one gets a rand_num of 0.2499, and the other gets 0.25, the 0.25 one will get picked almost all the time. This might or might not matter to your application. You could fix this by changing the rand_num of an entity every time it is selected, but that means each read also requires a write.
And pix's answer will always select the first key.
Here's the best general-purpose solution I could come up with:
num_images = Image.all().count()
offset = random.randrange(0, num_images)
image = Image.all().fetch(1, offset)[0]
No additional properties needed, but the downside is that count() and fetch() both have performance implications if the number of Images is large.

Another (less efficient) method, which requires no setup:
query = MyModel.all(keys_only=True)
# query.filter("...")
selected_key = None
n = 0
for key in query:
if random.randint(0,n)==0:
selected_key = key
n += 1
# just in case the query is empty
if selected_key is None:
entry = None
else:
entry = MyModel.get(selected_key)

Related

How to speed up insertion into Redis from SQL Query using Python

I have a SQL query I execute, and it comes into my Python program ~500ms (about 100k rows).
I want to quickly insert this into redis, but it currently takes ~6sec, even with piping.
pipe = r.pipeline()
for row in q:
pipe.zincrby(SKEY, row["name"], 1)
pipe.execute()
Is there a way to speed this up?
The problem is you insert a large number of items in a sorted set. Redis doc says that the time complexity of zincrby is O(log(N)) where N is the number of elements in the sorted set. So the more items you insert, the longer it takes. You probably should rethink the way you use Redis in this case. Maybe the sorted set is not the best answer to your use case.
In general there's no way to speed this up from redis's perspective, but there are two things you can do:
1 If keys repeat themselves, try reducing the number of rows by summing up the names before calling redis. i.e.:
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
and then if you have recurring ids, you'll make less queries in redis.
2 Another thing I would try it to call execute() every say 1000 or 5000 commands or so, that way redis would not be blocking for other callers while this is executed, and python itself would allocate less memory, which might speed things up.
e.g. (combined with the above):
d = dict()
for row in q:
name = row["name"]
d[name] = d.get(name, 0) + 1
pipe = r.pipeline()
for i, (k, v) in enumerate(d.iteritems()):
pipe.zincrby(SKEY, k, v)
if i > 0 and i % 5000 == 0:
pipe.execute()
pipe.execute()

Get random record set with Django, what is affecting the performance?

It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one

Python: How to speed up creating of objects?

I'm creating objects derived from a rather large txt file. My code is working properly but takes a long time to run. This is because the elements I'm looking for in the first place are not ordered and not (necessarily) unique. For example I am looking for a digit-code that might be used twice in the file but could be in the first and the last row. My idea was to check how often a certain code is used...
counter=collections.Counter([l[3] for l in self.body])
...and then loop through the counter. Advance: if a code is only used once you don't have to iterate over the whole file. However You are stuck with a lot of iterations which makes the process really slow.
So my question really is: how can I improve my code? Another idea of course is to oder the data first. But that could take quite long as well.
The crucial part is this method:
def get_pc(self):
counter=collections.Counter([l[3] for l in self.body])
# This returns something like this {'187':'2', '199':'1',...}
pcode = []
#loop through entries of counter
for k,v in counter.iteritems():
i = 0
#find post code in body
for l in self.body:
if i == v:
break
# find fist appearence of key
if l[3] == k:
#first encounter...
if i == 0:
#...so create object
self.pc = CodeCana(k,l[2])
pcode.append(self.pc)
i += 1
# make attributes
self.pc.attr((l[0],l[1]),l[4])
if v <= 1:
break
return pcode
I hope the code explains the problem sufficiently. If not, let me know and I will expand the provided information.
You are looping over body way too many times. Collapse this into one loop, and track the CodeCana items in a dictionary instead:
def get_pc(self):
pcs = dict()
pcode = []
for l in self.body:
pc = pcs.get(l[3])
if pc is None:
pc = pcs[l[3]] = CodeCana(l[3], l[2])
pcode.append(pc)
pc.attr((l[0],l[1]),l[4])
return pcode
Counting all items first then trying to limit looping over body by that many times while still looping over all the different types of items defeats the purpose somewhat...
You may want to consider giving the various indices in l names. You can use tuple unpacking:
for foo, bar, baz, egg, ham in self.body:
pc = pcs.get(egg)
if pc is None:
pc = pcs[egg] = CodeCana(egg, baz)
pcode.append(pc)
pc.attr((foo, bar), ham)
but building body out of a namedtuple-based class would help in code documentation and debugging even more.

how to fetch more than 1000 entities NON keybased?

If I should be approaching this problem through a different method, please suggest so. I am creating an item based collaborative filter. I populate the db with the LinkRating2 class and for each link there are more than a 1000 users that I need to call and collect their ratings to perform calculations which I then use to create another table. So I need to call more than 1000 entities for a given link.
For instance lets say there are over a 1000 users rated 'link1' there will be over a 1000 instances of this class for the given link property that I need to call.
How would I complete this example?
class LinkRating2(db.Model):
user = db.StringProperty()
link = db.StringProperty()
rating2 = db.FloatProperty()
query =LinkRating2.all()
link1 = 'link string name'
a = query.filter('link = ', link1)
aa = a.fetch(1000)##how would i get more than 1000 for a given link1 as shown?
##keybased over 1000 in other post example i need method for a subset though not key
class MyModel(db.Expando):
#classmethod
def count_all(cls):
"""
Count *all* of the rows (without maxing out at 1000)
"""
count = 0
query = cls.all().order('__key__')
while count % 1000 == 0:
current_count = query.count()
if current_count == 0:
break
count += current_count
if current_count == 1000:
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
return count
The 1000-entity fetch limit was removed recently; you can fetch as many as you need, provided you can do so within the time limits. Your entities look like they'll be fairly small, so you may be able to fetch significantly more than 1000 in a request.
Wooble points out that the 1,000 entity limit is a thing of the past now, so you actually don't need to use cursors to do this - just fetch everything at once (it'll be faster than getting them in 1,000 entity batches too since there will be fewer round-trips to the datastore, etc.)
The removal of the 1000 entity limit was removed in version 1.3.1: http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
Old solution using cursors:
Use query cursors to fetch results beyond the first 1,000 entities:
# continuing from your code ... get ALL of the query's results:
more = aa
while len(more) == 1000:
a.with_cusor(a.cursor()) # start the query where we left off
more = a.fetch(1000) # get the next 1000 results
aa = aa + more # copy the additional results into aa

Loading datasets from datastore and merge into single dictionary. Resource problem

I have a productdatabase that contains products, parts and labels for each part based on langcodes.
The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.
The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.
I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
## Start of problem lines ##
for defaultPart in product.defaultPartsData:
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel
for optionalPart in product.optionalPartsData:
for label in labelsForLangCode:
if label.key() in optionalPart.partLabelList:
typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
## end problem lines ##
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.
All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.
The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.
The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:
#property
def defaultPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)
#property
def optionalPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)
When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.
Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.
Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?
I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.
..fredrik
EDIT:
A AppStats screenshoot here.
From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?
EDIT 2:
I implemented dound's solution and added abit. Now it looks like this:
langCode = 'en'
typeData = Products.ProductPartTypes.all()
productData = Products.Product.all()
labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
productList = []
label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
labelData = memcache.get(label_cache_key)
if labelData is None:
langDict = {}
for langLabel in labelsForLangCode:
langDict[str(langLabel.key())] = langLabel.partLangLabel
memcache.add(label_cache_key, langDict, 500)
labelData = memcache.get(label_cache_key)
GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if partData is None:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
for lb in part.partLabelList:
if str(lb) in labelData:
label = labelData[str(lb)]
break
if part.key() in product.defaultParts:
typeDict[part.type.typeId]['default'] = label
elif part.key() in product.optionalParts:
typeDict[part.type.typeId]['optional'].append(label)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
The result is much better. I now have about 3000ms with out memcache and about 700ms with.
I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?
Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?
EDIT 3:
I noticed that non of the db.Model are indexed, could this make a differance?
EDIT 4:
After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.
..fredrik
A few simple ideas:
1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)
2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.
3) Like Will pointed out, IN queries perform one query PER argument.
Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
class ProductParts(db.Model):
...
products = db.ListProperty(db.Key) # product keys
...
Then you can do ONE query to per product instead of N queries per product. For example, you could do this:
parts = ProductParts.all().filter("products =", product).fetch(1000)
The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).
Less critical improvement idea:
4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:
GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
#property
def defaultPartsData(self):
return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)
You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.
Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).
GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
# here's a new approach that does just ONE datastore query (for each product)
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
if part.key() in self.defaultParts:
part_type = 'default'
else:
part_type = 'optional'
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
# (end new code)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.
So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.
Here's the relevant section from the GQL Reference:
Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.
You might try installing AppStats for your app to see where else it might be slowing down.
I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.
GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.

Categories

Resources