If I should be approaching this problem through a different method, please suggest so. I am creating an item based collaborative filter. I populate the db with the LinkRating2 class and for each link there are more than a 1000 users that I need to call and collect their ratings to perform calculations which I then use to create another table. So I need to call more than 1000 entities for a given link.
For instance lets say there are over a 1000 users rated 'link1' there will be over a 1000 instances of this class for the given link property that I need to call.
How would I complete this example?
class LinkRating2(db.Model):
user = db.StringProperty()
link = db.StringProperty()
rating2 = db.FloatProperty()
query =LinkRating2.all()
link1 = 'link string name'
a = query.filter('link = ', link1)
aa = a.fetch(1000)##how would i get more than 1000 for a given link1 as shown?
##keybased over 1000 in other post example i need method for a subset though not key
class MyModel(db.Expando):
#classmethod
def count_all(cls):
"""
Count *all* of the rows (without maxing out at 1000)
"""
count = 0
query = cls.all().order('__key__')
while count % 1000 == 0:
current_count = query.count()
if current_count == 0:
break
count += current_count
if current_count == 1000:
last_key = query.fetch(1, 999)[0].key()
query = query.filter('__key__ > ', last_key)
return count
The 1000-entity fetch limit was removed recently; you can fetch as many as you need, provided you can do so within the time limits. Your entities look like they'll be fairly small, so you may be able to fetch significantly more than 1000 in a request.
Wooble points out that the 1,000 entity limit is a thing of the past now, so you actually don't need to use cursors to do this - just fetch everything at once (it'll be faster than getting them in 1,000 entity batches too since there will be fewer round-trips to the datastore, etc.)
The removal of the 1000 entity limit was removed in version 1.3.1: http://googleappengine.blogspot.com/2010/02/app-engine-sdk-131-including-major.html
Old solution using cursors:
Use query cursors to fetch results beyond the first 1,000 entities:
# continuing from your code ... get ALL of the query's results:
more = aa
while len(more) == 1000:
a.with_cusor(a.cursor()) # start the query where we left off
more = a.fetch(1000) # get the next 1000 results
aa = aa + more # copy the additional results into aa
Related
I'm using describe_table_statistics to retrieve the list of tables in the given DMS task, and conditionally looping over describe_table_statistics with the response['Marker'].
When I use no filters, I get the correct number of records, 13k+.
When I use a filter, or combination of filters, whose result set has fewer than MaxRecords, I get the correct number of records.
However, when I pass in a filter that would get a record set larger than MaxRecords, I get far fewer records than I should.
Here's my function to retrieve the set of tables:
def get_dms_task_tables(account, region, task_name, schema_name=None, table_state=None):
tables=[]
max_records=500
filters=[]
if schema_name:
filters.append({'Name':'schema-name', 'Values':[schema_name]})
if table_state:
filters.append({'Name':'table-state', 'Values':[table_state]})
task_arn = get_dms_task_arn(account, region, task_name)
session = boto3.Session(profile_name=account, region_name=region)
client = session.client('dms')
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records)
tables += response['TableStatistics']
while len(response['TableStatistics']) == max_records:
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records
,Marker=response['Marker'])
tables += response['TableStatistics']
return tables
For troubleshooting, I loop over tables printing one line per table:
print(', '.join((
t['SchemaName']
,t['TableName']
,t['TableState'])))
When I pass in no filters and grep for that table state of 'Table completed' I get 12k+ records, which is the correct count, via the console
So superficially at least, the response loop works.
When I pass in a schema name and the table state filter conditions, I get the correct count, as confirmed by the console, but this count is less than MaxRecords.
When I just pass in the table state filter for 'Table completed', I only get 949 records, so I'm missing 11k records.
I've tried omitting the Filter parameter from the describe_table_statistics inside the loop, but I get the same results in all cases.
I suspect there's something wrong with my call to describe_table_statistics inside the loop but I've been unable to find examples of this in amazon's documentation to confirm that.
When filters are applied, describe_table_statistics does not adhere to the MaxRecords limit.
In fact, what it seems to do is retrieve (2 x MaxRecords), apply the filter, and return that set. Or possibly it retrieves MaxRecords, applies the filter, and continues until the result set is larger than MaxRecords. Either way my while condition was the problem.
I replaced
while len(response['TableStatistics']) == max_records:
with
while 'Marker' in response:
and now the function returns the correct number of records.
Incidentally, my first attempt was
while len(response['TableStatistics']) >= 1:
but on the last iteration of the loop it threw this error:
KeyError: 'Marker'
The completed and working function now looks thusly:
def get_dms_task_tables(account, region, task_name, schema_name=None, table_state=None):
tables=[]
max_records=500
filters=[]
if schema_name:
filters.append({'Name':'schema-name', 'Values':[schema_name]})
if table_state:
filters.append({'Name':'table-state', 'Values':[table_state]})
task_arn = get_dms_task_arn(account, region, task_name)
session = boto3.Session(profile_name=account, region_name=region)
client = session.client('dms')
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records)
tables += response['TableStatistics']
while 'Marker' in response:
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records
,Marker=response['Marker'])
tables += response['TableStatistics']
return tables
I wrote this iterate through a series of facebook user likes. The scrubbing process requires the code first pick a user, then pick a like, then a character from that like. If too many characters in a like are not english characters (in the alphanum string) then the like is assumed to be gibberish and is removed.
This filtering process continues through all likes and all users. I know having nested loops is a no no, but I don't see a way to do this without having a triple nested loop. Any suggestions? Additionally if anyone has any other efficiency or conventional advice I would love to hear it.
def cleaner(likes_path):
'''
estimated run time for 170k users: 3min
this method takes a given csv format datasheet of noisy facebook likes.
data is scrubbed row by row (meaning user by user) removing 'likes' that are not useful
data is parsed into manageable size specified files.
if more data is continuously added method will just keep adding new files
if more data is added at a later time choosing a new folder to put it in would
work best so that the update method can add it to existing counts instead
of starting over
'''
with open(os.path.join(likes_path)) as likes:
dct = [0]
file_num = 0
#initializes naming scheme for self-numbering files
file_size = 30000
#sets file size to 30000 userId's
alphanum = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890 $%#-'
user_count = 0
too_big = 1000
too_long = 30
for rows in likes:
repeat_check = []
user_count += 1
user_likes = make_like_list(rows)
to_check = user_likes[1:]
if len(to_check) < too_big:
#users with more than 1000 likestake up much more resources/time
#and are of less analytical value
for like in to_check:
if len(like) > too_long or len(like) == 0:
#This changes the filter sensitivity. Most useful likes
#are under 30 char long
user_likes.remove(like)
else:
letter_check = sum(1 for letter in like[:5] if letter in alphanum)
if letter_check < len(like[:5])-1:
user_likes.remove(like)
if len(user_likes) > 1 and len(user_likes[0]) == 32:
#userID's are 32 char long, this filters out some mistakes
#filters out users with no likes
scrubbed_to_check = user_likes[1:]
for like in scrubbed_to_check:
if like == 'Facebook' or like == 'YouTube':
#youtube and facebook are very common likes but
#aren't very useful
user_likes.remove(like)
#removes duplicate likes
elif like not in repeat_check:
repeat_check.append(like)
else:
user_likes.remove(like)
scrubbed_rows = '"'+'","'.join(user_likes)+'"\n'
if user_count%file_size == 1:
#This block allows for data to be parsed into
#multiple smaller files
file_num += 1
dct.append(file_num)
dct[file_num] = open(file_write_path + str(file_num) +'.csv', 'w')
if file_num != 1:
dct[file_num-1].close()
dct[file_num].writelines(scrubbed_rows)
if user_counter(user_count, 'Users Scrubbed:', 200000):
break
print 'Total Users Scrubbed:', user_count
I have two collections in a db page and pagearchive I am trying to clean up. I noticed that new documents were being created in the pagearchive instead of adding values to embedded documents as intended. So essentially what this script is doing is going through every document in page and then finding all copies of that document in pagearchive and moving data I want into a single document and deleted the extras.
The problem is there is only 200K documents in pagearchive and based on the count variable I am printing at the bottom, it's taking anywhere from 30min to 60+ min to iterate through 1000 records. This is extremely slow. The largest count in duplicate docs I have seen is 88. But for the most part when I query in pageArchive on uu, I see 1-2 duplicate documents.
mongodb is on a single instance 64 bit machine with 16GB of RAM.
The uu key that is being iterating on the pageArchive collection is a string. I made sure there was an index on that field db.pagearchive.ensureIndex({uu:1}) I also did a mongod --repair for good measure.
My guess is the problem is with my sloppy python code (not very good at it) or perhaps something I am missing that is necessary for mongodb. Why is it going so slow or what can I do to speed it up dramatically?
I thought maybe because the uu field is a string it's causing a bottleneck, but that's the unique property in the document (or will be once I clean up this collection). On top of that, when I stop the process and restart it, it speeds up to about 1000 records a second. Until it starts finding duplicates again in the collection, then it goes dog slow again (deleting about 100 records every 10-20 minutes)
from pymongo import Connection
import datetime
def match_dates(old, new):
if old['coll_at'].month == new['coll_at'].month and old['coll_at'].day == new['coll_at'].day and old['coll_at'].year == new['coll_at'].year:
return False
return new
connection = Connection('dashboard.dev')
db = connection['mydb']
pageArchive = db['pagearchive']
pages = db['page']
count = 0
for page in pages.find(timeout=False):
archive_keep = None
ids_to_delete = []
for archive in pageArchive.find({"uu" : page['uu']}):
if archive_keep == None:
#this is the first record we found, so we will store data from duplicate records with this one; delete the rest
archive_keep = archive
else:
for attr in archive_keep.keys():
#make sure we are dealing with an embedded document field
if isinstance(archive_keep[attr], basestring) or attr == 'updated_at':
continue
else:
try:
if len(archive_keep[attr]) == 0:
continue
except TypeError:
continue
try:
#We've got our first embedded doc from a property to compare against
for obj in archive_keep[attr]:
if archive['_id'] not in ids_to_delete:
ids_to_delete.append(archive['_id'])
#loop through secondary archive doc (comparing against the archive keep)
for attr_old in archive.keys():
#make sure we are dealing with an embedded document field
if isinstance(archive[attr_old], basestring) or attr_old == 'updated_at':
continue
else:
try:
#now we know we're dealing with a list, make sure it has data
if len(archive[attr_old]) == 0:
continue
except TypeError:
continue
if attr == attr_old:
#document prop. match; loop through embedded document array and make sure data wasn't collected on the same day
for obj2 in archive[attr_old]:
new_obj = match_dates(obj, obj2)
if new_obj != False:
archive_keep[attr].append(new_obj)
except TypeError, te:
'not iterable'
pageArchive.update({
'_id':archive_keep['_id']},
{"$set": archive_keep},
upsert=False)
for mongoId in ids_to_delete:
pageArchive.remove({'_id':mongoId})
count += 1
if count % 100 == 0:
print str(datetime.datetime.now()) + ' ### ' + str(count)
I'd make following changes to code:
in match_dates return None instead False and do if new_obj is not None: it will check reference, without calling object __ne__ or __nonzero__.
for page in pages.find(timeout=False): If only uu key is used and pages are big, fields=['uu'] parameter to find should speedup queries.
archive_keep == None to archive_keep is None
archive_keep[attr] is called 4 times. It will be little faster to save keep_obj = archive_keep[attr] and then use keep_obj.
change ids_to_delete = [] to ids_to_delete = set(). Then if archive['_id'] not in ids_to_delete: will be O(1)
Currently i'm using something like this:
images = Image.all()
count = images.count()
random_numb = random.randrange(1, count)
image = Image.get_by_id(random_numb)
But it turns out that the ids in the datastore on AppEngine don't start from 1.
I have two images in datastore and their ids are 6001 and 7001.
Is there a better way to retrieve random images?
The datastore is distributed, so IDs are non-sequential: two datastore nodes need to be able to generate an ID at the same time without causing a conflict.
To get a random entity, you can attach a random float between 0 and 1 to each entity on create. Then to query, do something like this:
rand_num = random.random()
entity = MyModel.all().order('rand_num').filter('rand_num >=', rand_num).get()
if entity is None:
entity = MyModel.all().order('rand_num').get()
Edit: Updated fall-through case per Nick's suggestion.
Another solution (if you don't want to add an additional property). Keep a set of keys in memory.
import random
# Get all the keys, not the Entities
q = ItemUser.all(keys_only=True).filter('is_active =', True)
item_keys = q.fetch(2000)
# Get a random set of those keys, in this case 20
random_keys = random.sample(item_keys, 20)
# Get those 20 Entities
items = db.get(random_keys)
The above code illustrates the basic method for getting only keys and then creating a random set with which to do a batch get. You could keep that set of keys in memory, add to it as you create new ItemUser Entities, and then have a method that returns a n random Entities. You'll have to implement some overhead to manage the memcached keys. I like this solution better if you're performing the query for random elements often (I assume using a batch get for n Entities is more efficient than a query for n Entities).
I think Drew Sears's answer above (attach a random float to each entity on create) has a potential problem: every item doesn't have an equal chance of getting picked. For example, if there are only 2 entities, and one gets a rand_num of 0.2499, and the other gets 0.25, the 0.25 one will get picked almost all the time. This might or might not matter to your application. You could fix this by changing the rand_num of an entity every time it is selected, but that means each read also requires a write.
And pix's answer will always select the first key.
Here's the best general-purpose solution I could come up with:
num_images = Image.all().count()
offset = random.randrange(0, num_images)
image = Image.all().fetch(1, offset)[0]
No additional properties needed, but the downside is that count() and fetch() both have performance implications if the number of Images is large.
Another (less efficient) method, which requires no setup:
query = MyModel.all(keys_only=True)
# query.filter("...")
selected_key = None
n = 0
for key in query:
if random.randint(0,n)==0:
selected_key = key
n += 1
# just in case the query is empty
if selected_key is None:
entry = None
else:
entry = MyModel.get(selected_key)
I have a productdatabase that contains products, parts and labels for each part based on langcodes.
The problem I'm having and haven't got around is a huge amount of resource used to get the different datasets and merging them into a dict to suit my needs.
The products in the database are based on a number of parts that is of a certain type (ie. color, size). And each part has a label for each language. I created 4 different models for this. Products, ProductParts, ProductPartTypes and ProductPartLabels.
I've narrowed it down to about 10 lines of code that seams to generate the problem. As of currently I have 3 Products, 3 Types, 3 parts for each type, and 2 languages. And the request takes a wooping 5500ms to generate.
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
## Start of problem lines ##
for defaultPart in product.defaultPartsData:
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId]['default'] = label.partLangLabel
for optionalPart in product.optionalPartsData:
for label in labelsForLangCode:
if label.key() in optionalPart.partLabelList:
typeDict[optionalPart.type.typeId]['optional'].append(label.partLangLabel)
## end problem lines ##
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
I guess the problem lies in the number of for loops is too many and have to iterate over the same data over and over again. labelForLangCode get all labels from ProductPartLabels that match the current langCode.
All parts for a product is stored in a db.ListProperty(db.key). The same goes for all labels for a part.
The reason I need the some what complex dict is that I want to display all data for a product with it's default parts and show a selector for the optional one.
The defaultPartsData and optionaPartsData are properties in the Product Model that looks like this:
#property
def defaultPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts)
#property
def optionalPartsData(self):
return ProductParts.gql('WHERE __key__ IN :key', key = self.optionalParts)
When the completed dict is in the memcache it works smoothly, but isn't the memcache reset if the application goes in to hibernation? Also I would like to show the page for first time user(memcache empty) with out the enormous delay.
Also as I said above, this is only a small amount of parts/product. What will the result be when it's 30 products with 100 parts.
Is one solution to create a scheduled task to cache it in the memcache every hour? It this efficient?
I know this is alot to take in, but I'm stuck. I've been at this for about 12 hours straight. And can't figure out a solution.
..fredrik
EDIT:
A AppStats screenshoot here.
From what I can read the queries seams fine in AppStats. only taking about 200-400 ms. How can the difference be that big?
EDIT 2:
I implemented dound's solution and added abit. Now it looks like this:
langCode = 'en'
typeData = Products.ProductPartTypes.all()
productData = Products.Product.all()
labelsForLangCode = Products.ProductPartLabels.gql('WHERE partLangCode = :langCode', langCode = langCode)
productList = []
label_cache_key = 'productpartslabels_%s' % (slugify(langCode))
labelData = memcache.get(label_cache_key)
if labelData is None:
langDict = {}
for langLabel in labelsForLangCode:
langDict[str(langLabel.key())] = langLabel.partLangLabel
memcache.add(label_cache_key, langDict, 500)
labelData = memcache.get(label_cache_key)
GQL_PARTS_BY_PRODUCT = Products.ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if partData is None:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
for lb in part.partLabelList:
if str(lb) in labelData:
label = labelData[str(lb)]
break
if part.key() in product.defaultParts:
typeDict[part.type.typeId]['default'] = label
elif part.key() in product.optionalParts:
typeDict[part.type.typeId]['optional'].append(label)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
The result is much better. I now have about 3000ms with out memcache and about 700ms with.
I'm still abit worried about the 3000ms, and on the local app_dev server the memcache gets filled up for each reload. Shouldn't put everything in there and then read from it?
Last but not least, does anyone know why the request take about 10x as long on the production server the the app_dev?
EDIT 3:
I noticed that non of the db.Model are indexed, could this make a differance?
EDIT 4:
After consulting AppStats (And understanding it, took some time. It seams that the big problems lies within part.type.typeId where part.type is a db.ReferenceProperty. Should have seen it before. And maybe explained it better :) I'll rethink that part. And get back to you.
..fredrik
A few simple ideas:
1) Since you need all the results, instead of doing a for loop like you have, call fetch() explicitly to just go ahead and get all the results at once. Otherwise, the for loop may result in multiple queries to the datastore as it only gets so many items at once. For example, perhaps you could try:
return ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts).fetch(1000)
2) Maybe only load part of the data in the initial request. Then use AJAX techniques to load additional data as needed. For example, start by returning the product information, and then make additional AJAX requests to get the parts.
3) Like Will pointed out, IN queries perform one query PER argument.
Problem: An IN query does one equals query for each argument you give it. So key IN self.defaultParts actually does len(self.defaultParts) queries.
Possible Improvement: Try denormalizing your data more. Specifically, store a list of products each part is used in on each part. You could structure your Parts model like this:
class ProductParts(db.Model):
...
products = db.ListProperty(db.Key) # product keys
...
Then you can do ONE query to per product instead of N queries per product. For example, you could do this:
parts = ProductParts.all().filter("products =", product).fetch(1000)
The trade-off? You have to store more data in each ProductParts entity. Also, when you write a ProductParts entity, it will be a little slower because it will cause 1 row to be written in the index for each element in your list property. However, you stated that you only have 100 products so even if a part was used in every product the list still wouldn't be too big (Nick Johnson mentions here that you won't get in trouble until you try to index a list property with ~5,000 items).
Less critical improvement idea:
4) You can create the GqlQuery object ONCE and then reuse it. This isn't your main performance problem by any stretch, but it will help a little. Example:
GQL_PROD_PART_BY_KEYS = ProductParts.gql('WHERE __key__ IN :1')
#property
def defaultPartsData(self):
return GQL_PROD_PART_BY_KEYS.bind(self.defaultParts)
You should also use AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot of appstats info about your request along with your post.
Here is what the code might look like if you re-wrote it fetch the data with fewer round-trips to the datastore (these changes are based on ideas #1, #3, and #4 above).
GQL_PARTS_BY_PRODUCT = ProductParts.gql('WHERE products = :1')
for product in productData:
productDict = {}
typeDict = {}
productDict['productName'] = product.name
cache_key = 'productparts_%s' % (slugify(product.key()))
partData = memcache.get(cache_key)
if not partData:
for type in typeData:
typeDict[type.typeId] = { 'default' : '', 'optional' : [] }
# here's a new approach that does just ONE datastore query (for each product)
GQL_PARTS_BY_PRODUCT.bind(product)
parts = GQL_PARTS_BY_PRODUCT.fetch(1000)
for part in parts:
if part.key() in self.defaultParts:
part_type = 'default'
else:
part_type = 'optional'
for label in labelsForLangCode:
if label.key() in defaultPart.partLabelList:
typeDict[defaultPart.type.typeId][part_type] = label.partLangLabel
# (end new code)
memcache.add(cache_key, typeDict, 500)
partData = memcache.get(cache_key)
productDict['parts'] = partData
productList.append(productDict)
One important thing to be aware of is the fact that IN queries (along with != queries) result in multiple subqueries being spawned behind the scenes, and there's a limit of 30 subqueries.
So your ProductParts.gql('WHERE __key__ IN :key', key = self.defaultParts) query will actually spawn len(self.defaultParts) subqueries behind the scenes, and it will fail if len(self.defaultParts) is greater than 30.
Here's the relevant section from the GQL Reference:
Note: The IN and != operators use multiple queries behind the scenes. For example, the IN operator executes a separate underlying datastore query for every item in the list. The entities returned are a result of the cross-product of all the underlying datastore queries and are de-duplicated. A maximum of 30 datastore queries are allowed for any single GQL query.
You might try installing AppStats for your app to see where else it might be slowing down.
I think the problem is one of design: wanting to construct a relational join table in memcache when the framework specifically abhors that.
GAE will toss your job out because it takes too long, but you shouldn't be doing it in the first place. I'm a GAE tyro myself, so I cannot specify how it should be done, unfortunately.