Getting The Most Recent Data Item - Google App Engine - Python - python

I need to retrieve the most recent item added to a collection. Here is how I'm doing it:
class Box(db.Model):
ID = db.IntegerProperty()
class Item(db.Model):
box = db.ReferenceProperty(Action, collection_name='items')
date = db.DateTimeProperty(auto_now_add=True)
#get most recent item
lastItem = box.items.order('-date')[0]
Is this an expensive way to do it? Is there a better way?

If you are going to iterate over a list of boxes, that is a very bad way to do it. You will run an additional query for every box. You can easily see what is going on with Appstats.
If you are doing one of those per request, it may be ok. But it is not ideal. you might also want to use: lastItem = box.items.order('-date').get(). get will only return the first result to the app.
If possible it would be significantly faster to add a lastItem property to Box, or store the Box ID (your attribute) on Item. In other words, denormalize the data. If you are going to fetch a list of Boxes and their most recent item you need to use this type of approach.

Related

Google Directory API groups

I am trying to use Google's admin directory API (with Google's python library).
I am able to list the users on the directory just fine, using some code like this:
results = client.users().list(customer='my_customer').execute()
However, those results do not include which groups the users are a member of. Instead, it appears that once I have the list of users, I then have to make a call to get a list of groups:
results = client.groups().list(customer='my_customer').execute()
And then go through each group and call the "members" api to see which users are in a group:
results = client.members().list(groupKey='[group key]').execute()
Which means that I have to make a new request for every group.
It seems to be horribly inefficient. There has to be a better way than this, and I'm just missing it. What is it?
There is no better way than the one you described (yet?): Iterate over groups and get member list of each one.
I agree that is not the answer you want to read, but it is also the way we do maintenance over group members.
The following method should be more efficient, although not 100% great :
For each user, call the groups.list method and pass the userKey parameter to specify that you only want groups who have user X as a member.
I'm not a Python developper, but it should look like this :
results = client.groups().list(customer='my_customer',userKey='user#domain.com').execute()
For each user:
results = client.groups().list(userKey=user,pageToken=None).execute()
where 'user' is the user's primary/alias email address or ID
This will return a page of groups the user is a member of, see:
https://developers.google.com/admin-sdk/directory/v1/reference/groups/list

Get by name in google app engine

Instead of using get_by_id() method for getting the id of a specific entry and print the content of this entry from the google datastore, i am trying to get the name of the url and print the content. For example:
print all the content that have this specific name(may have more than one rows of content with this name)
print the content of the specific id
i am using get_by_id(long(id)) to get the id in the second part of my example, and its working. I am trying to use get_by_key_name(name) but it does not working. any ideas on that? thank you.
sorry, but since i couldn't leave a comment, i am editing my question. Basically, since now i can get all the name of animals from my datastore and i have made them clickable using an html code in template file. In the datastore, there are entries with the same name of animal more than one times (e.g. name= duck, content= water and name=duck, content=lake). Now, when i am clicking into every name of animals(i have use the DINSTINCT in my gql query to print redundant elements(e.g. duck) only one time).Since the name=duck has two contents, when i am clicking on the name of the duck i want to see both of the contents. My problem is if i am using get_by_id(long(id)) i get the unique id of every element. But this will not print me both of the content of the name=duck because every entry has a unique id. But i want all the content of the entries with the same name. I am trying the following but it does not working.
msg = MODEL.Animals.get_by_key_name(name)
self.response.write("%s" % msg.content)
With get_by_id() you can get entity only if you know this ID. This operations named "Small operations" in quota and they are cheaper than datastore reads, but to get list of entities filtered by indexed property - you should use filters.
query = MODEL.Animals.query()
query = query.filter(MODEL.Animals.name == 'duck')
ducks = query.fetch(limit=100) # limit number of returned animals
for duck in ducks:
self.response.write('%s - %s' % (duck.name, duck.content))
By default, all string properties are indexed, so you will be able to do such requests.

Mongoengine - How to perform a "save new item or increment counter" operation?

I'm using MongoEngine in a web-scraping project. I would like to keep track of all the images I've encountered on all the scraped webpages.
To do so, I store the image src URL and the number of times the image has been encountered.
The MongoEngine model definition is the following:
class ImagesUrl(Document):
""" Model representing images encountered during web-scraping.
When an image is encountered on a web-page during scraping,
we store its url and the number of times it has been
seen (default counter value is 1).
If the image had been seen before, we do not insert a new document
in collection, but merely increment the corresponding counter value.
"""
# The url of the image. There cannot be any duplicate.
src = URLField(required=True, unique=True)
# counter of the total number of occurences of the image during
# the datamining process
counter = IntField(min_value=0, required=True, default=1)
I'm looking for the proper way to implement the "save or increment" process.
So far, I'm handling it that way, but I feel there might be a better, built-in way of doing it with MongoEngine:
def save_or_increment(self):
""" If it is the first time the image has been encountered, insert
its src in mongo, along with a counter=1 value.
If not, increment its counter value by 1.
"""
# check if item is already stored
# if not, save a new item
if not ImagesUrl.objects(src=self.src):
ImagesUrl(
src=self.src,
counter=self.counter,
).save()
else:
# if item already stored in Mongo, just increment its counter
ImagesUrl.objects(src=self.src).update_one(inc__counter=1)
Is there a better way of doing it?
Thank you very much for your time.
You should be able to just do an upsert eg:
ImagesUrl.objects(src=self.src).update_one(
upsert=True,
inc__counter=1,
set__src=self.src)
update_one as in #ross answer has a count of modified documents as a result (or a full result of update) and it will not return the document or a new counter number. If you want to have one you should use upsert_one:
images_url = ImagesUrl.objects(src=self.src).upsert_one(
inc__counter=1,
set__src=self.src)
print images_url.counter
It will create document if it does not exists or modify it and increase the counter number.

How to delete entities not found in feed on GAE

I am updating and adding items from a feed(which can have about 40000 items) to the datastore 200 items at a time, the problem is that the feed can change and some items might be deleted from the feed.
I have this code:
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
def updateFeed(offset, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name)
)
db.put(feedEntriesToAdd)
How do I find out which items were not in the feed and delete them from the datastore?
I thought about creating a list of items(in datastore) and just remove from there all the items that I updated and the ones left will be the ones to delete. - but that seems rather slow.
PS: All item.id are unique for that feed item and are consistent.
If you add a DateTimeProperty with auto_now=True, it will record the last modified time of each entity. Since you update every item in the feed, by the time you've finished they will all have times after the moment you started, so anything with a date before then isn't in the feed any more.
Xavier's generation counter is just as good - all we need is something guaranteed to increase between refreshes, and never decrease during a refresh.
Not sure from the docs, but I expect a DateTimeProperty is bigger than an IntegerProperty. The latter is a 64 bit integer, so they might be the same size, or it may be that DateTimeProperty stores several integers. A group post suggests maybe it's 10 bytes as opposed to 8.
But remember that by adding an extra property that you do queries on, you're adding another index anyway, so the difference in size of the field is diluted as a proportion of the overhead. Further, 40k times a few bytes isn't much even at $0.24/G/month.
With either a generation or a datetime, you don't necessarily have to delete the data immediately. Your other queries could filter on date/generation of the most recent refresh, meaning that you don't have to delete data immediately. If the feed (or your parsing of it) goes funny and fails to produce any items, or only produces a few, it might be useful to have the last refresh lying around as a backup. Depends entirely on the app whether it's worth having.
I would add a generation counter
class FeedEntry(db.Model):
name = db.StringProperty(required=True)
generation = db.IntegerProperty(required=True)
def updateFeed(offset, generation, number=200):
response = fetchFeed(offset, number)
feedItems = parseFeed(response)
feedEntriesToAdd = []
for item in feedItems:
feedEntriesToAdd.append(
FeedEntry(key_name=item.id, name=item.name,generation=generation)
)
db.put(feedEntriesToAdd)
def deleteOld(generation):
q = db.GqlQuery("SELECT * FROM FeedEntry " +
"WHERE generation != :1" ,generation )
db.delete(generation)

Aggregating across columns in Django

I'm trying to figure out if there's a way to do a somewhat-complex aggregation in Django using its ORM, or if I'm going to have to use extra() to stick in some raw SQL.
Here are my object models (stripped to show just the essentials):
class Submission(Models.model)
favorite_of = models.ManyToManyField(User, related_name="favorite_submissions")
class Response(Models.model)
submission = models.ForeignKey(Submission)
voted_up_by = models.ManyToManyField(User, related_name="voted_up_responses")
What I want to do is sum all the votes for a given submission: that is, all of the votes for any of its responses, and then also including the number of people who marked the submission as a favorite.
I have the first part working using the following code; this returns the total votes for all responses of each submission:
submission_list = Response.objects\
.values('submission')\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
(So after getting the vote total, I sort in descending order and return the top TOP_NUM submissions, to get a "best of" listing.)
That part works. Is there any way you can suggest to include the number of people who have favorited each submission in its votes? (I'd prefer to avoid extra() for portability, but I'm thinking it may be necessary, and I'm willing to use it.)
EDIT: I realized after reading the suggestions below that I should have been clearer in my description of the problem. The ideal solution would be one that allowed me to sort by total votes (the sum of voted_up_by and favorited) and then pick just the top few, all within the database. If that's not possible then I'm willing to load a few of the fields of each response and do the processing in Python; but since I'll be dealing with 100,000+ records, it'd be nice to avoid that overhead. (Also, to Adam and Dmitry: I'm sorry for the delay in responding!)
One possibility would be to re-arrange your current query slightly. What if you tried something like the following:
submission_list = Response.objects\
.annotate(votes=Count('voted_up_by'))\
.filter(votes__gt=0)\
.order_by('-votes')[:TOP_NUM]
submission_list.query.group_by = ['submission_id']
This will return a queryset of Response objects (objects with the same Submission will be lumped together). In order to access the related submission and/or the favorite_of list/count, you have two options:
num_votes = submission_list[0].votes
submission = submission_list[0].submission
num_favorite = submission.favorite_of.count()
or...
submissions = []
for response in submission_list:
submission = response.submission
submission.votes = response.votes
submissions.append(submission)
num_votes = submissions[0].votes
submission = submissions[0]
num_favorite = submission.favorite_of.count()
Basically the first option has the benefit of still being a queryset, but you have to be sure to access the submission object in order to get any info about the submission (since each object in the queryset is technically a Response). The second option has the benefit of being a list of the submissions with both the favorite_of list as well as the votes, but it is no longer a queryset (so be sure you don't need to alter the query anymore afterwards).
You can count favorites in another query like
favorite_list = Submission.objects.annotate(favorites=Count(favorite_of))
After that you add the values from two lists:
total_votes = {}
for item in submission_list:
total_votes[item.submission.id] = item.voted_by
for item in favorite_list:
has_votes = total_votes.get(item.id, 0)
total_votes[item.id] = has_votes + item.favorites
I am using ids in the dictionary because Submission objects will not be identical. If you need the Submissions themselves, you may use one more dictionary or store tuple (submission, votes) instead of just votes.
Added: this solution is better than the previous because you have only two DB requests.

Categories

Resources