Caching query results on GAE

Caching query results on GAE - python

I started using Google App Engine 3 months ago and I have a question regarding Python on memcaching.
I try to describe my problem as best as I can.
I use ndb (App Engine Datastore) and I have a "table" of entities like this:
class Event(ndb.Model):
dateInsert = ndb.DateTimeProperty(auto_now_add=True) # Inserting date
notes = ndb.StringProperty(indexed=False) # event notes
geohash = ndb.StringProperty(required=True) # Coordinates geohash
eventLatitude = ndb.FloatProperty(indexed=True, required=True) # self explanatory
eventLongitude = ndb.FloatProperty(indexed=True, required=True) # self explanatory
Client side (with a mobile app for example) a user can store in datastore an event in specified coordinates.
Those inserted events are visible of course by mobile app (on the map) and on a website.
Right now to retrieve stored events, client calls a web method that search events near a given location:
class getEvents(webapp.RequestHandler):
def get(self):
#blablabla get passed parameters
#[...]
# hMinPos and hMaxPos are hashed coordinates passed by client + X meters.
# In this way I can filter stored events in a precise bounding box.
# For example, I can get events near my location in a box of 5000 meters
qryEvent = Event.query(ndb.AND(Event.geohash >= hMinPos, Event.geohash <= hMaxPos))
events = qryEvent.fetch(1000)
Then I have to fetch every result with a loop cycle to create a JSON to store in a list and return it to the client.
So it is
for event in events:
#do my stuff
Everything is working fine, but the big problem are useless read operations EVERY single time I call that method.
I mean, every time method gets called, it fetches same events than other clients request or worst, same events than previous request by same client (if I move by 50meters and I make a client request, events are same as previous request ad 99%).
This will take quota usage and read operations over-quota very soon.
I think I should use memcache to store fetched events and read them in the memcache before make a read from datastore, but I have no idea to implement it with my structure.
My idea was to use geohash as a memcache key, but I can't iterate through cached elements, I can only make a precise get on a given key so my solution is not applicable (I can't make a direct access to memcache with a key, I need to iterate in memcache elements to find event that fits my coordinate-range request).
Someone has a hint or a suggestion?

I can think of 2 solutions:
1) Store in memcached the information of smaller boxes (e.g. 100 meters long), with a latitude-longitude identifier. You could request from ndb a big box of e.g. 5500 meters long, and save the information of all the contained small boxes in memcached. When the user moves 50, 100 or 400 meters, you'll be able to give her an answer with memcached data, and if someone is near that place (within 500 meters), same thing will happen.
2) You could use ElasticSearch, specifically the Geo Distance Filter. With it, you can filter "documents that include only hits that exists within a specific distance from a geo point".
Note: If getEvents return events in a box of 5000 meters, maybe you shouldn't trigger a new request when moving 50 meters, but a longer distance.

Related

Is there a way to extend wagtail's draftail with an entity that generates random numbers on page access?

I am extending the wagtail rich text editor, draftail. I want to give the user the ability to insert a $RANDOM entity.
The site has a global option to set the random interval, let's call these random_start and random_end. These values are stored in the database.
When the $RANDOM entity is used, I would like the visited page to display a random number between random_start and random_end.
My current attempt allows me to generate the random number but the generation happens only when the page is published, not accessed. All subsequent page visits show the same number.
Previously (before switching to wagtail) the Django code was simple.
Do the following in page view code:
get random_start
get random_end
generate random number between them
pass generated number to the template
My current 'incomplete' solution is based of http://docs.wagtail.io/en/v2.5.1/advanced_topics/customisation/extending_draftail.html#creating-new-entities.
By modifying stock_entity_decorator, I get my current code.
def stock_entity_decorator(props):
"""
Draft.js ContentState to database HTML.
Converts the STOCK entities into a span tag.
"""
return DOM.create_element('span', {
'data-stock': props['stock'],
}, str(random.randint(random_start, random_end)))
Both, random_start and random_end are values which can change in the database.
I know I can use JavaScript to calculate the number client side. But I am hoping a solution exists which avoids client side calculation as that would introduce other problems.
Update
I had simplified the use case a little bit. The random number is not completely random, it is based of a few parameters which should preferably remain secret. Doing the calculation client side would imply disclosing these values.
I thought about setting up a RESTful endpoint and using client-side JavaScript to get the values that way.
For this project I am not concerned with caching and can disable it for the required pages.

You don't need to extend Draftail. Override the get_context() method of your Page-based model:
YourPage(Page):
def get_context(self, request):
context = super().get_context(request)
context['random_number'] = your_random_number_function()
return context
Wagtail Docs

Optimizing for Social Leaderboards

I'm using Google App Engine (python) for the backend of a mobile social game. The game uses Twitter integration to allow people to follow relative leaderboards and play against their friends or followers.
By far the most expensive piece of the puzzle is the background (push) task that hits the Twitter API to query for the friends and followers of a given user, and then stores that data within our datastore. I'm trying to optimize that to reduce costs as much as possible.
The Data Model:
There are three main models related to this portion of the app:
User
'''General user info, like scores and stats'''
# key id => randomly generated string that uniquely identifies a user
# along the lines of user_kdsgj326
# (I realize I probably should have just used the integer ID that GAE
# creates, but its too late for that)
AuthAccount
'''Authentication mechanism.
A user may have multiple auth accounts- one for each provider'''
# key id => concatenation of the auth provider and the auth provider's unique
# ID for that user, ie, "tw:555555", where '555555' is their twitter ID
auth_id = ndb.StringProperty(indexed=True) # ie, '555555'
user = ndb.KeyProperty(kind=User, indexed=True)
extra_data = ndb.JsonProperty(indexed=False) # twitter picture url, name, etc.
RelativeUserScore
'''Denormalization for quickly generated relative leaderboards'''
# key id => same as their User id, ie, user_kdsgj326, so that we can quickly
# retrieve the object for each user
follower_ids = ndb.StringProperty(indexed=True, repeated=True)
# misc properties for the user's score, name, etc. needed for leaderboard
I don't think its necessary for this question, but just in case, here is a more detailed discussion that led to this design.
The Task
The background thread receives the twitter authentication data and requests a chunk of friend IDs from the Twitter API, via tweepy. Twitter sends up to 5000 friend IDs by default, and I'd rather not arbitrarily limit that more if I can avoid it (you can only make so many requests to their API per minute).
Once I get the list of the friend IDs, I can easily translate that into "tw:" AuthAccount key IDs, and use get_multi to retrieve the AuthAccounts. Then I remove all of the Null accounts for twitter users not in our system, and get all the user IDs for the twitter friends that are in our system. Those ids are also the keys of the RelativeUserScores, so I use a bunch of transactional_tasklets to add this user's ID to the RelativeUserScore's followers list.
The Optimization Questions
The first thing that happens is a call to Twitter's API. Given that this is required for everything else in the task, I'm assuming I would not get any gains in making this asynchronous, correct? (GAE is already smart enough to use the server for handling other tasks while this one blocks?)
When determining if a twitter friend is playing our game, I currently convert all twitter friend ids to auth account IDs, and retrieve by get_multi. Given that this data is sparse (most twitter friends will most likely not be playing our game), would I be better off with a projection query that just retrieves the user ID directly? Something like...
twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
friend_system_ids = AuthAccount\
.query(AuthAccount.auth_id.IN(twitter_friend_ids))\
.fetch(projection=[AuthAccount.user_id])
(I can't remember or find where, but I read this is better because you don't waste time attempting to read model objects that don't exist
Whether I end up using get_multi or a projection query, is there any benefit to breaking up the request into multiple async queries, instead of trying to get / query for potentially 5000 objects at once?

I would organize the task like this:
Make an asynchronous fetch call to the Twitter feed
Use memcache to hold all the AuthAccount->User data:
Request the data from memcache, if it doesn't exist then make a fetch_async() call to the AuthAccount to populate memcache and a local dict
Run each of the twitter IDs through the dict
Here is some sample code:
future = twitter_api.friend_ids() # make this asynchronous
auth_users = memcache.get('auth_users')
if auth_users is None:
auth_accounts = AuthAccount.query()
.fetch(projection=[AuthAccount.auth_id,
AuthAccount.user_id])
auth_users = dict([(a.auth_id, a.user_id) for a in auth_accounts])
memcache.add('auth_users', auth_users, 60)
twitter_friend_ids = future.get_result() # get async twitter results
friend_system_ids = []
for id in twitter_friend_ids:
friend_id = auth_users.get("tw:%s" % id)
if friend_id:
friend_system_ids.append(friend_id)
This is optimized for a relatively smaller number of users and a high rate of requests. Your comments above indicate a higher number of users and a lower rate of requests, so I would only make this change to your code:
twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
auth_account_keys = [ndb.Key("AuthAccount", "tw:%s" % id) for id in twitter_friend_ids]
friend_system_ids = filter(None, ndb.get_multi(auth_account_keys))
This will use ndb's built-in memcache to hold data when using get_multi() with keys.

GAE datastore how to poll for new items

I use python, ndb and the datastore. My model ("Event") has a property:
created = ndb.DateTimeProperty(auto_now_add=True).
Events gets saved every now and then, sometimes several within one second.
I want to "poll for new events", without getting the same Event twice, and get an empty result if there aren't any new Events. However, polling again might give me new events.
I have seen Cursors, - but I don't know if they can be used somehow to poll for new Events, after having reached the end if the first query? The "next_cursor" is None when I've reached the (current) end of the data.
Keeping the last received "created" DateTime-property and use that for getting the next batch works, but that's only using a resolution of seconds, so the ordering might get screwed up..
Must I create my own transactional, incrementing counter in Event for this?

Yes, using cursors is a valid option. Even tho this link is from the Java documentation, it's valid for python also. The second paragraph is what you are looking for:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.

EDIT: Prospective search has been shut down on December 1, 2015
Rather than polling an alternate approach would be to use prospective search
https://cloud.google.com/appengine/docs/python/prospectivesearch/
From the docs
"Prospective search is a querying service that allows your application
to match search queries against real-time data streams. For every
document presented, prospective search returns the ID of every
registered query that matches the document."

GAE datastore - best practice when there are more writes than reads

I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.
I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.
I Built a super simple app:
there are two webpages - one to
choose links, and one view chosen
links
every user can choose to add url links to his "links feed"
the user can choose as many links as he wants, whenever he wants.
on a different webpage, I want to show the user the most recent 10 links he chose.
every user has his own "links feed" webpage.
on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.
In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")
Question 1:
I can think of (at least) two options how to handle the data for this app:
Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it
Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls
What will be the better method?
Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)
EDIT (to answer #Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).
EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):
Here's my class:
class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls
then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:
def post(self):
user = users.get_current_user()
if user:
logging messages for debugging
self.response.headers['Content-Type'] = 'text/html'
#self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())
updating the new item that user adds
current_user = UserChannel.get_by_key_name(user.nickname())
dataJson = self.request.get('dataJson')
#self.response.out.write('<p>the dataJson is: %s</p>' % dataJson)
current_user.currentPlaylist.append(dataJson)
sizePlaylist= len(current_user.currentPlaylist)
self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
#whenever the list gets to 30 I cut it to be 20 long
if sizePlaylist > 30:
for i in range (0,9):
current_user.currentPlaylist.pop(i)
current_user.userCount +=1
current_user.put()
Updater().send_update(dataJson)
else:
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write('user_not_logged_in')
where Updater is my method for updating with Channel-API the webpage with the feed.
Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?

First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.
First Question
I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).
This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.
Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.
Second Question
Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.

Answer 1
Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.
When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.
Answer 2
Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.

Getting DISTINCT users on Google App Engine

How to do this on Google App Engine (Python):
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
Long version:
I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.
My event entities look like this:
class Event(db.Model):
t = db.DateTimeProperty(auto_now_add=True)
user = db.StringProperty(required=True)
event_type = db.StringProperty(required=True)
With an SQL database, I would do something like
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:
query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
usernames.append(event.user)
answer = len(set(usernames))
But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.
Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".
It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like
class Chard(db.Model):
usernames = db.StringListProperty(required=True)
So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:
chards = Chard.all()
all_usernames = set()
for chard in chards:
all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)
It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.
Is there any prettier solution to my problem?
Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.

As of SDK v1.7.4, there is now experimental support for the DISTINCT function.
See : https://developers.google.com/appengine/docs/python/datastore/gqlreference

Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.
You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.
Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.
At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:
class UniqueVisitsRecord(db.Model):
# be careful setting date correctly if processing at midnight
activity_date = db.DateProperty()
event_count = IntegerProperty()
You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.

NDB still does not support DISTINCT. I have written a small utility method to be able to use distinct with GAE.
See here. http://verysimplescripts.blogspot.jp/2013/01/getting-distinct-properties-with-ndb.html

Google App Engine and more particular GQL does not support a DISTINCT function.
But you can use Python's set function as described in this blog and in this SO question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.