Optimizing for Social Leaderboards

Optimizing for Social Leaderboards - python

I'm using Google App Engine (python) for the backend of a mobile social game. The game uses Twitter integration to allow people to follow relative leaderboards and play against their friends or followers.
By far the most expensive piece of the puzzle is the background (push) task that hits the Twitter API to query for the friends and followers of a given user, and then stores that data within our datastore. I'm trying to optimize that to reduce costs as much as possible.
The Data Model:
There are three main models related to this portion of the app:
User
'''General user info, like scores and stats'''
# key id => randomly generated string that uniquely identifies a user
# along the lines of user_kdsgj326
# (I realize I probably should have just used the integer ID that GAE
# creates, but its too late for that)
AuthAccount
'''Authentication mechanism.
A user may have multiple auth accounts- one for each provider'''
# key id => concatenation of the auth provider and the auth provider's unique
# ID for that user, ie, "tw:555555", where '555555' is their twitter ID
auth_id = ndb.StringProperty(indexed=True) # ie, '555555'
user = ndb.KeyProperty(kind=User, indexed=True)
extra_data = ndb.JsonProperty(indexed=False) # twitter picture url, name, etc.
RelativeUserScore
'''Denormalization for quickly generated relative leaderboards'''
# key id => same as their User id, ie, user_kdsgj326, so that we can quickly
# retrieve the object for each user
follower_ids = ndb.StringProperty(indexed=True, repeated=True)
# misc properties for the user's score, name, etc. needed for leaderboard
I don't think its necessary for this question, but just in case, here is a more detailed discussion that led to this design.
The Task
The background thread receives the twitter authentication data and requests a chunk of friend IDs from the Twitter API, via tweepy. Twitter sends up to 5000 friend IDs by default, and I'd rather not arbitrarily limit that more if I can avoid it (you can only make so many requests to their API per minute).
Once I get the list of the friend IDs, I can easily translate that into "tw:" AuthAccount key IDs, and use get_multi to retrieve the AuthAccounts. Then I remove all of the Null accounts for twitter users not in our system, and get all the user IDs for the twitter friends that are in our system. Those ids are also the keys of the RelativeUserScores, so I use a bunch of transactional_tasklets to add this user's ID to the RelativeUserScore's followers list.
The Optimization Questions
The first thing that happens is a call to Twitter's API. Given that this is required for everything else in the task, I'm assuming I would not get any gains in making this asynchronous, correct? (GAE is already smart enough to use the server for handling other tasks while this one blocks?)
When determining if a twitter friend is playing our game, I currently convert all twitter friend ids to auth account IDs, and retrieve by get_multi. Given that this data is sparse (most twitter friends will most likely not be playing our game), would I be better off with a projection query that just retrieves the user ID directly? Something like...
twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
friend_system_ids = AuthAccount\
.query(AuthAccount.auth_id.IN(twitter_friend_ids))\
.fetch(projection=[AuthAccount.user_id])
(I can't remember or find where, but I read this is better because you don't waste time attempting to read model objects that don't exist
Whether I end up using get_multi or a projection query, is there any benefit to breaking up the request into multiple async queries, instead of trying to get / query for potentially 5000 objects at once?

I would organize the task like this:
Make an asynchronous fetch call to the Twitter feed
Use memcache to hold all the AuthAccount->User data:
Request the data from memcache, if it doesn't exist then make a fetch_async() call to the AuthAccount to populate memcache and a local dict
Run each of the twitter IDs through the dict
Here is some sample code:
future = twitter_api.friend_ids() # make this asynchronous
auth_users = memcache.get('auth_users')
if auth_users is None:
auth_accounts = AuthAccount.query()
.fetch(projection=[AuthAccount.auth_id,
AuthAccount.user_id])
auth_users = dict([(a.auth_id, a.user_id) for a in auth_accounts])
memcache.add('auth_users', auth_users, 60)
twitter_friend_ids = future.get_result() # get async twitter results
friend_system_ids = []
for id in twitter_friend_ids:
friend_id = auth_users.get("tw:%s" % id)
if friend_id:
friend_system_ids.append(friend_id)
This is optimized for a relatively smaller number of users and a high rate of requests. Your comments above indicate a higher number of users and a lower rate of requests, so I would only make this change to your code:
twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
auth_account_keys = [ndb.Key("AuthAccount", "tw:%s" % id) for id in twitter_friend_ids]
friend_system_ids = filter(None, ndb.get_multi(auth_account_keys))
This will use ndb's built-in memcache to hold data when using get_multi() with keys.

Related

Stripe Subscription Testing

I'm trying to create a monthly subscription with stripe.
I wrote a sample code like that
if event_type == 'checkout.session.completed':
# Payment is successful and the subscription is created.
# You should provision the subscription and save the customer ID to your database.
try:
session = stripe.checkout.Session.retrieve(
event['data']['object'].id, expand=['line_items', 'subscription'])
_current_period_end = session['subscription']['current_period_end']
_current_period_start = session['subscription']['current_period_start']
_product_id = session['subscription']['items']['data'][0]['plan']['product']
_user_id = session['client_reference_id']
_stripe_customer_id = session['customer']
_subscription_id = session["subscription"]['id']
'''
do other things to update user package
'''
except Exception as e:
'''
error log
'''
elif event_type == 'invoice.paid':
if THIS_IS_NOT_FIRST_TIME:
parse_event_data
'''
do other things to extend subscription
'''
I have some questions;
I parsed web hook result from a dict which is returned from stripe.checkout.Session.retrieve object. It seemed a little bit odd to me. What if stripe update his API response and use a different names for dict keys that i used? Is there another way to get these values such as with dot notation maybe (session.get.product_id)?
How can i understand that invoice.paid event not triggered for the first time of subscription?
I want to test renewal process of my monthly subscription. I used stripe trigger invoice.payment_succeeded but i need a real data for my test accounts (with a test customer, subscription, product etc...)
I can update my user package with using CHECKOUT_SESSION_ID
from checkout success url ("success?session_id={CHECKOUT_SESSION_ID}). Should i do that or use checkout.session.completed web hook?
I have just returned HTTP 500 response to every request to my webhook URL to see if STRIPE show an error message to user in checkout page. However, STRIPE just created a successful subscription. In that case, STRIPE will take a payment from my customer, however even if i can not update my customer package on my database. What should i do to prevent this issue? Should i create a scheduled job to sync data between STRIPE and my db?

You have many separate questions that would be better suited for Stripe's support team directly: https://support.stripe.com/contact/email
Now I will try to touch on some of the questions you had starting with the first one.
When you call session = stripe.checkout.Session.retrieve(...) you get back a class that is a Checkout Session. All the properties of the class map to the properties covered in the API Reference for Session. This means you can do session.id which is cs_test_123 or session.created which is a timestamp of the creation date. It's not really different from accessing as a dictionary overall.
You're also asking if those can change and Stripe explains their backwards compatibility policy in details in their docs here. If they were to change a name from created to created_at, they would do it in a new API version for new integrations and it wouldn't impact your code unless you manually changed the API version for your account so that is safe.
For the invoice.paid event, you want to look at the invoice's billing_reason property which would be subscription_create for the first invoice.
You can test all of this easily in Test mode, create a session, start a subscription, etc. You can also simulate cycle changes.
I hope this helps but chatting with their support team is your best bet since those are more integration questions and not coding questions.

Caching query results on GAE

I started using Google App Engine 3 months ago and I have a question regarding Python on memcaching.
I try to describe my problem as best as I can.
I use ndb (App Engine Datastore) and I have a "table" of entities like this:
class Event(ndb.Model):
dateInsert = ndb.DateTimeProperty(auto_now_add=True) # Inserting date
notes = ndb.StringProperty(indexed=False) # event notes
geohash = ndb.StringProperty(required=True) # Coordinates geohash
eventLatitude = ndb.FloatProperty(indexed=True, required=True) # self explanatory
eventLongitude = ndb.FloatProperty(indexed=True, required=True) # self explanatory
Client side (with a mobile app for example) a user can store in datastore an event in specified coordinates.
Those inserted events are visible of course by mobile app (on the map) and on a website.
Right now to retrieve stored events, client calls a web method that search events near a given location:
class getEvents(webapp.RequestHandler):
def get(self):
#blablabla get passed parameters
#[...]
# hMinPos and hMaxPos are hashed coordinates passed by client + X meters.
# In this way I can filter stored events in a precise bounding box.
# For example, I can get events near my location in a box of 5000 meters
qryEvent = Event.query(ndb.AND(Event.geohash >= hMinPos, Event.geohash <= hMaxPos))
events = qryEvent.fetch(1000)
Then I have to fetch every result with a loop cycle to create a JSON to store in a list and return it to the client.
So it is
for event in events:
#do my stuff
Everything is working fine, but the big problem are useless read operations EVERY single time I call that method.
I mean, every time method gets called, it fetches same events than other clients request or worst, same events than previous request by same client (if I move by 50meters and I make a client request, events are same as previous request ad 99%).
This will take quota usage and read operations over-quota very soon.
I think I should use memcache to store fetched events and read them in the memcache before make a read from datastore, but I have no idea to implement it with my structure.
My idea was to use geohash as a memcache key, but I can't iterate through cached elements, I can only make a precise get on a given key so my solution is not applicable (I can't make a direct access to memcache with a key, I need to iterate in memcache elements to find event that fits my coordinate-range request).
Someone has a hint or a suggestion?

I can think of 2 solutions:
1) Store in memcached the information of smaller boxes (e.g. 100 meters long), with a latitude-longitude identifier. You could request from ndb a big box of e.g. 5500 meters long, and save the information of all the contained small boxes in memcached. When the user moves 50, 100 or 400 meters, you'll be able to give her an answer with memcached data, and if someone is near that place (within 500 meters), same thing will happen.
2) You could use ElasticSearch, specifically the Geo Distance Filter. With it, you can filter "documents that include only hits that exists within a specific distance from a geo point".
Note: If getEvents return events in a box of 5000 meters, maybe you shouldn't trigger a new request when moving 50 meters, but a longer distance.

How to find what serial number a user has in Twitter

I would like to follow certain users just to monitor their tweets in my program, but I don't know where to find the serial numbers that identify Twitter users. I have
follow_list = []
streamer.filter(follow = follow_list)
I know that the users are identified by strings like '1234567890', but I don't know where a list of these serial numbers is...

You need to use twitter's users/lookup API method, it will take usernames and return list of dicts with extended user data.
In tweepy there is lookup_users method, wrapping this API call. According to tweepy source, it should be somethin like:
users = tweepy.api.lookup_users(screen_names=['twitter', 'cleg'])
for user in users:
print(user)

FRIENDS Table Datastore Design + App Engine Python

In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?

Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.

propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.

GAE datastore - best practice when there are more writes than reads

I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.
I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.
I Built a super simple app:
there are two webpages - one to
choose links, and one view chosen
links
every user can choose to add url links to his "links feed"
the user can choose as many links as he wants, whenever he wants.
on a different webpage, I want to show the user the most recent 10 links he chose.
every user has his own "links feed" webpage.
on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.
In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")
Question 1:
I can think of (at least) two options how to handle the data for this app:
Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it
Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls
What will be the better method?
Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)
EDIT (to answer #Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).
EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):
Here's my class:
class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls
then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:
def post(self):
user = users.get_current_user()
if user:
logging messages for debugging
self.response.headers['Content-Type'] = 'text/html'
#self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())
updating the new item that user adds
current_user = UserChannel.get_by_key_name(user.nickname())
dataJson = self.request.get('dataJson')
#self.response.out.write('<p>the dataJson is: %s</p>' % dataJson)
current_user.currentPlaylist.append(dataJson)
sizePlaylist= len(current_user.currentPlaylist)
self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
#whenever the list gets to 30 I cut it to be 20 long
if sizePlaylist > 30:
for i in range (0,9):
current_user.currentPlaylist.pop(i)
current_user.userCount +=1
current_user.put()
Updater().send_update(dataJson)
else:
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write('user_not_logged_in')
where Updater is my method for updating with Channel-API the webpage with the feed.
Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?

First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.
First Question
I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).
This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.
Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.
Second Question
Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.

Answer 1
Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.
When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.
Answer 2
Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.