I have a Python server running on Google app engine and implements a social network. I am trying to find the best way (best=fast and cheap) to implement interactions on items.
Just like any other social network I have the stream items ("Content") and users can "like" these items.
As for queries, I want to be able to:
Get the list of users who liked the content
Get a total count of the likers.
Get an intersection of the likers with any other users list.
My Current implementation includes:
1. IntegerProperty on the content item which holds the total likers count
2. InteractionModel - a NdbModel with a key id qual to the content id (fast fetch) and a JsonPropery the holds the likers usernames
Each time a user likes a content I need to update the counter and the list of users. This requires me to run and pay for 4 datastore operations (2 reads, 2 writes).
On top of that, items with lots of likers results in an InteractionModel with a huge json that takes time to serialize and deserialize when reading/writing (Still faster then RepeatedProperty).
None of the updated fields are indexed (built-in index) nor included in combined index (index.yaml)
Looking for a more efficient and cost effective way to implement the same requirements.
I´m guessing you have two entities in you model: User and Content. Your queries seem to aggregate upon multiple Content objects.
What about keeping this aggregated values on the User object? This way, you don´t need to do any queries, but rather only look up the data stored in the User object for these queries.
At some point though, you might consider not using the datastore, but look at sql storage instead. It has a higher constant cost, but I´m guessing at some point (more content/users) it might be worth considering both in terms of cost and performance.
Related
I use django 1.11 with postgresql as the database. I know how to store and retrive data from a db but I can't find an example to which is the correct way to store and to retrieve an entire discussion of two users.
This is my simple idea:
Two users connect to 127.0.0.1 and in this page there is a text-area form. Both users can write into the text-area and by press a button they post their content. The page will reload and now all message are being displayed.
What I want to know is that if the correct way to store and retrive would be:
one db row => single message user
If two users exchange, say 15 messages, it will store 15 rows and to make a univocal discussion, I can put another column into the db something like discussion "id", so 15 rows would have the same id and the user:
db row1 ---> "pk=1, message=hello there, user=Mike, id=45")
db row2 ---> "pk=2, message=hello world, user=Jessy, id=45")
When the page reload clearly in django will run:
discussion = Discussion.objects.all().filter(id=45)
to retrieve the discussion.
Only two user can discuss in private, so every two user have a discussion page like 127.0.0.1/one, 127.0.0.1/two and so on..
If this is the correct way to store and retrive from the db, my question is how that would scale? Can I rely on that design to store and retrive data from the database efficiently or it will be heavy in near future? I worry that 1000 users could quickly grow into 10000 rows.
So the answer to your question depends on how you plan on using the data in the future and what you need to do with it. It is entirely possible to store an entire conversation between N users in a columnar database such as Postgres as individual records per message. However, as with all programming questions, there are multiple paradigms to answer your question. I will explore the pros/cons of a couple of them here (with the knowledge that there are certainly more).
Paradigm 1 New record (row) per message
Pros:
Simpler querying for individual messages.
Analytical functions can easily be applied at a message level (i.e. summing number of messages by certain users)
Record size is (relatively) small
Cons:
Very long table sizes
Searching becomes time consuming as table grows.
Post-processing needed on a collection (i.e. All records from a conversation)
More work is shifted to the server
Paradigm 2 New record (row) per conversation
Pros:
Simpler querying for individual conversations
Shorter table sizes
Post-processing needed on an object (i.e. The entire conversation stored as a JSON object)
Cons:
Larger row size that can grow substantially depending on the number and size of messages.
Harder to query individual messages or text within messages (need to use more expensive functions such as LIKE % on blobs of text = slow)
Less conducive to preforming any type of analytical function on messages.
Messages become an append exercise
More work is shifted to the client/application
Which is best? YMMV
Again, there are probably a half-dozen or so more ways you could store your application's messages, and all depend on your downstream needs. Additionally, I would implode you to look into projects such as Apache Kafka which specialize in message publishing as potentially a scaleable, drop in solution.
Three recommendations:
If you give PostgreSQL a decent amount of resources (say, an Amazon m3.large instance), then "a lot of rows" for a PostgreSQL database is around 100 million rows (depending). That's not a limit, it's just enough rows that you'll have to spend some time working on performance. So assuming that chats average 100 messages, then that would be one million conversations. So having one row per message is not a performance problem at the scale you're talking about.
Don't use a numerical PK as your main way of ordering conversations (you might still have one, Django likes having one). Do have a timestamptz column, which is how you reconstruct the order of conversations.
Have a unique index on user, timestamptz (since a user can't post two messages simultaneously), and another unique index on conversation, timestamptz (this will allow you to reconstruct conversations quickly).
You should also have a table called "conversations" which summarizes conversation_id, list-of-users, because this will make it easy to answer the request "show me all my conversations".
Does that answer your questions?
I have following 2 entities.
class Photo(db.Model):
name=db.StringProperty()
registerdate=db.DateTimeProperty()
iso=db.StringProperty()
exposure=db.StringProperty()
class PhotoRatings(db.Model):
ratings=db.IntegerProperty()
I need to do the following.
Get all the photos (Photo) with iso=800 sorted by ratings (PhotoRatings).
I cannot add add ratings inside Photo because ratings change all the time and I would have to write entire Photo entity every single time. This will cost me more time and money and the application will take performance hit.
I read this,
https://developers.google.com/appengine/articles/modeling
But could not get much information from it.
EDIT: I want to avoid fetching too many items and perform the match manually. I need fast and efficient solution.
You're trying to do relational database queries with an explicitly non-relational Datastore.
As you might imagine, this presents problems. If you want to the Datastore to sort the results for you, it has to be able to index on what you're wanting to sort. Indices cannot span multiple entity types, so you can't have an index for Photos that is ordered by PhotoRatings.
Sorry.
Consider, however - which will happen more often? Querying for this ordering of photos, or someone rating a photo? Chances are, you'll have far more views than actions, so storing the rating as part of the Photo entity might not be as big a hit as you fear.
If you look at the billing docs you'll notice that an Entity write is charged per changed number of properties.
So what you are trying to do will not reduce write cost, but will definitely increase read cost as you'll be reading double the number of entities.
I want to have a property on a database model of mine in Google App Engine and I am not sure which category works the best. I need it to be a tag cloud similar to the Tags on SO. Would a text property be best or should I use a string property and make it repeated=True.
The second seems best to me and then I can just divide the tags up with a comma as a delimiter. My goal is to be able to search through these tags and count the total number of each type of tag.
Does this seem like a reasonable solution?
This might be of interest, depending on exactly what you want to do.
GAE Sharding Counters
When developing an efficient application on Google App Engine, you need to pay attention to how often an entity is updated. While App Engine's datastore scales to support a huge number of entities, it is important to note that you can only expect to update any single entity or entity group about five times a second. That is an estimate and the actual update rate for an entity is dependent on several attributes of the entity, including how many properties it has, how large it is, and how many indexes need updating. While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding.
The question is, what if you had an entity that you wanted to update
faster than five times a second? For example, you might count the
number of votes in a poll, the number of comments, or even the number
of visitors to your site.
So you would create a tag like:
increment(tag)
which also happens to create it if it does not exist.
To count
get_count(tag)
But yes, you can make a repeated property which is a list essentially and store that and load it and count the values in it. It depends on how many you are going to have as the datastore has a limit on the model size and if you store it in a single model in a single list it'll eventually be too large.
So perhaps one model per tag, all of a single type? And when you run into the problem of 5/sec the above will come in handy.
A repeated string property is your best option.
In my application, we need to develop a FRIENDS relationship table in datastore. And of course a quick solution I've thought of would be this:
user = db.ReferenceProperty(User, required=True, collection_name='user')
friend = db.ReferenceProperty(User, required=True, collection_name='friends')
But, what would happen when the friend list grows to a huge number, say few thousands or more ? Will this be too inefficient ?
Performance is always a priority to us. This is very much needed, as we would have few more to follow this similar relationship design.
Please give advice on the best approach to design for FRIENDS relationship table using datastore within App Engine Python environment.
EDIT
Other than FRIENDS relationship, FOLLOWER relationship will be created as well. And I believe it will be very often enough all these relationship to be queries most of the time, for the reason social media oriented of my application tend to be.
For example, If I follow some users, I will get update as news feed on what they will be doing etc. And the activities will be increased over time. As for how many users, I can't answer yet as we haven't go live. But I foresee to have millions of users as we go on.
Hopefully, this would help for more specific advice or is there alternative to this approach ?
Your FRIENDS model (and presumably also your FOLLOWERS model) should scale well. The tricky part in your system is actually aggregating the content from all of a user's friends and followees.
Querying for a list of a user's is O(N), where N is the number of friends, due to the table you've described in your post. However, each of those queries requires another O(N) operation to retrieve content that the friend has shared. This leads to O(N^2) each time a user wants to see recent content. This particular query is bad for two reasons:
An O(N^2) operation isn't what you want to see in your core algorithm when designing a system for millions of users.
App Engine tends to limit these kinds of queries. Specifically, the IN keyword you'd need to use in order to grab the list of shared items won't work for more than 30 friends.
For this particular problem, I'd recommend creating another table that links each user to each piece of shared content. Something like this:
class SharedItems(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
item = db.ReferenceProperty(Item, required=True) # the item itself
posted = db.DateTimeProperty() # when it was shared
When it comes time to render the stream of updates, you need an O(N) query (N is the number of items you want to display) to look up all the items shared with the user (ordered by date descending). Keep N small to keep this as fast as possible.
Sharing an item requires creating O(N) SharedItems where N is the number of friends and followers the poster has. If this number is too large to handle in a single request, shard it out to a task queue or backend.
propertylist are a great way to get cheap/simple indexing in GAE.
but as u have correctly identified there is a few limitations.
the index size of the entire entity is limited (i think currently 5000). So each propertyList value will require an index. so basically propertylist size <4999
serialisation of such a large propertylist is expensive!!
bring back a 2Mb entity is slow... and will cost CPU.
if expecting a large propertyIndex then dont do it.
the alternative is to create a JOIN table that models the relationship
class Friends(db.Model):
user = db.ReferenceProperty(User, required=True) # logged-in user
from = db.ReferenceProperty(User, required=True) # who shared it
just a entity with 2 keys.
this allows for simple querying to find all friends for user.
select from friends where user = : me
find all user where i am the friend.
select from friends where friend = : me
since it returns a key, u can do a bulk get(keylist) to fetch the actual friends details.
I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.
I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.
I Built a super simple app:
there are two webpages - one to
choose links, and one view chosen
links
every user can choose to add url links to his "links feed"
the user can choose as many links as he wants, whenever he wants.
on a different webpage, I want to show the user the most recent 10 links he chose.
every user has his own "links feed" webpage.
on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.
In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")
Question 1:
I can think of (at least) two options how to handle the data for this app:
Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it
Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls
What will be the better method?
Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)
EDIT (to answer #Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).
EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):
Here's my class:
class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls
then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:
def post(self):
user = users.get_current_user()
if user:
logging messages for debugging
self.response.headers['Content-Type'] = 'text/html'
#self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())
updating the new item that user adds
current_user = UserChannel.get_by_key_name(user.nickname())
dataJson = self.request.get('dataJson')
#self.response.out.write('<p>the dataJson is: %s</p>' % dataJson)
current_user.currentPlaylist.append(dataJson)
sizePlaylist= len(current_user.currentPlaylist)
self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
#whenever the list gets to 30 I cut it to be 20 long
if sizePlaylist > 30:
for i in range (0,9):
current_user.currentPlaylist.pop(i)
current_user.userCount +=1
current_user.put()
Updater().send_update(dataJson)
else:
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write('user_not_logged_in')
where Updater is my method for updating with Channel-API the webpage with the feed.
Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?
First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.
First Question
I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).
This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.
Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.
Second Question
Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.
Answer 1
Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.
When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.
Answer 2
Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.