Google app engine, query multiple entities

Google app engine, query multiple entities - python

I have following 2 entities.
class Photo(db.Model):
name=db.StringProperty()
registerdate=db.DateTimeProperty()
iso=db.StringProperty()
exposure=db.StringProperty()
class PhotoRatings(db.Model):
ratings=db.IntegerProperty()
I need to do the following.
Get all the photos (Photo) with iso=800 sorted by ratings (PhotoRatings).
I cannot add add ratings inside Photo because ratings change all the time and I would have to write entire Photo entity every single time. This will cost me more time and money and the application will take performance hit.
I read this,
https://developers.google.com/appengine/articles/modeling
But could not get much information from it.
EDIT: I want to avoid fetching too many items and perform the match manually. I need fast and efficient solution.

You're trying to do relational database queries with an explicitly non-relational Datastore.
As you might imagine, this presents problems. If you want to the Datastore to sort the results for you, it has to be able to index on what you're wanting to sort. Indices cannot span multiple entity types, so you can't have an index for Photos that is ordered by PhotoRatings.
Sorry.
Consider, however - which will happen more often? Querying for this ordering of photos, or someone rating a photo? Chances are, you'll have far more views than actions, so storing the rating as part of the Photo entity might not be as big a hit as you fear.

If you look at the billing docs you'll notice that an Entity write is charged per changed number of properties.
So what you are trying to do will not reduce write cost, but will definitely increase read cost as you'll be reading double the number of entities.

Related

Django which is the correct way to store discussion inside db?

I use django 1.11 with postgresql as the database. I know how to store and retrive data from a db but I can't find an example to which is the correct way to store and to retrieve an entire discussion of two users.
This is my simple idea:
Two users connect to 127.0.0.1 and in this page there is a text-area form. Both users can write into the text-area and by press a button they post their content. The page will reload and now all message are being displayed.
What I want to know is that if the correct way to store and retrive would be:
one db row => single message user
If two users exchange, say 15 messages, it will store 15 rows and to make a univocal discussion, I can put another column into the db something like discussion "id", so 15 rows would have the same id and the user:
db row1 ---> "pk=1, message=hello there, user=Mike, id=45")
db row2 ---> "pk=2, message=hello world, user=Jessy, id=45")
When the page reload clearly in django will run:
discussion = Discussion.objects.all().filter(id=45)
to retrieve the discussion.
Only two user can discuss in private, so every two user have a discussion page like 127.0.0.1/one, 127.0.0.1/two and so on..
If this is the correct way to store and retrive from the db, my question is how that would scale? Can I rely on that design to store and retrive data from the database efficiently or it will be heavy in near future? I worry that 1000 users could quickly grow into 10000 rows.

So the answer to your question depends on how you plan on using the data in the future and what you need to do with it. It is entirely possible to store an entire conversation between N users in a columnar database such as Postgres as individual records per message. However, as with all programming questions, there are multiple paradigms to answer your question. I will explore the pros/cons of a couple of them here (with the knowledge that there are certainly more).
Paradigm 1 New record (row) per message
Pros:
Simpler querying for individual messages.
Analytical functions can easily be applied at a message level (i.e. summing number of messages by certain users)
Record size is (relatively) small
Cons:
Very long table sizes
Searching becomes time consuming as table grows.
Post-processing needed on a collection (i.e. All records from a conversation)
More work is shifted to the server
Paradigm 2 New record (row) per conversation
Pros:
Simpler querying for individual conversations
Shorter table sizes
Post-processing needed on an object (i.e. The entire conversation stored as a JSON object)
Cons:
Larger row size that can grow substantially depending on the number and size of messages.
Harder to query individual messages or text within messages (need to use more expensive functions such as LIKE % on blobs of text = slow)
Less conducive to preforming any type of analytical function on messages.
Messages become an append exercise
More work is shifted to the client/application
Which is best? YMMV
Again, there are probably a half-dozen or so more ways you could store your application's messages, and all depend on your downstream needs. Additionally, I would implode you to look into projects such as Apache Kafka which specialize in message publishing as potentially a scaleable, drop in solution.

Three recommendations:
If you give PostgreSQL a decent amount of resources (say, an Amazon m3.large instance), then "a lot of rows" for a PostgreSQL database is around 100 million rows (depending). That's not a limit, it's just enough rows that you'll have to spend some time working on performance. So assuming that chats average 100 messages, then that would be one million conversations. So having one row per message is not a performance problem at the scale you're talking about.
Don't use a numerical PK as your main way of ordering conversations (you might still have one, Django likes having one). Do have a timestamptz column, which is how you reconstruct the order of conversations.
Have a unique index on user, timestamptz (since a user can't post two messages simultaneously), and another unique index on conversation, timestamptz (this will allow you to reconstruct conversations quickly).
You should also have a table called "conversations" which summarizes conversation_id, list-of-users, because this will make it easy to answer the request "show me all my conversations".
Does that answer your questions?

App Engine social platform - Content interactions modeling strategy

I have a Python server running on Google app engine and implements a social network. I am trying to find the best way (best=fast and cheap) to implement interactions on items.
Just like any other social network I have the stream items ("Content") and users can "like" these items.
As for queries, I want to be able to:
Get the list of users who liked the content
Get a total count of the likers.
Get an intersection of the likers with any other users list.
My Current implementation includes:
1. IntegerProperty on the content item which holds the total likers count
2. InteractionModel - a NdbModel with a key id qual to the content id (fast fetch) and a JsonPropery the holds the likers usernames
Each time a user likes a content I need to update the counter and the list of users. This requires me to run and pay for 4 datastore operations (2 reads, 2 writes).
On top of that, items with lots of likers results in an InteractionModel with a huge json that takes time to serialize and deserialize when reading/writing (Still faster then RepeatedProperty).
None of the updated fields are indexed (built-in index) nor included in combined index (index.yaml)
Looking for a more efficient and cost effective way to implement the same requirements.

I´m guessing you have two entities in you model: User and Content. Your queries seem to aggregate upon multiple Content objects.
What about keeping this aggregated values on the User object? This way, you don´t need to do any queries, but rather only look up the data stored in the User object for these queries.
At some point though, you might consider not using the datastore, but look at sql storage instead. It has a higher constant cost, but I´m guessing at some point (more content/users) it might be worth considering both in terms of cost and performance.

GAE: Best way to keep track of number of entities of an NDB kind?

I'm trying to figure out the best approach to keeping track of the number of entities of a certain NDB kind I have in my cloud datastore.
One approach is just, when I want to know how many I have, get the .count() of a query that I know will return all of them, but that costs a ton of datastore small operations (looks like it's proportional to the number of entities of that kind I have). So that's not ideal.
Another option would be having a counter in the datastore that gets updated every time I create or delete an entity, but that's also not ideal because it would add an extra read and write operation to every entity I create or destroy.
As of now, it looks like the second option is my best choice, so my question is--do you agree? Are there any other options that would be more cost-effective?
Thanks a lot.
PS: Working in Python if that makes a difference.

Second option is the way to go.
Other considerations:
If you have many writes per second you may wish to consider using a shared counter
To reduce datastore writes, you could use a cron job to update the datastore at timed intervals (ie count how many entities have been created since last run)
Also consider using memcache.incr() in conjunction with a cron job to persist the data. Downside of this is you're memcache key could drop, so only really an option if the count doesn't have to be accurate.

There's actually a better/cheaper/faster way to see the info you are looking for but it might not work if you need to know EXACT number of fields at any given moment since its only updated couple of times a day (i.e. you can access it anytime but it may be a few hours outdated).
The "Datastore Statistics" page in GAE dashboard displays some detailed data about kinds/entities including "count" numbers and there's a way to access it programmatically. See more info here: https://cloud.google.com/appengine/docs/python/datastore/stats

GAE/P: Dealing with eventual consistency

In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.

UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.

Which GAE Database Property Fits a Tag Property?

I want to have a property on a database model of mine in Google App Engine and I am not sure which category works the best. I need it to be a tag cloud similar to the Tags on SO. Would a text property be best or should I use a string property and make it repeated=True.
The second seems best to me and then I can just divide the tags up with a comma as a delimiter. My goal is to be able to search through these tags and count the total number of each type of tag.
Does this seem like a reasonable solution?

This might be of interest, depending on exactly what you want to do.
GAE Sharding Counters
When developing an efficient application on Google App Engine, you need to pay attention to how often an entity is updated. While App Engine's datastore scales to support a huge number of entities, it is important to note that you can only expect to update any single entity or entity group about five times a second. That is an estimate and the actual update rate for an entity is dependent on several attributes of the entity, including how many properties it has, how large it is, and how many indexes need updating. While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding.
The question is, what if you had an entity that you wanted to update
faster than five times a second? For example, you might count the
number of votes in a poll, the number of comments, or even the number
of visitors to your site.
So you would create a tag like:
increment(tag)
which also happens to create it if it does not exist.
To count
get_count(tag)
But yes, you can make a repeated property which is a list essentially and store that and load it and count the values in it. It depends on how many you are going to have as the datastore has a limit on the model size and if you store it in a single model in a single list it'll eventually be too large.
So perhaps one model per tag, all of a single type? And when you run into the problem of 5/sec the above will come in handy.

A repeated string property is your best option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.