Google Apps Engine Datastore Search

Google Apps Engine Datastore Search - python

I was wondering if there was any way to search the datastore for a entry. I have a bunch of entries for songs(title, artist,rating) but im not sure how to really search through them for both song title and artist. We take in a search term and are looking for all entries that "match." But we are lost :( any help is much appreciated!
We are using python
edit1: current code is useless, its an exact search but might help you see the issue
query = song.gql("SELECT * FROM song WHERE title = searchTerm OR artist = searchTerm")

The song data you work with sounds as a rather static data set (primarily inserts, no or few updates). In that case there is GAE technique called Relation Index Entity (RIE) which is an efficient way to implement keyword-based search.
But some preparation work required which is briefly:
build special RIE entity where you place all searchable keywords
from each song (one-to-one relationship).
RIE stores them in StringListProperty which supports searches like this:
keywords = 'SearchTerm'
(returns True if any of the values in the list keywords matches 'SearchTerm'`)
AND condition works immediately by adding multipe filters as above
OR condition needs more work by implementing in-memory merge from AND-only queries
You can find details on solution workflow and code samples in my blog Relation Index Entities with Python for Google Datastore.

http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine

Related

Sorting and Filtering multiple queries of the same collection in Firestore

I'm new on cloud firestore and I'm trying to make queries as efficient as possible but I kind of desperate with an specific one. I would greatly appreciate your help.
This is the situation:
I want to show a project list which that I'm getting from an user field and 2 queries in project entity. The user field let’s called "favorite projects" and it has the projects id that reference those projects on their entity. The other query retrieve me the public projects (==) and the last the private projects where the user is a contributor (array_contains).
I want to sort and filtering the result of the two queries. Is there an option to merge both queries and use sort and filter as a we do with a collection reference?
Thank you for your time, have a nice day!

Based on this and this documentation, I do not believe there is an out of the box solution for joining the results of queries such as the ones described.
You'll need to achieve that within the your code.
For example you can run the first query and store all the data of the document in a map or array. Then use the reference of the other document within the document_reference to make the second query and the third.
Once you have all of them you can do as you please using Python. But getting them ready using a single query or auto-joining the queries seems to not be supported yet.

Google app engine, query multiple entities

I have following 2 entities.
class Photo(db.Model):
name=db.StringProperty()
registerdate=db.DateTimeProperty()
iso=db.StringProperty()
exposure=db.StringProperty()
class PhotoRatings(db.Model):
ratings=db.IntegerProperty()
I need to do the following.
Get all the photos (Photo) with iso=800 sorted by ratings (PhotoRatings).
I cannot add add ratings inside Photo because ratings change all the time and I would have to write entire Photo entity every single time. This will cost me more time and money and the application will take performance hit.
I read this,
https://developers.google.com/appengine/articles/modeling
But could not get much information from it.
EDIT: I want to avoid fetching too many items and perform the match manually. I need fast and efficient solution.

You're trying to do relational database queries with an explicitly non-relational Datastore.
As you might imagine, this presents problems. If you want to the Datastore to sort the results for you, it has to be able to index on what you're wanting to sort. Indices cannot span multiple entity types, so you can't have an index for Photos that is ordered by PhotoRatings.
Sorry.
Consider, however - which will happen more often? Querying for this ordering of photos, or someone rating a photo? Chances are, you'll have far more views than actions, so storing the rating as part of the Photo entity might not be as big a hit as you fear.

If you look at the billing docs you'll notice that an Entity write is charged per changed number of properties.
So what you are trying to do will not reduce write cost, but will definitely increase read cost as you'll be reading double the number of entities.

How to use High Replication Datastore

Okay, I have watched the video and read the articles in the App Engine documentation (including Using the High Replication Datastore). However I am still completely confused on the practical usage of it. I understand the benefits (from the video) and they sound great. But what I am lacking is a few practical examples. There are plenty of master/slave examples on the web, but very little illustrating (with proper documentation) the high replication datastore. The guestbook code example used in the Using the High Replication Datastore article illustrates the ancestor key by adding a new functionality that the previous guestbook code example does not have (seems you can change guestbook). This just adds to the confusion.
I often use djangoforms on GAE and I was wondering if someone can help me translate all these queries into high replication datastore compatible queries (let's forget for a moment the discussion that not all queries necessarily need to be high replication datastore compatible queries and focus on the example itself).
UPDATE: with high replication datastore compatible queries I refer to queries that always return the latest data and not potential stale data. Using entity groups seems to be the way to go here but as mentioned before, I don't have many practical code examples of how to do this, so that is what I am looking for!
So the queries in this article are:
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
validating the form happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So what do I/we need to do to make all these datastore requests compatible with the high replication datastore?
One last thing that is also not clear to me. Using ancestor keys, does this have any impact on the model in datastore. For example, in the guestbook code example they use:
def guestbook_key(guestbook_name=None):
return db.Key.from_path('Guestbook', guestbook_name or 'default_guestbook')
However 'Guestbook' does not exist in the model, so how can you use 'db.Key.from_path' on this and why would this work? Does this change how data is stored in the datastore which I need to keep into account when retrieving the data (e.g. does it add another field I should exclude from showing when using djangoforms)?
Like I said before, this is confusing me a lot and your help is greatly appreciated!

I'm not sure why you think you need to change your queries at all. The documentation that you link to clearly states:
The back end changes, but the datastore API does not change at all. You'll use the same programming interfaces no matter which datastore you're using.
The point of that page is just to say that queries may be out of sync if you don't use entity groups. Your final code snippet is just an example of that - the string 'Guestbook' is exactly an ancestor key. I don't understand why you think it needs to exist in the model. Once again, this is unchanged from the non-HR datastore - it has always been the case that keys are built up from paths, which can consist of arbitrary strings. You probably need to reread the documentation on entity groups and keys.

The changes to use the HRD are not in how queries are made, but in what guarantees are made about what data you get back. The example you give:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
will work in the HRD as well. The catch (basically) is that this kind of query (using either this syntax, or the Item.all() form) can return objects slightly out-of-date. This is probably not a big deal with the guestbook.
Note that if you're getting an object by key directly, it will never be out-of-date. It's only for queries that you can see this issue. You can avoid this problem with queries by placing all the entities that need to be consistent in a single entity group. Note that this limits the rate at which you can write to the entity group.
In answer to your follow-up question, "Guestbook" is the name of the entity.

Choose between many-to-many techniques in App Engine DB

There is going to be "articles" and "tags" in my App Engine application.
And there are two techniques to implement that (thanks to Nick Johnson's article):
# one entity just refers others
class Article(db.Model):
tags = db.ListProperty(Tag)
# via separate "join" table
class ArticlesAndTags(db.Model):
article = db.ReferenceProperty(Article)
tag = db.ReferenceProperty(Tag)
Which one should I prefer according to the following tasks?
Create tag cloud (frequently),
Select articles by a tag (rather rarely)

Because of the lack of a 'reduce' feature in appengine's map reduce (nor an SQL group by like query), tag clouds are tricky to implement efficiently because you need to count all tags you have manually. Which ever implementation you go with, what I would suggest for the tag cloud is to have a separate model TagCounter that keeps track of how many tags you have. Otherwise the tag query could get expensive if you have a lot of them.
class TagCounter:
tag = db.ReferenceProperty(Tag)
counter = db.IntegerProperty(default=0)
Every time you choose to update your tags on an article, make sure you increment and decrement from this table accordingly.
As for selecting articles by a tag, the first implementation is sufficient (the second is overly complex imo).
class Article(db.Model):
tags = db.ListProperty(Tag)
#staticmethod
def select_by_tag(tag):
return Article.all().filter("tags", tag).run()

I have created a huge tag cloud * on GAEcupboard opting for the first solution:
class Post(db.Model):
title = db.StringProperty(required = True)
tags = db.ListProperty(str, required = True)
The tag class has a counter property that is updated each time a new post is created/updated/deleted.
class Tag(db.Model):
name = db.StringProperty(required = True)
counter = db.IntegerProperty(required = True)
last_modified = db.DateTimeProperty(required = True, auto_now = True)
Having the tags organized in a ListProperty it's quite easy to offer a drill-down feature that allows user to compose different tags to search for the desired articles:
Example:
http://www.gaecupboard.com/tag/python/web-frameworks
The search is done using:
posts = Post.all()
posts.filter('tags', 'python').filter('tags', 'web-frameworks')
posts.fetch()
that does not need any custom index at all.
ok, it's too huge, I know :)

Creating a tag-cloud in app-engine is really difficult because the datastore doesn't support the GROUP BY construct normally used to express that; Nor does it supply a way to order by the length of a list property.
One of the key insights is that you have to show a tag cloud frequently, but you don't have to create one except when there are new articles, or articles get retagged, since you'll get the same tag-clout in any case; In fact, the tag cloud doesn't change very much for each new article, maybe a tag in the cloud becomes a little larger or a little smaller, but not by much, and not in a way that would affect its usefullness.
This suggests that tag-clouds should be created periodically, cached, and displayed much like static content. You should think about doing that in the Task Queue API.
The other query, listing articles by tag, would be utterly unsupported by the first techinque you've shown; Inverting it, having a Tag model with an articles ListProperty does support the query, but will suffer from update contention when popular tags have to get added to it at a high rate. The other technique, using an association model, suffers from neither of these concerns, but makes it harder to make the article listing queries convenient.
The way I would deal with this is to start with the ArticlesAndTags model, but add some additional data to the model to have a useful ordering; an article date, article title, whatever makes sense for the particular kind of site you're making. You'll also need a monotonic sequence (say, a timestamp) on this so you know when the tag applied.
The tag cloud query would be supported using a Tag entity that has Only a numeric article count, and also a reference to the same timestamp used in the ArticlesAndTags Model.
A task queue can then query for the 1000 oldest ArticlesAndTags that are newer than oldest Tag, sum the frequencies of each and add it to the counts in the Tags. Tag removals are probably rare enough that they can update the Tag model immediately without too much contention, but if that assumption turns out to be wrong, then delete events should be added to the ArticlesAndTags as well.

You don't seem to have very specific/complex requirements so my opinion is it's likely neither method would show significant benefits, or rather, the pros/cons will depend completely on what you're used to, how you want to structure your code, and how you implement caching and counting mechanisms.
The things that come to mind for me are:
-The ListProperty method leaves the data models looking more natural.
-The AtriclesAndTags method will mean you'd have to query for the relationships and then the Articles (ugh..), instead of doing Article.all().filter('tags =', tag).

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.