GAE datastore how to poll for new items

GAE datastore how to poll for new items - python

I use python, ndb and the datastore. My model ("Event") has a property:
created = ndb.DateTimeProperty(auto_now_add=True).
Events gets saved every now and then, sometimes several within one second.
I want to "poll for new events", without getting the same Event twice, and get an empty result if there aren't any new Events. However, polling again might give me new events.
I have seen Cursors, - but I don't know if they can be used somehow to poll for new Events, after having reached the end if the first query? The "next_cursor" is None when I've reached the (current) end of the data.
Keeping the last received "created" DateTime-property and use that for getting the next batch works, but that's only using a resolution of seconds, so the ordering might get screwed up..
Must I create my own transactional, incrementing counter in Event for this?

Yes, using cursors is a valid option. Even tho this link is from the Java documentation, it's valid for python also. The second paragraph is what you are looking for:
An interesting application of cursors is to monitor entities for unseen changes. If the app sets a timestamp property with the current date and time every time an entity changes, the app can use a query sorted by the timestamp property, ascending, with a Datastore cursor to check when entities are moved to the end of the result list. If an entity's timestamp is updated, the query with the cursor returns the updated entity. If no entities were updated since the last time the query was performed, no results are returned, and the cursor does not move.

EDIT: Prospective search has been shut down on December 1, 2015
Rather than polling an alternate approach would be to use prospective search
https://cloud.google.com/appengine/docs/python/prospectivesearch/
From the docs
"Prospective search is a querying service that allows your application
to match search queries against real-time data streams. For every
document presented, prospective search returns the ID of every
registered query that matches the document."

Related

Is there a way to set TTL on a document within AWS Elasticsearch utilizing python library?

I can't find anyway to setup TTL on a document within AWS Elasticsearch utilizing python elasticsearch library.
I looked at the code of the library itself, and there are no argument for it, and I yet to see any answers on google.

There is none, you can use the index management policy if you like, which will operate at the index level, not at the doc level. You have a bit of wriggle room though in that you can create a pattern data-* and have more than 1 index, data-expiring-2020-..., data-keep-me.
You can apply a template to the pattern data-expiring-* and set a transition to delete an index after lets say 20 days. If you roll over to a new index each day you will the oldest day being deleted at the end of the day once it is over 20 days.
This method is much more preferable because if you are deleting individual documents that could consume large amounts of your cluster's capacity, as opposed to deleting entire shards. Other NoSQL databases such as DynamoDB operate in a similar fashion, often what you can do is add another field to your docs such as deletionDate and add that to your query to filter out docs which are marked for deletion, but are still alive in your index as a deletion job has not yet cleaned them up. That is how the TTL in DynamoDB behaves as well, data is not deleted the moment the TTL expires it, but rather in batches to improve performance.

Monitor MySQLdb in python for new entries and Flask

I'm looking for a way to constantly check my database (MySQL) for new entries. Once a new entry is committed I want to output it in a webpage using Flask.
Since the process takes time to finish I would like to give the users the impression it took only few seconds to retrieve data.
For now I'm waiting that the whole process finishes to give to the user the whole result. But I would prefer to update the result web-page every time a new entry was added to the DB. So for example the first entry is added to the DB, immediately the user can see it on the web-page, then a second entry is added the user can now see both the first and the second entries on the web-page and so on. I don't know if it has to come from flask or other ways
Any idea?

You can set MySQL to log all commits to General Query Log and monitor all changes (for example via Watchdog or PyNotify). Once the file changes, you can parse the new log entries and get the signal. By this way you'll avoid pooling for changes.
The better way would be of course send the signal while storing data to the database.

Basic friend timeline algorithm?

I'm sure a lot of services online today must perform a task similar to what I'm doing. A user has friends, and I want to get all status updates of all the user's friends after their friends last status update date.
That was a mouthful, but here's what I have:
A user has say 10 friends. What I want to do is get new status updates for all his friends. So, I prepare a dictionary with each friend's last status date. Something like:
for friend in user:
dictionary['userId] = friend.id
dictionary['lastDate'] = friend.mostRecentStatusUpdate.date
Then, on my server side, I do something like this:
for dict in friends:
userId = dict['userId]
lastDate = dict['lastDate']
# each get below, however, launches an RPC and does a separate table lookup, so if I have 100 friends, this seems extremely inefficient
get statusUpdates for userId where postDate > lastDate
The problem with the above approach is that on the server side each iteration of the for loop launches a new query, which launches an RPC. So if there are a lot of friends, it would seem to be really inefficient.
Is there a better way to design my structure to make this task more efficient? How does say Twitter do something like that, where it gets new time line updates?

From the high level, I'd suggest you follow the prescribed app-engine mantra - make writes expensive to make reads cheap.
For each friend, you should keep a collection of known friends and their last status updates. This will allow you to update friends at write time. This is expensive for the write, but saves you processing and querying at read. This also assumes that you read more than you write.
Additionally, if you are just trying to display N number of latest updates for each friend, I would suggest you use NDB Structured property to store the Friend objects - this way you can create matching data structure. As part of the object, create a collection of keys that correspond to the status updates. When the status update is written, add to the collection, and potentially remove older entries (if space is a concern).
This way when you need to retrieve the updates, you are getting them by key, instead of a more expensive query types.
An alternative to this that avoids any additional queries, is to keep the entire update instead of just keys. However, this will be a lot bigger for storage - 10 friends all interconnected, means 100 versions of the same update.

How do I expire keys in dynamoDB with Boto?

I'm trying to move from redis to dynamoDB and sofar everything is working great! The only thing I have yet to figure out is key expiration. Currently, I have my data setup with one primary key and no range key as so:
{
"key" => string,
"value" => ["string", "string"],
"timestamp" => seconds since epoch
}
What I was thinking was to do a scan over the database for where timestamp is less than a particular value, and then explicitly delete them. This, however, seems extremely inefficient and would use up a ridiculous number of read/write units for no reason! On top of which, the expirations would only happen when I run the scan, so they could conceivably build up.
So, has anyone found a good solution to this problem?

I'm also using DynamoDB like the way we used to use Redis.
My suggestion is to write the key into different time-sliced tables.
For example, say a type of record should last few minutes, at most less an hour, then you can
Create a new table every day for this type of record and store new records in today's table.
Use a read repair tip when you read records, which means if you can't find a record in today's table, you try to find it in yesterday's table and put in today's table if necessary.
If you find the record in either table, verify it with it's timestamp. It's not necessary to delete expired records at this moment.
Drop entire stale tables in your tasks.
This is easier to maintain and cost-efficient.

You could do lazy expiration and delete it on request.
For example:
store key "a" with an attribute "expiration", expires in 10 minutes.
fetch in 9 minutes, check expiration, return it.
fetch in 11 minutes. check expiration. since it's less than now, delete the entry.
This is what memcached was doing when I looked at the source a few years ago.
You'd still need to do a scan to remove all the old entries.
You could also consider using Elasticache, which is meant for caching rather than a permanent data store.

It seems that Amazon just added expiration support to DynamoDB (as of feb 27 2017). Take a look at the official blog post:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/

You could use the timestamp as the range key which would be indexed and allow for easier operations based on the time.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.