Google App Engine: Modifying 1000 entities

Google App Engine: Modifying 1000 entities - python

I have about 1000 user account entities like this:
class UserAccount(ndb.Model):
email = ndb.StringProperty()
Some of these email values contain uppercase letters like JohnathanDough#email.com. I want to select all the email values from all UserAccount entities and apply python's email.lower(). How can I do this efficiently, and most importantly, without errors?
Note: The email values are important for login, so I cannot afford to mess this up. Is there a way to backup this data in case of the event that I do make a mistake?
Thank you.

Yes, off course. Even if Datastore Administration is an experimental feature we can backup and restore data without coding. Follow this instruction for the backup flow: Backing up data.
To processing your data instead, the most efficient way is to use the MapReduce library.

Mapreduce works but its an excesive complication if youve never done it before.
Use task queues, each can handle a query result page, store the next pageToken and start another taskqueue for the next page.
Slower than mapreduce if you run the taskqueues secuentially. 1000 entries ia not much. Maybe in one minute it will be done.

Related

Checking username availability - Handling of AJAX requests (Google App Engine)

I want to add the 'check username available' functionality on my signup page using AJAX. I have few doubts about the way I should implement it.
With which event should I register my AJAX requests? We can send the
requests when user focus out of the 'username' input field (blur
event) or as he types (keyup event). Which provides better user
experience?
On the server side, a simple way of dealing with requests would be
to query my main 'Accounts' database. But this could lead to a lot
of request hitting my database(even more if we POST using the keyup
event). Should I maintain a separate model for registered usernames
only and use that to get better results?
Is it possible to use Memcache in this case? Initializing cache with
every username as key and updating it as we register users and use a
random key to check if cache is actually initialized or pass the
queries directly to db.

Answers -
Do the check on blur. If you do it on key up, you will be hammering your server with unnecessary queries, annoying the user who is not yet done typing, and likely lag the typing anyway.
If your Account entity is very large, you may want to create a separate AccountName entity, and create a matching such entity whenever you create a real Account (but this is probably an unnecessary optimization). When you create the Account (or AccountName), be sure to assign id=name when you create it. Then you can do an AccountName.get_by_id(name) to quickly see if the AccountName has already been assigned, and it will automatically pull it from memcache if it has been recently dealt with.
By default, GAE NDB will automatically populate memcache for you when you put or get entities. If you follow my advice in step 2, things will be very fast and you won't have to mess around with pre-populating memcache.
If you are concerned about 2 people simultaneously requesting the same user name, put your create method in a transaction:
#classmethod
#ndb.transactional()
def create_account(cls, name, other_params):
acct = Account.get_by_id(name)
if not acct:
acct = Account(id=name, other param assigns)
acct.put()

I would recommend the blur event of the username field, combined with some sort of inline error/warning display.
I would also suggest maintaining a memcache of registered usernames, to reduce DB hits and improve user experience - although probably not populate this with a warm-up, but instead only when requests are made. This is sometimes called a "Repository" pattern.
BUT, you can only populate the cache with USED usernames - you should not store the "available" usernames here (or if you do, use a much lower timeout).
You should always check directly against the DB/Datastore when actually performing the registration. And ideally in some sort of transactional method so that you don't have race conditions with multiple people registering.
BUT, all of this work is dependant on several things, including how busy your app is and what data storage tech you are using!

GAE: Best way to keep track of number of entities of an NDB kind?

I'm trying to figure out the best approach to keeping track of the number of entities of a certain NDB kind I have in my cloud datastore.
One approach is just, when I want to know how many I have, get the .count() of a query that I know will return all of them, but that costs a ton of datastore small operations (looks like it's proportional to the number of entities of that kind I have). So that's not ideal.
Another option would be having a counter in the datastore that gets updated every time I create or delete an entity, but that's also not ideal because it would add an extra read and write operation to every entity I create or destroy.
As of now, it looks like the second option is my best choice, so my question is--do you agree? Are there any other options that would be more cost-effective?
Thanks a lot.
PS: Working in Python if that makes a difference.

Second option is the way to go.
Other considerations:
If you have many writes per second you may wish to consider using a shared counter
To reduce datastore writes, you could use a cron job to update the datastore at timed intervals (ie count how many entities have been created since last run)
Also consider using memcache.incr() in conjunction with a cron job to persist the data. Downside of this is you're memcache key could drop, so only really an option if the count doesn't have to be accurate.

There's actually a better/cheaper/faster way to see the info you are looking for but it might not work if you need to know EXACT number of fields at any given moment since its only updated couple of times a day (i.e. you can access it anytime but it may be a few hours outdated).
The "Datastore Statistics" page in GAE dashboard displays some detailed data about kinds/entities including "count" numbers and there's a way to access it programmatically. See more info here: https://cloud.google.com/appengine/docs/python/datastore/stats

Basic friend timeline algorithm?

I'm sure a lot of services online today must perform a task similar to what I'm doing. A user has friends, and I want to get all status updates of all the user's friends after their friends last status update date.
That was a mouthful, but here's what I have:
A user has say 10 friends. What I want to do is get new status updates for all his friends. So, I prepare a dictionary with each friend's last status date. Something like:
for friend in user:
dictionary['userId] = friend.id
dictionary['lastDate'] = friend.mostRecentStatusUpdate.date
Then, on my server side, I do something like this:
for dict in friends:
userId = dict['userId]
lastDate = dict['lastDate']
# each get below, however, launches an RPC and does a separate table lookup, so if I have 100 friends, this seems extremely inefficient
get statusUpdates for userId where postDate > lastDate
The problem with the above approach is that on the server side each iteration of the for loop launches a new query, which launches an RPC. So if there are a lot of friends, it would seem to be really inefficient.
Is there a better way to design my structure to make this task more efficient? How does say Twitter do something like that, where it gets new time line updates?

From the high level, I'd suggest you follow the prescribed app-engine mantra - make writes expensive to make reads cheap.
For each friend, you should keep a collection of known friends and their last status updates. This will allow you to update friends at write time. This is expensive for the write, but saves you processing and querying at read. This also assumes that you read more than you write.
Additionally, if you are just trying to display N number of latest updates for each friend, I would suggest you use NDB Structured property to store the Friend objects - this way you can create matching data structure. As part of the object, create a collection of keys that correspond to the status updates. When the status update is written, add to the collection, and potentially remove older entries (if space is a concern).
This way when you need to retrieve the updates, you are getting them by key, instead of a more expensive query types.
An alternative to this that avoids any additional queries, is to keep the entire update instead of just keys. However, this will be a lot bigger for storage - 10 friends all interconnected, means 100 versions of the same update.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?

At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.

The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.

If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

App engine app design questions

I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?

I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.

You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.

If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.