App engine app design questions

App engine app design questions - python

I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?

I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.

You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.

If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).

Related

microservices and multiple databases

i have written MicroServices like for auth, location, etc.
All of microservices have different database, with for eg location is there in all my databases for these services.When in any of my project i need a location of user, it first looks in cache, if not found it hits the database. So far so good.Now when location is changed in any of my different databases, i need to update it in other databases as well as update my cache.
currently i made a model (called subscription) with url as its field, whenever a location is changed in any database, an object is created of this subscription. A periodic task is running which checks for subscription model, when it finds such objects it hits api of other services and updates location and updates the cache.
I am wondering if there is any better way to do this?

I am wondering if there is any better way to do this?
"better" is entirely subjective. if it meets your needs, it's fine.
something to consider, though: don't store the same information in more than one place.
if you need an address, look it up from the service that provides address, every time.
this may be a performance hit, but it eliminates the problem of replicating the data everywhere.
another option would be a more proactive approach, as suggested in comments.
instead of creating a task list for changes, and doing that periodically, send a message across rabbitmq immediately when the change happens. let every service that needs to know, get a copy of the message and update it's own cache of info.
just remember, though. every time you have more than one copy of the information, you reduce the "correctness" of the system, as a whole. it will always be possible for the information found in one of your apps to be out of date, because it did not get an update from the official source.

update existing cache data with newer items in django

I want to use caching in Django and I am stuck up with how to go about it. I have data in some specific models which are write intensive. records will get added continuously to the model. Each user has some specific data in the model similar to orders table.
Since my model is write intensive I am not sure how effective caching frameworks in Django are going to be. I tried Django view specific caching and I am try to develop a view where first it will pick up data from the cache. Then I will have another call which will bring in data which was added to the model after the caching was done. What I want to do is add the updated data to the original cache data and store it again.
It is like I don't want to expire my cache, I just want to keep adding to my existing cache data. may be once in 3 hrs I can clear it.
Is what I am doing right. Are there better ways than this. Can I really add to items in existing cache.
I will be very glad for your help

You ask about "caching" which is a really broad topic, and the answer is always a mix of opinion, style and the specific app requirements. Here are a few points to consider.
If the data is per user, you can cache it per user:
from django.core.cache import cache
cache.set(request.user.id,"foo")
cache.get(request.user.id)
The common practice it to keep a database flag that tells you if the user's data changed since it was cached. So before you fetch the data from cache, check only this flag from the DB. If the flag says nothing changed, get the data from cache. If it did change, pull from DB, replace the cache, and set the flag again.
The flag check should be fast and simple: one table, indexed by user.id, and a boolean flag field. This will squeeze a lot of index rows into a single DB page, and enables a fast fetching of a single one field row. Yet you still get a persistent updated main storage, that prevents the use of not updated cache data. You can check this flag in a middleware.
You can run expiry in many ways: clear cache when user logs out, run a cron script that clears items, or let the cache backend expire items. If you use a flag check before you use the cache, there is no issue in keeping items in cache except space, and caching backends handle that. If you use the django simple file cache (which is easy, simple and zero config), you will have to clear the cache. A simple cron script will do.

Wait for datastore changes before redirecting

Very similar to this question, except that the answer is not suitable.
I populate a table from a datastore query, then there is a link allowing the user to delete a specific row. Clicking the link goes to a url that deletes the row from the datastore then redirects back to the table.
Changes more often than not aren't shown in the table until reloading again.
Easy solution is to redirect to another page, that uses a javascript redirect to add a delay of a couple of seconds. Other alternative is to send details back to the page like action=delete&key=### and then make sure that item is missed from the table. That's a pain though.

The answer is with ancestor queries.
https://cloud.google.com/appengine/docs/python/datastore/queries#Python_Ancestor_queries
Create the entities with a parent. When one of the entities is deleted, you can run an ancestor query for your table list view which will have strong consistency when data is changed.
Example ancestor query:
tom = Person(key_name='Tom')
photo_query = Photo.all()
photo_query.ancestor(tom)

With the datastore, unless you can use ancestors, you cant guarantee when the indexes will be updated, only the entity itself (for getting by key later) by doing a put without async. Best is a combination of your suggestion where the client takes into account its action to patch the ui, plus maybe using memcache to remember recent actions and patch queries server-side before returning to client.

Here is a different approach. Use Javascript and AJAX. When the user clicks a link, you do two things:
Use Javascript/jQuery to remove the row from the DOM, and
Send an AJAX call to the server to do the appropriate datastore modifications.
It makes for a nice user experience because you are not reloading the page at all.

You might want to consider that there is always room for displaying outdated info in the table: for example displaying the table simultaneously in 2 different windows/tabs, then in one of them deletion is performed, the other will still display a delete link which will cause a 404 if followed.
With this in mind I'd 1st focus on managing the expectations (the user should know that the page may occasionally display outdated info) and then on the user's ability to get an outdated page in sync (refresh button?). Which might make the issue moot.
The delay-based "solutions" are bound to fail sooner or later in race condition scenarios, I wouldn't bother with the extra complexity. Especially for the document where the deletion is done: that's exactly where the user knows that the info is outdated (for free) and would be likely inclined to refresh until the recent change becomes visible.

Django session race condition?

Summary: is there a race condition in Django sessions, and how do I prevent it?
I have an interesting problem with Django sessions which I think involves a race condition due to simultaneous requests by the same user.
It has occured in a script for uploading several files at the same time, being tested on localhost. I think this makes simultaneous requests from the same user quite likely (low response times due to localhost, long requests due to file uploads). It's still possible for normal requests outside localhost though, just less likely.
I am sending several (file post) requests that I think do this:
Django automatically retrieves the user's session*
Unrelated code that takes some time
Get request.session['files'] (a dictionary)
Append data about the current file to the dictionary
Store the dictionary in request.session['files'] again
Check that it has indeed been stored
More unrelated code that takes time
Django automatically stores the user's session
Here the check at 6. will indicate that the information has indeed been stored in the session. However, future requests indicate that sometimes it has, sometimes it has not.
What I think is happening is that two of these requests (A and B) happen simultaneously. Request A retrieves request.session['files'] first, then B does the same, changes it and stores it. When A finally finishes, it overwrites the session changes by B.
Two questions:
Is this indeed what is happening? Is the django development server multithreaded? On Google I'm finding pages about making it multithreaded, suggesting that by default it is not? Otherwise, what could be the problem?
If this race condition is the problem, what would be the best way to solve it? It's an inconvenience but not a security concern, so I'd already be happy if the chance can be decreased significantly.
Retrieving the session data right before the changes and saving it right after should decrease the chance significantly I think. However I have not found a way to do this for the request.session, only working around it using django.contrib.sessions.backends.db.SessionStore. However I figure that if I change it that way, Django will just overwrite it with request.session at the end of the request.
So I need a request.session.reload() and request.session.commit(), basically.

Yes, it is possible for a request to start before another has finished. You can check this by printing something at the start and end of a view and launch a bunch of request at the same time.
Indeed the session is loaded before the view and saved after the view. You can reload the session using request.session = engine.SessionStore(session_key) and save it using request.session.save().
Reloading the session however does discard any data added to the session before that (in the view or before it). Saving before reloading would destroy the point of loading late. A better way would be to save the files to the database as a new model.
The essence of the answer is in the discussion of Thomas' answer, which was incomplete so I've posted the complete answer.

Mark just nailed it, only minor addition from me, is how to load that session:
for key in session.keys(): # if you have potential removals
del session[key]
session.update(session.load())
session.modified = False # just making it clean
First line optional, you only need it if certain values might be removed meanwhile from the session.
Last line is optional, if you update the session, then it does not really matter.

That is true. You can confirm it by having a look at the django.contrib.sessions.middleware.SessionMiddleware.
Basically, request.session is loaded before request hits your view (in process_request), and it is updated in the session backend (if needed) after the response has left your view (in process_response).
If what I mean is unclear, you might want to have a look at the django documentation for Middleware.
The best way to solve the issue will depend on what you're trying to achieve with that information. I'll update my answer if you provide that information!

Where is the best place to put cache-evicting logic in an AppEngine application?

I've written an application for Google AppEngine, and I'd like to make use of the memcache API to cut down on per-request CPU time. I've profiled the application and found that a large chunk of the CPU time is in template rendering and API calls to the datastore, and after chatting with a co-worker I jumped (perhaps a bit early?) to the conclusion that caching a chunk of a page's rendered HTML would cut down on the CPU time per request significantly. The caching pattern is pretty clean, but the question of where to put this logic of caching and evicting is a bit of a mystery to me.
For example, imagine an application's main page has an Announcements section. This section would need to be re-rendered after:
first read for anyone in the account,
a new announcement being added, and
an old announcement being deleted
Some options of where to put the evict_announcements_section_from_cache() method call:
in the Announcement Model's .delete(), and .put() methods
in the RequestHandler's .post() method
anywhere else?
Then in the RequestHandler's get page, I could potentially call get_announcements_section() which would follow the standard memcache pattern (check cache, add to cache on miss, return value) and pass that HTML down to the template for that chunk of the page.
Is it the typical design pattern to put the cache-evicting logic in the Model, or the Controller/RequestHandler, or somewhere else? Ideally I'd like to avoid having evicting logic with tentacles all over the code.

I've got just such a decorator up in an open source Github project:
http://github.com/jamslevy/gae_memoize/tree/master
It's a bit more in-depth, allowing for things like forcing execution of the function (when you want to refresh the cache) or forcing caching locally...these were just things that I needed in my app, so I baked them into my memoize decorator.

A couple of alternatives to regular eviction:
The obvious one: Don't evict, and set a timer instead. Even a really short one - a few seconds - can cut down on effort a huge amount for a popular app, without users even noticing data may be a few seconds stale.
Instead of evicting, generate the cache key based on criteria that change when the data does. For example, if retrieving the key of the most recent announcement is cheap, you could use that as part of the key of the cached data. When a new announcement is posted, you go looking for a key that doesn't exist, and create a new one as a result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.