Access GAE datastore from background thread

Access GAE datastore from background thread - python

I'm writing a web app through Google App Engine and I'd like to have a script frequently update user profiles based on live, temporary information I'm getting from an XML feed. I'm doing this with a GAE background_thread so the site can continue to operate while this runs.
Outside this background thread, users can still navigate the website and thereby make changes to their profile.
The background thread does exactly what it should, updating user profiles based on the live XML data and re-entering the profile to the datastore. However when a user makes a change to their profile, the background thread is not picking up on the changes. The returned list from the ndb datastore query does not reflect the changes users make.
The curious detail is that it DOES reflect the correct changes if a new user is added to the datastore, it just doesn't reflect changes if a preexisting user profile is modified. I should be able to query/put the datastore from a background thread right?
The meat of the background thread:
def update_accounts():
while True:
# Get data from XML feed.
info_dict = get_live_data()
# Get all the users from the GAE database
gprofiles = mUserStats.query()
for profile in gprofiles:
# This isn't the actual condition but there's a condition here.
if needs_update in profile.m_needsUpdate:
# Modify the current profile.
profile.make_change(info_dict)
# Re enter into database.
profile.put()
# Add a sleep time as this doesn't need to run that frequently.
time.sleep(20)
class updateAccounts():
def start_thread(self):
t =background_thread.start_new_background_thread(target=update_accounts())
This is where profiles are modified:
def post(self):
session = get_current_session()
user_key = mUserStats_key(session['me'].m_email)
curr_user = mUserStats.get_by_id(session['me'].m_email, user_key)
curr_user.change_profile()
curr_user.put()

Just some random thoughts, don't really know which would work best (if any at all):
Instead of doing profile.put() inside the loop maybe you could store changed entities in a list and do some ndb.put_multi() calls after the loop? This would reduce the number of datastore calls by the number of mUserStats entities you have thus reducing the execution time and leaving less chances for a profile to be changed by a user while the background task is running.
If the gprofiles = mUserStats.query() line actually fetches whole entities maybe you could try doing keys_only=True and get each mUserStats entity individually inside the loop. This will increase the execution time and number of datastore calls by the number of mUserStats entities but there will be a lot less chances that an entity was changed by a user during the time it was fetched by the background task.
Are the properties updated by the XML feed the same properties updated by user? If not - maybe they could be stored in different models.
You could also take a look at query's cursors and iterators which might be helpful to automate suggestions 1 & 2.

Related

How to handle multi-user database interaction with PyQt5

I am developing a GUI app that will be used supposedly by mutliple users. In my app, I use QAbstractTableModel to display a MS Access Database (stored on a local server, accessed by several PCs) in a QTableView. I developped everything I needed for unique user interaction. But now I'm moving to the step where I need to think about multi-user interaction.
For exemple, if user A changes a specific line, the instance of the app on user's B PC needs to update the changed line. Another example, if user A is modifying a specific line, and user B also wants to modify it, it needs to be notified as "already being modified, wait please", and once the modification from user A is done, the user B needs to see this modification updated before he has any interaction.
Today, because of the local nature of the MS Access database, I have to update the table view a lot of time, based on user interaction, in order to not miss any database modification from other potential users. It is kinda greedy in terms of performance and resources.
I was thinking about using Django in order make the different app instances communicate with each other, but maybe I'm overthingking it and may be there is other solutions.
Dunno if it's clear, I'm available for more informations !

Perhaps you could simply store a "lastUpdated" timestamp on the row. With each update, you update that timestamp.
Now, when you submit an update, you include that timestamp, and if the timestamps don't match, you let the user know, and handle the conflict on the frontend (Perhaps a simple "overwrite local copy, or force update and overwrite server copy" option).
That's a simple and robust solution, but if you don't want users wasting time writing updates for old rows, you could use WebSockets to communicate from a server to any clients with that row open for editing, and let them know that the row has been updated.
If you want to "lock" rows while the row is already being edited, you could simply store a "inUse" boolean and have users check the value before continuing.

Usually, when using a MVC pattern (which is what QAbstractTableModel + QTableView is) the responsibility of updating the view should lie on the model itself. I.e. it's the model that should notify the view that something changed.
It seems that QAbstractTableModel has a dataChanged signal that gets emitted on data changes.
I suggest you to connect it to your view refresh slot as done here.
In this way you avoid the need of another moving part/infrastructure component (django).

Monitor MySQLdb in python for new entries and Flask

I'm looking for a way to constantly check my database (MySQL) for new entries. Once a new entry is committed I want to output it in a webpage using Flask.
Since the process takes time to finish I would like to give the users the impression it took only few seconds to retrieve data.
For now I'm waiting that the whole process finishes to give to the user the whole result. But I would prefer to update the result web-page every time a new entry was added to the DB. So for example the first entry is added to the DB, immediately the user can see it on the web-page, then a second entry is added the user can now see both the first and the second entries on the web-page and so on. I don't know if it has to come from flask or other ways
Any idea?

You can set MySQL to log all commits to General Query Log and monitor all changes (for example via Watchdog or PyNotify). Once the file changes, you can parse the new log entries and get the signal. By this way you'll avoid pooling for changes.
The better way would be of course send the signal while storing data to the database.

Checking username availability - Handling of AJAX requests (Google App Engine)

I want to add the 'check username available' functionality on my signup page using AJAX. I have few doubts about the way I should implement it.
With which event should I register my AJAX requests? We can send the
requests when user focus out of the 'username' input field (blur
event) or as he types (keyup event). Which provides better user
experience?
On the server side, a simple way of dealing with requests would be
to query my main 'Accounts' database. But this could lead to a lot
of request hitting my database(even more if we POST using the keyup
event). Should I maintain a separate model for registered usernames
only and use that to get better results?
Is it possible to use Memcache in this case? Initializing cache with
every username as key and updating it as we register users and use a
random key to check if cache is actually initialized or pass the
queries directly to db.

Answers -
Do the check on blur. If you do it on key up, you will be hammering your server with unnecessary queries, annoying the user who is not yet done typing, and likely lag the typing anyway.
If your Account entity is very large, you may want to create a separate AccountName entity, and create a matching such entity whenever you create a real Account (but this is probably an unnecessary optimization). When you create the Account (or AccountName), be sure to assign id=name when you create it. Then you can do an AccountName.get_by_id(name) to quickly see if the AccountName has already been assigned, and it will automatically pull it from memcache if it has been recently dealt with.
By default, GAE NDB will automatically populate memcache for you when you put or get entities. If you follow my advice in step 2, things will be very fast and you won't have to mess around with pre-populating memcache.
If you are concerned about 2 people simultaneously requesting the same user name, put your create method in a transaction:
#classmethod
#ndb.transactional()
def create_account(cls, name, other_params):
acct = Account.get_by_id(name)
if not acct:
acct = Account(id=name, other param assigns)
acct.put()

I would recommend the blur event of the username field, combined with some sort of inline error/warning display.
I would also suggest maintaining a memcache of registered usernames, to reduce DB hits and improve user experience - although probably not populate this with a warm-up, but instead only when requests are made. This is sometimes called a "Repository" pattern.
BUT, you can only populate the cache with USED usernames - you should not store the "available" usernames here (or if you do, use a much lower timeout).
You should always check directly against the DB/Datastore when actually performing the registration. And ideally in some sort of transactional method so that you don't have race conditions with multiple people registering.
BUT, all of this work is dependant on several things, including how busy your app is and what data storage tech you are using!

GAE datastore - best practice when there are more writes than reads

I'm trying to do some practicing with the GAE datastore to get a feeling about the queries and billings mechanisms.
I've read the Oreilly book about the GAE, and watched the Google videos about the datastore. My problem is that the best practice methods are usually concerning more reads than writes to the datastore.
I Built a super simple app:
there are two webpages - one to
choose links, and one view chosen
links
every user can choose to add url links to his "links feed"
the user can choose as many links as he wants, whenever he wants.
on a different webpage, I want to show the user the most recent 10 links he chose.
every user has his own "links feed" webpage.
on every "link" I want to save and show some metadata - for example: the url link itself; when it was chosen; how many times it appeared on the feed already; etc.
In this case, since the user can choose as many links he wants, whenever he wants, my app write to the datastore, much more than the number of reads (write - when the user chose another link; read - when the user opens the webpage to see his "links feed")
Question 1:
I can think of (at least) two options how to handle the data for this app:
Option A:
- maintain entity per user with the user details, registration, etc
- maintain another entity per user that holds his recent 10 chosen links, which will be rendered to the user's webpage after he asks for it
Option B:
- maintain entity per url link - which means all the urls of all users will be stored as the same object
- maintain entity per user details (same as in Option A), but add a reference to the user's urls in the big table of the urls
What will be the better method?
Question 2:
If I want to count the total numbers of urls chosen till today, or the daily amount of urls the user chose, or any other counting - should I use it with my SDK tools, or should I insert counters in the entities I described above? (I want to reduce the amount of datastore writes as much as I can)
EDIT (to answer #Elad's comment):
Assume I want to save only the 10 last urls per users. the rest of them I want to get rid of (so to not overpopulate my DB with unnecessary data).
EDIT 2: after adding the code
So I made the try with the following code (trying first Elad's method):
Here's my class:
class UserChannel(db.Model):
currentUser = db.UserProperty()
userCount = db.IntegerProperty(default=0)
currentList = db.StringListProperty() #holds the last 20-30 urls
then I serialized the url & metadata into JSON strings, which the user POSTs from the first page.
here's how the POST is dealt:
def post(self):
user = users.get_current_user()
if user:
logging messages for debugging
self.response.headers['Content-Type'] = 'text/html'
#self.response.out.write('<p>the user_id is: %s</p>' % user.user_id())
updating the new item that user adds
current_user = UserChannel.get_by_key_name(user.nickname())
dataJson = self.request.get('dataJson')
#self.response.out.write('<p>the dataJson is: %s</p>' % dataJson)
current_user.currentPlaylist.append(dataJson)
sizePlaylist= len(current_user.currentPlaylist)
self.response.out.write('<p>size of currentplaylist is: %s</p>' % sizePlaylist)
#whenever the list gets to 30 I cut it to be 20 long
if sizePlaylist > 30:
for i in range (0,9):
current_user.currentPlaylist.pop(i)
current_user.userCount +=1
current_user.put()
Updater().send_update(dataJson)
else:
self.response.headers['Content-Type'] = 'text/html'
self.response.out.write('user_not_logged_in')
where Updater is my method for updating with Channel-API the webpage with the feed.
Now, it all works, I can see each user has a ListProperty with 20-30 links (when it hits 30, I cut it down to 20 with the pop()), but! the prices are quite high...
each POST like the one here takes ~200ms, 121 cpu_ms, cpm_usd= 0.003588. This is very expensive considering all I do is save a string to the list...
I think the problem might be that the entity gets big with the big ListProperty?

First, you're right to worry about lots of writes to GAE datastore - my own experience is that they're very expensive compared to reads. For instance, an app of mine that did nothing but insert records in a single model table reached exhausted the free quota with a few 10's of thousands of writes per day. So handling writes efficiently translates directly into your bottom line.
First Question
I wouldn't store links as separate entities. The datastore is not a RDBMS, so standard normalization practices do not necessarily apply. For each User entity, use a ListProperty to store the the most recent URLs along with their metadata (you can serialize everything into a string).
This is efficient for writing since you only update a single record - there are no updates to all the link records whenever the user adds links. Keep in mind that to keep a rolling list (FIFO) with references URLs stored as separate models, every new URL means two write actions - an insert of the new URL, and a delete to remove the oldest one.
It's also efficient for reading since a single read on the user record gives you all the data you need to render the User's feed.
From a storage perspective, the total number of URLs in the world far exceeds your number of users (even if you become the next Facebook), and so does the variance of URLs chosen by your users, so it's likely that the mean URL will have a single user - no real gain in RDBMS-style normalization of the data.
Another optimization idea: if your users usually add several links in a short period you can try to write them in bulk rather than separately. Use memcache to store newly added user URLs, and the Task Queue to periodically write that transient data to the persistent datastore. I'm not sure what's the resource cost of using Tasks though - you'll have to check.
Here's a good article to read on the subject.
Second Question
Use counters. Just keep in mind that they aren't trivial in a distributed environment, so read up - there are many GAE articles, recipes and blog posts on the subject - just google appengine counters. Here too, using memcache should be a good option in order to reduce the total number datastore writes.

Answer 1
Store Links as separate entities. Also store an entity per user with a ListProperty having keys to the most recent 20 links. As user chooses more links you just update the ListProperty of keys. ListProperty maintains order so you dont need to worry about the chronological orders of links chosen as long as you follow a FIFO insertion order.
When you want to show the user's chosen links (page 2) you can do one get(keys) to fetch all the user's links in one call.
Answer 2
Definitely keep counters, as the number of entities grows, the complexity of counting records will continue to increase but with counters, the performance will remain the same.

App engine app design questions

I want to load info from another site (this part is done), but i am doing this every time the page is loaded and that wont do. So i was thinking of having a variable in a table of settings like 'last checked bbc site' and when the page loads it would check if its been long enough since last check to check again. Is there anything silly about doing it that way?
Also do i absolutely have to use tables to store 1 off variables like this setting?

I think there are 2 options that would work for you, besides creating a entity in the datastore to keep track of "last visited time".
One way is to just check the external page periodically, using the cron api as described by jldupont.
The second way is to store the last visited time in memcache. Although memcache is not permanent, it doesn't have to be if you are only storing last refresh times. If your entry in memcache were to disappear for some reason, the worst that would happen would be that you would fetch the page again, and update memcache with the current date/time.
The first way would be best if you want to check the external page at regular intervals. The second way might be better if you want to check the external page only when a user clicks on your page, and you haven't fetched that page yourself in the recent past. With this method, you aren't wasting resources fetching the external page unless someone is actually looking for data related to it.

You could also use Scheduled Tasks.
Also, you don't absolutely need to use the Datastore for configuration parameters: you could have this in a script / config file.

If you want some handler on your GAE app (including one for a scheduled task, reception of messages, web page visits, etc) to store some new information in such a way that some handler in the future can recover that information, then GAE's storage is the only good general way (memcache could expire from under you, for example). Not sure what you mean by "tables" (?!), but guessing that you actually mean GAE's storage the answer is "yes". (Under very specific circumstances you might want to put that data to some different place on the network, such as your visitor's browser e.g. via cookies, or an Amazon storage instance, etc, but it does not appear to me that those specific circumstances are appliable to your use case).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.