I'm working on a multiprocessed application, and each process sometimes executes the following code:
db_cursor.execute("SELECT MAX(id) FROM prqueue;")
for record in db_cursor.fetchall():
if record[0]:
db_cursor.execute("DELETE FROM prqueue WHERE id='%s'" % record[0]);
db_connector.commit()
And I'm facing the following problem: there may be a situation, when two processes take the same maximum value, and both try to delete this value. Such situation is not acceptable in the context of my application, each value must be taken (and deleted) only by one process.
How can I achieve this? Is table locking while taking the maximum and deleting absolutely necessary, or there is another, nice way to do that?
Thank you.
Consider simulating record locks with GET_LOCK();
Choose a name specific to the op you want locking. e.g. 'prqueue_max_del'.
Call SELECT GET_LOCK('prqueue_max_del',30) to lock the name 'prqueue_max_del'.. it will return 1 and set the lock if the name becomes available, or return 0 if the lock is not available after 30 seconds (the second parameter is the timeout).
Use SELECT RELEASE_LOCK('prqueue_max_del') when you are finished.
You will have to use the same names in each transaction and calling GET_LOCK() again in a transaction will release the previously set lock.
Beware; As only the abstract name is locked, all other processes not using this method and abstract name will be able to savage your table independently.
GET_LOCK() docs
Related
I have the system which sends concurrent get queries to couchbase. Every time the system gets existing key, it should update(prolongate) its lifetime. The amount of time is not so important and measures by days: the main idea is that key should be removed after nobody gets it for some time (20 days for example).
I guess that touch operation should be used, but should I lock keys, which would make things more difficult? Is it ok to use memcached package (it seems does not have lock API, but perhaps gets should do the trick)?
import pylibmc
class Cache(Singleton):
def init(self):
self.mc = pylibmc.Client(
# connection settings here
)
def get(self, key):
"""get key without locking it and update lifetime"""
result = self.mc.get(key)
if result:
# prolongate key for another 20 days
self.mc.touch(key, 60*60*24*20)
return result
def get_and_lock(self, key):
"""lock the key while getting it and update lifetime"""
# should use couchbase package as memcached does not have lock API
# or use 'gets' instead?
I think that you mixed up two independent topics.
First is the touch operation with the new lifetime as a param. This sets the new time to live for the data object respectively.
The lock operation is unrelated to the time to live. Naturally, Couchbase uses optimistic locking (by utilizing a CAS value) to guarantee the consistency of updates. This means that the data object is not locked because locking is expensive and very often useless (because there is no other operation requesting the locked object in the meantime) but there is CAS value which is changed by every update. Nevertheless if you know beforehand that the data object will be accessed very often (which means there will be a lot of concurring updates) you can decide to use a pessimistic locking (i.e. the lock operation). But this behaviour is not related to the time to live at all (by the way it is possible to give a time to live to locks, too).
Conclusion: Your touch command would work. Never use pessimistic locking if you don't know that you really need it. Optimistic locking is perfect for most cases (there is a reason why Couchbase chose the optimistic locking as default behaviour!)
Better solution for your implementation: According to the API you can give the get operation a ttl param as well. Like
get(key, ttl=60*60*24*20)
This will modify the ttl of the object you got and you won't need the additional touch command!
I'm using the Python multiprocessing library to generate several processes that each write to a shared (MongoDB) database. Is this safe, or will the writes overwrite each other?
So long as you make sure to create a separate database connection for each worker process, it's perfectly safe to have multiple processes accessing a database at the same time. Any queries they issue which make changes to the database will be applied individually, typically in the order they are received by the database. Under most situations this will be safe, but:
If your processes are all just inserting documents into the database, each insert will typically create a separate object.
The exception is if you explicitly specify an _id for a document, and that identifier has already been used within the collection. This will cause the insert to fail. (So don't do that: leave the _id out, and MongoDB will always generate a unique value for you.)
If your processes are deleting documents from the database, the operation will fail if another process has already deleted the same object. (This is not strictly a failure, though; it just means that someone else got there before you.)
If your processes are updating documents in the database, things get murkier.
So long as each process is updating a different document, you're fine.
If multiple processes are trying to update the same document at the same time, you start needing to be careful. Updates which replace values on an object will be applied in order, which may cause changes made by one process to inadvertently be overwritten by another. You should be careful to avoid specifying fields that you don't intend to change. Using MongoDB's update operators may be helpful to perform complex operations atomically, such as changing the numeric values of fields.
Note that "at the same time" doesn't necessarily mean that operations are occurring at exactly the same time. It means more generally that there's an "overlap" in the time two processes are working with the same document, e.g.
Process A Process B
--------- ---------
Reads object from DB ...
working... Reads object from DB
working... working...
updates object with changes working...
updates object with changes
In the above situation, it's possible for some of the changes made by process A to inadvertently be overwritten by process B.
In short, yes it is perfectly reasonable (and actually preferred) to let your database worry about the concurrency of your database operations.
Any relevant database driver (MongoDB included) will handle concurrent operations for you automatically.
How can I get a task by name?
from google.appengine.api import taskqueue
taskqueue.add(name='foobar', url='/some-handler', params={'foo': 'bar'}
task_queue = taskqueue.Queue('default')
task_queue.delete_tasks_by_name('foobar') # would work
# looking for a method like this:
foobar_task = task_queue.get_task_by_name('foobar')
It should be possible with the REST API (https://developers.google.com/appengine/docs/python/taskqueue/rest/tasks/get). But I would prefer something like task_queue.get_task_by_name('foobar'). Any ideas? Did I miss something?
There is no guarantee that the task with this name exists - it may have been already executed. And even if you manage to get a task, it may be executed while you are trying to do something with it. So when you try to put it back, you have no idea if you are adding it for the first time or for the second time.
Because of this uncertainty, I can't see any use case where getting a task by name may be useful.
EDIT:
You can give a name to your task in order to ensure that a particular task only executes once. When you add a task with a name to a queue, App Engine will check if the task with such name already exists. If it does, the subsequent attempt will fail.
For example, you can have many instances running, and each instance may need to insert an entity in the Datastore. Your first option is to check if an entity already exists in a datastore. This is a relatively slow operation, and by the time you received your response (entity does not exist) and decide to insert it, another instance could have already inserted it. So you end up with two entities instead of one.
Your second option is to use tasks. Instead of inserting a new entity directly into a datastore, an instance creates a task to insert it, and it gives this task a name. If another instance tries to add a task with the same name, it will simply override the existing task. As a result, you are guaranteed that an entity will be inserted only once.
i have a mysql tables that uses lock-write mechanism. the lock might go for too long (we're talking about 1-2 minutes here).
i had to make a check if the table is in use or not before the update is done (using beforeUpdate signal)
but after checking and returning that my table is in use , system hang until the other user unlocks the table . is it possible to prevent data from updating if the flag returned that the table is in use.,
im searching for a better way to handle this i don't want to re-implement the setData method since doing this is a pain. or if you have a good re-implementation for it . it will be very helpfull.
thanks in advance
Python threading: http://docs.python.org/library/thread.html You can create threads that wait until the table is finished and it should be negligible in system resources, also your end user wont have to wait for the system to respond to continue with a different task.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.