I have the system which sends concurrent get queries to couchbase. Every time the system gets existing key, it should update(prolongate) its lifetime. The amount of time is not so important and measures by days: the main idea is that key should be removed after nobody gets it for some time (20 days for example).
I guess that touch operation should be used, but should I lock keys, which would make things more difficult? Is it ok to use memcached package (it seems does not have lock API, but perhaps gets should do the trick)?
import pylibmc
class Cache(Singleton):
def init(self):
self.mc = pylibmc.Client(
# connection settings here
)
def get(self, key):
"""get key without locking it and update lifetime"""
result = self.mc.get(key)
if result:
# prolongate key for another 20 days
self.mc.touch(key, 60*60*24*20)
return result
def get_and_lock(self, key):
"""lock the key while getting it and update lifetime"""
# should use couchbase package as memcached does not have lock API
# or use 'gets' instead?
I think that you mixed up two independent topics.
First is the touch operation with the new lifetime as a param. This sets the new time to live for the data object respectively.
The lock operation is unrelated to the time to live. Naturally, Couchbase uses optimistic locking (by utilizing a CAS value) to guarantee the consistency of updates. This means that the data object is not locked because locking is expensive and very often useless (because there is no other operation requesting the locked object in the meantime) but there is CAS value which is changed by every update. Nevertheless if you know beforehand that the data object will be accessed very often (which means there will be a lot of concurring updates) you can decide to use a pessimistic locking (i.e. the lock operation). But this behaviour is not related to the time to live at all (by the way it is possible to give a time to live to locks, too).
Conclusion: Your touch command would work. Never use pessimistic locking if you don't know that you really need it. Optimistic locking is perfect for most cases (there is a reason why Couchbase chose the optimistic locking as default behaviour!)
Better solution for your implementation: According to the API you can give the get operation a ttl param as well. Like
get(key, ttl=60*60*24*20)
This will modify the ttl of the object you got and you won't need the additional touch command!
Related
I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.
I have a game in which users contact a server to find a user of their level who wants to play a game. Here is the basic architecture of a game request.
I am using ndb to store a waiting queue for each user level in the Google DataStore.
I am accessing these queues by their keys to ensure strong consistency (per this article). The entities are stored in the queue using a repeated (list of) LocalStructuredProperty.
Questions:
An entity is deleted from a waiting queue because it is matched to a request. The transaction is committed but not yet applied. That same entity is matched with another request and deleted. Will this throw an error?
These strongly consistent accesses are limited to ~1 write/sec. Is there a better architecture that would eliminate this constraint?
One thing I've considered for the latter question is to maintain multiple queues (whose number grows and shrinks with demand).
Not sure about your first question, but you might be able to simulate it with a sleep statement in your transaction.
For your second question, there is another architecture that you could use. If the waiting queue duration is relatively short (minutes instead of hours), you might want to use memcache. It will be a lot faster than writing to disk and you can avoid dealing with consistency issues.
1.- If you do the entity get and the post inside a transaction, then the same entity can not be matched for a game and therefore no error and it remains consistent.
2.- The 1 write per second is sthe limit for transactions inside the same entity group. If you need more, you can shard the queue entity.
You can use a dedicated memcache or a redis instance to avoid contention. This are much faster than the datastore.
See how these guys use tree nodes to do the match making:
https://www.youtube.com/watch?v=9nWyWwY2Onc
I'm working on a multiprocessed application, and each process sometimes executes the following code:
db_cursor.execute("SELECT MAX(id) FROM prqueue;")
for record in db_cursor.fetchall():
if record[0]:
db_cursor.execute("DELETE FROM prqueue WHERE id='%s'" % record[0]);
db_connector.commit()
And I'm facing the following problem: there may be a situation, when two processes take the same maximum value, and both try to delete this value. Such situation is not acceptable in the context of my application, each value must be taken (and deleted) only by one process.
How can I achieve this? Is table locking while taking the maximum and deleting absolutely necessary, or there is another, nice way to do that?
Thank you.
Consider simulating record locks with GET_LOCK();
Choose a name specific to the op you want locking. e.g. 'prqueue_max_del'.
Call SELECT GET_LOCK('prqueue_max_del',30) to lock the name 'prqueue_max_del'.. it will return 1 and set the lock if the name becomes available, or return 0 if the lock is not available after 30 seconds (the second parameter is the timeout).
Use SELECT RELEASE_LOCK('prqueue_max_del') when you are finished.
You will have to use the same names in each transaction and calling GET_LOCK() again in a transaction will release the previously set lock.
Beware; As only the abstract name is locked, all other processes not using this method and abstract name will be able to savage your table independently.
GET_LOCK() docs
I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.
I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?
Intro
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyro module (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
Server.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
Client.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
The key in syncdict to operate on
The amount to increment the value accessed by the key every cycle
The amount of time to sleep per cycle in seconds
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
Annoying problems
The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
Closing
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.
I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclib to make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment to perform the increment and one taking just the key and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistent shared dict, or persistence is not necessary to you), then measure and optimize as and if needed.
In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this article is suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
From the site:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter
Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.