edit: im asking if global variables are safe in a single-threaded web framework like tornado
im using the mongoengine orm, which gets a database connection from a global variable:
_get_db() # gets the db connection
im also using tornado, a single-threaded python web framework. in one particular view, i need to grab a database connection and dereference a DBRef object [similar to a foreign key]:
# dereference a DBRef
_get_db().dereference(some_db_ref)
since the connection returned by _get_db is a global var, is there a possibility of collision and the wrong value being returned to the wrong thread?
Threads are always required to hold the GIL when interacting with Python objects. The namespace holding the variables is a Python object (either a frameobject or a dict, depending on what kind of variable it is.) It's always safe to get or set variables in multiple threads. You will never get garbage data.
However, the usual race conditions do apply as to which object you get, or which object you replace when you assign. A statement like x += 1 is not thread-safe, because a different thread can run between the get and the store, changing the value of x, which you would then overwrite.
Assuming MongoEngine is wrapping PyMongo (and I believe it is), then you should be fine. PyMongo is completely thread-safe.
No, but locks are pretty straightforward to use in python. Use the try: finally: pattern to ensure that a lock is released after you modify your global variable.
There is nothing about globals that makes them any more or less thread safe than any other variables. Whether or not it's possible for an operation to fail or return incorrect results when run in different threads, the best practice is that you should protect data shared between threads.
If I'm reading you right, you're asking if a variable is safe in a single-threaded environment. In this case, where data is not shared between concurrent processes, the variable is safe (after all, there's nothing else running that may interrupt it).
Related
Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?
One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.
I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.
First off, I have no idea if "Ownership" is the correct term for this, it's just what I am calling it in Java.
I am currently building a Server that uses SQLite, and I am encountering errors concerning object "ownership":
I have one Module that manages the SQLite Database. Let's call it "pyDB". Simplified:
import threading
import sqlite3
class DB(object):
def __init__(self):
self.lockDB = threading.Lock()
self.conn = sqlite3.connect('./data.sqlite')
self.c = self.conn.cursor()
[...]
def doSomething(self,Param):
with self.lockDB:
self.c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
(Note that the lockDB object is there because the Database-Class can be called by multiple concurrent threads, and although SQLite itself is thread-safe, the cursor-Object is not, as far as I know).
Then I have a worker thread that processes stuff.
import pyDB
self.DB = pyDB.DB()
class Thread(threading.Thread):
[omitting some stuff that is not relevant here]
def doSomethingElse(self, Param):
DB.doSomething(Param)
If I am executing this, I am getting the following exception:
self.process(task)
File "[removed]/ProcessingThread.py", line 67, in process
DB.doSomething(Param)
File "[removed]/pyDB.py", line 101, in doSomething
self.c.execute(self,"SELECT * FROM xyz WHERE ID = ?", Param)
ProgrammingError: SQLite objects created in a thread can only be used in that same
thread.The object was created in thread id 1073867776 and this is thread id 1106953360
Now, as far as I can see, this is the same problem I had earlier (Where Object ownership was given not to the initialized class, but to the one that called it. Or so I understand it), and this has led me to finally accept that I generally don't understand how object ownership in Python works. I have seached the Python Documentation for an understandable explanation, but have not found any.
So, my Questions are:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
Where can I read up on this stuff to finally "get" it?
Is the term "Object ownership" even correct, or is there an other term for this in Python? (Edit: For explanations concerning this, read the comments of the main question)
I will be glad to take specific advice for this case, but am generally more interested in the whole concept of "what belongs to who" in Python, because to me it seems pretty different to the way Java handles it, and since I am planning to use Python a lot in the future, I might as well just learn it now, as this is a pretty important part of Python.
ProgrammingError: SQLite objects created in a thread can only be used in that same
The problem is that you're trying to conserve the cursor for some reason. You should not be doing this. Create a new cursor for every transaction; or if you're not totally sure where transactions start or end, a new cursor per query.
import sqlite3
class DB(object):
def __init__(self):
self.conn_uri = './data.sqlite'
[...]
def doSomething(self,Param):
conn = sqlite.connect(self.conn_uri)
c = conn.cursor()
c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
Edit, Re comments in your question: What's going on here has very little to do with python. When you create a sqlite resource, which is a C library and totally independent of python, sqlite requires that resource be used only in the thread that created it. It verifies this by looking at the thread ID of the currently running thread, and not at all attempting to coordinate the transfer of the resource from one thread to another. As such, you are obligated to create sqlite resources in each thread that needs them.
In your code, you create all of the sqlite resources in the DB object's __init__ method, which is probably called only once, and in the main thread. Thus these resources are only permitted to be used in that thread, threading.Lock not withstanding.
Your questions:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
The thread that created it. Since it looks like you're calling DB() at the module level, it's very likely that it's the main thread.
Where can I read up on this stuff to finally "get" it?
There's not really much of anything to get. Nothing is happening at all behind the scenes, except what SQLite has to say on the matter, when you are using it.
Is the term "Object ownership" even correct, or is there an other term for this in Python?
Python doesn't really have much of anything at all to do with threading, except that it allows you to use threads. It's on you to coordinate multi-threaded applications properly.
EDIT again:
Objects do not live inside particular threads. When you call a method on an object, that method runs in the calling thread. ten threads can call the same method on the same object; all will run concurrently (or whatever passes for that re the GIL), and it's up to the caller or the method body to make sure nothing breaks.
I'm the author of an alternate SQLite wrapper for Python (APSW) and very familiar with this issue. SQLite itself used to require that objects - the database connection and cursors could only be used in the same thread. Around SQLite 3.5 this was changed and you could use objects concurrently although internally SQLite did its own locking so you didn't actually get concurrent performance. The default Python SQLite wrapper (aka pysqlite) supports even old versions of SQLite 3 so it continues to enforce this restriction even though it is no longer necessary for SQLite itself. However the pysqlite code would need to be modified to allow concurrency as the way it wraps SQLite is not safe - eg handling error messages is not safe because of SQLite API design flaws and requires special handling.
Note that cursors are very cheap. Do not try to reuse them or treat them as precious. The actual underlying SQLite objects (sqlite3_stmt) are kept in a cache and reused as needed.
If you do want maximum concurrency then open multiple connections and use them simultaneously.
The APSW doc has more about multi-threading and re-entrancy. Note that it has extra code to allow the actual concurrent usage that pysqlite does not have, but the other tips and info apply to any usage of SQLite.
Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.
I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?
Intro
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyro module (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
Server.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
Client.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
The key in syncdict to operate on
The amount to increment the value accessed by the key every cycle
The amount of time to sleep per cycle in seconds
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
Annoying problems
The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
Closing
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.
I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclib to make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment to perform the increment and one taking just the key and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistent shared dict, or persistence is not necessary to you), then measure and optimize as and if needed.
In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this article is suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
From the site:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter
Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.