Storing subprocess object in memory using global singleton instance - python

So I am using subprocess to spawn a long running process through the web interface using Django. Now if a user wants to come back to the page I would like to give him the option of terminating the subprocess at a later stage.
How can do this? I implemented the same thing in Java and made a global singleton ProcessManager dictionary to store the Process Object in Memory. Can I do something similar in Python?
EDIT
Yes Singletons and a hash of ProcessManager is the way of doing it cleanly. Emmanuel's code works perfectly fine with a few modifications.
Thanks

I think an easy way to implement Singleton pattern in python is via class attributes:
import subprocess
class ProcessManager(object):
__PROCESS = None;
#staticmethod
def set_process(args):
# Sets singleton process
if __PROCESS is None:
p = subprocess.Popen(args)
ProcessManager.__PROCESS = p;
# else: exception handling
#staticmethod
def kill_process():
# Kills process
if __PROCESS is None:
# exception handling
else:
ProcessManager.__PROCESS.kill()
Then you can use this class via:
from my_module import ProcessManager
my_args = ...
ProcessManager.set_process(my_args)
...
ProcessManager.kill_process()
Notes:
the ProcessManager is in charge of creating the process, to be symmetrical with its ending
I don't have enough knowledge in multi-threading to know if this works in multi-threading mode

You can use the same technique in Python as you did in Java, that is store the reference to the process in a module variable or implement a kind of a singleton.
The only problem you have as opposed to Java, is that Python does not have that rich analogy to the Servlet specification, and there is no interface to handle the application start or finish. In most cases you should not be worried how many instances of your application are running, because you fetch all data from a persistent storage. But in this case you should understand how your application is deployed.
If there is a single long running instance of your application (a FastCGI instance, for example, or a single WSGI application on cherrypy), you can isolate the process handling functionality in a separate module and load it when the module is imported (any module is imported only once within an application). If there are many instances (many FastCGI instances, or plain CGI-scripts), you should better detach child processes and keep their ids in a persistent storage (in a database, or files) and intersect them with the the list of currently running processes on demand.

Related

multiprocesesing.Process stored in an attribute not accessible to terminate in class in Python 3

When declaring a class attribute as a multiprocess.Process instance, the attribute isn't accessible to the class.
Background
I'm working on a free web development desktop application in Python for people new to coding. It downloads all the tools necessary to begin web development and sets up the system in a single install. It will set up and manage MongoDB and NodeJS instances, push and pull projects to and from a Github repository, build the application, and export a package that can be uploaded to a server, all from a single GUI. I'm currently having some issues managing the NodeJS instances. The first issue I ran into is piping multiple commands into the CLI, as Node doesn't play well without user intervention, but I figured out a work around by writing out the commands in at most 2 lines.
Current Issue
My issue now is shutting down the NodeJS server. The GUI is built using customTkinter, and to avoid locking up the UI, I have to start Node by using threading.Thread which doesn't have a method to stop the thread available, so I tried setting up subprocess.run and Popen in a while loop so I could pass a termination flag and break the process, but that just continued to spawn NodeJS servers until all system resources were consumed. My next attempt used threading.Thread to wrap multiprocessing.Process which then wraps subprocess.run since multiprocess.Process has a built in terminate method. (I tried subprocess.Popen but that doesn't work when wrapped in multiprocess.Process as it returns a pickling error.) I stored the resulting multiprocessing.Process in a class attribute called NPM, however, when I call self.NPM.terminate(), the program returns an attribute error stating that the the attribute doesn't exist.
Code
from subprocess import run
from multiprocessing import Process
from threading import Thread
...
self.startbtn=ctkButton(command=Thread(target=lambda:self.NPMStart(self.siteDir)))
def NPMStart( self, siteDir ):
self.stopbtn = ctk.ctkButton(command=self.NPMStop)
self.NPM = Process(target = run(['powershell', 'npm', 'run', 'dev'], cwd=siteDir))
self.admin.start()
def NPMStop( self ):
self.startbtn=ctkButton(command=Thread(target=lambda:self.NPMStart(self.siteDir)))
self.NPM.terminate()
Closing Notes
I have no idea what I'm doing wrong here as from everything I've read this SHOULD work. Any explanation as to what I'm doing that is preventing the class from accessing the self.NPM attribute outside the NPMStart method would be greatly appreciated.
If you want to see the full code I currently have, feel free to check out my Github repository:
https://github.com/ToneseekerMusical/PPIM

Python parallel programming model

I'm writing a machine learning program with the following components:
A shared "Experience Pool" with a binary-tree-like data structure.
N simulator processes. Each adds an "experience object" to the pool every once in a while. The pool is responsible for balancing its tree.
M learner processes that sample a batch of "experience objects" from the pool every few moments and perform whatever learning procedure.
I don't know what's the best way to implement the above. I'm not using Tensorflow, so I cannot take advantage of its parallel capability. More concretely,
I first think of Python3's built-in multiprocessing library. Unlike multithreading, however, multiprocessing module cannot have different processes update the same global object. My hunch is that I should use the server-proxy model. Could anyone please give me a rough skeleton code to start with?
Is MPI4py a better solution?
Any other libraries that would be a better fit? I've looked at celery, disque, etc. It's not obvious to me how to adapt them to my use case.
Based on the comments, what you're really looking for is a way to update a shared object from a set of processes that are carrying out a CPU-bound task. The CPU-bounding makes multiprocessing an obvious choice - if most of your work was IO-bound, multithreading would have been a simpler choice.
Your problem follows a simpler server-client model: the clients use the server as a simple stateful store, no communication between any child processes is needed, and no process needs to be synchronised.
Thus, the simplest way to do this is to:
Start a separate process that contains a server.
Inside the server logic, provide methods to update and read from a single object.
Treat both your simulator and learner processes as separate clients that can periodically read and update the global state.
From the server's perspective, the identity of the clients doesn't matter - only their actions do.
Thus, this can be accomplished by using a customised manager in multiprocessing as so:
# server.py
from multiprocessing.managers import BaseManager
# this represents the data structure you've already implemented.
from ... import ExperienceTree
# An important note: the way proxy objects work is by shared weak reference to
# the object. If all of your workers die, it takes your proxy object with
# it. Thus, if you have an instance, the instance is garbage-collected
# once all references to it have been erased. I have chosen to sidestep
# this in my code by using class variables and objects so that instances
# are never used - you may define __init__, etc. if you so wish, but
# just be aware of what will happen to your object once all workers are gone.
class ExperiencePool(object):
tree = ExperienceTree()
#classmethod
def update(cls, experience_object):
''' Implement methods to update the tree with an experience object. '''
cls.tree.update(experience_object)
#classmethod
def sample(cls):
''' Implement methods to sample the tree's experience objects. '''
return cls.tree.sample()
# subclass base manager
class Server(BaseManager):
pass
# register the class you just created - now you can access an instance of
# ExperiencePool using Server.Shared_Experience_Pool().
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine
with Server(('localhost', 8080), authkey=b'none') as server_process:
server_process.get_server().serve_forever()
Now for all of your clients you can just do:
# client.py - you can always have a separate client file for a learner and a simulator.
from multiprocessing.managers import BaseManager
from server import ExperiencePool
class Server(BaseManager):
pass
Server.register('Shared_Experience_Pool', ExperiencePool)
if __name__ == '__main__':
# run the server on port 8080 of your own machine forever.
server_process = Server(('localhost', 8080), authkey=b'none')
server_process.connect()
experience_pool = server_process.Shared_Experience_Pool()
# now do your own thing and call `experience_call.sample()` or `update` whenever you want.
You may then launch one server.py and as many workers as you want.
Is This The Best Design?
Not always. You may run into race conditions in that your learners may receive stale or old data if they are forced to compete with a simulator node writing at the same time.
If you want to ensure a preference for latest writes, you may additionally use a lock whenever your simulators are trying to write something, preventing your other processes from getting a read until the write finishes.

Understanding Python sqlite mechanics in multi-module environments

First off, I have no idea if "Ownership" is the correct term for this, it's just what I am calling it in Java.
I am currently building a Server that uses SQLite, and I am encountering errors concerning object "ownership":
I have one Module that manages the SQLite Database. Let's call it "pyDB". Simplified:
import threading
import sqlite3
class DB(object):
def __init__(self):
self.lockDB = threading.Lock()
self.conn = sqlite3.connect('./data.sqlite')
self.c = self.conn.cursor()
[...]
def doSomething(self,Param):
with self.lockDB:
self.c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
(Note that the lockDB object is there because the Database-Class can be called by multiple concurrent threads, and although SQLite itself is thread-safe, the cursor-Object is not, as far as I know).
Then I have a worker thread that processes stuff.
import pyDB
self.DB = pyDB.DB()
class Thread(threading.Thread):
[omitting some stuff that is not relevant here]
def doSomethingElse(self, Param):
DB.doSomething(Param)
If I am executing this, I am getting the following exception:
self.process(task)
File "[removed]/ProcessingThread.py", line 67, in process
DB.doSomething(Param)
File "[removed]/pyDB.py", line 101, in doSomething
self.c.execute(self,"SELECT * FROM xyz WHERE ID = ?", Param)
ProgrammingError: SQLite objects created in a thread can only be used in that same
thread.The object was created in thread id 1073867776 and this is thread id 1106953360
Now, as far as I can see, this is the same problem I had earlier (Where Object ownership was given not to the initialized class, but to the one that called it. Or so I understand it), and this has led me to finally accept that I generally don't understand how object ownership in Python works. I have seached the Python Documentation for an understandable explanation, but have not found any.
So, my Questions are:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
Where can I read up on this stuff to finally "get" it?
Is the term "Object ownership" even correct, or is there an other term for this in Python? (Edit: For explanations concerning this, read the comments of the main question)
I will be glad to take specific advice for this case, but am generally more interested in the whole concept of "what belongs to who" in Python, because to me it seems pretty different to the way Java handles it, and since I am planning to use Python a lot in the future, I might as well just learn it now, as this is a pretty important part of Python.
ProgrammingError: SQLite objects created in a thread can only be used in that same
The problem is that you're trying to conserve the cursor for some reason. You should not be doing this. Create a new cursor for every transaction; or if you're not totally sure where transactions start or end, a new cursor per query.
import sqlite3
class DB(object):
def __init__(self):
self.conn_uri = './data.sqlite'
[...]
def doSomething(self,Param):
conn = sqlite.connect(self.conn_uri)
c = conn.cursor()
c.execute("SELECT * FROM xyz WHERE ID = ?", Param)
Edit, Re comments in your question: What's going on here has very little to do with python. When you create a sqlite resource, which is a C library and totally independent of python, sqlite requires that resource be used only in the thread that created it. It verifies this by looking at the thread ID of the currently running thread, and not at all attempting to coordinate the transfer of the resource from one thread to another. As such, you are obligated to create sqlite resources in each thread that needs them.
In your code, you create all of the sqlite resources in the DB object's __init__ method, which is probably called only once, and in the main thread. Thus these resources are only permitted to be used in that thread, threading.Lock not withstanding.
Your questions:
Who owns the cursor object in this case? The Processing Thread or the DB thread?
The thread that created it. Since it looks like you're calling DB() at the module level, it's very likely that it's the main thread.
Where can I read up on this stuff to finally "get" it?
There's not really much of anything to get. Nothing is happening at all behind the scenes, except what SQLite has to say on the matter, when you are using it.
Is the term "Object ownership" even correct, or is there an other term for this in Python?
Python doesn't really have much of anything at all to do with threading, except that it allows you to use threads. It's on you to coordinate multi-threaded applications properly.
EDIT again:
Objects do not live inside particular threads. When you call a method on an object, that method runs in the calling thread. ten threads can call the same method on the same object; all will run concurrently (or whatever passes for that re the GIL), and it's up to the caller or the method body to make sure nothing breaks.
I'm the author of an alternate SQLite wrapper for Python (APSW) and very familiar with this issue. SQLite itself used to require that objects - the database connection and cursors could only be used in the same thread. Around SQLite 3.5 this was changed and you could use objects concurrently although internally SQLite did its own locking so you didn't actually get concurrent performance. The default Python SQLite wrapper (aka pysqlite) supports even old versions of SQLite 3 so it continues to enforce this restriction even though it is no longer necessary for SQLite itself. However the pysqlite code would need to be modified to allow concurrency as the way it wraps SQLite is not safe - eg handling error messages is not safe because of SQLite API design flaws and requires special handling.
Note that cursors are very cheap. Do not try to reuse them or treat them as precious. The actual underlying SQLite objects (sqlite3_stmt) are kept in a cache and reused as needed.
If you do want maximum concurrency then open multiple connections and use them simultaneously.
The APSW doc has more about multi-threading and re-entrancy. Note that it has extra code to allow the actual concurrent usage that pysqlite does not have, but the other tips and info apply to any usage of SQLite.

How to use simple sqlalchemy calls while using thread/multiprocessing

Problem
I am writing a program that reads a set of documents from a corpus (each line is a document). Each document is processed using a function processdocument, assigned a unique ID, and then written to a database. Ideally, we want to do this using several processes. The logic is as follows:
The main routine creates a new database and sets up some tables.
The main routine sets up a group of processes/threads that will run a worker function.
The main routine starts all the processes.
The main routine reads the corpus, adding documents to a queue.
Each process's worker function loops, reading a document from a queue, extracting the information from it using processdocument, and writes the information to a new entry in a table in the database.
The worker loops breaks once the queue is empty and an appropriate flag has been set by the main routine (once there are no more documents to add to the queue).
Question
I'm relatively new to sqlalchemy (and databases in general). I think the code used for setting up the database in the main routine works fine, from what I can tell. Where I'm stuck is I'm not sure exactly what to put into the worker functions for each process to write to the database without clashing with the others.
There's nothing particularly complicated going on: each process gets a unique value to assign to an entry from a multiprocessing.Value object, protected by a Lock. I'm just not sure whether what I should be passing to the worker function (aside from the queue), if anything. Do I pass the sqlalchemy.Engine instance I created in the main routine? The Metadata instance? Do I create a new engine for each process? Is there some other canonical way of doing this? Is there something special I need to keep in mind?
Additional Comments
I'm well aware I could just not bother with the multiprocessing but and do this in a single process, but I will have to write code that has several processes reading for the database later on, so I might as well figure out how to do this now.
Thanks in advance for your help!
The MetaData and its collection of Table objects should be considered a fixed, immutable structure of your application, not unlike your function and class definitions. As you know with forking a child process, all of the module-level structures of your application remain present across process boundaries, and table defs are usually in this category.
The Engine however refers to a pool of DBAPI connections which are usually TCP/IP connections and sometimes filehandles. The DBAPI connections themselves are generally not portable over a subprocess boundary, so you would want to either create a new Engine for each subprocess, or use a non-pooled Engine, which means you're using NullPool.
You also should not be doing any kind of association of MetaData with Engine, that is "bound" metadata. This practice, while prominent on various outdated tutorials and blog posts, is really not a general purpose thing and I try to de-emphasize this way of working as much as possible.
If you're using the ORM, a similar dichotomy of "program structures/active work" exists, where your mapped classes of course are shared between all subprocesses, but you definitely want Session objects to be local to a particular subprocess - these correspond to an actual DBAPI connection as well as plenty of other mutable state which is best kept local to an operation.

How to synchronize a python dict with multiprocessing

I am using Python 2.6 and the multiprocessing module for multi-threading. Now I would like to have a synchronized dict (where the only atomic operation I really need is the += operator on a value).
Should I wrap the dict with a multiprocessing.sharedctypes.synchronized() call? Or is another way the way to go?
Intro
There seems to be a lot of arm-chair suggestions and no working examples. None of the answers listed here even suggest using multiprocessing and this is quite a bit disappointing and disturbing. As python lovers we should support our built-in libraries, and while parallel processing and synchronization is never a trivial matter, I believe it can be made trivial with proper design. This is becoming extremely important in modern multi-core architectures and cannot be stressed enough! That said, I am far from satisfied with the multiprocessing library, as it is still in its infancy stages with quite a few pitfalls, bugs, and being geared towards functional programming (which I detest). Currently I still prefer the Pyro module (which is way ahead of its time) over multiprocessing due to multiprocessing's severe limitation in being unable to share newly created objects while the server is running. The "register" class-method of the manager objects will only actually register an object BEFORE the manager (or its server) is started. Enough chatter, more code:
Server.py
from multiprocessing.managers import SyncManager
class MyManager(SyncManager):
pass
syncdict = {}
def get_dict():
return syncdict
if __name__ == "__main__":
MyManager.register("syncdict", get_dict)
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.start()
raw_input("Press any key to kill server".center(50, "-"))
manager.shutdown()
In the above code example, Server.py makes use of multiprocessing's SyncManager which can supply synchronized shared objects. This code will not work running in the interpreter because the multiprocessing library is quite touchy on how to find the "callable" for each registered object. Running Server.py will start a customized SyncManager that shares the syncdict dictionary for use of multiple processes and can be connected to clients either on the same machine, or if run on an IP address other than loopback, other machines. In this case the server is run on loopback (127.0.0.1) on port 5000. Using the authkey parameter uses secure connections when manipulating syncdict. When any key is pressed the manager is shutdown.
Client.py
from multiprocessing.managers import SyncManager
import sys, time
class MyManager(SyncManager):
pass
MyManager.register("syncdict")
if __name__ == "__main__":
manager = MyManager(("127.0.0.1", 5000), authkey="password")
manager.connect()
syncdict = manager.syncdict()
print "dict = %s" % (dir(syncdict))
key = raw_input("Enter key to update: ")
inc = float(raw_input("Enter increment: "))
sleep = float(raw_input("Enter sleep time (sec): "))
try:
#if the key doesn't exist create it
if not syncdict.has_key(key):
syncdict.update([(key, 0)])
#increment key value every sleep seconds
#then print syncdict
while True:
syncdict.update([(key, syncdict.get(key) + inc)])
time.sleep(sleep)
print "%s" % (syncdict)
except KeyboardInterrupt:
print "Killed client"
The client must also create a customized SyncManager, registering "syncdict", this time without passing in a callable to retrieve the shared dict. It then uses the customized SycnManager to connect using the loopback IP address (127.0.0.1) on port 5000 and an authkey establishing a secure connection to the manager started in Server.py. It retrieves the shared dict syncdict by calling the registered callable on the manager. It prompts the user for the following:
The key in syncdict to operate on
The amount to increment the value accessed by the key every cycle
The amount of time to sleep per cycle in seconds
The client then checks to see if the key exists. If it doesn't it creates the key on the syncdict. The client then enters an "endless" loop where it updates the key's value by the increment, sleeps the amount specified, and prints the syncdict only to repeat this process until a KeyboardInterrupt occurs (Ctrl+C).
Annoying problems
The Manager's register methods MUST be called before the manager is started otherwise you will get exceptions even though a dir call on the Manager will reveal that it indeed does have the method that was registered.
All manipulations of the dict must be done with methods and not dict assignments (syncdict["blast"] = 2 will fail miserably because of the way multiprocessing shares custom objects)
Using SyncManager's dict method would alleviate annoying problem #2 except that annoying problem #1 prevents the proxy returned by SyncManager.dict() being registered and shared. (SyncManager.dict() can only be called AFTER the manager is started, and register will only work BEFORE the manager is started so SyncManager.dict() is only useful when doing functional programming and passing the proxy to Processes as an argument like the doc examples do)
The server AND the client both have to register even though intuitively it would seem like the client would just be able to figure it out after connecting to the manager (Please add this to your wish-list multiprocessing developers)
Closing
I hope you enjoyed this quite thorough and slightly time-consuming answer as much as I have. I was having a great deal of trouble getting straight in my mind why I was struggling so much with the multiprocessing module where Pyro makes it a breeze and now thanks to this answer I have hit the nail on the head. I hope this is useful to the python community on how to improve the multiprocessing module as I do believe it has a great deal of promise but in its infancy falls short of what is possible. Despite the annoying problems described I think this is still quite a viable alternative and is pretty simple. You could also use SyncManager.dict() and pass it to Processes as an argument the way the docs show and it would probably be an even simpler solution depending on your requirements it just feels unnatural to me.
I would dedicate a separate process to maintaining the "shared dict": just use e.g. xmlrpclib to make that tiny amount of code available to the other processes, exposing via xmlrpclib e.g. a function taking key, increment to perform the increment and one taking just the key and returning the value, with semantic details (is there a default value for missing keys, etc, etc) depending on your app's needs.
Then you can use any approach you like to implement the shared-dict dedicated process: all the way from a single-threaded server with a simple dict in memory, to a simple sqlite DB, etc, etc. I suggest you start with code "as simple as you can get away with" (depending on whether you need a persistent shared dict, or persistence is not necessary to you), then measure and optimize as and if needed.
In response to an appropriate solution to the concurrent-write issue. I did very quick research and found that this article is suggesting a lock/semaphore solution. (http://effbot.org/zone/thread-synchronization.htm)
While the example isn't specificity on a dictionary, I'm pretty sure you could code a class-based wrapper object to help you work with dictionaries based on this idea.
If I had a requirement to implement something like this in a thread safe manner, I'd probably use the Python Semaphore solution. (Assuming my earlier merge technique wouldn't work.) I believe that semaphores generally slow down thread efficiencies due to their blocking nature.
From the site:
A semaphore is a more advanced lock mechanism. A semaphore has an internal counter rather than a lock flag, and it only blocks if more than a given number of threads have attempted to hold the semaphore. Depending on how the semaphore is initialized, this allows multiple threads to access the same code section simultaneously.
semaphore = threading.BoundedSemaphore()
semaphore.acquire() # decrements the counter
... access the shared resource; work with dictionary, add item or whatever.
semaphore.release() # increments the counter
Is there a reason that the dictionary needs to be shared in the first place? Could you have each thread maintain their own instance of a dictionary and either merge at the end of the thread processing or periodically use a call-back to merge copies of the individual thread dictionaries together?
I don't know exactly what you are doing, so keep in my that my written plan may not work verbatim. What I'm suggesting is more of a high-level design idea.

Categories

Resources