Detect if a process is already running and collaborate with it - python

I'm trying to create a program that starts a process pool of, say, 5 processes, performs some operation, and then quits, but leaves the 5 processes open. Later the user can run the program again, and instead of it starting new processes it uses the existing 5. Basically it's a producer-consumer model where:
The number of producers varies.
The number of consumers is constant.
The producers can be started at different times by different programs or even different users.
I'm using the builtin multiprocessing module, currently in Python 2.6.4., but with the intent to move to 3.1.1 eventually.
Here's a basic usage scenario:
Beginning state - no processes running.
User starts program.py operation - one producer, five consumers running.
Operation completes - five consumers running.
User starts program.py operation - one producer, five consumers running.
User starts program.py operation - two producers, five consumers running.
Operation completes - one producer, five consumers running.
Operation completes - five consumers running.
User starts program.py stop and it completes - no processes running.
User starts program.py start and it completes - five consumers running.
User starts program.py operation - one procucer, five consumers running.
Operation completes - five consumers running.
User starts program.py stop and it completes - no processes running.
The problem I have is that I don't know where to start on:
Detecting that the consumer processes are running.
Gaining access to them from a previously unrelated program.
Doing 1 and 2 in a cross-platform way.
Once I can do that, I know how to manage the processes. There has to be some reliable way to detect existing processes since I've seen Firefox do this to prevent multiple instances of Firefox from running, but I have no idea how to do that in Python.

There are a couple of common ways to do your item #1 (detecting running processes), but to use them would first require that you slightly tweak your mental picture of how these background processes are started by the first invocation of the program.
Think of the first program not as starting the five processes and then exiting, but rather as detecting that it is the first instance started and not exiting. It can create a file lock (one of the common approaches for preventing multiple occurrences of an application from running), or merely bind to some socket (another common approach). Either approach will raise an exception in a second instance, which then knows that it is not the first and can refocus its attention on contacting the first instance.
If you're using multiprocessing, you should be able simply to use the Manager support, which involves binding to a socket to act as a server.
The first program starts the processes, creates Queues, proxies, or whatever. It creates a Manager to allow access to them, possibly allowing remote access.
Subsequent invocations first attempt to contact said server/Manager on the predefined socket (or using other techniques to discover the socket it's on). Instead of doing a server_forever() call they connect() and communicate using the usual multiprocessing mechanisms.

Take a look at these different Service Discovery mechanisms: http://en.wikipedia.org/wiki/Service_discovery
The basic idea is that the consumers would each register a service when they start. The producer would go through the discovery process when starting. If it finds the consumers, it binds to them. If it doesn't find them it starts up new consumers. In most all of these systems, services can typically also publish properties, so you can have each consumer uniquely identify itself and give other information to the discovering producer.
Bonjour/zeroconf is pretty well supported cross-platform. You can even configure Safari to show you the zeroconf services on your local network, so you can use that to debug the service advertisement for the consumers. One side advantage of this kind of approach is that you could easily run the producers on different machines than the consumers.

You need a client-server model on a local system. You could do this using TCP/IP sockets to communicate between your clients and servers, but it's faster to use local named pipes if you don't have the need to communicate over a network.
The basic requirements for you if I understood correctly are these:
1. A producer should be able to spawn consumers if none exist already.
2. A producer should be able to communicate with consumers.
3. A producer should be able to find pre-existing consumers and communicate with them.
4. Even if a producer completes, consumers should continue running.
5. More than one producer should be able to communicate with the consumers.
Let's tackle each one of these one by one:
(1) is a simple process-creation problem, except that consumer (child) processes should continue running, even if the producer (parent) exits. See (4) below.
(2) A producer can communicate with consumers using named pipes. See os.mkfifo() and unix man page of mkfifo() to create named pipes.
(3) You need to create named pipes from the consumer processes in a well known path, when they start running. The producer can find out if any consumers are running by looking for this well-known pipe(s) in the same location. If the pipe(s) do not exist, no consumers are running, and the producers can spawn these.
(4) You'll need to use os.setuid() for this, and make the consumer processes act like a daemon. See unix man page of setsid().
(5) This one is tricky. Multiple producers can communicate with the consumers using the same named pipe, but you cannot transfer more than "PIPE_BUF" amount of data from the producer to the consumer, if you want to reliably identify which producer sent the data, or if you want to prevent some kind of interleaving of data from different producers.
A better way to do (5) is to have the consumers open a "control" named pipe (/tmp/control.3456, 3456 being the consumer pid) on execution. Producers first set up a communication channel using the "control" pipe. When a producer connects, it sends its pid say "1234", to the consumer on the "control" pipe, which tells the consumer to create a named pipe for data exchange with the producer, say "/tmp/data.1234". Then the producer closes the "control" pipe, and opens "/tmp/data.1234" to communicate with the consumer. Each consumer can have its own "control" pipes (use the consumer pids to distinguish between pipes of different consumers), and each producer gets its own "data" pipe.. When a producer finishes, it should clean up its data pipe or tell the consumer to do so. Similarly, when the consumer finishes, it should clean up its control pipes.
A difficulty here is to prevent multiple producers from connecting to the control pipes of a single consumer at the same time. The "control" pipe here is a shared resource and you need to synchronize between different producers to access it. Use semaphores for it or file locking. See the posix_ipc python module for this.
Note: I have described most of the above in terms of general UNIX semantics, but all you really need is the ability to create daemon processes, ability to create "named" pipes/queues/whatever so that they can be found by an unrelated process, and ability to synchronize between unrelated processes. You can use any python module which provides such semantics.

Related

Is it possible to use multiprocessing.Queue to communicate between TWO python scripts?

I have just learned about python concurrency and its library module multiprocessing. Most examples I have encountered are within ONE python script, it spawns several processes, and communicate among them using multiprocessing.Queue.
My question is: without using message broker or a third supervising application, can TWO python script communicate with each other using multiprocessing.Queue?
The multiprocessing module is a package that supports spawning processes, so that you can write code that executes in parallel. This means that you can write one python script that spawns multiple processes transparently, without worrying much about how these processes serialize data & pass it to each-other.
As for your question, it depends... Why do they need to be separate?
If the only concern is that your functions are defined in different modules/scripts, you can just import everything you need in the script that uses the Queue and make all your functions available in one script.
If your use-case is that you want one script to wait for requests (server) & the other script to be a client (it sends requests to the server when needed and waits for response), then you need to implement some sort of RPC protocol.
You can make an http server using web frameworks like Flask & send http requests to it from the client, or if you only need to share short simple messages, you can implement your own message exchange protocol using sockets.
So to sum up: It is possible for 2 python processes to communicate without a message broker (e.g: through sockets). But you want to use multiprocessing if you want to run 1 python script that spawns multiple processes that can communicate with one-another. If instead you need to start 2 independent scripts and have one of them request the other one to do some work & return the output, you need to implement some RPC protocol between them. The multiprocessing.Queue object itself is not a replacement for message brokers. If you want independent scripts that are started independently to communicate through a message queue, that queue needs to live either in one of the processes that are communicating (i.e: the server), or in a 3rd process.

RabbitMQ: Consuming only one message at a time from multiple queues

I'm trying to stay connected to multiple queues in RabbitMQ. Each time I pop a new message from one of these queue, I'd like to spawn an external process.
This process will take some time to process the message, and I don't want to start processing another message from that specific queue until the one I popped earlier is completed. If possible, I wouldn't want to keep a process/thread around just to wait on the external process to complete and ack the server. Ideally, I would like to ack in this external process, maybe passing some identifier so that it can connect to RabbitMQ and ack the message.
Is it possible to design this system with RabbitMQ? I'm using Python and Pika, if this is relevant to the answer.
Thanks!
RabbitMQ can do this.
You only want to read from the queue when you're ready - so spin up a thread that can spawn the external process and watch it, then fetch the next message from the queue when the process is done. You can then have mulitiple threads running in parallel to manage multiple queues.
I'm not sure what you want an ack for? Are you trying to stop RabbitMQ from adding new elements to that queue if it gets too full (because its elements are being processed too slowly/not at all)? There might be a way to do this when you add messages to the queues - before adding an item, check to make sure that the number of messages already in that queue is not "much greater than" the average across all queues?

Design a multi-process daemon

I am writing a web application in Python (Django) that will execute tasks/process on the side, typically network scans. I would like the user to be able to terminate a scan, view its status or results in real-time.
I thought one of the best ways to do this is to have a job manager daemon that is a stand alone process, that:
Accepts new jobs via a TCP connection.
Accepts user-commands, typically to terminate or restart a process.
Reports on the status of a job.
I am struggling with the structure of this code. I am thinking that a TCP port on the daemon process will accept new jobs. It will then create an os.fork(), which itself will create an os.fork(). The second fork will perform an os.execv() for nmap. The first os.fork() will monitor the second fork (how?) and when it completes, it will report back to the master daemon that it has ended. The first fork must also be able to terminate the second child process.
How does that sound? Are there any structures of this already having been done? I would hate to re-create the wheel.
Finally, how would the child process know that the second child, the one running the os.execv() has terminated? Or whether its still running? I would hate to continuously poll a list of processes.
And as I've said, this must be done in Python.
I opted for a fork-based approach. This approach is "wrong", but it works and fulfills my needs.
https://gist.github.com/FarhansCode/a0f27469142b6afaa6c2

Recommended way to send messages between threads in python?

I have read lots about python threading and the various means to 'talk' across thread boundaries. My case seems a little different, so I would like to get advice on the best option:
Instead of having many identical worker threads waiting for items in a shared queue, I have a handful of mostly autonomous, non-daemonic threads with unique identifiers going about their business. These threads do not block and normally do not care about each other. They sleep most of the time and wake up periodically. Occasionally, based on certain conditions, one thread needs to 'tell' another thread to do something specific - an action -, meaningful to the receiving thread. There are many different combinations of actions and recipients, so using Events for every combination seems unwieldly. The queue object seems to be the recommended way to achieve this. However, if I have a shared queue and post an item on the queue having just one recipient thread, then every other thread needs monitor the queue, pull every item, check if it is addressed to it, and put it back in the queue if it was addressed to another thread. That seems a lot of getting and putting items from the queue for nothing. Alternatively, I could employ a 'router' thread: one shared-by-all queue plus one queue for every 'normal' thread, shared with the router thread. Normal threads only ever put items in the shared queue, the router pulls every item, inspects it and puts it on the addressee's queue. Still, a lot of putting and getting items from queues....
Are there any other ways to achieve what I need to do ? It seems a pub-sub class is the right approach, but there is no such thread-safe module in standard python, at least to my knowledge.
Many thanks for your suggestions.
Instead of having many identical worker threads waiting for items in a shared queue
I think this is the right approach to do this. Just remove identical and shared from the above statement. i.e.
having many worker threads waiting for items in queues
So I would suggest using Celery for this approach.
Occasionally, based on certain conditions, one thread needs to 'tell'
another thread to do something specific - an action, meaningful to the receiving thread.
This can be done by calling another celery task from within the calling task. All the tasks can have separate queues.
Thanks for the response. After some thoughts, I have decided to use the approach of many queues and a router-thread (hub-and-spoke). Every 'normal' thread has its private queue to the router, enabling separate send and receive queues or 'channels'. The router's queue is shared by all threads (as a property) and used by 'normal' threads as a send-only-channel, ie they only post items to this queue, and only the router listens to it, ie pulls items. Additionally, each 'normal' thread uses its own queue as a 'receive-only-channel' on which it listens and which is shared only with the router. Threads register themselves with the router on the router queue/channel, the router maintains a list of registered threads including their queues, so it can send an item to a specific thread after its registration.
This means that peer to peer communication is not possible, all communication is sent via the router.
There are several reasons I did it this way:
1. There is no logic in the thread for checking if an item is addressed to 'me', making the code simpler and no constant pulling, checking and re-putting of items on one shared queue. Threads only listen on their queue, when a message arrives the thread can be sure that the message is addressed to it, including the router itself.
2. The router can act as a message bus, do vocabulary translation and has the possibility to address messages to external programs or hosts.
3. Threads don't need to know anything about other threads capabilities, ie they just speak the language of the router. In a peer-to-peer world, all peers must be able to understand each other, and since my threads are of many different classes, I would have to teach each class all other classes' vocabulary.
Hope this helps someone some day when faced with a similar challenge.

Queueing: N producers to N consumers

The requirement is as follows:
There are N producers, that generate messages or jobs or whatever you want to call it.
Messages from each procuder must be processed in order and each message must be processed exactly once.
There's one more restriction: at any time for any given producer there must be not more than one message that is being processed.
The consuming side consists of a number of threads (they are identical in their functionality) that are spread across a number of processes - it is a WSGI application run via mod_wsgi.
At the moment, the queueing on the consuming side is implemented as a custom queue, that subclasses Queue, but it has its own problems that I won't get into, the main one being that upon process restart its queue is lost.
Is there a product, that will make it possible to fulfill the requirements I've outlined above? Support for persistency would've been great, though that is not so important (since the queue will not reside in the worker process' memory any more).
There are many products that do what you are looking for. People with Django experience will probably tell you "celery", but that's not a complete answer. Celery is a (useful) wrapper around the actual queuing system, and using a wrapper doesn't mean you don't have to think about your underlying technology.
ZeroMQ, Redis, and RabbitMQ are a few different solutions that come to mind. There are of course more options. I'm fairly certain that no queueing solution will support your "at any time for any given producer there must be not more than one message that is being processed" requirement as a configuration parameter; you should probably implement this requirement at the producer (i.e. do not submit job #2 until you receive confirmation that job #1 has completed).
Redis is not a real queueing system, but a very fast database with pub/sub features; you would not be able to use Redis pub/sub to satisfy the "job processed exactly once" requirement out of the box, although you could use Redis pub/sub to publish jobs to a single subscriber which then pushes them into the database as a list (a poor man's queue). Your consumers would then atomically pull a job from the list. It'll work if you want to go this route.
RabbitMQ is an "enterprise" queueing system, and would absolutely meet your requirements, but you'd have to deploy the RabbitMQ server somewhere, and it might be overkill. For the record, I use RabbitMQ on numerous projects, and it gets the job done. Set up a "direct"-type exchange, bind it to a single queue, and subscribe all your consumers to this queue. You get pretty good persistence from RabbitMQ too.
ZeroMQ has a very very flexible queueing model, and ZeroMQ can absolutely be made to do what you want. ZeroMQ is basically just the transport mechanism though, so when it comes to making your publishers and subscribers and a broker to distribute them, you may end up rolling your own.

Categories

Resources