Issues with Pool and multiprocessing in Python 3 - python

Currently I'm trying to convert my little python script to support multiple threads/cores. I've been reading about the multiprocessing module for several days now and I've also been trying to get it to suit my needs for some time, still I don't have a clue why it won't work.
This is the working code, and this is my approach on implementing the pool workers. As there are no locks in place and I didn't want to make it too complicated at first I already disabled the logging to file.
Still it doesn't work. It doesn't even output any kind of error message. After running it it just displays the welcome message and then it just keeps running, but without outputting any of the desired output, which would be 2 lines per converted file (before + after converting).

all your workers do is wait for started subprocesses to finish. they don't have any real work to do as that is performed by the external subprocesses, so they will be idle all the time.
using multiprocessing for what you do really is overkill, it's much more appropriate to use threads for that.
if you want to learn how to do multiprocessing, try something which involves inter-process communication, synchronisation, pipes, ...
but to also address your question:
hava a look at what arguments subprocess.call takes. you call it with a single space-separated command string. if you want that to work you have to pass shell=True, otherwise the whole string is interpreted as the executable's name.
the preferred way to call a program using subprocess is is to specify program and arguments as a list:
subprocess.Popen(['/path/to/program', 'arg1', 'arg2'], *otherarguments)

Related

How to run multiple files in python

so title doesn't explain much.
I have a function where it should run on a separate .py file.
So I have a python file where it takes some variables and waits for event to happen than continue until it finishes. You can think this as event listener (web socket).
This file runs and doesn't give output just does some functions and when event finishes it closes. So running one file only is no problem but I want to run more than 10 of these at the same time for different purposes but same event, this causes problems where some of them doesn't work or miss the event.
I do this by running 10 terminals (cmd or shell). Which I think it creates problem because of running this much of event handling in shells, in future I might use more than 10 files maybe 50-100.
So what I tried:
I tried one-file using threading (multi-threading) and it didn't work.
My goal
I want help with this problem where I can run as many as these files without missing events and slowing down the system. I am open to ideas.
Concurrent.future could be a good option, it execute a piece of
code in another thread. See documentation
https://docs.python.org/3/library/concurrent.futures.html.
The threading library that comes with Python allows you to start
mutliple times the same function in different threads whithout having
to wait for them to finish. See the documentation
ttps://docs.python.org/3/library/threading.html.
A similar API is in the library multiprocessing allow you do the
same in pultiple processes. Documentation is
https://docs.python.org/3/library/multiprocessing.html. One
difference is that in Python threads are virtual, all manage in the
single interpreter process. With multiprocessing you start several
processes and probably have less impact on the performance.
The code you have to run in a process or a thread has to be in a defined function. It seems that this code is in a separate .py file, a module, therefore you have to import it (https://docs.python.org/3/tutorial/modules.html) first. So one file manage the thread/multiprocess in a loop, another for the code listening the event and only one terminal will be required to start them.
You can use multi Threading.
here in this page you will find some very useful examples of what you want (I recommend using the concurrent.futures cause in the new version of python 3 you will run into some bugs using the threading ).
https://realpython.com/intro-to-python-threading/#using-a-threadpoolexecutor

Python file-based queue that is process-safe

Is there a process safe persistent (disk-based) Python FIFO queue?
Could someone provide a simple example with a script that writes a string to the queue, and another that reads one-string-at-a-time from the queue? Note that each of these processes can be launched multiple times from the command line. I.e. multiple writers and multiple readers.
Background: I have several scripts that each produces some output every once in a while. I would like to be aware of the output they create, and use another script(s) to process the information they produce. Unfortunately I cannot use the multiprocessing or threading modules of Python, because the scripts can run from different machines on the same file system. I.e. each of the scripts is launched from the command line.
All I need is that each of the queue elements is a string, and that the queue is process-safe.
Edit: I found the modules of queuelib pqueque, but they don't provide process safety. I am thinking of maybe replacing the file handling of queuelib, with atomicfile calls. But I am afraid spending too long on debugging it, since I am short of time, and there are always fine details to care when writing software that should be process safe.
Note1: I am using Python 2.7.
Note2: I am aware of this question, however my requirements are very modest, and I am looking for a simple solution, while the provided answers on that question refer to complex libraries, with no examples on how to use the queues.
EDIT (2019): For anyone landing here, I found that SQLite database provides atomic access, it has a python API, and it is serverless. Therefore, I believe it is possible (and easy) to implement the queue in my question with it.

How to pass variables in parent to subprocess in python?

I am trying to have a parent python script sent variables to a child script to help me speed-up and automate video analysis.
I am now using the subprocess.Popen() call to start-up 6 instances of a child script but cannot find a way to pass variables and modules already called for in the parent to the child. For example, the parent file would have:
import sys
import subprocess
parent_dir = os.path.realpath(sys.argv[0])
subprocess.Popen(sys.executable, 'analysis.py')
but then import sys; import subprocess; parent_dir have to be called again in "analysis.py". Is there a way to pass them to the child?
In short, what I am trying to achieve is: I have a folder with a couple hundred video files. I want the parent python script to list the video files and start up to 6 parallel instances of an analysis script that each analyse one video file. If there are no more files to be analysed the parent file stops.
The simple answer here is: don't use subprocess.Popen, use multiprocessing.Process. Or, better yet, multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor.
With subprocess, your program's Python interpreter doesn't know anything about the subprocess at all; for all it knows, the child process is running Doom. So there's no way to directly share information with it.* But with multiprocessing, Python controls launching the subprocess and getting everything set up so that you can share data as conveniently as possible.
Unfortunately "as conveniently as possible" still isn't 100% as convenient as all being in one process. But what you can do is usually good enough. Read the section on Exchanging objects between processes and the following few sections; hopefully one of those mechanisms will be exactly what you need.
But, as I implied at the top, in most cases you can make it even simpler, by using a pool. Instead of thinking about "running 6 processes and sharing data with them", just think about it as "running a bunch of tasks on a pool of 6 processes". A task is basically just a function—it takes arguments, and returns a value. If the work you want to parallelize fits into that model—and it sounds like your work does—life is as simple as could be. For example:
import multiprocessing
import os
import sys
import analysis
parent_dir = os.path.realpath(sys.argv[0])
paths = [os.path.join(folderpath, file)
for file in os.listdir(folderpath)]
with multiprocessing.Pool(processes=6) as pool:
results = pool.map(analysis.analyze, paths)
If you're using Python 3.2 or earlier (including 2.7), you can't use a Pool in a with statement. I believe you want this:**
pool = multiprocessing.Pool(processes=6)
try:
results = pool.map(analysis.analyze, paths)
finally:
pool.close()
pool.join()
This will start up 6 processes,*** then tell the first one to do analysis.analyze(paths[0]), the second to do analysis.analyze(paths[1]), etc. As soon as any of the processes finishes, the pool will give it the next path to work on. When they're all finished, you get back a list of all the results.****
Of course this means that the top-level code that lived in analysis.py has to be moved into a function def analyze(path): so you can call it. Or, even better, you can move that function into the main script, instead of a separate file, if you really want to save that import line.
* You can still indirectly share information by, e.g., marshaling it into some interchange format like JSON and pass it via the stdin/stdout pipes, a file, a shared memory segment, a socket, etc., but multiprocessing effectively wraps that up for you to make it a whole lot easier.
** There are different ways to shut a pool down, and you can also choose whether or not to join it immediately, so you really should read up on the details at some point. But when all you're doing is calling pool.map, it really doesn't matter; the pool is guaranteed to shut down and be ready to join nearly instantly by the time the map call returns.
*** I'm not sure why you wanted 6; most machines have 4, 8, or 16 cores, not 6; why not use them all? The best thing to do is usually to just leave out the processes=6 entirely and let multiprocessing ask your OS how many cores to use, which means it'll still run at full speed on your new machine with twice as many cores that you'll buy next year.
**** This is slightly oversimplified; usually the pool will give the first process a batch of files, not one at a time, to save a bit of overhead, and you can manually control the batching if you need to optimize things or sequence them more carefully. But usually you don't care, and this oversimplification is fine.

Python subprocess signal

I would like to establish a very simple communication between two python scripts. I have decided that the best way to communicate and to have both scripts read from a text file. I would like the main program to wait while to child programs execute.
Normally I would make the main program wait x amount of time and continuously check the text file for an okay flag. However I have seen people talk about using a signal.
Could someone please give an example of this.
There is Popen.send_signal() method that allows you to send a signal to a child process.
Here's code example that sends SIGINT to ping subprocess to get the summary in the output on exit.
You need one process to write and one to read; both processes reading leads to no communication. Signals are used only for special proposes, not for normal inter-process-communication. Use something like pipes or sockets. It's not more complicated than files, but much more powerful.

Use several workers to execute python code

I'm executing python code on several files. Since the files are all very big and since one call one treats one file, it lasts very long till the final file is treated. Hence, here is my question: Is it possible to use several workers which treat the files in parallel?
Is this a possible invocation?:
import annotation as annot # this is a .py-file
import multiprocessing
pool = multiprocessing.Pool(processes=4)
pool.map(annot, "")
The .py-file uses for-loops (etc.) to get all files by itself.
The problem is: If I have a look at all the processes (with 'top'), I only see 1 process which is working with the .py-file. So...I suspect that I shouldn't use multiprocessing like this...does I?
Thanks for any help! :)
Yes. Use multiprocessing.Pool.
import multiprocessing
pool = multiprocessing.Pool(processes=<pool size>)
result = pool.map(<your function>, <file list>)
My answer is not purely a python answer though I think it's the best approach given your problem.
This will only work on Unix systems (OS X/Linux/etc.).
I do stuff like this all the time, and I am in love with GNU Parallel. See this also for an introduction by the GNU Parallel developer. You will likely have to install it, but it's worth it.
Here's a simple example. Say you have a python script called processFiles.py:
#!/usr/bin/python
#
# Script to print out file name
#
fileName = sys.argv[0] # command line argument
print( fileName ) # adapt for python 2.7 if you need to
To make this file executable:
chmod +x processFiles.py
And say all your large files are in largeFileDir. Then to run all the files in parallel with four processors (-P4), run this at the command line:
$ parallel -P4 processFiles.py ::: $(ls largeFileDir/*)
This will output
file1
file3
file7
file2
...
They may not be in order because each thread is operating independently in parallel. To adapt this to your process, insert your file processing script instead of just stupidly printing the file to screen.
This is preferable to threading in your case because each file processing job will get its own instance of the Python interpreter. Since each file is processed independently (or so it sounds) threading is overkill. In my experience this is the most efficient way to parallelize a process like you describe.
There is something called the Global Interpreter Lock that I don't understand very well, but has caused me headaches when trying to use python built-ins to hyperthread. Which is why I say if you don't need to thread, don't. Instead do as I've recommended and start up independent python processes.
There are many options.
multiple threads
multiple processes
"green threads", I personally like Eventlet
Then there are more "Enterprise" solutions, which are even able running workers on multiple servers, e.g. Celery, for more search Distributed task queue python.
In all cases, your scenario will become more complex and sometime you will not gain much, e.g. if your processing is limited by I/O operations (reading the data) and not by computation and processing.
Yes, this is possible. You should investigate the threading module and the multiprocessing module. Both will allow you to execute Python code concurrently. One note with the threading module, though, is that because of the way Python is implemented (Google "python GIL" if you're interested in the details), only one thread will execute at a time, even if you have multiple CPU cores. This is different from the threading implementation in our languages, where each thread will run at the same time, each using a different core. Because of this limitation, in cases where you want to do CPU-intensive operations concurrently, you'll get better performance with the multiprocessing module.

Categories

Resources