Use several workers to execute python code - python

I'm executing python code on several files. Since the files are all very big and since one call one treats one file, it lasts very long till the final file is treated. Hence, here is my question: Is it possible to use several workers which treat the files in parallel?
Is this a possible invocation?:
import annotation as annot # this is a .py-file
import multiprocessing
pool = multiprocessing.Pool(processes=4)
pool.map(annot, "")
The .py-file uses for-loops (etc.) to get all files by itself.
The problem is: If I have a look at all the processes (with 'top'), I only see 1 process which is working with the .py-file. So...I suspect that I shouldn't use multiprocessing like this...does I?
Thanks for any help! :)

Yes. Use multiprocessing.Pool.
import multiprocessing
pool = multiprocessing.Pool(processes=<pool size>)
result = pool.map(<your function>, <file list>)

My answer is not purely a python answer though I think it's the best approach given your problem.
This will only work on Unix systems (OS X/Linux/etc.).
I do stuff like this all the time, and I am in love with GNU Parallel. See this also for an introduction by the GNU Parallel developer. You will likely have to install it, but it's worth it.
Here's a simple example. Say you have a python script called processFiles.py:
#!/usr/bin/python
#
# Script to print out file name
#
fileName = sys.argv[0] # command line argument
print( fileName ) # adapt for python 2.7 if you need to
To make this file executable:
chmod +x processFiles.py
And say all your large files are in largeFileDir. Then to run all the files in parallel with four processors (-P4), run this at the command line:
$ parallel -P4 processFiles.py ::: $(ls largeFileDir/*)
This will output
file1
file3
file7
file2
...
They may not be in order because each thread is operating independently in parallel. To adapt this to your process, insert your file processing script instead of just stupidly printing the file to screen.
This is preferable to threading in your case because each file processing job will get its own instance of the Python interpreter. Since each file is processed independently (or so it sounds) threading is overkill. In my experience this is the most efficient way to parallelize a process like you describe.
There is something called the Global Interpreter Lock that I don't understand very well, but has caused me headaches when trying to use python built-ins to hyperthread. Which is why I say if you don't need to thread, don't. Instead do as I've recommended and start up independent python processes.

There are many options.
multiple threads
multiple processes
"green threads", I personally like Eventlet
Then there are more "Enterprise" solutions, which are even able running workers on multiple servers, e.g. Celery, for more search Distributed task queue python.
In all cases, your scenario will become more complex and sometime you will not gain much, e.g. if your processing is limited by I/O operations (reading the data) and not by computation and processing.

Yes, this is possible. You should investigate the threading module and the multiprocessing module. Both will allow you to execute Python code concurrently. One note with the threading module, though, is that because of the way Python is implemented (Google "python GIL" if you're interested in the details), only one thread will execute at a time, even if you have multiple CPU cores. This is different from the threading implementation in our languages, where each thread will run at the same time, each using a different core. Because of this limitation, in cases where you want to do CPU-intensive operations concurrently, you'll get better performance with the multiprocessing module.

Related

How to run multiple files in python

so title doesn't explain much.
I have a function where it should run on a separate .py file.
So I have a python file where it takes some variables and waits for event to happen than continue until it finishes. You can think this as event listener (web socket).
This file runs and doesn't give output just does some functions and when event finishes it closes. So running one file only is no problem but I want to run more than 10 of these at the same time for different purposes but same event, this causes problems where some of them doesn't work or miss the event.
I do this by running 10 terminals (cmd or shell). Which I think it creates problem because of running this much of event handling in shells, in future I might use more than 10 files maybe 50-100.
So what I tried:
I tried one-file using threading (multi-threading) and it didn't work.
My goal
I want help with this problem where I can run as many as these files without missing events and slowing down the system. I am open to ideas.
Concurrent.future could be a good option, it execute a piece of
code in another thread. See documentation
https://docs.python.org/3/library/concurrent.futures.html.
The threading library that comes with Python allows you to start
mutliple times the same function in different threads whithout having
to wait for them to finish. See the documentation
ttps://docs.python.org/3/library/threading.html.
A similar API is in the library multiprocessing allow you do the
same in pultiple processes. Documentation is
https://docs.python.org/3/library/multiprocessing.html. One
difference is that in Python threads are virtual, all manage in the
single interpreter process. With multiprocessing you start several
processes and probably have less impact on the performance.
The code you have to run in a process or a thread has to be in a defined function. It seems that this code is in a separate .py file, a module, therefore you have to import it (https://docs.python.org/3/tutorial/modules.html) first. So one file manage the thread/multiprocess in a loop, another for the code listening the event and only one terminal will be required to start them.
You can use multi Threading.
here in this page you will find some very useful examples of what you want (I recommend using the concurrent.futures cause in the new version of python 3 you will run into some bugs using the threading ).
https://realpython.com/intro-to-python-threading/#using-a-threadpoolexecutor

How to pass variables in parent to subprocess in python?

I am trying to have a parent python script sent variables to a child script to help me speed-up and automate video analysis.
I am now using the subprocess.Popen() call to start-up 6 instances of a child script but cannot find a way to pass variables and modules already called for in the parent to the child. For example, the parent file would have:
import sys
import subprocess
parent_dir = os.path.realpath(sys.argv[0])
subprocess.Popen(sys.executable, 'analysis.py')
but then import sys; import subprocess; parent_dir have to be called again in "analysis.py". Is there a way to pass them to the child?
In short, what I am trying to achieve is: I have a folder with a couple hundred video files. I want the parent python script to list the video files and start up to 6 parallel instances of an analysis script that each analyse one video file. If there are no more files to be analysed the parent file stops.
The simple answer here is: don't use subprocess.Popen, use multiprocessing.Process. Or, better yet, multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor.
With subprocess, your program's Python interpreter doesn't know anything about the subprocess at all; for all it knows, the child process is running Doom. So there's no way to directly share information with it.* But with multiprocessing, Python controls launching the subprocess and getting everything set up so that you can share data as conveniently as possible.
Unfortunately "as conveniently as possible" still isn't 100% as convenient as all being in one process. But what you can do is usually good enough. Read the section on Exchanging objects between processes and the following few sections; hopefully one of those mechanisms will be exactly what you need.
But, as I implied at the top, in most cases you can make it even simpler, by using a pool. Instead of thinking about "running 6 processes and sharing data with them", just think about it as "running a bunch of tasks on a pool of 6 processes". A task is basically just a function—it takes arguments, and returns a value. If the work you want to parallelize fits into that model—and it sounds like your work does—life is as simple as could be. For example:
import multiprocessing
import os
import sys
import analysis
parent_dir = os.path.realpath(sys.argv[0])
paths = [os.path.join(folderpath, file)
for file in os.listdir(folderpath)]
with multiprocessing.Pool(processes=6) as pool:
results = pool.map(analysis.analyze, paths)
If you're using Python 3.2 or earlier (including 2.7), you can't use a Pool in a with statement. I believe you want this:**
pool = multiprocessing.Pool(processes=6)
try:
results = pool.map(analysis.analyze, paths)
finally:
pool.close()
pool.join()
This will start up 6 processes,*** then tell the first one to do analysis.analyze(paths[0]), the second to do analysis.analyze(paths[1]), etc. As soon as any of the processes finishes, the pool will give it the next path to work on. When they're all finished, you get back a list of all the results.****
Of course this means that the top-level code that lived in analysis.py has to be moved into a function def analyze(path): so you can call it. Or, even better, you can move that function into the main script, instead of a separate file, if you really want to save that import line.
* You can still indirectly share information by, e.g., marshaling it into some interchange format like JSON and pass it via the stdin/stdout pipes, a file, a shared memory segment, a socket, etc., but multiprocessing effectively wraps that up for you to make it a whole lot easier.
** There are different ways to shut a pool down, and you can also choose whether or not to join it immediately, so you really should read up on the details at some point. But when all you're doing is calling pool.map, it really doesn't matter; the pool is guaranteed to shut down and be ready to join nearly instantly by the time the map call returns.
*** I'm not sure why you wanted 6; most machines have 4, 8, or 16 cores, not 6; why not use them all? The best thing to do is usually to just leave out the processes=6 entirely and let multiprocessing ask your OS how many cores to use, which means it'll still run at full speed on your new machine with twice as many cores that you'll buy next year.
**** This is slightly oversimplified; usually the pool will give the first process a batch of files, not one at a time, to save a bit of overhead, and you can manually control the batching if you need to optimize things or sequence them more carefully. But usually you don't care, and this oversimplification is fine.

What is the difference between running multiple python scripts and multiprocessing

I was thinking about adding multiprocessing to one of my scripts to
increase performance.
It's nothing fancy, there are 1-2 parameters to the main method.
Is there anything wrong with just running four of the same scripts cloned
on the terminal versus actually adding multiprocessing into the python code?
ex for four cores:
~$ script.py &
script.py &
script.py &
script.py;
I've read that linux/unix os's automatically divide the
programs amongst the available cores.
Sorry if ^ stuff i've mentioned above is totally wrong. Nothing above was
formally learned, it was just all online stuff.
Martijn Pieters' comment hits the nail on the head, I think. If each of your processes only consumes a small amount of memory (so that you can easily have all four running in parallel without running out of RAM) and if your processes do not need to communicate with each other, it is easiest to start all four processes independently, from the shell, as you suggested.
The python multiprocessing module is very useful if you have slightly more complex requirements. You might, for instance, have a program that needs to run in serial at startup, then spawn several copies for the more computationally intensive section, and finally do some post-processing in serial. multiprocessing would be invaluable for this sort of synchronization.
Alternatively, your program might require a large amount of memory (maybe to store a large matrix in scientific computing, or a large database in web programming). multiprocessing lets you share that object among the different processes so that you don't have n copies of the data in memory (using multiprocessing.Value and multiprocessing.Array objects).
Using multiprocessing might also become the better solution if you'd want to run your script, say, 100 times on only 4 cores.
Then your terminal-based approach would become pretty nasty.
In this case you might want to use a Pool from the multiprocessing module.

Issues with Pool and multiprocessing in Python 3

Currently I'm trying to convert my little python script to support multiple threads/cores. I've been reading about the multiprocessing module for several days now and I've also been trying to get it to suit my needs for some time, still I don't have a clue why it won't work.
This is the working code, and this is my approach on implementing the pool workers. As there are no locks in place and I didn't want to make it too complicated at first I already disabled the logging to file.
Still it doesn't work. It doesn't even output any kind of error message. After running it it just displays the welcome message and then it just keeps running, but without outputting any of the desired output, which would be 2 lines per converted file (before + after converting).
all your workers do is wait for started subprocesses to finish. they don't have any real work to do as that is performed by the external subprocesses, so they will be idle all the time.
using multiprocessing for what you do really is overkill, it's much more appropriate to use threads for that.
if you want to learn how to do multiprocessing, try something which involves inter-process communication, synchronisation, pipes, ...
but to also address your question:
hava a look at what arguments subprocess.call takes. you call it with a single space-separated command string. if you want that to work you have to pass shell=True, otherwise the whole string is interpreted as the executable's name.
the preferred way to call a program using subprocess is is to specify program and arguments as a list:
subprocess.Popen(['/path/to/program', 'arg1', 'arg2'], *otherarguments)

Parallelize my python program

I have a python program that reads a line from a input file, does some manipulation and writes it to output file. I have a quadcore machine, and I want to utilize all of them. I think there are two alternatives to do this,
Creating n multiple python processes each handling a total number of records/n
Creating n threads in a single python process for every input record and each thread processing a record.
Creating a pool of n threads in a single python process, each executing a input record.
I have never used python mutliprocessing capabilities, can the hackers please tell which method is best option?
The reference implementation of the Python interpreter (CPython) holds the infamous "Global Interpreter Lock" (GIL), effectively allowing only one thread to execute Python code at a time. As a result, multithreading is very limited in Python -- unless your heavy lifting gets done in C extensions that release the GIL.
The simplest way to overcome this limitation is to use the multiprocessing module instead. It has a similar API to threading and is pretty straight-forward to use. In your case, you could use it like this (assuming that the manipulation is the hard part):
import multiprocessing
def process_line(line):
# This function is executed in your worker processes. Manipulate the
# line and return the results.
return manipulate(line)
if __name__ == '__main__':
with open('input.txt') as fin, open('output.txt', 'w') as fout:
# This creates a pool of N worker processes, where N is the number
# of CPUs in your machine.
pool = multiprocessing.Pool()
# Let the workers do the manipulation and write the results to
# the output file:
for manipulated_line in pool.imap(process_line, fin):
fout.write(manipulated_line)
Number one is the right answer.
First of all, it is easier to create and manage multiple processes than multiple threads. You can use the multiprocessing module or something like pyro to take care of the details. Secondly, threading needs to deal with Python's global interpreter lock which makes it more complicated even if you are an expert at threading with Java or C#. And most importantly, performance on multicore machines is harder to predict than you might think. If you haven't implemented and measured two different ways to do things, your intuition as to which way is fastest, is probably wrong.
By the way if you really are an expert at Java or C# threading, then you probably should go with threading instead, but use Jython or IronPython instead of CPython.
Reading the same file from several processes concurrently is tricky. Is it possible to split the file beforehand?
While Python has the GIL both Jython and IronPython hasn't that limitation.
Also make sure that a simple single process doesn't already max disk I/O. You will have a hard time gaining anything if it does.

Categories

Resources