Python: Non-Responsive multiprocessing.pool.map_async() function

Python: Non-Responsive multiprocessing.pool.map_async() function - python

I have a strange problem here.
I have a python program that executes code held in seperate .py files, designed to be executed in sequence, one after another. The codes work fine, however they take too long to run. My plan was to split up processing each of these .py files amongst 4 processors using multiprocessing.pool.map_async(function, arguments) using execfile() as the function and the filename as the argument.
So anyways, when I run the code, absolutely nothing happens at all, not even an error.
Take a look and see if you can help me out, I run the file in SeqFile.runner(SeqFile.file).
class FileRunner:
def __init__(self, file):
self.file = file
def runner(self, file):
self.run = pool.map_async(execfile, file)
SeqFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/SequencePickler.py")
VolFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/VolumePickler.py")
CWFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/Combine_and_Write.py")
(SeqFile.runner(SeqFile.file))

You have several problems here - I'm guessing you never used multiprocessing before.
One of your problems is that you fire off an async operation but never wait for it to end. If you did wait for it to end, you'd get more info. For example, add:
result = SeqFile.run.get()
Do that, and you'll see the exception raised in the child process: you're mapping execfile over the string bound to file, so execfile sees one character at a time. execfile barfs when the first thing it tries to do is (in effect):
execfile("/")
apply_async() would make a lot more sense, or map_async() passed a list of all the files you want to run.
Etc - this gets tedious ;-)
Specifics
Let's get rid of the irrelevant cruft here, and show a complete executable program. I have three files a.py, b.py and c.py. Here's a.py:
print "I'm A!"
The other two are the obvious variations.
Here's my entire driver:
if __name__ == "__main__":
import multiprocessing as mp
files = ["a.py", "b.py", "c.py"]
pool = mp.Pool(2)
pool.imap_unordered(execfile, files)
pool.close()
pool.join()
That's all it takes, and prints (some permutation of):
I'm A!
I'm B!
I'm C!
imap_unordered() splits the list of files up among the worker processes, and doesn't care ("unordered") which order they run in. That's maximally efficient. Note that I restricted the number of workers to 2, just to show that it works fine even though there are more files (3) than worker processes (2).
You can get any of the Pool functions to work similarly. If you have ;-) to use map_async(), for example, replace the imap_unordered() call with:
async = pool.map_async(execfile, files)
async.get()
Or:
asyncs = [pool.apply_async(execfile, (fn,)) for fn in files]
for a in asyncs:
a.get()
Clearer? Keep it as simple as possible at first.

Related

Issues with files using Multithreading

My code is conceptually something like this
import multiprocessing.dummy
def working_with_files(test_file):
open test_file
...bunch of stuff...
create_fileA(variable)
create_fileB_from_fileA(fileA)
os.remove(fileA)
if __name__ == "__main__":
files = glob("/Users/Name/Documents/TestData/*")
pool = multiprocessing.dummy.Pool(8)
results = pool.map(working_with_files, files)
pool.close()
pool.join()
From my understanding, each thread is running concurrently, but inside each thread, its still happening in sequence. Since each thread is a function, everything inside the function should still be happening in sequence. I am, however, getting some weird errors. For example, when trying to os.remove(fileA), it says fileA doesn't exist (only occurs sometimes); however, it should exist since I'm only running that line after creating the file. These errors don't exist for single threads.

In the comment section, #AskioFrio confirmed that different threads could create files with same filenames. So I think the issue is race condition which could be illustrated with the following example (steps happening sequentially):
Thread A creates a file abc.
Thread B creates a file with the same filename abc; so abc gets overwritten.
Thread A deletes abc.
Thread B tries to delete abc, which has been deleted by thread A and thus the error occurs.
Actully the most notable race conditions happen in system memory when multiple threads try to write to the memory on the same address (e.g., writing to the same element in an array).
To avoid the race conditions, you may use lock or semaphore to coordinate the activities of threads.

Python multiprocessing doesn't finish all tasks

I have a lot of files that need to be processed by some software. They don't need to be processed in the order.
Let's say I have 12 files and divided them in three lists then tried to send these lists to different processes to be executed:
# import all files
files = glob.glob(src_path + "*.fits")
files_list = [files[0::3], files[1::3], files[2::3]]
num_processors = 3 # Create a pool of processors
p = Pool(processes = num_processors) # get them to work in parallel
output = pool.map(run2, [f for f in files_list])
def run2(files, *args):
for ffit in files:
terminal_astrometry(command)
def terminal_astrometry(command):
result = subprocess.run(command, stdout=subprocess.PIPE)
The problem is that sometimes, the program doesn't process all of these files, i.e. 11 files do get processed but one does not. Or other time, 9 finished but 3 were skipped. Sometimes it does finish all tasks(process all of the files).
Essentially, in run2() function I am calling that particular software that I want to be run in parallel (Astrometry.net) on every file run2() function received.
EDIT2: I trimmed run2() function because it contains a lot of calculation(statistics) not relevant to a problem here(at least I think so) and posted it here.

Your symptoms sound like a race condition, however pool.map blocks the main process until all tasks have finished so the code will not progress past that line until all tasks have finished. Therefore, I think the problem may be within the run2 function - could you post its code?
Edit: I previously had the following text in the answer too, the question has now been edited:
You are calling run2 twice for each file - once asynchronously with the pool, and once in the main process. Depending on the logic within this function, this could be the cause of the odd behaviour you're seeing.

Software that I'm calling inside the run2() function is causing problems. It tries to write stdout in the same file which causes it to not complete all the tasks.

Using multiprocessing with runpy

I have a Python module that uses multiprocessing. I'm executing this module from another script with runpy. However, this results in (1) the module running twice, and (2) the multiprocessing jobs never finish (the script just hangs).
In my minimal working example, I have a script runpy_test.py:
import runpy
runpy.run_module('module_test')
and a directory module_test containing an empty __init__.py and a __main__.py:
from multiprocessing import Pool
print 'start'
def f(x):
return x*x
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
When I run runpy_test.py, I get:
start
start
and the script hangs.
If I remove the pool.map call (or if I run __main__.py directly, including the pool.map call), I get:
start
done
I'm running this on Scientific Linux 7.6 in Python 2.7.5.

Rewrite your __main__.py like so:
from multiprocessing import Pool
from .implementation import f
print 'start'
pool = Pool()
result = pool.map(f, [1,2,3])
print 'done'
And then write an implementation.py (you can call this whatever you want) in which your function is defined:
def f(x):
return x*x
Otherwise you will have the same problem with most interfaces in multiprocessing, and independently of using runpy. As #Weeble explained, when Pool.map tries to load the function f in each sub-process it will import <your_package>.__main__ where your function is defined, but since you have executable code at module-level in __main__ it will be re-executed by the sub-process.
Aside from this technical reason, this is also better design in terms of separation of concerns and testing. Now you can easily import and call (including for test purposes) the function f without running it in parallel.

Try defining your function f in a separate module. It needs to be serialised to be passed to the pool processes, and then those processes need to recreate it, by importing the module it occurs in. However, the __main__.py file it occurs in isn't a module, or at least, not a well-behaved one. Attempting to import it would result in the creation of another Pool and another invocation of map, which seems like a recipe for disaster.

Although not the "right" way to do it, one solution that ended up working for me was to use runpy's _run_module_as_main instead of run_module. This was ideal for me since I was working with someone else's code and required the fewest changes.

multiprocessing - calling function with different input files

I have a function which reads in a file, compares a record in that file to a record in another file and depending on a rule, appends a record from the file to one of two lists.
I have an empty list for adding matched results to:
match = []
I have a list restrictions that I want to compare records in a series of files with.
I have a function for reading in the file I wish to see if contains any matches. If there is a match, I append the record to the match list.
def link_match(file):
links = json.load(file)
for link in links:
found = False
try:
for other_link in other_links:
if link['data'] == other_link['data']:
match.append(link)
found = True
else:
pass
else:
print "not found"
I have numerous files that I wish to compare and I thus wish to use the multiprocessing library.
I create a list of file names to act as function arguments:
list_files=[]
for file in glob.glob("/path/*.json"):
list_files.append(file)
I then use the map feature to call the function with the different input files:
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=6)
pool.map(link_match,list_files)
pool.close()
pool.join()
CPU use goes through the roof and by adding in a print line to the function loop I can see that matches are being found and the function is behaving correctly.
However, the match results list remains empty. What am I doing wrong?

multiprocessing runs a new instance of Python for each process in the pool - the context is empty (if you use spawn as a start method) or copied (if you use fork), plus copies of any arguments you pass in (either way), and from there they're all separate. If you want to pass data between branches, there's a few other ways to do it.
Instead of writing to an internal list, write to a file and read from it later when you're done. The largest potential problem here is that only one thing can write to a file at a time, so either you make a lot of separate files (and have to read all of them afterwards) or they all block each other.
Continue with multiprocessing, but use a multiprocessing.Queue instead of a list. This is an object provided specifically for your current use-case: Using multiple processes and needing to pass data between them. Assuming that you should indeed be using multiprocessing (that your situation wouldn't be better for threading, see below), this is probably your best option.
Instead of multiprocessing, use threading. Separate threads all share a single environment. The biggest problems here are that Python only lets one thread actually run Python code at a time, per process. This is called the Global Interpreter Lock (GIL). threading is thus useful when the threads will be waiting on external processes (other programs, user input, reading or writing files), but if most of the time is spent in Python code, it actually takes longer (because it takes a little time to switch threads, and you're not doing anything to save time). This has its own queue. You should probably use that rather than a plain list, if you use threading - otherwise there's the potential that two threads accessing the list at the same time interfere with each other, if it switches threads at the wrong time.
Oh, by the way: If you do use threading, Python 3.2 and later has an improved implementation of the GIL, which seems like it at least has a good chance of helping. A lot of stuff for threading performance is very dependent on your hardware (number of CPU cores) and the exact tasks you're doing, though - probably best to try several ways and see what works for you.

When multiprocessing, each subprocess gets its own copy of any global variables in the main module defined before the if __name__ == '__main__': statement. This means that the link_match() function in each one of the processes will be accessing a different match list in your code.
One workaround is to use a shared list, which in turn requires a SyncManager to synchronize access to the shared resource among the processes (which is created by calling multiprocessing.Manager()). This is then used to create the list to store the results (which I have named matches instead of match) in the code below.
I also had to use functools.partial() to create a single argument callable out of the revised link_match function which now takes two arguments, not one (which is the kind of function pool.map() expects).
from functools import partial
import glob
import multiprocessing
def link_match(matches, file): # note: added results list argument
links = json.load(file)
for link in links:
try:
for other_link in other_links:
if link['data'] == other_link['data']:
matches.append(link)
else:
pass
else:
print "not found"
if __name__ == '__main__':
manager = multiprocessing.Manager() # create SyncManager
matches = manager.list() # create a shared list here
link_matches = partial(link_match, matches) # create one arg callable to
# pass to pool.map()
pool = multiprocessing.Pool(processes=6)
list_files = glob.glob("/path/*.json") # only used here
pool.map(link_matches, list_files) # apply partial to files list
pool.close()
pool.join()
print(matches)

Multiprocessing creates multiple processes. The context of your "match" variable will now be in that child process, not the parent Python process that kicked the processing off.
Try writing the list results out to a file in your function to see what I mean.

To expand cthrall's answer, you need to return something from your function in order to pass the info back to your main thread, e.g.
def link_match(file):
[put all the code here]
return match
[main thread]
all_matches = pool.map(link_match,list_files)
the list match will be returned from each single thread and map will return a list of lists in this case. You can then flatten it again to get the final output.
Alternatively you can use a shared list but this will just add more headache in my opinion.

python -> multiprocessing module

Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?

Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...

The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.