My code is conceptually something like this
import multiprocessing.dummy
def working_with_files(test_file):
open test_file
...bunch of stuff...
create_fileA(variable)
create_fileB_from_fileA(fileA)
os.remove(fileA)
if __name__ == "__main__":
files = glob("/Users/Name/Documents/TestData/*")
pool = multiprocessing.dummy.Pool(8)
results = pool.map(working_with_files, files)
pool.close()
pool.join()
From my understanding, each thread is running concurrently, but inside each thread, its still happening in sequence. Since each thread is a function, everything inside the function should still be happening in sequence. I am, however, getting some weird errors. For example, when trying to os.remove(fileA), it says fileA doesn't exist (only occurs sometimes); however, it should exist since I'm only running that line after creating the file. These errors don't exist for single threads.
In the comment section, #AskioFrio confirmed that different threads could create files with same filenames. So I think the issue is race condition which could be illustrated with the following example (steps happening sequentially):
Thread A creates a file abc.
Thread B creates a file with the same filename abc; so abc gets overwritten.
Thread A deletes abc.
Thread B tries to delete abc, which has been deleted by thread A and thus the error occurs.
Actully the most notable race conditions happen in system memory when multiple threads try to write to the memory on the same address (e.g., writing to the same element in an array).
To avoid the race conditions, you may use lock or semaphore to coordinate the activities of threads.
Related
I have a lot of files that need to be processed by some software. They don't need to be processed in the order.
Let's say I have 12 files and divided them in three lists then tried to send these lists to different processes to be executed:
# import all files
files = glob.glob(src_path + "*.fits")
files_list = [files[0::3], files[1::3], files[2::3]]
num_processors = 3 # Create a pool of processors
p = Pool(processes = num_processors) # get them to work in parallel
output = pool.map(run2, [f for f in files_list])
def run2(files, *args):
for ffit in files:
terminal_astrometry(command)
def terminal_astrometry(command):
result = subprocess.run(command, stdout=subprocess.PIPE)
The problem is that sometimes, the program doesn't process all of these files, i.e. 11 files do get processed but one does not. Or other time, 9 finished but 3 were skipped. Sometimes it does finish all tasks(process all of the files).
Essentially, in run2() function I am calling that particular software that I want to be run in parallel (Astrometry.net) on every file run2() function received.
EDIT2: I trimmed run2() function because it contains a lot of calculation(statistics) not relevant to a problem here(at least I think so) and posted it here.
Your symptoms sound like a race condition, however pool.map blocks the main process until all tasks have finished so the code will not progress past that line until all tasks have finished. Therefore, I think the problem may be within the run2 function - could you post its code?
Edit: I previously had the following text in the answer too, the question has now been edited:
You are calling run2 twice for each file - once asynchronously with the pool, and once in the main process. Depending on the logic within this function, this could be the cause of the odd behaviour you're seeing.
Software that I'm calling inside the run2() function is causing problems. It tries to write stdout in the same file which causes it to not complete all the tasks.
I'm kind of new to multiprocessing. However, assume that we have a program as below. The program seems to work fine. Now to the question. In my opinion we will have 4 instances of SomeKindOfClass with the same name (a) at the same time. How is that possible? Moreover, is there a potential risk with this kind of programming?
from multiprocessing.dummy import Pool
import numpy
from theFile import someKindOfClass
n = 8
allOutputs = numpy.zeros(n)
def work(index):
a = SomeKindOfClass()
a.theSlowFunction()
allOutputs[index] = a.output
pool = Pool(processes=4)
pool.map(work,range(0,n))
The name a is only local in scope within your work function, so there is no conflict of names here. Internally python will keep track of each class instance with a unique identifier. If you wanted to check this you could check the object id using the id function:
print(id(a))
I don't see any issues with your code.
Actually, you will have 8 instances of SomeKindOfClass (one for each worker), but only 4 will ever be active at the same time.
multiprocessing vs multiprocessing.dummy
Your program will only work if you continue to use the multiprocessing.dummy module, which is just a wrapper around the threading module. You are still using "python threads" (not separate processes). "Python threads" share the same global state; "Processes" don't. Python threads also share the same GIL, so they're still limited to running one python bytecode statement at a time, unlike processes, which can all run python code simultaneously.
If you were to change your import to from multiprocessing import Pool, you would notice that the allOutputs array remains unchanged after all the workers finish executing (also, you would likely get an error because you're creating the pool in the global scope, you should probably put that inside a main() function). This is because multiprocessing makes a new copy of the entire global state when it makes a new process. When the worker modifies the global allOutputs, it will be modifying a copy of that initial global state. When the process ends, nothing will be returned to the main process and the global state of the main process will remain unchanged.
Sharing State Between Processes
Unlike threads, processes aren't sharing the same memory
If you want to share state between processes, you have to explicitly declare shared variables and pass them to each process, or use pipes or some other method to allow the worker processes to communicate with each other or with the main process.
There are several ways to do this, but perhaps the simplest is using the Manager class
import multiprocessing
def worker(args):
index, array = args
a = SomeKindOfClass()
a.some_expensive_function()
array[index] = a.output
def main():
n = 8
manager = multiprocessing.Manager()
array = manager.list([0] * n)
pool = multiprocessing.Pool(4)
pool.map(worker, [(i, array) for i in range(n)])
print array
You can declare class instances inside the pool workers, because each instance has a separate place in memory so they don't conflict. The problem is if you declare a class instance first, then try to pass that one instance into multiple pool workers. Then each worker has a pointer to the same place in memory, and it will fail (this can be handled, just not this way).
Basically pool workers must not have overlapping memory anywhere. As long as the workers don't try to share memory somewhere, or perform operations that may result in collisions (like printing to the same file), there shouldn't be any problem.
Make sure whatever they're supposed to do (like something you want printed to a file, or added to a broader namespace somewhere) is returned as a result at the end, which you then iterate through.
If you are using multiprocessing you shouldn't worry - process doesn't share memory (by-default). So, there is no any risk to have several independent objects of class SomeKindOfClass - each of them will live in own process. How it works? Python runs your program and after that it runs 4 child processes. That's why it's very important to have if __init__ == '__main__' construction before pool.map(work,range(0,n)). Otherwise you will receive a infinity loop of process creation.
Problems could be if SomeKindOfClass keeps state on disk - for example, write something to file or read it.
I have a strange problem here.
I have a python program that executes code held in seperate .py files, designed to be executed in sequence, one after another. The codes work fine, however they take too long to run. My plan was to split up processing each of these .py files amongst 4 processors using multiprocessing.pool.map_async(function, arguments) using execfile() as the function and the filename as the argument.
So anyways, when I run the code, absolutely nothing happens at all, not even an error.
Take a look and see if you can help me out, I run the file in SeqFile.runner(SeqFile.file).
class FileRunner:
def __init__(self, file):
self.file = file
def runner(self, file):
self.run = pool.map_async(execfile, file)
SeqFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/SequencePickler.py")
VolFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/VolumePickler.py")
CWFile = FileRunner("/Users/haysb/Dropbox/Stuart/Sample_proteins/Code/SVS_CodeParts/Combine_and_Write.py")
(SeqFile.runner(SeqFile.file))
You have several problems here - I'm guessing you never used multiprocessing before.
One of your problems is that you fire off an async operation but never wait for it to end. If you did wait for it to end, you'd get more info. For example, add:
result = SeqFile.run.get()
Do that, and you'll see the exception raised in the child process: you're mapping execfile over the string bound to file, so execfile sees one character at a time. execfile barfs when the first thing it tries to do is (in effect):
execfile("/")
apply_async() would make a lot more sense, or map_async() passed a list of all the files you want to run.
Etc - this gets tedious ;-)
Specifics
Let's get rid of the irrelevant cruft here, and show a complete executable program. I have three files a.py, b.py and c.py. Here's a.py:
print "I'm A!"
The other two are the obvious variations.
Here's my entire driver:
if __name__ == "__main__":
import multiprocessing as mp
files = ["a.py", "b.py", "c.py"]
pool = mp.Pool(2)
pool.imap_unordered(execfile, files)
pool.close()
pool.join()
That's all it takes, and prints (some permutation of):
I'm A!
I'm B!
I'm C!
imap_unordered() splits the list of files up among the worker processes, and doesn't care ("unordered") which order they run in. That's maximally efficient. Note that I restricted the number of workers to 2, just to show that it works fine even though there are more files (3) than worker processes (2).
You can get any of the Pool functions to work similarly. If you have ;-) to use map_async(), for example, replace the imap_unordered() call with:
async = pool.map_async(execfile, files)
async.get()
Or:
asyncs = [pool.apply_async(execfile, (fn,)) for fn in files]
for a in asyncs:
a.get()
Clearer? Keep it as simple as possible at first.
I have a long running python script which creates and deletes temporary files. I notice there is a non-trivial amount of time spent on file deletion, but the only purpose of deleting those files is to ensure that the program doesn't eventually fill up all the disk space during a long run. Is there a cross platform mechanism in Python to aschyronously delete a file so the main thread can continue to work while the OS takes care of the file delete?
You can try delegating deleting the files to another thread or process.
Using a newly spawned thread:
thread.start_new_thread(os.remove, filename)
Or, using a process:
# create the process pool once
process_pool = multiprocessing.Pool(1)
results = []
# later on removing a file in async fashion
# note: need to hold on to the async result till it has completed
results.append(process_pool.apply_async(os.remove, filename), callback=lambda result: results.remove(result))
The process version may allow for more parallelism because Python threads are not executing in parallel due to the notorious global interpreter lock. I would expect though that GIL is released when it calls any blocking kernel function, such as unlink(), so that Python lets another thread to make progress. In other words, a background worker thread that calls os.unlink() may be the best solution, see Tim Peters' answer.
Yet, multiprocessing is using Python threads underneath to asynchronously communicate with the processes in the pool, so some benchmarking is required to figure which version gives more parallelism.
An alternative method to avoid using Python threads but requires more coding is to spawn another process and send the filenames to its standard input through a pipe. This way you trade os.remove() to a synchronous os.write() (one write() syscall). It can be done using deprecated os.popen() and this usage of the function is perfectly safe because it only communicates in one direction to the child process. A working prototype:
#!/usr/bin/python
from __future__ import print_function
import os, sys
def remover():
for line in sys.stdin:
filename = line.strip()
try:
os.remove(filename)
except Exception: # ignore errors
pass
def main():
if len(sys.argv) == 2 and sys.argv[1] == '--remover-process':
return remover()
remover_process = os.popen(sys.argv[0] + ' --remover-process', 'w')
def remove_file(filename):
print(filename, file=remover_process)
remover_process.flush()
for file in sys.argv[1:]:
remove_file(file)
if __name__ == "__main__":
main()
You can create a thread to delete files, following a common producer-consumer pattern:
import threading, Queue
dead_files = Queue.Queue()
END_OF_DATA = object() # a unique sentinel value
def background_deleter():
import os
while True:
path = dead_files.get()
if path is END_OF_DATA:
return
try:
os.remove(path)
except: # add the exceptions you want to ignore here
pass # or log the error, or whatever
deleter = threading.Thread(target=background_deleter)
deleter.start()
# when you want to delete a file, do:
# dead_files.put(file_path)
# when you want to shut down cleanly,
dead_files.put(END_OF_DATA)
deleter.join()
CPython releases the GIL (global interpreter lock) around internal file deletion calls, so this should be effective.
Edit - new text
I would advise against spawning a new process per delete. On some platforms, process creation is quite expensive. Would also advise against spawning a new thread per delete: in a long-running program, you really never want the possibility of creating an unbounded number of threads at any point. Depending on how quickly file deletion requests pile up, that could happen here.
The "solution" above is wordier than the others, because it avoids all that. There's only one new thread total. Of course it could easily be generalized to use any fixed number of threads instead, all sharing the same dead_files queue. Start with 1, add more if needed ;-)
The OS-level file removal primitives are synchronous on both Unix and Windows, so I think you pretty much have to use a worker thread. You could have it pull files to delete off a Queue object, and then when the main thread is done with a file it can just post the file to the queue. If you're using NamedTemporaryFile objects, you probably want to set delete=False in the constructor and just post the name to the queue, not the file object, so you don't have object lifetime headaches.
Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.