I have a binary (say a.out) that I want to call with different configs. I want to run these configs on a 40-core machine in parallel. Below is a sketch of my code.
It is very straightforward: I generate a config and pass in into the worker, and the worker calls the binary with the config using subprocess. I am also redirecting the output to a file. Let's call this piece of code run.py
def worker(cmdlist, filename):
outputfile = open(filename, 'wb')
// here it essentially executes a.out config > outputfile
subprocess.call(cmdlist, stderr=outputfile, stdout=outputfile)
outputfile.close()
def main():
pool = Pool(processes = 40)
for config in all_configs
filename, cmdlist = genCmd(config)
res = pool.apply_async(worker, [cmdlist, filename])
results.append(res)
for res in results:
res.get()
pool.close()
But after I kick it off, I realized that I am not spawning as many processes as I want. I definitely submitted more than 40 workers, but in top, I am only seeing about 20 of a.out.
I do see many of the run.py that are in "sleeping" state (i.e., "S" in top). When I do a ps auf, I also saw a lot of run.py in "S+" state, with no binary spawned out. Only about half of them spawned "a.out"
I am wondering, why is this happening? I am redirecting the output to a network-mounted hard-drive, which could be a reason, but in top I only see 10%wa (which in my understanding is 10% of the time waiting for IO). I don't think this results in 50% of idle CPUs. Plus, I should at least have the binary spawned out, instead of being stuck at run.py. My binary's runtime is also long enough. I should really be seeing 40 jobs running for a long time.
Any other explanation? Anything I did wrong in my python code?
An approach I have used to make use of many simultaneous processes running at once on multiple cores is to use p = subprocess.Popen(...) and p.Poll(). In your case I think you would be able to skip using Pool altogether. I'd give you a better example but unfortunately I don't have access to that code anymore.
Related
I have a lot of files that need to be processed by some software. They don't need to be processed in the order.
Let's say I have 12 files and divided them in three lists then tried to send these lists to different processes to be executed:
# import all files
files = glob.glob(src_path + "*.fits")
files_list = [files[0::3], files[1::3], files[2::3]]
num_processors = 3 # Create a pool of processors
p = Pool(processes = num_processors) # get them to work in parallel
output = pool.map(run2, [f for f in files_list])
def run2(files, *args):
for ffit in files:
terminal_astrometry(command)
def terminal_astrometry(command):
result = subprocess.run(command, stdout=subprocess.PIPE)
The problem is that sometimes, the program doesn't process all of these files, i.e. 11 files do get processed but one does not. Or other time, 9 finished but 3 were skipped. Sometimes it does finish all tasks(process all of the files).
Essentially, in run2() function I am calling that particular software that I want to be run in parallel (Astrometry.net) on every file run2() function received.
EDIT2: I trimmed run2() function because it contains a lot of calculation(statistics) not relevant to a problem here(at least I think so) and posted it here.
Your symptoms sound like a race condition, however pool.map blocks the main process until all tasks have finished so the code will not progress past that line until all tasks have finished. Therefore, I think the problem may be within the run2 function - could you post its code?
Edit: I previously had the following text in the answer too, the question has now been edited:
You are calling run2 twice for each file - once asynchronously with the pool, and once in the main process. Depending on the logic within this function, this could be the cause of the odd behaviour you're seeing.
Software that I'm calling inside the run2() function is causing problems. It tries to write stdout in the same file which causes it to not complete all the tasks.
I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.
There seems to be a litany of questions and answers on overflow about the multiprocessing library. I have looked through all the relevant ones I can find all and have not found one that directly speaks to my problem.
I am trying to apply the same function to multiple files in parallel. Whenever I start the processing though, the computer just spins up several instances of python and then does nothing. No computations happen at all and the processes just sit idle
I have looked at all of the similar questions on overflow, and none seem to have my problem of idle processes.
what am i doing wrong?
define the function (abbreviated for example. checked to make sure it works)
import pandas as pd
import numpy as np
import glob
import os
#from timeit import default_timer as timer
import talib
from multiprocessing import Process
def example_function(file):
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
print(f'stock{stock_name} processed')
final_macd_report.append(df)
getting a list of all the files in the directory i want to run the function on
import glob
path = r'C:\Users\josiahh\Desktop\big_test3/*'
files = [f for f in glob.glob(path, recursive=True)]
attempting multiprocessing
import multiprocessing as mp
if __name__ == '__main__':
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, files)
p.close()
p.join()
print("Complete")
any help would be greatly appreciated.
There's nothing wrong with the structure of the code, so something is going wrong that can't be guessed from what you posted. Start with something very much simpler, then move it in stages to what you're actually trying to do. You're importing mountains of extension (3rd party) code, and the problem could be anywhere. Here's a start:
def example_function(arg):
from time import sleep
msg = "crunching " + str(arg)
print(msg)
sleep(arg)
print("done " + msg)
if __name__ == '__main__':
import multiprocessing as mp
p = mp.Pool(processes = 5)
async_result = p.map_async(example_function, reversed(range(15)))
print("result", async_result.get())
p.close()
p.join()
print("Complete")
That works fine on Win10 under 64-bit Python 3.7.4 for me. Does it for you?
Note especially the async_result.get() at the end. That displays a list with 15 None values. You never do anything with your async_result. Because of that, if any exception was raised in a worker process, it will most likely silently vanish. In such cases .get()'ing the result will (re)raise the exception in your main program.
Also please verify that your files list isn't in fact empty. We can't guess at that from here either ;-)
EDIT
I moved the async_result.get() into its own line, right after the map_async(), to maximize the chance of revealing otherwise silent exception in the worker processes. At least add that much to your code too.
While I don't see anything wrong per se, I would like to suggest some changes.
In general, worker functions in a Pool are expected to return something. This return value is transferred back to the parent process. I like to use that as a status report. It is also a good idea to catch exceptions in the worker process, just in case.
For example:
def example_function(file):
status = 'OK'
try:
df=pd.read_csv(file, header = 1)
stock_name = os.path.basename(file)[:-4]
macd, macdsignal, macdhist = talib.MACD(df.Close, fastperiod=12, slowperiod=26, signalperiod=9)
df['macd'] = macdhist*1000
final_macd_report.append(df)
except:
status = 'exception caught!'
return {'filename': file, 'result': status}
(This is just a quick example. You might want to e.g. report the full exception traceback to help with debugging.)
If workers run for a long time, I like to get feedback ASAP.
So I prefer to use imap_unordered, especially if some tasks can take much longer than others. This returns an iterator that yields results in the order that jobs finish.
if __name__ == '__main__':
with mp.Pool() as p:
for res in p.imap_unordered(example_function, files):
print(res)
This way you get unambiguous proof that a worker finished, and what the result was and if any problems occurred.
This is preferable over just calling print from the workers. With stdout buffering and multiple workers inheriting the same output stream there is no saying when you actually see something.
Edit: As you can see here, multiprocessing.Pool does not work well with interactive interpreters, especially on ms-windows. Basically, ms-windows lacks the fork system call that lets UNIX-like systems duplicate a process. So on ms-windows, multiprocessing has to do a try and mimic fork which means importing the original program file in the child processes. That doesn't work well with interactive interpreters like IPython. One would probably have to dig deep into the internals of Jupyter and multiprocessing to find out the exact cause of the problem.
It seems that a workaround for this problem is to define the worker function in a separate module and import that in your code in IPython.
It is actually mentioned in the documentation that multiprocessing.Pool doesn't work well with interactive interpreters. See the note at the end of this section.
I am trying to use the python multiprocessing library in order to parallize a task I am working on:
import multiprocessing as MP
def myFunction((x,y,z)):
...create a sqlite3 database specific to x,y,z
...write to the database (one DB per process)
y = 'somestring'
z = <large read-only global dictionary to be shared>
jobs = []
for x in X:
jobs.append((x,y,z,))
pool = MP.Pool(processes=16)
pool.map(myFunction,jobs)
pool.close()
pool.join()
Sixteen processes are started as seen in htop, however no errors are returned, no files written, no CPU is used.
Could it happen that there is an error in myFunction that is not reported to STDOUT and blocks execution?
Perhaps it is relevant that the python script is called from a bash script running in background.
The lesson learned here was to follow the strategy suggested in one of the comments and use multiprocessing.dummy until everything works.
At least in my case, errors were not visible otherwise and the processes were still running as if nothing had happened.
Here's what I am trying to accomplish -
I have about a million files which I need to parse & append the parsed content to a single file.
Since a single process takes ages, this option is out.
Not using threads in Python as it essentially comes to running a single process (due to GIL).
Hence using multiprocessing module. i.e. spawning 4 sub-processes to utilize all that raw core power :)
So far so good, now I need a shared object which all the sub-processes have access to. I am using Queues from the multiprocessing module. Also, all the sub-processes need to write their output to a single file. A potential place to use Locks I guess. With this setup when I run, I do not get any error (so the parent process seems fine), it just stalls. When I press ctrl-C I see a traceback (one for each sub-process). Also no output is written to the output file. Here's code (note that everything runs fine without multi-processes) -
import os
import glob
from multiprocessing import Process, Queue, Pool
data_file = open('out.txt', 'w+')
def worker(task_queue):
for file in iter(task_queue.get, 'STOP'):
data = mine_imdb_page(os.path.join(DATA_DIR, file))
if data:
data_file.write(repr(data)+'\n')
return
def main():
task_queue = Queue()
for file in glob.glob('*.csv'):
task_queue.put(file)
task_queue.put('STOP') # so that worker processes know when to stop
# this is the block of code that needs correction.
if multi_process:
# One way to spawn 4 processes
# pool = Pool(processes=4) #Start worker processes
# res = pool.apply_async(worker, [task_queue, data_file])
# But I chose to do it like this for now.
for i in range(4):
proc = Process(target=worker, args=[task_queue])
proc.start()
else: # single process mode is working fine!
worker(task_queue)
data_file.close()
return
what am I doing wrong? I also tried passing the open file_object to each of the processes at the time of spawning. But to no effect. e.g.- Process(target=worker, args=[task_queue, data_file]). But this did not change anything. I feel the subprocesses are not able to write to the file for some reason. Either the instance of the file_object is not getting replicated (at the time of spawn) or some other quirk... Anybody got an idea?
EXTRA: Also Is there any way to keep a persistent mysql_connection open & pass it across to the sub_processes? So I open a mysql connection in my parent process & the open connection should be accessible to all my sub-processes. Basically this is the equivalent of a shared_memory in python. Any ideas here?
Although the discussion with Eric was fruitful, later on I found a better way of doing this. Within the multiprocessing module there is a method called 'Pool' which is perfect for my needs.
It's optimizes itself to the number of cores my system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable. So here's the code. Might help someone later-
from multiprocessing import Pool
def main():
po = Pool()
for file in glob.glob('*.csv'):
filepath = os.path.join(DATA_DIR, file)
po.apply_async(mine_page, (filepath,), callback=save_data)
po.close()
po.join()
file_ptr.close()
def mine_page(filepath):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return
Still going through this huge module. Not sure if save_data() is executed by parent process or this function is used by spawned child processes. If it's the child which does the saving it might lead to concurrency issues in some situations. If anyone has anymore experience in using this module, you appreciate more knowledge here...
The docs for multiprocessing indicate several methods of sharing state between processes:
http://docs.python.org/dev/library/multiprocessing.html#sharing-state-between-processes
I'm sure each process gets a fresh interpreter and then the target (function) and args are loaded into it. In that case, the global namespace from your script would have been bound to your worker function, so the data_file would be there. However, I am not sure what happens to the file descriptor as it is copied across. Have you tried passing the file object as one of the args?
An alternative is to pass another Queue that will hold the results from the workers. The workers put the results and the main code gets the results and writes it to the file.