I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.
Related
Hi I don't feel like I have quite understood multiprocessing in python correctly.
I want to run a function called 'run_worker' (which is simply code that runs and manages a subprocess) 20 times in parallel and wait for all the functions to complete. Each run_worker should run on a separate core/thread. I don' mind what order the processes complete hence i used async and i dont have a return value so i used map
I thought that I should use:
if __name__ == "__main__":
num_workers = 20
param_map = []
for i in range(num_workers):
param_map += [experiment_id]
pool = mp.Pool(processes= num_workers)
pool.map_async(run_worker, param_map)
pool.close()
pool.join()
However this code exits straight away and doesn't appear to execute run_worker properly. Also do I really have to create a param_map of the same experiment_id to pass to the worker because this seems like a hack to get the number of run_workers created. Ideally i would like to run a function with no parameters and no return value over multiple cores.
Note I am using windows 2019 server in AWS.
edit added run_worker which calls a subprocess which write to file:
def run_worker(experiment_id):
hostname = socket.gethostname()
experiment = conn.experiments(experiment_id).fetch()
while experiment.progress.observation_count < experiment.observation_budget:
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion.assignments)
conn.experiments(experiment_id).observations().create(suggestion=suggestion.id,value=value,metadata=dict(hostname=hostname),)
# Update the experiment object
experiment = conn.experiments(experiment_id).fetch()
It seems that for this simple purpose you can better be using pool.map instead of pool.map_async. They both run in parallel, however pool.map is blocking until all operations are finished (see also this question). pool.map_async is especially meant for situations like this:
result = map_async(func, iterable)
while not result.ready():
// do some work while map_async is running
pass
// blocking call to get the result
out = result.get()
Regarding your question about the parameters, the fundamental idea of a map operation is to map the values of one list/array/iterable to a new list of values of the same size. As far as I can see in the docs, multiprocessing does not provide any method to run multiple functions without parameters.
If you would also share your run_worker function, that might help to get better answers to your question. That might also clear up why you would run a function without any arguments and return values using a map operation in the first place.
I have a lot of files that need to be processed by some software. They don't need to be processed in the order.
Let's say I have 12 files and divided them in three lists then tried to send these lists to different processes to be executed:
# import all files
files = glob.glob(src_path + "*.fits")
files_list = [files[0::3], files[1::3], files[2::3]]
num_processors = 3 # Create a pool of processors
p = Pool(processes = num_processors) # get them to work in parallel
output = pool.map(run2, [f for f in files_list])
def run2(files, *args):
for ffit in files:
terminal_astrometry(command)
def terminal_astrometry(command):
result = subprocess.run(command, stdout=subprocess.PIPE)
The problem is that sometimes, the program doesn't process all of these files, i.e. 11 files do get processed but one does not. Or other time, 9 finished but 3 were skipped. Sometimes it does finish all tasks(process all of the files).
Essentially, in run2() function I am calling that particular software that I want to be run in parallel (Astrometry.net) on every file run2() function received.
EDIT2: I trimmed run2() function because it contains a lot of calculation(statistics) not relevant to a problem here(at least I think so) and posted it here.
Your symptoms sound like a race condition, however pool.map blocks the main process until all tasks have finished so the code will not progress past that line until all tasks have finished. Therefore, I think the problem may be within the run2 function - could you post its code?
Edit: I previously had the following text in the answer too, the question has now been edited:
You are calling run2 twice for each file - once asynchronously with the pool, and once in the main process. Depending on the logic within this function, this could be the cause of the odd behaviour you're seeing.
Software that I'm calling inside the run2() function is causing problems. It tries to write stdout in the same file which causes it to not complete all the tasks.
I am having a lot of data (more than one million) and need to do some calculation on it that finds a certain value out of the millions. This is time consuming. To speed up the process I am trying to use all cores of the CPU. To do this, I am using multiprocessing-pool and I am calling the worker process with starmap_async (I need to hand over multiple arguments). Basically, it works so far with the limitation that I have to wait until all values of the list are executed and all processes are finished before I can continue. Is there a possibility to end the starmap process once one of the processes finds a correct value?
I have already tried several different things such as terminating from the worker process changing the whole structure to a for loop but it seems that the starmap process needs to run to its end and that they cannot be stopped. The only way seems to extract each list-value individually and feed it to a separate Process which creates again a big overhead and slows down the process significantly.
Does anyone have an idea?
The solution described here Terminate a Python multiprocessing program once a one of its workers meets a certain condition looks the same but it is not. I have tried this but it doesn’t work. The difference seems to be that none of the arguments is the iterable in the described issue. I have played with this and I couldn’t get it to end the processes before the starmap process finished completely. In the recommended solution the processes are started and ran independently until one finds a solution. In my case starmap seems to continue feeding the processes without checking termination conditions.
import multiprocessing
def worker(x, arg1, arg2):
some calculation with all arguments
**#here I need a possibility to cancel all processes and return the current x value**
if __name__ == '__main__':
arg1 = somthing
arg2 = somthing_else
value_list = (a,b,c,d,e,.......)
pool = multiprocessing.Pool(cpu_count())
p = pool.starmap_async(worker, [(value_list, arg1, arg2) for x in value_list])
pool.close()
pool.join()
for y in p:
if y = correct_value:
print(correct_value)
break
Here is my code for a simple multiprocessing task in python
from multiprocessing import Process
def myfunc(num):
tmp = num * num
print 'squared O/P will be ', tmp
return(tmp)
a = [ i**3 for i in range(5)] ## just defining a list
task = [Process(target = myfunc, args = (i,)) for i in a] ## creating processes
for each in task : each.start() # starting processes <------ problem line
for each in task : each.join() # waiting all to finish up
When I run this code, it hangs at certain point, so to identify it I ran it line by line in python shell and found that when I call 'each.start()' The shell pops out a dialogue box as:
" The program is still running , do you want to kill it? '
and I select 'yes' the shell closes.
When I replace Process with 'threading.Thread' the same code runs but with this nonsense output:
Squared Squared Squared Squared Squared 0 1491625
36496481
Is there any help in this regard ? thank in advance
To run my python codes I use Idlex IDE and I start it from terminal.
I have Intel Xeon Processor with 4 cores / 8 Threads, and 8GB RAM
With a little thought I finally found the problem.
This is happening because in Python, the float and int objects are not 'thread-safe', meaning the memory allocated to calculate any function's value by one thread/process can be overwritten by another and hence they show absurd values. This is called a race condition.
To solve this problem, use deque() from the collections module or, even better, use the 'Lock' facility. deque() works with arrays but it's meant for arrays of the same kind (much like MATLAB arrays) and is thread/process safe. 'Lock' avoids race conditions.
So the edit would be :
def myfunc(num):
lock.acquire()
.......some code .....
.......some code......
lock.release()
That's all.
But one problem still persists and that is with the multiprocessing module. Even after calling 'lock', the problem mentioned in the question remains.
Save the code above into a .py file and then run it in a gnome-terminal with
python myfile.py
Where "myfile.py" is the filename you saved to.
I would assume that the IDE you are using is confused somehow by Process()
I am running through a csv file of about 800k rows. I need a threading solution that runs through each row and spawns 32 threads at a time into a worker. I want to do this without a queue. It looks like current python threading solution with a queue is eating up alot of memory.
Basically want to read a csv file row and put into a worker thread. And only want 32 threads running at a time.
This is current script. It appears that it is reading the entire csv file into queue and doing a queue.join(). Is it correct that it is loading the entire csv into a queue then spawning the threads?
queue=Queue.Queue()
def worker():
while True:
task=queue.get()
try:
subprocess.call(['php {docRoot}/cli.php -u "api/email/ses" -r "{task}"'.format(
docRoot=docRoot,
task=task
)],shell=True)
except:
pass
with lock:
stats['done']+=1
if int(time.time())!=stats.get('now'):
stats.update(
now=int(time.time()),
percent=(stats.get('done')/stats.get('total'))*100,
ps=(stats.get('done')/(time.time()-stats.get('start')))
)
print("\r {percent:.1f}% [{progress:24}] {persec:.3f}/s ({done}/{total}) ETA {eta:<12}".format(
percent=stats.get('percent'),
progress=('='*int((23*stats.get('percent'))/100))+'>',
persec=stats.get('ps'),
done=int(stats.get('done')),
total=stats.get('total'),
eta=snippets.duration.time(int((stats.get('total')-stats.get('done'))/stats.get('ps')))
),end='')
queue.task_done()
for i in range(32):
workers=threading.Thread(target=worker)
workers.daemon=True
workers.start()
try:
with open(csvFile,'rb') as fh:
try:
dialect=csv.Sniffer().sniff(fh.readline(),[',',';'])
fh.seek(0)
reader=csv.reader(fh,dialect)
headers=reader.next()
except csv.Error as e:
print("\rERROR[CSV] {error}\n".format(error=e))
else:
while True:
try:
data=reader.next()
except csv.Error as e:
print("\rERROR[CSV] - Line {line}: {error}\n".format( line=reader.line_num, error=e))
except StopIteration:
break
else:
stats['total']+=1
queue.put(urllib.urlencode(dict(zip(headers,data)+dict(campaign=row.get('Campaign')).items())))
queue.join()
32 threads is probably overkill unless you have some humungous hardware available.
The rule of thumb for optimum number of threads or processes is: (no. of cores * 2) - 1
which comes to either 7 or 15 on most hardware.
The simplest way would be to start 7 threads passing each thread an "offset" as a parameter.
i.e. a number from 0 to 7.
Each thread would then skip rows until it reached the "offset" number and process that row. Having processed the row it can skip 6 rows and process the 7th -- repeat until no more rows.
This setup works for threads and multiple processes and is very efficient in I/O on most machines as all the threads should be reading roughly the same part of the file at any given time.
I should add that this method is particularly good for python as each thread is more or less independent once started and avoids the dreaded python global lock common to other methods.
I don't understand why you want to spawn 32 threads per row. However data processing in parallel in a fairly common embarassingly paralell thing to do and easily achievable with Python's multiprocessing library.
Example:
from multiprocessing import Pool
def job(args):
# do some work
inputs = [...] # define your inputs
Pool().map(job, inputs)
I leave it up to you to fill in the blanks to meet your specific requirements.
See: https://bitbucket.org/ccaih/ccav/src/tip/bin/ for many examples of this pattenr.
Other answers have explained how to use Pool without having to manage queues (it manages them for you) and that you do not want to set the number of processes to 32, but to your CPU count - 1. I would add two things. First, you may want to look at the pandas package, which can easily import your csv file into Python. The second is that the examples of using Pool in the other answers only pass it a function that takes a single argument. Unfortunately, you can only pass Pool a single object with all the inputs for your function, which makes it difficult to use functions that take multiple arguments. Here is code that allows you to call a previously defined function with multiple arguments using pool:
import multiprocessing
from multiprocessing import Pool
def multiplyxy(x,y):
return x*y
def funkytuple(t):
"""
Breaks a tuple into a function to be called and a tuple
of arguments for that function. Changes that new tuple into
a series of arguments and passes those arguments to the
function.
"""
f = t[0]
t = t[1]
return f(*t)
def processparallel(func, arglist):
"""
Takes a function and a list of arguments for that function
and proccesses in parallel.
"""
parallelarglist = []
for entry in arglist:
parallelarglist.append((func, tuple(entry)))
cpu_count = int(multiprocessing.cpu_count() - 1)
pool = Pool(processes = cpu_count)
database = pool.map(funkytuple, parallelarglist)
pool.close()
return database
#Necessary on Windows
if __name__ == '__main__':
x = [23, 23, 42, 3254, 32]
y = [324, 234, 12, 425, 13]
i = 0
arglist = []
while i < len(x):
arglist.append([x[i],y[i]])
i += 1
database = processparallel(multiplyxy, arglist)
print(database)
Your question is pretty unclear. Have you tried initializing your Queue to have a maximum size of, say, 64?
myq = Queue.Queue(maxsize=64)
Then a producer (one or more) trying to .put() new items on myq will block until consumers reduce the queue size to less than 64. This will correspondingly limit the amount of memory consumed by the queue. By default, queues are unbounded: if the producer(s) add items faster than consumers take them off, the queue can grow to consume all the RAM you have.
EDIT
This is current script. It appears that it is reading the
entire csv file into queue and doing a queue.join(). Is
it correct that it is loading the entire csv into a queue
then spawning the threads?
The indentation is messed up in your post, so have to guess some, but:
The code obviously starts 32 threads before it opens the CSV file.
You didn't show the code that creates the queue. As already explained above, if it's a Queue.Queue, by default it's unbounded, and can grow to any size if your main loop puts items on it faster than your threads remove items from it. Since you haven't said anything about what worker() does (or shown its code), we don't have enough information to guess whether that's the case. But that memory use is out of hand suggests that's the case.
And, as also explained, you can stop that easily by specifying a maximum size when you create the queue.
To get better answers, supply better info ;-)
ANOTHER EDIT
Well, the indentation is still messed up in spots, but it's better. Have you tried any suggestions? Looks like your worker threads each spawn a new process, so they'll take very much longer than it takes just to read another line from the csv file. So it's indeed very likely that you put items on the queue far faster than they're taken off. So, for the umpteenth time ;-), TRY initializing the queue with (say) maxsize=64. Then reveal what happens.
BTW, the bare except: clause in worker() is a Really Bad Idea. If anything goes wrong, you'll never know. If you have to ignore every possible exception (including even KeyboardInterrupt and SystemExit), at least log the exception info.
And note what #JamesAnderson said: unless you have extraordinary hardware resources, trying to run 32 processes at a time is almost certainly slower than running a number of processes that's no more than twice the number of available cores. Then again, that depends too a lot on what your PHP program does. If, for example, the PHP program uses disk I/O heavily, any multiprocessing may be slower than none.