I'm trying to parallelize some web requests in python using multiprocessing, but it appears that occasionally, all of the functions I send to map do not complete.
These results appear whether I'm using python 2 or 3.
Test script:
#!/usr/bin/env python
import multiprocessing
def my_print(string):
print(string)
all_strings = ["alpaca", "bear", "cat", "dog", "elephant", "frog"]
pool = multiprocessing.Pool()
pool.map(my_print, all_strings)
I run it like so:
for i in `seq 1 50`; do ./test.py | wc -l; done | sort | uniq -c
And my results will look like:
6 5
44 6
...so most of the time all 6 executions of the function are running, but occasionally, only 5 of them will run until the overall script completes execution. I expect there to be 50 6 as a result (aka, all functions getting executed on every run).
The documentation https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map says It blocks until the result is ready. I assumed that to mean All functions will complete before we move to the next line of code.
Am I misunderstanding that? Does using a pool require you to always call pool.close() and pool.join() to ensure the tasks are complete?
Edit: I'm running on AWS, if that makes any obvious difference - a coworker told me I should mention that.
Thanks very much in advance!
All workers run their functions and return any values before map returns. That is true. But that doesn't mean you will see all strings immediately.
You have multiple worker processes trying to write to the same file/terminal. To make that work you might have to import sys and call sys.stdout.flush() after every print() in the worker process.
Related
I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.
I am trying to execute n processes simultaneously. The example below works with 2 processes that are supplied externally.
At the moment it is all hard-coded for just these 2 processes but I would need to come up with the generic solution how to accomplish the same - i.e. run n processes at the same time.
My code is as follows:
import multiprocessing
'''
The first process: print 'aa'
The second Process: print 'BB'
'''
def TR1():
print 'aaaaaaaaa'
def TR2():
print 'BBBBBBBB'
if __name__ == '__main__':
process_1 = multiprocessing.Process(name='process_1', target=TR1)
process_2 = multiprocessing.Process(name='process_2', target=TR2)
process_1.start()
process_2.start()
Thanks for your suggestions!
You can either spawn processes in a loop, or use executor pool.
In real life, later one is often preferred approach, as you can limit pool size and have easy result gathering.
If you're using python 2, there's backport including ProcessPoolExecutor
I'm trying to run multiple exe's (12 of them), because of computer resources I can spawn maximum 4 at a time before I get performance degradation.
I'm trying to find if there is a way to call 4 exe's at a time and as soon as one of them finishes, to call another exe to fill the resources that have freed up
My current code does this:
excs = [r"path\to\exe\exe.exe",r"path\to\exe\exe.exe",r"path\to\exe\exe.exe",r"path\to\exe\exe.exe"]
running = [subprocess.Popen(ex) for ex in excs]
[process.wait() for process in running]
It repeats this process three times so that it runs all 12. Unfortunately it means that it needs to wait for all of them to finish before moving on to the next set. Is there a more efficient way of doing this?
For the record, all of the exe's have different run times.
Python has ThreadPoolExecutor which makes this very convenient
import subprocess
from concurrent.futures import ThreadPoolExecutor
def create_pool(N,commands):
pool = ThreadPoolExecutor(max_workers=N)
for command in commands:
pool.submit(subprocess.call, command)
pool.shutdown(wait=False)
def main():
N_WORKERS=4
commands = [job1, job2, ...]
create_pool(N_WORKERS, commands)
I have some classifiers which I want to evaluate on the one sample. This task can be ran in parallel since they are independent of each other. This means that I want to parallelize it.
I tried it with python and also as a bash script. The problem is that when I run it the program for the first time, it takes like 30s-40s to finish. When I run the program multiple times consecutively, it takes just 1s-3s to finish. Even If I fed classifiers with different input I got different result so it seems that there is no caching. When I run some other program and afterwards rerun the program then it again takes 40s to finish.
I also observed in htop that CPUs are not that much utilized when the program is run for the first time but then when I rerun it again and again the CPUs are fully utilized.
Can someone please explain me this strange behaviour? How can I avoid it so that even the first run of the program will be fast?
Here is the python code:
import time
import os
from fastText import load_model
from joblib import delayed, Parallel, cpu_count
import json
os.system("taskset -p 0xff %d" % os.getpid())
def format_duration(start_time, end_time):
m, s = divmod(end_time - start_time, 60)
h, m = divmod(m, 60)
return "%d:%02d:%02d" % (h, m, s)
def classify(x, classifier_name, path):
f = load_model(path + os.path.sep + classifier_name)
labels, probabilities = f.predict(x, 2)
if labels[0] == '__label__True':
return classifier_name
else:
return None
if __name__ == '__main__':
with open('classifier_names.json') as json_data:
classifiers = json.load(json_data)
x = "input_text"
Parallel(n_jobs=cpu_count(), verbose=100, backend='multiprocessing', pre_dispatch='all') \
(delayed(perform_binary_classification)
(x, classifier, 'clfs/') for
classifier in classifiers)
end_time = time.time()
print(format_duration(start_time, end_time))
Here is the bash code:
#!/usr/bin/env bash
N=4
START_TIME=$SECONDS
open_sem(){
mkfifo pipe-$$
exec 3<>pipe-$$
rm pipe-$$
local i=$1
for((;i>0;i--)); do
printf %s 000 >&3
done
}
run_with_lock(){
local x
read -u 3 -n 3 x && ((0==x)) || exit $x
(
"$#"
printf '%.3d' $? >&3
)&
}
open_sem $N
for d in classifiers/* ; do
run_with_lock ~/fastText/fasttext predict "$d" test.txt
done
ELAPSED_TIME=$(($SECONDS - $START_TIME))
echo time taken $ELAPSED_TIME seconds
EDITED
The bigger picture is that I am running flask app with 2 API methods. Each of them calls the function that parallelize the classification. When I am doing requests, it behaves the same way like this program below. First request to method A takes a lot and then subsequent requests take like 1s. When I switch to method B it is the same behavior as with method A. If I switch between method A and method B several times like A,B,A,B then each request takes like 40s to finish.
One approach is to modify your python code to use an event loop, stay running all the time, and execute new jobs in parallel whenever new jobs are detected. One way to do this is is to have a job directory, and place a file in that directory whenever there is a new job todo. The python script should also move completed jobs out of that directory to prevent running them more than once. How to run an function when anything changes in a dir with Python Watchdog?
Another option is to use a fifo file which is piped to the python script, and add new lines to that file for new jobs. https://www.linuxjournal.com/content/using-named-pipes-fifos-bash
I personally dislike parallelizing in python, and prefer to parallelize in bash using GNU parallel. To do it this way, I would
implement the event loop and jobs directory or the fifo file job queue using bash and GNU parallel
modify the python script to remove all the parallel code
read each jobspec from stdin
process each one serially in a loop
pipe jobs to parallel, which pipes them to ncpu python processes, which each runs forever waiting for the next job from stdin
e.g., something like:
run_jobs.sh:
mkfifo jobs
cat jobs | parallel --pipe --round-robin -n1 ~/fastText/fasttext
queue_jobs.sh:
echo jobspec >> jobs
.py:
for jobspec in sys.stdin:
...
This has the disadvantage that all ncpu python processes may have the slow startup problem, but they can stay running indefinitely, so the problem becomes insignificant, and the code is much simpler and easier to debug and maintain.
Using a jobs directory and a file for each jobspec instead of a fifo jobs queue requires slightly more code, but it also makes it more straightforward to see which jobs are queued and which jobs are done.
I have a python function that has to run 12 times in total. I have this set up currently to use Pool from the multiprocessing library to run up to all of them in parallel. Typically I run 6 at a time because the function is CPU intensive and running 12 in parallel often causes the program to crash. When we do 6 at a time, the second set of 6 will not begin until all of the first 6 processes are finished. Ideally, we would like another one (e.g. the 7th) to kick off as soon as one from the initial batch of 6 is finished- So that 6 are running at once while there are more to start. Right now the code looks like this (it would be called twice, passing the first 6 elements in one list and then the second 6 in another:
from multiprocessing import Pool
def start_pool(project_list):
pool = Pool(processes=6)
pool.map(run_assignments_parallel,project_list[0:6])
So i have been trying to implement a worker/queue solution and have run into some issues. I have a worker function that looks like this:
def worker(work_queue, done_queue):
try:
for proj in iter(work_queue.get, 'STOP'):
print proj
run_assignments_parallel(proj)
done_queue.put('finished ' + proj )
except Exception, e:
done_queue.put("%s failed on %s with: %s" % (current_process().name, proj, e.message))
return True
And the code to call the worker function is as follows:
workers = 6
work_queue = Queue()
done_queue = Queue()
processes = []
for project in project_list:
print project
work_queue.put(project)
for w in xrange(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for status in iter(done_queue.get, 'STOP'):
print status
project_list is just a list of paths for the 12 projects that need to be run in the function 'run_assignments_parallel.'
The way this is written now, the function is getting called more than once for the same process (project) and I cant really tell what is going on. This code is based on an example i found and I am pretty sure the looping structure is messed up. Any help would be great and I aplogize for my ignorance on the matter. Thanks!
Ideally, we would like another one (e.g. the 7th) to kick off as soon as one from the initial batch of 6 is finished- So that 6 are running at once while there are more to start.
All you need to change is to pass all 12 input parameters instead of 6:
from multiprocessing import Pool
pool = Pool(processes=6) # run no more than 6 at a time
pool.map(run_assignments_parallel, project_list) # pass full list (12 items)
You can use the MPipe module.
Create a 6-worker, single-stage pipeline and feed in all your projects as tasks. Then just read the results (in your case, statuses) off the end.
from mpipe import Pipeline, OrderedStage
...
pipe = Pipeline(OrderedStage(run_assignments_parallel), 6)
for project in project_list:
pipe.put(project)
pipe.put(None) # Signal end of input.
for status in pipe.results():
print(status)