Why are the parallel tasks always slow at the first time? - python

I have some classifiers which I want to evaluate on the one sample. This task can be ran in parallel since they are independent of each other. This means that I want to parallelize it.
I tried it with python and also as a bash script. The problem is that when I run it the program for the first time, it takes like 30s-40s to finish. When I run the program multiple times consecutively, it takes just 1s-3s to finish. Even If I fed classifiers with different input I got different result so it seems that there is no caching. When I run some other program and afterwards rerun the program then it again takes 40s to finish.
I also observed in htop that CPUs are not that much utilized when the program is run for the first time but then when I rerun it again and again the CPUs are fully utilized.
Can someone please explain me this strange behaviour? How can I avoid it so that even the first run of the program will be fast?
Here is the python code:
import time
import os
from fastText import load_model
from joblib import delayed, Parallel, cpu_count
import json
os.system("taskset -p 0xff %d" % os.getpid())
def format_duration(start_time, end_time):
m, s = divmod(end_time - start_time, 60)
h, m = divmod(m, 60)
return "%d:%02d:%02d" % (h, m, s)
def classify(x, classifier_name, path):
f = load_model(path + os.path.sep + classifier_name)
labels, probabilities = f.predict(x, 2)
if labels[0] == '__label__True':
return classifier_name
else:
return None
if __name__ == '__main__':
with open('classifier_names.json') as json_data:
classifiers = json.load(json_data)
x = "input_text"
Parallel(n_jobs=cpu_count(), verbose=100, backend='multiprocessing', pre_dispatch='all') \
(delayed(perform_binary_classification)
(x, classifier, 'clfs/') for
classifier in classifiers)
end_time = time.time()
print(format_duration(start_time, end_time))
Here is the bash code:
#!/usr/bin/env bash
N=4
START_TIME=$SECONDS
open_sem(){
mkfifo pipe-$$
exec 3<>pipe-$$
rm pipe-$$
local i=$1
for((;i>0;i--)); do
printf %s 000 >&3
done
}
run_with_lock(){
local x
read -u 3 -n 3 x && ((0==x)) || exit $x
(
"$#"
printf '%.3d' $? >&3
)&
}
open_sem $N
for d in classifiers/* ; do
run_with_lock ~/fastText/fasttext predict "$d" test.txt
done
ELAPSED_TIME=$(($SECONDS - $START_TIME))
echo time taken $ELAPSED_TIME seconds
EDITED
The bigger picture is that I am running flask app with 2 API methods. Each of them calls the function that parallelize the classification. When I am doing requests, it behaves the same way like this program below. First request to method A takes a lot and then subsequent requests take like 1s. When I switch to method B it is the same behavior as with method A. If I switch between method A and method B several times like A,B,A,B then each request takes like 40s to finish.

One approach is to modify your python code to use an event loop, stay running all the time, and execute new jobs in parallel whenever new jobs are detected. One way to do this is is to have a job directory, and place a file in that directory whenever there is a new job todo. The python script should also move completed jobs out of that directory to prevent running them more than once. How to run an function when anything changes in a dir with Python Watchdog?
Another option is to use a fifo file which is piped to the python script, and add new lines to that file for new jobs. https://www.linuxjournal.com/content/using-named-pipes-fifos-bash
I personally dislike parallelizing in python, and prefer to parallelize in bash using GNU parallel. To do it this way, I would
implement the event loop and jobs directory or the fifo file job queue using bash and GNU parallel
modify the python script to remove all the parallel code
read each jobspec from stdin
process each one serially in a loop
pipe jobs to parallel, which pipes them to ncpu python processes, which each runs forever waiting for the next job from stdin
e.g., something like:
run_jobs.sh:
mkfifo jobs
cat jobs | parallel --pipe --round-robin -n1 ~/fastText/fasttext
queue_jobs.sh:
echo jobspec >> jobs
.py:
for jobspec in sys.stdin:
...
This has the disadvantage that all ncpu python processes may have the slow startup problem, but they can stay running indefinitely, so the problem becomes insignificant, and the code is much simpler and easier to debug and maintain.
Using a jobs directory and a file for each jobspec instead of a fifo jobs queue requires slightly more code, but it also makes it more straightforward to see which jobs are queued and which jobs are done.

Related

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!
Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Python: Run multiple exe with load balance

I'm trying to run multiple exe's (12 of them), because of computer resources I can spawn maximum 4 at a time before I get performance degradation.
I'm trying to find if there is a way to call 4 exe's at a time and as soon as one of them finishes, to call another exe to fill the resources that have freed up
My current code does this:
excs = [r"path\to\exe\exe.exe",r"path\to\exe\exe.exe",r"path\to\exe\exe.exe",r"path\to\exe\exe.exe"]
running = [subprocess.Popen(ex) for ex in excs]
[process.wait() for process in running]
It repeats this process three times so that it runs all 12. Unfortunately it means that it needs to wait for all of them to finish before moving on to the next set. Is there a more efficient way of doing this?
For the record, all of the exe's have different run times.
Python has ThreadPoolExecutor which makes this very convenient
import subprocess
from concurrent.futures import ThreadPoolExecutor
def create_pool(N,commands):
pool = ThreadPoolExecutor(max_workers=N)
for command in commands:
pool.submit(subprocess.call, command)
pool.shutdown(wait=False)
def main():
N_WORKERS=4
commands = [job1, job2, ...]
create_pool(N_WORKERS, commands)

python multiprocessing pool.map not blocking?

I'm trying to parallelize some web requests in python using multiprocessing, but it appears that occasionally, all of the functions I send to map do not complete.
These results appear whether I'm using python 2 or 3.
Test script:
#!/usr/bin/env python
import multiprocessing
def my_print(string):
print(string)
all_strings = ["alpaca", "bear", "cat", "dog", "elephant", "frog"]
pool = multiprocessing.Pool()
pool.map(my_print, all_strings)
I run it like so:
for i in `seq 1 50`; do ./test.py | wc -l; done | sort | uniq -c
And my results will look like:
6 5
44 6
...so most of the time all 6 executions of the function are running, but occasionally, only 5 of them will run until the overall script completes execution. I expect there to be 50 6 as a result (aka, all functions getting executed on every run).
The documentation https://docs.python.org/2/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map says It blocks until the result is ready. I assumed that to mean All functions will complete before we move to the next line of code.
Am I misunderstanding that? Does using a pool require you to always call pool.close() and pool.join() to ensure the tasks are complete?
Edit: I'm running on AWS, if that makes any obvious difference - a coworker told me I should mention that.
Thanks very much in advance!
All workers run their functions and return any values before map returns. That is true. But that doesn't mean you will see all strings immediately.
You have multiple worker processes trying to write to the same file/terminal. To make that work you might have to import sys and call sys.stdout.flush() after every print() in the worker process.

Synchronize a Linux system command and a while-loop in Python

With the RaspberryPi system I have to synchronize a Raspbian system command (raspivid -t 20000) with a while loop that reads continuously from a sensor adn stores samples in an array. The Raspbian command start a video recording by the RaspberryPi camera CSI module and I have to be sure that it starts at the same instant of the acquisition by the sensor. I have seen many solution that have confused me among modules like multiprocessing, threading, subprocess, ecc. So far the only thing that I have understood is that the os.system() function blocks execution of following python's commands placed in the script as long as it runs. So if I try with:
import os
import numpy as np
os.system("raspivid -t 20000 /home/pi/test.h264")
data = np.zeros(20000, dtype="float") #memory pre-allocation supposing I have to save 20000 samples from the sensor (1 for each millisecond of the video)
indx=0
while True:
sens = readbysensor() #where the readbysensor() function is defined before in the script and reads a sample from the sensor
data[indx]=sens
if indx==19999:
break
else:
indx+=1
that while-loop will run only when the os.system() function will finish. But as I wrote above I need that the two processes are synchronized and work in parallel. Any suggestion?
Just add an & at the end, to make the process detach to the background:
os.system("raspivid -t 20000 /home/pi/test.h264 &")
According to bash man pages:
If a command is terminated by the control operator &, the shell
executes the command in the background in a subshell. The shell does
not wait for the command to finish, and the return status is 0.
Also, if you want to minimize the time it takes for the loop to start after executing raspivid, you should allocate your data and indx prior to the call:
data = np.zeros(20000, dtype="float")
indx=0
os.system("raspivid -t 20000 /home/pi/test.h264 &")
while True:
# ....
Update:
Since we discussed further in the comments, it is clear that there is no really a need to start the loop "at the same time" as raspivid (whatever that might mean), because if you are trying to read data from the I2C and make sure you don't miss any data, you will be best of starting the reading operation prior to running raspivid. This way you are certain that in the meantime (however big of delay there is between those two executions) you are not missing any data.
Taking this into consideration, your code could look something like this:
data = np.zeros(20000, dtype="float")
indx=0
os.system("(sleep 1; raspivid -t 20000 /home/pi/test.h264) &")
while True:
# ....
This is the simplest version in which we add a delay of 1 second before running raspivid, so we have time to enter our while loop and start waiting for I2C data.
This works, but it is hardly a production quality code. For a better solution, run the data acquisition function in one thread and the raspivid in a second thread, preserving the launch order (the reading thread is started first).
Something like this:
import Queue
import threading
import os
# we will store all data in a Queue so we can process
# it at a custom speed, without blocking the reading
q = Queue.Queue()
# thread for getting the data from the sensor
# it puts the data in a Queue for processing
def get_data(q):
for cnt in xrange(20000):
# assuming readbysensor() is a
# blocking function
sens = readbysensor()
q.put(sens)
# thread for processing the results
def process_data(q):
for cnt in xrange(20000):
data = q.get()
# do something with data here
q.task_done()
t_get = threading.Thread(target=get_data, args=(q,))
t_process = threading.Thread(target=process_data, args=(q,))
t_get.start()
t_process.start()
# when everything is set and ready, run the raspivid
os.system("raspivid -t 20000 /home/pi/test.h264 &")
# wait for the threads to finish
t_get.join()
t_process.join()
# at this point all processing is completed
print "We are all done!"
You could rewrite your code as:
import subprocess
import numpy as np
n = 20000
p = subprocess.Popen(["raspivid", "-t", str(n), "/home/pi/test.h264"])
data = np.fromiter(iter(readbysensor, None), dtype=float, count=n)
subprocess.Popen() returns immidiately without waiting for raspivid to end.

Python multiprocessing + subprocess issues

I have a binary (say a.out) that I want to call with different configs. I want to run these configs on a 40-core machine in parallel. Below is a sketch of my code.
It is very straightforward: I generate a config and pass in into the worker, and the worker calls the binary with the config using subprocess. I am also redirecting the output to a file. Let's call this piece of code run.py
def worker(cmdlist, filename):
outputfile = open(filename, 'wb')
// here it essentially executes a.out config > outputfile
subprocess.call(cmdlist, stderr=outputfile, stdout=outputfile)
outputfile.close()
def main():
pool = Pool(processes = 40)
for config in all_configs
filename, cmdlist = genCmd(config)
res = pool.apply_async(worker, [cmdlist, filename])
results.append(res)
for res in results:
res.get()
pool.close()
But after I kick it off, I realized that I am not spawning as many processes as I want. I definitely submitted more than 40 workers, but in top, I am only seeing about 20 of a.out.
I do see many of the run.py that are in "sleeping" state (i.e., "S" in top). When I do a ps auf, I also saw a lot of run.py in "S+" state, with no binary spawned out. Only about half of them spawned "a.out"
I am wondering, why is this happening? I am redirecting the output to a network-mounted hard-drive, which could be a reason, but in top I only see 10%wa (which in my understanding is 10% of the time waiting for IO). I don't think this results in 50% of idle CPUs. Plus, I should at least have the binary spawned out, instead of being stuck at run.py. My binary's runtime is also long enough. I should really be seeing 40 jobs running for a long time.
Any other explanation? Anything I did wrong in my python code?
An approach I have used to make use of many simultaneous processes running at once on multiple cores is to use p = subprocess.Popen(...) and p.Poll(). In your case I think you would be able to skip using Pool altogether. I'd give you a better example but unfortunately I don't have access to that code anymore.

Categories

Resources