What resource does numpy "lock" across processes?

What resource does numpy "lock" across processes? - python

So a friend noticed something curious about numpy. Here is a minimal example that runs the same script first serially, than two instances parallel each in their own process:
#!/bin/bash
# This is runner.sh
fl=/tmp/$(mktemp test_XXXXX.py)
trap "rm -fv '$fl'" EXIT
cat - > "$fl" <<-'EndOfHereDoc'
#!/usr/bin/env python
import numpy as np
import sys
if __name__ == '__main__':
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
cube_size=100
cube=np.zeros((cube_size,cube_size,cube_size))
cube_ones=np.ones((cube_size,cube_size,cube_size))
for x in range(10000):
np.add(cube_ones,cube,out=cube)
if len(sys.argv)>1: print(sys.argv[1] +' start: '+ str(datetime.datetime.now()))
EndOfHereDoc
echo "Serial"
time python "$fl" 0
echo
echo "Parallel"
time python "$fl" 1&
time python3 "$fl" 2&
wait
rm -fv "$fl"
trap '' EXIT
The output of which is:
$ runner.sh
Serial
0 start: 2018-09-19 15:46:52.540881
0 end: 2018-09-19 15:47:04.592280
real 0m12,105s
user 0m12,084s
sys 0m0,020s
Parallel
1 start: 2018-09-19 15:47:04.665260
2 start: 2018-09-19 15:47:04.780635
2 end: 2018-09-19 15:47:27.053261
real 0m22,480s
user 0m22,448s
sys 0m0,128s
1 end: 2018-09-19 15:47:27.097312
real 0m22,505s
user 0m22,409s
sys 0m0,040s
removed '/tmp/test_zWN0V.py'
No speedup. It is as if the processes where run one after the other. I assume numpy is using a resource exclusively and the other process waits for that resource to be freed. But what exactly is going on here? The GIL should only be an issue with multi-threading, not multiple processes, right? I find it especially weird, that p2 is not simply waiting for p1 to finish. Instead BOTH processes take ~22s to finish. I'd expect one to get the resource and finish in half the time. While the other waits until the first releases it and takes an additional ~12s.
Note that this also ocours when running the python code with python's own multiprocessing module in a Pool. It does however not occur, if you do something that doesn't involve some specific numpy functions like:
cube_size=25
cube=[0 for i in range(cube_size**3)]
for x in range(10000):
cube = [ value + 1 for value in cube]
Edit:
I have a real 4-core CPU. I kept hyperthreading in mind, it's not the issue here. During the single process part, one CPU is at 100%, the rest idle. During the two process part, two are at 100%, the rest is idle (as per htop). I understand that numpy runs ATLAS, LAPACK and BLAS libraries in the background, which are not Python (in fact pure C or Fortran). These might utilize parallel techniques. My question here is, why doesn't that show up in CPU utilization?

Numpy is not restricted by the GIL as much as core Python is. This is because numpy only stores the array as a Python object. The actual data itself is stored as "primitive" types defined in C. This is also why iterating over a numpy array is much slower than iterating over a Python list. The numpy array has to build a Python object for each value it yields, whereas the Python list already has Python objects.
As numpy is not hampered by the GIL, it is able to use threaded math libraries where available. That is to say, your parallel processes took longer to run because each process was already maxing out your machine and so both processes were competing for the same resources.
Take a look at the output and see what's available in your machine (be warned it's quite verbose).
import numpy.distutils.system_info as sysinfo
sysinfo.show_all()

Related

Mulitprocessing pool for function with no arguments/iterable?

I'm running Python 2.7 on the GCE platform to do calculations. The GCE instances boot, install various packages, copy 80 Gb of data from a storage bucket and runs a "workermaster.py" script with nohangup. The workermaster runs on an infinite loop which checks a task-queue bucket for tasks. When the task bucket isn't empty it picks a random file (task) and passes work to a calculation module. If there is nothing to do the workermaster sleeps for a number of seconds and checks the task-list again. The workermaster runs continuously until the instance is terminated (or something breaks!).
Currently this works quite well, but my problem is that my code only runs instances with a single CPU. If I want to scale up calculations I have to create many identical single-CPU instances and this means there is a large cost overhead for creating many 80 Gb disks and transferring the data to them each time, even though the calculation is only "reading" one small portion of the data for any particular calculation. I want to make everything more efficient and cost effective by making my workermaster capable of using multiple CPUs, but after reading many tutorials and other questions on SO I'm completely confused.
I thought I could just turn the important part of my workermaster code into a function, and then create a pool of processes that "call" it using the multiprocessing module. Once the workermaster loop is running on each CPU, the processes do not need to interact with each other or depend on each other in any way, they just happen to be running on the same instance. The workermaster prints out information about where it is in the calculation and I'm also confused about how it will be possible to tell the "print" statements from each process apart, but I guess that's a few steps from where I am now! My problems/confusion are that:
1) My workermaster "def" doesn't return any value because it just starts an infinite loop, where as every web example seems to have something in the format myresult = pool.map(.....); and
2) My workermaster "def" doesn't need any arguments/inputs - it just runs, whereas the examples of multiprocessing that I have seen on SO and on the Python Docs seem to have iterables.
In case it is important, the simplified version of the workermaster code is:
# module imports are here
# filepath definitions go here
def workermaster():
while True:
tasklist = cloudstoragefunctions.getbucketfiles('<my-task-queue-bucket')
if tasklist:
tasknumber = random.randint(2, len(tasklist))
assignedtask = tasklist[tasknumber]
print 'Assigned task is now: ' + assignedtask
subprocess.call('gsutil -q cp gs://<my-task-queue-bucket>/' + assignedtask + ' "' + taskfilepath + assignedtask + '"', shell=True)
tasktype = assignedtask.split('#')[0]
if tasktype == 'Calculation':
currentcalcid = assignedtask.split('#')[1]
currentfilenumber = assignedtask.split('#')[2].replace('part', '')
currentstartfile = assignedtask.split('#
currentendfile = assignedtask.split('#')[4].replace('.csv', '')
calcmodule.docalc(currentcalcid, currentfilenumber, currentstartfile, currentendfile)
elif tasktype == 'Analysis':
#set up and run analysis module, etc.
print ' Operation completed!'
os.remove(taskfilepath + assignedtask)
else:
print 'There are no tasks to be processed. Going to sleep...'
time.sleep(30)
Im trying to "call" the function multiple times using the multiprocessing module. I think I need to use the "pool" method, so I've tried this:
import multiprocessing
if __name__ == "__main__":
p = multiprocessing.Pool()
pool_output = p.map(workermaster, [])
My understanding from the docs is that the __name__ line is there only as a workaround for doing multiprocessing in Windows (which I am doing for development, but GCE is on Linux). The p = multiprocessing.Pool() line is creating a pool of workers equal to the number of system CPUs as no argument is specified. It the number of CPUs was 1 then I would expect the code to behave as it does before I attempted to use multiprocessing. The last line is the one that I don't understand. I thought that it was telling each of the processors in the pool that the "target" (thing to run) is workermaster. From the docs there appears to be a compulsory argument which is an iterable, but I don't really understand what this is in my case, as workermaster doesn't take any arguments. I've tried passing it an empty list, empty string, empty brackets (tuple?) and it doesn't do anything.
Please would it be possible for someone help me out? There are lots of discussions about using multiprocessing and this thread Mulitprocess Pools with different functions and this one python code with mulitprocessing only spawns one process each time seem to be close to what I am doing but still have iterables as arguments. If there is anything critical that I have left out please advise and I will modify my post - thank you to anyone who can help!

Pool() is useful if you want to run the same function with different argumetns.
If you want to run function only once then use normal Process().
If you want to run the same function 2 times then you can manually create 2 Process().
If you want to use Pool() to run function 2 times then add list with 2 arguments (even if you don't need arguments) because it is information for Pool() to run it 2 times.
But if you run function 2 times with the same folder then it may run 2 times the same task. if you will run 5 times then it may run 5 times the same task. I don't know if it is needed.
As for Ctrl+C I found on Stackoverflow Catch Ctrl+C / SIGINT and exit multiprocesses gracefully in python but I don't know if it resolves your problem.

Why are the parallel tasks always slow at the first time?

I have some classifiers which I want to evaluate on the one sample. This task can be ran in parallel since they are independent of each other. This means that I want to parallelize it.
I tried it with python and also as a bash script. The problem is that when I run it the program for the first time, it takes like 30s-40s to finish. When I run the program multiple times consecutively, it takes just 1s-3s to finish. Even If I fed classifiers with different input I got different result so it seems that there is no caching. When I run some other program and afterwards rerun the program then it again takes 40s to finish.
I also observed in htop that CPUs are not that much utilized when the program is run for the first time but then when I rerun it again and again the CPUs are fully utilized.
Can someone please explain me this strange behaviour? How can I avoid it so that even the first run of the program will be fast?
Here is the python code:
import time
import os
from fastText import load_model
from joblib import delayed, Parallel, cpu_count
import json
os.system("taskset -p 0xff %d" % os.getpid())
def format_duration(start_time, end_time):
m, s = divmod(end_time - start_time, 60)
h, m = divmod(m, 60)
return "%d:%02d:%02d" % (h, m, s)
def classify(x, classifier_name, path):
f = load_model(path + os.path.sep + classifier_name)
labels, probabilities = f.predict(x, 2)
if labels[0] == '__label__True':
return classifier_name
else:
return None
if __name__ == '__main__':
with open('classifier_names.json') as json_data:
classifiers = json.load(json_data)
x = "input_text"
Parallel(n_jobs=cpu_count(), verbose=100, backend='multiprocessing', pre_dispatch='all') \
(delayed(perform_binary_classification)
(x, classifier, 'clfs/') for
classifier in classifiers)
end_time = time.time()
print(format_duration(start_time, end_time))
Here is the bash code:
#!/usr/bin/env bash
N=4
START_TIME=$SECONDS
open_sem(){
mkfifo pipe-$$
exec 3<>pipe-$$
rm pipe-$$
local i=$1
for((;i>0;i--)); do
printf %s 000 >&3
done
}
run_with_lock(){
local x
read -u 3 -n 3 x && ((0==x)) || exit $x
(
"$#"
printf '%.3d' $? >&3
)&
}
open_sem $N
for d in classifiers/* ; do
run_with_lock ~/fastText/fasttext predict "$d" test.txt
done
ELAPSED_TIME=$(($SECONDS - $START_TIME))
echo time taken $ELAPSED_TIME seconds
EDITED
The bigger picture is that I am running flask app with 2 API methods. Each of them calls the function that parallelize the classification. When I am doing requests, it behaves the same way like this program below. First request to method A takes a lot and then subsequent requests take like 1s. When I switch to method B it is the same behavior as with method A. If I switch between method A and method B several times like A,B,A,B then each request takes like 40s to finish.

One approach is to modify your python code to use an event loop, stay running all the time, and execute new jobs in parallel whenever new jobs are detected. One way to do this is is to have a job directory, and place a file in that directory whenever there is a new job todo. The python script should also move completed jobs out of that directory to prevent running them more than once. How to run an function when anything changes in a dir with Python Watchdog?
Another option is to use a fifo file which is piped to the python script, and add new lines to that file for new jobs. https://www.linuxjournal.com/content/using-named-pipes-fifos-bash
I personally dislike parallelizing in python, and prefer to parallelize in bash using GNU parallel. To do it this way, I would
implement the event loop and jobs directory or the fifo file job queue using bash and GNU parallel
modify the python script to remove all the parallel code
read each jobspec from stdin
process each one serially in a loop
pipe jobs to parallel, which pipes them to ncpu python processes, which each runs forever waiting for the next job from stdin
e.g., something like:
run_jobs.sh:
mkfifo jobs
cat jobs | parallel --pipe --round-robin -n1 ~/fastText/fasttext
queue_jobs.sh:
echo jobspec >> jobs
.py:
for jobspec in sys.stdin:
...
This has the disadvantage that all ncpu python processes may have the slow startup problem, but they can stay running indefinitely, so the problem becomes insignificant, and the code is much simpler and easier to debug and maintain.
Using a jobs directory and a file for each jobspec instead of a fifo jobs queue requires slightly more code, but it also makes it more straightforward to see which jobs are queued and which jobs are done.

Python multiprocessing to binary gives bad scaling

I need to run multiple instances of a binary in parallel. For this I am using python multiprocessing module. The binary itself has a parallelization which can be set using OMP_NUM_THREADS environment variable. A minimalist example of my code is the following
import sys
import os
from numpy import *
import time
import xml.etree.ElementTree as ET
from multiprocessing import Process, Queue
def cal_dist(filename):
tic = time.time()
################################### COPY THE INPUP FILE ########################################
tree = ET.parse(inputfilename+'.feb')
tree.write(filename+'.feb',xml_declaration=True,encoding="ISO-8859-1")
##################################### SUBMIT THE JOB ###########################################
os.system('export OMP_NUM_THREADS=12')
os.system('$HOME/febiosource-2.0/bin/febio2.lnx64 -noconfig -i ' + filename + '.feb -silent')
toc = time.time()
print "Job %s completed in %5.2f minutes" %(filename,(toc-tic)/60.);
return
# INPUT PARAMETERS
inputfilename="main-step1"
tempfilename='temp';
nCPU=7;
for iter in range(0,1):
################################### PARALLEL PROCESSING STARTS ########################################
# CREATE ALL THE PROCESSES,
p=[];
maxj=nCPU;
for j in range(0,nCPU):
p.append(Process(target=cal_dist, args=(tempfilename+str(j),)))
# START THE PROCESSES,
for j in range(0,nCPU):
p[j].start();
time.sleep(0.2);
# JOIN THEM,
for j in range(0,nCPU):
p[j].join();
################################### PARALLEL PROCESSING ENDS ########################################
If I set OMP_NUM_THREADS=1, then increasing the nCPU gives a good scaling. That is,
for nCPU=1, job time=3.5 minutes
for nCPU=7, job time=4.2 minutes
However, if I set OMP_NUM_THREADS=12, then increasing the nCPU gives a very bad scaling. That is,
for nCPU=1, job time=3.4 minutes
for nCPU=5, job time=5.7 minutes
for nCPU=7, job time=7.5 minutes
Any ideas on how I can solve this issue? I really need to use high number of CPUs and OMP_NUM_THREADS for my actual problem (and I know that the architecture of computer is that each node has 12 processors and I run it on nCPU*12 number of processors.

It looks like you're overloading your CPUs. With nCPU set to 1 with OMP_NUM_THREADS=12, you're spawning one process that uses twelve threads, which means you're keeping all your CPUs fully saturated. When you set nCPU to 7 with OMP_NUM_THREADS=12, you're spawning seven processes that use twelve threads each, which means you've got 12 * 7 = 84 threads running in parallel, fighting over 12 CPUs. My guess is this is creating a high context-switching overhead for the OS, and that's slowing you down.
With only 12 CPUs to work with, you're going to get diminishing returns if you try to run more than 12 threads+processes in parallel. (Unless a bunch of the work being done is I/O-bound, which doesn't seem to be the case here.)

Why does python thread consume so much memory?

Why does python thread consumes so much memory?
I measured that spawning one thread consumes 8 megs of memory, almost as big as a whole new python process!
OS: Ubuntu 10.10
Edit: due to popular demand I'll give some extraneous examples, here it is:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
sleep(999999999)
On my box, pmap pid will give 9424K
Now, let's run the child thread:
from os import getpid
from time import sleep
from threading import Thread
def nap():
print 'sleeping child'
sleep(999999999)
print getpid()
child_thread = Thread(target=nap)
child_thread.start() # <--- ADDED THIS LINE
sleep(999999999)
Now pmap pid will give 17620K
So, the cost for the extra thread is 17620K - 9424K = 8196K
ie. 87% of running a whole new separate process!
Now isn't that just, wrong?

This is not Python-specific, and has to do with the separate stack that gets allocated by the OS for every thread. The default maximum stack size on your OS happens to be 8MB.
Note that the 8MB is simply a chunk of address space that gets set aside, with very little memory committed to it initially. Additional memory gets committed to the stack when required, up to the 8MB limit.
The limit can be tweaked using ulimit -s, but in this instance I see no reason to do this.
As an aside, pmap shows address space usage. It isn't a good way to gauge memory usage. The two concepts are quite distinct, if related.

Python 2.6 GC appears to cleanup objects, but memory is not released

I have a program written in python 2.6 that creates a large number of short lived instances (it is a classic producer-consumer problem). I noticed that the memory usage as reported by top and pmap seems to increase when these instances are created and never goes back down. I was concerned that some python module I was using might be leaking memory so I carefully isolated the problem in my code. I then proceeded to reproduce it in as short as example as possible. I came up with this:
class LeaksMemory(list):
timesDelCalled = 0
def __del__(self):
LeaksMemory.timesDelCalled +=1
def leakSomeMemory():
l = []
for i in range(0,500000):
ml = LeaksMemory()
ml.append(float(i))
ml.append(float(i*2))
ml.append(float(i*3))
l.append(ml)
import gc
import os
leakSomeMemory()
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(gc.collect()) +" objects collected")
print("__del__ was called " + str(LeaksMemory.timesDelCalled) + " times")
print(str(os.getpid()) + " : check memory usage with pmap or top")
If you run this with something like 'python2.6 -i memoryleak.py' it will halt and you can use pmap -x PID to check the memory usage. I added the del method so I could verify that GC was occuring. It is not there in my actual program and does not appear to make any functional difference. Each call to leakSomeMemory() increases the amount of memory consumed by this program. I fear I am making some simple error and that references are getting kept by accident, but cannot identify it.

Python will release the objects, but it will not release the memory back to the operating system immediately. Instead, it will re-use the same segments for future allocations within the same interpreter.
Here's a blog post about the issue: http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
UPDATE: I tested this myself with Python 2.6.4 and didn't notice persistent increases in memory usage. Some invocations of leakSomeMemory() caused the memory footprint of the Python process to increase, and some made it decrease again. So it all depends on how the allocator is re-using the memory.

According to Alex Martelli:
"The only really reliable way to
ensure that a large but temporary use
of memory DOES return all resources to
the system when it's done, is to have
that use happen in a subprocess, which
does the memory-hungry work then
terminates."
So, in your situation it sounds like it would make sense to use the multiprocessing module to run the short-lived functions in separate processes to ensure the return of resources when the process finishes.
import multiprocessing as mp
def NOT_leakSomeMemory():
# do stuff
return result
if __name__=='__main__':
pool = mp.Pool()
results=pool.map(NOT_leakSomeMemory, range(500000))
For more ideas on how to set things up using multiprocessing, see Doug Hellman's tutorial:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What resource does numpy "lock" across processes? - python

Related

Mulitprocessing pool for function with no arguments/iterable?

Why are the parallel tasks always slow at the first time?

Python multiprocessing to binary gives bad scaling

Why does python thread consume so much memory?

Python 2.6 GC appears to cleanup objects, but memory is not released

Categories

Resources