I am using Ubuntu 17.04 64-bit with processor-Intel® Core™ i7-7500U CPU # 2.70GHz × 4 and 16gb of RAM.
So when I run this program, it uses a single core instead of using all the 4 cores.
import time
import multiprocessing
def boom1(*args):
print(5**10000000000)
def boom2(*args):
print(5**10000000000)
def boom3(*args):
print(5**10000000000)
def boom4(*args):
print(5**10000000000)
if __name__=="__main__":
array = []
p1 = multiprocessing.Process(target=boom1, args=(array,))
p2 = multiprocessing.Process(target=boom2, args=(array,))
p3 = multiprocessing.Process(target=boom3, args=(array,))
p4 = multiprocessing.Process(target=boom4, args=(array,))
p1.start()
p2.start()
p3.start()
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
print('Done')
Now if I print some low power of 10 in each function:
print(5 ** 10000000)
Now for small duration of time, a single core is processing 100% and then all 4 cores are performing 100%.
Why is this so? Shouldn't it start with all cores performing 100%.
What I had come to know was that python performs some operation before itself and hence was doing that from a single core. If it is so then what is the point of python being an interpreted language or am I missing something?does
The peephole optimizer is trying to constant-fold the 5**10000000000 calculation. This happens before any worker processes launch.
Most languages have a constant-folding optimization: when an operation between constants appears, the compiler will perform the operation and replace the expression with the single-constant result.
Python does this, as well. I expect that your multi-node operation was simply the sequence of start-print-join on each node.
If you want to get longer runs on the four nodes, try an expression that can't be evaluated at parse-time. For instance, pass the base in the argument list and use that instead of 5, or perhaps have each process pick a random number in the range 1-10 and add that to the exponent. This should force run-time evaluation.
I believe the previous answers are correct, but may not fully explain your observations. As others have pointed out, that the single processor time you're seeing is the time that the interpreter is spending to calculate the value of the exponential expression. Because you're using integers, and Python can do arbitrarily long integers, this takes quite awhile, probably exponential in terms of the number of 0's in your exponent. In the first case, the calculation is taking so long that it appears to not get past that (I don't know if you ran to completion, or if your machine could even do it without running out of memory).
In the second case, you've removed enough zeros so that it can calculate it (single threaded interpreter) then proceed to print it (in parallel). However long that took, it will probably take at least 1000 times longer to do the first case.
Related
The attached script evaluates the numpy.conjugate routine for varying numbers of parallel processes on differently sized matrices and records the corresponding run times.
The matrix shape only varies in it's first dimension (from 1,64,64 to 256,64,64). Conjugation calls are always made on 1,64,64 sub matrices to ensure that the parts that are being worked on fit into the L2 cache on my system (256 KB per core, L3 cache in my case is 25MB). Running the script yields the following diagram (with slightly different ax labels and colors).
As you can see starting from a shape of around 100,64,64 the runtime is depending on the number of parallel processes which are used.
What could be the cause of this ?
Or why is the dependence on the number of processes for matrices below (100,64,64) so low?
My main goal is to find a modification to this script such that the runtime becomes as independent as possible from the number of processes for matrices 'a' of arbitrary size.
In case of 20 Processes:
all 'a' matrices take at most: 20 * 16 * 256 * 64 * 64 Byte = 320MB
all 'b' sub matrices take at most: 20 * 16 * 1 * 64 * 64 Byte = 1.25MB
So all sub matrices fit simultaneously in L3 cache as well as individually in the L2 cache per core of my CPU.
I did only use physical cores no hyper-threading for these tests.
Here is the script:
from multiprocessing import Process, Queue
import time
import numpy as np
import os
from matplotlib import pyplot as plt
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
def f(q,size):
a = np.random.rand(size,64,64) + 1.j*np.random.rand(size,64,64)
start = time.time()
n=a.shape[0]
for i in range(20):
for b in a:
b.conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
for p_id,p in enumerate(processes):
for size_0 in range(1,257):
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
fig,ax = plt.subplots()
for d in data:
ax.plot(d)
ax.set_xlabel('Matrix Size: 1-256,64,64')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')
The problem is due to at least a combination of two complex effects: cache-thrashing and frequency-scaling. I can reproduce the effect on my 6 core i5-9600KF processor.
Cache thrashing
The biggest effect comes from a cache-thrashing issue. It can be easily tracked by looking at the RAM throughput. Indeed, it is 4 GiB/s for 1 process and 20 GiB/s for 6 processes. The read throughput is similar to the write one so the throughput is symmetric. My RAM is able to reach up to ~40 GiB/s but usually ~32 GiB/s only for mixed read/write patterns. This means the RAM pressure is pretty big. Such use-case typically occurs in two cases:
an array is read/written-back from/to the RAM because cache are not big enough;
many access to different locations in memory are made but they are mapped in the same cache lines in the L3.
At first glance, the first case is much more likely to happen here since arrays are contiguous and pretty big (the other effect unfortunately also happens, see below). In fact, the main problem is the a array that is too big to fit in the L3. Indeed, when size is >128, a takes more than 128*64*64*8*2 = 8 MiB/process. Actually, a is built from two array that must be read so the space needed in cache is 3 time bigger than that: ie. >24 MiB/process. The thing is all processes allocate the same amount of memory, so the bigger the number of processes the bigger the cumulative space taken by a. When the cumulative space is bigger than the cache, the processor needs to write data to the RAM and read it back which is slow.
In fact, this is even worse: processes are not fully synchronized so some process can flush data needed by others due to the filling of a.
Furthermore, b.conj() creates a new array that may not be allocated at the same memory allocation every time so the processor also needs to write data back. This effect is dependent of the low-level allocator being used. One can use the out parameter so to fix this problem. That being said, the problem was not significant on my machine (using out was 2% faster with 6 processes and equally fast with 1 process).
Put it shortly, more processes access more data and the global amount of data do not fit in CPU caches decreasing performance since arrays need to be reloaded over and over.
Frequency scaling
Modern-processors use frequency scaling (like turbo-boost) so to make (quite) sequential applications faster, but they cannot use the same frequency for all cores when they are doing computation because processors have a limited power budget. This results of a lower theoretical scalability. The thing is all processes are doing the same work so N processes running on N cores are not N times takes more time than 1 process running on 1 core.
When 1 process is created, two cores are operating at 4550-4600 MHz (and others are at 3700 MHz) while when 6 processes are running, all cores operate at 4300 MHz. This is enough to explain a difference up to 7% on my machine.
You can hardly control the turbo frequency but you can either disable it completely or control the frequency so the minimum-maximum frequency are both set to the base frequency. Note that the processor is free to use a much lower frequency in pathological cases (ie. throttling, when a critical temperature reached). I do see an improved behavior by tweaking frequencies (7~10% better in practice).
Other effects
When the number of process is equal to the number of core, the OS do more context-switches of the process than if one core is left free for other tasks. Context-switches decrease a bit the performance of the process. THis is especially true when all cores are allocated because it is harder for the OS scheduler to avoid unnecessary migrations. This usually happens on PC with many running processes but not much on computing machines. This overhead is about 5-10% on my machine.
Note that the number of processes should not exceed the number of cores (and not hyper-threads). Beyond this limit, the performance are hardly predictable and many complex overheads appears (mainly scheduling issues).
I'll accept Jérômes answer.
For the interested reader which could ask:
Why are you subdividing your big numpy array and only working on sub matrices?
The answer is, that it's faster!
Lets consider a complex Matrix 'a' which is 128MB big (to big to fit in cache).
For a single proccess one can quickly check that in
import numpy as np
import timeit
a=np.random.rand(8192,32,32)+1.j*np.random.rand(8192,32,32)
print(timeit.timeit('a.conj()',number=100,globals=globals()))
print(timeit.timeit('for i in range(0,8192,8): a[i:i+8].conj()',number = 100 ,globals = globals()))
the second timeit which iterates over 128kB sub-matrices finishes faster than the first (if 128kB is somewhere between your L1 and L2 cache sizes).
In the following plots I'll show computation time vs sub-matrix size computed on two test machines . There are two plots for each test case which cover the sub-matrix size ranges 16kB - 1024kB (using 16kB steps) and 0.5MB - 64MB (using 0.5MB steps) respectively.
Machine I: 2 * Xenon E5-2640 v3(L1i=L1d=32KB, L2=256KB, L3=20MB, 10 cores)
Machine II: 2 * Xenon E5-2640 v4(L1i=L1d=32KB, L2=256KB, L3=50MB, 20 cores)
The sub-matrix size for which the calculation is completed the quickest (64KB) is suspiciously exactly the size of the combined L1 cache of the two CPUs on each of the test Machines.
At the value of the combined L2 cache (512KB) nothing special is happening.
As soon as the combined sub-matrix size of all paralell running processes exceeds the L3 cache of one of the available CPUs the computation time starts to increase rapidly.(Eg. Machine 1, 19 processes, at ~ 1MB, Machine 2, 37 processes, at ~1.3MB)
Here is the script for the plots:
from multiprocessing import Process, Queue
import time
import numpy as np
import timeit
from matplotlib import pyplot as plt
import os
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
m_shape =(8192,32,32)
def f(q,size):
a = np.random.rand(*m_shape) + 1.j*np.random.rand(*m_shape)
start = time.time()
n=a.shape[0]
for i in range(0,n,size):
a[i:i+size].conj()
duration = time.time()-start
q.put(duration)
def speed_test(number_of_processes=1,size=1):
number_of_processes = number_of_processes
process_list=[]
queue = Queue()
#Start processes
for p_id in range(number_of_processes):
p = Process(target=f,args=(queue,size))
process_list.append(p)
p.start()
#Wait until all processes are finished
for p in process_list:
p.join()
output = []
while queue.qsize() != 0:
output.append(queue.get())
return np.mean(output)
if __name__ == '__main__':
processes=np.arange(1,20,3)
data=[[] for i in processes]
## L1 L2 cache data range
sub_matrix_sizes = list(range(1,64,1))
## L3 cache data range
#sub_matrix_sizes = list(range(32,4098,32))
#sub_matrix_sizes.append(8192)
for p_id,p in enumerate(processes):
for size_0 in sub_matrix_sizes:
data[p_id].append(speed_test(number_of_processes=p,size=size_0))
print('{} of {} finished.'.format(p_id+1,len(processes)))
from matplotlib import pyplot as plt
from xframe.presenters.matplolibPresenter import plot1D
data = np.array(data)
sub_size_in_kb = np.array(sub_matrix_sizes)*np.dtype(complex).itemsize*np.prod(m_shape[1:])/1024
sub_size_in_mb = sub_size_in_kb/1024
fig,ax = plt.subplots()
for d in data:
ax.plot(sub_size_in_kb,d)
ax.set_xlabel('Matrix Size in KB')
#ax.set_xlabel('Matrix Size in MB')
ax.set_ylabel('Runtime in seconds')
fig.savefig('result.png')
print('done.')
I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.
I have 100000 images and I need to get the vectors for each image
imageVectors = []
for i in range(100000):
fileName = "Images/" + str(i) + '.jpg'
imageVectors.append(getvector(fileName).reshape((1,2048)))
cPickle.dump( imageVectors, open( 'imageVectors.pkl', "w+b" ), cPickle.HIGHEST_PROTOCOL )
getVector is a function that takes 1 image at a time and takes about 1 second to process a it. So, basically my problem reduces to
for i in range(100000):
A = callFunction(i) //a complex function that takes 1 sec for each call
The things that I have already tried are: (only the pseduo-code is given here)
1) Using numpy vectorizer:
def callFunction1(i):
return callFunction2(i)
vfunc = np.vectorize(callFunction1)
imageVectors = vfunc(list(range(100000))
2)Using python map:
def callFunction1(i):
return callFunction2(i)
imageVectors = map(callFunction1, list(range(100000))
3) Using python multiprocessing:
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 4 # arbitrary default
pool = multiprocessing.Pool(processes=cpus)
result = pool.map(callFunction, xrange(100000000))
4) Using multiprocessing in a different way:
from multiprocessing import Process, Queue
q = Queue()
N = 100000000
p1 = Process(target=callFunction, args=(N/4,q))
p1.start()
p2 = Process(target=callFunction, args=(N/4,q))
p2.start()
p3 = Process(target=callFunction, args=(N/4,q))
p3.start()
p4 = Process(target=callFunction, args=(N/4,q))
p4.start()
results = []
for i in range(4):
results.append(q.get(True))
p1.join()
p2.join()
p3.join()
p4.join()
All the above methods are taking immensely huge time. Is there any other way more efficient than this so that maybe I can loop through many elements simultaneously instead of sequentially or in any other faster way.
The time is mainly being taken by the getvector function itself. As a work around, I have split my data into 8 different batches and running the same program for different parts of the loop and running eight separate instances of python on a octa-core VM in google cloud. Could anyone suggest if map-reduce or taking help of GPU's using PyCuda may be a good option?
The multiprocessing.Pool solution is a good one, in the sense that it uses all your cores. So it should be approximately N times faster than using plain old map, where N is the number of cores you have.
BTW, you can skip determining the amount of cores. By default multiprocessing.Pool uses as many processes as your CPU has cores.
Instead of a plain map (which blocks until everything has been processed), I would suggest using imap_unordered. This is an iterator that will start returning results as soon as they become available so your parent process can start further processing if any. If ordering is important, you might want to return a tuple (number, array) to identify the result.
Your function returns a numpy array of 2048 values, which I assume are numpy.float64 Using the standard mapping functions will transport the results back to the parent process using IPC. On a 4-core machine that will result in 4 IPC transports of 2048*8 = 16384 bytes, so 65536 bytes/second. That doesn't sound too bad. But I don't know how much overhead the IPC (which involves pickling and Queues) will incur.
In case the overhead is large, you might want to create a shared memory area to store the results in. You would need approximately 1.5 Gib to store 100000 results of 2048 8-byte floats. That is a sizeable amount of memory, but not impractical for current machines.
For 100000 images and 4 cores and each image taking around one second, your program's running time would be in the order of 8 hours.
Your most important task for optimization would be to look into reducing the runtime of the getvector function. For example, would it run just as well if you reduced the size of the images by half? Assuming that the runtime scales linearly to the amount of pixels, that should cut the runtime to 0.25 s.
I need to make a functionnal test that assert that one run is significantly faster than one other.
Here is the code I have written so far:
def test_run5(self):
cmd_line = ["python", self.__right_def_file_only_files]
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime1 = end - start
start = time.clock()
with self.assertRaises(SystemExit):
ClassName().run(cmd_line)
end = time.clock()
runtime2 = end - start
self.assertTrue(runtime2 < runtime1 * 1.4)
It works but I don't like this way because the 1.4 factor has been chosen experimentaly with my specific example of execution.
How would you test that the second execution is always faster than the first?
EDIT
I didn't think that it would be necessary to explain it but in the context of my program, it is not up to me to say that a factor is significant for an unknown execution.
The whole program is a kind of Make and it is the pipeline definition file that will define what is the "significant difference of speed", not me:
If the definition file contains a lot of rules that are very fast, the difference of execution time between two consecutive execution will be very small, let's say 5% faster but still significant
Else if the definition file contains few rules but very long ones, the difference will be big, let's say 90% faster so a difference of 5% would not be significant at all.
I found out an equation named Michaelis-Menten kinetics which fit with my needs. Here is the function which should do the trick
def get_best_factor(full_exec_time, rule_count, maximum_ratio=1):
average_rule_time = full_exec_time / rule_count
return 1 + (maximum_ratio * average_rule_time / (1.5 + average_rule_time))
full_exec_time parameter is runtime1 which is the maximum execution time for a given pipeline definition file.
rule_count is the number of rules in the given pipeline definition file.
maximum_ratio means that the second execution will be, at max, 100% faster than the first (impossible, in practice)
The variable parameter of the Michaelis-Menten kinetics equation, is the average rule execution time. And I have arbitrarily chosen 1.5 seconds as the average rule execution time at which the execution time should be maximum_ratio / 2 faster. It is the actual parameter that depends on your use of this equation.
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.