Mulpiprocessing.Pool runs slow - python

I'm trying to get advantages of multi-processing in python, so did some tests and found multi-processing code runs much slower than plain one. What I do wrong???
Here is the test script:
import numpy as np
from datetime import datetime
from multiprocessing import Pool
def some_func(argv):
x = argv[0]
y = argv[1]
return np.sum(x * y)
def other_func(argv):
x = argv[0]
y = argv[1]
f1 = np.fft.rfft(x)
f2 = np.fft.rfft(y)
CC = np.fft.irfft(f1 * np.conj(f2))
return CC
N = 20000
X = np.random.randint(0, 10, size=(N, N))
Y = np.random.randint(0, 10, size=(N, N))
output_check = np.zeros(N)
D1 = datetime.now()
for k in range(len(X)):
output_check[k] = np.max(some_func((X[k], Y[k])))
print('Plain: ', datetime.now()-D1)
output = np.zeros(N)
D1 = datetime.now()
with Pool(10) as pool: # CPUs
for ind, res in enumerate(pool.imap(some_func, zip(X, Y), chunksize=1)):
output[ind] = np.max(res)
pool.close()
pool.join()
print('Pool: ', datetime.now()-D1)
Output:
Plain: 0:00:00.904062
Pool: 0:00:15.386251
Why so big difference? What consumes the time???
Have 80 CPUs available, tried different pool size and chunksize...
The actual function is more complex (like other_func), with it I get almost the same time for plain and parallel code, but still no speed-up :(
The input is a BIG 3D numpy array, and I need a pairwise convolution of its elements

Related

numpy: Bottleneck of multiprocessing when memory is not shared and no file IO

The following code (python) measures the speedup when increasing number of processing. The task in the multiprocessing is just multiplying a random matrix, the size of which is also varied and corresponding elapsed time is measured.
Note that, each process does not share any object and they are completely independent. So, I expected that performance curve when changing number of process will be almost same for all matrix size. However, when plotting the results (see below), I found that the expectation is false. Specifically, when matrix size becomes large (80, 160), the performance hardly be better though number of process increased. Note: The figures legend indicates the matrix sizes.
Could you explain, why performance does not become better when matrix size is large?
For your information, here is the spec of my CPU:
https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
Product Family: AMD Ryzen™ Processors
Product Line: AMD Ryzen™ 9 Desktop Processors
# of CPU Cores: 12
# of Threads: 24
Max. Boost Clock: Up to 4.6GHz
Base Clock: 3.8GHz
L1 Cache: 768KB
L2 Cache: 6MB
L3 Cache: 64MB
main script
import numpy as np
import pickle
from dataclasses import dataclass
import time
import multiprocessing
import os
import subprocess
import numpy as np
def split_number(n_total, n_split):
return [n_total // n_split + (1 if x < n_total % n_split else 0) for x in range(n_split)]
def task(args):
n_iter, idx, matrix_size = args
#cores = "{},{}".format(2 * idx, 2 * idx+1)
#os.system("taskset -p -c {} {}".format(cores, os.getpid()))
for _ in range(n_iter):
A = np.random.randn(matrix_size, matrix_size)
for _ in range(100):
A = A.dot(A)
def measure_time(n_process: int, matrix_size: int) -> float:
n_total = 100
assigne_list = split_number(n_total, n_process)
pool = multiprocessing.Pool(n_process)
ts = time.time()
pool.map(task, zip(assigne_list, range(n_process), [matrix_size] * n_process))
elapsed = time.time() - ts
return elapsed
if __name__ == "__main__":
n_experiment_sample = 5
n_logical = os.cpu_count()
n_physical = int(0.5 * n_logical)
result = {}
for mat_size in [5, 10, 20, 40, 80, 160]:
subresult = {}
result[mat_size] = subresult
for n_process in range(1, n_physical + 1):
elapsed = np.mean([measure_time(n_process, mat_size) for _ in range(n_experiment_sample)])
subresult[n_process] = elapsed
print("{}, {}, {}".format(mat_size, n_process, elapsed))
with open("result.pkl", "wb") as f:
pickle.dump(result, f)
plot script
import numpy as np
import matplotlib.pyplot as plt
import pickle
with open("result.pkl", "rb") as f:
result = pickle.load(f)
fig, ax = plt.subplots()
for matrix_size in result.keys():
subresult = result[matrix_size]
n_process_list = list(subresult.keys())
elapsed_time_list = np.array(list(subresult.values()))
speedups = elapsed_time_list[0] / elapsed_time_list
ax.plot(n_process_list, speedups, label=matrix_size)
ax.set_xlabel("number of process")
ax.set_ylabel("speed up compared to single process")
ax.legend(loc="upper left", borderaxespad=0, fontsize=10, framealpha=1.0)
plt.show()

Memory usage increases when building a large NumPy array

In the following example, I initialize a 500x2000x2000 three-dimensional NumPy array named a. At each iteration, a random two-dimensional array r is inserted into array a. This example represents a larger piece of code where the r array would be created from various calculations during each iteration of the for-loop. Consequently, each slice in the z dimension of the a array is calculated at each iteration.
# ex1_basic.py
import numpy as np
import time
def main():
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
a = np.zeros((z, x, y))
for i in range(z):
r = np.random.rand(x, y)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
This example's memory is profiled using the memory-profiler package. The steps for running the memory-profiler for this example are:
# Run the memory profiler
$ mprof run ex1_basic.py
# Plot the memory profiler results
$ mprof plot
The memory usage is plotted below. The memory usage increases over time as the values are added to the array.
I profiled another example where the data type for the array is defined as np.float32. See below for the example code and memory usage plot. This decreased the overall memory use but the memory still increases with each iteration.
import numpy as np
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
a = np.zeros((z, x, y), dtype=np.float32)
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
Since I initialized array a using np.zeros, I would expect the memory usage to remain constant based on the block of memory initialized for that array. But it appears that memory usage increases as values are inserted into array a.
So I have two questions related to these examples:
Why does the memory usage increase with time?
How do I create and store array a on disk and only have slice a[i] and the r array in memory at each iteration? Basically, how would I run these examples if the a array did not fit in memory (RAM)?
Update
I ran an example using numpy.memmap but there is no improvement in memory usage. It seems like memmap is still keeping the entire array in memory.
import numpy as np
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500
x = 2000
y = 2000
a = np.memmap('file.dat', dtype=np.float32, mode='w+', shape=(z, x, y))
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
Using the h5py package, I can create an hdf5 file that contains a dataset that represents the a array. The dset variable is similar to the a variable discussed in the question. This allows the array to reside on disk, not in memory. The generated hdf5 file is 8 GB on disk which is the size of the array containing np.float32 values. The elapsed time for this approach is similar to the examples discussed in the question; therefore, writing to the hdf5 file seems to have a negligible performance impact.
import numpy as np
import h5py
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
f = h5py.File('file.hdf5', 'w')
dset = f.create_dataset('data', shape=(z, x, y), dtype=np.float32)
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
dset[i, :, :] = r
toc = time.perf_counter()
print('elapsed time =', round(toc - tic, 2), 'sec')
s = np.float32().nbytes * (z * x * y) / 1e9 # where 1 GB = 1000 MB
print('calculated storage =', s, 'GB')
if __name__ == '__main__':
main()
Output from running this example on a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB of RAM:
elapsed time = 22.97 sec
calculated storage = 8.0 GB
Running the memory profiler for this example gives the plot shown below. The peak memory usage is about 100 MiB which is drastically lower than the examples demonstrated in the question.

Fast way to generate large-scale random ndarray

I want to generate a random matrix of shape (1e7, 800). But I find numpy.random.rand() becomes very slow at this scale. Is there a quicker way?
A simple way to do that is to write a multi-threaded implementation using Numba:
import numba as nb
import random
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def genRandom(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
This is 6.4 times faster than np.random.rand() on my 6-core machine.
Note that using 32-bit floats may help to speed up a bit the computation although the precision will be lower.
Numba is a good option, another option that might work well is dask.array, which will create lazy blocks of numpy arrays and perform parallel computations on blocks. On my machine I get a factor of 2 improvement in speed (for 1e6 x 1e3 matrix since I don't have enough memory on my machine).
rows = 10**6
cols = 10**3
import dask.array as da
x = da.random.random(size=(rows, cols)).compute() # takes about 5 seconds
# import numpy as np
# x = np.random.rand(rows, cols) # takes about 10 seconds
Note that .compute at the end is only to bring the computed array into memory, however in general you can continue to exploit the parallel computations with dask to get much faster calculations (that can also scale beyond a single machine), see docs.
An attempt to find an answer from answers given till now:
I just wrote a script which is compiled from already given (by SultanOrazbayev and Jérôme Richard) answers and contains 3 functions for each numba, dask and numpy approach and measure the time spent for n number of different sized arrays.
The code
import dask.array as da
import matplotlib.pyplot as plt
import numba as nb
import timeit
import numpy as np
#nb.njit('float64[:,:](int_, int_)', parallel=True)
def nmb(n, m):
res = np.empty((n, m))
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.rand()
return res
def nmp(n, m):
return np.random.random((n, m))
def dask(n, m):
return da.random.random(size=(n, m)).compute()
if __name__ == '__main__':
data = []
for i in range(1, 16):
dmm = 2 ** i
s_nmb = timeit.default_timer()
nmb(dmm, dmm)
e_nmb = timeit.default_timer()
s_nmp = timeit.default_timer()
nmp(dmm, dmm)
e_nmp = timeit.default_timer()
s_dask = timeit.default_timer()
dask(dmm, dmm)
e_dask = timeit.default_timer()
data.append([
dmm,
e_nmb - s_nmb,
e_nmp - s_nmp,
e_dask - s_dask
])
data = np.array(data)
plt.plot(data[:, 0], data[:, 1], "-r", label="Numba")
plt.plot(data[:, 0], data[:, 2], "-g", label="Numpy")
plt.plot(data[:, 0], data[:, 3], "-b", label="Dask")
plt.xlabel("Number of Element on each axes")
plt.ylabel("Time spent (s)")
plt.legend()
plt.show()
The result

multiprocessing with Xarray and Numpy array

So I am trying to implement a solution that was already described here, but I am changing it a bit. Instead of just trying to change the array with operations, I am trying to read from a NetCDF file using xarray and then write to a shared numpy array with the multiprocessing module.
I feel as though I am getting pretty close, but something is going wrong. I have pasted a reproducible, easy copy/paste example below. As you can see, when I run the processes, they can all read the files that I created, but they do not correctly update the shared numpy array that I am trying to write to. Any help would be appreciated.
Code
import ctypes
import logging
import multiprocessing as mp
import xarray as xr
from contextlib import closing
import numpy as np
info = mp.get_logger().info
def main():
data = np.arange(10)
for i in range(4):
ds = xr.Dataset({'x': data})
ds.to_netcdf('test_{}.nc'.format(i))
ds.close()
logger = mp.log_to_stderr()
logger.setLevel(logging.INFO)
# create shared array
N, M = 4, 10
shared_arr = mp.Array(ctypes.c_float, N * M)
arr = tonumpyarray(shared_arr, dtype=np.float32)
arr = arr.reshape((N, M))
# Fill with random values
arr[:, :] = np.zeros((N, M))
arr_orig = arr.copy()
files = ['test_0.nc', 'test_1.nc', 'test_2.nc', 'test_3.nc']
parameter_tuples = [
(files[0], 0),
(files[1], 1),
(files[2], 2),
(files[3], 3)
]
# write to arr from different processes
with closing(mp.Pool(initializer=init, initargs=(shared_arr,))) as p:
# many processes access different slices of the same array
p.map_async(g, parameter_tuples)
p.join()
print(arr_orig)
print(tonumpyarray(shared_arr, np.float32).reshape(N, M))
def init(shared_arr_):
global shared_arr
shared_arr = shared_arr_ # must be inherited, not passed as an argument
def tonumpyarray(mp_arr, dtype=np.float64):
return np.frombuffer(mp_arr.get_obj(), dtype)
def g(params):
"""no synchronization."""
print("Current File Name: ", params[0])
tmp_dataset = xr.open_dataset(params[0])
print(tmp_dataset["x"].data[:])
arr = tonumpyarray(shared_arr)
arr[params[1], :] = tmp_dataset["x"].data[:]
tmp_dataset.close()
if __name__ == '__main__':
mp.freeze_support()
main()
What's wrong?
1.You forgot to reshape back after tonumpyarray.
2.You used the wrong dtype in tonumpyarray.
Code
import ctypes
import logging
import multiprocessing as mp
import xarray as xr
from contextlib import closing
import numpy as np
info = mp.get_logger().info
def main():
data = np.arange(10)
for i in range(4):
ds = xr.Dataset({'x': data})
ds.to_netcdf('test_{}.nc'.format(i))
ds.close()
logger = mp.log_to_stderr()
logger.setLevel(logging.INFO)
# create shared array
N, M = 4, 10
shared_arr = mp.Array(ctypes.c_float, N * M)
arr = tonumpyarray(shared_arr, dtype=np.float32)
arr = arr.reshape((N, M))
# Fill with random values
arr[:, :] = np.zeros((N, M))
arr_orig = arr.copy()
files = ['test_0.nc', 'test_1.nc', 'test_2.nc', 'test_3.nc']
parameter_tuples = [
(files[0], 0),
(files[1], 1),
(files[2], 2),
(files[3], 3)
]
# write to arr from different processes
with closing(mp.Pool(initializer=init, initargs=(shared_arr, N, M))) as p:
# many processes access different slices of the same array
p.map_async(g, parameter_tuples)
p.join()
print(arr_orig)
print(tonumpyarray(shared_arr, np.float32).reshape(N, M))
def init(shared_arr_, N_, M_): # add shape
global shared_arr
global N, M
shared_arr = shared_arr_ # must be inherited, not passed as an argument
N = N_
M = M_
def tonumpyarray(mp_arr, dtype=np.float32): # change type
return np.frombuffer(mp_arr.get_obj(), dtype)
def g(params):
"""no synchronization."""
print("Current File Name: ", params[0])
tmp_dataset = xr.open_dataset(params[0])
print(tmp_dataset["x"].data[:])
arr = tonumpyarray(shared_arr).reshape(N, M) # reshape
arr[params[1], :] = tmp_dataset["x"].data[:]
tmp_dataset.close()
if __name__ == '__main__':
mp.freeze_support()
main()

Issues with using numpy.fromiter & numpy.array in concurrent.futures.ProcessPoolExecutor map() and submit() methods

Background:
This blog reported speed benefits from using numpy.fromiter() over numpy.array(). Using the provided script as a based, I wanted to see the benefits of numpy.fromiter() when executed in the map() and submit() methods in python's concurrent.futures.ProcessPoolExecutor class.
Below are my findings for a 2 seconds run:
It is clear that numpy.fromiter() is faster than numpy.array() when the array size is <256 in general.
However the performances of numpy.fromiter() and numpy.array() can be significantly poorer than a series run, and are not consistent, when executed by the map() and submit() methods in python's concurrent.futures.ProcessPoolExecutor class.
Questions:
Can the inconsistent and poorer performances of numpy.fromiter() and numpy.array() when used in map() and submit() methods in python's concurrent.futures.ProcessPoolExecutor class be avoided? How can I improve my scripts?
The python scripts that I had used for this benchmarking are given below.
map():
#!/usr/bin/env python3.5
import concurrent.futures
from itertools import chain
import time
import numpy as np
import pygal
from os import path
list_sizes = [2**x for x in range(1, 11)]
seconds = 2
def test(size_array):
pyarray = [float(x) for x in range(size_array)]
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.fromiter(pyarray, dtype=np.float32, count=size_array)
iterations += 1
fromiter_count = iterations
# array
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.array(pyarray, dtype=np.float32)
iterations += 1
array_count = iterations
#return array_count, fromiter_count
return size_array, array_count, fromiter_count
begin = time.time()
results = {}
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
data = list(chain.from_iterable(executor.map(test, list_sizes)))
print('data = ', data)
for i in range( 0, len(data), 3 ):
res = tuple(data[i+1:i+3])
size_array = data[i]
results[size_array] = res
print("Result for size {} in {} seconds: {}".format(size_array,seconds,res))
out_folder = path.dirname(path.realpath(__file__))
print("Create diagrams in {}".format(out_folder))
chart = pygal.Line()
chart.title = "Performance in {} seconds".format(seconds)
chart.x_title = "Array size"
chart.y_title = "Iterations"
array_result = []
fromiter_result = []
x_axis = sorted(results.keys())
print(x_axis)
chart.x_labels = x_axis
chart.add('np.array', [results[x][0] for x in x_axis])
chart.add('np.fromiter', [results[x][1] for x in x_axis])
chart.render_to_png(path.join(out_folder, 'result_{}_concurrent_futures_map.png'.format(seconds)))
end = time.time()
compute_time = end - begin
print("Program Time = ", compute_time)
submit():
#!/usr/bin/env python3.5
import concurrent.futures
from itertools import chain
import time
import numpy as np
import pygal
from os import path
list_sizes = [2**x for x in range(1, 11)]
seconds = 2
def test(size_array):
pyarray = [float(x) for x in range(size_array)]
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.fromiter(pyarray, dtype=np.float32, count=size_array)
iterations += 1
fromiter_count = iterations
# array
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.array(pyarray, dtype=np.float32)
iterations += 1
array_count = iterations
return size_array, array_count, fromiter_count
begin = time.time()
results = {}
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
future_to_size_array = {executor.submit(test, size_array):size_array
for size_array in list_sizes}
data = list(chain.from_iterable(
f.result() for f in concurrent.futures.as_completed(future_to_size_array)))
print('data = ', data)
for i in range( 0, len(data), 3 ):
res = tuple(data[i+1:i+3])
size_array = data[i]
results[size_array] = res
print("Result for size {} in {} seconds: {}".format(size_array,seconds,res))
out_folder = path.dirname(path.realpath(__file__))
print("Create diagrams in {}".format(out_folder))
chart = pygal.Line()
chart.title = "Performance in {} seconds".format(seconds)
chart.x_title = "Array size"
chart.y_title = "Iterations"
x_axis = sorted(results.keys())
print(x_axis)
chart.x_labels = x_axis
chart.add('np.array', [results[x][0] for x in x_axis])
chart.add('np.fromiter', [results[x][1] for x in x_axis])
chart.render_to_png(path.join(out_folder, 'result_{}_concurrent_futures_submitv2.png'.format(seconds)))
end = time.time()
compute_time = end - begin
print("Program Time = ", compute_time)
Serial:(with minor changes to original code)
#!/usr/bin/env python3.5
import time
import numpy as np
import pygal
from os import path
list_sizes = [2**x for x in range(1, 11)]
seconds = 2
def test(size_array):
pyarray = [float(x) for x in range(size_array)]
# fromiter
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.fromiter(pyarray, dtype=np.float32, count=size_array)
iterations += 1
fromiter_count = iterations
# array
start = time.time()
iterations = 0
while time.time() - start <= seconds:
np.array(pyarray, dtype=np.float32)
iterations += 1
array_count = iterations
return array_count, fromiter_count
begin = time.time()
results = {}
for size_array in list_sizes:
res = test(size_array)
results[size_array] = res
print("Result for size {} in {} seconds: {}".format(size_array,seconds,res))
out_folder = path.dirname(path.realpath(__file__))
print("Create diagrams in {}".format(out_folder))
chart = pygal.Line()
chart.title = "Performance in {} seconds".format(seconds)
chart.x_title = "Array size"
chart.y_title = "Iterations"
x_axis = sorted(results.keys())
print(x_axis)
chart.x_labels = x_axis
chart.add('np.array', [results[x][0] for x in x_axis])
chart.add('np.fromiter', [results[x][1] for x in x_axis])
#chart.add('np.array', [x[0] for x in results.values()])
#chart.add('np.fromiter', [x[1] for x in results.values()])
chart.render_to_png(path.join(out_folder, 'result_{}_serial.png'.format(seconds)))
end = time.time()
compute_time = end - begin
print("Program Time = ", compute_time)
The reason for the inconsistent and poor performances of numpy.fromiter() and numpy.array() that I had encountered earlier appears to be associated to the number of CPUs used by concurrent.futures.ProcessPoolExecutor. I had earlier used 6 CPUs. Below diagrams shows the corresponding performances of numpy.fromiter() and numpy.array() when 2, 4, 6, and 8 CPUs were used. These diagrams show that there exists an optimum number of CPUs that can be used. Using too many CPUs (i.e. >4 CPUs) can be detrimental for small array sizes (<512 elements). Example, >4 CPUs can cause slower performances (by a factor of 1/2) and even inconsistent performances when compared to a serial run.

Categories

Resources