Numba #guvectorise returns garbage values

Numba #guvectorise returns garbage values - python

The code below is a test for calculating distance between points in a periodic system.
import itertools
import time
import numpy as np
import numba
from numba import njit
#njit(cache=True)
def get_dr(i=np.array([]),j=np.array([]),cellsize=np.array([])):
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
return np.linalg.norm(k)
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64)"],
"(m),(m),(m)->()",nopython=True,cache=True)
def get_dr_vec(i,j,cellsize,dr):
dr=0.0
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
dr=np.sqrt(np.square(k[0])+np.square(k[1])+np.square(k[2]))
N, dim = 50, 3 # 50 particles in 3D
vec = np.random.random((N, dim))
cellsize=np.array([26.4,26.4,70.0])
rList=[];rList2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
rList.append(get_dr(pI,pJ,cellsize))
end =time.perf_counter()
print("Time {:.3g}s".format(end-start))
newvec1=[];newvec2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
newvec1.append(pI)
newvec2.append(pJ)
cellsizeVec=np.full(shape=np.shape(newvec1),fill_value=cellsize,dtype=float)
rList2=get_dr_vec(np.array(newvec1),np.array(newvec2),cellsizeVec)
end =time.perf_counter()
print("Time {:.3g}s".format(end-start))
print(rList2)
exit()
Compared to get_dr() which shows the correct result, get_dr_vec() shows garbage and nan values. The function get_dr_vec() is calculating the correct value for dr, but it returns garbage values with correct dimensions. Can someone suggest any ideas on how to resolve this issue?

You made a small mistake in the guvectorize function call. Guvectorize does not want you to redefine the output variable, the output array/scalar must be filled in instead. The code below should work:
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64[:])"],
"(m),(m),(m)->()",nopython=True, cache=True)
def get_dr_vec(i,j,cellsize,dr):
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
# The mistake was on this line. You had "dr =", but it should be "dr[0] ="
dr[0] = np.sqrt(np.square(k[0])+np.square(k[1])+np.square(k[2]))
The reason that dr = does not work is because guvectorize already allocates the dr array before you call the function. dr = messes things up because it places a new dr array in a new place in memory, so when numba looks at the original place in memory, where it expects to find an array, it instead finds nothing. dr[0] = does work, because that way, we can fill in the values in the original place in memory, where numba expect the values to be.
If it is still not 100% clear i recommend that you look through the numba documentation on this topic.
Those "garbage values" you were seeing was the output array that was never filled, similar to what you would see if you would call print(np.empty(10))

Related

Compute and fill an array in parallel

As part of a signal processing task, I am doing some computation per frequency step.
I have a frequencies list of length 513.
I have a 3D numpy array A of shape (81,81,513), where 513 is the number of frequency points. I then have a 81x81 matrix per frequency.
I want to apply some modification to each of those matrices, to end up with a modified version of A I'll name B here, which will also be of shape (81,81,513).
For that, I start pre-allocating B with :
B = np.zeros_like(A)
I then loop over my frequencies and call a dothing function like:
for index, frequency in enumerate(frequencies):
B[:,:,index] = dothing(A[:,:,index])
The problem is that dothing takes a lot of time, and ran sequentially over 513 frequency steps seems endless.
So I wanted to parallelize it. But even after reading a lot of docs and watching a lot of videos, I get lost in all the libraries and potential solutions.
Computations at all individual frequencies can be done independently. But in the end I need to assign everything back to B in the right order.
Any idea on how to do that?
Thanks in advance
Antoine

Here I would use a shared array using shared_memory, as there's no need to protect write access if no two loop iterations ever use the same memory address. I eliminated the second array to shorten the example (only construct a single shared array), and I re-ordered the array shape to better preserve memory-aligned access.
from multiprocessing import Pool
from multiprocessing.shared_memory import SharedMemory
import numpy as np
import numpy.typing as npt
from typing import Any
from time import sleep
def dothing(arr: np.ndarray, t_func: Any) -> np.ndarray:
sleep(.05) #simulate hard work
return arr * 2
def dodothing(args: tuple[int, Any]):
global arr
index = args[0]
t_func = args[1]
arr[index] = dothing(arr[index], t_func) #write result back to self to avoid need for 2 shared arrays
def init(shm: SharedMemory, shape: tuple[int, ...], dtype: npt.DTypeLike):
global arr
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
if __name__ == "__main__":
_A = np.ones((513,81,81), np.float64) #source data
t_funcs = ["some transfer function"] * _A.shape[0] #added example of passing some data + an index
nbytes = _A.size * _A.itemsize
dtype = _A.dtype
shape = _A.shape
shm = SharedMemory(create=True, size=nbytes)
A = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
A[:] = _A[:] #copy contents into shared A
with Pool(initializer=init, initargs=(shm, shape, dtype)) as pool:
pool.map(dodothing, enumerate(t_funcs)) #enumerate returns tuple[int,Any] each loop
print(A.sum()/_A.sum()) #prove we multiplied all elements by 2
shm.close()
shm.unlink()
multiprocessing.Pool is a bit funny sometimes in what can be a valid argument to a target function, so I tend to share things like Lock, Queue, shared_memory etc. via the pool's initialization function, which accepts arguments just like Process does.

Multiprocessing pool.map does not return from function call

I have a short snippet making the pool.map call to hang up. It does not return in any reasonable amount of time. The idea is to have a couple of point lists and find a measure for the correspondence to another point list.
import numpy as np
import multiprocessing
import scipy.spatial as ss
def dist(args): # returns the sum of distances of the nearest neighbors.
kdSet, points = args
dist, _ = kdSet.query(points)
return np.sum(dist)
#just define some points to play with
points = np.array([[13.27, 25.49], [13.18, 25.39], [13.08, 25.39],
[12.99, 25.39], [12.89, 25.39], [12.80, 25.39],
[12.71, 25.39], [12.61, 25.30], [12.52, 25.30]])
pointList = [points + idx for idx in range(3)] #creates a list of points
#create a list of KDTrees for fast lookup. Add some number to make them a little different from pointList
kdTreeList = [ss.KDTree(pointsLocal + idx/10) for idx, pointsLocal in enumerate(pointList)]
#this part works fine: serial execution
distances = list(map(dist, zip(kdTreeList, pointList)))
print(distances)
#this part hangs, if called as multiprocess, expected the same result as above
with multiprocessing.Pool() as pool:
print('Calling pool.map')
distancesMultiprocess = pool.map(dist, zip(kdTreeList, pointList)) #<-This call does not return, but uses the CPU 100%
print(distancesMultiprocess)
Calling the function sequentially it works fine. But if called in the pool.map function it neither gives an error nor it is returning anything. It just hangs there. The CPU load goes to 100%, so something is happening.
Removing the KDTree class for testing purposes did not show a different result.
Since there is no error thrown, I have no idea whats going wrong here. Any hint?

Thanks a lot to juanpa.arrivillaga
adding
if __name__ == "__main__":
did the trick.

Python for loop: Proper implementation of multiprocessing

The following for loop is part of a iterative simulation process and is the main bottleneck regarding computational time:
import numpy as np
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every itr
cols_red_list = []
rows_list = list(range(2500)) #row idx for diff where negative element is known to appear
diff = np.random.uniform(-1.323, 3.780, (2500, 300)) #np.random.uniform is just used as toy example
for row in rows_list:
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
cols_red_list.append(col)
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Hence, I tried to parallelize it by using the multiprocessing package in hope to reduce computation time:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
class Simulation(object):
def __init__(self,n_int):
self.n_int = n_int
def loop(self):
for itr in range(self.n_int):
#some preceeding code which updates rows_list and diff with every
rows_list = list(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
if __name__ == '__main__':
num_of_workers = cpu_count()
print('number of CPUs : ', num_of_workers)
pool = Pool(num_of_workers)
cols_red_list = pool.map(partial(crossings,diff = diff), rows_list)
pool.close()
print(len(cols_red_list))
# some subsequent code which uses the cols_red_list data
sim1 = Simulation(n_int=10)
sim1.loop()
Unfortunately, the parallelization turns out to be much slower compared to the sequential piece of code.
Hence my question: Did I use the multiprocessing package properly in that particular example? Are there alternative ways to parallelize the above mentioned for loop ?

Disclaimer: As you're trying to reduce the runtime of your code through parallelisation, this doesn't strictly answer your question but it might still be a good learning opportunity.
As a golden rule, before moving to multiprocessing to improve
performance (execution time), one should first optimise the
single-threaded case.
Your
rows_list = list(range(2500))
Generates the numbers 0 to 2499 (that's the range) and stores them in memory (list), which requires time to do the allocation of the required memory and the actual write. You then only use these predictable values once each, by reading them from memory (which also takes time), in a predictable order:
for row in rows_list:
This is particularly relevant to the runtime of your loop function as you do it repeatedly (for itr in range(n_int):).
Instead, consider generating the number only when you need it, without an intermediate store (which conceptually removes any need to access RAM):
for row in range(2500):
Secondly, on top of sharing the same issue (unnecessary accesses to memory), the following:
diff = np.random.uniform(-1, 1, (2500, 300))
# ...
col = next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
seems to me to be optimisable at the level of math (or logic).
What you're trying to do is get a random variable (that col index) by defining it as "the first time I encounter a random variable in [-1;1] that is lower than 0". But notice that figuring out if a random variable with a uniform distribution over [-α;α] is negative, is the same as having a random variable over {0,1} (i.e. a bool).
Therefore, you're now working with bools instead of floats and you don't even have to do the comparison (val < 0) as you already have a bool. This potentially makes the code much faster. Using the same idea as for rows_list, you can generate that bool only when you need it; testing it until it is True (or False, choose one, it doesn't matter obviously). By doing so, you only generate as many random bools as you need, not more and not less (BTW, what happens in your code if all 300 elements in the row are negative? ;) ):
for _ in range(n_int):
cols_red_list = []
for row in range(2500):
col = next(i for i in itertools.count() if random.getrandbits(1))
cols_red_list.append(col)
or, with list comprehension:
cols_red_list = [next(i for i in count() if getrandbits(1))
for _ in range(2500)]
I'm sure that, through proper statistical analysis, you even can express that col random variable as a non-uniform variable over [0;limit[, allowing you to compute it much faster.
Please test the performance of an "optimized" version of your single-threaded implementation first. If the runtime is still not acceptable, you should then look into multithreading.

multiprocessing uses system processes (not threads!) for parallelization, which require expensive IPC (inter-process communication) to share data.
This bites you in two spots:
diff = np.random.uniform(-1, 1, (2500, 300)) creates a large matrix which is expensive to pickle/copy to another process
rows_list = list(range(2500)) creates a smaller list, but the same applies here.
To avoid this expensive IPC, you have one and a half choices:
If on a POSIX-compliant system, initialize your variables on the module level, that way each process gets a quick-and-dirty copy of the required data. This is not scalable as it requires POSIX, weird architecture (you probably don't want to put everything on the module level), and doesn't support sharing changes to that data.
Use shared memory. This only supports mostly primitive data types, but mp.Array should cover your needs.
The second problem is that setting up a pool is expensive, as num_cpu processes need to be started. Your workload is small enough to be negligible compared to this overhead. A good practice is to only create one pool and reuse it.
Here is a quick-and-dirty example of the POSIX only solution:
import numpy as np
from multiprocessing import Pool, cpu_count
from functools import partial
n_int = 10
rows_list = np.array(range(2500))
diff = np.random.uniform(-1, 1, (2500, 300))
def crossings(row, diff):
return next(idx for idx, val in enumerate(diff[row,:]) if val < 0)
def workload(_):
cols_red_list = [crossings(row, diff) for row in rows_list]
print(len(cols_red_list))
class Simulation(object):
def loop(self):
num_of_workers = cpu_count()
with Pool(num_of_workers) as pool:
pool.map(workload, range(10))
pool.close()
sim1 = Simulation()
sim1.loop()
For me (and my two cores) this is roughly twice as fast as the sequential version.
Update with shared memory:
import numpy as np
from multiprocessing import Pool, cpu_count, Array
from functools import partial
n_int = 10
ROW_COUNT = 2500
### WORKER
diff = None
result = None
def init_worker(*args):
global diff, result
(diff, result) = args
def crossings(i):
result[i] = next(idx for idx, val in enumerate(diff[i*300:(i+1)*300]) if val < 0)
### MAIN
class Simulation():
def loop(self):
num_of_workers = cpu_count()
diff = Array('d', range(ROW_COUNT*300), lock=False)
result = Array('i', ROW_COUNT, lock=False)
# Shared memory needs to be passed when workers are spawned
pool = Pool(num_of_workers, initializer=init_worker, initargs=(diff, result))
for i in range(n_int):
# SLOW, I assume you use a different source of values anyway.
diff[:] = np.random.uniform(-1, 1, ROW_COUNT*300)
pool.map(partial(crossings), range(ROW_COUNT))
print(len(result))
pool.close()
sim1 = Simulation()
sim1.loop()
A few notes:
Shared memory needs to be set up at worker creation, so it's global anyway.
This still isn't faster than the sequential version, but that's mainly due to random.uniform needing to be copied entirely into shared memory. I assume that are just values for testing, and in reality you'd fill it differently anyway.
I only pass indices to the worker, and use them to read and write values to the shared memory.

How do I vectorize the following loop in Numpy?

"""Some simulations to predict the future portfolio value based on past distribution. x is
a numpy array that contains past returns.The interpolated_returns are the returns
generated from the cdf of the past returns to simulate future returns. The portfolio
starts with a value of 100. portfolio_value is filled up progressively as
the program goes through every loop. The value is multiplied by the returns in that
period and a dollar is removed."""
portfolio_final = []
for i in range(10000):
portfolio_value = [100]
rand_values = np.random.rand(600)
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
I couldn't find a way to write this code using numpy. I was having a look at iterations using nditer but I was unable to move ahead with that.

I guess the easiest way to figure out how you can vectorize your stuff would be to look at the equations that govern your evolution and see how your portfolio actually iterates, finding patterns that could be vectorized instead of trying to vectorize the code you already have. You would have noticed that the cumprod actually appears quite often in your iterations.
Nevertheless you can find the semi-vectorized code below. I included your code as well such that you can compare the results. I also included a simple loop version of your code which is much easier to read and translatable into mathematical equations. So if you share this code with somebody else I would definitely use the simple loop option. If you want some fancy-pants vectorizing you can use the vector version. In case you need to keep track of your single steps you can also add an array to the simple loop option and append the pv at every step.
Hope that helps.
Edit: I have not tested anything for speed. That's something you can easily do yourself with timeit.
import numpy as np
from scipy.special import erf
# Prepare simple return model - Normal distributed with mu &sigma = 0.01
x = np.linspace(-10,10,100)
cdf_values = 0.5*(1+erf((x-0.01)/(0.01*np.sqrt(2))))
# Prepare setup such that every code snippet uses the same number of steps
# and the same random numbers
nSteps = 600
nIterations = 1
rnd = np.random.rand(nSteps)
# Your code - Gives the (supposedly) correct results
portfolio_final = []
for i in range(nIterations):
portfolio_value = [100]
rand_values = rnd
interpolated_returns = np.interp(rand_values,cdf_values,x)
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio_value[j-1])
portfolio_value[j] = portfolio_value[j]-1
portfolio_final.append(portfolio_value[-1])
print (np.mean(portfolio_final))
# Using vectors
portfolio_final = []
for i in range(nIterations):
portfolio_values = np.ones(nSteps)*100.0
rcp = np.cumprod(np.interp(rnd,cdf_values,x) + 1)
portfolio_values = rcp * (portfolio_values - np.cumsum(1.0/rcp))
portfolio_final.append(portfolio_values[-1])
print (np.mean(portfolio_final))
# Simple loop
portfolio_final = []
for i in range(nIterations):
pv = 100
rets = np.interp(rnd,cdf_values,x) + 1
for i in range(nSteps):
pv = pv * rets[i] - 1
portfolio_final.append(pv)
print (np.mean(portfolio_final))

Forget about np.nditer. It does not improve the speed of iterations. Only use if you intend to go one and use the C version (via cython).
I'm puzzled about that inner loop. What is it supposed to be doing special? Why the loop?
In tests with simulated values these 2 blocks of code produce the same thing:
interpolated_returns = np.add(interpolated_returns,1)
for j in range(1,len(interpolated_returns)+1):
portfolio_value.append(interpolated_returns[j-1]*portfolio[j-1])
portfolio_value[j] = portfolio_value[j]-1
interpolated_returns = (interpolated_returns+1)*portfolio - 1
portfolio_value = portfolio_value + interpolated_returns.tolist()
I assuming that interpolated_returns and portfolio are 1d arrays of the same length.

numpy -- Transform non-contiguous data to contiguous data in place

Consider the following code:
import numpy as np
a = np.zeros(50)
a[10:20:2] = 1
b = c = a[10:40:4]
print b.flags # You'll see that b and c are not C_CONTIGUOUS or F_CONTIGUOUS
My question:
Is there a way (with only a reference to b) to make both b and c contiguous?
It is completely fine if np.may_share_memory(b,a) returns False after this operation.
Things which are close, but don't quite work out are: np.ascontiguousarray/np.asfortranarray as they will return a new array.
My use case is that I have very large 3D fields stored in a subclass of a numpy.ndarray. In order to save memory, I would like to chop those fields down to the portion of the domain that I am actually interested in processing:
a = a[ix1:ix2,iy1:iy2,iz1:iz2]
Slicing for the subclass is somewhat more restricted than slicing of ndarray objects, but this should work and it will "do the right thing" -- the various custom meta-data attached on the subclass will be transformed/preserved as expected. Unfortunately, since this returns a view, numpy won't free the big array afterward so I don't actually save any memory here.
To be completely clear, I'm looking to accomplish 2 things:
preserve the metadata on my class instance. slicing will work, but I'm not sure about other forms of copying.
make it so that the original array is free to be garbage collected

According to Alex Martelli:
"The only really reliable way to ensure that a large
but temporary use of memory DOES return all resources to the system
when it's done, is to have that use happen in a subprocess, which
does the memory-hungry work then terminates."
However, the following appears to free at least some of the memory:
Warning: my way of measuring free memory is Linux-specific:
import time
import numpy as np
def free_memory():
"""
Return free memory available, including buffer and cached memory
"""
total = 0
with open('/proc/meminfo', 'r') as f:
for line in f:
line = line.strip()
if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
field, amount, unit = line.split()
amount = int(amount)
if unit != 'kB':
raise ValueError(
'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
total += amount
return total
def gen_change_in_memory():
"""
https://stackoverflow.com/a/14446011/190597 (unutbu)
"""
f = free_memory()
diff = 0
while True:
yield diff
f2 = free_memory()
diff = f - f2
f = f2
change_in_memory = gen_change_in_memory().next
Before allocating the large array:
print(change_in_memory())
# 0
a = np.zeros(500000)
a[10:20:2] = 1
b = c = a[10:40:4]
After allocating the large array:
print(change_in_memory())
# 3844 # KiB
a[:len(b)] = b
b = a[:len(b)]
a.resize(len(b), refcheck=0)
time.sleep(1)
Free memory increases after resizing:
print(change_in_memory())
# -3708 # KiB

You can do this in cython:
In [1]:
%load_ext cythonmagic
In [2]:
%%cython
cimport numpy as np
np.import_array()
def to_c_contiguous(np.ndarray a):
cdef np.ndarray new
cdef int dim, i
new = a.copy()
dim = np.PyArray_NDIM(new)
for i in range(dim):
np.PyArray_STRIDES(a)[i] = np.PyArray_STRIDES(new)[i]
a.data = new.data
np.PyArray_UpdateFlags(a, np.NPY_C_CONTIGUOUS)
np.set_array_base(a, new)
In [8]:
import sys
import numpy as np
a = np.random.rand(10, 10, 10)
b = c = a[::2, 1::3, 2::4]
d = a[::2, 1::3, 2::4]
print sys.getrefcount(a)
to_c_contiguous(b)
print sys.getrefcount(a)
print np.all(b==d)
The output is:
4
3
True
to_c_contiguous(a) will create a c_contiguous copy of a, and make it as the base of a.
After the call of to_c_contiguous(b), the refcount of a is decreased, and when the refcount of a become 0, it will be freed.

I would claim the correct way to accomplish the 2 things you listed is by np.copying the slices you create.
Of course, in order for that to work correctly, you would need to define an appropriate __array_finalize__. You weren't very clear about why you decided to avoid it in the first place, but my feeling is that you should define it. (how did you solve the bx**2 problem without using __array_finalize__?)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numba #guvectorise returns garbage values - python

Related

Compute and fill an array in parallel

Multiprocessing pool.map does not return from function call

Python for loop: Proper implementation of multiprocessing

How do I vectorize the following loop in Numpy?

numpy -- Transform non-contiguous data to contiguous data in place

Categories

Resources