Equivalence between python for-loop and 3D numpy matrix additions - python

I cannot figure out a bug in a very simple transition from a for-loop to a vectorized numpy operation. The code is the following
for null_pos in null_positions:
np.add(singletree[null_pos, parent.x, :, :],
posteriors[parent.u, null_pos, :, :],
out=singletree[null_pos, parent.x, :, :])
Since it is a simple addition between 2D matrices, I generalise into a 3D addition
np.add(singletree[null_positions, parent.x, :, :],
posteriors[parent.u, null_positions, :, :],
out=singletree[null_positions, parent.x, :, :])
The thing is, it appears the result is different! Can you see why?
Thanks!
Update:
It seems that
singletree[null_positions, parent.x, :, :] = \
posteriors[parent.u, null_positions, :, :] +
singletree[null_positions, parent.x, :, :]
solves the problem. In what does this differ with respect to the add operation? (apart from allocating a new matrix, I'm interested in the semantic aspects)

The problem is that passing out=singletree[null_positions, parent.x, :, :] is making a copy of the portion of singletree, since you are using advanced indexing (as opposed to basic indexing, which returns views). Hence, the result will be written to an entirely different array and the original one will remain unmodified.
However, you can use advanced indexing to assign values. In you case, the most recommendable syntax would be:
singletree[null_positions, parent.x, :, :] += \
posteriors[parent.u, null_positions, :, :]
Which would minimize the use of intermediate arrays.

Related

3D matrix multiplication in tensorflow: AAB and AAB matrices to get a new AAB matrix

Suppose I have two images with dimensions of 32x32x3 (number of channels=3). I want to multiply them (like "matmul" function) on the first and the second dimensions for each of these 3 channels in Tensorflow to get a new 32x32x3 image.
Can someone help me with this?
Something like this loop:
#x.shape=(32,32,3)
#y.shape=(32,32,3)
a = np.zeros((x.shape[-3], x.shape[-2], x.shape[-1],), dtype='float32')
for i in range(a.shape[-1]):
a[:, :, i] = tf.matmul(x[:, :, i], y[:, :, i])
a = tf.convert_to_tensor(a, dtype=tf.float32)
but I was wondering there is a more efficient way to do this?
Actually, I found the answer.
The matmul works also for 3d arrays. However, the features (channels) need to be first in matmul function. So we need to use tf.transpose if channels are placed in the last dimension as follow:
x=tf.transpose(x, perm=[2, 0, 1])
y=tf.transpose(y, perm=[2, 0, 1])
a=tf.matmul(x,y)
a=tf.transpose(a, perm=[1, 2, 0])
It gives the same result as the loop I wrote above.

How to use np.put when target array is more than 2d?

I want to change BGR image elements.
In detail, if 2nd element equals 3rd one, both of them are changed to 0.
arg1 = np.argwhere(img[:, :, 1] == img[:, :, 2])
np.put(img[:, :, 1], arg1, 0)
np.put(img[:, :, 2], arg1, 0)
I tried this but doesn't work.
Your code does work but just not the way you might be expecting. np.put expects the indices of multi-dimensional matrices as tuples, while np.argwhere gives you a 2d-array of rows and columns.
To make it mush simpler, you can use boolean masks and get the job done-
mask = img[:, : ,1] == img[:,:,2]
img[:, :, 0][mask] = 0

How to make this python code more efficient?

I have the following for loop:
for x in range(int(r.shape[3]/2)):
for d in range(int(r.shape[1]/2)):
r[:, d, :, x, :] = r[:, 0, :, x+d, :]
and I want to get rid of the nested for loops and solely use the numpy library functions to make this code more efficient. How can I do that?

Faster definition of "matrix multiplication" in Python

I need to define matrix multiplication from scratch, as instead of multiplying each constant together, each constant is actually another array and any two arrays need to be "convolved" together (I don't think it's necessary to define what a convolution is here).
I have made a picture that hopefully explains what I'm trying to say better:
The code I have to do this with is this:
for row in range(arr1.shape[2]):
for column in range(arr2.shape[3]):
for index in range(arr2.shape[2]): # Could also be "arr1.shape[3]"
out[:, :, row, column] += convolve(
arr2[:, :, : , column][:, :, index],
arr1[:, :, row, : ][:, :, index]
)
However, this method had proved to be very slow for me, so I was wondering if there was a faster way to do this.
If the intermediate fits in memory the following should be reasonably efficient
import numpy as np
from scipy.signal import fftconvolve,convolve
# example
rng = np.random.default_rng()
A = rng.random((5,6,2,3))
B = rng.random((4,3,3,4))
# custom matmul
Ae,Be = A[...,None],B[:,:,None]
shsh = np.maximum(Ae.shape[2:],Be.shape[2:])
Ae = np.broadcast_to(Ae,(*Ae.shape[:2],*shsh))
Be = np.broadcast_to(Be,(*Be.shape[:2],*shsh))
C = fftconvolve(Ae,Be,axes=(0,1),mode='valid').sum(3)
# original loop for reference
out = np.zeros_like(C)
for row in range(A.shape[2]):
for column in range(B.shape[3]):
for index in range(B.shape[2]): # Could also be "A.shape[3]"
out[:, :, row, column] += convolve(
B[:, :, : , column][:, :, index],
A[:, :, row, : ][:, :, index],
mode='valid'
)
print(np.allclose(C,out))
# True
By doing the convolution in bulk we reduce the total number of fft's we have to do.
If need be this could be further optimized for both speed and memory by doing the sum reduction in Fourier space using einsum. This would require doing the fft convolution by hand, though.

Memory growth with broadcast operations in NumPy

I am using NumPy to handle some large data matrices (of around ~50GB in size). The machine where I am running this code has 128GB of RAM so doing simple linear operations of this magnitude shouldn't be a problem memory-wise.
However, I am witnessing a huge memory growth (to more than 100GB) when computing the following code in Python:
import numpy as np
# memory allocations (everything works fine)
a = np.zeros((1192953, 192, 32), dtype='f8')
b = np.zeros((1192953, 192), dtype='f8')
c = np.zeros((192, 32), dtype='f8')
a[:] = b[:, :, np.newaxis] - c[np.newaxis, :, :] # memory explodes here
Please note that initial memory allocations are done without any problems. However, when I try to perform the subtract operation with broadcasting, the memory grows to more than 100GB. I always thought that broadcasting would avoid making extra memory allocations but now I am not sure if this is always the case.
As such, can someone give some details on why this memory growth is happening, and how the following code could be rewritten using more memory efficient constructs?
I am running the code in Python 2.7 within IPython Notebook.
#rth's suggestion to do the operation in smaller batches is a good one. You could also try using the function np.subtract and give it the destination array to avoid creating an addtional temporary array. I also think you don't need to index c as c[np.newaxis, :, :], because it is already a 3-d array.
So instead of
a[:] = b[:, :, np.newaxis] - c[np.newaxis, :, :] # memory explodes here
try
np.subtract(b[:, :, np.newaxis], c, a)
The third argument of np.subtract is the destination array.
Well, your array a takes already 1192953*192*32* 8 bytes/1.e9 = 58 GB of memory.
The broadcasting does not make additional memory allocations for the initial arrays, but the result of
b[:, :, np.newaxis] - c[np.newaxis, :, :]
is still saved in a temporary array. Therefore at this line, you have allocated at least 2 arrays with the shape of a for a total memory used >116 GB.
You can avoid this issue, by operating on a smaller subset of your array at one time,
CHUNK_SIZE = 100000
for idx in range(b.shape[0]/CHUNK_SIZE):
sl = slice(idx*CHUNK_SIZE, (idx+1)*CHUNK_SIZE)
a[sl] = b[sl, :, np.newaxis] - c[np.newaxis, :, :]
this will be marginally slower, but uses much less memory.

Categories

Resources