When to use .shape and when to use .reshape? - python

I ran into a memory problem when trying to use .reshape on a numpy array and figured if I could somehow reshape the array in place that would be great.
I realised that I could reshape arrays by simply changing the .shape value.
Unfortunately when I tried using .shape I again got a memory error which has me thinking that it doesn't reshape in place.
I was wondering when do I use one when do I use the other?
Any help is appreciated.
If you want additional information please let me know.
EDIT:
I added my code and how the matrix I want to reshape is created in case that is important.
Change the N value depending on your memory.
import numpy as np
N = 100
a = np.random.rand(N, N)
b = np.random.rand(N, N)
c = a[:, np.newaxis, :, np.newaxis] * b[np.newaxis, :, np.newaxis, :]
c = c.reshape([N*N, N*N])
c.shape = ([N, N, N, N])
EDIT2:
This is a better representation. Apparently the transpose seems to be important as it changes the arrays from C-contiguous to F-contiguous, and the resulting multiplication in above case is contiguous while in the one below it is not.
import numpy as np
N = 100
a = np.random.rand(N, N).T
b = np.random.rand(N, N).T
c = a[:, np.newaxis, :, np.newaxis] * b[np.newaxis, :, np.newaxis, :]
c = c.reshape([N*N, N*N])
c.shape = ([N, N, N, N])

numpy.reshape will copy the data if it can't make a proper view, whereas setting the shape will raise an error instead of copying the data.
It is not always possible to change the shape of an array without copying the data. If you want an error to be raise if the data is copied, you should assign the new shape to the shape attribute of the array.

I would like to revisit this question focusing on OOP paradigm, despite memory issues presented as the problem.
When to use .shape and when to use .reshape?
OOP principle of Encapsulation
Following OOP paradigms, since shape is a property of the object numpy.array it is always advisable to call an object.method to change properties. This adheres to OOP principle of encapsulation.
Performance Issues
As for performance, there seems to be no difference.
import numpy as np
# creates an array of 1,000,000 random floats
a = np.array(np.random.rand(1_000_000))
# (1000000,)
a.shape
# using IPython to time both operations resulted in
# 201 ns ± 4.85 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.shape = (5_000, 200)
# 217 ns ± 0.957 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit a.reshape (5_000, 200)
Running hardware
OS : Linux 4.15.0-142-generic #146~16.04.1-Ubuntu
CPU: Intel(R) Core(TM) i3-4170 CPU # 3.70GHz 4 cores
RAM: 16BG

Related

Fastest Way to Find the Dot Product of a Large Matrix of Vectors

I am looking for suggestions on the most efficient way to solve the following problem:
I have two arrays called A and B. They are both of shape NxNx3. They represent two 2D matrix of positions, where each position is a vector of x, y, and z coordinates.
I want to create a new array, called C, of shape NxN, where C[i, j] is the dot product of the vectors A[i, j] and B[i, j].
Here are the solutions I've come up with so far. The first uses the numpy's einsum function (which is beautifully described here). The second uses numpy's broadcasting rules along with its sum function.
>>> import numpy as np
>>> A = np.random.randint(0, 10, (100, 100, 3))
>>> B = np.random.randint(0, 10, (100, 100, 3))
>>> C = np.einsum("ijk,ijk->ij", A, B)
>>> D = np.sum(A * B, axis=2)
>>> np.allclose(C, D)
True
Is there a faster way? I've heard murmurs that numpy's tensordot function can be blazing fast but I've always struggled to understand it. What about using numpy's dot, or inner functions?
For some context, the A and B arrays will typically have between 100 and 1000 elements.
Any guidance is much appreciated!
With a bit of reshaping, we can use matmul. The idea is to treat the first 2 dimensions as the 'batch' dimensions, and to the dot on the last:
In [278]: E = A[...,None,:]#B[...,:,None]
In [279]: E.shape
Out[279]: (100, 100, 1, 1)
In [280]: E = np.squeeze(A[...,None,:]#B[...,:,None])
In [281]: np.allclose(C,E)
Out[281]: True
In [282]: timeit E = np.squeeze(A[...,None,:]#B[...,:,None])
130 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [283]: timeit C = np.einsum("ijk,ijk->ij", A, B)
90.2 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Comparative timings can be a bit tricky. In the current versions, einsum can take different routes depending on the dimensions. In some cases it appears to delegate the task to matmul (or at least the same underlying BLAS-like code). While it's nice that einsum is faster in this test, I wouldn't generalize that.
tensordot just reshapes (and if needed transposes) the arrays so it can apply the ordinary 2d np.dot. Actually it doesn't work here because you are treating the first 2 axes as a 'batch', where as it does an outer product on them.

How to matrix-multiply a 2D numpy array with a 3D array to give a 3D array?

I am solving a photometric stereo problem, in which I have "n" number of light sources with 3 channels(Red, Green, Blue) each.
Thus light array is of shape nx3: lights.shape = nx3
I have the images corresponding to each lighting condition. image dimensions are hxw (height x width), images.shape = nxhxw
I want to matrix multiple each pixel in the image to a matrix of shape 3 x n and get another array of shape 3xhxw these will be the normal vector of each pixel on the image.
shapes of:
images : (n_ims, h, w)
lights : (n_ims, 3)
S = lights
S_pinv = np.linalg.inv(S.T#S)#S.T # pinv is pseudo inverse, S_pinv.shape : (n_ims,3)
b = S_pinv # images # I want (3xn # nxhxw = 3xhxw)
But I am getting this error:
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 100 is different from 3)
The problem is that numpy views multidimensional arrays as stacks of matrices, and always the last two dimensions are assumed to be the linear space dimensions. This means that the dot product will not work by collapsing the first dimension of your 3d array.
Instead the simplest thing you can do is to reshape your 3d array into a 2d one, doing the matrix multiplication, and reshaping back into a 3d array. This will also make use of optimised BLAS code which is one of the great advantages of numpy.
import numpy as np
S_pinv = np.random.rand(3, 4)
images = np.random.rand(4, 5, 6)
# error:
# (S_pinv # images).shape
res_shape = S_pinv.shape[:1] + images.shape[1:] # (3, 5, 6)
res = (S_pinv # images.reshape(images.shape[0], -1)).reshape(res_shape)
print(res.shape) # (3, 5, 6)
So instead of (3,n) x (n,h,w) we do (3,n) x (n, h*w) -> (3, h*w) which we reshape back to (3, h, w). Reshaping is free, because this doesn't mean any actual manipulation of data in memory (only a reinterpretation of the single block of memory that underlies the array), and as I said proper matrix products are highly optimized in numpy.
Since you asked for a more intuitive way, here's an alternative making use of numpy.einsum. It will probably be slower, but it's very transparent if you get a little bit used to its notation:
res_einsum = np.einsum('tn,nhw -> thw', S_pinv, images)
print(np.array_equal(res, res_einsum)) # True
This notation names each of the dimensions of the input arrays: for S_pinv the first and second dimensions are named t and n, respectively, and similarly n, h and w for images. The output is set to have dimensions thw which means that any remaining dimensions that are not present in the output shape will be summed along after multiplying the input arrays. This is exactly what you need.
As you noted in a comment, you could also transpose your arrays so that np.dot finds the right dimensions in the right place. But this will also be slow because this might lead to copies in memory, or at least suboptimal looping over your arrays.
I made a quick timing comparison using the following defininitions:
def reshaped(S_pinv, images):
res_shape = S_pinv.shape[:1] + images.shape[1:]
return (S_pinv # images.reshape(images.shape[0], -1)).reshape(res_shape)
def einsummed(S_pinv, images):
return np.einsum('tn,nhw -> thw', S_pinv, images)
def transposed(S_pinv, images):
return (S_pinv # images.transpose(2, 0, 1)).transpose(1, 2, 0)
And here's the timing test using IPython's built-in %timeit magic, and some more realistic array sizes:
>>> S_pinv = np.random.rand(3, 100)
... images = np.random.rand(100, 200, 300)
... args = S_pinv, images
... %timeit reshaped(*args)
... %timeit einsummed(*args)
... %timeit transposed(*args)
5.92 ms ± 460 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.9 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
44.5 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
answer is np.swapaxes
import numpy as np
q= np.random.random([2, 5,5])
q.shape
w = np.random.random([3,2])
w.shape
w#q
and we have ValueError but
import numpy as np
q= np.random.random([5, 2,5])
q.shape
w = np.random.random([3,2])
w.shape
res = (w#q).swapaxes(0,1)
res.shape # =[3, 5, 5]
This is basically what np.einsum is for.
Instead of :
b = S_pinv # images
Use
b = np.einsum('ij, ikl -> jkl', S_pinv, images)
in this case i = n_ims, j = 3, k = h and l = w
Since this is a single contraction, you can also do it with np.tensordot()
b = np.tensordot(S_pinv.T, images, axes = 1)
or,
b = np.tensordot(S_pinv, images, axes = ([0], [0]))
One easy way would be np.inner; inner reduces along the last axis and preserves all others; therefore it is up to a transpose a perfect match:
n,h,w = 10,384,512
images = np.random.randint(1,10,(n,h,w))
S_pinv = np.random.randint(1,10,(n,3))
res_inr = np.inner(images.T,S_pinv.T).T
res_inr.shape
# (3, 384, 512)
Similarly, using transposes matmul actually does the right thing:
res_mml = (images.T#S_pinv).T
assert (res_mml==res_inr).all()
These two seem to be roughly equally fast similar to #AndrasDeak's einsum method.
In particular, they are not as fast as reshaped matmul (Unsurprising, since a single straight matmul must be one of the most optimized operations there is). They are trading in speed for convenience.

Applying a mapping function to each member of an ndarray with indices as arguments

I have an ndarray representing an RGB image with the shape of (width,height,3) and I wish to replace each value with the result of some function of itself, its position and the color channel it belongs to. Doing so in three nested for loops is extremely slow, is there a way to express this as a native array operation ?
Edit: looking for an in-place solution - one that does not involve creating another O(width x height) ndarray (unless numpy has some magic that can prevent such ndarray from actually being allocated)
I'm not sure if I got your question correct! What I understood is that you want to apply a mapping on each channel of your RGB image based on their corresponding indices, if so the code below MIGHT help, since no details was available in your question.
import numpy as np
bit_depth = 8
patch_size = 32
def lut_generator(constant_multiplier):
x = np.arange(2 ** bit_depth)
y = constant_multiplier * x
return dict(zip(x, y))
rgb = np.random.randint(0, (2**bit_depth), (patch_size, patch_size, 3))
# Considering a simple lookup table without using indices.
lut = lut_generator(5)
# splitting three channels followed and their respective indices.
# You can use indexes wherever you need them.
r, g, b = np.dsplit(rgb, rgb.shape[-1])
indexes = np.arange(rgb.size).reshape(rgb.shape)
r_idx, g_idx, b_idx = np.dsplit(indexes, indexes.shape[-1])
# Apply transformation on each channel.
transformed_r = np.vectorize(lut.get)(r)
transformed_g = np.vectorize(lut.get)(g)
transformed_b = np.vectorize(lut.get)(b)
Good luck!
Take note of the qualifications in many of the comments, using numpy arithmetic directly will often be easier and faster.
import numpy as np
def test(item, ix0, ix1, ix2):
# A function with the required signature. This you customise to suit.
return item*(ix0+ix1+ix2)//202
def make_function_for(arr, f):
''' where arr is a 3D numpy array and f is a function taking four arguments.
item : the item from the array
ix0 ... ix2 : the three indices
it returns the required result from these 4 arguments.
'''
def user_f(ix0, ix1, ix2):
# np.fromfunction requires only the three indices as arguments.
ix0=ix0.astype(np.int32)
ix1=ix1.astype(np.int32)
ix2=ix2.astype(np.int32)
return f(arr[ix0, ix1, ix2], ix0, ix1, ix2)
return user_f
# user_f is a function suitable for calling in np.fromfunction
a=np.arange(100*100*3)
a.shape=100,100,3
a[...]=np.fromfunction(make_function_for(a, test), a.shape)
My test function is pretty simple so I can do it in numpy.
Using fromfunction:
%timeit np.fromfunction(make_function_for(a, test), a.shape)
5.7 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy arithmetic:
def alt_func(arr):
temp=np.add.outer(np.arange(arr.shape[0]), np.arange(arr.shape[1]))
temp=np.add.outer(temp,np.arange(arr.shape[2]))
return arr*temp//202
%timeit alt_func(a)
967 µs ± 4.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So numpy arithmetic is almost 6 times as fast on my machine for this case.
Edited to correct my seemingly inevitable typos!

np.shuffle much slower than np.random.choice

I have an array of shape (N, 3) and I'd like to randomly shuffle the rows. N is on the order of 100,000.
I discovered that np.random.shuffle was bottlenecking my application. I tried replacing the shuffle with a call to np.random.choice and experienced a 10x speed-up. What's going on here? Why is it so much faster to call np.random.choice? Does the np.random.choice version generate a uniformly distributed shuffle?
import timeit
task_choice = '''
N = 100000
x = np.zeros((N, 3))
inds = np.random.choice(N, N, replace=False)
x[np.arange(N), :] = x[inds, :]
'''
task_shuffle = '''
N = 100000
x = np.zeros((N, 3))
np.random.shuffle(x)
'''
task_permute = '''
N = 100000
x = np.zeros((N, 3))
x = np.random.permutation(x)
'''
setup = 'import numpy as np'
timeit.timeit(task_choice, setup=setup, number=10)
>>> 0.11108078400138766
timeit.timeit(task_shuffle, setup=setup, number=10)
>>> 1.0411593900062144
timeit.timeit(task_permute, setup=setup, number=10)
>>> 1.1140159380011028
Edit: For anyone curious, I decided to go with the following solution since it is readable and outperformed all other methods in my benchmarks:
task_ind_permute = '''
N = 100000
x = np.zeros((N, 3))
inds = np.random.permutation(N)
x[np.arange(N), :] = x[inds, :]
'''
You're comparing very different sized arrays here. In your first example, although you create an array of zeros, you simply use random.choice(100000, 100000), which pulls 100000 random values between 1-100000. In your second example your are shuffling a (100000, 3) shape array.
>>> x.shape
(100000, 3)
>>> np.random.choice(N, N, replace=False).shape
(100000,)
Timings on more equivalent samples:
In [979]: %timeit np.random.choice(N, N, replace=False)
2.6 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [980]: x = np.arange(100000)
In [981]: %timeit np.random.shuffle(x)
2.29 ms ± 67.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [982]: x.shape == np.random.choice(N, N, replace=False).shape
Out[982]: True
permutation and shuffle are linked, in fact permutation calls shuffle under the hood!!
The reason why shuffle is slower than permutation for multidimensional array is that permutation only need to shuffle the index along the first axis. Thus becomes a special case of shuffle of 1d array (the 1st if-else block).
This special case is also explained in the source as well:
# We trick gcc into providing a specialized implementation for
# the most common case, yielding a ~33% performance improvement.
# Note that apparently, only one branch can ever be specialized.
For shuffle on the otherhand, multidimensional ndarray operation requires a bounce buffer, creating that buffer, especially when the dimension is relative big, becomes expensive. Additionally, we can no longer use the trick mentioned above that helps the 1d case.
With replace=False and using choice to generate a new array of the same size, choice and permutation is the same, see here. The extra time would have to come from the time spend in creating intermediate index arrays.

Is there a way to efficiently invert an array of matrices with numpy?

Normally I would invert an array of 3x3 matrices in a for loop like in the example below. Unfortunately for loops are slow. Is there a faster, more efficient way to do this?
import numpy as np
A = np.random.rand(3,3,100)
Ainv = np.zeros_like(A)
for i in range(100):
Ainv[:,:,i] = np.linalg.inv(A[:,:,i])
It turns out that you're getting burned two levels down in the numpy.linalg code. If you look at numpy.linalg.inv, you can see it's just a call to numpy.linalg.solve(A, inv(A.shape[0]). This has the effect of recreating the identity matrix in each iteration of your for loop. Since all your arrays are the same size, that's a waste of time. Skipping this step by pre-allocating the identity matrix shaves ~20% off the time (fast_inverse). My testing suggests that pre-allocating the array or allocating it from a list of results doesn't make much difference.
Look one level deeper and you find the call to the lapack routine, but it's wrapped in several sanity checks. If you strip all these out and just call lapack in your for loop (since you already know the dimensions of your matrix and maybe know that it's real, not complex), things run MUCH faster (Note that I've made my array larger):
import numpy as np
A = np.random.rand(1000,3,3)
def slow_inverse(A):
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.inv(A[i])
return Ainv
def fast_inverse(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
Ainv = np.zeros_like(A)
for i in range(A.shape[0]):
Ainv[i] = np.linalg.solve(A[i], identity)
return Ainv
def fast_inverse2(A):
identity = np.identity(A.shape[2], dtype=A.dtype)
return array([np.linalg.solve(x, identity) for x in A])
from numpy.linalg import lapack_lite
lapack_routine = lapack_lite.dgesv
# Looking one step deeper, we see that solve performs many sanity checks.
# Stripping these, we have:
def faster_inverse(A):
b = np.identity(A.shape[2], dtype=A.dtype)
n_eq = A.shape[1]
n_rhs = A.shape[2]
pivots = zeros(n_eq, np.intc)
identity = np.eye(n_eq)
def lapack_inverse(a):
b = np.copy(identity)
pivots = zeros(n_eq, np.intc)
results = lapack_lite.dgesv(n_eq, n_rhs, a, n_eq, pivots, b, n_eq, 0)
if results['info'] > 0:
raise LinAlgError('Singular matrix')
return b
return array([lapack_inverse(a) for a in A])
%timeit -n 20 aI11 = slow_inverse(A)
%timeit -n 20 aI12 = fast_inverse(A)
%timeit -n 20 aI13 = fast_inverse2(A)
%timeit -n 20 aI14 = faster_inverse(A)
The results are impressive:
20 loops, best of 3: 45.1 ms per loop
20 loops, best of 3: 38.1 ms per loop
20 loops, best of 3: 38.9 ms per loop
20 loops, best of 3: 13.8 ms per loop
EDIT: I didn't look closely enough at what gets returned in solve. It turns out that the 'b' matrix is overwritten and contains the result in the end. This code now gives consistent results.
A few things have changed since this question was asked and answered, and now numpy.linalg.inv supports multidimensional arrays, handling them as stacks of matrices with matrix indices being last (in other words, arrays of shape (...,M,N,N)). This seems to have been introduced in numpy 1.8.0. Unsurprisingly this is by far the best option in terms of performance:
import numpy as np
A = np.random.rand(3,3,1000)
def slow_inverse(A):
"""Looping solution for comparison"""
Ainv = np.zeros_like(A)
for i in range(A.shape[-1]):
Ainv[...,i] = np.linalg.inv(A[...,i])
return Ainv
def direct_inverse(A):
"""Compute the inverse of matrices in an array of shape (N,N,M)"""
return np.linalg.inv(A.transpose(2,0,1)).transpose(1,2,0)
Note the two transposes in the latter function: the input of shape (N,N,M) has to be transposed to shape (M,N,N) for np.linalg.inv to work, then the result has to be permuted back to shape (M,N,N).
A check and timing results using IPython, on python 3.6 and numpy 1.14.0:
In [5]: np.allclose(slow_inverse(A),direct_inverse(A))
Out[5]: True
In [6]: %timeit slow_inverse(A)
19 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit direct_inverse(A)
1.3 ms ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy-Blas calls are not always the fastest possibility
On problems where you have to calculate lots of inverses, eigenvalues, dot-products of small 3x3 matrices or similar cases, numpy-MKL which I use can often be outperformed by quite a margin.
This external Blas routines are usually made for problems with larger matrices, for smaller ones you can write out a standard algorithm or take a look at eg. Intel IPP.
Please keep also in mind that Numpy uses C-ordered arrays by default (last dimension changes fastest).
For this example I took the code from Matrix inversion (3,3) python - hard coded vs numpy.linalg.inv and modified it a bit.
import numpy as np
import numba as nb
import time
#nb.njit(fastmath=True)
def inversion(m):
minv=np.empty(m.shape,dtype=m.dtype)
for i in range(m.shape[0]):
determinant_inv = 1./(m[i,0]*m[i,4]*m[i,8] + m[i,3]*m[i,7]*m[i,2] + m[i,6]*m[i,1]*m[i,5] - m[i,0]*m[i,5]*m[i,7] - m[i,2]*m[i,4]*m[i,6] - m[i,1]*m[i,3]*m[i,8])
minv[i,0]=(m[i,4]*m[i,8]-m[i,5]*m[i,7])*determinant_inv
minv[i,1]=(m[i,2]*m[i,7]-m[i,1]*m[i,8])*determinant_inv
minv[i,2]=(m[i,1]*m[i,5]-m[i,2]*m[i,4])*determinant_inv
minv[i,3]=(m[i,5]*m[i,6]-m[i,3]*m[i,8])*determinant_inv
minv[i,4]=(m[i,0]*m[i,8]-m[i,2]*m[i,6])*determinant_inv
minv[i,5]=(m[i,2]*m[i,3]-m[i,0]*m[i,5])*determinant_inv
minv[i,6]=(m[i,3]*m[i,7]-m[i,4]*m[i,6])*determinant_inv
minv[i,7]=(m[i,1]*m[i,6]-m[i,0]*m[i,7])*determinant_inv
minv[i,8]=(m[i,0]*m[i,4]-m[i,1]*m[i,3])*determinant_inv
return minv
#I was to lazy to modify the code from the link above more thoroughly
def inversion_3x3(m):
m_TMP=m.reshape(m.shape[0],9)
minv=inversion(m_TMP)
return minv.reshape(minv.shape[0],3,3)
#Testing
A = np.random.rand(1000000,3,3)
#Warmup to not measure compilation overhead on the first call
#You may also use #nb.njit(fastmath=True,cache=True) but this has also about 0.2s
#overhead on fist call
Ainv = inversion_3x3(A)
t1=time.time()
Ainv = inversion_3x3(A)
print(time.time()-t1)
t1=time.time()
Ainv2 = np.linalg.inv(A)
print(time.time()-t1)
print(np.allclose(Ainv2,Ainv))
Performance
np.linalg.inv: 0.36 s
inversion_3x3: 0.031 s
For loops are indeed not necessarily much slower than the alternatives and also in this case, it will not help you much. But here is a suggestion:
import numpy as np
A = np.random.rand(100,3,3) #this is to makes it
#possible to index
#the matrices as A[i]
Ainv = np.array(map(np.linalg.inv, A))
Timing this solution vs. your solution yields a small but noticeable difference:
# The for loop:
100 loops, best of 3: 6.38 ms per loop
# The map:
100 loops, best of 3: 5.81 ms per loop
I tried to use the numpy routine 'vectorize' with the hope of creating an even cleaner solution, but I'll have to take a second look into that. The change of ordering in the array A is probably the most significant change, since it utilises the fact that numpy arrays are ordered column-wise and therefor a linear readout of the data is ever so slightly faster this way.

Categories

Resources