I have a matrix which is supposed to be symmetric (it's the inverse of symmetric), but it is not exactly due to numerical errors in the inversion etc.
So, I add a step of making the matrix symmetric (by a = .5(a+a'), and I see a numerical disaster if I do it in-place (out-of-place is ok). Code:
import numpy as np
def check_sym(x):
print("||a-a'||^2 = %e" % np.sum((x - x.T)**2))
# make a symmetric matrix
dim = 100
a = np.random.randn(dim,dim)
a = np.matmul(a, a.T)
b = a.copy()
print("symmetrizing in-place")
a += a.T
a *= .5
print("symmetrizing out-of-place")
b = .5 * (b + b.T)
And the output is:
||a-a'||^2 = 1.184044e-26
symmetrizing in-place
||a-a'||^2 = 7.313593e+04
symmetrizing out-of-place
||a-a'||^2 = 0.000000e+00
Note that for lower dimension (e.g. dim=10) the problem does not appear.
EDIT some more info is given by looking at a-a' after the in-place version:
The error comes from the line a += a.T. It is a known problem of the in-place operations (I cannot find right now the proper piece of documentation that states so) but quoted from scipy lecture notes:
The transposition is a view.
As a results, the following code is wrong and will not make a matrix symmetric:
a += a.T
It will work for small arrays (because of buffering) but fail for large one, in unpredictable ways.
The reason is that at the same time a is being updated with a.T, a.T is actually changing (since it is a memoryview of a), and thus updating some coordinates of a incorrectly.
If you want to symmetrize a matrix in-place, you could do the following:
a = np.random.rand(4,4)
a[np.tril_indices_from(a)] = a.T[np.tril_indices_from(a)]
Or, if you want to stick to your notation:
a += a.T.copy()
since copy will create a temporary copy of a.T which is not going to be updated.
I wrote the following function, which takes as inputs three 1D array (namely int_array, x, and y) and a number lim. The output is a number as well.
def integrate_to_lim(int_array, x, y, lim):
if lim >= np.max(x):
res = 0.0
if lim <= np.min(x):
res = int_array[0]
index = np.argmax(x > lim) # To find the first element of x larger than lim
partial = int_array[index]
slope = (y[index-1] - y[index]) / (x[index-1] - x[index])
rest = (x[index] - lim) * (y[index] + (lim - x[index]) * slope / 2.0)
res = partial + rest
return res
Basically, outside form the limit cases lim>=np.max(x) and lim<=np.min(x), the idea is that the function finds the index of the first value of the array x larger than lim and then uses it to make some simple calculations.
In my case, however lim can also be a fairly big 2D array (shape ~2000 times ~1000 elements)
I would like to rewrite it such that it makes the same calculations for the case that lim is a 2D array.
Obviously, the output should also be a 2D array of the same shape of lim.
I am having a real struggle figuring out how to vectorize it.
I would like to stick only to the numpy package.
PS I want to vectorize my function because efficiency is important and as I understand using for loops is not a good choice in this regard.
Edit: my attempt
I was not aware of the function np.take, which made the task way easier.
Here is my brutal attempt that seems to work (suggestions on how to clean up or to make the code faster are more than welcome).
def integrate_to_lim_vect(int_array, x, y, lim_mat):
lim_mat = np.asarray(lim_mat) # Make sure that it is an array
shape_3d = list(lim_mat.shape) + [1]
x_3d = np.ones(shape_3d) * x # 3 dimensional version of x
lim_3d = np.expand_dims(lim_mat, axis=2) * np.ones(x_3d.shape) # also 3d
# I use np.argmax on the 3d matrices (is there a simpler way?)
index_mat = np.argmax(x_3d > lim_3d, axis=2)
# Silly calculations
partial = np.take(int_array, index_mat)
y1_mat = np.take(y, index_mat)
y2_mat = np.take(y, index_mat - 1)
x1_mat = np.take(x, index_mat)
x2_mat = np.take(x, index_mat - 1)
slope = (y1_mat - y2_mat) / (x1_mat - x2_mat)
rest = (x1_mat - lim_mat) * (y1_mat + (lim_mat - x1_mat) * slope / 2.0)
res = partial + rest
# Make the cases with np.select
condlist = [lim_mat >= np.max(x), lim_mat <= np.min(x)]
choicelist = [0.0, int_array[0]] # Shoud these options be a 2d matrix?
output = np.select(condlist, choicelist, default=res)
return output
I am aware that if the limit is larger than the maximum value in the array np.argmax returns the index zero (leading to wrong results). This is why I used np.select to check and correct for these cases.
Is it necessary to define the three dimensional matrices x_3d and lim_3d, or there is a simpler way to find the 2D matrix of the indices index_mat?
Suggestions, especially to improve the way I expanded the dimension of the arrays, are welcome.
I think you can solve this using two tricks. First, a 2d array can be easily flattened to a 1d array, and then your answers can be converted back into a 2d array with reshape.
Next, your use of argmax suggests that your array is sorted. Then you can find your full set of indices using digitize. Thus instead of a single index, you will get a complete array of indices. All the calculations you are doing are intrinsically supported as array operations in numpy, so that should not cause any problems.
You will have to specifically look at the limiting cases. If those are rare enough, then it might be okay to let the answers be derived by the default formula (they will be garbage values), and then replace them with the actual values you desire.
Numpy arrays are stored with different contiguous types (C- and F-). When using numpy.swapaxes(), the contiguous type gets changed. I need to add two multidimensional arrays (3d to be more specific), one of which comes from another array with swapped axes. What I've noticed is that when the first axis gets swapped with the last axis, in the case of a 3d array, the contiguous type changes from C- to F-. And adding two arrays with different contiguous type is extremely slow (~6 times slower than adding two C-contiguous arrays). However, if other axes are swapped (0-1 or 1-2), the resulting array would have false flags for both C- and F- contiguous (non-contiguous). The weird thing to me is that adding one array of C-configuous and one array neither C- nor F- contiguous, is in fact only slightly slower than adding two arrays of same type. Here are my two questions:
Why does it seem to be different for C-&F-contiguous arrray addition and C-&non-contiguous array addition? Is is caused by different rearranging mechanism or simply because the rearranging distance between C- and F- contiguous is longest for all possible axes orders?
If I have to add a C-contiguous array and a F-contiguous/non-contiguous array, what is the best way to accelerate the speed?
Below is a minimum example of what I encountered. The three printed durations on my computer are 2.0s (C-contiguous + C-contiguous), 12.4s (C-contiguous + F-contiguous), 3.4s (C-contiguous + non-contiguous) and 3.3s (C-contiguous + non-contiguous).
import numpy as np
import time
a = np.random.random((300, 400, 500)) # C-contiguous
b = np.swapaxes(np.random.random((500, 400, 300)), 0, 2) # F-contiguous
c = np.swapaxes(np.random.random((300, 500, 400)), 1, 2) # Non-contiguous
d = np.swapaxes(np.random.random((400, 300, 500)), 0, 1) # Non-contiguous
t = time.time()
for n in range(10):
result = a + a
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + b
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + c
print(time.time() - t)
t = time.time()
for n in range(10):
result = a + d
print(time.time() - t)
These types (F and C) denote whether a matrix (or multi-dimensional array) is stored in column-major (C as in C language which uses column-major storage) or row-major (F as in Fortran language which uses row-major storage).
Both do not really vary in speed. It is just a abstraction layer. No matter which one you use, it brings performance wise the same.
However, what makes an enormous difference is whether arrays are contiguous or not. If they are contiguous you will have good timings cause of caching effects, vectorization and other optimization games that the compiler might apply.
Can a single numpy einsum statement replicate gemm functionality? Scalar and matrix multiplication seem straightforward, but I haven't found how to get the "+" working. In case its simpler, D = alpha * A * B + beta * C would be acceptable (preferable actually)
alpha = 2
beta = 3
A = np.arange(9).reshape(3, 3)
B = A + 1
C = B + 1
left_part = alpha*np.dot(A, B)
left_part = np.einsum(',ij,jk->ik', alpha, A, B)
There seems to be some confusion here: np.einsum handles operations that can be cast in the following form: broadcast–multiply–reduce. Element-wise summation is not part of its scope.
The reason why you need this sort of thing for the multiplication is that writing these operations out "naively" may exceed memory or computing resources quickly. Consider, for example, matrix multiplication:
import numpy as np
x, y = np.ones((2, 2000, 2000))
# explicit loop - ridiculously slow
a = sum(x[:,j,np.newaxis] * y[j,:] for j in range(2000))
# explicit broadcast-multiply-reduce: throws MemoryError
a = (x[:,:,np.newaxis] * y[:,np.newaxis,:]).sum(1)
# einsum or dot: fast and memory-saving
a = np.einsum('ij,jk->ik', x, y)
The Einstein convention however factorizes for addition, so you
can write your BLAS-like problem simply as:
d = np.einsum(',ij,jk->ik', alpha, a, b) + np.einsum(',ik', beta, c)
with minimal memory overhead (you can rewrite most of it as in-place operations if you are really concerned about memory) and constant runtime overhead (the cost of two python-to-C calls).
So regarding performance, this seems, respectfully, like a case of premature optimization to me: have you actually verified that the split of GEMM-like operations into two separate numpy calls is a bottleneck in your code? If it indeed is, then I suggest the following (in order of increasing involvedness):
Try, carefully!, scipy.linalg.blas.dgemm. I would be surprised if you get
significantly better performance, since dgemms are usually only
building block themselves.
Try an expression compiler (essentially you are proposing
such a thing) like Theano.
Write your own generalised ufunc using Cython or C.
I have a 3D numpy array like a = np.zeros((100,100, 20)). I want to perform an operation over every x,y position that involves all the elements over the z axis and the result is stored in an array like b = np.zeros((100,100)) on the same corresponding x,y position.
Now i'm doing it using a for loop:
d_n = np.array([...]) # a parameter with the same shape as b
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
### calculate some_value using C
minv = sys.maxint
depth = -1
C = a[x,y,:]
for d in range(len(C)):
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
if e < minv:
minv = e
depth = d
some_value = depth
if depth == -1:
some_value = len(C) - 1
b[x,y] = some_value
The problem now is that this operation is much slower than others done the pythonic way, e.g. c = b * b (I actually profiled this function and it's around 2 orders of magnitude slower than others using numpy built in functions and vectorized functions, over a similar number of elements)
How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?
What is usually done in 3D images is to swap the Z axis to the first index:
>>> a = a.transpose((2,0,1))
>>> a.shape
(20, 100, 100)
And now you can easily iterate over the Z axis:
>>> for slice in a:
do something
The slice here will be each of your 100x100 fractions of your 3D matrix. Additionally, by transpossing allows you to access each of the 2D slices directly by indexing the first axis. For example a[10] will give you the 11th 2D 100x100 slice.
Bonus: If you store the data contiguosly, without transposing (or converting to a contiguous array using a = np.ascontiguousarray(a.transpose((2,0,1))) the access to you 2D slices will be faster since they are mapped contiguosly in memory.
Obviously you want to get rid of the explicit for loop, but I think whether this is possible depends on what calculation you are doing with C. As a simple example,
a = np.zeros((100,100, 20))
a[:,:] = np.linspace(1,20,20) # example data: 1,2,3,.., 20 as "z" for every "x","y"
b = np.sum(a[:,:]**2, axis=2)
will fill the 100 by 100 array b with the sum of the squared "z" values of a, that is 1+4+9+...+400 = 2870.
If your inner calculation is sufficiently complex, and not amenable to vectorization, then your iteration structure is good, and does not contribute significantly to the calculation time
for (x,y), v in np.ndenumerate(b):
C = a[x,y,:]
for d in range(len(C)):
... # complex, not vectorizable calc
b[x,y] = some_value
There doesn't appear to be a special structure in the 1st 2 dimensions, so you could just as well think of it as 2D mapping on to 1D, e.g. mapping a (N,20) array onto a (N,) array. That doesn't speed up anything, but may help highlight the essential structure of the problem.
One step is to focus on speeding up that C to some_value calculation. There are functions like cumsum and cumprod that help you do sequential calculations on a vector. cython is also a good tool.
A different approach is to see if you can perform that internal calculation over the N values all at once. In other words, if you must iterate, it is better to do so over the smallest dimension.
In a sense this a non-answer. But without full knowledge of how you get some_value from C and d_n I don't think we can do more.
It looks like e can be calculated for all points at once:
e = 2.5 * float(math.pow(d_n[x,y] - d, 2)) + C[d] * 0.05
E = 2.5 * (d_n[...,None] - np.arange(a.shape[-1]))**2 + a * 0.05 # (100,100,20)
E.min(axis=-1) # smallest value along the last dimension
E.argmin(axis=-1) # index of where that min occurs
On first glance it looks like this E.argmin is the b value that you want (tweaked for some boundary conditions if needed).
I don't have realistic a and d_n arrays, but with simple test ones, this E.argmin(-1) matches your b, with a 66x speedup.
How can I improve the performance of such kind of functions mapping a 3D array to a 2D one?
Many functions in Numpy are "reduction" functions*, for example sum, any, std, etc. If you supply an axis argument other than None to such a function it will reduce the dimension of the array over that axis. For your code you can use the argmin function, if you first calculate e in a vectorized way:
d = np.arange(a.shape[2])
e = 2.5 * (d_n[...,None] - d)**2 + a*0.05
b = np.argmin(e, axis=2)
The indexing with [...,None] is used to engage broadcasting. The values in e are floating point values, so it's a bit strange to compare to sys.maxint but there you go:
I, J = np.indices(b.shape)
b[e[I,J,b] >= sys.maxint] = a.shape[2] - 1
* Strickly speaking a reduction function is of the form reduce(operator, sequence) so technically not std and argmin
As part of a complex task, I need to compute matrix cofactors. I did this in a straightforward way using this nice code for computing matrix minors. Here is my code:
def matrix_cofactor(matrix):
C = np.zeros(matrix.shape)
nrows, ncols = C.shape
for row in xrange(nrows):
for col in xrange(ncols):
minor = matrix[np.array(range(row)+range(row+1,nrows))[:,np.newaxis],
C[row, col] = (-1)**(row+col) * np.linalg.det(minor)
return C
It turns out that this matrix cofactor code is the bottleneck, and I would like to optimize the code snippet above. Any ideas as to how to do this?
If your matrix is invertible, the cofactor is related to the inverse:
def matrix_cofactor(matrix):
return np.linalg.inv(matrix).T * np.linalg.det(matrix)
This gives large speedups (~ 1000x for 50x50 matrices). The main reason is fundamental: this is an O(n^3) algorithm, whereas the minor-det-based one is O(n^5).
This probably means that also for non-invertible matrixes, there is some clever way to calculate the cofactor (i.e., not use the mathematical formula that you use above, but some other equivalent definition).
If you stick with the det-based approach, what you can do is the following:
The majority of the time seems to be spent inside det. (Check out line_profiler to find this out yourself.) You can try to speed that part up by linking Numpy with the Intel MKL, but other than that, there is not much that can be done.
You can speed up the other part of the code like this:
minor = np.zeros([nrows-1, ncols-1])
for row in xrange(nrows):
for col in xrange(ncols):
minor[:row,:col] = matrix[:row,:col]
minor[row:,:col] = matrix[row+1:,:col]
minor[:row,col:] = matrix[:row,col+1:]
minor[row:,col:] = matrix[row+1:,col+1:]
This gains some 10-50% total runtime depending on the size of your matrices. The original code has Python range and list manipulations, which are slower than direct slice indexing. You could try also to be more clever and copy only parts of the minor that actually change --- however, already after the above change, close to 100% of the time is spent inside numpy.linalg.det so that furher optimization of the othe parts does not make so much sense.
The calculation of np.array(range(row)+range(row+1,nrows))[:,np.newaxis] does not depended on col so you could could move that outside the inner loop and cache the value. Depending on the number of columns you have this might give a small optimization.
Instead of using the inverse and determinant, I'd suggest using the SVD
def cofactors(A):
U,sigma,Vt = np.linalg.svd(A)
N = len(sigma)
g = np.tile(sigma,N)
g[::(N+1)] = 1
G = np.diag(-(-1)**N*np.product(np.reshape(g,(N,N)),1))
return U # G # Vt
from sympy import *
A = Matrix([[1,2,0],[0,3,0],[0,7,1]])
And the output (which is cofactor matrix) is:
[ 3, 0, 0],
[-2, 1, -7],
[ 0, 0, 3]])