For performance reasons,
I'm curious if there is a way to multiply a stack of a stack of matrices. I have a 4-D array (500, 201, 2, 2). Its basically a 500 length stack of (201,2,2) matrices where for each of the 500, I want to multiply the adjacent matrices using einsum and get another (201,2,2) matrix.
I am only doing matrix multiplication on the [2x2] matrices at the end. Since my explanation is already heading off the rails, I'll just show what I'm doing now, and also the 'reduce' equivalent and why its not helpful (because its the same speed computationally). Preferably this would be a numpy one-liner, but I don't know what that is, or even if its possible.
Code:
Arr = rand(500,201,2,2)
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
print np.all(A1 == A2)
print shape(A1); print shape(A2)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
Returns:
True
(201, 2, 2)
(201, 2, 2)
10 loops, best of 3: 34.8 ms per loop
10 loops, best of 3: 35.2 ms per loop
Any help would be appreciated. Things are functional, but when I have to iterate this over a large series of parameters, the code tends to take a long time, and I'm wondering if there's a way to avoid the 500 iterations through a loop.
I don't think it's possible to do this efficiently using numpy (the cumprod solution was elegant, though). This is the sort of situation where I would use f2py. It's the simplest way of calling a faster language that I know of and only requires a single extra file.
fortran.f90:
subroutine multimul(a, b)
implicit none
real(8), intent(in) :: a(:,:,:,:)
real(8), intent(out) :: b(size(a,1),size(a,2),size(a,3))
real(8) :: work(size(a,1),size(a,2))
integer i, j, k, l, m
!$omp parallel do private(work,i,j)
do i = 1, size(b,3)
b(:,:,i) = a(:,:,i,size(a,4))
do j = size(a,4)-1, 1, -1
work = matmul(b(:,:,i),a(:,:,i,j))
b(:,:,i) = work
end do
end do
end subroutine
Compile with f2py -c -m fortran fortran.f90 (or F90FLAGS="-fopenmp" f2py -c -m fortran fortran.f90 -lgomp to enable OpenMP acceleration). Then you would use it in your script as
import numpy as np, fmuls
Arr = np.random.standard_normal([500,201,2,2])
def loopMult(Arr):
ArrMult = Arr[0]
for i in range(1,len(Arr)):
ArrMult = np.einsum('fij,fjk->fik', ArrMult, Arr[i])
return ArrMult
def myeinsum(A1, A2):
return np.einsum('fij,fjk->fik', A1, A2)
A1 = loopMult(Arr)
A2 = reduce(myeinsum, Arr)
A3 = fmuls.multimul(Arr.T).T
print np.allclose(A1,A2)
print np.allclose(A1,A3)
%timeit loopMult(Arr)
%timeit reduce(myeinsum, Arr)
%timeit fmuls.multimul(Arr.T).T
Which outputs
True
True
10 loops, best of 3: 48.4 ms per loop
10 loops, best of 3: 48.8 ms per loop
100 loops, best of 3: 5.82 ms per loop
So that's a factor 8 speedup. The reason for all the transposes is that f2py implicitly transposes all the arrays, and we need to transpose them manually to tell it that our fortran code expects things to be transposed. This avoids a copy operation. The cost is that each of our 2x2 matrices are transposed, so to avoid performing the wrong operation we have to loop in reverse.
Greater speedups than 8 should be possible - I didn't spend any time trying to optimize this.
Related
I have two 3d arrays A and B with shape (N, 2, 2) that I would like to multiply element-wise according to the N-axis with a matrix product on each of the 2x2 matrix. With a loop implementation, it looks like
C[i] = dot(A[i], B[i])
Is there a way I could do this without using a loop? I've looked into tensordot, but haven't been able to get it to work. I think I might want something like tensordot(a, b, axes=([1,2], [2,1])) but that's giving me an NxN matrix.
It seems you are doing matrix-multiplications for each slice along the first axis. For the same, you can use np.einsum like so -
np.einsum('ijk,ikl->ijl',A,B)
We can also use np.matmul -
np.matmul(A,B)
On Python 3.x, this matmul operation simplifies with # operator -
A # B
Benchmarking
Approaches -
def einsum_based(A,B):
return np.einsum('ijk,ikl->ijl',A,B)
def matmul_based(A,B):
return np.matmul(A,B)
def forloop(A,B):
N = A.shape[0]
C = np.zeros((N,2,2))
for i in range(N):
C[i] = np.dot(A[i], B[i])
return C
Timings -
In [44]: N = 10000
...: A = np.random.rand(N,2,2)
...: B = np.random.rand(N,2,2)
In [45]: %timeit einsum_based(A,B)
...: %timeit matmul_based(A,B)
...: %timeit forloop(A,B)
100 loops, best of 3: 3.08 ms per loop
100 loops, best of 3: 3.04 ms per loop
100 loops, best of 3: 10.9 ms per loop
You just need to perform the operation on the first dimension of your tensors, which is labeled by 0:
c = tensordot(a, b, axes=(0,0))
This will work as you wish. Also you don't need a list of axes, because it's just along one dimension you're performing the operation. With axes([1,2],[2,1]) you're cross multiplying the 2nd and 3rd dimensions. If you write it in index notation (Einstein summing convention) this corresponds to c[i,j] = a[i,k,l]*b[j,k,l], thus you're contracting the indices you want to keep.
EDIT: Ok, the problem is that the tensor product of a two 3d object is a 6d object. Since contractions involve pairs of indices, there's no way you'll get a 3d object by a tensordot operation. The trick is to split your calculation in two: first you do the tensordot on the index to do the matrix operation and then you take a tensor diagonal in order to reduce your 4d object to 3d. In one command:
d = np.diagonal(np.tensordot(a,b,axes=()), axis1=0, axis2=2)
In tensor notation d[i,j,k] = c[i,j,i,k] = a[i,j,l]*b[i,l,k].
I have these variables with the following dimensions:
A - (3,)
B - (4,)
X_r - (3,K,N,nS)
X_u - (4,K,N,nS)
k - (K,)
and I want to compute (A.dot(X_r[:,:,n,s])*B.dot(X_u[:,:,n,s])).dot(k) for every possible n and s, the way I am doing it now is the following:
np.array([[(A.dot(X_r[:,:,n,s])*B.dot(X_u[:,:,n,s])).dot(k) for n in xrange(N)] for s in xrange(nS)]) #nSxN
But this is super slow and I was wondering if there was a better way of doing it but I am not sure.
However there is another computation that I am doing and I am sure it can be optimized:
np.sum(np.array([(X_r[:,:,n,s]*B.dot(X_u[:,:,n,s])).dot(k) for n in xrange(N)]),axis=0)
In this one I am creating a numpy array just to sum it in one axis and discard the array after. If this was a list in 1-D I would use reduce and optimize it, what should I use for numpy arrays?
Using few np.einsum calls -
# Calculation of A.dot(X_r[:,:,n,s])
p1 = np.einsum('i,ijkl->jkl',A,X_r)
# Calculation of B.dot(X_u[:,:,n,s])
p2 = np.einsum('i,ijkl->jkl',B,X_u)
# Include .dot(k) part to get the final output
out = np.einsum('ijk,i->kj',p1*p2,k)
About the second example, this solves it:
p1 = np.einsum('i,ijkl->jkl',B,X_u)#OUT_DIM - (k,N,nS)
sol = np.einsum('ijkl,j->il',X_r*p1[None,:,:,:],k)#OUT_DIM (3,nS)
You can use dot for multiplication of matrices in higher dimensions but the running indices must be the last two.
When we reorder your matrices
X_r_t = X_r.transpose(2,3,0,1)
X_u_t = X_u.transpose(2,3,0,1)
we obtain for your first expression
res1_imp = (A.dot(X_r_t)*B.dot(X_u_t)).dot(k).T # shape nS x N
and for the second expression
res2_imp = np.sum((X_r_t * B.dot(X_u_t)[:,:,None,:]).dot(k),axis=0)[-1]
Timings
Divakars solution gives on my computer 10000 loops, best of 3: 21.7 µs per loop
my solution gives 10000 loops, best of 3: 101 µs per loop
Edit
My upper Timings included the computation of both expressions. When I include only the first expression (as Divakar) I obtain 10000 loops, best of 3: 41 µs per loop ... which is still slower but closer to his timings
In Matlab, the builtin isequal does a check if two arrays are equal. If they are not equal, this might be very fast, as the implementation presumably stops checking as soon as there is a difference:
>> A = zeros(1e9, 1, 'single');
>> B = A(:);
>> B(1) = 1;
>> tic; isequal(A, B); toc;
Elapsed time is 0.000043 seconds.
Is there any equavalent in Python/numpy? all(A==B) or all(equal(A, B)) is far slower, because it compares all elements, even if the initial one differs:
In [13]: A = zeros(1e9, dtype='float32')
In [14]: B = A.copy()
In [15]: B[0] = 1
In [16]: %timeit all(A==B)
1 loops, best of 3: 612 ms per loop
Is there any numpy equivalent? It should be very easy to implement in C, but slow to implement in Python because this is a case where we do not want to broadcast, so it would require an explicit loop.
Edit:
It appears array_equal does what I want. However, it is not faster than all(A==B), because it's not a built-in, but just a short Python function doing A==B. So it does not meet my need for a fast check.
In [12]: %timeit array_equal(A, B)
1 loops, best of 3: 623 ms per loop
First, it should be noted that in the OP's example the arrays have identical elements because B=A[:] is just a view onto the array, so:
>>> print A[0], B[0]
1.0, 1.0
But, although the test isn't a fit one, the basic complaint is true: Numpy does not have a short-circuiting equivalency check.
One can easily see from the source that all of allclose, array_equal, and array_equiv are just variations upon all(A==B) to match their respective details, and are not notable faster.
An advantage of numpy though is that slices are just views, and are therefore very fast, so one could write their own short-circuiting comparison fairly easily (I'm not saying this is ideal, but it does work):
from numpy import *
A = zeros(1e8, dtype='float32')
B = A[:]
B[0] = 1
C = array(B)
C[0] = 2
D = array(A)
D[-1] = 2
def short_circuit_check(a, b, n):
L = len(a)/n
for i in range(n):
j = i*L
if not all(a[j:j+L]==b[j:j+L]):
return False
return True
In [26]: %timeit short_circuit_check(A, C, 100) # 100x faster
1000 loops, best of 3: 1.49 ms per loop
In [27]: %timeit all(A==C)
1 loops, best of 3: 158 ms per loop
In [28]: %timeit short_circuit_check(A, D, 100)
10 loops, best of 3: 144 ms per loop
In [29]: %timeit all(A==D)
10 loops, best of 3: 160 ms per loop
I have an array (N = 10^4) and I need to find a difference between each two of the entries( calculating a potential given the coordinates of the atoms)
Here is the code I am writing in pure python, but its really not effective, can anyone tell me how to speed it up? (using numpy or weave). Here x,y are arrays of coordinates of atoms(just simple 1D array)
def potential(r):
U = 4.*(np.power(r,-12) - np.power(r,-6))
return U
def total_energy(x):
E = 0.
#need to speed up this part
for i in range(N-1):
for j in range(i):
E += potential(np.sqrt((x[i]-x[j])**2))
return E
first you can use array arithmetics:
def potential(r):
return 4.*(r**(-12) - r**(-6))
def total_energy(x):
E = 0.
for i in range(N-1):
E += potential(np.sqrt((x[i]-x[:i])**2)).sum()
return E
or you can test the fully vectorized version:
def total_energy(x):
b=np.diag(x).cumsum(1)-x
return potential(abs(b[np.triu_indices_from(b,1)])).sum()
I would recommend looking into scipy.spatial.distance. Using pdist in particular computes all pairwise distances of an array.
I am assuming that you have an array that is of shape (Nx3), thus we need to slightly change your code:
def potential(r):
U = 4.*(np.power(r,-12) - np.power(r,-6))
return U
def total_energy(x):
E = 0.
#need to speed up this part
for i in range(N): #To N here
for j in range(i):
E += potential(np.sqrt(np.sum((x[i]-x[j])**2))) #Add sum here
return E
Now lets rewrite this using spatial:
import scipy.spatial.distance as sd
def scipy_LJ(arr, sigma=None):
"""
Computes the Lennard-Jones potential for an array (M x N) of M points
in N dimensional space. Usage of a sigma parameter is optional.
"""
if len(arr.shape)==1:
arr = arr[:,None]
r = sd.pdist(arr)
if sigma==None:
np.power(r, -6, out=r)
return np.sum(r**2 - r)*4
else:
r *= sigma
np.power(r, -6, out=r)
return np.sum(r**2 - r)*4
Lets run some tests:
N = 1000
points = np.random.rand(N,3)+0.1
np.allclose(total_energy(points), scipy_LJ(points))
Out[43]: True
%timeit total_energy(points)
1 loops, best of 3: 13.6 s per loop
%timeit scipy_LJ(points)
10 loops, best of 3: 24.3 ms per loop
Now it is ~500 times faster!
N = 10000
points = np.random.rand(N,3)+0.1
%timeit scipy_LJ(points)
1 loops, best of 3: 3.05 s per loop
This used ~2GB of ram.
Here is the final answer with some timing
0) The Plain version (Really slow)
In [16]: %timeit total_energy(points)
1 loops, best of 3: 14.9 s per loop
1) SciPy version
In [9]: %timeit scipy_LJ(points)
10 loops, best of 3: 44 ms per loop
1-2) Numpy version
%timeit sum( potential(np.sqrt((x[i]-x[:i])**2 + (y[i]-y[:i])**2 + (z[i] - z[:i])**2)).sum() for i in range(N-1))
10 loops, best of 3: 126 ms per loop
2) Insanely fast Fortran version (! - means comment)
subroutine EnergyForces(Pos, PEnergy, Dim, NAtom)
implicit none
integer, intent(in) :: Dim, NAtom
real(8), intent(in), dimension(0:NAtom-1, 0:Dim-1) :: Pos
! real(8), intent(in) :: L
real(8), intent(out) :: PEnergy
real(8), dimension(Dim) :: rij, Posi
real(8) :: d2, id2, id6, id12
real(8) :: rc2, Shift
integer :: i, j
PEnergy = 0.
do i = 0, NAtom - 1
!store Pos(i,:) in a temporary array for faster access in j loop
Posi = Pos(i,:)
do j = i + 1, NAtom - 1
rij = Pos(j,:) - Posi
! rij = rij - L * dnint(rij / L)
!compute only the squared distance and compare to squared cut
d2 = sum(rij * rij)
id2 = 1. / d2 !inverse squared distance
id6 = id2 * id2 * id2 !inverse sixth distance
id12 = id6 * id6 !inverse twelvth distance
PEnergy = PEnergy + 4. * (id12 - id6)
enddo
enddo
end subroutine
after calling it
In [14]: %timeit ljlib.energyforces(points.transpose(), 3, N)
10000 loops, best of 3: 61 us per loop
3) Conclusion Fortran is 1000 times faster than scipy and 3000 times faster than numpy, and millions times faster than the pure python. That is because the Scipy version creates a matrix of differences and then analyzes it, whereas the Fortran version does everything on the fly.
Thank you for your help. Here is what I have found.
The shortest version
return sum( potential(np.sqrt((x[i]-x[:i])**2)).sum() for i in range(N-1))
The scipy version is also good.
The fastest version one may consider is to use the f2py program, i.e write the bottleneck slow part in pure Fortran(which is insanely fast), compile it and then plug it into your python code as a library
For example I have:
program_lj.f90
$gfortran -c program_lj.f90
If one defines all the types explicitly in fortran program we are good to go.
$f2py -c -m program_lj program_lj.f90
After compilation the only thing left would be to call the program from python.
in python program:
import program_lj
result = program_lj.subroutine_in_program(parameters)
In case you need a more general reference please refer to wonderful webpage.
I have n documents in MongoDB containing a scipy sparse vector, stored as a pickle object and initially created with scipy.sparse.lil. The vectors are all of the same size, say p x 1.
What I need to do is to put all these vectors into a sparse n x p matrix back in python. I am using mongoengine and thus defined a property to load each pickle vector:
class MyClass(Document):
vector_text = StringField()
#property
def vector(self):
return cPickle.loads(self.vector_text)
Here's what I'm doing now, with n = 4700 and p = 67:
items = MyClass.objects()
M = items[0].vector
for item in items[1:]:
to_add = item.vector
M = scipy.sparse.hstack((M, to_add))
The loading part (i.e. calling n times the property) takes about 1.3s. The stacking part about 2.7s. Since in the future n is going to seriously increase (possibly more than a few hundred thousands), I sense that this is not optimal :)
Any idea to speed the whole thing up? If you know how to fasten the "loading" or the "stacking" only, I'm happy to hear it. For instance maybe the solution is to store the entire matrix in mongoDB? Thanks !
First, what you describe you want to do would require you using vstack, not hstack. In any case, your choice of sparse format is part of your performance problem. Try the following:
n, p = 4700, 67
csr_vecs = [sps.rand(1, p, density=0.5, format='csr') for j in xrange(n)]
lil_vecs = [vec.tolil() for vec in csr_vecs]
%timeit sps.vstack(csr_vecs, format='csr')
1 loops, best of 3: 722 ms per loop
%timeit sps.vstack(lil_vecs, format='lil')
1 loops, best of 3: 1.34 s per loop
So there's already a 2x improvement simply from swithcing to CSR. Furthermore, the stacking functions of scipy.sparse do not seem to be very optimized, definitely not for sparse vectors. The following two functions stack a list of CSR or LIL vectors, returning a CSR sparse matrix:
def csr_stack(vectors):
data = np.concatenate([vec.data for vec in vectors])
indices = np.concatenate([vec.indices for vec in vectors])
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
import itertools as it
def lil_stack(vectors):
indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
data = np.fromiter(it.chain(*(vec.data[0] for vec in vectors)),
dtype=vectors[0].dtype, count=indptr[-1])
indices = np.fromiter(it.chain(*(vec.rows[0] for vec in vectors)),
dtype=np.intp, count=indptr[-1])
return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
vectors[0].shape[1]))
It works:
>>> np.allclose(sps.vstack(csr_vecs).A, csr_stack(csr_vecs).A)
True
>>> np.allclose(csr_stack(csr_vecs).A, lil_stack(lil_vecs).A)
True
And is substantially faster:
%timeit csr_stack(csr_vecs)
100 loops, best of 3: 11.7 ms per loop
%timeit lil_stack(lil_vecs)
10 loops, best of 3: 37.6 ms per loop
%timeit lil_stack(lil_vecs).tolil()
10 loops, best of 3: 53.6 ms per loop
So, by switching to CSR, you can improve performance by over 100x. If you stick with LIL, your performance improvement will be only around 30x, more if you can live with CSR in the combined matrix, less if you insist on LIL.
I think, you should try to use ListField, which is essentially a python list representation of BSON array, to store your vectors. In that situation, you won't need to unpickle them every time.
class MyClass(Document):
vector = ListField()
items = MyClass.objects()
M = items[0].vector
The only problem I can see in that solution, is that you have to convert python lists to scipy sparse vector type, but I believe, that should be faster.