I'm looking for a fast way to calculate a sum of n outer products.
Essentially, I start with two matrices generated from normal distributions - there are n vectors with v elements:
A = np.random.normal(size = (n, v))
B = np.random.normal(size = (n, v))
What I'd like is to calculate the outer products of each vector of size v in A and B and sum them together.
Note that A * B.T doesn't work - A is of size n x v whereas B is of size v x n.
The best I can do is create a loop where the outer products are constructed, then summed later. I have it like so:
outers = np.array([A[i] * B[i].T])
This creates an n x v x v array (the loop is within the list comprehension, which is subsequently converted into an array), which I can then sum together by using np.sum(outers, axis = 0). However, this is quite slow, and I was wondering if there's a vectorized function I could use to speed this up.
If anybody has any advice, I would really appreciate it!
It seems to me all you need to do is change the order of the transpositions, and do A.T * B instead of A * B.T.
If that's not quite what you are after, take a look at np.einsum, which can do some very powerful voodoo. For the above example, you would do:
np.einsum('ij,ik->jk', A, B)
Also consider np.outer.
np.array([np.outer(A, B) for i in xrange(n)]).sum(0)
although np.einsum suggested by #Jamie is the clear winner.
In [63]: %timeit np.einsum('ij,ik->jk', A, B)
100000 loops, best of 3: 4.61 us per loop
In [64]: %timeit np.array([np.outer(A[i], B[i]) for i in xrange(n)]).sum(0)
10000 loops, best of 3: 169 us per loop
and, to be sure, their results are identical:
In [65]: np.testing.assert_allclose(method_outer, method_einsum)
But, as an aside, I do not find that A.T * B or A * B.T broadcast successfully.
Related
Consider x, an n x 3 vector.
Is it possible, using built-in methods of numpy or tensorflow, or any Python library, to get a vector of the order n x 1 such that each row is a vector of the order 3 x 1? That is, if x is [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]T, can a vector of the form [[1, 2, 3]T, [4, 5, 6]T, [7, 8, 9]T, [10, 11, 12]T]T be got without for loops or introducing new axes like, say, np.newaxis?
The motive behind this is to get only the diagonal elements of the dot product of x and its transpose. We could, of course, do something like np.diag(x.dot(x.T)). But, if n is significantly large, say, 202933, one can hear the CPU's fan suffering from wheezing. How to actually avoid doing the dot product of all the elements and do so of only the diagonal ones of the phantom dot product without iteration?
Let's take a look at the formula for each element in the result of multiplying x by its own transpose. I don't feel like trying to coerce the Stack Overflow UI into allowing me to use tensor notation, so we'll look conceptually.
Each element at row i, column j of the result is the dot product of row i in x and column j in x.T. Now column j in x.T is just row j in x, and the diagonal is where i and j are the same. So what you want is a sum across the rows of the squared elements of x:
d = (x * x).sum(axis=1)
To address the first part of your question, the transpose operation in numpy rarely makes a copy of your data, so x.T or np.transpose(x) are constant-time operations for even the largest arrays. The reason is that numpy arrays are stored as a block of data along with some meta-data like dimensions, strides between elements in each dimension, and data size. Transposing an array only requires you to modify a small amount of meta-data in the array object, like sizes along each dimension and strides, not copy the whole data set.
The time consuming part is performing the multiplication. Simply having the objects x and x.T costs almost nothing: they both use the same data buffer.
This function is likely one of the most efficient ways to handle this. (Taken from trimesh: https://github.com/mikedh/trimesh/blob/main/trimesh/util.py#L589)
def diagonal_dot(a, b):
"""
Dot product by row of a and b.
There are a lot of ways to do this though
performance varies very widely. This method
uses a dot product to sum the row and avoids
function calls if at all possible.
Parameters
------------
a : (m, d) float
First array
b : (m, d) float
Second array
Returns
-------------
result : (m,) float
Dot product of each row
"""
# make sure `a` is numpy array
# doing it for `a` will force the multiplication to
# convert `b` if necessary and avoid function call otherwise
a = np.asanyarray(a)
# 3x faster than (a * b).sum(axis=1)
# avoiding np.ones saves 5-10% sometimes
return np.dot(a * b, [1.0] * a.shape[1])
Comparing performance of some equivalent versions:
In [1]: import numpy as np; import trimesh
In [2]: a = np.random.random((10000, 3))
In [3]: b = np.random.random((10000, 3))
In [4]: %timeit (a * b).sum(axis=1)
1000 loops, best of 3: 181 us per loop
In [5]: %timeit np.einsum('ij,ij->i', a, b)
10000 loops, best of 3: 62.7 us per loop
In [6]: %timeit np.diag(np.dot(a, b.T))
1 loop, best of 3: 429 ms per loop
In [7]: %timeit np.dot(a * b, np.ones(a.shape[1]))
10000 loops, best of 3: 61.3 us per loop
In [8]: %timeit trimesh.util.diagonal_dot(a, b)
10000 loops, best of 3: 55.2 us per loop
I have two 3d arrays A and B with shape (N, 2, 2) that I would like to multiply element-wise according to the N-axis with a matrix product on each of the 2x2 matrix. With a loop implementation, it looks like
C[i] = dot(A[i], B[i])
Is there a way I could do this without using a loop? I've looked into tensordot, but haven't been able to get it to work. I think I might want something like tensordot(a, b, axes=([1,2], [2,1])) but that's giving me an NxN matrix.
It seems you are doing matrix-multiplications for each slice along the first axis. For the same, you can use np.einsum like so -
np.einsum('ijk,ikl->ijl',A,B)
We can also use np.matmul -
np.matmul(A,B)
On Python 3.x, this matmul operation simplifies with # operator -
A # B
Benchmarking
Approaches -
def einsum_based(A,B):
return np.einsum('ijk,ikl->ijl',A,B)
def matmul_based(A,B):
return np.matmul(A,B)
def forloop(A,B):
N = A.shape[0]
C = np.zeros((N,2,2))
for i in range(N):
C[i] = np.dot(A[i], B[i])
return C
Timings -
In [44]: N = 10000
...: A = np.random.rand(N,2,2)
...: B = np.random.rand(N,2,2)
In [45]: %timeit einsum_based(A,B)
...: %timeit matmul_based(A,B)
...: %timeit forloop(A,B)
100 loops, best of 3: 3.08 ms per loop
100 loops, best of 3: 3.04 ms per loop
100 loops, best of 3: 10.9 ms per loop
You just need to perform the operation on the first dimension of your tensors, which is labeled by 0:
c = tensordot(a, b, axes=(0,0))
This will work as you wish. Also you don't need a list of axes, because it's just along one dimension you're performing the operation. With axes([1,2],[2,1]) you're cross multiplying the 2nd and 3rd dimensions. If you write it in index notation (Einstein summing convention) this corresponds to c[i,j] = a[i,k,l]*b[j,k,l], thus you're contracting the indices you want to keep.
EDIT: Ok, the problem is that the tensor product of a two 3d object is a 6d object. Since contractions involve pairs of indices, there's no way you'll get a 3d object by a tensordot operation. The trick is to split your calculation in two: first you do the tensordot on the index to do the matrix operation and then you take a tensor diagonal in order to reduce your 4d object to 3d. In one command:
d = np.diagonal(np.tensordot(a,b,axes=()), axis1=0, axis2=2)
In tensor notation d[i,j,k] = c[i,j,i,k] = a[i,j,l]*b[i,l,k].
Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).
I have written two function for it (first one with help of numpy and the other one without numpy):
def hamming_distance(X, X_train):
array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])
return array
def hamming_distance2(X, X_train):
a = len(X[:,0])
b = len(X_train[:,0])
hamming_distance = np.zeros(shape=(a, b))
for i in range(0, a):
for j in range(0, b):
hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])
return hamming_distance
My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?
PS. Sorry for my english, it isn't my first language, although I was trying to do my best!
Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:
hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)
We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.
In the comments #ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.
What you are doing is very similar to dot product. Consider these two binary arrays:
1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1
We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.
Here's a demo:
prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
np.allclose(res1, res2)
Out: True
And timings:
%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop
%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop
%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop
Hi I'm writing program for AES mix column stage. Here I have to multiply two matrices of (4,4) shape. The only difference is that while multiplying two matrices I have to take 'xor' instead of where I have to add. e.g
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
np.dot(a,b) # this gives [[(1*5+2*7),(1*6+2*8)][(3*5+4*7),(3*6+4*8)]]
# but I want [[((1*5)^(2*7)),((1*6)^(2*8))][((3*5)^(4*7)),((3*6)^(4*8))]]
Here's the solution with loops
result = [[0,0,0,0],
[0,0,0,0],
[0,0,0,0],
[0,0,0,0]]
# iterate through rows of X
for i in range(len(X)):
# iterate through columns of Y
for j in range(len(Y[0])):
# iterate through rows of Y
for k in range(len(Y)):
result[i][j] = result[i][j] ^ (X[i][k] * Y[k][j])
How to achieve that without using loops?
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
For explanation, consider a rectangular problem for easier identification:
a=np.arange(12).reshape(4,3).astype(object)
b=np.arange(12).reshape(3,4).astype(object)
object is to provide the python int arbitrary precision for AES.
products are obtained by broadcasting,
c=a[...,None]*b # dims : (4,3,1) * ((1),3,4) -> (4,3,4) , c_ijk =a_ij*b_jk
The dot product it then obtained by :
dot_ab=c.sum(axis=1) # ->(4,4)
In [734]: (dot_ab==a.dot(b)).all()
Out[734]: True
Then change to the equivalent xor function :
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
As an alternative, you can interpret your loops with numba (0.23):
from numba import jit
#jit(nopython=True)
def xor(X,Y):
result=np.zeros((4,4),np.uint64)
for i in range(len(X)):
# iterate through columns of Y
for j in range(Y.shape[1]):
# iterate through rows of Y
for k in range(len(Y)):
result[i,j] = result[i,j] ^ (X[i,k] * Y[k,j])
return result
for a impressive efficiency gain,due to optimal memory usage.
But you are limited to 32 bits for a and b:
In [790]: %timeit xor(a,b)
1000000 loops, best of 3: 580 ns per loop
In [791]: %timeit xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
100000 loops, best of 3: 13.2 µs per loop
In [792] (xor(a,b)==np.bitwise_xor.reduce(a[...,None]*b,axis=1)).all()
Out[792]: True
I have these variables with the following dimensions:
A - (3,)
B - (4,)
X_r - (3,K,N,nS)
X_u - (4,K,N,nS)
k - (K,)
and I want to compute (A.dot(X_r[:,:,n,s])*B.dot(X_u[:,:,n,s])).dot(k) for every possible n and s, the way I am doing it now is the following:
np.array([[(A.dot(X_r[:,:,n,s])*B.dot(X_u[:,:,n,s])).dot(k) for n in xrange(N)] for s in xrange(nS)]) #nSxN
But this is super slow and I was wondering if there was a better way of doing it but I am not sure.
However there is another computation that I am doing and I am sure it can be optimized:
np.sum(np.array([(X_r[:,:,n,s]*B.dot(X_u[:,:,n,s])).dot(k) for n in xrange(N)]),axis=0)
In this one I am creating a numpy array just to sum it in one axis and discard the array after. If this was a list in 1-D I would use reduce and optimize it, what should I use for numpy arrays?
Using few np.einsum calls -
# Calculation of A.dot(X_r[:,:,n,s])
p1 = np.einsum('i,ijkl->jkl',A,X_r)
# Calculation of B.dot(X_u[:,:,n,s])
p2 = np.einsum('i,ijkl->jkl',B,X_u)
# Include .dot(k) part to get the final output
out = np.einsum('ijk,i->kj',p1*p2,k)
About the second example, this solves it:
p1 = np.einsum('i,ijkl->jkl',B,X_u)#OUT_DIM - (k,N,nS)
sol = np.einsum('ijkl,j->il',X_r*p1[None,:,:,:],k)#OUT_DIM (3,nS)
You can use dot for multiplication of matrices in higher dimensions but the running indices must be the last two.
When we reorder your matrices
X_r_t = X_r.transpose(2,3,0,1)
X_u_t = X_u.transpose(2,3,0,1)
we obtain for your first expression
res1_imp = (A.dot(X_r_t)*B.dot(X_u_t)).dot(k).T # shape nS x N
and for the second expression
res2_imp = np.sum((X_r_t * B.dot(X_u_t)[:,:,None,:]).dot(k),axis=0)[-1]
Timings
Divakars solution gives on my computer 10000 loops, best of 3: 21.7 µs per loop
my solution gives 10000 loops, best of 3: 101 µs per loop
Edit
My upper Timings included the computation of both expressions. When I include only the first expression (as Divakar) I obtain 10000 loops, best of 3: 41 µs per loop ... which is still slower but closer to his timings