Is there any way to multiply a 2D sparse matrix by a 3D numpy array please?
For example I have this function
def myFun(x, p):
r = 2
out = x * np.log(p) + r * np.log(1-p)
return out
where x is an array of dimension 3500, 90 and p another array with dimensions 3500, 90, 70. At the moment both x and p are dense arrays and I am just broadcasting when I call the function:
out = myFun(x[..., None], p)
However array x is quite sparse, only 7% of its elements are non-zero. On the other side, p doesnt have any zeros, only floats between zero and one.
I am hoping though that with a sparse matrix (from scipy.sparse probably) I will see a speed improvement. However, I do not know how to do this operation or if this more efficient please.
I am using python 3.
Many thanks
You can exploit the sparseness of x using the where keyword.
def sprse(x, p):
r = 2
out = x * np.log(p, where=x.astype(bool)) + r * np.log(1-p)
return out
from timeit import timeit
x = np.random.uniform(-13, 1, (3500, 90, 1)).clip(0, None)
p = np.random.random((3500, 90, 70))
assert np.all(sprse(x, p)==myFun(x, p))
def f():
return myFun(x, p)
print(timeit(f, number=3))
def f():
return sprse(x, p)
print(timeit(f, number=3))
Sample run:
5.171174691990018
3.2122434769989923
You can try the following implementation. For this simple function this looks like a bit exaggerated, but I also had troubles to get numexpr to work with Intel SVML (otherwise I would prefer numexpr). This solution should give 0.07s per call on a Quadcore i7 and should scale quite well on more cores. Please also note that the first call has a compilation overhead of about 0.5s.
Installing Intel SVML
import numpy as np
import numba as nb
x = np.random.uniform(-13, 1, (3500, 90, 1)).clip(0, None)
p = np.random.random((3500, 90, 70))
#nb.njit(parallel=True,fastmath=True)
def nb_myFun_sp(x, p):
out=np.empty(p.shape,p.dtype)
r = 2.
for i in nb.prange(p.shape[0]):
for j in range(p.shape[1]):
if x[i,j,0]!=0.:
x_=x[i,j,0]
for k in range(p.shape[2]):
out[i,j,k] = x_ * np.log(p[i,j,k]) + r * np.log(1.-p[i,j,k])
else:
for k in range(p.shape[2]):
out[i,j,k] = r * np.log(1.-p[i,j,k])
return out
#nb.njit(parallel=True,fastmath=True)
def nb_myFun(x, p):
out=np.empty(p.shape,p.dtype)
r = 2.
for i in nb.prange(p.shape[0]):
for j in range(p.shape[1]):
x_=x[i,j,0]
for k in range(p.shape[2]):
out[i,j,k] = x_ * np.log(p[i,j,k]) + r * np.log(1.-p[i,j,k])
return out
Related
I'm trying to implement a differential in python via numpy that can accept a scalar, a vector, or a matrix.
import numpy as np
def foo_scalar(x):
f = x * x
df = 2 * x
return f, df
def foo_vector(x):
f = x * x
n = x.size
df = np.zeros((n, n))
for mu in range(n):
for i in range(n):
if mu == i:
df[mu, i] = 2 * x[i]
return f, df
def foo_matrix(x):
f = x * x
m, n = x.shape
df = np.zeros((m, n, m, n))
for mu in range(m):
for nu in range(n):
for i in range(m):
for j in range(n):
if (mu == i) and (nu == j):
df[mu, nu, i, j] = 2 * x[i, j]
return f, df
This works fine, but it seems like there should be a way to do this in a single function, and let numpy "figure out" the correct dimensions. I could force everything into a 2-D array form with something like
x = np.array(x)
if len(x.shape) == 0:
x = x.reshape(1, 1)
elif len(x.shape) == 1:
x = x.reshape(-1, 1)
if len(f.shape) == 0:
f = f.reshape(1, 1)
elif len(f.shape) == 1:
f = f.reshape(-1, 1)
and always have 4 nested for loops, but this doesn't scale if I need to generalize to higher-order tensors.
Is what I'm trying to do possible, and if so, how?
I highly doubt there is a function to generate the second parameter returned by the function in Numpy. That being said you can play with the feature of Numpy and Python so to vectorize this and make the function faster. You first need to generate the indices and, then generate the target matrix and set it. Note that operating with N-dimensional generic arrays tends to be slow and tricky in non-trivial cases. The magic * unrolling operator is used to generate N parameters.
def foo_generic(x):
f = x ** 2
idx = np.stack(np.meshgrid(*[np.arange(e) for e in x.shape], indexing='ij'))
idx = tuple(np.concatenate((idx, idx)).reshape(2*x.ndim, -1))
df = np.zeros([*x.shape, *x.shape])
df[idx] = 2 * x.ravel()
return f, df
Note that foo_generic does not support scalar and it would be very inefficient to use it for that anyway, but you can add a condition in it to support this special case apart.
The df matrix will very quickly be huge for higher order so I strongly advise you not to use dense matrices for that since the number of zeros is huge compared to the number of values in the matrix case already. Sparse matrices fix this. In fact, for a 5x5 matrix, there are >95% of zeros. Not to mention the matrix becomes quickly huge and willing a huge matrix full of zeros is not efficient.
Is it possible to accelerate the following nested loop in Python (possibly with CUDA or parallel process)? The order of the elements that are appended to outputList[i] doesn't matter.
Currently, the code takes a long time to complete. Which part is slowing down the code? Is it the append() or the calculation f = c*sin(i*pi/180) + (1/c)*cos(i*pi/180)?
N = 640*480
outputList = [[] for i in range(N)]
def foo(a,b): # a, b are always integers
c = sqrt(a**2 + b**2)
for deg in range(360):
f = c*sin(deg*pi/180) + (1/c)*cos(deg*pi/180)
if (f<1):
outputList[a].append(f)
if __name__ == '__main__':
for x in range(N):
for y in range(60):
foo(x,y)
As talonmies mentioned, your first step should be using numpy.
import math
import numpy as np
def foo(a, b):
c = math.hypot(a, b)
degs = np.arange(360, dtype='f8')
rads = np.deg2rad(degs)
fs = c * np.sin(rads) + (1/c)*np.cos(rads)
below1 = fs[fs < 1.]
outputList[a].extend(below1)
There are fancier methods to do the sin + cos computation but I've kept it simple for now. Also, depending on your use case, change outputList to a 2D numpy array or a list of 1D numpy arrays
I have a varimax rotation code from wikipedia
def varimax(Phi, gamma = 1, q = 20, tol = 1e-6):
from numpy import eye, asarray, dot, sum, diag
from numpy.linalg import svd
p,k = Phi.shape
R = eye(k)
d=0
for i in xrange(q):
d_old = d
Lambda = dot(Phi, R)
u,s,vh = svd(dot(Phi.T,asarray(Lambda)**3 - (gamma/p) * dot(Lambda, diag(diag(dot(Lambda.T,Lambda))))))
R = dot(u,vh)
d = sum(s)
if d/d_old < tol: break
return dot(Phi, R)
and I use it this way:
varimax(X) ## X is a numpy array
but it returns numbers like this: 2.4243244e-15 !! that's not my expected answer
should I change other arguments? for example gamma or q??
I'm not familiar with varimax rotation
Can you post an example of what you're using as the inputs for X and what kind of outputs you're expecting?
I tested your code by fixing up the indenting in your code, like this:
from numpy import eye, asarray, dot, sum, diag
from numpy.linalg import svd
def varimax(Phi, gamma = 1, q = 20, tol = 1e-6):
p,k = Phi.shape
R = eye(k)
d=0
for i in xrange(q):
d_old = d
Lambda = dot(Phi, R)
u,s,vh = svd(dot(Phi.T,asarray(Lambda)**3 - (gamma/p) * dot(Lambda, diag(diag(dot(Lambda.T,Lambda))))))
R = dot(u,vh)
d = sum(s)
if d/d_old < tol: break
return dot(Phi, R)
And making some dummy components to test it like this:
import numpy as np
comps = np.linalg.svd(
np.random.randn(100,10),
full_matrices=False
)[0]
rot_comps = varimax(comps)
print("Original components dimension {}".format(comps.shape))
print("Component norms")
print(np.sum(comps**2, axis=0))
print("Rotated components dimension {}".format(rot_comps.shape))
print("Rotated component norms")
print(np.sum(rot_comps**2, axis=0))
The inputs and outputs are 100 x 10 arrays with unit norm, just as you'd expect.
I have the following code. It is taking forever in Python. There must be a way to translate this calculation into a broadcast...
def euclidean_square(a,b):
squares = np.zeros((a.shape[0],b.shape[0]))
for i in range(squares.shape[0]):
for j in range(squares.shape[1]):
diff = a[i,:] - b[j,:]
sqr = diff**2.0
squares[i,j] = np.sum(sqr)
return squares
You can use np.einsum after calculating the differences in a broadcasted way, like so -
ab = a[:,None,:] - b
out = np.einsum('ijk,ijk->ij',ab,ab)
Or use scipy's cdist with its optional metric argument set as 'sqeuclidean' to give us the squared euclidean distances as needed for our problem, like so -
from scipy.spatial.distance import cdist
out = cdist(a,b,'sqeuclidean')
I collected the different methods proposed here, and in two other questions, and measured the speed of the different methods:
import numpy as np
import scipy.spatial
import sklearn.metrics
def dist_direct(x, y):
d = np.expand_dims(x, -2) - y
return np.sum(np.square(d), axis=-1)
def dist_einsum(x, y):
d = np.expand_dims(x, -2) - y
return np.einsum('ijk,ijk->ij', d, d)
def dist_scipy(x, y):
return scipy.spatial.distance.cdist(x, y, "sqeuclidean")
def dist_sklearn(x, y):
return sklearn.metrics.pairwise.pairwise_distances(x, y, "sqeuclidean")
def dist_layers(x, y):
res = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[1]):
res += np.subtract.outer(x[:, i], y[:, i])**2
return res
# inspired by the excellent https://github.com/droyed/eucl_dist
def dist_ext1(x, y):
nx, p = x.shape
x_ext = np.empty((nx, 3*p))
x_ext[:, :p] = 1
x_ext[:, p:2*p] = x
x_ext[:, 2*p:] = np.square(x)
ny = y.shape[0]
y_ext = np.empty((3*p, ny))
y_ext[:p] = np.square(y).T
y_ext[p:2*p] = -2*y.T
y_ext[2*p:] = 1
return x_ext.dot(y_ext)
# https://stackoverflow.com/a/47877630/648741
def dist_ext2(x, y):
return np.einsum('ij,ij->i', x, x)[:,None] + np.einsum('ij,ij->i', y, y) - 2 * x.dot(y.T)
I use timeit to compare the speed of the different methods. For the comparison, I use vectors of length 10, with 100 vectors in the first group, and 1000 vectors in the second group.
import timeit
p = 10
x = np.random.standard_normal((100, p))
y = np.random.standard_normal((1000, p))
for method in dir():
if not method.startswith("dist_"):
continue
t = timeit.timeit(f"{method}(x, y)", number=1000, globals=globals())
print(f"{method:12} {t:5.2f}ms")
On my laptop, the results are as follows:
dist_direct 5.07ms
dist_einsum 3.43ms
dist_ext1 0.20ms <-- fastest
dist_ext2 0.35ms
dist_layers 2.82ms
dist_scipy 0.60ms
dist_sklearn 0.67ms
While the two methods dist_ext1 and dist_ext2, both based on the idea of writing (x-y)**2 as x**2 - 2*x*y + y**2, are very fast, there is a downside: When the distance between x and y is very small, due to cancellation error the numerical result can sometimes be (very slightly) negative.
Another solution besides using cdist is the following
difference_squared = np.zeros((a.shape[0], b.shape[0]))
for dimension_iterator in range(a.shape[1]):
difference_squared = difference_squared + np.subtract.outer(a[:, dimension_iterator], b[:, dimension_iterator])**2.
I'm trying to use numbapro to write a simple matrix vector multiplication below:
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
n = 100
#cuda.jit('void(float32[:,:], float32[:], float32[:])')
def cu_matrix_vector(A, b, c):
y, x = cuda.grid(2)
if y < n:
c[y] = 0.0
if x < n and y < n:
for i in range(n):
c[y] += A[y, i] * b[i]
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, 1)), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector(dA, dB, dC)
dC.to_host()
e = time()
tcuda = e - s
but I'm getting following error:
numbapro.cudadrv.error.CudaDriverError: CUDA_ERROR_LAUNCH_FAILED Failed to copy memory D->H
I don't understand why the device to host copy is failing. Please help
Your code has multiple problems.
The B and C vectors are Nx1 2D matrices, not 1D vectors, but the type signature of your kernel lists them as "float32[:]" -- 1D vectors. It also indexes them with a single index, which results in runtime errors on the GPU due to misaligned access (cuda-memcheck is your friend here!)
Your kernel assumes a 2D grid, but only uses 1 column of it -- meaning many threads doing the same computation and overwriting each other.
There is no execution configuration given, so NumbaPro is launching a kernel with 1 block of 1 thread. (nvprof is your friend here!)
Here is a code that works. Note that this uses a 1D grid of 1D blocks, and loops over the columns of the matrix. Therefore it is optimized for the case where the number of rows in the vector/matrix is large. A kernel that is optimized for a short and wide matrix would need to use another approach (parallel reductions). But I would use CUBLAS sgemv (which is exposed in NumbaPro also) instead.
from numbapro import cuda
from numba import *
import numpy as np
import math
from timeit import default_timer as time
m = 100000
n = 100
#cuda.jit('void(f4[:,:], f4[:], f4[:])')
def cu_matrix_vector(A, b, c):
row = cuda.grid(1)
if (row < m):
sum = 0
for i in range(n):
sum += A[row, i] * b[i]
c[row] = sum
A = np.array(np.random.random((m, n)), dtype=np.float32)
B = np.array(np.random.random(m), dtype=np.float32)
C = np.empty_like(B)
s = time()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.to_device(C)
cu_matrix_vector[(m+511)/512, 512](dA, dB, dC)
dC.to_host()
print C
e = time()
tcuda = e - s