cuda gemm transpose with numpy

cuda gemm transpose with numpy - python

I am wondering how the GEMM Transpose works. I have a matrix which I want to multiply and I want to multiple the sample matrix transposed. Such as A.T * A
I have something like this,
def bptrs(a):
return gpuarray.arange(a.ptr,a.ptr+a.shape[0]*a.strides[0],a.strides[0],dtype=ctypes.c_void_p)
handle=cublasCreate()
A=np.ones((s,3)).astype(np.float64)
B=A.T # transposed
m,k=A.shape
k,n=B.shape
a_gpu = gpuarray.to_gpu(A)
b_gpu = gpuarray.to_gpu(B) # I am guessing I need to do a copy since A.T is a view
c_gpu = gpuarray.empty((m,n), np.float64) #Not 100% sure if this is right. I want to get a view returned, so I can save on memory
alpha = np.float64(1.0)
beta = np.float64(0.0)
cublasDgemmBatched(handle, 't','n',
n, m, k, alpha,
b_arr.gpudata, m,
a_arr.gpudata, k,
beta, c_arr.gpudata, m, 1)
I am using Cublas 7

To do A^T A use the cublas syrk-function instead of gemm.
If you want to understand gemm then you have to carefully read the documentation. Don't bother performing the transposition in python as gemm have arguments to do it on the fly. Something like this should get you what you want:
s = ...
k = 3
handle=cublasCreate()
A = np.ones((s,k)).astype(np.float64)
a_gpu = gpuarray.to_gpu(A)
c_gpu = gpuarray.empty((k,k), np.float64)
alpha = np.float64(1.0)
beta = np.float64(0.0)
cublasDgemm(handle,
't', 'n', # A^T A
m=k, # number of rows of matrix op(A) and C.
n=k, # number of columns of matrix op(B) and C.
k=s, # number of columns of op(A) and rows of op(B).
alpha,
a_gpu.gpudata, s, # lda x m with lda>=max(1,k)
a_gpu.gpudata, k, # ldb x k with ldb>=max(1,n)
beta,
c_arr.gpudata, k, # ldc x n with ldc>=max(1,m)
)

Related

Understanding fancy einsum equation

I was reading about attention and came across this equation:
import einops
from fancy_einsum import einsum
import torch
x = torch.rand((200, 10, 768))
y = torch.rand((20, 768, 64))
res = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", x, y)
And I am not able to understand the underlying operations that give the result res
I thought it might be matmul and tried this:
import torch
x_ = x.unsqueeze(dim = 2).unsqueeze(dim = 2)
y_ = torch.broadcast_to(y, (1, 1, 20, 768, 64))
res2 = x_ # y_
res2 = res2.squeeze(dim = -2)
(res == res2).all() # Prints False
But that does not seem to be right.
Any help regarding this is greatly appreciated

So whenever using einsum you best think about the meaning of the dimensions. Basically we perform a multiplication between the two inputs in this case. The signature passed to einsum shows what dimensions will be preserved and which ones will be "summed away". I simplified the signature with single letters here:
res = einsum("b q m, n m h -> b q n h", x, y)
We can read from this that both x and y have three dimensions. Furthermore both have a dimension called m, and this doesn't appear in the output. So we can conclude that it gets "summed away". So for each entry of the output we have following formula. For simplicity I reused the dimension names as indices, so for every b,q,n,h we get
___
\
res[b,q,n,h] = / x[b,q,m] * y[n,m,h]
/__
m
To do this with any other function than einsum is usually more cumbersome. So first we need to reorder and unsqueeze the dimensions in a way that they are compatible to be multiplied, so we can do the following (the shapes annotated above):
#(b,q,m,n,h) (b, q, m, 1, 1) (m, n, h)
product = x[:, :, :, None, None] * y.permute([1,0,2])
Due to the broadcasting rules, the second (y-) term will implicitly get the required leading dummy dimensions.
Then we can "sum away" the dimension m:
res = product.sum(dim=2) # (b,q,n,h)
So you can interpret that as a matrix multiplication if you want, or also just a scalar product, but of course with many "batch"-dimensions.

How to compute the operator Schmidt decomposition of a matrix using python

How to do THIS in python?
Currently, I can transform a 4X4 matrix 'A' into its bipartite shape [2,2,2,2], as described in here.
Once performing the SVD on the latter (the transformed version of 'A') using python via:
NumPy SVD: U, S, Vh = numpy.linalg.svd(A)
PyTorch SVD: U, S, Vh = torch.linalg.svd(A)
and it turns out (in both cases):
U.shape = Vh.shape = [2,2,2,2]
S.shape = [2,2,2]
I want to write a function:
def to_schmidt(A):
U, S, Vh = torch.linalg.svd(A)
#
# Do stuff with U, S, and Vh
#
return [[s1, ..., s4], [B1, ..., B4], [C1, ..., C4]]
Which returns a list of the s's' B's, and C's just as described in here. Just to clarify (using PyTorch in the example):
should_be_the_original_matrix_A = None
for i in range(4):
if should_be_the_original_matrix_A = None:
should_be_the_original_matrix_A = s[i] * torch.kron(B[i], C[i])
else:
should_be_the_original_matrix_A += s[i] * torch.kron(B[i], C[i])
#
# The one with A_{i}{j} with i,j = 0, 1, 2, 3
# or alternatively A.shape= [4,4], i.e.
# before the transformation to the bipartite rep.
#
The answer given there uses Mathematica, but as I'm not familiar with it:
How to complete this function?
Hopefully, using PyTorch, NumPy, or anything convertible to them, fairly trivially.

How to fastly build a graph with Numpy?

In order to use PyStruct to perform image segmentation (by means of inference [1]), I first need to build a graph whose nodes correspond to pixels and edges are the link between these pixels.
I have thus written a function, which works, to do so:
def create_graph_for_pystruct(mrf, betas, nb_labels):
M, N = mrf.shape
b1, b2, b3, b4 = betas
edges = []
pairwise = np.zeros((nb_labels, nb_labels))
# loop over rows
for i in range(M):
# loop over columns
for j in range(N):
# get rid of pixels belonging to image's borders
if i!=0 and i!=M-1 and j!=0 and j!=N-1:
# get the current linear index
current_linear_ind = i * N + j
# retrieve its neighborhood (yield a list of tuple (row, col))
neigh = np.array(getNeighborhood(i, j, M, N))
# convert neighbors indices to linear ones
neigh_linear_ind = neigh[:, 0] * N + neigh[:, 1]
# add edges
[edges.append((current_linear_ind, n)) for n in neigh_linear_ind]
mat1 = b1 * np.eye(nb_labels)
mat2 = b2 * np.eye(nb_labels)
mat3 = b3 * np.eye(nb_labels)
mat4 = b4 * np.eye(nb_labels)
pairwise = np.ma.dstack((pairwise, mat1, mat1, mat2, mat2, mat3, mat3, mat4, mat4))
return np.array(edges), pairwise[:, :, 1:]
However, it is slow and I am wondering where I can improve my function in order to speed it up.
[1] https://pystruct.github.io/generated/pystruct.inference.inference_dispatch.html

Here is a code suggestion that should run much faster (in numpy one should focus on using vectorisation against for-loops). I try to build the whole output in a single pass using vectorisation, I used the helpfull np.ogrid to generate xy coordinates.
def new(mrf, betas, nb_labels):
M, N = mrf.shape
b1, b2, b3, b4 = betas
mat1,mat2,mat3,mat4 = np.array([b1,b2,b3,b4])[:,None,None]*np.eye(nb_labels)[None,:,:]
pairwise = np.array([mat1, mat1, mat2, mat2, mat3, mat3, mat4, mat4]*((M-2)*(N-2))).transpose()
m,n=np.ogrid[0:M,0:N]
a,b,c= m[0:-2]*N+n[:,0:-2],m[1:-1]*N+n[:,0:-2],m[2: ]*N+n[:,0:-2]
d,e,f= m[0:-2]*N+n[:,1:-1],m[1:-1]*N+n[:,1:-1],m[2: ]*N+n[:,1:-1]
g,h,i= m[0:-2]*N+n[:,2: ],m[1:-1]*N+n[:,2: ],m[2: ]*N+n[:,2: ]
center_index = e
edges_index = np.stack([a,b,c,d,f,g,h,i])
edges=np.empty(list(edges_index.shape)+[2])
edges[:,:,:,0]= center_index[None,:,:]
edges[:,:,:,1]= edges_index
edges=edges.reshape(-1,2)
return edges,pairwise
Timing and comparison test :
import timeit
args=(np.empty((40,50)), [1,2,3,4], 10)
f1=lambda : new(*args)
f2=lambda : create_graph_for_pystruct(*args)
edges1, pairwise1 = f1()
edges2, pairwise2 = f2()
#outputs are not exactly indentical: the order isn't the the same
#I sort both to compare the results
edges1 = edges1[np.lexsort(np.fliplr(edges1).T)]
edges2 = edges2[np.lexsort(np.fliplr(edges2).T)]
print("edges identical ?",(edges1 == edges2).all())
print("pairwise identical ?",(pairwise1 == pairwise2).all())
print("new : ",timeit.timeit(f1,number=1))
print("old : ",timeit.timeit(f2,number=1))
Output :
edges identical ? True
pairwise identical ? True
new : 0.015270026000507642
old : 4.611805051001284
Note: I had to guess what was in the getNeighborhood function

Element-wise maximum of two sparse matrices

Is there an easy/build-in way to get the element-wise maximum of two (or ideally more) sparse matrices? I.e. a sparse equivalent of np.maximum.

This did the trick:
def maximum (A, B):
BisBigger = A-B
BisBigger.data = np.where(BisBigger.data < 0, 1, 0)
return A - A.multiply(BisBigger) + B.multiply(BisBigger)

No, there's no built-in way to do this in scipy.sparse. The easy solution is
np.maximum(X.A, Y.A)
but this is obviously going to be very memory-intensive when the matrices have large dimensions and it might crash your machine. A memory-efficient (but by no means fast) solution is
# convert to COO, if necessary
X = X.tocoo()
Y = Y.tocoo()
Xdict = dict(((i, j), v) for i, j, v in zip(X.row, X.col, X.data))
Ydict = dict(((i, j), v) for i, j, v in zip(Y.row, Y.col, Y.data))
keys = list(set(Xdict.iterkeys()).union(Ydict.iterkeys()))
XmaxY = [max(Xdict.get((i, j), 0), Ydict.get((i, j), 0)) for i, j in keys]
XmaxY = coo_matrix((XmaxY, zip(*keys)))
Note that this uses pure Python instead of vectorized idioms. You can try shaving some of the running time off by vectorizing parts of it.

Here's another memory-efficient solution that should be a bit quicker than larsmans'. It's based on finding the set of unique indices for the nonzero elements in the two arrays using code from Jaime's excellent answer here.
import numpy as np
from scipy import sparse
def sparsemax(X, Y):
# the indices of all non-zero elements in both arrays
idx = np.hstack((X.nonzero(), Y.nonzero()))
# find the set of unique non-zero indices
idx = tuple(unique_rows(idx.T).T)
# take the element-wise max over only these indices
X[idx] = np.maximum(X[idx].A, Y[idx].A)
return X
def unique_rows(a):
void_type = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
b = np.ascontiguousarray(a).view(void_type)
idx = np.unique(b, return_index=True)[1]
return a[idx]
Testing:
def setup(n=1000, fmt='csr'):
return sparse.rand(n, n, format=fmt), sparse.rand(n, n, format=fmt)
X, Y = setup()
Z = sparsemax(X, Y)
print np.all(Z.A == np.maximum(X.A, Y.A))
# True
%%timeit X, Y = setup()
sparsemax(X, Y)
# 100 loops, best of 3: 4.92 ms per loop

The latest scipy (13.0) defines element-wise booleans for sparse matricies. So:
BisBigger = B>A
A - A.multiply(BisBigger) + B.multiply(BisBigger)
np.maximum does not (yet) work because it uses np.where, which is still trying to get the truth value of an array.
Curiously B>A returns a boolean dtype, while B>=A is float64.

Here is a function that returns a sparse matrix that is element-wise maximum of two sparse matrices. It implements the answer by hpaulj:
def sparse_max(A, B):
"""
Return the element-wise maximum of sparse matrices `A` and `B`.
"""
AgtB = (A > B).astype(int)
M = AgtB.multiply(A - B) + B
return M
Testing:
A = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
B = sparse.csr_matrix(np.random.randint(-9,10, 25).reshape((5,5)))
M = sparse_max(A, B)
M2 = sparse_max(B, A)
# Test symmetry:
print((M.A == M2.A).all())
# Test that M is larger or equal to A and B, element-wise:
print((M.A >= A.A).all())
print((M.A >= B.A).all())

from scipy import sparse
from numpy import array
I = array([0,3,1,0])
J = array([0,3,1,2])
V = array([4,5,7,9])
A = sparse.coo_matrix((V,(I,J)),shape=(4,4))
A.data.max()
9
If you haven't already, you should try out ipython, you could have saved your self time my making your spare matrix A then simply typing A. then tab, this will print a list of methods that you can call on A. From this you would see A.data gives you the non-zero entries as an array and hence you just want the maximum of this.

How to run a .py module?

I've got zero experience with Python. I have looked around some tutorial materials, but it seems difficult to understand a advanced code. So I came here for a more specific answer.
For me the mission is to redo the code in my computer.
Here is the scenario:
I'm a graduate student studying tensor factorization in relation learning. A paper[1] providing a code to run this algorithm, as follows:
import logging, time
from numpy import dot, zeros, kron, array, eye, argmax
from numpy.linalg import qr, pinv, norm, inv
from scipy.linalg import eigh
from numpy.random import rand
__version__ = "0.1"
__all__ = ['rescal', 'rescal_with_random_restarts']
__DEF_MAXITER = 500
__DEF_INIT = 'nvecs'
__DEF_PROJ = True
__DEF_CONV = 1e-5
__DEF_LMBDA = 0
_log = logging.getLogger('RESCAL')
def rescal_with_random_restarts(X, rank, restarts=10, **kwargs):
"""
Restarts RESCAL multiple time from random starting point and
returns factorization with best fit.
"""
models = []
fits = []
for i in range(restarts):
res = rescal(X, rank, init='random', **kwargs)
models.append(res)
fits.append(res[2])
return models[argmax(fits)]
def rescal(X, rank, **kwargs):
"""
RESCAL
Factors a three-way tensor X such that each frontal slice
X_k = A * R_k * A.T. The frontal slices of a tensor are
N x N matrices that correspond to the adjecency matrices
of the relational graph for a particular relation.
For a full description of the algorithm see:
Maximilian Nickel, Volker Tresp, Hans-Peter-Kriegel,
"A Three-Way Model for Collective Learning on Multi-Relational Data",
ICML 2011, Bellevue, WA, USA
Parameters
----------
X : list
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
rank : int
Rank of the factorization
lmbda : float, optional
Regularization parameter for A and R_k factor matrices. 0 by default
init : string, optional
Initialization method of the factor matrices. 'nvecs' (default)
initializes A based on the eigenvectors of X. 'random' initializes
the factor matrices randomly.
proj : boolean, optional
Whether or not to use the QR decomposition when computing R_k.
True by default
maxIter : int, optional
Maximium number of iterations of the ALS algorithm. 500 by default.
conv : float, optional
Stop when residual of factorization is less than conv. 1e-5 by default
Returns
-------
A : ndarray
array of shape ('N', 'rank') corresponding to the factor matrix A
R : list
list of 'M' arrays of shape ('rank', 'rank') corresponding to the factor matrices R_k
f : float
function value of the factorization
iter : int
number of iterations until convergence
exectimes : ndarray
execution times to compute the updates in each iteration
"""
# init options
ainit = kwargs.pop('init', __DEF_INIT)
proj = kwargs.pop('proj', __DEF_PROJ)
maxIter = kwargs.pop('maxIter', __DEF_MAXITER)
conv = kwargs.pop('conv', __DEF_CONV)
lmbda = kwargs.pop('lmbda', __DEF_LMBDA)
if not len(kwargs) == 0:
raise ValueError( 'Unknown keywords (%s)' % (kwargs.keys()) )
sz = X[0].shape
dtype = X[0].dtype
n = sz[0]
k = len(X)
_log.debug('[Config] rank: %d | maxIter: %d | conv: %7.1e | lmbda: %7.1e' % (rank,
maxIter, conv, lmbda))
_log.debug('[Config] dtype: %s' % dtype)
# precompute norms of X
normX = [norm(M)**2 for M in X]
Xflat = [M.flatten() for M in X]
sumNormX = sum(normX)
# initialize A
if ainit == 'random':
A = array(rand(n, rank), dtype=dtype)
elif ainit == 'nvecs':
S = zeros((n, n), dtype=dtype)
T = zeros((n, n), dtype=dtype)
for i in range(k):
T = X[i]
S = S + T + T.T
evals, A = eigh(S,eigvals=(n-rank,n-1))
else :
raise 'Unknown init option ("%s")' % ainit
# initialize R
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute factorization
fit = fitchange = fitold = f = 0
exectimes = []
ARAt = zeros((n,n), dtype=dtype)
for iter in xrange(maxIter):
tic = time.clock()
fitold = fit
A = __updateA(X, A, R, lmbda)
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute fit value
f = lmbda*(norm(A)**2)
for i in range(k):
ARAt = dot(A, dot(R[i], A.T))
f += normX[i] + norm(ARAt)**2 - 2*dot(Xflat[i], ARAt.flatten()) + lmbda*(R[i].flatten()**2).sum()
f *= 0.5
fit = 1 - f / sumNormX
fitchange = abs(fitold - fit)
toc = time.clock()
exectimes.append( toc - tic )
_log.debug('[%3d] fit: %.5f | delta: %7.1e | secs: %.5f' % (iter,
fit, fitchange, exectimes[-1]))
if iter > 1 and fitchange < conv:
break
return A, R, f, iter+1, array(exectimes)
def __updateA(X, A, R, lmbda):
n, rank = A.shape
F = zeros((n, rank), dtype=X[0].dtype)
E = zeros((rank, rank), dtype=X[0].dtype)
AtA = dot(A.T,A)
for i in range(len(X)):
F += dot(X[i], dot(A, R[i].T)) + dot(X[i].T, dot(A, R[i]))
E += dot(R[i], dot(AtA, R[i].T)) + dot(R[i].T, dot(AtA, R[i]))
A = dot(F, inv(lmbda * eye(rank) + E))
return A
def __updateR(X, A, lmbda):
r = A.shape[1]
R = []
At = A.T
if lmbda == 0:
ainv = dot(pinv(dot(At, A)), At)
for i in range(len(X)):
R.append( dot(ainv, dot(X[i], ainv.T)) )
else :
AtA = dot(At, A)
tmp = inv(kron(AtA, AtA) + lmbda * eye(r**2))
for i in range(len(X)):
AtXA = dot(At, dot(X[i], A))
R.append( dot(AtXA.flatten(), tmp).reshape(r, r) )
return R
def __projectSlices(X, Q):
q = Q.shape[1]
X2 = []
for i in range(len(X)):
X2.append( dot(Q.T, dot(X[i], Q)) )
return X2
It's boring to paste such a long code but there is no other way to figure out my problems. I'm sorry about this.
I import this module and pass them arguments according to the author's website:
import pickle, sys
from rescal import rescal
rank = sys.argv[1]
X = pickle.load('us-presidents.pickle')
A, R, f, iter, exectimes = rescal(X, rank, lmbda=1.0)
The dataset us-presidents.rdf can be found here.
My questions are:
According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I understand tensor = list in Python?
Should I convert RDF format to a triple(subject, predicate, object) format first? I'm not sure of the data structure of X. How do I assignment values to X by hand?
Then, how to run it?
I paste the author's code without his authorization, is it an act of infringement? if so, I am so sorry and I will delete it soon.
The problems may be a little bored, but these are important to me. Any help would be greatly appreciated.
[1] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel,
A Three-Way Model for Collective Learning on Multi-Relational Data,
in Proceedings of the 28th International Conference on Machine Learning, 2011 , Bellevue, WA, USA

To answer Q2: you need to transform the RDF and save it before you can load it from the file 'us-presidents.pickle'. The author of that code probably did that once because the Python native pickle format loads faster. As the pickle format includes the datatype of the data, it is possible that X is some numpy class instance and you would need either an example pickle file as used by this code, or some code doing the pickle.dump to figure out how to convert from RDF to this particular pickle file as rescal expects it.
So this might answer Q1: the tensor consists of a list of elements. From the code you can see that the X parameter to rescal has a length (k = len(X) ) and can be indexed (T = X[i]). So it elements are used as a list (even if it might be some other datatype, that just behaves as such.
As an aside: If you are not familiar with Python and are just interested in the result of the computation, you might get more help contacting the author of the software.

According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I
understand tensor = list in Python?
Not necessarily but the author of the code has decided to represent the tensor data as a list data structure. As the comments indicate, the list X contains:
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
That means the tensor is repesented as a list of tuples: [(N, N), ..., (N, N)].
I'm not sure of the data structure of X. How do I assignment values to X by hand?
Now that we now the data structure of X, we can assign values to it using assignment. The following will assign the tuple (1, 3) to the first position in the list X (as the first position is at index 0, the second at position 1, et cetera):
X[0] = (1, 3)
Similarly, the following will assign the tuple (2, 4) to the second position:
X[1] = (2, 4)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

cuda gemm transpose with numpy - python

Related

Understanding fancy einsum equation

How to compute the operator Schmidt decomposition of a matrix using python

How to fastly build a graph with Numpy?

Element-wise maximum of two sparse matrices

How to run a .py module?

Categories

Resources