I was reading about attention and came across this equation:
import einops
from fancy_einsum import einsum
import torch
x = torch.rand((200, 10, 768))
y = torch.rand((20, 768, 64))
res = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", x, y)
And I am not able to understand the underlying operations that give the result res
I thought it might be matmul and tried this:
import torch
x_ = x.unsqueeze(dim = 2).unsqueeze(dim = 2)
y_ = torch.broadcast_to(y, (1, 1, 20, 768, 64))
res2 = x_ # y_
res2 = res2.squeeze(dim = -2)
(res == res2).all() # Prints False
But that does not seem to be right.
Any help regarding this is greatly appreciated
So whenever using einsum you best think about the meaning of the dimensions. Basically we perform a multiplication between the two inputs in this case. The signature passed to einsum shows what dimensions will be preserved and which ones will be "summed away". I simplified the signature with single letters here:
res = einsum("b q m, n m h -> b q n h", x, y)
We can read from this that both x and y have three dimensions. Furthermore both have a dimension called m, and this doesn't appear in the output. So we can conclude that it gets "summed away". So for each entry of the output we have following formula. For simplicity I reused the dimension names as indices, so for every b,q,n,h we get
res[b,q,n,h] = / x[b,q,m] * y[n,m,h]
To do this with any other function than einsum is usually more cumbersome. So first we need to reorder and unsqueeze the dimensions in a way that they are compatible to be multiplied, so we can do the following (the shapes annotated above):
#(b,q,m,n,h) (b, q, m, 1, 1) (m, n, h)
product = x[:, :, :, None, None] * y.permute([1,0,2])
Due to the broadcasting rules, the second (y-) term will implicitly get the required leading dummy dimensions.
Then we can "sum away" the dimension m:
res = product.sum(dim=2) # (b,q,n,h)
So you can interpret that as a matrix multiplication if you want, or also just a scalar product, but of course with many "batch"-dimensions.
The shape of Y[n,:,:] is (200,1) and so I need Z[n,,:,:]*H[n,:,:] (or something related) to be (200,1) also. But Z[n,,:,:] and H[n,:,:] are both (200,6) so I need a multiplication operator that multiplies them and gets rid of the 6 to give an answer of shape (200,1). Any suggestions? The code is below
M = 200
dW = np.sqrt(1/n)*randn(n,M,D);
H=cap(dW,1/n,np.log(n))#the generation of the Brownian motion increment
X = define_X(1,dW,1,1,1)
Y = np.zeros((n+1,M,1))
Z = np.zeros_like(X)
Y[n-1,:,:]= Y[n,:,:] +f(X[n-1,:,:],Y[n,:,:],Z[n-1,:,:])*(1/10)-Z[n,,:,:]*H[n,:,:]
I am attempting to write a program which constructs a matrix and performs a singular value decomposition on it. I am evaluating the function ax^2 +bx + 1 on a grid. I then make a uniform meshgrid of a and b. The rows of the matrix correspond to different quadratic coefficients, while each column corresponds to a grid point at which the function is evaluated.
The matlab code is here:
% Collect data
x = linspace(-1,1,100);
[a,b] = meshgrid(0:0.1:1,0:0.1:1);
sz = size(D)
% Build “Dose” matrix
for i=1:numel(a)
D(:,i) = a(i)*x.^2+b(i)*x+1;
% Do the SVD:
D_reconstructed = U*S*V';
This is my attempt at a solution:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-1, 1, 100)
def f(x, a, b):
return a*x*x + b*x + 1
a, b = np.mgrid[0:1:0.1,0:1:0.1]
#a = b = np.arange(0,1,0.01)
D = np.zeros((x.size, a.size))
for i in range(a.size):
D[i] = a[i]*x*x +b[i]*x +1
U, S, V = np.linalg.svd(D)
fig = plt.figure()
ax = plt.axes(projection="3d")
ax.scatter(a, b, V[0])
but I always get broadcasting errors which I am not sure how to fix.
Firstly, in MATLAB you're assigning to D(:,i), but in python you're assigning to D[i]. The latter is equivalent to D[i, ...] which is in your case D[i, :]. Instead you seem to need D[:, i].
Secondly, in MATLAB using a linear index into a 2d array (namely a and b) will give you flattened views. If you do that with numpy you get slices of an array instead, just as I mentioned with D[i].
You can do away with the loop with broadcasting and getting your desired 2d array by .ravelling (or reshaping) your a and b arrays:
x = np.linspace(-1, 1, 100)[:, None] # inject trailing singleton for broadcasting
a, b = np.mgrid[0:1:0.1, 0:1:0.1]
D = a.ravel() * x**2 + b.ravel() * x + 1
The way this works is that x has shape (100, 1) after we inject a trailing singleton (in MATLAB trailing singletons are implied, in numpy leading ones), and both a.ravel() and b.ravel() have shape (10*10,) which is compatible with (1, 10*10), making broadcasting possible into shape (100, 10*10). You could also replace the calls to ravel with
a, b = np.mgrid[...].reshape(2, -1)
which is a trick I sometimes use, but this is harder to read if you're unfamiliar with the pattern.
Side note: it's better to use example data where dimensions end up being of different size so that you notice if something ends up being transposed.
I find myself reshaping 1D vectors way to many times. I wonder if this is because I'm doing something wrong, or because it is an inherit fault of numpy.
Why can't numpy infer that when he gets an object of shape (400,) to transform it to (400,1) ? And why do so many numpy operations result in removing the axis completely?
def predict(Theta1, Theta2, X):
m = X.shape[0]
X = np.c_[np.ones(m), X]
hidden = sigmoid(X # Theta1.T)
hidden = np.c_[np.ones(m), hidden]
output = sigmoid(hidden # Theta2.T)
result = np.argmax(output, axis=1) + 1 # removes the 2nd axis - (400,)
return result.reshape((-1, 1)) # re-add the axis - (400,1)
pred = predict(Theta1, Theta2, X)
print(np.mean(pred == y))
If I don't reshape the result in the last row, I get funky behavior when comparing pred (400,) and y (400,1).
you can use
np.array_split(data, s)
knowing that new dimensions will have length of s (data shape is s * s)
The new numpy version (1.22) now added an optional keepdims to argmax. Source: here.
how can I simplify and extend the following code for arbitrary shapes of A?
import numpy as np
A = np.random.random([10,12,13,5,5])
B = np.zeros([10,12,13,10,10])
s2 = np.array([[0,1],[-1,0]])
for i in range(10):
for j in range(12):
for k in range(13):
B[i,j,k,:,:] = np.kron(A[i,j,k,:,:],s2)
I know it would be possible with np.einsum, but also there I would have to explicitly give the shape.
That output shape has to be computed for the last two axes -
out_shp = A.shape[:-2] + tuple(A.shape[-2:]*np.array(s2.shape))
Then einsum or explicit extension of dims could be used -
B_out = (A[...,:,None,:,None]*s2[:,None]).reshape(out_shp)
B_out = np.einsum('ijklm,no->ijklnmo',A,s2).reshape(out_shp)
That einsum one could be generalized more for generic dims with ellipsis ... -
Extend to generic dims
We can generalize to generic dims that would accept the axes along which the kronecker multiplications are to be performed with some reshaping work -
def kron_along_axes(a, b, axis):
# Extend a to the extent of the broadcasted o/p shape
ae = a.reshape(np.insert(a.shape,np.array(axis)+1,1))
# Extend b to the extent of the broadcasted o/p shape
d = np.ones(a.ndim,dtype=int)
be = b.reshape(np.insert(d,np.array(axis),1))
# Get o/p and reshape back to a's dims
out = ae*be
out_shp = np.array(a.shape)
out_shp[list(axis)] *= b.shape
return out.reshape(out_shp)
Thus, to solve our case, it would be -
B = kron_along_axes(A, s2, axis=(3,4))
With numpy.kron
If you are looking for elegance and okay with something slower, we can use the built-in np.kron too with some axes-permutations -
def kron_along_axes(a, b, axis):
new_order = list(np.setdiff1d(range(a.ndim),axis)) + list(axis)
return np.kron(a.transpose(new_order),b).transpose(new_order)
flattened_A = A.reshape([-1, A.shape[-2], A.shape[-1]])
flattened_kron_product = np.kron(flattened_A, s2)
dims = list(A.shape[:-2]) + [flattened_kron_product.shape[-2], flattened_kron_product.shape[-1]]
result = flattened_kron_product.reshape(dims)
Subtracting result from B results in a zero filled. matrix.
I've got zero experience with Python. I have looked around some tutorial materials, but it seems difficult to understand a advanced code. So I came here for a more specific answer.
For me the mission is to redo the code in my computer.
Here is the scenario:
I'm a graduate student studying tensor factorization in relation learning. A paper[1] providing a code to run this algorithm, as follows:
import logging, time
from numpy import dot, zeros, kron, array, eye, argmax
from numpy.linalg import qr, pinv, norm, inv
from scipy.linalg import eigh
from numpy.random import rand
__version__ = "0.1"
__all__ = ['rescal', 'rescal_with_random_restarts']
__DEF_INIT = 'nvecs'
__DEF_PROJ = True
__DEF_CONV = 1e-5
_log = logging.getLogger('RESCAL')
def rescal_with_random_restarts(X, rank, restarts=10, **kwargs):
Restarts RESCAL multiple time from random starting point and
returns factorization with best fit.
models = []
fits = []
for i in range(restarts):
res = rescal(X, rank, init='random', **kwargs)
return models[argmax(fits)]
def rescal(X, rank, **kwargs):
Factors a three-way tensor X such that each frontal slice
X_k = A * R_k * A.T. The frontal slices of a tensor are
N x N matrices that correspond to the adjecency matrices
of the relational graph for a particular relation.
For a full description of the algorithm see:
Maximilian Nickel, Volker Tresp, Hans-Peter-Kriegel,
"A Three-Way Model for Collective Learning on Multi-Relational Data",
ICML 2011, Bellevue, WA, USA
X : list
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
rank : int
Rank of the factorization
lmbda : float, optional
Regularization parameter for A and R_k factor matrices. 0 by default
init : string, optional
Initialization method of the factor matrices. 'nvecs' (default)
initializes A based on the eigenvectors of X. 'random' initializes
the factor matrices randomly.
proj : boolean, optional
Whether or not to use the QR decomposition when computing R_k.
True by default
maxIter : int, optional
Maximium number of iterations of the ALS algorithm. 500 by default.
conv : float, optional
Stop when residual of factorization is less than conv. 1e-5 by default
A : ndarray
array of shape ('N', 'rank') corresponding to the factor matrix A
R : list
list of 'M' arrays of shape ('rank', 'rank') corresponding to the factor matrices R_k
f : float
function value of the factorization
iter : int
number of iterations until convergence
exectimes : ndarray
execution times to compute the updates in each iteration
# init options
ainit = kwargs.pop('init', __DEF_INIT)
proj = kwargs.pop('proj', __DEF_PROJ)
maxIter = kwargs.pop('maxIter', __DEF_MAXITER)
conv = kwargs.pop('conv', __DEF_CONV)
lmbda = kwargs.pop('lmbda', __DEF_LMBDA)
if not len(kwargs) == 0:
raise ValueError( 'Unknown keywords (%s)' % (kwargs.keys()) )
sz = X[0].shape
dtype = X[0].dtype
n = sz[0]
k = len(X)
_log.debug('[Config] rank: %d | maxIter: %d | conv: %7.1e | lmbda: %7.1e' % (rank,
maxIter, conv, lmbda))
_log.debug('[Config] dtype: %s' % dtype)
# precompute norms of X
normX = [norm(M)**2 for M in X]
Xflat = [M.flatten() for M in X]
sumNormX = sum(normX)
# initialize A
if ainit == 'random':
A = array(rand(n, rank), dtype=dtype)
elif ainit == 'nvecs':
S = zeros((n, n), dtype=dtype)
T = zeros((n, n), dtype=dtype)
for i in range(k):
T = X[i]
S = S + T + T.T
evals, A = eigh(S,eigvals=(n-rank,n-1))
else :
raise 'Unknown init option ("%s")' % ainit
# initialize R
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute factorization
fit = fitchange = fitold = f = 0
exectimes = []
ARAt = zeros((n,n), dtype=dtype)
for iter in xrange(maxIter):
tic = time.clock()
fitold = fit
A = __updateA(X, A, R, lmbda)
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute fit value
f = lmbda*(norm(A)**2)
for i in range(k):
ARAt = dot(A, dot(R[i], A.T))
f += normX[i] + norm(ARAt)**2 - 2*dot(Xflat[i], ARAt.flatten()) + lmbda*(R[i].flatten()**2).sum()
f *= 0.5
fit = 1 - f / sumNormX
fitchange = abs(fitold - fit)
toc = time.clock()
exectimes.append( toc - tic )
_log.debug('[%3d] fit: %.5f | delta: %7.1e | secs: %.5f' % (iter,
fit, fitchange, exectimes[-1]))
if iter > 1 and fitchange < conv:
return A, R, f, iter+1, array(exectimes)
def __updateA(X, A, R, lmbda):
n, rank = A.shape
F = zeros((n, rank), dtype=X[0].dtype)
E = zeros((rank, rank), dtype=X[0].dtype)
AtA = dot(A.T,A)
for i in range(len(X)):
F += dot(X[i], dot(A, R[i].T)) + dot(X[i].T, dot(A, R[i]))
E += dot(R[i], dot(AtA, R[i].T)) + dot(R[i].T, dot(AtA, R[i]))
A = dot(F, inv(lmbda * eye(rank) + E))
return A
def __updateR(X, A, lmbda):
r = A.shape[1]
R = []
At = A.T
if lmbda == 0:
ainv = dot(pinv(dot(At, A)), At)
for i in range(len(X)):
R.append( dot(ainv, dot(X[i], ainv.T)) )
else :
AtA = dot(At, A)
tmp = inv(kron(AtA, AtA) + lmbda * eye(r**2))
for i in range(len(X)):
AtXA = dot(At, dot(X[i], A))
R.append( dot(AtXA.flatten(), tmp).reshape(r, r) )
return R
def __projectSlices(X, Q):
q = Q.shape[1]
X2 = []
for i in range(len(X)):
X2.append( dot(Q.T, dot(X[i], Q)) )
return X2
It's boring to paste such a long code but there is no other way to figure out my problems. I'm sorry about this.
I import this module and pass them arguments according to the author's website:
import pickle, sys
from rescal import rescal
rank = sys.argv[1]
X = pickle.load('us-presidents.pickle')
A, R, f, iter, exectimes = rescal(X, rank, lmbda=1.0)
The dataset us-presidents.rdf can be found here.
My questions are:
According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I understand tensor = list in Python?
Should I convert RDF format to a triple(subject, predicate, object) format first? I'm not sure of the data structure of X. How do I assignment values to X by hand?
Then, how to run it?
I paste the author's code without his authorization, is it an act of infringement? if so, I am so sorry and I will delete it soon.
The problems may be a little bored, but these are important to me. Any help would be greatly appreciated.
[1] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel,
A Three-Way Model for Collective Learning on Multi-Relational Data,
in Proceedings of the 28th International Conference on Machine Learning, 2011 , Bellevue, WA, USA
To answer Q2: you need to transform the RDF and save it before you can load it from the file 'us-presidents.pickle'. The author of that code probably did that once because the Python native pickle format loads faster. As the pickle format includes the datatype of the data, it is possible that X is some numpy class instance and you would need either an example pickle file as used by this code, or some code doing the pickle.dump to figure out how to convert from RDF to this particular pickle file as rescal expects it.
So this might answer Q1: the tensor consists of a list of elements. From the code you can see that the X parameter to rescal has a length (k = len(X) ) and can be indexed (T = X[i]). So it elements are used as a list (even if it might be some other datatype, that just behaves as such.
As an aside: If you are not familiar with Python and are just interested in the result of the computation, you might get more help contacting the author of the software.
According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I
understand tensor = list in Python?
Not necessarily but the author of the code has decided to represent the tensor data as a list data structure. As the comments indicate, the list X contains:
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
That means the tensor is repesented as a list of tuples: [(N, N), ..., (N, N)].
I'm not sure of the data structure of X. How do I assignment values to X by hand?
Now that we now the data structure of X, we can assign values to it using assignment. The following will assign the tuple (1, 3) to the first position in the list X (as the first position is at index 0, the second at position 1, et cetera):
X[0] = (1, 3)
Similarly, the following will assign the tuple (2, 4) to the second position:
X[1] = (2, 4)