How to fastly build a graph with Numpy? - python

In order to use PyStruct to perform image segmentation (by means of inference [1]), I first need to build a graph whose nodes correspond to pixels and edges are the link between these pixels.
I have thus written a function, which works, to do so:
def create_graph_for_pystruct(mrf, betas, nb_labels):
M, N = mrf.shape
b1, b2, b3, b4 = betas
edges = []
pairwise = np.zeros((nb_labels, nb_labels))
# loop over rows
for i in range(M):
# loop over columns
for j in range(N):
# get rid of pixels belonging to image's borders
if i!=0 and i!=M-1 and j!=0 and j!=N-1:
# get the current linear index
current_linear_ind = i * N + j
# retrieve its neighborhood (yield a list of tuple (row, col))
neigh = np.array(getNeighborhood(i, j, M, N))
# convert neighbors indices to linear ones
neigh_linear_ind = neigh[:, 0] * N + neigh[:, 1]
# add edges
[edges.append((current_linear_ind, n)) for n in neigh_linear_ind]
mat1 = b1 * np.eye(nb_labels)
mat2 = b2 * np.eye(nb_labels)
mat3 = b3 * np.eye(nb_labels)
mat4 = b4 * np.eye(nb_labels)
pairwise = np.ma.dstack((pairwise, mat1, mat1, mat2, mat2, mat3, mat3, mat4, mat4))
return np.array(edges), pairwise[:, :, 1:]
However, it is slow and I am wondering where I can improve my function in order to speed it up.
[1] https://pystruct.github.io/generated/pystruct.inference.inference_dispatch.html

Here is a code suggestion that should run much faster (in numpy one should focus on using vectorisation against for-loops). I try to build the whole output in a single pass using vectorisation, I used the helpfull np.ogrid to generate xy coordinates.
def new(mrf, betas, nb_labels):
M, N = mrf.shape
b1, b2, b3, b4 = betas
mat1,mat2,mat3,mat4 = np.array([b1,b2,b3,b4])[:,None,None]*np.eye(nb_labels)[None,:,:]
pairwise = np.array([mat1, mat1, mat2, mat2, mat3, mat3, mat4, mat4]*((M-2)*(N-2))).transpose()
m,n=np.ogrid[0:M,0:N]
a,b,c= m[0:-2]*N+n[:,0:-2],m[1:-1]*N+n[:,0:-2],m[2: ]*N+n[:,0:-2]
d,e,f= m[0:-2]*N+n[:,1:-1],m[1:-1]*N+n[:,1:-1],m[2: ]*N+n[:,1:-1]
g,h,i= m[0:-2]*N+n[:,2: ],m[1:-1]*N+n[:,2: ],m[2: ]*N+n[:,2: ]
center_index = e
edges_index = np.stack([a,b,c,d,f,g,h,i])
edges=np.empty(list(edges_index.shape)+[2])
edges[:,:,:,0]= center_index[None,:,:]
edges[:,:,:,1]= edges_index
edges=edges.reshape(-1,2)
return edges,pairwise
Timing and comparison test :
import timeit
args=(np.empty((40,50)), [1,2,3,4], 10)
f1=lambda : new(*args)
f2=lambda : create_graph_for_pystruct(*args)
edges1, pairwise1 = f1()
edges2, pairwise2 = f2()
#outputs are not exactly indentical: the order isn't the the same
#I sort both to compare the results
edges1 = edges1[np.lexsort(np.fliplr(edges1).T)]
edges2 = edges2[np.lexsort(np.fliplr(edges2).T)]
print("edges identical ?",(edges1 == edges2).all())
print("pairwise identical ?",(pairwise1 == pairwise2).all())
print("new : ",timeit.timeit(f1,number=1))
print("old : ",timeit.timeit(f2,number=1))
Output :
edges identical ? True
pairwise identical ? True
new : 0.015270026000507642
old : 4.611805051001284
Note: I had to guess what was in the getNeighborhood function

Related

Outer product of large vectors exceeds memory

I have three 1D vectors. Let's say T with 100k element array, f and df each with 200 element array:
T = [T0, T1, ..., T100k]
f = [f0, f1, ..., f200]
df = [df0, df1, ..., df200]
For each element array, I have to calculate a function such as the following:
P = T*f + T**2 *df
My first instinct was to use the NumPy outer to find the function with each combination of f and df
P1 = np.outer(f,T)
P2 = np.outer(df,T**2)
P = np.add.outer(P1, P2)
However, in this case, I am facing the ram issue and receiving the following error:
Unable to allocate 2.23 PiB for an array with shape (200, 100000, 200,
100000) and data type float64
Is there a good way that I can calculate this?
My attempt using for loops
n=100
f_range = 5e-7
df_range = 1.5e-15
fsrc = np.arange(f - n * f_range, f + n * f_range, f_range) #array of 200
dfsrc = np.arange(df - n * df_range, df + n * df_range, df_range) #array of 200
dfnus=pd.DataFrame(fsrc)
numf=dfnus.shape[0]
dfnudots=pd.DataFrame(dfsrc)
numfdot=dfnudots.shape[0]
test2D = np.zeros([numf,(numfdot)])
for indexf, f in enumerate(fsrc):
for indexfd, fd in enumerate(dfsrc):
a=make_phase(T,f,fd) #--> this is just a function that performs T*f + T**2 *df
zlauf2d=z_n(a, n=1, norm=1) #---> And this is just another function that takes this 1D "a" and gives another 1D element array
test2D[indexf, indexfd]=np.copy(zlauf2d) #---> I do this so I could make a contour plot at the end. It just copys the same thing to 2D
Now my test2D has the shape of (200,200). This is what I want, however the floor loop is taking ages and I want somehow reduce two for loop to at least one.
Using broadcasting:
P1 = (f[:, np.newaxis] * T).sum(axis=-1)
P2 = (df[:, np.newaxis] * T**2).sum(axis=-1)
P = P1[:, np.newaxis] + P2
Alternatively, using outer:
P1 = (np.outer(f, T)).sum(axis=-1)
P2 = (np.outer(df, T**2)).sum(axis=-1)
P = P1[..., np.newaxis] + P2
This produces an array of shape (f.size, df.size) == (200, 200).
Generally speaking, if the final output array size is very large, one can either:
Reduce the size of the datatypes. One way is to change the datatypes of the arrays used to calculate the final output via P1.astype(np.float32). Alternatively, some operations allow one to pass in a dtype=np.float32 as a parameter.
Chunk the computation and work with smaller subsections of the result.
Based on the most recent edit, compute an array a with shape (200, 200, 100000). Then, take its element-wise norm along the last axis to produce an array z with shape (200, 200).
a = (
f[:, np.newaxis, np.newaxis] * T
+ df[np.newaxis, :, np.newaxis] * T**2
)
# L1 norm along last axis.
z = np.abs(a).sum(axis=-1)
This produces an array of shape (f.size, df.size) == (200, 200).

How to vectorize indexing and computation when indexed tensors are different dimensions?

I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.
It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.
You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
batch_data
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.

Calculating medoid of a cluster (Python)

So I'm running a KNN in order to create clusters. From each cluster, I would like to obtain the medoid of the cluster.
I'm employing a fractional distance metric in order to calculate distances:
where d is the number of dimensions, the first data point's coordinates are x^i, the second data point's coordinates are y^i, and f is an arbitrary number between 0 and 1
I would then calculate the medoid as:
where S is the set of datapoints, and δ is the absolute value of the distance metric used above.
I've looked online to no avail trying to find implementations of medoid (even with other distance metrics, but most thing were specifically k-means or k-medoid which [I think] is relatively different from what I want.
Essentially this boils down to me being unable to translate the math into effective programming. Any help would or pointers in the right direction would be much appreciated! Here's a short list of what I have so far:
I have figured out how to calculate the fractional distance metric (the first equation) so I think I'm good there.
I know numpy has an argmin() function (documented here).
Extra points for increased efficiency without lack of accuracy (I'm trying not to brute force by calculating every single fractional distance metric (because the number of point pairs might lead to a factorial complexity...).
compute pairwise distance matrix
compute column or row sum
argmin to find medoid index
i.e. numpy.argmin(distMatrix.sum(axis=0)) or similar.
So I've accepted the answer here, but I thought I'd provide my implementation if anyone else was trying to do something similar:
(1) This is the distance function:
def fractional(p_coord_array, q_coord_array):
# f is an arbitrary value, but must be greater than zero and
# less than one. In this case, I used 3/10. I took advantage
# of the difference of cubes in this case, so that I wouldn't
# encounter an overflow error.
a = np.sum(np.array(p_coord_array, dtype=np.float64))
b = np.sum(np.array(q_coord_array, dtype=np.float64))
a2 = np.sum(np.power(p_coord_array, 2))
ab = np.sum(p_coord_array) * np.sum(q_coord_array)
b2 = np.sum(np.power(p_coord_array, 2))
diffab = a - b
suma2abb2 = a2 + ab + b2
temp_dist = abs(diffab * suma2abb2)
temp_dist = np.power(temp_dist, 1./10)
dist = np.power(temp_dist, 10./3)
return dist
(2) The medoid function (if the length of the dataset was less than 6000 [if greater than that, I ran into overflow errors... I'm still working on that bit to be perfectly honest...]):
def medoid(dataset):
point = []
w = len(dataset)
if(len(dataset) < 6000):
h = len(dataset)
dist_matrix = [[0 for x in range(w)] for y in range(h)]
list_combinations = [(counter_1, counter_2, data_1, data_2) for counter_1, data_1 in enumerate(dataset) for counter_2, data_2 in enumerate(dataset) if counter_1 < counter_2]
for counter_3, tuple in enumerate(list_combinations):
temp_dist = fractional(tuple[2], tuple[3])
dist_matrix[tuple[0]][tuple[1]] = abs(temp_dist)
dist_matrix[tuple[1]][tuple[0]] = abs(temp_dist)
Any questions, feel free to comment!
If you don't mind using brute force this might help:
def calc_medoid(X, Y, f=2):
n = len(X)
m = len(Y)
dist_mat = np.zeros((m, n))
# compute distance matrix
for j in range(n):
center = X[j, :]
for i in range(m):
if i != j:
dist_mat[i, j] = np.linalg.norm(Y[i, :] - center, ord=f)
medoid_id = np.argmin(dist_mat.sum(axis=0)) # sum over y
return medoid_id, X[medoid_id, :]
Here is an example of computing a medoid for a single cluster with Euclidean distance.
import numpy as np, pandas as pd, matplotlib.pyplot as plt
a, b, c, d = np.array([0,1]), np.array([1, 3]), np.array([4,2]), np.array([3, 1.5])
vCenroid = np.mean([a, b, c, d], axis=0)
def GetMedoid(vX):
vMean = np.mean(vX, axis=0) # compute centroid
return vX[np.argmin([sum((x - vMean)**2) for x in vX])] # pick a point closest to centroid
vMedoid = GetMedoid([a, b, c, d])
print(f'centroid = {vCenroid}')
print(f'medoid = {vMedoid}')
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
plt.plot(vCenroid[0], vCenroid[1], 'ro', ms=10); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid as red star
You can also use the following package to compute medoid for one or more clusters
!pip -q install scikit-learn-extra > log
from sklearn_extra.cluster import KMedoids
GetMedoid = lambda vX: KMedoids(n_clusters=1).fit(vX).cluster_centers_
GetMedoid([a, b, c, d])[0]
I would say that you just need to compute the median.
np.median(np.asarray(points), axis=0)
Your median is the point with the biggest centrality.
Note: if you are using distances different than Euclidean this doesn't hold.

Repeating A Function for Many Values and Building a 2D Array

I have a code.
It takes in a value N and does a quantum walk for that many steps and gives an array that shows the probability at each position.
It's quite a complex calculation and N must be a single integer.
What I want to do is repeat this calculation for 100 values of N and build a large 2D array.
Any idea how I would do this?
Here's my code:
N = 100 # number of random steps
P = 2*N+1 # number of positions
#defining a quantum coin
coin0 = array([1, 0]) # |0>
coin1 = array([0, 1]) # |1>
#defining the coin operator
C00 = outer(coin0, coin0) # |0><0|
C01 = outer(coin0, coin1) # |0><1|
C10 = outer(coin1, coin0) # |1><0|
C11 = outer(coin1, coin1) # |1><1|
C_hat = (C00 + C01 + C10 - C11)/sqrt(2.)
#step operator
ShiftPlus = roll(eye(P), 1, axis=0)
ShiftMinus = roll(eye(P), -1, axis=0)
S_hat = kron(ShiftPlus, C00) + kron(ShiftMinus, C11)
#walk operator
U = S_hat.dot(kron(eye(P), C_hat))
#defining the initial state
posn0 = zeros(P)
posn0[N] = 1 # array indexing starts from 0, so index N is the central posn
psi0 = kron(posn0,(coin0+coin1*1j)/sqrt(2.))
#the state after N steps
psiN = linalg.matrix_power(U, N).dot(psi0)
#finidng the probabilty operator
prob = empty(P)
for k in range(P):
posn = zeros(P)
posn[k] = 1
M_hat_k = kron( outer(posn,posn), eye(2))
proj = M_hat_k.dot(psiN)
prob[k] = proj.dot(proj.conjugate()).real
prob[prob==0] = np.nan
nanmask = np.isfinite(prob)
prob_masked=prob[nanmask] #this is the final probability to be plotted
P_masked=arange(P)[nanmask] #these are the possible positions
Rather than writing out the array I get as it is 100 units, this is a graph of the position and probability at N = 100
I eventually want to make a 3D plot of position against N against probability.

How to run a .py module?

I've got zero experience with Python. I have looked around some tutorial materials, but it seems difficult to understand a advanced code. So I came here for a more specific answer.
For me the mission is to redo the code in my computer.
Here is the scenario:
I'm a graduate student studying tensor factorization in relation learning. A paper[1] providing a code to run this algorithm, as follows:
import logging, time
from numpy import dot, zeros, kron, array, eye, argmax
from numpy.linalg import qr, pinv, norm, inv
from scipy.linalg import eigh
from numpy.random import rand
__version__ = "0.1"
__all__ = ['rescal', 'rescal_with_random_restarts']
__DEF_MAXITER = 500
__DEF_INIT = 'nvecs'
__DEF_PROJ = True
__DEF_CONV = 1e-5
__DEF_LMBDA = 0
_log = logging.getLogger('RESCAL')
def rescal_with_random_restarts(X, rank, restarts=10, **kwargs):
"""
Restarts RESCAL multiple time from random starting point and
returns factorization with best fit.
"""
models = []
fits = []
for i in range(restarts):
res = rescal(X, rank, init='random', **kwargs)
models.append(res)
fits.append(res[2])
return models[argmax(fits)]
def rescal(X, rank, **kwargs):
"""
RESCAL
Factors a three-way tensor X such that each frontal slice
X_k = A * R_k * A.T. The frontal slices of a tensor are
N x N matrices that correspond to the adjecency matrices
of the relational graph for a particular relation.
For a full description of the algorithm see:
Maximilian Nickel, Volker Tresp, Hans-Peter-Kriegel,
"A Three-Way Model for Collective Learning on Multi-Relational Data",
ICML 2011, Bellevue, WA, USA
Parameters
----------
X : list
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
rank : int
Rank of the factorization
lmbda : float, optional
Regularization parameter for A and R_k factor matrices. 0 by default
init : string, optional
Initialization method of the factor matrices. 'nvecs' (default)
initializes A based on the eigenvectors of X. 'random' initializes
the factor matrices randomly.
proj : boolean, optional
Whether or not to use the QR decomposition when computing R_k.
True by default
maxIter : int, optional
Maximium number of iterations of the ALS algorithm. 500 by default.
conv : float, optional
Stop when residual of factorization is less than conv. 1e-5 by default
Returns
-------
A : ndarray
array of shape ('N', 'rank') corresponding to the factor matrix A
R : list
list of 'M' arrays of shape ('rank', 'rank') corresponding to the factor matrices R_k
f : float
function value of the factorization
iter : int
number of iterations until convergence
exectimes : ndarray
execution times to compute the updates in each iteration
"""
# init options
ainit = kwargs.pop('init', __DEF_INIT)
proj = kwargs.pop('proj', __DEF_PROJ)
maxIter = kwargs.pop('maxIter', __DEF_MAXITER)
conv = kwargs.pop('conv', __DEF_CONV)
lmbda = kwargs.pop('lmbda', __DEF_LMBDA)
if not len(kwargs) == 0:
raise ValueError( 'Unknown keywords (%s)' % (kwargs.keys()) )
sz = X[0].shape
dtype = X[0].dtype
n = sz[0]
k = len(X)
_log.debug('[Config] rank: %d | maxIter: %d | conv: %7.1e | lmbda: %7.1e' % (rank,
maxIter, conv, lmbda))
_log.debug('[Config] dtype: %s' % dtype)
# precompute norms of X
normX = [norm(M)**2 for M in X]
Xflat = [M.flatten() for M in X]
sumNormX = sum(normX)
# initialize A
if ainit == 'random':
A = array(rand(n, rank), dtype=dtype)
elif ainit == 'nvecs':
S = zeros((n, n), dtype=dtype)
T = zeros((n, n), dtype=dtype)
for i in range(k):
T = X[i]
S = S + T + T.T
evals, A = eigh(S,eigvals=(n-rank,n-1))
else :
raise 'Unknown init option ("%s")' % ainit
# initialize R
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute factorization
fit = fitchange = fitold = f = 0
exectimes = []
ARAt = zeros((n,n), dtype=dtype)
for iter in xrange(maxIter):
tic = time.clock()
fitold = fit
A = __updateA(X, A, R, lmbda)
if proj:
Q, A2 = qr(A)
X2 = __projectSlices(X, Q)
R = __updateR(X2, A2, lmbda)
else :
R = __updateR(X, A, lmbda)
# compute fit value
f = lmbda*(norm(A)**2)
for i in range(k):
ARAt = dot(A, dot(R[i], A.T))
f += normX[i] + norm(ARAt)**2 - 2*dot(Xflat[i], ARAt.flatten()) + lmbda*(R[i].flatten()**2).sum()
f *= 0.5
fit = 1 - f / sumNormX
fitchange = abs(fitold - fit)
toc = time.clock()
exectimes.append( toc - tic )
_log.debug('[%3d] fit: %.5f | delta: %7.1e | secs: %.5f' % (iter,
fit, fitchange, exectimes[-1]))
if iter > 1 and fitchange < conv:
break
return A, R, f, iter+1, array(exectimes)
def __updateA(X, A, R, lmbda):
n, rank = A.shape
F = zeros((n, rank), dtype=X[0].dtype)
E = zeros((rank, rank), dtype=X[0].dtype)
AtA = dot(A.T,A)
for i in range(len(X)):
F += dot(X[i], dot(A, R[i].T)) + dot(X[i].T, dot(A, R[i]))
E += dot(R[i], dot(AtA, R[i].T)) + dot(R[i].T, dot(AtA, R[i]))
A = dot(F, inv(lmbda * eye(rank) + E))
return A
def __updateR(X, A, lmbda):
r = A.shape[1]
R = []
At = A.T
if lmbda == 0:
ainv = dot(pinv(dot(At, A)), At)
for i in range(len(X)):
R.append( dot(ainv, dot(X[i], ainv.T)) )
else :
AtA = dot(At, A)
tmp = inv(kron(AtA, AtA) + lmbda * eye(r**2))
for i in range(len(X)):
AtXA = dot(At, dot(X[i], A))
R.append( dot(AtXA.flatten(), tmp).reshape(r, r) )
return R
def __projectSlices(X, Q):
q = Q.shape[1]
X2 = []
for i in range(len(X)):
X2.append( dot(Q.T, dot(X[i], Q)) )
return X2
It's boring to paste such a long code but there is no other way to figure out my problems. I'm sorry about this.
I import this module and pass them arguments according to the author's website:
import pickle, sys
from rescal import rescal
rank = sys.argv[1]
X = pickle.load('us-presidents.pickle')
A, R, f, iter, exectimes = rescal(X, rank, lmbda=1.0)
The dataset us-presidents.rdf can be found here.
My questions are:
According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I understand tensor = list in Python?
Should I convert RDF format to a triple(subject, predicate, object) format first? I'm not sure of the data structure of X. How do I assignment values to X by hand?
Then, how to run it?
I paste the author's code without his authorization, is it an act of infringement? if so, I am so sorry and I will delete it soon.
The problems may be a little bored, but these are important to me. Any help would be greatly appreciated.
[1] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel,
A Three-Way Model for Collective Learning on Multi-Relational Data,
in Proceedings of the 28th International Conference on Machine Learning, 2011 , Bellevue, WA, USA
To answer Q2: you need to transform the RDF and save it before you can load it from the file 'us-presidents.pickle'. The author of that code probably did that once because the Python native pickle format loads faster. As the pickle format includes the datatype of the data, it is possible that X is some numpy class instance and you would need either an example pickle file as used by this code, or some code doing the pickle.dump to figure out how to convert from RDF to this particular pickle file as rescal expects it.
So this might answer Q1: the tensor consists of a list of elements. From the code you can see that the X parameter to rescal has a length (k = len(X) ) and can be indexed (T = X[i]). So it elements are used as a list (even if it might be some other datatype, that just behaves as such.
As an aside: If you are not familiar with Python and are just interested in the result of the computation, you might get more help contacting the author of the software.
According to the code note, the tensor X is a list. I don't quite understand this, how do I relate a list to a tensor in Python? Can I
understand tensor = list in Python?
Not necessarily but the author of the code has decided to represent the tensor data as a list data structure. As the comments indicate, the list X contains:
List of frontal slices X_k of the tensor X. The shape of each X_k is ('N', 'N')
That means the tensor is repesented as a list of tuples: [(N, N), ..., (N, N)].
I'm not sure of the data structure of X. How do I assignment values to X by hand?
Now that we now the data structure of X, we can assign values to it using assignment. The following will assign the tuple (1, 3) to the first position in the list X (as the first position is at index 0, the second at position 1, et cetera):
X[0] = (1, 3)
Similarly, the following will assign the tuple (2, 4) to the second position:
X[1] = (2, 4)

Categories

Resources