Dimension out of range when applying l2 normalization in Pytorch - python

I'm getting a runtime error:
RuntimeError: Dimension out of range (expected to be in range of [-1, 0], but got 1)`
and can't figure out how to fix it.
The error appears to refer to the line:
i_enc = F.normalize(input =i_batch, p=2, dim=1, eps=1e-12) # (batch, K, feat_dim)
I'm trying to encode image features (batch x 36 x 2038) by applying a L2 norm. Below is the full code for the section.
def forward(self, q_batch, i_batch):
# batch size = 512
# q -> 512(batch)x14(length)
# i -> 512(batch)x36(K)x2048(f_dim)
# one-hot -> glove
emb = self.embed(q_batch)
output, hn = self.gru(emb.permute(1, 0, 2))
q_enc = hn.view(-1,self.h_dim)
# image encoding with l2 norm
i_enc = F.normalize(input =i_batch, p=2, dim=1, eps=1e-12) # (batch, K, feat_dim)
q_enc_copy = q_enc.repeat(1, self.K).view(-1, self.K, self.h_dim)
q_i_concat = torch.cat((i_enc, q_enc_copy), -1)
q_i_concat = self.non_linear(q_i_concat, self.td_W, self.td_W2 )#512 x 36 x 512
i_attention = self.att_w(q_i_concat) #512x36x1
i_attention = F.softmax(i_attention.squeeze(),1)
#weighted sum
i_enc = torch.bmm(i_attention.unsqueeze(1), i_enc).squeeze() # (batch, feat_dim)
# element-wise multiplication
q = self.non_linear(q_enc, self.q_W, self.q_W2)
i = self.non_linear(i_enc, self.i_W, self.i_W2)
h = torch.mul(q, i) # (batch, hid_dim)
# output classifier
# BCE with logitsloss
score = self.c_Wo(self.non_linear(h, self.c_W, self.c_W2))
return score
I would appreciate any help.
Thanks

I would suggest to check the shape of i_batch (e.g. print(i_batch.shape)), as I suspect i_batch has only 1 dimension (e.g. of shape [N]).
This would explain why PyTorch is complaining you can normalize only over the dimension #0; while you are asking for the operation to be done over a dimension #1 (c.f. dim=1).

Related

Porting theano function() with updates to Pytorch (negative sampling RuntimeError: Expected hidden size)

I'm trying to port code from Theano to PyTorch, I have very limited understanding of how both frameworks actually work to be frank, so please bear with me! I would greatly appreciate any help in furthering my understanding.
https://github.com/hidasib/GRU4Rec/blob/master/gru4rec.py#L614
Is the code I'm trying to port. Part of the code has already been ported to PyTorch, this can be found here: https://github.com/hungthanhpham94/GRU4REC-pytorch/tree/master/lib
A number of features are missing from the PyTorch implementation that exist in the original code. I've made a bunch of modifications already but have a hit a block with regards to negative sampling.
In the original code, a batch size is defined (default = 32) and additional negative samples (default n_sample = 2048 per batch afaik) are stored in GPU memory.
In Theano:
P = theano.shared(pop.astype(theano.config.floatX), name='P')
self.ST = theano.shared(np.zeros((generate_length, self.n_sample), dtype='int64'))
self.STI = theano.shared(np.asarray(0, dtype='int64'))
X = mrng.uniform((generate_length*self.n_sample,))
updates_st = OrderedDict()
updates_st[self.ST] = gpu_searchsorted(P, X, dtype_int64=True).reshape((generate_length, self.n_sample))
updates_st[self.STI] = np.asarray(0, dtype='int64')
generate_samples = theano.function([], updates=updates_st)
generate_samples()
sample_pointer = 0
The above block is creating an array of idxs stored in gpu memory. Which I've implemented in the DataLoader class as:
def generate_negatives(self):
P = torch.FloatTensor(self.pop)
ST = torch.LongTensor(np.zeros((self.generate_length, self.n_sample), dtype='int64'))
STI = torch.LongTensor(np.asarray(0, dtype='int64'))
X = torch.rand((self.generate_length * self.n_sample,))
return torch.searchsorted(P, X).reshape((self.generate_length, self.n_sample))
In Theano, the negative generator is used here:
while not finished:
........
else:
y = out_idx
if self.n_sample:
if sample_pointer == generate_length:
generate_samples()
sample_pointer = 0
sample_pointer += 1
reset = (start+i+1 == end-1)
cost = train_function(in_idx, y, len(iters), reset.reshape(len(reset), 1))
where the train_function is defined as:
train_function = function(inputs=[X, Y, M, R], outputs=cost, updates=updates, allow_input_downcast=True, on_unused_input='ignore')
and an example loss function is as follows:
def bpr(self, yhat, M):
return T.cast(T.sum(-T.log(T.nnet.sigmoid(gpu_diag(yhat, keepdims=True)-yhat))), theano.config.floatX)
In PyTorch, I've attempted to implement the negative generator in the same way:
while not finished:
minlen = (end - start).min()
# Item indices(for embedding) for clicks where the first sessions start
idx_target = df.item_idx.values[start]
for i in range(minlen - 1):
# Build inputs & targets
idx_input = idx_target
idx_target = df.item_idx.values[start + i + 1]
if self.n_sample:
if sample_pointer == self.generate_length:
neg_samples = self.generate_negatives()
sample_pointer = 0
sample = neg_samples[sample_pointer]
sample_pointer += 1
# idx_target = np.hstack([idx_target, sample]) # like cpu version (doesn't work due to hidden size)
input = torch.LongTensor(idx_input)
target = torch.LongTensor(idx_target)
yield input, target, mask
The above generator is used in train_epoch method in Trainer class:
if self.n_sample:
dataloader = DataLoader(self.train_data, self.batch_size, self.n_sample, self.generate_length)
else:
dataloader = DataLoader(self.train_data, self.batch_size)
for ii, (input, target, mask) in enumerate(dataloader):
input = input.to(self.device)
target = target.to(self.device)
self.optim.zero_grad()
hidden = reset_hidden(hidden, mask).detach()
logit, hidden = self.model(input, hidden)
# output sampling
logit_sampled = logit[:, target.view(-1)]
loss = self.loss_func(logit_sampled)
losses.append(loss.item())
loss.backward()
self.optim.step()
The same loss function is defined as:
class BPRLoss(nn.Module):
def __init__(self):
super(BPRLoss, self).__init__()
def forward(self, logit):
diff = logit.diag().view(-1, 1).expand_as(logit) - logit
loss = -torch.mean(F.logsigmoid(diff))
return loss
from my understanding, in the Theano version, in_idx and y (input item idxs, target item idxs respectively), are of the same shape (batch_size), (batch_size). A matrix is produced where the diag (note: keepdims=True or expand_as() in both loss functions) comprises the scores for target items while remaining elements are treated as intra-mini-batch negative samples. Since this is consistent across implementations, how then is the loss calculated on the additional 2048 negative samples in the Theano version?
In the Theano CPU implementation (which is deprecated):
y = np.hstack([out_idx, sample])
The GPU implementation:
def model(self, X, H, M, R=None, Y=None, drop_p_hidden=0.0, drop_p_embed=0.0, predict=False):
sparams, full_params, sidxs = [], [], []
if (hasattr(self, 'ST')) and (Y is not None) and (not predict) and (self.n_sample > 0):
A = self.ST[self.STI]
Y = T.concatenate([Y, A], axis=0)
If our batch size was 32, and n_sample was 2048, using the above logic (concatenating sample to target), we would yield an unchanged input of size 32, while our target would be of size 32 + 2048 = 2080. Resulting in the following error:
RuntimeError: Expected hidden size (3, 2080, 100), got [3, 32, 100].
How can this dimension mismatch be resolved?
I've tried reshaping the input (copying input idxs)/ target (with concanated negative samples) then looping through (but this is not parallelized and therefore extremely slow).
I've also tried changing the shape to new batch size (n_samples + batch_size) and with idx_input = np.repeat(idx_input, (self.n_sample + self.batch_size) // self.batch_size) when calling init_hidden(), however this produces other runtime errors, OOM, and RuntimeError: The expanded size of the tensor (9248) must match the existing size (544) at non-singleton dimension 0. Target sizes: [9248, 544]. Tensor sizes: [544, 1]
Kind regards

Applying Gaussian blur on tensor in custom loss

I have a custom loss where I want to apply Gaussian filter to a predicted label to manipulate it a little. Using max or average pooling is simple as it is predefined in keras, but I had to make my own class for Gaussian pooling:
import numpy as np
from keras.layers import DepthwiseConv2D
from keras.layers import Input
from keras.models import Model
import tensorflow as tf
class Gaussian():
def __init__(self,shape, f = 3):
self.filt = f
self.g = self.gaussFilter(shape)
def doFilter(self, data):
return self.g.predict(data, steps=1) #steps are for predicting on const tensor, I change it when predicting on predictions
def gauss2D(self,shape=(3,3),sigma=0.5):
m,n = [(ss-1.)/2. for ss in shape]
y,x = np.ogrid[-m:m+1,-n:n+1]
h = np.exp( -(x*x + y*y) / (2.*sigma*sigma) )
h[ h < np.finfo(h.dtype).eps*h.max() ] = 0
sumh = h.sum()
if sumh != 0:
h /= sumh
return h
def gaussFilter(self, size=256):
kernel_weights = self.gauss2D(shape=(self.filt,self.filt))
in_channels = 1 # the number of input channels
kernel_weights = np.expand_dims(kernel_weights, axis=-1)
kernel_weights = np.repeat(kernel_weights, in_channels, axis=-1) # apply the same filter on all the input channels
kernel_weights = np.expand_dims(kernel_weights, axis=-1) # for shape compatibility reasons
inp = Input(shape=(size,size,1))
g_layer = DepthwiseConv2D(self.filt, use_bias=False, padding='same')(inp)
model_network = Model(input=inp, output=g_layer)
print(model_network.summary())
model_network.layers[1].set_weights([kernel_weights])
model_network.trainable= False
return model_network
This works as expected when feeding a constant tensor to the doFilter function, an example of simple data:
a = np.array([[[1, 2, 3], [4, 5, 6], [4, 5, 6]]])
filt = Gaussian(3)
print(filt.doFilter(tf.constant(a.reshape(1,3,3,1))))
However, if I try to use this in a custom loss :
def custom_loss_no_true(input_tensor, length):
def loss(y_true, y_pred):
gaus_pooler = Gaussian(256, length//8)
a = gaus_pooler.doFilter(y_pred)
...more stuff comes after
I get an error:
ValueError: When feeding symbolic tensors to a model, we expect the
tensors to have a static batch size. Got tensor with shape: (None,
256, 256, 1)
This is as I have found caused by the fact, that I am feeding a tensor that is an output of other model, a symbolic data, not actual values (source). Thus I need to change the logic of my approach, because evaluating the tensor to feed my class would break the graph and lead to no gradient propagation within the loss (or am I incorrect?). How can I apply such convolution operation on a tensor that is an output of other model? Is it even possible? Or maybe there is a way to use it without adding the layer to the model, such as MaxPooling?
You don't really need a complex keras Model nor a keras Layer if what you want to do is just convolve your input with a Gaussian kernel. Here is a port of your code with simple tensorflow ops :
import tensorflow as tf
def get_gaussian_kernel(shape=(3,3), sigma=0.5):
"""build the gaussain filter"""
m,n = [(ss-1.)/2. for ss in shape]
x = tf.expand_dims(tf.range(-n,n+1,dtype=tf.float32),1)
y = tf.expand_dims(tf.range(-m,m+1,dtype=tf.float32),0)
h = tf.exp(tf.math.divide_no_nan(-((x*x) + (y*y)), 2*sigma*sigma))
h = tf.math.divide_no_nan(h,tf.reduce_sum(h))
return h
def gaussian_blur(inp, shape=(3,3), sigma=0.5):
"""Convolve using tf.nn.depthwise_conv2d"""
in_channel = tf.shape(inp)[-1]
k = get_gaussian_kernel(shape,sigma)
k = tf.expand_dims(k,axis=-1)
k = tf.repeat(k,in_channel,axis=-1)
k = tf.reshape(k, (*shape, in_channel, 1))
# using padding same to preserve size (H,W) of the input
conv = tf.nn.depthwise_conv2d(inp, k, strides=[1,1,1,1],padding="SAME")
return conv
You can use it simply in your custom loss (assuming a 4D y_pred [batch, height width, channel]) :
a = gaussian_blur(y_pred)

Numpy RNN gradient check failure

So I am building an RNN from scratch using numpy just to get the hang of how they work internally. My backpropagation through time is here:
def backprop_through_time(self, X, Y):
assert(len(X.shape) == 3)
seq_length = Y.shape[1] if self.return_sequences else 1
_, (Z_states, States, Z_outs, Outs) = self.feed_forward(X, cache=True)
if not self.return_sequences:
Outs = Outs[:,-1,:]
# setup gradients
dLdU = np.zeros(self.U.shape)
dLdV = np.zeros(self.V.shape)
dLdW = np.zeros(self.W.shape)
dLdB_state = np.zeros(self.B_state.shape)
dLdB_out = np.zeros(self.B_out.shape)
dLdOuts = self.loss_function_prime(Outs, Y)
if not self.return_sequences:
# we need dLdOuts to have a seq_length dim at axis 1
dLdOuts = np.expand_dims(dLdOuts, axis=1)
for t in range(seq_length):
adjusted_t = seq_length-1 if not self.return_sequences else t
# print("adjusted_t {}".format(adjusted_t))
dOuts_tdZ_out = self.output_activation_function_prime(Z_outs[:,adjusted_t,:])
dLdZ_out = np.multiply(dLdOuts[:, adjusted_t, :], dOuts_tdZ_out)
# Z_state = dot(X_t, self.U) + dot(State_{t-1}, self.W) + self.B_state
# State_t = f(Z_state)
# Z_out = dot(State_t, self.V) + self.B_out
# Out_t = g(Z_out)
dLdV += np.dot(States[:,adjusted_t,:].T, dLdZ_out)
dLdB_out += np.sum(dLdZ_out, axis=0, keepdims=True)
dLdZ_state = np.multiply(np.dot(dLdZ_out, self.V.T),
self.hidden_activation_function_prime(Z_states[:,adjusted_t,:]))
for t_prev in range(max(0, adjusted_t-self.backprop_through_time_limit), adjusted_t+1)[::-1]:
dLdB_state += np.sum(dLdZ_state, axis=0, keepdims=True)
dLdW += np.dot(States[:,t_prev-1,:].T, dLdZ_state)
dLdU += np.dot(X[:,t_prev,:].T, dLdZ_state)
dLdZ_state = np.multiply(np.dot(dLdZ_state, self.W.T),
self.hidden_activation_function_prime(States[:,t_prev-1,:]))
return (dLdU, dLdV, dLdW), (dLdB_state, dLdB_out)
However I am still failing a gradient check for parameters `dLdU, dLdW, dLdB_state`. I have gone through the math about a dozen times now, and I cannot find what is wrong with my implementation.
I assume X and Y both are 3D matrices with X having shape: X.shape := (batch_size, seq_length, input_dim)
while Y having shape: Y.shape := (batch_size, seq_length, output_dim)
Caching the feed_forward operation, I am returning Z_states with shape Z_states.shape := (batch_size, seq_length, hidden_dim), Z_outs and Outs with shape Z_outs.shape, Outs.shape := (batch_size, seq_length, output_dim), and States as States.shape := (batch_size, seq_length+1, hidden_dim). States[:,-1,:] is the original zeros of shape States[:,-1,:].shape := (batch_size, hidden_dim) that the RNN state is initialized with. Could anyone help me?
EDIT
I found my answer. My math is right, but I was calling the wrong variable. When I update dLdZ_state in the 2nd inner loop (the backprop through time part), I am multiplying with self.hidden_activation_function_prime(States[:,t_prev-1,:]) This shoud instead be self.hidden_activation_function_prime(Z_states[:,t_prev-1,:])

Dimensionality for stacked LSTM network in TensorFlow

In reviewing the numerous similar questions concerning multidimensional inputs and a stacked LSTM RNN I have not found an example which lays out the dimensionality for the initial_state placeholder and following rnn_tuple_state below. The attempted [lstm_num_layers, 2, None, lstm_num_cells, 2] is an extension of the code from these examples (http://monik.in/a-noobs-guide-to-implementing-rnn-lstm-using-tensorflow/, https://medium.com/#erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40) with an extra dimension of feature_dim added at the end for the multiple values at each time step of the features (this doesn't work but instead produces a ValueError due to mismatched dimensions in the tensorflow.nn.dynamic_rnn call).
time_steps = 10
feature_dim = 2
label_dim = 4
lstm_num_layers = 3
lstm_num_cells = 100
dropout_rate = 0.8
# None is to allow for variable size batches
features = tensorflow.placeholder(tensorflow.float32,
[None, time_steps, feature_dim])
labels = tensorflow.placeholder(tensorflow.float32, [None, label_dim])
cell = tensorflow.contrib.rnn.MultiRNNCell(
[tensorflow.contrib.rnn.LayerNormBasicLSTMCell(
lstm_num_cells,
dropout_keep_prob = dropout_rate)] * lstm_num_layers,
state_is_tuple = True)
# not sure of the dimensionality for the initial state
initial_state = tensorflow.placeholder(
tensorflow.float32,
[lstm_num_layers, 2, None, lstm_num_cells, feature_dim])
# which impacts these two lines as well
state_per_layer_list = tensorflow.unstack(initial_state, axis = 0)
rnn_tuple_state = tuple(
[tensorflow.contrib.rnn.LSTMStateTuple(
state_per_layer_list[i][0],
state_per_layer_list[i][1]) for i in range(lstm_num_layers)])
# also not sure if expanding the feature dimensions is correct here
outputs, state = tensorflow.nn.dynamic_rnn(
cell, tensorflow.expand_dims(features, -1),
initial_state = rnn_tuple_state)
What would be most helpful is an explanation of the generic situation where:
each time step has N values
each time sequence has S steps
each batch has B sequences
each output has R values
there are L hidden LSTM layers in the network
each layer has M number of nodes
so the pseudocode version of this would be:
# B, S, N, and R are undefined values for the purpose of this question
features = tensorflow.placeholder(tensorflow.float32, [B, S, N])
labels = tensorflow.placeholder(tensorflow.float32, [B, R])
...
which if I could finish I wouldn't be asking here in the first place. Thanks in advance. Any comments on relevant best practices welcome.
After much trial and error the following produces a stacked LSTM dynamic_rnn regardless of the dimensionality of the features:
time_steps = 10
feature_dim = 2
label_dim = 4
lstm_num_layers = 3
lstm_num_cells = 100
dropout_rate = 0.8
learning_rate = 0.001
features = tensorflow.placeholder(
tensorflow.float32, [None, time_steps, feature_dim])
labels = tensorflow.placeholder(
tensorflow.float32, [None, label_dim])
cell_list = []
for _ in range(lstm_num_layers):
cell_list.append(
tensorflow.contrib.rnn.LayerNormBasicLSTMCell(lstm_num_cells,
dropout_keep_prob=dropout_rate))
cell = tensorflow.contrib.rnn.MultiRNNCell(cell_list, state_is_tuple=True)
initial_state = tensorflow.placeholder(
tensorflow.float32, [lstm_num_layers, 2, None, lstm_num_cells])
state_per_layer_list = tensorflow.unstack(initial_state, axis=0)
rnn_tuple_state = tuple(
[tensorflow.contrib.rnn.LSTMStateTuple(
state_per_layer_list[i][0],
state_per_layer_list[i][1]) for i in range(lstm_num_layers)])
state_series, last_state = tensorflow.nn.dynamic_rnn(
cell=cell, inputs=features, initial_state=rnn_tuple_state)
hidden_layer_output = tensorflow.transpose(state_series, [1, 0, 2])
last_output = tensorflow.gather(hidden_layer_output, int(
hidden_layer_output.get_shape()[0]) - 1)
weights = tensorflow.Variable(tensorflow.random_normal(
[lstm_num_cells, int(labels.get_shape()[1])]))
biases = tensorflow.Variable(tensorflow.constant(
0.0, shape=[labels.get_shape()[1]]))
predictions = tensorflow.matmul(last_output, weights) + biases
mean_squared_error = tensorflow.reduce_mean(
tensorflow.square(predictions - labels))
minimize_error = tensorflow.train.RMSPropOptimizer(
learning_rate).minimize(mean_squared_error)
Part of what started this journey down one of many proverbial rabbit holes was the previously referenced examples reshaped the output to accommodate a classifier instead of a regressor (which is what I was attempting to build). Since this is independent of the feature dimensionality it serves as a generic template for this use case.

Simple Binary Classification Using Theano Error

I got an error when trying to create a simple binary classification for XOR case using Theano. It said dimension mismatch, but I can't find out what variable cause that.
and the strange part, my program is works when I change the number of neuron in the last layer. When I change to use 2 neuron in the last layer, and change that layer to softmax layer, and also use the negative log likelihood (multiclass classification style), this program is works fine.
This is my full code:
import numpy as np
import theano
import theano.tensor as T
class HiddenLayer(object):
def __init__(self, input, nIn, nOut, is_last, W=None):
self.input = input
W_val = np.random.randn(nIn,nOut)*0.001
b_val = np.zeros((nOut,))
self.W = theano.shared(np.asarray(W_val,dtype=theano.config.floatX),
name='W',borrow=True)
self.b = theano.shared(np.asarray(b_val,dtype=theano.config.floatX),
name='b',borrow=True)
self.z = T.dot(input,self.W) + self.b
if(is_last==0):
self.output = T.switch(self.z < 0 , 0 ,self.z)
else:
self.output = T.nnet.sigmoid(self.z)
self.y_pred = self.output > 0.5
self.params = [self.W, self.b]
def cost_function(self,y):
return -T.mean(y*T.log(self.output)+(1-y)*T.log(1-self.output))
def errors(self,y):
return T.mean(T.neq(self.y_pred,y))
alfa = 1
epoch = 1000
neu = 5
inpx = np.array([[1,0],[1,1],[0,0],[0,1]])
inpy = np.array([1,0,0,1])
x = T.fmatrix('x')
y = T.ivector('y')
layer0 = HiddenLayer(
input = x,
nIn = 2,
nOut = neu,
is_last=0
)
layer1 = HiddenLayer(
input = layer0.output,
nIn = neu,
nOut = 1,
is_last=1
)
params = layer0.params + layer1.params
cost = layer1.cost_function(y)
grads = T.grad(cost, params)
updates = [(param_i, param_i - alfa * grad_i) for param_i, grad_i in zip(params, grads)]
eror = layer1.errors(y)
train_model = theano.function([x,y], [eror,cost],updates=updates,allow_input_downcast=True)
test_model = theano.function([x,y],[eror,layer1.y_pred],allow_input_downcast=True)
for i in xrange(epoch):
etr,ctr = train_model(inpx, inpy)
if i%(epoch/10)==0:
print etr,ctr
et,pt = test_model(inpx,inpy)
print pt
and the error:
ValueError: Input dimension mis-match. (input[0].shape[1] = 1, input[1].shape[1] = 4)
Apply node that caused the error: Elemwise{neq,no_inplace}(sigmoid.0, DimShuffle{x,0}.0)
Toposort index: 41
Inputs types: [TensorType(float32, matrix), TensorType(int32, row)]
Inputs shapes: [(4L, 1L), (1L, 4L)]
Inputs strides: [(4L, 4L), (16L, 4L)]
Inputs values: [array([[ 0.94264328],
[ 0.99725735],
[ 0.5 ],
[ 0.95675617]], dtype=float32), array([[1, 0, 0, 1]])]
Outputs clients: [[Shape(Elemwise{neq,no_inplace}.0), Sum{acc_dtype=int64}(Elemwise{neq,no_inplace}.0)]]
Thank you in advance for any help.
Your problem is with your y and inpy variables: what you are trying to do is to have y be the expected output of the network. Your network is given a dataset with 4 elements, each having 2 features, you thus have 4 rows in your input matrix, and 2 columns. You are thus expected to have 4 elements in your predicted output, that is 4 rows in your y or inpy matrix, but you are using a vector, which in theano is a row vector and thus has only one row. You need either to transpose your y vector when computing the cost, or to define your y variable as a matrix, and thus to have inpy as a (4,1) matrix instead of a (4,) vector (once again, vectors are row vectors in theano).
Hope this helps,
Best

Categories

Resources