Keras model.predict function giving input shape error - python

I have implemented universal sentence encoder in Tensorflow and now I am trying to predict the class probabilities on a sentence. I am converting the string to an array as well.
Code:
if model.model_type == "universal_classifier_basic":
class_probs = model.predict(np.array(['this is a random sentence'], dtype=object)
Error Message:
InvalidArgumentError (see above for traceback): input must be a vector, got shape: []
[[Node: lambda_1/module_apply_default/tokenize/StringSplit = StringSplit[skip_empty=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](lambda_1/module_apply_default/RegexReplace_1, lambda_1/module_apply_default/tokenize/Const)]]
Any leads, suggestions or explanations are welcomed and highly appreciated.
Thank You :)

it is not that easy as you would like. Usually a model expects a vector of integer as input. Each integer represent the index of the correspondent word in a vocabulary. For example
vocab = {"hello":0, "world":1}
and you want to give as input the sentence "hello world" to the network then you should build the vector as follow:
net_input = [vocab.get(word) for word in "hello world".split(" ")]
Note also that, if you trained the network with mini batch then you will also need to add an extra first dimension to the vector you want to feed to the network. You can easily do this with numpy:
import numpy as np
net_input = np.expand_dims(net_input, 0)
In this way your net_input have the shape [1, 2] and you can feed it into the network.
There is still a problem that could stop you to feed the network with such a vector. At training time you have probably defined a placeholder for the input that has a precise len (30, 40 tokens). At test time you need to match that size at cost of padding your sentence if it doesn't feel the whole length or to cut it if it is longer.
You can truncate or add padding as follow:
net_input = [old_in[:max_len] + [vocab.get("PAD")] * (max_len - len(old_in[:max_len])] for old_in in net_input]
This line of code truncate the input if necessary old_in[:max_len] to the maximum possible len (note that python won't do anything if the len was less than max_len) and fill the difference between max len and the real len ((max_len - len(old_in[:max_len])) slots with padding tokens (+ [vocab.get("PAD")] )
Hope this helps.
If this is not the case you are in, just write down a comment to the answer and I'll try to figure out other solutions.

Related

Add sequential features to 1D CNN classification model

I am building a 1D CNN model using Keras for text classification where the input is a sequence of words generated by tokenizer.texts_to_sequences. Is there a way to also feed in a sequence of numerical features (e.g. a score) for each word in the sequence? For example, for sentence 1 the input would be ['the', 'dog', 'barked'] and each word in this particular sequence has the scores [0.9, 0.75, 0.6]. The scores are not word specific, but sentence specific scores of the words (if that makes a difference for how to format the input). Would an LSTM be more appropriate in this case?
Many thanks in advance!
Yes, just use 2 channels in the input tensor.
In better terms, if you input before had shape: (batch_size, seq_len)
Now you could have: (batch_size, seq_len, 2)
If you look at the Keras documentation, you see that with the parameter data_format you pass a string, one of channels_last (default) or channels_first. In this case the default would be fine, because the 2 (number of channels is last).
You can just stack the 2 input arrays into a tensor with this shape.
Now if you use a word embedding probably the number of channels will not be 2, but it would be embedding_dim + 1, so the final input shape would be: (batch_size, seq_len, embedding_dim + 1)
In general you can also refer to this other Stack Overflow question.
In any case, both CNN 1D and LSTM could be good models... but this you need to discover yourself depending on your task, data and model constraints.
Now as a final remark, you could even think of a model with multiple inputs one the word sequence and the other the scores. See this documentation page or this random tutorial I found on the internet. You can again refer also to the same SO question.

Variable sentence length for LSTM using word2vec as inputs on tensorflow

I am building an LSTM Model using word2vec as an input. I am using the tensorflow framework. I have finished word embedding part, but I am stuck with LSTM part.
The issue here is that I have different sentence lengths, which means that I have to either do padding or use dynamic_rnn with specified sequence length. I am struggling with both of them.
Padding.
The confusing part of padding is when I do padding. My model goes like
word_matrix=model.wv.syn0
X = tf.placeholder(tf.int32, shape)
data = tf.placeholder(tf.float32, shape)
data = tf.nn.embedding_lookup(word_matrix, X)
Then, I am feeding sequences of word indices for word_matrix into X. I am worried that if I pad zero's to the sequences fed into X, then I would incorrectly keep feeding unnecessary input (word_matrix[0] in this case).
So, I am wondering what is the correct way of 0 padding. It would be great if you let me know how to implement it with tensorflow.
dynamic_rnn
For this, I have declared a list containing all the lengths of sentences and feed those along with X and y at the end. In this case, I cannot feed the inputs as batch though. Then, I have encountered this error (ValueError: as_list() is not defined on an unknown TensorShape.), which seems to me that sequence_length argument only accepts list? (My thoughts might be entirely incorrect though).
The following is my code for this.
X = tf.placeholder(tf.int32)
labels = tf.placeholder(tf.int32, [None, numClasses])
length = tf.placeholder(tf.int32)
data = tf.placeholder(tf.float32, [None, None, numDimensions])
data = tf.nn.embedding_lookup(word_matrix, X)
lstmCell = tf.contrib.rnn.BasicLSTMCell(lstmUnits, state_is_tuple=True)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.25)
initial_state=lstmCell.zero_state(batchSize, tf.float32)
value, _ = tf.nn.dynamic_rnn(lstmCell, data, sequence_length=length,
initial_state=initial_state, dtype=tf.float32)
I am so struggling with this part so that any help would be very much appreciated.
Thank you in advance.
Tensorflow does not support variable length Tensor. So when you declare a Tensor, the list/numpy array should have a uniform shape.
From your 1st part, what I understand is that you were already able to pad the zeros in the last time steps of the sequence length. Which is what the ideal situation should be. Here is how it should look for a batch size of 4, max sequence length 10 and 50 hidden units ->
[4,10,50] would be the size of your whole batch, but internally, it may be shaped like this when you try to visualize the paddings ->
`[[5+5pad,50],[10,50],[8+2pad,50],[9+1pad,50]`
Each pad would represent a sequence length of 1 with hidden state size 50 Tensor. All filled with nothing but zeroes. Look at this question and this one to know more about how to pad manually.
You will use dynamic rnn for the exact reason that you do not want to compute it on the padding sequences. The tf.nn.dynamic_rnn api will ensure that by passing the sequence_length argument.
For the above example, that argument will be: [5,10,8,9] for the example above. You can compute it by summing the non-zero entities for each batch component. A simple way to compute that would be:
data_mask = tf.cast(data, tf.bool)
data_len = tf.reduce_sum(tf.cast(data_mask, tf.int32), axis=1)
and pass it in the tf.nn.dynamic_rnn api:
tf.nn.dynamic_rnn(lstmCell, data, sequence_length=data_len, initial_state=initial_state)

How to correctly use mask_zero=True for Keras Embedding with pre-trained weights?

I am confused about how to format my own pre-trained weights for Keras Embedding layer if I'm also setting mask_zero=True. Here's a concrete toy example.
Suppose I have a vocabulary of 4 words [1,2,3,4] and am using vector weights defined by:
weight[1]=[0.1,0.2]
weight[2]=[0.3,0.4]
weight[3]=[0.5,0.6]
weight[4]=[0.7,0.8]
I want to embed sentences of length up to 5 words, so I have to zero pad them before feeding them into the Embedding layer. I want to mask out the zeros so further layers don't use them.
Reading the Keras docs for Embedding, it says the 0 value can't be in my vocabulary.
mask_zero: Whether or not the input value 0 is a special "padding"
value that should be masked out. This is useful when using recurrent
layers which may take variable length input. If this is True then all
subsequent layers in the model need to support masking or an exception
will be raised. If mask_zero is set to True, as a consequence, index 0
cannot be used in the vocabulary (input_dim should equal size of
vocabulary + 1).
So what I'm confused about is how to construct the weight array for the Embedding layer, since "index 0 cannot be used in the vocabulary." If I build the weight array as
[[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
then normally, word 1 would point to index 1, which in this case holds the weights for word 2. Or is it that when you specify mask_zero=True, Keras internally makes it so that word 1 points to index 0? Alternatively, do you just prepend a vector of zeros in index zero, as follows?
[[0.0,0.0],
[0.1,0.2],
[0.3,0.4],
[0.5,0.6],
[0.7,0.8]]
This second option seems to me to put the zero into the vocabulary. In other words, I'm very confused. Can anyone shed light on this?
You're second approach is correct. You will want to construct your embedding layer in the following way
embedding = Embedding(
output_dim=embedding_size,
input_dim=vocabulary_size + 1,
input_length=input_length,
mask_zero=True,
weights=[np.vstack((np.zeros((1, embedding_size)),
embedding_matrix))],
name='embedding'
)(input_layer)
where embedding_matrix is the second matrix you provided.
You can see this by looking at the implementation of keras' embedding layer. Notably, how mask_zero is only used to literally mask the inputs
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
output_mask = K.not_equal(inputs, 0)
return output_mask
thus the entire kernel is still multiplied by the input, meaning all indexes are shifted up by one.

Using sample_weight in Keras for sequence labelling

I am working on a sequential labeling problem with unbalanced classes and I would like to use sample_weight to resolve the unbalance issue. Basically if I train the model for about 10 epochs, I get great results. If I train for more epochs, val_loss keeps dropping, but I get worse results. I'm guessing the model just detects more of the dominant class to the detriment of the smaller classes.
The model has two inputs, for word embeddings and character embeddings, and the input is one of 7 possible classes from 0 to 6.
With the padding, the shape of my input layer for word embeddings is (3000, 150) and the input layer for word embeddings is (3000, 150, 15). I use a 0.3 split for testing and training data, which means X_train for word embeddings is (2000, 150) and (2000, 150, 15) for char embeddings. y contains the correct class for each word, encoded in a one-hot vector of dimension 7, so its shape is (3000, 150, 7). y is likewise split into a training and testing set. Each input is then fed into a Bidirectional LSTM.
The output is a matrix with one of the 7 categories assigned for each word of the 2000 training samples, so the size is (2000, 150, 7).
At first, I simply tried to define sample_weight as an np.array of length 7 containing the weights for each class:
count = [list(array).index(1) for arrays in y for array in arrays]
count = dict(Counter(count))
count[0] = 0
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
category_weights = np.zeros(7)
for f in count:
category_weights[f] = count[f]
But I get the following error ValueError: Found a sample_weight array with shape (7,) for an input with shape (2000, 150, 7). sample_weight cannot be broadcast.
Looking at the docs, it looks like I should instead be passing a 2D array with shape (samples, sequence_length). So I create a (3000, 150) array with a concatenation of the weights of every word of each sequence:
weights = []
for sample in y:
current_weight = []
for line in sample:
current_weight.append(frequency[list(line).index(1)])
weights.append(current_weight)
weights = np.array(weights)
and pass that to the fit function through the sample_weight parameter after having added the sample_weight_mode="temporal" option in compile().
I first got an error telling me the dimension was wrong, however after generating the weights for only the training sample, I end up with a (2000, 150) array that I can use to fit my model.
Is this a proper way to define sample_weights or am I doing it all wrong ? I can't say I've noticed any improvements from adding the weights, so I must have missed something.
I think you are confusing sample_weights and class_weights. Checking the docs a bit we can see the differences between them:
sample_weights is used to provide a weight for each training sample. That means that you should pass a 1D array with the same number of elements as your training samples (indicating the weight for each of those samples). In case you are using temporal data you may instead pass a 2D array, enabling you to give weight to each timestep of each sample.
class_weights is used to provide a weight or bias for each output class. This means you should pass a weight for each class that you are trying to classify. Furthermore, this parameter expects a dictionary to be passed to it (not an array, that is why you got that error). For example consider this situation:
class_weight = {0 : 1. , 1: 50.}
In this case (a binary classification problem) you are giving 50 times as much weight (or "relevance") to your samples of class 1 compared to class 0. This way you can compensate for imbalanced datasets. Here is another useful post explaining more about this and other options to consider when dealing with imbalanced datasets.
If I train for more epochs, val_loss keeps dropping, but I get worse results.
Probably you are over-fitting, and something that may be contributing to that is the imbalanced classes your dataset has, as you correctly suspected. Compensating the class weights should help mitigate this, however there may still be other factors that can cause over-fitting that escape the scope of this question/answer (so make sure to watch out for those after solving this question).
Judging by your post, seems to me that what you need is to use class_weight to balance your dataset for training, for which you will need to pass a dictionary indicating the weight ratios between your 7 classes. Consider using sample_weight only if you want to give each sample a custom weight for consideration.
If you want a more detailed comparison between those two consider checking this answer I posted on a related question. Spoiler: sample_weight overrides class_weight, so you have to use one or the other, but not both, so be careful with not mixing them.
Update: As of the moment of this edit (March 27, 2020), looking at the source code of training_utils.standardize_weights() we can see that it now supports both class_weights and sample_weights:
Everything gets normalized to a single sample-wise (or timestep-wise)
weight array. If both sample_weights and class_weights are provided,
the weights are multiplied together.
I searched online for the same question and I did have good accuracy improvement after using sample_weight correctly in my case.
I think your understanding is correct and the procedure is also correct. One possible reason that you don't have improvements in your case is that, when you pass in the sample_weight, higher value means higher weight. This means that you cannot use word count directly. You might consider to use the inverted count frequency:
total = sum([count[key] for key in count])
count = {k: count[key] / total for key in count}
for f in count:
category_weights = np.zeros(7)
category_weights[f] = 1 - count[f]

How to use LSTM with sequence data of varying length with keras without embedding?

I have an input data where each example is some varying number of vectors of length k. In total I have n examples. So the dimensions of the input is n * ? * k. The question mark symbolizes varying length.
I want to input it to an LSTM layer in Keras, if possible, without using embedding (it isn't your ordinary words dataset).
Could someone write a short example of how to do this?
The data is currently a double nested python array, e.g.
example1 = [[1,0,1], [1,1,1]]
example2 = [[1,1,1]]
my_data = []
my_data.append(example1)
my_data.append(example2)
I think you could use pad_sequences. This should get all of your inputs to the same length.
You can use padding (pad_sequences) and a Masking layer.
You can also train batches of different lenghts in a manual training loop:
for e in range(epochs):
for batch_x, batch_y in list_of_batches: #providade you separated the batches by length
model.train_on_batch(batch_x, batch_y)
The keypoint in all of this is that your input_shape=(None, k)

Categories

Resources