Error when checking model input keras when predicting new results - python

I am trying to use a keras model I built on new data, except I have an input error when trying to predict the predictions.
Here's my code for the model:
def build_model(max_features, maxlen):
"""Build LSTM model"""
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop')
return model
And my code to predict the output predictions of my new data:
LSTM_model = load_model('LSTMmodel.h5')
data = pickle.load(open('traindata.pkl', 'rb'))
#### LSTM ####
"""Run train/test on logistic regression model"""
# Extract data and labels
X = [x[1] for x in data]
labels = [x[0] for x in data]
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}
max_features = len(valid_chars) + 1
maxlen = np.max([len(x) for x in X])
# Convert characters to int and pad
X = [[valid_chars[y] for y in x] for x in X]
X = sequence.pad_sequences(X, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels]
y_pred = LSTM_model.predict(X)
The error I get when running this code:
ValueError: Error when checking input: expected embedding_1_input to have shape (57,) but got array with shape (36,)
My error comes from maxlen because for my training data, maxlen=57 and with my new data, maxlen=36.
So I tried to set in my prediction code maxlen=57 but then I get this error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[31,53] = 38 is not in [0, 38)
[[Node: embedding_1/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, embedding_1/embedding_lookup/axis)]]
What should I do in order to resolve these issues? Change my embedding layer?

Either set the input_length of the Embedding layer to the maximum length you would see in the dataset, or just use the same maxlen value you used when constructing the model in pad_sequences. In that case any sequence shorter than maxlen would be padded and any sequence longer than maxlen would be truncated.
Further make sure that the features you use are the same in both train and test time (i.e. their numbers should not change).

Related

Keras model with fasttext word embedding

I am trying to learn a language model to predict the last word of a sentence given all the previous words using keras. I would like to embed my inputs using a learned fasttext embedding model.
I managed to preprocess my text data and embed the using fasttext. My training data is comprised of sentences of 40 tokens each. I created 2 np arrays, X and y as inputs, with y what I want to predict.
X is of shape (44317, 39, 300) with 44317 the number of example sentences, 39 the number of tokens in each sentence, and 300 the dimension of the word embedding.
y is of shape (44317, 300) is for each example the embedding of the last token of the sentence.
My code for the keras model goes as follow (inspired by this)
#importing all the needed tensorflow.keras components
model = Sequential()
model.add(InputLayer((None, 300)))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(300, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=20)
model.save('model.h5')
However, the accuracy I get while training on this model is extremely low (around 1.5%). I think there is some component of the keras model that I misundertood, as if I don't embed my inputs and add an extra embedding layer instead of the InputLayer I get an accuracy of about 60 percents.
My main doubt is the value of "300" on my second Dense layer, as I read that this should correspond the vocabulary size of my word embedding model (which is 48000), however if I put anything else than 300 I get a dimension error. So I understand that I'm doing something wrong, but I can't find how to fix it.
PS :
I have also tried y = to_categorical(y, num_classes=vocab_size) with vocab_size the vocabulary size of my word embedding, and by changing 300 by this same value in the second Dense, however then it tries to create an array of shape(13295100, 48120) instead of what I expect : (44317, 48120).
If you really want to use the word vectors from Fasttext, you will have to incorporate them into your model using a weight matrix and Embedding layer. The goal of the embedding layer is to map each integer sequence representing a sentence to its corresponding 300-dimensional vector representation:
import gensim.downloader as api
import numpy as np
import tensorflow as tf
def load_doc(filename):
file = open(filename, 'r')
text = file.read()
file.close()
return text
fasttext = api.load("fasttext-wiki-news-subwords-300")
embedding_dim = 300
in_filename = 'data.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
text_sequences = np.array(text_sequences)
X, y = text_sequences[:, :-1], text_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
max_length = X.shape[1]
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
try:
embedding_vector = fasttext[word]
weight_matrix[i] = embedding_vector
except KeyError:
weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
sentence_input = tf.keras.layers.Input(shape=(max_length,))
x = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[weight_matrix],
input_length=max_length)(sentence_input)
x = tf.keras.layers.LSTM(100, return_sequences=True)(x)
x = tf.keras.layers.LSTM(100)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)
output = tf.keras.layers.Dense(vocab_size, activation='softmax')(x)
model = tf.keras.Model(sentence_input, output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=5, epochs=20)
Note that I am using the dataset and preprocessing steps from the tutorial you linked.
It's very difficult to train RNN models in next sentence prediction task. LSTM/GRU do not have enough resources to extract enough features from text.
There are 2 ways to solve issue:
predict chars instead of word class
use transformer model. For example, Bert is good for features extracting and predicting masked word

LSTM: Input 0 of layer sequential is incompatible with the layer

I know there are several questions about this here, but I haven't found one which fits exactly my problem.
I'm trying to fit an LSTM with data from Pandas DataFrames but getting confused about the format I have to provide them.
I created a small code snipped which shall show you what I try to do:
import pandas as pd, tensorflow as tf, random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
targets = pd.DataFrame(index=pd.date_range(start='2019-01-01', periods=300, freq='D'))
targets['A'] = [random.random() for _ in range(len(targets))]
targets['B'] = [random.random() for _ in range(len(targets))]
features = pd.DataFrame(index=targets.index)
for i in range(len(features)) :
features[str(i)] = [random.random() for _ in range(len(features))]
model = Sequential()
model.add(LSTM(units=targets.shape[1], input_shape=features.shape))
model.compile(optimizer='adam', loss='mae')
model.fit(features, targets, batch_size=10, epochs=10)
this results to:
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [10, 300]
which I expect relates to the dimensions of the features DataFrame provided. I guess that once fixed this the next error would mention the targets DataFrame.
As far as I understand, 'units' parameter of my first layer defines the output dimensionality of this model. The inputs have to have a 3D shape, but I don't know how to create them out of the 2D world of the Data Frames.
I hope you can help me understanding the reshape mechanism in Python and how to use them in combination with Pandas DataFrames. (I'm quite new to Python and came from R)
Thankls in advance
Lets looks at the few popular ways in LSTMs are used.
Many to Many
Example: You have a sentence (composed of words in sequence). Give these sequence of words you would like to predict the Parts of speech (POS) of each word.
So you have n words and you feed each word per timestep to the LSTM. Each LSTM timestep (also called LSTM unwrapping) will produce and output. The word is represented by a a set of features normally word embeddings. So the input to LSTM is of size bath_size X time_steps X features
Keras code:
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = True)(inputs)
outputs = keras.layers.TimeDistributed(keras.layers.Dense(5, activation='softmax'))(lstm)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y = np.random.randint(0,2, size=(4,10,5))
model.fit(X, y, epochs=2)
print (model.predict(X).shape)
Many to One
Example: Again you have a sentence (composed of words in sequence). Give these sequence of words you would like to predict sentiment of the sentence if it is positive or negative.
Keras code
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = False)(inputs)
outputs =keras.layers.Dense(5, activation='softmax')(lstm)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y = np.random.randint(0,2, size=(4,5))
model.fit(X, y, epochs=2)
print (model.predict(X).shape)
Many to multi-headed
Example: You have a sentence (composed of words in sequence). Give these sequence of words you would like to predict sentiment of the sentence as well the author of the sentence.
This is multi-headed model where one head will predict the sentiment and another head will predict the author. Both the heads share the same LSTM backbone.
Keras code
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = False)(inputs)
output_A = keras.layers.Dense(5, activation='softmax')(lstm)
output_B = keras.layers.Dense(5, activation='softmax')(lstm)
model = keras.Model(inputs=inputs, outputs=[output_A, output_B])
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y_A = np.random.randint(0,2, size=(4,5))
y_B = np.random.randint(0,2, size=(4,5))
model.fit(X, [y_A, y_B], epochs=2)
y_hat_A, y_hat_B = model.predict(X)
print (y_hat_A.shape, y_hat_B.shape)
What you are looking for is Many to Multi head model where your predictions for A will be made by one head and another head will make predictions for B
The input data for the LSTM has to be 3D.
If you print the shapes of your DataFrames you get:
targets : (300, 2)
features : (300, 300)
The input data has to be reshaped into (samples, time steps, features). This means that targets and features must have the same shape.
You need to set a number of time steps for your problem, in other words, how many samples will be used to make a prediction.
For example, if you have 300 days and 2 features the time step can be 3. So that three days will be used to make one prediction (you can choose this arbitrarily). Here is the code for reshaping your data (with a few more changes):
import pandas as pd
import numpy as np
import tensorflow as tf
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
data = pd.DataFrame(index=pd.date_range(start='2019-01-01', periods=300, freq='D'))
data['A'] = [random.random() for _ in range(len(data))]
data['B'] = [random.random() for _ in range(len(data))]
# Choose the time_step size.
time_steps = 3
# Use numpy for the 3D array as it is easier to handle.
data = np.array(data)
def make_x_y(ts, data):
"""
Parameters
ts : int
data : numpy array
This function creates two arrays, x and y.
x is the input data and y is the target data.
"""
x, y = [], []
offset = 0
for i in data:
if offset < len(data)-ts:
x.append(data[offset:ts+offset])
y.append(data[ts+offset])
offset += 1
return np.array(x), np.array(y)
x, y = make_x_y(time_steps, data)
print(x.shape, y.shape)
nodes = 100 # This is the width of the network.
out_size = 2 # Number of outputs produced by the network. Same size as features.
model = Sequential()
model.add(LSTM(units=nodes, input_shape=(x.shape[1], x.shape[2])))
model.add(Dense(out_size)) # For the output a Dense (fully connected) layer is used.
model.compile(optimizer='adam', loss='mae')
model.fit(x, y, batch_size=10, epochs=10)
Well, just to finalize this issue I would like to provide one solution I have meanwhile worked on. The class TimeseriesGenerator in tf.keras.... enabled me quite easy to provide the data in the right shape to an LSTM model
from keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np
window_size = 7
batch_size = 8
sampling_rate = 1
train_gen = TimeseriesGenerator(X_train.values, y_train.values,
length=window_size, sampling_rate=sampling_rate,
batch_size=batch_size)
valid_gen = TimeseriesGenerator(X_valid.values, y_valid.values,
length=window_size, sampling_rate=sampling_rate,
batch_size=batch_size)
test_gen = TimeseriesGenerator(X_test.values, y_test.values,
length=window_size, sampling_rate=sampling_rate,
batch_size=batch_size)
There are many other ways on implementing generators e.g. using the more_itertools which provides the function windowed, or making use of tensorflow.Dataset and its function window.
For me the TimeseriesGenerator was sufficient to feed the tests I did.
In case you would like to see an example modeling the DAX based on some stocks I'm sharing a notebook on Github.

Encoder-Decoder LSTM model gives 'nan' loss and predictions

I am trying to create a basic encoder-decoder model for training a chatbot. X contains the questions or human dialogues and Y contains the bot answers. I have padded the sequences to the max size of input and output sentences. X.shape = (2363, 242, 1) and Y.shape = (2363, 144, 1). But during training, the loss has value 'nan' for all epochs and the prediction gives array with all values as 'nan'. I have tried using 'rmsprop' optimizer instead of 'adam'. I cannot use loss function 'categorical_crossentropy' as the output is not one-hot encoded but a sequence. What exactly is wrong with my code?
Model
model = Sequential()
model.add(LSTM(units=64, activation='relu', input_shape=(X.shape[1], 1)))
model.add(RepeatVector(Y.shape[1]))
model.add(LSTM(units=64, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(units=1)))
print(model.summary())
model.compile(optimizer='adam', loss='mean_squared_error')
hist = model.fit(X, Y, epochs=20, batch_size=64, verbose=2)
model.save('encoder_decoder_model_epochs20.h5')
Data Preparation
def remove_punctuation(s):
s = s.translate(str.maketrans('','',string.punctuation))
s = s.encode('ascii', 'ignore').decode('ascii')
return s
def prepare_data(fname):
word2idx = {'PAD': 0}
curr_idx = 1
sents = list()
for line in open(fname):
line = line.strip()
if line:
tokens = remove_punctuation(line.lower()).split()
tmp = []
for t in tokens:
if t not in word2idx:
word2idx[t] = curr_idx
curr_idx += 1
tmp.append(word2idx[t])
sents.append(tmp)
sents = np.array(pad_sequences(sents, padding='post'))
return sents, word2idx
human = 'rdany-conversations/human_text.txt'
robot = 'rdany-conversations/robot_text.txt'
X, input_vocab = prepare_data(human)
Y, output_vocab = prepare_data(robot)
X = X.reshape((X.shape[0], X.shape[1], 1))
Y = Y.reshape((Y.shape[0], Y.shape[1], 1))
First of all check that you do not have any NaNs in your input. If this is not the case it might be exploding gradients. Standardize your inputs (MinMax- or Z-scaling), try smaller learning rates, clip gradients the gradients, try different weight initializations.

How input should relate/map to label y if Keras Model.fit() is given a list of input train arrays>

I am trying to work with a Deep learning model in two of the following scenarios, where two different inputs are given. I want to achieve following:
Train two models (with different weights but same architecture) with same input and concatenate the result. So in model.fit(), I am passing just the trainX value. Code is given below. It works fine.
def create_model(input_tensor):
x= Conv1D(filters = 16, kernel size=6, strides = 5, kernel_initializer = "uniform", activation = "relu")(input_tensor)
x= GlobalMaxPooling1D()(x)
x = Dense(2,activation ='softmax')()
return x
dataframe = pd.read_csv(Filename, index_col=0)
X= dataframe.values[:,:].astype(float)
Y = dataframe.values[:,1]
trainx, testx, trainy, testy = train_test_split(X,Y, test_Szie= 0.2, random_state=200, shuffle =True)
input_shape = (33000,1)
input_tensor = Input(input_shape)
pred_a = create_model(input_tensor)
pred_b = create_model(input_tensor)
out = keras.layers.Multiply()([pred_a, pred_b])
model =Model(inputs=(input_tensor), outputs=out)
model.compile(loss='categorical_crossentropy', optimizer= 'Adam', metrics =['accuracy'])
histroy = model.fit(trainX, trainy)
Train same model (with same weights) twice but with different inputs. I am confused how to pass inputs in this case. In normal cases, we have equal number of instances in both trainX and trainy data. If I pass a list like model.fit([x_train_1, x_train_2], trainy), then the number of instances of combined x_train_1, x_train_2 will be double than y. How trainy corresponds to the input trainx in this case?
The input and corresponding output of a model have shapes as X = (batch_size, ....) , y = (batch_size,....)
In case of multiple inputs, you can define multiple input layers and feed them to your different model instances as follows
inp_A = Input(shape=(...))
inp_B = Input(shape=(...))
pred_A = create_model(inp_A)
pred_B = create_model(inp_B)
*** Other layers and code ****
model = Model(inputs=[inp_A, inp_B], outputs=out)
*** Other code ***
Then you can call model.fit with passing a list of inputs and a single output.

Huge difference between training accuracy and evaluation accuracy using Tensorflow Datasets class + Keras

I created a dataset from my data and this dataset is in the form of (features,labels). features' dimension is [?,731,7] (where the ? should be 400), the corresponding labels' dimension is [4,] as shown in my dataset. Each [731,7] sample corresponds to a 4 elements array like [0,1,0,0].
A few sample data:
Sampledata1
Sampledata2
After building a simple multi-layer neural network, the training process is normal as follow. But when I use the same dataset to validate (just to check if the algorithm is working or not), I actually got a huge difference.
I don't think it is right but I am not sure if this happens because I used .eval() wrong or my datasets got wrong.
My code for datasets creation:
filenames = glob.glob(main_dir+keywords)
# filenames = ['test.txt','test2.txt']
length = len(filenames) # num of files
length_samesat = 100 # happen to be this... I designed in propogation
batch_num = 731 # happen to be this...
dataset = tf.data.Dataset.from_tensor_slices(filenames)
dataset = dataset.flat_map(lambda filename: tf.data.TextLineDataset(filename).skip(3))
dataset = dataset.map(lambda string: tf.string_split([string],delimiter=', ').values)
dataset = dataset.map(lambda x: tf.strings.to_number(x))
dataset = dataset.batch(batch_num)
dataset = dataset.map(lambda tensor: tf.reshape(tensor,[batch_num,7]))
dataset = dataset.batch(1).repeat()
Then I zip my dataset with my label dataset and create NN and run
dataset_all = tf.data.Dataset.zip((dataset, datalabel))
dataset_all = dataset_all.shuffle(400)
visual_dataset(dataset_all,0,20)
# NN Model
inputs = tf.keras.Input(shape=(731,7,)) # Returns a placeholder tensor
# A layer instance is callable on a tensor, and returns a tensor.
x = tf.keras.layers.Flatten()(inputs)
x = tf.keras.layers.Dense(400, activation='tanh')(x)
x = tf.keras.layers.Dense(400, activation='tanh')(x)
# x = tf.keras.layers.Dense(450, activation='tanh')(x)
# x = tf.keras.layers.Dense(300, activation='tanh')(x)
# x = tf.keras.layers.Dense(450, activation='tanh')(x)
# x = tf.keras.layers.Dense(200, activation='relu')(x)
# x = tf.keras.layers.Dense(100, activation='relu')(x)
predictions = tf.keras.layers.Dense(4, activation='softmax')(x)
# Instantiate the model given inputs and outputs.
model = tf.keras.Model(inputs=inputs, outputs=predictions)
# The compile step specifies the training configuration.
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])
# Trains for 5 epochs
model.fit(dataset_all, epochs=5, steps_per_epoch=400)
model.evaluate(dataset_all, steps=400)
Thanks!

Categories

Resources