InvalidArgumentError with RNN/LSTM in Keras

InvalidArgumentError with RNN/LSTM in Keras - python

I'm throwing myself into machine learning, and wish to use Keras for a university project that's time-critical. I realise it would be best to learn individual concepts and building blocks, but it's important that this is done soon.
I'm working with someone who has some experience and interest in machine learning, but we cannot seem to get further than this. The below code was adapted from GitHub code mentioned in a guide in Machine Learning Mastery.
For context, I've got data from multiple physical sensors (where each sensor is a column), with each sample from those sensors represented by one row. I wish to use machine learning to determine who the sensors were tracking at any given time. I'm trying to allocate approximately 80% of the rows to training and 20% to testing, and am creating my own "y" set of data (with the first 521,549 rows being from one participant, and the remainder from another). My data (training and test) has a total of 1,019,802 rows, and 16 columns (all populated), but the number of columns can be reduced if need be.
I would love to know the following:
What does this error mean in the context of what I'm trying to achieve, and how can I change my code to avoid it?
Is the below code suitable for what I'm trying to achieve?
Does this code represent any specific fundamental flaw in my understanding of what machine learning (generally or specifically) is designed to achieve?
Below is the Python code I'm trying to run to make use of machine learning:
x_all = pd.read_csv("(redacted)...csv",
delim_whitespace=True, header=None, low_memory=False).values
y_all = np.append(np.full((521549,1), 0), np.full((498253,1),1))
limit = 815842
x_train = x_all[:limit]
y_train = y_all[:limit]
x_test = x_all[limit:]
y_test = y_all[limit:]
max_features = 16
maxlen = 80
batch_size = 32
model = Sequential()
model.add(Embedding(500, 32, input_length=max_features))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=15,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
This is an excerpt from the CSV referenced in the code:
6698.486328125 4.28260869565217 4.6304347826087 10.6195652173913 2.4392579293836 2.56134051466188 9.05326152004788 0.0 1.0812 924.898261191267 -1.55725190839695 -0.244274809160305 0.320610687022901 -0.122938530734633 0.490254872563718 0.382308845577211
6706.298828125 4.28260869565217 4.58695652173913 10.5978260869565 2.4655894673848 2.50867743865949 9.04368641532017 0.0 1.0812 924.898261191267 -1.64885496183206 -0.366412213740458 0.381679389312977 -0.122938530734633 0.490254872563718 0.382308845577211
6714.111328125 4.26086956521739 4.64130434782609 10.5978260869565 2.45601436265709 2.57809694793537 9.03411131059246 0.0 1.0812 924.898261191267 -0.931297709923664 -0.320610687022901 0.320610687022901 -0.125937031484258 0.493253373313343 0.371814092953523
The following error occurs when running this:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 972190 is not in [0, 500)
[[Node: embedding_1/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:#training/Adam/Assign_2"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, training/Adam/gradients/embedding_1/embedding_lookup_grad/concat/axis)]]
For reference, I'm on a 2017 27-inch iMac Retina 5K with 4.2 GHz i7, 32 GB RAM, with a Radeon Pro 580 8 GB.

There are some more tutorials on Machine Learning Mastery for what you want to accomplish
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
And I'll give my own quick explanation of what you probably want to do.
Right now it looks like you are using the exact same data for the X and y inputs into your model. The y inputs are the labels which in your case is "who the sensors were tracking". So in the binary case of having 2 possible people it is set to 0 for the first person and 1 for the second person.
The sigmoid activation on the final layer will output a number between 0 and 1. If the number is bellow 0.5 then it is predicting that the sensor is tracking person 0 and if it above 0.5 then it is predicting person 1. This will be represented in the accuracy score.
You will probably not want to use an embedding layer, its possible that you might but I would drop it to start with. Normalize your data though before feeding it into the net to improve training. Scikit-Learn has good tools for this if you want a quick solution.
http://scikit-learn.org/stable/modules/preprocessing.html
When working with time series data you often want to feed in a window of time points rather than a single point. If you send your time series to Keras model.fit() then it will use a single point as input.
In order to have a time window as input you need to reorganize each example in the data set to be a whole window, or you can use a generator if that will take up to much memory. This is described in the Machine Learning Mastery pages that I linked.
Keras has a generator that you can use called TimeseriesGenerator
from keras.preprocessing.sequence import TimeseriesGenerator
timeseries_generator = TimeseriesGenerator(data, targets, length, sampling_rate)
where data is your time series of features and targets is your time series of labels.
If you use the timeseries generator then when fitting you will have to use fit_generator
model.fit_generator(timeseries_generator)
same with evaluating using evaluate_generator()
If you have your data set up correctly then your model should work
model = Sequential()
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
you could also try a simpler dense model
model = Sequential()
model.add(Flatten())
model.add(Dense(64, dropout=0.2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
One more issue I see is that it appears you would be splitting off a test set that contains only one type of label which is not only bad practice but will also weight your training set towards the other label which might hurt your results.
Hopefully that gets you started. Make sure you get your data set up correctly!

Related

Comprehending which inputs have the highest weight in a neuronal network

I am currently working on a Supervised Machine Learning Solution to categorize some data into two classes.
So far I have worked on a keras/tensorflow Python Scipt which seems to manage that just fine:
input_dim = len(data.columns) - 1
print(input_dim)
model = Sequential()
model.add(Dense(8, input_dim=input_dim, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x, train_y, validation_split=0.33, epochs=1500, batch_size=1000, verbose=1)
The input Data I use is a csv data with 168 input features. As I was first running this script successfully I was very surprised to see that I actually got an accuracy of over 99% after only a couple hundred epochs of training. I didn't even bother to normalize the input data yet.
What I am trying to find out now is which of my 168 input features is responsible for such a high accuracy rate and which features dont take much of an effect while training.
Is there a way to check the weights of each input column to see which of them is being used most, respectively which make the most impact.

Answering your last question:
model.layers[0].get_weights()
However, unless there is an obviously dominating weight, it is unlikely that a single sample gives you good accuracy. For feature selection, try replacing some features of your input by their mean and check how the prediction fluctuates. Little-to-no fluctuation means that the feature is not important.
Also, please consider posting ML questions on https://datascience.stackexchange.com/

There is going to be a connection from each 'column' to each neuron in first layer. You could go two ways (apart from randomizing or dropping (equivalent to replacing with mean as suggested in the answer above) the columns values) about finding the relative importance of columns using the weights. Please keep in mind that these methods make sense only if you input standardized dataset
You could use L1 or L2 norm of each columns weight in the first layer
Say your input has 100 columns. You create a layer that dot products the input with a tensor (trainable) of size (100,). Now, you input the output of this layer to your sequential model. Your trained (100,) tensor is the relative importance of your columns

How can i call model.fit() for different data set in for loop?

I develop Convolutional Neural Network in python. I create Sequential model and i want to fit this model for different data set. So i call fit model in for loop. But calling with one data set vs calling in for loop give different results. How can i reset model parameters?
My code is below:
for tr_ind in range(len(train_set_month_list)):
test_dataset_month_info = test_set_month_list[tr_ind];
train_dataset_month_info = train_set_month_list[tr_ind];
model = Sequential()
history = fit_model_cnn(model, train_x_df, train_x_df_reshaped, train_y_df, validation_data_x_df_reshaped,
validation_data_y_df, timesteps, epoch_size, batch_size);
def fit_model_cnn(model, train_x_df, train_x_df_reshaped, train_y_df, validation_data_x_df_reshaped,
validation_data_y_df,
timesteps, epoch_size, batch_size):
model.add(
Conv1D(filters=filter_size, kernel_size=kernel_size, activation=activation_func, padding='same',
input_shape=(timesteps, train_x_df.shape[1] / timesteps)))
model.add(MaxPooling1D(pool_size=1))
model.add(Flatten())
model.add(Dense(node_count, activation=activation_func, kernel_initializer='he_uniform'))
model.add(Dense(1))
model.compile(optimizer=optimizer_type, loss='mse')
# fit model
history = model.fit(train_x_df_reshaped, train_y_df.values,
validation_data=(validation_data_x_df_reshaped, validation_data_y_df), batch_size=batch_size,
epochs=epoch_size)
return history;

I think you should clear your previous model inside loop so you could use this function which is keras.backend.clear_session().From https://keras.io/backend/:
This will be solved your problem.

From a very simplistic point of view, the data is fed in sequentially, which suggests that at the very least, it's possible for the data order to have an effect on the output. If the order doesn't matter, randomization certainly won't hurt. If the order does matter, randomization will help to smooth out those random effects so that they don't become systematic bias. In short, randomization is cheap and never hurts, and will often minimize data-ordering effects.
In other words, when you feed your neural network with different datasets, your model can get biased towards the latest dataset it was trained on.
You should always make sure that your are randomly sampling from all datasets you have.

Keras LSTM neural network for Time Series Predictions shows nan during model fit

I am training a neural network to predict a whole day of availability (144 samples, 6 features) by passing yesterday's availability (144 samples). I'm having trouble finding good resources or explanations on how to define a neural network to predict time series in a regression problem. The training is defined as a supervised learning problem. My definition of the neural network is,
lstm_neurons = 30
model = Sequential()
model.add(LSTM(lstm_neurons * 2, input_shape=(self.train_x.shape[1], sel f.train_x.shape[2]), return_sequences=True))
model.add(LSTM(lstm_neurons * 2))
model.add(Dense(len_day, activation='softmax'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics = [rm se, 'mae', 'mape'])
I am training for 20 epochs with a batch size of 200 where the used datasets have the following shapes,
Train X (9631, 144, 6)
Train Y (9631, 144)
Test X (137, 144, 6)
Test Y (137, 144)
Validation X (3990, 144, 6)
Validation Y (3990, 144)
All of this produces nan values during training for loss, rmse, mae... While this looks like it's a problem I can use the generated model to generate predictions and they look good-ish.

The first question to ask - are you trying to predict a time series based on interpreting availability as a probability measure?
The softmax activation function would work best under this scenario - but you may be misspecifying it when you are in fact attempting to forecast an interval time series - hence why you are obtaining NaN readings for your results.
This example might be of use to you - LSTM is used to this example to forecast weekly fluctuations in hotel cancellations.
Similarly to your example, X_train and X_val are reshaped as samples, time steps, features:
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_val = np.reshape(X_val, (X_val.shape[0], 1, X_val.shape[1]))
The LSTM network is defined as follows:
# Generate LSTM network
model = tf.keras.Sequential()
model.add(LSTM(4, input_shape=(1, previous)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, Y_train, epochs=20, batch_size=1, verbose=2)
As you can see, the mean squared error is used as the loss function since the cancellation variable in question is interval (i.e. can take on a wide range of values and is not necessarily restricted by any particular scale).
I can only speculate as I have not seen your data or results, but you may be going wrong by defining softmax as your activation function when it is not appropriate - I suspect this is the case as you are also using mean squared error as the loss measurement.
In the above example, the Dense layer does not specify an activation function per se.
In terms of how you might choose to validate whether your time series forecast with LSTM is effective, a potentially good idea is to compare the findings to that of a simpler time series model; e.g. ARIMA.
Using our example, ARIMA performed better when forecasting for Hotel 1, but LSTM performed better when forecasting for Hotel 2:
H1 Results
Reading ARIMA LSTM
MDA 0.86 0.8
RMSE 57.95 63.89
MFE -12.72 -54.25
H2 Results
Reading ARIMA LSTM
MDA 0.86 0.8
RMSE 274.07 95.28
MFE 156.32 38.65
Finally, when creating your datasets using the train and validation sets, you must also ensure that you are using the correct previous parameter, i.e. the number of time periods going back with which you choose to regress against the observations at time t.
For instance, you are using yesterday's availability - but you might find that the model is improved using the previous 5 or 10 days, for instance.
# Number of previous
previous = 5
X_train, Y_train = create_dataset(train, previous)
X_val, Y_val = create_dataset(val, previous)
In your situation, the first thing I would check is the use of the softmax activation function, and work from there.

CNN on small dataset is overfiting

I want to classify pattern on image. My original image shape are 200 000*200 000 i reshape it to 96*96, pattern are still recognizable with human eyes. Pixel value are 0 or 1.
i'm using the following neural network.
train_X, test_X, train_Y, test_Y = train_test_split(cnn_mat, img_bin["Classification"], test_size = 0.2, random_state = 0)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(train_Y),
train_Y)
train_Y_one_hot = to_categorical(train_Y)
test_Y_one_hot = to_categorical(test_Y)
train_X,valid_X,train_label,valid_label = train_test_split(train_X, train_Y_one_hot, test_size=0.2, random_state=13)
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
train = model.fit(train_X, train_label, batch_size=80,epochs=20,verbose=1,validation_data=(valid_X, valid_label),class_weight=class_weights)
I have already run some experiment to find a "good" number of hidden layer and fully connected layer. it's probably not the most optimal architecture since my computer is slow, i just ran different model once and selected best one with matrix confusion, i didn't use cross validation,I didn't try more complex architecture since my number of data is small, i have read small architecture are the best, is it worth to try more complex architecture?
here the result with 5 and 12 epoch, bach size 80. This is the confusion matrix for my test set
As you can see it's look like i'm overfiting. When i only run 5 epoch, most of the class are assigned to class 0; With more epoch, class 0 is less important but classification is still bad
I added 0.8 dropout after each convolutional layer
e.g
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
With drop out, 95% of my image are classified in class 0.
I tryed image augmentation; i made rotation of all my training image, still used weighted activation function, result didnt improve. Should i try to augment only class with small number of image? Most of the thing i read says to augment all the dataset...
To resume my question are:
Should i try more complex model?
Is it usefull to do image augmentation only on unrepresented class? then should i still use weight class (i guess no)?
Should i have hope to find a "good" model with cnn when we see the size of my dataset?

I think according to the imbalanced data, it is better to create a custom data generator for your model so that each of it's generated data batch, contains at least one sample from each class. And also it is better to use Dropout layer after each dense layer instead of conv layer. For data augmentation it is better to at least use combination of rotate, horizontal flip and vertical flip. there are some other approaches for data augmentation like using GAN network or random pixel replacement.
For Gan you can check This SO post
For using Gan as data augmenter you can read This Article.
For combination of pixel level augmentation and GAN pixel level data augmentation

What I used - in a different setting - was to upsample my data with ADASYN. This algorithm calculates the amount of new data required to balance your classes, and then takes available data to sample novel examples.
There is an implementation for Python. Otherwise, you also have very little data. SVMs are good performing even with little data. You might want to try them or other image classification algorithms depending where the expected pattern is always at the same position, or varies. Then you could also try the Viola–Jones object detection framework.

Predicting Past End of Dataset with RNN in Keras

I have a dataset spanning hundreds of values regarding temperature. Obviously, in meteorology, it is helpful to predict what future values will be based on the past.
I have the following stateful model, built in Keras:
look_back = 1
model.add(LSTM(32, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(10):
model.fit(trainX, trainY, epochs=4, batch_size=batch_size, verbose=2, shuffle=False)
model.reset_states()
# make predictions
trainPredict = model.predict(trainX, batch_size=batch_size)
I have successfully been able to train and test the model on my dataset to reasonable results, however am struggling to comprehend what is required to predict the next, say, 20 points in the dataset. Obviously, these 20 points are outside of the dataset, and they have yet to "occur".
I would appreciate anything that would be of help; I feel like I am missing some simple functionality in Keras.
Thank you.

I feel like I am missing some simple functionality in Keras.
You have all you need right there. To obtain predictions on new data you have to use model.predict() again, but on the desired range. This depends on how your data looks.
Lets assume your timeseries trainX had events with x ranging from [0,100].
Then to predict the next 20 events you want to call predict() on values 101 to 120, something like:
futureData = np.array(range(101,121)) #[101,102,...,120]
futurePred = model.predict(futureData)
Again, this depends on how your "next 20" events look. If you bin size were instead 0.1 (100, 100.1, 100.2,...) you should evaluate the prediction accordingly.
You may also like to check this page where they give examples and explain more about Timeseries in Keras with RNNs, if you are interested.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.