Predicting future values with a multivariate LSTM model

Predicting future values with a multivariate LSTM model - python

I come to ask a question concerning the future predictions with an LSTM models
I explain to you :
I am using an LSTM model to predict the stock price for the next 36 hours.
I have a dataset with 10 features.
I use these 10 features as inputs in my model with a single output (the expected price).
Here is my overall model:
model = Sequential()
# input shape == (336, 10), I use 336 hours for my lookback and 10 features
model.add(LSTM(units=50,return_sequences=True,input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(LSTM(units=50,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=50,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='linear'))
model.compile(optimizer='adam',loss='mean_squared_error')
I can assess the performance of my model with my test data, but now I would like to use it to predict the next 36 hours, that's the goal anyway, isn't it?
And there I have the impression that there is a big black hole on the internet, everyone presents how to build models and test them with the test data but nobody uses them...
I found two interesting examples which consist in re-integrating the prediction into the last window iteratively.
Here are the examples at the bottom of the topics:
https://towardsdatascience.com/time-series-forecasting-with-recurrent-neural-networks-74674e289816
https://towardsdatascience.com/using-lstms-to-forecast-time-series-4ab688386b1f
In itself it works but with only one feature as input.
I have 10 features, my model just returns me an output value so I cannot reintegrate it into the last window which expects 10 features in its shape.
Do you see the problem?
I really hope you can orient me on the subject.
Adrien

Related

on which basis should i set Input and output shapes in python keras LSTM?

I have dataset of shape (143312, 30) and i'm using the following code for setting the model
model = Sequential()
model.add(LSTM(100,activation='sigmoid', input_shape = (30,1 ) ))
model.add(Dense(5, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy',f1_m,precision_m, recall_m])
It is working but I have no idea why. Is it just about the feature numbers? When I have 30 features then do I simply set it like this? What does 1 mean and on which basis was Dense set to 5?

About this one:
LSTM(100,activation='sigmoid', input_shape = (30,1))
You have created RNN, which works on sequences of 30 items, each item has one feature. This matches to your data set with shape (143312, 30). The dataset contains 143312 sequences of data, each sequence 30 items long, each item is just a single feature.
The 100 here specifies the number of units (recurrent neurons) used in LSTM. It is a hyperparameter, you use a bigger number for a more complex model and smaller one if your model overfits data.
Regarding this one:
model.add(Dense(5, activation='softmax'))
This is an output layer of your model. Apparently you are using your model for classficantion ('softmax' activation function) and your labels have 5 classes, hence 5 neurons in the Dense layer.

Tensorflow LSTM not learning despite large network, small sample size and preprocessed data

I have the following Neural Network:
model = Sequential()
model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=X_train.shape[1:]))
model.add(Dropout(0.2))
model.add(LSTM(100, activation='relu', return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(150, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(10))
model.add(Dropout(0.3))
model.add(Dense(2, activation='softmax')) # Activation_layer
opt = Adam(lr=1e-3, decay=1e-6)
model.compile(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
The network will be fed sequential data, and is trying to classify the data to either 1 or 0.
Example of one of the samples:
X:
[[0.56450562 0.69825955 0.57768099 0.69077864]
[0.58818427 0.70355375 0.61725885 0.30270281]
[0.57407927 0.72501532 0.59603936 0.29196058]
[0.56501804 0.69072662 0.59064673 0.66034622]
[0.56552001 0.70354009 0.59136487 0.1586415 ]
[0.56501496 0.68205159 0.57877241 0.62252169]
[0.54535762 0.67067675 0.58414928 0.9077868 ]
[0.56197241 0.71226839 0.5920788 0.1339519 ]
[0.57308813 0.70469134 0.59749238 0.27085101]
[0.56146488 0.69258436 0.58377929 0.7065891 ]
[0.55943607 0.69106406 0.59569036 0.69378783]
[0.5670203 0.68271571 0.58702014 0.70585781]
[0.58320254 0.71228948 0.60867704 0.19280208]
[0.56904526 0.71490986 0.59027546 0.35757948]
[0.56398908 0.67858148 0.58197139 0.75064535]
[0.57005691 0.7062191 0.60363236 0.38345417]
[0.5705625 0.70394121 0.58630169 0.19171352]
[0.56145905 0.69106039 0.58340288 0.76821359]
[0.55183665 0.68991404 0.5935228 0.53419864]
[0.56549613 0.68800419 0.58013082 0.74470123]
[0.54926442 0.67315638 0.58336904 0.77819332]
[0.56802882 0.71842805 0.60222782 0.12845991]
[0.59591035 0.70927878 0.61161172 0.68023463]
[0.56904526 0.713053 0.58773435 0.20017562]
[0.58321778 0.69939555 0.61194041 0.47063807]
[0.57814777 0.71113559 0.58991151 0.62149082]
[0.56044844 0.69257776 0.58738045 0.39285414]
[0.56853912 0.70091102 0.59713724 0.21938703]
[0.56398364 0.69939514 0.59316136 0.43031303]
[0.56701957 0.69901619 0.5935228 0.39333831]
[0.56701916 0.68082684 0.58701647 0.84346823]
[0.57765044 0.70812209 0.60147335 0.38961049]
[0.58975543 0.71340576 0.6050683 0.61008348]
[0.57207508 0.70280098 0.59821004 0.44573693]
[0.56702537 0.71035313 0.59424384 0.30333905]
[0.58417429 0.69901619 0.60288387 0.7210835 ]
[0.56400225 0.70128289 0.59028243 0.42721302]
[0.5725759 0.70241467 0.60000056 0.22784863]
[0.57055816 0.69561772 0.59136355 0.66855609]
[0.58766922 0.70995564 0.60538235 0.71163122]
[0.57206444 0.69788453 0.59567842 0.707679 ]
[0.5775922 0.70956495 0.60249313 0.32745877]
[0.57407031 0.6997696 0.57952909 0.54327415]
[0.55346759 0.69223554 0.58920848 0.27867972]
[0.58612784 0.7031614 0.617901 0.76338596]
[0.58659902 0.72005896 0.60604811 0.48696192]
[0.57004823 0.70539865 0.59173347 0.47288217]
[0.57405756 0.7023936 0.59030119 0.49981083]
[0.55801818 0.68813345 0.58564415 0.38486918]
[0.55900944 0.69300306 0.58527681 0.41875207]
[0.56351994 0.68585174 0.58239563 0.70965566]
[0.5509523 0.69524821 0.59280378 0.46280846]
[0.56753474 0.69713124 0.59172507 0.29915786]
[0.56753451 0.69939326 0.5978358 0.59996518]
[0.56954889 0.69109776 0.57734904 0.27905973]
[0.55595081 0.68429475 0.59424321 0.86881108]
[0.57005376 0.71486763 0.60215717 0.20096972]
[0.57509255 0.70467308 0.59028491 0.29196681]
[0.5584625 0.68958804 0.59028342 0.24039387]
[0.57005412 0.70203582 0.5964024 0.59344888]]
y:
1
The issue I am having, is that the loss starts out at around 0.69 and never decreases significantly (it fluctuates a bit), and the loss and validtation loss both stay around 0.5
What I've tried so far:
Checked Training and validation data for NaN's or values < 0 or > 1 -> Not found
Reduced sample size dramatically (down to 50 samples) and a network that should be more than large enough to overfit, but alas still the same result.
Preprocessed the data in a completely different way
Using sigmoid activation instead of softmax to classify the labels.
Reduced learning rate
Removing the second last dense layer
Used LeakyReLU with alpha=0.05
Although the data could be next to random, shouldn't a sufficiently large network easily overfit onto 50 samples or less?

2 suggestions:
It appears you have a binary classification problem (either 0 or 1), perhaps you could try a binary cross entropy loss instead?
Are you using a method such as to_categorical to one-hot encode your labels?
Other factors that can sometimes dramatically affect accuracies that you haven't mentioned trying/changing:
Using different optimizers
Exploring different architectures: have you considered maybe a CNN-LSTM model? Or have you tested on different architectures, do some learn better than others?

Comprehending which inputs have the highest weight in a neuronal network

I am currently working on a Supervised Machine Learning Solution to categorize some data into two classes.
So far I have worked on a keras/tensorflow Python Scipt which seems to manage that just fine:
input_dim = len(data.columns) - 1
print(input_dim)
model = Sequential()
model.add(Dense(8, input_dim=input_dim, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_x, train_y, validation_split=0.33, epochs=1500, batch_size=1000, verbose=1)
The input Data I use is a csv data with 168 input features. As I was first running this script successfully I was very surprised to see that I actually got an accuracy of over 99% after only a couple hundred epochs of training. I didn't even bother to normalize the input data yet.
What I am trying to find out now is which of my 168 input features is responsible for such a high accuracy rate and which features dont take much of an effect while training.
Is there a way to check the weights of each input column to see which of them is being used most, respectively which make the most impact.

Answering your last question:
model.layers[0].get_weights()
However, unless there is an obviously dominating weight, it is unlikely that a single sample gives you good accuracy. For feature selection, try replacing some features of your input by their mean and check how the prediction fluctuates. Little-to-no fluctuation means that the feature is not important.
Also, please consider posting ML questions on https://datascience.stackexchange.com/

There is going to be a connection from each 'column' to each neuron in first layer. You could go two ways (apart from randomizing or dropping (equivalent to replacing with mean as suggested in the answer above) the columns values) about finding the relative importance of columns using the weights. Please keep in mind that these methods make sense only if you input standardized dataset
You could use L1 or L2 norm of each columns weight in the first layer
Say your input has 100 columns. You create a layer that dot products the input with a tensor (trainable) of size (100,). Now, you input the output of this layer to your sequential model. Your trained (100,) tensor is the relative importance of your columns

CNN on small dataset is overfiting

I want to classify pattern on image. My original image shape are 200 000*200 000 i reshape it to 96*96, pattern are still recognizable with human eyes. Pixel value are 0 or 1.
i'm using the following neural network.
train_X, test_X, train_Y, test_Y = train_test_split(cnn_mat, img_bin["Classification"], test_size = 0.2, random_state = 0)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(train_Y),
train_Y)
train_Y_one_hot = to_categorical(train_Y)
test_Y_one_hot = to_categorical(test_Y)
train_X,valid_X,train_label,valid_label = train_test_split(train_X, train_Y_one_hot, test_size=0.2, random_state=13)
model = Sequential()
model.add(Conv2D(24,kernel_size=3,padding='same',activation='relu',
input_shape=(96,96,1)))
model.add(MaxPool2D())
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(16, activation='softmax'))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
train = model.fit(train_X, train_label, batch_size=80,epochs=20,verbose=1,validation_data=(valid_X, valid_label),class_weight=class_weights)
I have already run some experiment to find a "good" number of hidden layer and fully connected layer. it's probably not the most optimal architecture since my computer is slow, i just ran different model once and selected best one with matrix confusion, i didn't use cross validation,I didn't try more complex architecture since my number of data is small, i have read small architecture are the best, is it worth to try more complex architecture?
here the result with 5 and 12 epoch, bach size 80. This is the confusion matrix for my test set
As you can see it's look like i'm overfiting. When i only run 5 epoch, most of the class are assigned to class 0; With more epoch, class 0 is less important but classification is still bad
I added 0.8 dropout after each convolutional layer
e.g
model.add(Conv2D(48,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
model.add(Conv2D(64,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPool2D())
model.add(Dropout(0.8))
With drop out, 95% of my image are classified in class 0.
I tryed image augmentation; i made rotation of all my training image, still used weighted activation function, result didnt improve. Should i try to augment only class with small number of image? Most of the thing i read says to augment all the dataset...
To resume my question are:
Should i try more complex model?
Is it usefull to do image augmentation only on unrepresented class? then should i still use weight class (i guess no)?
Should i have hope to find a "good" model with cnn when we see the size of my dataset?

I think according to the imbalanced data, it is better to create a custom data generator for your model so that each of it's generated data batch, contains at least one sample from each class. And also it is better to use Dropout layer after each dense layer instead of conv layer. For data augmentation it is better to at least use combination of rotate, horizontal flip and vertical flip. there are some other approaches for data augmentation like using GAN network or random pixel replacement.
For Gan you can check This SO post
For using Gan as data augmenter you can read This Article.
For combination of pixel level augmentation and GAN pixel level data augmentation

What I used - in a different setting - was to upsample my data with ADASYN. This algorithm calculates the amount of new data required to balance your classes, and then takes available data to sample novel examples.
There is an implementation for Python. Otherwise, you also have very little data. SVMs are good performing even with little data. You might want to try them or other image classification algorithms depending where the expected pattern is always at the same position, or varies. Then you could also try the Viola–Jones object detection framework.

InvalidArgumentError with RNN/LSTM in Keras

I'm throwing myself into machine learning, and wish to use Keras for a university project that's time-critical. I realise it would be best to learn individual concepts and building blocks, but it's important that this is done soon.
I'm working with someone who has some experience and interest in machine learning, but we cannot seem to get further than this. The below code was adapted from GitHub code mentioned in a guide in Machine Learning Mastery.
For context, I've got data from multiple physical sensors (where each sensor is a column), with each sample from those sensors represented by one row. I wish to use machine learning to determine who the sensors were tracking at any given time. I'm trying to allocate approximately 80% of the rows to training and 20% to testing, and am creating my own "y" set of data (with the first 521,549 rows being from one participant, and the remainder from another). My data (training and test) has a total of 1,019,802 rows, and 16 columns (all populated), but the number of columns can be reduced if need be.
I would love to know the following:
What does this error mean in the context of what I'm trying to achieve, and how can I change my code to avoid it?
Is the below code suitable for what I'm trying to achieve?
Does this code represent any specific fundamental flaw in my understanding of what machine learning (generally or specifically) is designed to achieve?
Below is the Python code I'm trying to run to make use of machine learning:
x_all = pd.read_csv("(redacted)...csv",
delim_whitespace=True, header=None, low_memory=False).values
y_all = np.append(np.full((521549,1), 0), np.full((498253,1),1))
limit = 815842
x_train = x_all[:limit]
y_train = y_all[:limit]
x_test = x_all[limit:]
y_test = y_all[limit:]
max_features = 16
maxlen = 80
batch_size = 32
model = Sequential()
model.add(Embedding(500, 32, input_length=max_features))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=15,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
This is an excerpt from the CSV referenced in the code:
6698.486328125 4.28260869565217 4.6304347826087 10.6195652173913 2.4392579293836 2.56134051466188 9.05326152004788 0.0 1.0812 924.898261191267 -1.55725190839695 -0.244274809160305 0.320610687022901 -0.122938530734633 0.490254872563718 0.382308845577211
6706.298828125 4.28260869565217 4.58695652173913 10.5978260869565 2.4655894673848 2.50867743865949 9.04368641532017 0.0 1.0812 924.898261191267 -1.64885496183206 -0.366412213740458 0.381679389312977 -0.122938530734633 0.490254872563718 0.382308845577211
6714.111328125 4.26086956521739 4.64130434782609 10.5978260869565 2.45601436265709 2.57809694793537 9.03411131059246 0.0 1.0812 924.898261191267 -0.931297709923664 -0.320610687022901 0.320610687022901 -0.125937031484258 0.493253373313343 0.371814092953523
The following error occurs when running this:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 972190 is not in [0, 500)
[[Node: embedding_1/embedding_lookup = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:#training/Adam/Assign_2"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embeddings/read, embedding_1/Cast, training/Adam/gradients/embedding_1/embedding_lookup_grad/concat/axis)]]
For reference, I'm on a 2017 27-inch iMac Retina 5K with 4.2 GHz i7, 32 GB RAM, with a Radeon Pro 580 8 GB.

There are some more tutorials on Machine Learning Mastery for what you want to accomplish
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
And I'll give my own quick explanation of what you probably want to do.
Right now it looks like you are using the exact same data for the X and y inputs into your model. The y inputs are the labels which in your case is "who the sensors were tracking". So in the binary case of having 2 possible people it is set to 0 for the first person and 1 for the second person.
The sigmoid activation on the final layer will output a number between 0 and 1. If the number is bellow 0.5 then it is predicting that the sensor is tracking person 0 and if it above 0.5 then it is predicting person 1. This will be represented in the accuracy score.
You will probably not want to use an embedding layer, its possible that you might but I would drop it to start with. Normalize your data though before feeding it into the net to improve training. Scikit-Learn has good tools for this if you want a quick solution.
http://scikit-learn.org/stable/modules/preprocessing.html
When working with time series data you often want to feed in a window of time points rather than a single point. If you send your time series to Keras model.fit() then it will use a single point as input.
In order to have a time window as input you need to reorganize each example in the data set to be a whole window, or you can use a generator if that will take up to much memory. This is described in the Machine Learning Mastery pages that I linked.
Keras has a generator that you can use called TimeseriesGenerator
from keras.preprocessing.sequence import TimeseriesGenerator
timeseries_generator = TimeseriesGenerator(data, targets, length, sampling_rate)
where data is your time series of features and targets is your time series of labels.
If you use the timeseries generator then when fitting you will have to use fit_generator
model.fit_generator(timeseries_generator)
same with evaluating using evaluate_generator()
If you have your data set up correctly then your model should work
model = Sequential()
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
you could also try a simpler dense model
model = Sequential()
model.add(Flatten())
model.add(Dense(64, dropout=0.2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
One more issue I see is that it appears you would be splitting off a test set that contains only one type of label which is not only bad practice but will also weight your training set towards the other label which might hurt your results.
Hopefully that gets you started. Make sure you get your data set up correctly!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.