Masking in LSTM with variable length input does not work - python

I'm building a LSTM model with variable length arrays as input. In many resources, I was recommended to do padding which is inserting 0 until all input arrays have the same length and then applying Masking for the model to ignore the 0s.
However, after many trainings, I feel like Masking does not work as expected, the padded 0s in the input still affect the prediction ability of the model.
After concatenating all sequences into one array, my training data looks like below without padding:
X y
[1 2 3] 4
[2 3 4] 5
[3 4 5] 6
...
My python implementation:
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Masking
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
""" Raw Training Input """
arr = np.array([
[1, 2, 3, 4, 5, 6],
[5, 6, 7],
[11, 12, 13, 14]
], dtype=object)
timesteps = 3
n_features = 1
maxlen = 6
""" Padding """
padded_arr = pad_sequences(arr, maxlen=maxlen, padding='pre', truncating='pre')
""" Concatenate all sequences into one array """
sequence = np.concatenate(padded_arr)
sequence = sequence.reshape((len(sequence), n_features))
# print(sequence)
""" Training Data Generator """
generator = TimeseriesGenerator(sequence, sequence, length=timesteps, batch_size=1)
""" Check Generator """
for i in range(len(generator)):
x, y = generator[i]
print('%s => %s' % (x, y))
""" Build Model """
model = Sequential()
model.add(Masking(mask_value=0.0, input_shape=(timesteps, n_features))) # masking to ignore padded 0
model.add(LSTM(1024, activation='relu', input_shape=(timesteps, n_features), return_sequences=False))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
model.fit(generator, steps_per_epoch=1, epochs=1000, verbose=10)
""" Prediction """
x_input = np.array([2,3,4]).reshape((1, timesteps, n_features))
yhat = model.predict(x_input, verbose=0)
print(yhat) # here I'm expecting 5 because the input is [2, 3, 4]
For the prediction, I input [2,3,4] and most of the time I keep getting values very far away from the expected value (= 5)
I'm wondering if I missed out on something or simply because the model architecture was not correctly tuned.
I want to understand why the model is not predicting correctly. Is the Masking the issue or is it something else?

The problem is that the batch size is 1, and also just one step per epoch. As a result, no meaningful gradient can be calculated. You have to put all training data into one batch, and you should have good results:
""" Training Data Generator """
generator = TimeseriesGenerator(sequence, sequence, length=timesteps,
batch_size=15)
[ Alternatively, you could leave batch size 1, but change the steps_per_epoch to len(generator) Which seems to work with the adam optimizer, but not with SGD. And it's much slower. ]

Related

How do I properly make a tf Dataset for recurrent-neural-network?

I have just preprocessed my categories data to one-hot encoding and used tf.argmax(). This way, I got it into a range of numbers of 1 to 34; thus, there are sequences of, say: [7, 4, 28, 14, 5, 15, 22, 22].
My question comes about how do I make the further dataset preparations. I intend to use 3 numbers in sequence to predict the next 2. So, do I need to map the dataset into features being the first 3 and labels the last 2? Should I batch it with buffer_size 5 to specify the sequence legnth? Lastly, do I keep the one_hot representation over it's transformation into a simpler number?
No one had answered me in time, so I found the solution some days ago.
Just a note that I'm now using 8 numbers to predict the next 2.
First, to make a prediction of the 2 next steps, I decided that I can predict the label for the 8 initial steps and then make another prediction with the last 7 time steps from the initial steps and use the predicted value as the 8th input. Second, my teacher told me it was better for the rnn to use one-hot, so I mapped the dataset as features of 8 one-hots and a label of 1 one-hot. Third, what impressed me, is that in fact I was able to use batching as a form of grouping the splitted data sequence and by so indicating that the last one-hot is my label.
So here is the code:
INPUT_WIDTH = 8
LABEL_WIDTH = 1
shift = 1
INPUT_SLICE = slice(0, INPUT_WIDTH)
total_window_size = INPUT_WIDTH + shift
label_start = total_window_size - LABEL_WIDTH
LABELS_SLICE = slice(label_start, None)
BATCH_SIZE = INPUT_WIDTH + LABEL_WIDTH
Above are some constants that I got from [1]. The only one that I couldn't understand correctly is the shift var, but set it to 1 and you're fine.
Below, the split function:
def split_window(features):
inputs = features[INPUT_SLICE]
labels = features[LABELS_SLICE]
return inputs, labels
Simple and cute, isn't it? But be clear it was a disgrace to change this function from [1] and make it compatible with my input shape.
Now the dataset:
def create_seq_dataset(data):
ds = tf.data.Dataset.from_tensor_slices(data)
#At this point, my input was a single array of 3644 string categories to be turned
#into one-hot. The below function just one-hots the data and turns it into float32 dtype,
#as required by the rnn, so I won't cover it in detail.
ds = ds.map(get_seq_label)
#As I'm using from_tensor_slices() function, the 3644 number disappears from the shape.
#Now, from shape (1), I've got shape (53) after one-hotting it, beign 53 the number of
#possible categories that I'm working with.
#Here I batch the one-hot data
ds = ds.batch(BATCH_SIZE, drop_remainder=True)
#Shape (53) => (9, 53)
#Without using drop_remainder, the rnn will complain that it can't predict "empty" data.
ds = ds.map(split_window)
#Got features shape (8, 53) and labels shape(53,)
ds = ds.batch(16, drop_remainder=True)
#Got features shape (16, 8, 53) and labels: (16, 53)
return
train_ds = create_seq_dataset(train_df)
for features_batch, labels_batch in train_ds:
print(features_batch.shape)
print(labels_batch.shape)
break
What I got from the print was:
(16, 8, 53) and
(16, 53)
Lastly, the LSTM:
inputs = layers.Input(name="sequence", shape=(INPUT_WIDTH, 53), dtype='float32')
#Reminder that the batch size is implicit when the input is a dataset, unless the LSTM
#is stateful.
def create_rnn_model():
#From [1]: Shape [batch, time, features] => [batch, time, lstm_units]
#In my case [16, 8, 53] => [16, 8 100], but as the 16 is the implicit batch size, it would
#appear as None if you use rnn_model.summary() after declaring the model
x = layers.LSTM(100, activation='tanh', stateful=False, return_sequences=False, name='LSTM_1')(inputs)
#Got shape after this layer as [16, 100], or [None, 100].
x = layers.Dense(32, activation='relu')(x)
output = layers.Dense(NUM_CLASSES, activation='softmax')(x)
model = Model(inputs = inputs, outputs = output)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
return model
rnn_model = create_rnn_model()
[1]. https://www.tensorflow.org/tutorials/structured_data/time_series#4_create_tfdatadatasets

Deep Learning how to split 5 dimensions timeseries and pass some dimensions through embedding layer

I have an input that is a time series of 5 dimensions:
a = [[8,3],[2] , [4,5],[1], [9,1],[2]...] #total 100 timestamps. For each element, dims 0,1 are numerical data and dim 2 is a numerical encoding of a category. This is per sample, 3200 samples
The category has 3 possible values (0,1,2)
I want to build a NN such that the last dimension (the category) will go through an embedding layer with output size 8, and then will be concatenated back to the first two dims (the numerical data).
So, this will be something like:
input1 = keras.layers.Input(shape=(2,)) #the numerical features
input2 = keras.layers.Input(shape=(1,)) #the encoding of the categories. this part will be embedded to 5 dims
x2 = Embedding(input_dim=1, output_dim = 8)(input2) #apply it to every timestamp and take only dim 3, so [2],[1], [2]
x = concatenate([input1,x2]) #will get 10 dims at each timepoint, still 100 timepoints
x = LSTM(units=24)(x) #the input has 10 dims/features at each timepoint, total 100 timepoints per sample
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=[input1, input2] , outputs=[x]) #input1 is 1D vec of the width 2 , input2 is 1D vec with the width 1 and it is going through the embedding
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['acc']
)
How can I do it? (preferably in keras)?
My problem is how to apply the embedding to every time point?
Meaning, if I have 1000 timepoints with 3 dims each, I need to convert it to 1000 timepoints with 8 dims each (The emebedding layer should transform input2 from (1000X1) to (1000X8)
There are a couple of issues you are having here.
First let me give you a working example and explain along the way how to solve your issues.
Imports and Data Generation
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.models import Model
num_timesteps = 100
max_features_values = [100, 100, 3]
num_observations = 2
input_list = [[[np.random.randint(0, v) for _ in range(num_timesteps)]
for v in max_features_values]
for _ in range(num_observations)]
input_arr = np.array(input_list) # shape (2, 3, 100)
In order to use an embedding we need to the voc_size as input_dimension, as stated in the LSTM documentation.
Embedding and Concatenation
voc_size = len(np.unique(input_arr[:, 2, :])) + 1 # 4
Now we need to create the inputs. Inputs should be of size [None, 2, num_timesteps] and [None, 1, num_timesteps] where the first dimension is the flexible and will be filled with the number of observations we are passing in. Let's use the embedding right after that using the previously calculated voc_size.
inp1 = layers.Input(shape=(2, num_timesteps)) # TensorShape([None, 2, 100])
inp2 = layers.Input(shape=(1, num_timesteps)) # TensorShape([None, 1, 100])
x2 = layers.Embedding(input_dim=voc_size, output_dim=8)(inp2) # TensorShape([None, 1, 100, 8])
x2_reshaped = tf.transpose(tf.squeeze(x2, axis=1), [0, 2, 1]) # TensorShape([None, 8, 100])
This cannot be easily concatenated since all dimensions must match except for the one along the concatenation axis. But the shapes are not matching unfortunately. Therefore we reshape x2. We do so by removing the first dimension and then transposing.
Now we can concatenate without any issue and everything works in a straight forward fashion:
x = layers.concatenate([inp1, x2_reshaped], axis=1)
x = layers.LSTM(32)(x)
x = layers.Dense(1, activation='sigmoid')(x)
model = Model(inputs=[inp1, inp2], outputs=[x])
Check on Dummy Example
inp1_np = input_arr[:, :2, :]
inp2_np = input_arr[:, 2:, :]
model.predict([inp1_np, inp2_np])
# Output
# array([[0.544262 ],
# [0.6157502]], dtype=float32)
#This outputs values between 0 and 1 just as expected.
In case you are not looking for Embeddings the way it's usually used in Keras (positive integers mapping to dense vectors). You might be looking for some sort of unprojection or basis expansion, in which 3 dimensions get mapped (embedded) to 8 and concatenating the result. This can be done using the kernel trick or other methods, but also happens implicitly in neural networks with non-linear applications.
As such, you can do something like this, following a similar format to pythonic833 because it was good (but with timestamps in the middle per the Keras LSTM documentation asking for [batch, timesteps, feature]):
Input generation
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.models import Model
num_timesteps = 100
num_features = 5
num_observations = 2
input_list = [[[np.random.randint(1, 100) for _ in range(num_features)]
for _ in range(num_timesteps)]
for _ in range(num_observations)]
input_arr = np.array(input_list) # shape (2, 100, 5)
Model construction
Then you can process the inputs:
input1 = layers.Input(shape=(num_timesteps, 2,))
input2 = layers.Input(shape=(num_timesteps, 3))
x2 = layers.Dense(8, activation='relu')(input2)
x = layers.concatenate([input1,x2], axis=2) # This produces tensors of shape (None, 100, 10)
x = layers.LSTM(units=24)(x)
x = layers.Dense(1, activation='sigmoid')(x)
model = Model(inputs=[input1, input2] , outputs=[x])
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['acc']
)
Results
inp1_np = input_arr[:, :, :2]
inp2_np = input_arr[:, :, 2:]
model.predict([inp1_np, inp2_np])
which produces
array([[0.44117224],
[0.23611131]], dtype=float32)
Other explanations about basis expansion to check out:
https://stats.stackexchange.com/questions/527258/embedding-data-into-a-larger-dimension-space
https://www.reddit.com/r/MachineLearning/comments/2ffejw/why_dont_researchers_use_the_kernel_method_in/

If I want to predict the next element in a sequence of numbers, what do I need to pass as second argument to Keras' fit method?

I'm trying to program a simple example to understand how LSTMs work. I want to take a simple integer series 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, and predict the next number. I've got a code, but I don't know what the second argument of the fit method needs to be.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM
df = pd.DataFrame(columns = ['Serie'])
for i in range(0, 10):
df.loc[i, 'Serie'] = i
sc = MinMaxScaler(feature_range = (0, 1))
train_set = sc.fit_transform(df.iloc[:, [True]])
xTrain = []
for i in range(0, len(train_set) - 3):
xTrain.append(train_set[i:i + 3, 0])
xTrain = np.array(xTrain)
xTrain = np.reshape(xTrain, (xTrain.shape[0], xTrain.shape[1], 1))
regresor = Sequential()
regresor.add(LSTM(units = 1, input_shape = (3, 1)))
regresor.compile(optimizer = 'rmsprop', loss = 'mse')
regresor.fit(xTrain, ???, batch_size = 1)
Can someone give me a very simple example of this?
You need to set the problem as a supervised one. Every sample contains the independent variable x and the dependent variable y. Based on your question, x contains samples of 3 timesteps and 1 feature. Start off by doing the necessary imports:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import tensorflow as tf
Let's define some constants:
points = 30 # number of data points to generate
timesteps = 3 # number of time steps per sample as LSTM layers need input shape (samples, time steps, features)
features = 1 # number of features per time step as LSTM layers need input shape (samples, time steps, features)
A sequence generation from 0 ... 30:
x = np.arange(points + 1) # array([ 0, 1, ..., 29, 30])
Here is where we start setting the problem as a supervised one with xas a sequence of numbers and y as sequence of next numbers:
y = x[1:] # [ 1, 2, ..., 29, 30 ]
x = x[:30] # [ 0, 1, ..., 28, 29 ]
Put both x and y together for scaling:
dataset = np.hstack((x.reshape((points, 1)),y.reshape((points, 1))))
scaler = MinMaxScaler((0, 1))
scaled = scaler.fit_transform(dataset)
Let's define the inputs and outputs of our model:
x_train = scaled[:,0] # first column
x_train = x_train.reshape((points // timesteps, timesteps, features)) # as i stated before LSTM layers need input shape (samples, time steps, features)
y_train = scaled[:,1] # second column
y_train = y_train[2::3] # start at the third element in steps of 3, for a total of 10
Model definition and compilation. I decided to make the model architecture a little more robust for "better" performance (see the results below):
regresor = tf.keras.models.Sequential()
regresor.add(tf.keras.layers.LSTM(units = 4, return_sequences = True))
regresor.add(tf.keras.layers.LSTM(units = 2))
regresor.add(tf.keras.layers.Dense(units = 1))
regresor.compile(optimizer = 'rmsprop', loss = 'mse')
Train the model:
regresor.fit(x_train, y_train, batch_size = 2, epochs = 500, verbose = 1)
Some predictions:
y_hats = regresor.predict(x_train)
The results;
real y predicted y
0.068966 0.086510
0.172414 0.162209
0.275862 0.252749
0.379310 0.356117
0.482759 0.467885
0.586207 0.582081
0.689655 0.692756
0.793103 0.795362
0.896552 0.887317
1.000000 0.967796
As you can see, the predictions are close enough to the real values.
A plot of the results:
Note that for simplicity I performed the predictions on the training data set, the testing should be done on test data. For that, you will have to generate more points and split them accordingly (70% training, 30% testing). Also, you can obtain the values in the original range by calling the scaler's inverse_transform methods.

How to prepare the inputs in Keras implementation of Wavenet for time-series prediction

In Keras implementation of Wavenet, the input shape is (None, 1). I have a time series (val(t)) in which the target is to predict the next data point given a window of past values (the window size depends on maximum dilation). The input-shape in wavenet is confusing. I have few questions about it:
How Keras figure out the input dimension (None) when a full sequence is given? According to dilations, we want the input to have a length of 2^8.
If a input series of shape (1M, 1) is given as training X, do we need to generate vectors of 2^8 time-steps as input? It seems, we can just use the input series as input of wave-net (Not sure why raw time series input does not give error).
In general, how we can debug such Keras networks. I tried to apply the function on numerical data like Conv1D(16, 1, padding='same', activation='relu')(inputs), however, it gives error.
#
n_filters = 32
filter_width = 2
dilation_rates = [2**i for i in range(7)] * 2
from keras.models import Model
from keras.layers import Input, Conv1D, Dense, Activation, Dropout, Lambda, Multiply, Add, Concatenate
from keras.optimizers import Adam
history_seq = Input(shape=(None, 1))
x = history_seq
skips = []
for dilation_rate in dilation_rates:
# preprocessing - equivalent to time-distributed dense
x = Conv1D(16, 1, padding='same', activation='relu')(x)
# filter
x_f = Conv1D(filters=n_filters,
kernel_size=filter_width,
padding='causal',
dilation_rate=dilation_rate)(x)
# gate
x_g = Conv1D(filters=n_filters,
kernel_size=filter_width,
padding='causal',
dilation_rate=dilation_rate)(x)
# combine filter and gating branches
z = Multiply()([Activation('tanh')(x_f),
Activation('sigmoid')(x_g)])
# postprocessing - equivalent to time-distributed dense
z = Conv1D(16, 1, padding='same', activation='relu')(z)
# residual connection
x = Add()([x, z])
# collect skip connections
skips.append(z)
# add all skip connection outputs
out = Activation('relu')(Add()(skips))
# final time-distributed dense layers
out = Conv1D(128, 1, padding='same')(out)
out = Activation('relu')(out)
out = Dropout(.2)(out)
out = Conv1D(1, 1, padding='same')(out)
# extract training target at end
def slice(x, seq_length):
return x[:,-seq_length:,:]
pred_seq_train = Lambda(slice, arguments={'seq_length':1})(out)
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
you are using extreme values for dilatation rate, they don't make sense. try to reduce them using, for example, a sequence made of [1, 2, 4, 8, 16, 32]. the dilatation rates aren't a constraint on the dimension of the input passed
your network work simply passing this input
n_filters = 32
filter_width = 2
dilation_rates = [1, 2, 4, 8, 16, 32]
....
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
n_sample = 5
time_step = 100
X = np.random.uniform(0,1, (n_sample,time_step,1))
model.predict(X)
specify a None dimension in Keras means to leave the model free to receive every dimension. this not means you can pass samples of various dimension, they always must have the same format... you can build the model every time with a different dimension size
for time_step in np.random.randint(100,200, 4):
print('temporal dim:', time_step)
n_sample = 5
model = Model(history_seq, pred_seq_train)
model.compile(Adam(), loss='mean_absolute_error')
X = np.random.uniform(0,1, (n_sample,time_step,1))
print(model.predict(X).shape)
I suggest also you a premade library in Keras which provide WAVENET implementation: https://github.com/philipperemy/keras-tcn you can use it as a baseline and investigate also the code to create a WAVENET

How can I improve the speed of my simple neural network?

I've just started exploring TensorFlow and I'm facing an issue regarding performance. As a starting example, I tried implementing a model to simulate a logic gate. Let's say there are two inputs A and B and one output Y. Suppose Y depended only on B and not on A. That means that the following are valid examples:
[0, 0] -> 0
[1, 0] -> 0
[0, 1] -> 1
[1, 1] -> 1
I created training sets for this data and created a model that uses a DenseFeatures layer using two features A and B. This layer feeds into a Dense(128, 'relu') layer, which feeds into a Dense(16, 'relu') layer, which finally feeds into a Dense(1, 'sigmoid') layer.
Training this NN works fine and the predictions are perfect. However, I noticed that on my MacBook, each prediction takes about 250ms. This is too much, since my final goal is to use such a NN to test hundreds of predictions each second.
So I stripped the network down to DenseFeatures([A, B]) -> Dense(8, 'relu') -> Dense(1, 'sigmoid'), however predictions for this NN still takes the same about of time. I was expecting that the execution speed depends on the complexity of the model. I can see that this is not the case here? What am I doing wrong?
Also, I had read that TensorFlow uses floating point math for accuracy but this has a penalty hit in terms of performance and if we convert our data to use integer math, it would speed things up. However, I have no idea of how to achieve that.
I would really appreciate if someone can help me understand why predictions for such a simple logic gate and such a simple NN is taking this long. And how can I speed it up.
For reference, here is my code in python:
import random
from typing import Any, List
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow import feature_column
from tensorflow.keras import layers
class Input:
def __init__(self, data: List[int]):
self.data = data
class Output:
def __init__(self, value: float):
self.value = value
class Item:
def __init__(self, input: Input, output: Output):
self.input = input
self.output = output
DATA: List[Item] = []
for i in range(10000):
x = Input([random.randint(0, 1), random.randint(0, 1)])
y = Output(x.data[1])
DATA.append(Item(x, y))
BATCH_SIZE = 5
DATA_TRAIN, DATA_TEST = train_test_split(DATA, shuffle=True, test_size=0.2)
DATA_TRAIN, DATA_VAL = train_test_split(DATA_TRAIN, shuffle=True, test_size=0.2/0.8)
def toDataSet(data: List[Item], shuffle: bool, batch_size: int):
a = {
'a': [x.input.data[0] for x in data],
'b': [x.input.data[1] for x in data],
}
b = [x.output.value for x in data]
return tf.data.Dataset.from_tensor_slices((a, b)).shuffle(buffer_size=len(data)).batch(BATCH_SIZE)
DS_TRAIN = toDataSet(DATA_TRAIN, True, 5)
DS_VAL = toDataSet(DATA_VAL, True, 5)
DS_TEST = toDataSet(DATA_TEST, True, 5)
FEATURES = []
FEATURES.append(a)
FEATURES.append(b)
feature_layer = tf.keras.layers.DenseFeatures(FEATURES)
model = tf.keras.models.load_model('MODEL.H5')
model = tf.keras.Sequential([
feature_layer,
layers.Dense(8, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(DS_TRAIN, validation_data=DS_VAL, epochs=10)
loss, accuracy = model.evaluate(DS_TEST)
for i in range(1000):
val = model.predict([np.array([random.randint(0, 1)]), np.array([random.randint(0, 1)])])
Since you are only using integers, change the input of the model to use 8-bit signed integers. You can do this by changing the datatype in your input layer by adding the dtype parameter. This will vastly improve processing speed since you won't be wasting calculations.

Categories

Resources