Transformer model is very slow and doesn't predict well - python

I created my first transformer model, after having worked so far with LSTMs. I created it for multivariate time series predictions - I have 10 different meteorological features (temperature, humidity, windspeed, pollution concentration a.o.) and with them I am trying to predict time sequences (24 consecutive values/hours) of air pollution. So my input has the shape X.shape = (75575, 168, 10) - 75575 time sequences, each sequence contains 168 hourly entries/vectors and each vector contains 10 meteo features. My output has the shape y.shape = (75575, 24) - 75575 sequences each containing 24 consecutive hourly values of the air pollution concentration.
I took as a model an example from the official keras site. It is created for classification problems, I only took out the softmax activation and in the last dense layer I set the number of neurons to 24 and I hoped it would work. I runs and trains, but it doesn't do a better job than the LSTMs I have used on the same problem and more importantly - it is very slow - 4 min/epoch. Below I attach the model and I would like to know:
I) Have I done something wrong in the model? can the accuracy or speed be improved? Are there maybe some other parts of the code I need to change for it to work on regression, not classification problems?
II) Also, can a transformer at all work on multivariate problems of my kind (10 features input, 1 feature output) or do transformers only work on univariate problems? Tnx
def build_transformer_model(input_shape, head_size, num_heads, ff_dim, num_transformer_blocks, mlp_units, dropout=0, mlp_dropout=0):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(x)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
x = x + res
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
x = layers.Dense(24)(x)
return keras.Model(inputs, x)
model_tr = build_transformer_model(input_shape=(window_size, X_train.shape[2]), head_size=256, num_heads=4, ff_dim=4, num_transformer_blocks=4, mlp_units=[128], mlp_dropout=0.4, dropout=0.25)
m_tr_history =, y=y_train, validation_split=0.25, batch_size=64, epochs=10, callbacks=[modelsave_cb])


LSTM: Input 0 of layer sequential is incompatible with the layer

I know there are several questions about this here, but I haven't found one which fits exactly my problem.
I'm trying to fit an LSTM with data from Pandas DataFrames but getting confused about the format I have to provide them.
I created a small code snipped which shall show you what I try to do:
import pandas as pd, tensorflow as tf, random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
targets = pd.DataFrame(index=pd.date_range(start='2019-01-01', periods=300, freq='D'))
targets['A'] = [random.random() for _ in range(len(targets))]
targets['B'] = [random.random() for _ in range(len(targets))]
features = pd.DataFrame(index=targets.index)
for i in range(len(features)) :
features[str(i)] = [random.random() for _ in range(len(features))]
model = Sequential()
model.add(LSTM(units=targets.shape[1], input_shape=features.shape))
model.compile(optimizer='adam', loss='mae'), targets, batch_size=10, epochs=10)
this results to:
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [10, 300]
which I expect relates to the dimensions of the features DataFrame provided. I guess that once fixed this the next error would mention the targets DataFrame.
As far as I understand, 'units' parameter of my first layer defines the output dimensionality of this model. The inputs have to have a 3D shape, but I don't know how to create them out of the 2D world of the Data Frames.
I hope you can help me understanding the reshape mechanism in Python and how to use them in combination with Pandas DataFrames. (I'm quite new to Python and came from R)
Thankls in advance
Lets looks at the few popular ways in LSTMs are used.
Many to Many
Example: You have a sentence (composed of words in sequence). Give these sequence of words you would like to predict the Parts of speech (POS) of each word.
So you have n words and you feed each word per timestep to the LSTM. Each LSTM timestep (also called LSTM unwrapping) will produce and output. The word is represented by a a set of features normally word embeddings. So the input to LSTM is of size bath_size X time_steps X features
Keras code:
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = True)(inputs)
outputs = keras.layers.TimeDistributed(keras.layers.Dense(5, activation='softmax'))(lstm)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y = np.random.randint(0,2, size=(4,10,5)), y, epochs=2)
print (model.predict(X).shape)
Many to One
Example: Again you have a sentence (composed of words in sequence). Give these sequence of words you would like to predict sentiment of the sentence if it is positive or negative.
Keras code
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = False)(inputs)
outputs =keras.layers.Dense(5, activation='softmax')(lstm)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y = np.random.randint(0,2, size=(4,5)), y, epochs=2)
print (model.predict(X).shape)
Many to multi-headed
Example: You have a sentence (composed of words in sequence). Give these sequence of words you would like to predict sentiment of the sentence as well the author of the sentence.
This is multi-headed model where one head will predict the sentiment and another head will predict the author. Both the heads share the same LSTM backbone.
Keras code
inputs = keras.Input(shape=(10,3))
lstm = keras.layers.LSTM(8, input_shape = (10, 3), return_sequences = False)(inputs)
output_A = keras.layers.Dense(5, activation='softmax')(lstm)
output_B = keras.layers.Dense(5, activation='softmax')(lstm)
model = keras.Model(inputs=inputs, outputs=[output_A, output_B])
model.compile(loss='categorical_crossentropy', optimizer='adam')
X = np.random.randn(4,10,3)
y_A = np.random.randint(0,2, size=(4,5))
y_B = np.random.randint(0,2, size=(4,5)), [y_A, y_B], epochs=2)
y_hat_A, y_hat_B = model.predict(X)
print (y_hat_A.shape, y_hat_B.shape)
What you are looking for is Many to Multi head model where your predictions for A will be made by one head and another head will make predictions for B
The input data for the LSTM has to be 3D.
If you print the shapes of your DataFrames you get:
targets : (300, 2)
features : (300, 300)
The input data has to be reshaped into (samples, time steps, features). This means that targets and features must have the same shape.
You need to set a number of time steps for your problem, in other words, how many samples will be used to make a prediction.
For example, if you have 300 days and 2 features the time step can be 3. So that three days will be used to make one prediction (you can choose this arbitrarily). Here is the code for reshaping your data (with a few more changes):
import pandas as pd
import numpy as np
import tensorflow as tf
import random
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
data = pd.DataFrame(index=pd.date_range(start='2019-01-01', periods=300, freq='D'))
data['A'] = [random.random() for _ in range(len(data))]
data['B'] = [random.random() for _ in range(len(data))]
# Choose the time_step size.
time_steps = 3
# Use numpy for the 3D array as it is easier to handle.
data = np.array(data)
def make_x_y(ts, data):
ts : int
data : numpy array
This function creates two arrays, x and y.
x is the input data and y is the target data.
x, y = [], []
offset = 0
for i in data:
if offset < len(data)-ts:
offset += 1
return np.array(x), np.array(y)
x, y = make_x_y(time_steps, data)
print(x.shape, y.shape)
nodes = 100 # This is the width of the network.
out_size = 2 # Number of outputs produced by the network. Same size as features.
model = Sequential()
model.add(LSTM(units=nodes, input_shape=(x.shape[1], x.shape[2])))
model.add(Dense(out_size)) # For the output a Dense (fully connected) layer is used.
model.compile(optimizer='adam', loss='mae'), y, batch_size=10, epochs=10)
Well, just to finalize this issue I would like to provide one solution I have meanwhile worked on. The class TimeseriesGenerator in tf.keras.... enabled me quite easy to provide the data in the right shape to an LSTM model
from keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np
window_size = 7
batch_size = 8
sampling_rate = 1
train_gen = TimeseriesGenerator(X_train.values, y_train.values,
length=window_size, sampling_rate=sampling_rate,
valid_gen = TimeseriesGenerator(X_valid.values, y_valid.values,
length=window_size, sampling_rate=sampling_rate,
test_gen = TimeseriesGenerator(X_test.values, y_test.values,
length=window_size, sampling_rate=sampling_rate,
There are many other ways on implementing generators e.g. using the more_itertools which provides the function windowed, or making use of tensorflow.Dataset and its function window.
For me the TimeseriesGenerator was sufficient to feed the tests I did.
In case you would like to see an example modeling the DAX based on some stocks I'm sharing a notebook on Github.

Character-based Text Classification with Triplet Loss

Im trying to implement a text-classifier using triplet loss to classify different job descriptions into categories based on this paper. But whatever i do, the classifier yields very bad results.
For the embedding i followed this tutorial and the NN architecture is based on this article.
I create my encodings using:
max_char_len = 20
group_numbers = range(0, len(job_groups))
char_vocabulary = {'PAD':0}
X_char = []
y_temp = []
i = 1
for group, number in zip(job_groups, group_numbers):
for job in group:
job_cleaned = some_cleaning_function(job)
job_enc = []
for c in job_cleaned:
if c in char_vocabulary.keys():
char_vocabulary[c] = i
X_char = pad_sequences(X_char, maxlen = max_char_length, truncating='post')
My Neural Network is set up the following way:
def create_base_model():
char_in = Input(shape=(max_char_length,), name='Char_Input')
char_enc = Embedding(input_dim=len(char_vocabulary)+1, output_dim=20, mask_zero=True,name='Char_Embedding')(char_in)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(char_enc)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=True, recurrent_dropout=0.2, dropout=0.4))(x)
x = Bidirectional(LSTM(64, return_sequences=False, recurrent_dropout=0.2, dropout=0.4))(x)
out = Dense(128, activation = "softmax")(x)
return Model(char_in, out)
def get_siamese_triplet_char():
anchor_input_c = Input(shape=(max_char_length,),name='Char_Input_Anchor')
pos_input_c = Input(shape=(max_char_length,),name='Char_Input_Positive')
neg_input_c = Input(shape=(max_char_length,),name='Char_Input_Negative')
base_model = create_base_model(encoding_generator)
encoded_anchor = base_model(anchor_input_c)
encoded_positive = base_model(pos_input_c)
encoded_negative = base_model(neg_input_c)
inputs = [anchor_input_c, pos_input_c, neg_input_c]
outputs = [encoded_anchor, encoded_positive, encoded_negative]
siamese_triplet = Model(inputs, outputs)
siamese_triplet.compile(loss=None, optimizer='adam')
return siamese_triplet, base_model
The triplet loss is defined as follows:
def triplet_loss(inputs):
anchor, positive, negative = inputs
positive_distance = K.square(anchor - positive)
negative_distance = K.square(anchor - negative)
positive_distance = K.sqrt(K.sum(positive_distance, axis=-1, keepdims = True))
negative_distance = K.sqrt(K.sum(negative_distance, axis=-1, keepdims = True))
loss = positive_distance - negative_distance
loss = K.maximum(0.0, 1 + loss)
return K.mean(loss)
The model is then trained with:
shuffle=True, batch_size=8, epochs=22, verbose=1)
My goal is to: First, train the network with no label data in order to minimize the space of the different phrases and second, add a classification layer and create the final classifier.
My general problem is that even the first phase shows sinking cost-values it overfits and the validation results jump around and the second phase fails badly as I'm not able to train the model to actually classify.
My questions are the following:
Could someone explain the Embedding Architecture? What is the output dimension refering to? The individual characters? Would that even make sense? Or is there a better way to encode the input data?
How can i add validation_data to a network that does not contain labeled data? I could use validation_split, but i would rather prefer passing specific data to validate as my data is stratified.
Is there a reason why the classification does not work? Applying a simple K-Nearest Neighbor algorithm achieves at best 0.5 accuracy! Is it because of the data? Or is there a systematic error in my system?
All ideas and suggestions are really appreciated!

Huge difference between training accuracy and evaluation accuracy using Tensorflow Datasets class + Keras

I created a dataset from my data and this dataset is in the form of (features,labels). features' dimension is [?,731,7] (where the ? should be 400), the corresponding labels' dimension is [4,] as shown in my dataset. Each [731,7] sample corresponds to a 4 elements array like [0,1,0,0].
A few sample data:
After building a simple multi-layer neural network, the training process is normal as follow. But when I use the same dataset to validate (just to check if the algorithm is working or not), I actually got a huge difference.
I don't think it is right but I am not sure if this happens because I used .eval() wrong or my datasets got wrong.
My code for datasets creation:
filenames = glob.glob(main_dir+keywords)
# filenames = ['test.txt','test2.txt']
length = len(filenames) # num of files
length_samesat = 100 # happen to be this... I designed in propogation
batch_num = 731 # happen to be this...
dataset =
dataset = dataset.flat_map(lambda filename:
dataset = string: tf.string_split([string],delimiter=', ').values)
dataset = x: tf.strings.to_number(x))
dataset = dataset.batch(batch_num)
dataset = tensor: tf.reshape(tensor,[batch_num,7]))
dataset = dataset.batch(1).repeat()
Then I zip my dataset with my label dataset and create NN and run
dataset_all =, datalabel))
dataset_all = dataset_all.shuffle(400)
# NN Model
inputs = tf.keras.Input(shape=(731,7,)) # Returns a placeholder tensor
# A layer instance is callable on a tensor, and returns a tensor.
x = tf.keras.layers.Flatten()(inputs)
x = tf.keras.layers.Dense(400, activation='tanh')(x)
x = tf.keras.layers.Dense(400, activation='tanh')(x)
# x = tf.keras.layers.Dense(450, activation='tanh')(x)
# x = tf.keras.layers.Dense(300, activation='tanh')(x)
# x = tf.keras.layers.Dense(450, activation='tanh')(x)
# x = tf.keras.layers.Dense(200, activation='relu')(x)
# x = tf.keras.layers.Dense(100, activation='relu')(x)
predictions = tf.keras.layers.Dense(4, activation='softmax')(x)
# Instantiate the model given inputs and outputs.
model = tf.keras.Model(inputs=inputs, outputs=predictions)
# The compile step specifies the training configuration.
# Trains for 5 epochs, epochs=5, steps_per_epoch=400)
model.evaluate(dataset_all, steps=400)

How to implement a 1D convolutional neural network with residual connections and batch-normalization in Keras?

I am trying to develop a 1D convolutional neural network with residual connections and batch-normalization based on the paper Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks, using keras.
This is the code so far:
# define model
x = Input(shape=(time_steps, n_features))
# First Conv / BN / ReLU layer
y = Conv1D(filters=n_filters, kernel_size=n_kernel, strides=n_strides, padding='same')(x)
y = BatchNormalization()(y)
y = ReLU()(y)
shortcut = MaxPooling1D(pool_size = n_pool)(y)
# First Residual block
y = Conv1D(filters=n_filters, kernel_size=n_kernel, strides=n_strides, padding='same')(y)
y = BatchNormalization()(y)
y = ReLU()(y)
y = Dropout(rate=drop_rate)(y)
y = Conv1D(filters=n_filters, kernel_size=n_kernel, strides=n_strides, padding='same')(y)
# Add Residual (shortcut)
y = add([shortcut, y])
# Repeated Residual blocks
for k in range (2,3): # smaller network for testing
shortcut = MaxPooling1D(pool_size = n_pool)(y)
y = BatchNormalization()(y)
y = ReLU()(y)
y = Dropout(rate=drop_rate)(y)
y = Conv1D(filters=n_filters * k, kernel_size=n_kernel, strides=n_strides, padding='same')(y)
y = BatchNormalization()(y)
y = ReLU()(y)
y = Dropout(rate=drop_rate)(y)
y = Conv1D(filters=n_filters * k, kernel_size=n_kernel, strides=n_strides, padding='same')(y)
y = add([shortcut, y])
z = BatchNormalization()(y)
z = ReLU()(z)
z = Flatten()(z)
z = Dense(64, activation='relu')(z)
predictions = Dense(classes, activation='softmax')(z)
model = Model(inputs=x, outputs=predictions)
# Compiling
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
# Fitting, train_y, epochs=n_epochs, batch_size=n_batch)
And this is the graph of a simplified model of what I am trying to build.
The model described in the paper uses an incrementing number of filters:
The network consists of 16 residual blocks with 2 convolutional layers per block. The convolutional layers all have a filter length of 16 and have 64k filters, where k starts out as 1 and is incremented every 4-th residual block. Every alternate residual block subsamples its inputs by a factor of 2, thus the original input is ultimately subsampled by a factor of 2^8. When a residual block subsamples the input, the corresponding shortcut connections also subsample their input using a Max Pooling operation with the same subsample factor.
But I can only make it work if I use the same number of filters in every Conv1D layer, with k=1, strides=1 and padding=same, without applying any MaxPooling1D. Any changes in these parameters causes a tensor size mismatch and failure to compile with the following error:
ValueError: Operands could not be broadcast together with shapes (70, 64) (70, 128)
Does anyone have any idea on how to fix this size mismatch and make it work?
In addition, if the input has more than one channel (or features) the mismatch is even worst! Is there a way to deal with more than one channel?
The issue of tensor shape mismatch should be happening in add([y, shortcut]) layer. Because of the fact that you are using MaxPooling1D layer, this halves your time-steps by default, which you can change it by using the pool_size parameter. On the other hand, your residual portion is not reducing the time-steps by same amount. You should apply stride=2 with padding='same' before adding shortcut and y in any one of Conv1D layer (preferably the last one).
For reference, you can check out the Resnet code here Keras-applications-github

classifier after binary classifier

I have some images which contains some structure (building for example).
I have the pixel data and the labels data.
The labels data are in the form:
np.array([[0, nan, nan],
[1, 2, 1],
[1, 3, 2]])
The first column means: is there or not a structure ? 1 or 0
The second: what type is the structure ? types: 1,2,3
The third: how many structures are there?
When we don't have a structure, all other values are nan.
So, for the example above we have:
first line: no structure
second line: we have a structure of type 2 and it is one
third line: we have a structure of type 3 and there are two of them
So, first I am doing a binary classification to find out if we have a structure or not.
I am using vggnet with pretrained weights.
imagenet_weights = './vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
base_vgg16 = VGG16(include_top=False, weights=imagenet_weights, input_shape=input_shape)
last_layer = base_vgg16.output
x = Flatten()(last_layer)
x = Dense(512, activation='relu')(x)
x = Dropout(0.3)(x)
preds = Dense(1, activation='sigmoid')(x)
base_vgg16.trainable = False
model_vgg16 = Model(base_vgg16.input, preds)
and for training I am using 3kfold and data augmentation.
This gives 97% accuracy.
Now, I want to make a classification in order to find out the type, based on the binary classification results.
So , I have :
pred_val = model.predict(X_val)
Now, I am not sure how to proceed.
What to use as input for the classifier.
I tried:
X_train = X_train[np.where(pred_val >= 0.5), :]
pred_train = model.predict(X_train)
X_train = X_train[np.where(pred_train.squeeze() >= 0.5), :]
( I am not sure if it has any meaning to predict on train data)
Now, however I run the second network, I am always receiving awful results.
Val_loss is almost steady around (10-11)
Train_loss is around 10-12
Val_acc is almost steady around (0.25-0.35)
Train_acc is 0.28-0.38
For the second network:
inputs = Input(shape=input_shape)
x = Flatten()(inputs)
x = Dense(512, activation='relu')(x)
x = Dense(256, activation='relu')(x)
preds = Dense(5, activation='softmax')(x)
model = Model(inputs, preds)
Whatever combinations I tried, more or less units ( I tried only a few like : x = Dense(8, activation='relu')(x))
or whatever batch size and optimizer and learning rate the result is the same.

