from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import cross_val_score
def build_model():
model2=Sequential()
model2.add(LSTM(8,batch_input_shape=(12,12,1),stateful=True))
model2.add(Dense(8))
model2.add(Dense(8))
model2.add(Dense(1))
model2.compile(loss='mse',optimizer='adam')
return model2
model=KerasRegressor(build_fn=build_model, epochs=50, batch_size=12, verbose=0)
kfold = KFold(n_splits=5, random_state=np.random.seed(7))
score=cross_val_score(model,ts_x,ts_y,cv=kfold,scoring='neg_mean_squared_error')
ts_x.shape is (228,12,1)
ts_y.shape is (228,1,1)
As we can see here, I have 228 samples now,but when I run it:
ValueError: In a stateful network, you should only pass inputs with a number of samples that can be divided by the batch size. Found: 183 samples.
I want to know why it founded 183 samples instead 228 samples?
What the error means:
The batch_size you have provided is 12, that is, 12 records are taken for the training process every time. Now, your total records are 228, which isn't a multiple of 12, so the last batch doesn't have enough records to train.
However, that is not where the problem is. You are also using 5 fold cross-validation. That means your dataset is divided into 5 parts, out of which 1 part is kept untouched as a validation set whereas the model trains on the other 4 parts. The length of these parts is 228/5 = 45.6 and 228*4/5 = 182.4 (~ 183).
So, the model training which occurs is actually on 183 records at a time, which is again, not a multiple of 12.
Potential solution:
You can try setting the batch_size to a factor of 183 (1,3,61,183) which doesn't give you much reasonable options.
So, you can try changing your n_splits to something close (like 6), so that 228 * (n_splits - 1)/n_splits has factors close to 10 (if n_splits is 6, 10 is one of the possible batch_sizes)
Apart from that, I am sorry I don't have experience with tensorflow since I use pytorch, and pytorch doesn't show an error even if the last batch isn't a full batch. Still, you could look at tensorflow's documentation and their own q/a forums to get another answer.
I hope this solves your problem or at least guides you in the right direction towards a solution.
Related
I know that all these parameters are in the documentation, but the terminology is confusing. I'm not sure what the difference is between an 'execution' and an 'epoch'. My current understanding is this
max_trials: the number of combinations of hyper parameters to search over
executions_per_trial: the number of times to update weights for each combination of hyper parameters
epochs: the number of times to go through the process of: updating the weights executions_per_trial times for each of the max_trials trials
So using the code below as an example
tuner = RandomSearch(
hypermodel = build_model,
max_trials = 5,
executions_per_trial = 6,
hyperparameters = hp,
objective = 'mse',
...
)
tuner %>% fit_tuner(x = x, y = y,
epochs = 100,
validation_data = list(x_val, y_val))
I would expect this to update the weights of the model 6 times for 5 combinations of parameters, and do this 100 times. But, like I said, I'm really not sure.
The basic idea is the following pseudo code:
for trial in 1 to max_trials:
hp = select_hyperparameters()
for execution in 1 to executions_per_trial:
model = build_model(hp)
for epoch in 1 to epochs:
model.update_weights()
For one combination of hyperparameters, you may want to build multiple models if the model you want to evaluate is not deterministic. Indeed, the weights of a neural network are initialized randomly (as well as the order they receive the training data), and two networks with the same architecture will not lead to exactly the same model even if they are trained with the same data. If you train multiple models with the same combination of hyperparameters, you are less likely to observe a case where a good combination led to a bad score, or a bad combination led to a good score.
So I was trying to implement this kaggle code into my Jupyter to test the performance of my laptop.
There were some modifications of the code to fit the my version of environments:
#from scipy.ndimage import imread
from imageio import imread
Upon block[11], i received the error as below
Any help or suggestions are appreciated.
You have specified step_per_epoch incorrectly.
The steps_per_epoch should be equal to
steps_per_epoch = ceil(number_of_samples / batch_size)
For your case
steps_per_epoch = ceil(1161 / 16) = ceil(72.56) = 73
Try specifying steps_per_epoch = 73
As you can your entire data is exhausted in 73 steps. Now, if you specify steps_per_epoch any higher than 73 ie 74
There is no data available. Therefore you get input generator ran out of data
More Information:
Model training comprises of two parts forward pass and backward pass.
1 train step = 1 forward pass + 1 backward pass
A single train step(1 forward pass + 1 backward pass) is calculated on a single batch.
So if you have 100 samples and your batch size is 10.
Your model will have 10 train steps.
Epoch: Epoch is defined as complete iteration over the dataset.
Therefore, for your model to completely iterate over the dataset of 100 samples, it should undergo 10 train steps.
This train step is nothing but steps_per_epoch.
The steps_per_epoch argument is usually specified when you give infinite data generator to your fit() command and does not need to be specified if you have finite data.
I am absolutely new to TensorFlow and Keras, and I am trying to make my way around trying out some code that I am finding online.
In particular I am using the fashion-MNIST - consisting of 60000 examples and test set of 10000 examples. Each of them is a 28x28 grayscale image.
I am following this tutorial "https://towardsdatascience.com/building-your-first-neural-network-in-tensorflow-2-tensorflow-for-hackers-part-i-e1e2f1dfe7a0", and I have no problem until the definition of
history = model.fit(
train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
As long as I understood, I need to use train_dataset.repeat() as input dataset because otherwise I won't have enough training example using those values for the hyperparameters (epochs, steps_per_epochs).
My question is: how can I avoid to have to use .repeat()?
How do I need to change the hyperparameters?
I am coping the code here, for simplicity:
def preprocess(x,y):
x = tf.cast(x,tf.float32) / 255.0
y = tf.cast(y, tf.float32)
return x,y
def create_dataset(xs, ys, n_classes=10):
ys = tf.one_hot(ys, depth=n_classes)
return tf.data.Dataset.from_tensor_slices((xs, ys)).map(preprocess).shuffle(len(ys)).batch(128)
model.compile(optimizer = 'adam', loss =tf.losses.CategoricalCrossentropy(from_logits= True), metrics =['accuracy'])
history1 = model.fit(train_dataset.repeat(),
epochs=10,
steps_per_epoch=500,
validation_data=val_dataset.repeat(),
validation_steps=2)
Thanks!
If you don't want to use .repeat() you need to have your model passing thought your entire data only one time per epoch.
In order to do that you need to calculate how many steps it will take for your model to pass throught the entire dataset, the calcul is easy :
steps_per_epoch = len(train_dataset) // batch_size
So with a train_dataset of 60 000 sample and a batch_size of 128, you need to have 468 steps per epoch.
By setting this parameter like that you make sure that you do not exceed the size of your dataset.
I encountered the same problem and here is what I found.
Documentation of tf.keras.Model.fit: "If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted."
In other words, we don't need to specify 'steps_per_epoch' if we use the tf.data.dataset as the training data, and tf will figure out how many steps are there. Meanwhile, tf will automatically repeat the dataset when the next epoch begins, so you can specify any 'epoch'.
When passing an infinitely repeating dataset (e.g. dataset.repeat()), you must specify the steps_per_epoch argument.
Let's suppose I have a sequence of integers:
0,1,2, ..
and want to predict the next integer given the last 3 integers, e.g.:
[0,1,2]->5, [3,4,5]->6, etc
Suppose I setup my model like so:
batch_size=1
time_steps=3
model = Sequential()
model.add(LSTM(4, batch_input_shape=(batch_size, time_steps, 1), stateful=True))
model.add(Dense(1))
It is my understanding that model has the following structure (please excuse the crude drawing):
First Question: is my understanding correct?
Note I have drawn the previous states C_{t-1}, h_{t-1} entering the picture as this is exposed when specifying stateful=True. In this simple "next integer prediction" problem, the performance should improve by providing this extra information (as long as the previous state results from the previous 3 integers).
This brings me to my main question: It seems the standard practice (for example see this blog post and the TimeseriesGenerator keras preprocessing utility), is to feed a staggered set of inputs to the model during training.
For example:
batch0: [[0, 1, 2]]
batch1: [[1, 2, 3]]
batch2: [[2, 3, 4]]
etc
This has me confused because it seems this is requires the output of the 1st Lstm Cell (corresponding to the 1st time step). See this figure:
From the tensorflow docs:
stateful: Boolean (default False). If True, the last state for each
sample at index i in a batch will be used as initial state for the
sample of index i in the following batch.
it seems this "internal" state isn't available and all that is available is the final state. See this figure:
So, if my understanding is correct (which it's clearly not), shouldn't we be feeding non-overlapped windows of samples to the model when using stateful=True? E.g.:
batch0: [[0, 1, 2]]
batch1: [[3, 4, 5]]
batch2: [[6, 7, 8]]
etc
The answer is: depends on problem at hand. For your case of one-step prediction - yes, you can, but you don't have to. But whether you do or not will significantly impact learning.
Batch vs. sample mechanism ("see AI" = see "additional info" section)
All models treat samples as independent examples; a batch of 32 samples is like feeding 1 sample at a time, 32 times (with differences - see AI). From model's perspective, data is split into the batch dimension, batch_shape[0], and the features dimensions, batch_shape[1:] - the two "don't talk." The only relation between the two is via the gradient (see AI).
Overlap vs no-overlap batch
Perhaps the best approach to understand it is information-based. I'll begin with timeseries binary classification, then tie it to prediction: suppose you have 10-minute EEG recordings, 240000 timesteps each. Task: seizure or non-seizure?
As 240k is too much for an RNN to handle, we use CNN for dimensionality reduction
We have the option to use "sliding windows" - i.e. feed a subsegment at a time; let's use 54k
Take 10 samples, shape (240000, 1). How to feed?
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[54000:108000] ...
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[1:54001] ...
Which of the two above do you take? If (2), your neural net will never confuse a seizure for a non-seizure for those 10 samples. But it'll also be clueless about any other sample. I.e., it will massively overfit, because the information it sees per iteration barely differs (1/54000 = 0.0019%) - so you're basically feeding it the same batch several times in a row. Now suppose (3):
(10, 54000, 1), all samples included, slicing as sample[0:54000]; sample[24000:81000] ...
A lot more reasonable; now our windows have a 50% overlap, rather than 99.998%.
Prediction: overlap bad?
If you are doing a one-step prediction, the information landscape is now changed:
Chances are, your sequence length is faaar from 240000, so overlaps of any kind don't suffer the "same batch several times" effect
Prediction fundamentally differs from classification in that, the labels (next timestep) differ for every subsample you feed; classification uses one for the entire sequence
This dramatically changes your loss function, and what is 'good practice' for minimizing it:
A predictor must be robust to its initial sample, especially for LSTM - so we train for every such "start" by sliding the sequence as you have shown
Since labels differ timestep-to-timestep, the loss function changes substantially timestep-to-timestep, so risks of overfitting are far less
What should I do?
First, make sure you understand this entire post, as nothing here's really "optional." Then, here's the key about overlap vs no-overlap, per batch:
One sample shifted: model learns to better predict one step ahead for each starting step - meaning: (1) LSTM's robust against initial cell state; (2) LSTM predicts well for any step ahead given X steps behind
Many samples, shifted in later batch: model less likely to 'memorize' train set and overfit
Your goal: balance the two; 1's main edge over 2 is:
2 can handicap the model by making it forget seen samples
1 allows model to extract better quality features by examining the sample over several starts and ends (labels), and averaging the gradient accordingly
Should I ever use (2) in prediction?
If your sequence lengths are very long and you can afford to "slide window" w/ ~50% its length, maybe, but depends on the nature of data: signals (EEG)? Yes. Stocks, weather? Doubt it.
Many-to-many prediction; more common to see (2), in large per longer sequences.
LSTM stateful: may actually be entirely useless for your problem.
Stateful is used when LSTM can't process the entire sequence at once, so it's "split up" - or when different gradients are desired from backpropagation. With former, the idea is - LSTM considers former sequence in its assessment of latter:
t0=seq[0:50]; t1=seq[50:100] makes sense; t0 logically leads to t1
seq[0:50] --> seq[1:51] makes no sense; t1 doesn't causally derive from t0
In other words: do not overlap in stateful in separate batches. Same batch is OK, as again, independence - no "state" between the samples.
When to use stateful: when LSTM benefits from considering previous batch in its assessment of the next. This can include one-step predictions, but only if you can't feed the entire seq at once:
Desired: 100 timesteps. Can do: 50. So we set up t0, t1 as in above's first bullet.
Problem: not straightforward to implement programmatically. You'll need to find a way to feed to LSTM while not applying gradients - e.g. freezing weights or setting lr = 0.
When and how does LSTM "pass states" in stateful?
When: only batch-to-batch; samples are entirely independent
How: in Keras, only batch-sample to batch-sample: stateful=True requires you to specify batch_shape instead of input_shape - because, Keras builds batch_size separate states of the LSTM at compiling
Per above, you cannot do this:
# sampleNM = sample N at timestep(s) M
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample21, sample41, sample11, sample31]
This implies 21 causally follows 10 - and will wreck training. Instead do:
batch1 = [sample10, sample20, sample30, sample40]
batch2 = [sample11, sample21, sample31, sample41]
Batch vs. sample: additional info
A "batch" is a set of samples - 1 or greater (assume always latter for this answer)
. Three approaches to iterate over data: Batch Gradient Descent (entire dataset at once), Stochastic GD (one sample at a time), and Minibatch GD (in-between). (In practice, however, we call the last SGD also and only distinguish vs BGD - assume it so for this answer.) Differences:
SGD never actually optimizes the train set's loss function - only its 'approximations'; every batch is a subset of the entire dataset, and the gradients computed only pertain to minimizing loss of that batch. The greater the batch size, the better its loss function resembles that of the train set.
Above can extend to fitting batch vs. sample: a sample is an approximation of the batch - or, a poorer approximation of the dataset
First fitting 16 samples and then 16 more is not the same as fitting 32 at once - since weights are updated in-between, so model outputs for the latter half will change
The main reason for picking SGD over BGD is not, in fact, computational limitations - but that it's superior, most of the time. Explained simply: a lot easier to overfit with BGD, and SGD converges to better solutions on test data by exploring a more diverse loss space.
BONUS DIAGRAMS:
I have the following data
feat_1 feat_2 ... feat_n label
gene_1 100.33 10.2 ... 90.23 great
gene_2 13.32 87.9 ... 77.18 soso
....
gene_m 213.32 63.2 ... 12.23 quitegood
The size of M is large ~30K rows, and N is much smaller ~10 columns.
My question is what is the appropriate Deep Learning structure to learn
and test the data like above.
At the end of the day, the user will give a vector of genes with expression.
gene_1 989.00
gene_2 77.10
...
gene_N 100.10
And the system will label which label does each gene apply e.g. great or soso, etc...
By structure I mean one of these:
Convolutional Neural Network (CNN)
Autoencoder
Deep Belief Network (DBN)
Restricted Boltzman Machine
To expand a little on #sung-kim 's comment:
CNN's are used primarily for problems in computer imaging, such as
classifying images. They are modelled on animals visual cortex, they
basically have a connection network such that there are tiles of
features which have some overlap. Typically they require a lot of
data, more than 30k examples.
Autoencoder's are used for feature generation and dimensionality reduction. They start with lots of neurons on each layer, then this number is reduced, and then increased again. Each object is trained on itself. This results in the middle layers (low number of neurons) providing a meaningful projection of the feature space in a low dimension.
While I don't know much about DBN's they appear to be a supervised extension of the Autoencoder. Lots of parameters to train.
Again I don't know much about Boltzmann machines, but they aren't widely used for this sort of problem (to my knowledge)
As with all modelling problems though, I would suggest starting from the most basic model to look for signal. Perhaps a good place to start is Logistic Regression before you worry about deep learning.
If you have got to the point where you want to try deep learning, for whatever reasons. Then for this type of data a basic feed-forward network is the best place to start. In terms of deep-learning, 30k data points is not a large number, so always best start out with a small network (1-3 hidden layers, 5-10 neurons) and then get bigger. Make sure you have a decent validation set when performing parameter optimisation though. If your a fan of the scikit-learn API, I suggest that Keras is a good place to start
One further comment, you will want to use a OneHotEncoder on your class labels before you do any training.
EDIT
I see from the bounty and the comments that you want to see a bit more about how these networks work. Please see the example of how to build a feed-forward model and do some simple parameter optisation
import numpy as np
from sklearn import preprocessing
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
# Create some random data
np.random.seed(42)
X = np.random.random((10, 50))
# Similar labels
labels = ['good', 'bad', 'soso', 'amazeballs', 'good']
labels += labels
labels = np.array(labels)
np.random.shuffle(labels)
# Change the labels to the required format
numericalLabels = preprocessing.LabelEncoder().fit_transform(labels)
numericalLabels = numericalLabels.reshape(-1, 1)
y = preprocessing.OneHotEncoder(sparse=False).fit_transform(numericalLabels)
# Simple Keras model builder
def buildModel(nFeatures, nClasses, nLayers=3, nNeurons=10, dropout=0.2):
model = Sequential()
model.add(Dense(nNeurons, input_dim=nFeatures))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
for i in xrange(nLayers-1):
model.add(Dense(nNeurons))
model.add(Activation('sigmoid'))
model.add(Dropout(dropout))
model.add(Dense(nClasses))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
return model
# Do an exhaustive search over a given parameter space
for nLayers in xrange(2, 4):
for nNeurons in xrange(5, 8):
model = buildModel(X.shape[1], y.shape[1], nLayers, nNeurons)
modelHist = model.fit(X, y, batch_size=32, nb_epoch=10,
validation_split=0.3, shuffle=True, verbose=0)
minLoss = min(modelHist.history['val_loss'])
epochNum = modelHist.history['val_loss'].index(minLoss)
print '{0} layers, {1} neurons best validation at'.format(nLayers, nNeurons),
print 'epoch {0} loss = {1:.2f}'.format(epochNum, minLoss)
Which outputs
2 layers, 5 neurons best validation at epoch 0 loss = 1.18
2 layers, 6 neurons best validation at epoch 0 loss = 1.21
2 layers, 7 neurons best validation at epoch 8 loss = 1.49
3 layers, 5 neurons best validation at epoch 9 loss = 1.83
3 layers, 6 neurons best validation at epoch 9 loss = 1.91
3 layers, 7 neurons best validation at epoch 9 loss = 1.65
Deep learning structure would be recommended if you were dealing with raw data and wanted to find features, that work towards your classification goal, automatically. But based on the names of your columns and their number (only 10) it seems that you have your features already engineered.
For this reason you could just go with a standard multi-layer neural network and use supervised learning (back propagation). Such network would have the number of inputs matching the number of your columns (10), followed by a number of hidden layers, and then followed by an output layer with the number of neurons matching the number of your labels. You could experiment with using different number of hidden layers, neurons, different neuron types (sigmoid, tanh, rectified linear etc.) and so on.
Alternatively you could use the raw data (if it's available) and then go with DBNs (they're known to be robust and achieve good results across different problems) or auto-encoders.
If you expect the output to be thought of like scores for a label (as I understood from your question), try a supervised multi-class logistic regression classifier. (the highest score takes the label).
If you're bound to use deep-learning.
A simple feed-forward ANN should do, supervise learning through back propagation. Input layer with N neurons, and one or two hidden layers can be added, not more than that. There is no need to go 'deep' and add more layers for this data, there is risk to overfit the data easily with more layers, if you do so it can be tricky to figure out what the problem is, and the test accuracy will be affected greatly.
Simply plotting or visualizing the data i.e with t-sne can be a good start, if you need to figure out which features are important (or any correlation that may exist).
you can then play with higher powers of those feature dimensions/ or add increased weight to their score.
For problems like this, deep-learning probably isn't very well suited. but a simpler ANN architecture like this should work well depending on the data.