I want to train a network on the isolet dataset, consisting of 6238 samples with 300 features each.
This is my code so far:
import tensorflow as tf
import sklearn.preprocessing as prep
import numpy as np
import matplotlib.pyplot as plt
def main():
X, C, Xtst, Ctst = load_isolet()
#normalize
#X = (X - np.mean(X, axis = 1)[:, np.newaxis]) / np.std(X, axis = 1)[:, np.newaxis]
#Xtst = (Xtst - np.mean(Xtst, axis = 1)[:, np.newaxis]) / np.std(Xtst, axis = 1)[:, np.newaxis]
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.transform(Xtst)
# Build the tf.keras.Sequential model by stacking layers. Choose an optimizer and loss function for training:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(X.shape[1], activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(26, activation='softmax')
])
ES_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=1e-2, patience=10, verbose=1)
initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate,decay_steps=100000,decay_rate=0.9999,staircase=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(scaledX, C, epochs=100, callbacks=[ES_callback], batch_size = 32)
plt.figure(1)
plt.plot(range(len(history.history['loss'])), history.history['loss']);
plt.plot(range(len(history.history['accuracy'])), history.history['accuracy']);
plt.show()
Up to now, I have pretty much turned every knob I know:
different number of layers
different sizes of layers
different activation functions
different learning rates
different optimizers (we should test with 'adam' and 'stochastic gradient decent'
different batch sizes
different data preparations (the features range from -1 to 1 values. I tried normalizing along the feature axes, batch normalizing (z_i = (x_i - mean) / std(x_i)) and as seen in the code above scaling the values from 0 to 1 (since I guess 'relu' activation won't work well with negative input values)
Pretty much everything I tried gives weird outputs with extremely high loss values (depending on the learning rate) and very low accuracies during learning. The loss is increasing over epochs pretty much all of the time, but seems to be independent from the accuracy values.
For the code, I followed tutorials I got provided, however something is very off, since I should find the best hyper parameters, but I'm not able to find any good whatsoever.
I'd be very glad to get some points, where got the code wrong or need to preprocess the data differently.
Edit: Using loss='categorical_crossentropy'was given, so at least this one should be correct.
first of all:
Your convergence problems may be due to "incorrect" loss function. tf.keras supports a variety of losses that depend on the shape of your input labels.
Try different possibilities like
tf.keras.losses.SparseCategoricalCrossentropy if your labels are one-hot vectors.
tf.keras.losses.CategoricalCrossentropy if your lables are 1,2,3...
or tf.keras.losses.BinaryCrossentropy if your labels are just 0,1.
Honestly, this part of tf.keras is a bit tricky and some settings like that might need tuning.
Second of all - this part:
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.fit_transform(Xtst)
assuming Xtst is your test set you want to scale it based on your training set. So the correct scaling would be just
scaledXtst = scaler.transform(Xtst)
Hope this helps!
Related
I'm currently working with a time series dataset of 46 lines about meteorological measurements on approximately each 3 hours by day during one week. My explanatory variables (X) is composed of 26 variables and some variable has different units of measurement (degree, minimeters, g/m3 etc.). My variable to explain (y) is composed of only one variable temperature.
My goal is to predict temperature (y) on a slot of 12h-24h with the ensemble of variables (X)
For that I used Keras Tensorflow and Python, with MLP regressor model :
X = df_forcast_cap.loc[:, ~df_forcast_cap.columns.str.startswith('l')]
X = X.drop(['temperature_Y'],axis=1)
y = df_forcast_cap['temperature_Y']
y = pd.DataFrame(data=y)
# normalize the dataset X
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(X)
normalized = scaler.transform(X)
# normalize the dataset y
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit_transform(y)
normalized = scaler.transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# define base model
def norm_model():
# create model
model = Sequential()
model.add(Dense(26, input_dim=26, kernel_initializer='normal', activation='relu'))# 30 is then number of neurons
#model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=norm_model, epochs=(100), batch_size=5, verbose=1)
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold)
print(results)
[-0.00454741 -0.00323181 -0.00345096 -0.00847261 -0.00390925 -0.00334816
-0.00239754 -0.00681044 -0.02098541 -0.00140129]
# invert predictions
X_train = scaler.inverse_transform(X_train)
y_train = scaler.inverse_transform(y_train)
X_test = scaler.inverse_transform(X_test)
y_test = scaler.inverse_transform(y_test)
results = scaler.inverse_transform(results)
print("Results: %.2f (%.2f) MSE" % (results.mean(), results.std()))
Results: -0.01 (0.01) MSE
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree. In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable. But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?
The model that you present above is looking at a single instance of 26 measurements to make a prediction. From your description it seems that you would like to make predictions from a sequence of these measurements. I'm not sure if I fully understood the description but I'll assume that you have a sequence of 46 measurements, each with 26 values that you believe should be good predictors of the temperature. If that is the case, the input shape of your model should be (46, 26,). The 46 here is called time_steps, 26 is the number of features.
For a time series you need to select a model design. There are 2 approaches: a recurrent network or a convolutional network (or a mixture of the 2nd). A convolutional network is typically used to detect patterns in the input data which may be located somewhere in the data. For instance, suppose you want to detect a given shape in an image. Convolutional Networks are a good starting point. Recurrent networks, update their internal state after each time step. They can detect patterns as well as a convolutional network, but you can think of them as being less position independent.
Simple example of a convolutional approach.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential, Model
average_tmp = 0.0
model = Sequential([
InputLayer(input_shape=(46,26,)),
Conv1D(16, 4),
Conv1D(32, 4),
Conv1D(64, 2),
Conv1D(128, 4),
MaxPooling1D(),
Flatten(),
Dense(256, activation='relu'),
Dense(1, bias_initializer=keras.initializers.Constant(average_tmp)),
])
model.compile('adam', 'mse')
model.summary()
A mixed approach, would replace the ```Flatten`` layer above with an LSTM node. That would probably be a reasonable starting point to start experimenting.
(1) I read that cross-validation is not adapted for time series prediction. So, I'm wondering which others techniques exist and which one is more adapted to time-series.
cross validation is a technique that is very well suited for this problem. If you try the example model above, I can almost guarantee that it will overfit your dataset very significantly. cross-validation can help you determine the right regularisation parameters for your model in order to avoid overfitting.
Examples of regularisation techniques that you probably want to consider:
Saving the model weights at the epoch with lower validation score.
Dropout and/or BatchNormalization.
kernel regularisation.
(2) In a second place, I decided to normalize my data because my X dataset is composed of different metrics (degree, minimeters, g/m3 etc.) and my variable to explain y is in degree.
Good call. It will avoid training cycles of your model trying to discover the bias at very high values from the random initialisation.
In this way, I know that have to deal with a more complicated interpretation of the MSE because its result won't be in the same unity that my y variable.
This is orthogonal. The inputs are not assumed to be in the same unit as y. We assume in a DNN that we can create a combination of linear transformation of weights (plus non-linear activations). That has no implicit assumption of units.
But for the next step of my study I need to save the result of the y predicted (made by the MLP model) and I need that these values be in degree. So, I tried to inverse the normalization but without success, when I print my results, the predicted values are still in normalized format (see in my code above). Does anyone see my mistake.s ?
scaler.inverse_transform(results) should do the trick.
It doesn't make sense to inverse transform the inputs X_ and Y_. And it would probably help you keep your code straight to not use the same variable name for both the X and Y scalers.
It is also possible to refrain from scaling Y. If you choose to do so, I'd suggest that you initialise the output layer bias with the mean of the Ys.
I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.
I manage to predict y=x**2 and y=x**3, but equations like y=x**4 or y=x**5 or y=x**7 converge only to inaccurate lines?
What do I do wrong? What could I improve?
import numpy as np
from keras.layers import Dense, Activation
from keras.models import Sequential
import matplotlib.pyplot as plt
import math
import time
x = np.arange(-100, 100, 0.5)
y = x**4
model = Sequential()
model.add(Dense(50, input_shape=(1,)))
model.add(Activation('sigmoid'))
model.add(Dense(50) )
model.add(Activation('elu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam')
t1 = time.clock()
for i in range(100):
model.fit(x, y, epochs=1000, batch_size=len(x), verbose=0)
predictions = model.predict(x)
print (i," ", np.mean(np.square(predictions - y))," t: ", time.clock()-t1)
plt.hold(False)
plt.plot(x, y, 'b', x, predictions, 'r--')
plt.hold(True)
plt.ylabel('Y / Predicted Value')
plt.xlabel('X Value')
plt.title([str(i)," Loss: ",np.mean(np.square(predictions - y))," t: ", str(time.clock()-t1)])
plt.pause(0.001)
#plt.savefig("fig2.png")
plt.show()
The problem is that your input and output variables have too large values and hence are not compatible with the (initial) weights of the network. For Dense layer the default kernel initializer is glorot_uniform; the documentation states that:
It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.
For your example the weights of the first and last layer are therefore sampled on the interval [0.34, 0.34]. Now there are two issues that have to do with the magnitude of weights and input/outputs:
The inputs are in the range [-100, 100] and hence the output of the first Dense layer will be about 58 * 0.2 ~= 10 (the two numbers are the std. dev. of input and weights respectively); it will be smaller for smaller inputs but larger for larger ones. Because this is fed into a sigmoid activation it is likely to saturate. For the example value it will be (1 + exp(-10))**-1 ~= 0.99995. This will cause problems during backpropagation because the weight updates are proportional to the gradient of the activation function which in this case is very small; i.e. the weights are not updated much.
The other issue has to do with the magnitude of outputs y. To see why let's step through the network. The sigmoid activation outputs in the range [0, 1] and hence the activation of the next dense layer will be in the same order of magnitude (given the default glorot_uniform initializer). The ELU activation won't change the order of magnitude and hence the input to the last layer is still on the order of magnitude 1. It also uses the glorot_uniform initializer and hence has weights within the range [-0.34, 0.34]. However the outputs are in the range of [-1e8, 1e8]. In order to generate such huge outputs this means the optimizer would need to step through about 7 (!) orders of magnitude during the fitting procedure. This will take (almost) forever.
So what can we do about it? On the one hand we could modify the weight initialization and on the other hand we can scale the inputs and outputs to a more moderate range. The latter is a much better idea since any numerical computation is much more accurate when performed in the order of magnitude 1. Also MSE loss is going to explode for orders of magnitude difference.
Variable scaling
The scikit-learn package provides various routines for data preparation as for example the StandardScaler. This will subtract the mean from the data and then divide by its standard deviation, i.e. x -> (x - mu) / sigma.
x_scaler = StandardScaler()
y_scaler = StandardScaler()
x = x_scaler.fit_transform(x[:, None]) # Features are expected as columns vectors.
y = y_scaler.fit_transform(y[:, None])
... # Model definition and fitting goes here.
# Invert the transformation before plotting.
x = x_scaler.inverse_transform(x).ravel()
y = y_scaler.inverse_transform(y).ravel()
predictions = y_scaler.inverse_transform(predictions).ravel()
After 2000 epochs of training (full batch size):
Weight initialization
Not recommend! Instead feature scaling should be used, I just provide the example for the sake of completeness. So in order to make the weights compatible with the input/output we can specify custom initializers for the first and last layer of the network:
model.add(Dense(50, input_shape=(1,),
kernel_initializer=RandomUniform(-0.001, 0.001)))
... # Activations and intermediate layers.
model.add(Dense(1, kernel_initializer=RandomUniform(-1e7, 1e7)))
Note the small weights for the first layer (in order to prevent saturation of the sigmoid) and the large weights of the last layer (in order to help the network scaling the outputs by the required 7 orders of magnitude).
Again, after 2000 epochs (full batch size):
As you can see, it works as well, but not as well as for the scaled feature approach. Furthermore, the larger the number, the larger the risk to run into numerical instabilities. A good rule of thumb is to try to always stay in the region around 1 (plus/minus a (very) few orders of magnitude).
That's a cool question!
This happens because the data is not properly scaled. As a consequence, some activations (i.e. sigmoid) saturate more easily and gradients get close to zero. The easiest solution is to scale your data as follows:
x_orig = x
y_orig = y
x_mean = np.mean(x)
x_std = np.std(x)
x = (x - x_mean)/x_std
y_mean = np.mean(y)
y_std = np.std(y)
y = (y - y_mean)/y_std
As a result of scaling the data in this manner, the approximation at the first iteration is:
The original range can then be recovered as follows:
y_pred = predictions*y_std + y_mean
plt.plot(x_orig, y_orig, 'b', x_orig, y_pred, 'r--')
I think it is because the range of input data is so large. Adding a batchnorm layer can improve the performance. Here is the result of the model with batchnorm layer.
The figure
Here is the code:
import numpy as np
import keras
from keras.layers import Dense, Activation
from keras.models import Sequential
import matplotlib.pyplot as plt
import math
import time
x = np.arange(-100, 100, 0.5)
y = x**4
model = Sequential()
model.add(keras.layers.normalization.BatchNormalization(input_shape=(1,)))
model.add(Dense(200))
model.add(Activation('relu'))
model.add(Dense(50))
model.add(Activation('elu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam')
t1 = time.clock()
for i in range(100):
model.fit(x, y, epochs=1000, batch_size=len(x), verbose=0)
predictions = model.predict(x)
print (i," ", np.mean(np.square(predictions - y))," t: ", time.clock()-t1)
plt.hold(False)
plt.plot(x, y, 'b', x, predictions, 'r--')
plt.hold(True)
plt.ylabel('Y / Predicted Value')
plt.xlabel('X Value')
plt.title([str(i)," Loss: ",np.mean(np.square(predictions - y))," t: ", str(time.clock()-t1)])
plt.pause(0.001)
plt.show()
I am new to machine learning and I tried the mnist dataset and I got an accuracy of around 97% but then I tried working on my image dataset and I got an accuracy of 0%. Please help me out.
This is the 97% accuracy model code:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, Flatten
from keras.callbacks import ModelCheckpoint
x_train = tf.keras.utils.normalize(x_train, axis =1)
x_test = tf.keras.utils.normalize(x_test, axis = 1)
model = Sequential()
model.add(Flatten())
model.add(Dense(128, activation = 'relu'))
model.add(Dense(128, activation = 'relu'))
model.add(Dense(10, activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
checkpointer = ModelCheckpoint(filepath = 'mnist.model.weights.best.hdf5',verbose = 1,save_best_only = True, monitor = 'loss')
model.fit(x_train, y_train, epochs = 3, callbacks = [checkpointer],
batch_size = 32,verbose = 2,shuffle = True)
Now I tried with my 10 images and none of them were predicted correctly.
Below is the code:
from skimage import io
from skimage import color
import numpy as np
import tensorflow as tf
import keras
img_0 = color.rgb2gray(io.imread("0.jpg"))
img_2 = color.rgb2gray(io.imread("2.jpg"))
img_3 = color.rgb2gray(io.imread("3.jpg"))
img_4 = color.rgb2gray(io.imread("4.jpg"))
img_5 = color.rgb2gray(io.imread("5.jpg"))
img_6 = color.rgb2gray(io.imread("6.jpg"))
img_7 = color.rgb2gray(io.imread("7.jpg"))
img_8 = color.rgb2gray(io.imread("8.jpg"))
img_9 = color.rgb2gray(io.imread("9.jpg"))
array = [img_0, img_2, img_3, img_4, img_5, img_6, img_7, img_8, img_9]
#normalized the data between 0-1
array = tf.keras.utils.normalize(array, axis = 1)
#used the loop to increase the dimensions of the input layer as 1,28,28 which will be converted into 1*784
for i in array:
i = np.expand_dims(i,axis = 0)
print(i.shape)
new_model = tf.keras.models.load_model('mnist_save.model')
new_model.load_weights('mnist.model.weights.best.hdf5')
predictions = new_model.predict(array)
Can you please help me out with my problem.
If I were you, I will check the following three things.
1. Visualize both training and testing data side by side
This is the simplest way to see whether the low performance is reasonable. Basically, if the testing data is very different looking than the training data, there is no way for your pretrained model to achieve high performance in this new testing domain. Even this is not the case, visualization should be helpful to decide what simple domain adaptation can be applied to achieve better performance.
2. Double check with your L2normalization
I took a look at the source code of keras.utils.normalize
#tf_export('keras.utils.normalize')
def normalize(x, axis=-1, order=2):
"""Normalizes a Numpy array.
Arguments:
x: Numpy array to normalize.
axis: axis along which to normalize.
order: Normalization order (e.g. 2 for L2 norm).
Returns:
A normalized copy of the array.
"""
l2 = np.atleast_1d(np.linalg.norm(x, order, axis))
l2[l2 == 0] = 1
return x / np.expand_dims(l2, axis)
Since you are using the tensorflow backend, normalize along the 1st axis means what? Normalize each row? This is strange. The right way to do normalization is to (1) vectorize your input image, i.e. each image becomes a vector; and (2) normalize the resulting vector (at axis=1).
Actually, it is somewhat inappropriate especially when you want to apply a pretrained model in a different domain. This is because L2normalization is more sensitive to nonzero values. In MNIST samples, almost binarized, i.e. either 0s or 1s. However, in a grayscale image, you may meet values in [0,255], which is a completely different distribution.
You may try the simple (0,1) normalization, i.e.
x_normalized = (x-min(x))/(max(x)-min(x))
but this requires you to retrain a new model.
3. Apply domain adaptation techniques
This means you want to do the following things before feeding a testing image to your model (even before normalization).
binarize your testing image, i.e. convert to 0/1 images
negate your testing image, i.e. make 0s to 1s and 1s to 0s
centralize your testing image, i.e. shift your image such that its mass center is the image center.
Of course, what techniques to apply is dependent on the domain differences that you observe in visualization results.
With an objective of learning Keras LSTM and RNNs, I thought to create a simple problem to work on: given a sine wave, can we predict its frequency?
I wouldn't expect a simple neural network to be able to predict the frequency, given that the notion of time is important here. However, even with LSTMs, I am unable to learn the frequency; I'm able to learn a trivial zero as the estimated frequency (even for train samples).
Here's the code to create the train set.
import numpy as np
import matplotlib.pyplot as plt
def create_sine(frequency):
return np.sin(frequency*np.linspace(0, 2*np.pi, 2000))
train_x = np.array([create_sine(x) for x in range(1, 300)])
train_y = list(range(1, 300))
Now, here's a simple neural network for this example.
from keras.models import Model
from keras.layers import Dense, Input, LSTM
input_series = Input(shape=(2000,),name='Input')
dense_1 = Dense(100)(input_series)
pred = Dense(1, activation='relu')(dense_1)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100], train_y[:100], epochs=100)
As expected, this NN doesn't learn anything useful. Next, I tried a simple LSTM example.
input_series = Input(shape=(2000,1),name='Input')
lstm = LSTM(100)(input_series)
pred = Dense(1, activation='relu')(lstm)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100].reshape(100, 2000, 1), train_y[:100], epochs=100)
However, this LSTM based model also doesn't learn anything useful.
Why doesn't it learn?
You think it's a simple problem to train an RNN on, but actually your setup isn't easy for the network at all:
As already mentioned, there's lack of important samples. You throw so much data into it (300 * 2000 points), but the actual target (frequency) is seen only once by the network. Even if the network does learn something, there's high chance it will overfit.
Inconsistent data. Remember that RNNs are good at capturing similar patterns in the series data. For instance, in NLP all sentences in the corpus are governed by the same language rules and more sentences help RNN to understand these rules better, i.e., more data helps.
In your case, the series with different frequencies aren't very much alike: compare the sine with frequency=1 and frequency=100. This kind of diversity in the data makes it harder to learn, not easier. It doesn't mean that the frequency is impossible for an RNN to learn, it simply means that you shouldn't be surprised that a trivial RNN like yours has hard time.
Data scale. Changing the frequency from 1 to 300, changes the scale of both x and y by two orders of magnitude, which may be problematic for any neural network.
Solution
Since your goal is rather educational, I solved the second and third items simply by limiting the target frequency to 10, so that scaling and distribution diversity isn't much of an issue (you are welcome to try different values here: you should see that increasing this one parameter to, say, 50 makes the task much more complex).
The first item is solved by giving the RNN 10 examples of each frequency, instead of just one. I've also added one more hidden layer to increase network flexibility, plus a simple regularizer (Dropout layer).
The complete code:
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense, Dropout, LSTM
max_freq = 10
time_steps = 100
def create_sine(frequency, offset):
return np.sin(frequency * np.linspace(offset, 2 * np.pi + offset, time_steps))
train_y = list(range(1, max_freq)) * 10
train_x = np.array([create_sine(freq, np.random.uniform(0,1)) for freq in train_y])
train_y = np.array(train_y)
input_series = Input(shape=(time_steps, 1), name='Input')
lstm = LSTM(units=100)(input_series)
hidden = Dense(units=100, activation='relu')(lstm)
dropout = Dropout(rate=0.1)(hidden)
output = Dense(units=1, activation='relu')(dropout)
model = Model(input_series, output)
model.compile('adam', 'mean_squared_error')
model.fit(train_x.reshape(-1, time_steps, 1), train_y, epochs=200)
# Trying the network on the same data
test_x = train_x.reshape(-1, time_steps, 1)
test_y = train_y
predicted = model.predict(test_x).reshape([-1])
print()
print((predicted - train_y)[:12])
print(np.mean(np.abs(predicted - train_y)))
The output:
max_freq=10
[-0.05612183 -0.01982236 -0.03744316 -0.02568841 -0.11959982 -0.0770483
0.04643679 0.12057972 -0.00625324 -0.00724655 -0.16919005 -0.04512954]
0.0503574344847
max_freq=20 (everything else is the same)
[ 0.51365542 0.09269333 -0.009691 0.0619092 0.09852839 0.04378462
0.01430321 -0.01953268 0.00722599 0.02558327 -0.04520988 -0.0614748 ]
0.146024380232
max_freq=30 (everything else is the same)
[-0.28205156 -0.28922796 -0.00569081 -0.21314907 0.1068716 0.23497915
0.23975039 0.25955486 0.26333141 0.24235058 0.08320332 -0.03686047]
0.406703719805
Note that results are random and actually increasing the max_freq increases the changes of divergence. But even when it converges, the performance doesn't improve despite having more data, instead gets worse and pretty fast.
sample data item very low, one for each freq,
add small noise and use more data,
normalize output data -1 to 1 range
then try again
As you said, you want to predict the frequency. You also want to use LSTM. First we generate enough data to train, then we build the network. I'm sorry my example is not with keras, I'm using tflearn.
import numpy as np
import tflearn
from random import shuffle
# parameters
n_input=100
n_train=2000
n_test = 500
# generate data
xs=[]
ys=[]
frequencies = np.linspace(1,50,n_train+n_test)
shuffle(frequencies)
t=np.linspace(0,2*np.pi,n_input)
for freq in frequencies:
xs.append(np.sin(t*freq))
ys.append(freq)
xs_train=np.array(xs[:n_train]).reshape(n_train,n_input,1)
xs_test=np.array(xs[n_train:]).reshape(n_test,n_input,1)
ys_train = np.array(ys[:n_train]).reshape(-1,1)
ys_test = np.array(ys[n_train:]).reshape(-1,1)
# LSTM network prediction
net = tflearn.input_data(shape=[None, n_input, 1])
net = tflearn.lstm(net, 10)
net = tflearn.fully_connected(net, 100, activation="relu")
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam', loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train, n_epoch=100)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
# [[ 13.08494568 12.76470588]
# [ 22.23135376 21.98039216]
# [ 39.0812912 37.58823529]
# [ 15.77548409 15.66666667]
# [ 26.57996941 25.58823529]
# [ 26.57759476 25.11764706]
# [ 16.42217445 15.8627451 ]
# [ 32.55020905 30.80392157]
# [ 44.16622925 43.01960784]
# [ 26.18071365 25.45098039]]
If you have the data in that order, you don't actually need LSTM, you can easily replace the LSTM part with a Deep Neural Network:
# Deep network instead of LSTM
net = tflearn.input_data(shape=[None, n_input])
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam',loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
Both codes are going to give you as result the predicted value of the frequency. I also created a gist with the program.